When Your Multi-Agent System Fails, Don’t Blame the Router

Written in

by

As a product manager, I frequently run into problems that require a specific application of different product management strategies, whether it’s to enable a conversation with a stakeholder, drive a decision, or figure out what the next steps should be. I’ve always found myself reaching for some framework or structured approach, and at some point I started wondering what it would look like if I built an agentic system that could help identify the right strategy for a given problem, apply that strategy alongside my input, and offer me different ways of thinking about it.

I had been following a lot of writing about the recent advances in Claude’s capabilities, and the step change felt significant enough to try building something real. The initial design was fairly straightforward: the user provides a problem statement, a router classifies the problem type using keyword and semantic understanding, and then delegates it to the right specialized agent. I had agents covering product strategy, prioritization, user research, stakeholder management, technical PM skills, product discovery, and pricing. The idea was that the system would be able to figure out which stage of problem solving I was in and surface the right framework to help me get to a more refined state of thinking about that problem.

What Broke

Even as I was putting the design together, something about it felt off. This was my first time building an agentic system in a knowledge work context, which is a very nuanced and open ended domain compared to more structured applications. When I tested the system on a real problem at work, it immediately jumped to solution validation when what I actually needed was problem exploration. The system wasn’t reasoning about where I was in my process. It wasn’t considering what assumptions I had already made, or what the downstream impacts of those assumptions might be on the solution design, the value estimate, or even the scope of the problem itself.

My first instinct was to go fix the router. It looked like a classification issue, something that could be addressed with better keywords, smarter semantic matching, maybe some few-shot examples to help it distinguish between problem types more accurately.

But as I started working through what actually went wrong, I realized the router wasn’t where the problem lived. What I had done was conflate three very different dimensions of problem solving into individual agents:

  1. Workflow stages like exploration vs. validation
  2. Analytical tools like stakeholder mapping, experiment design, RICE scoring
  3. Skill domains like pricing strategy, product design, experimentation

The architecture was built on the assumption that one problem maps to one agent. In practice though, a single problem statement touches a workflow stage, pulls from multiple analytical tools, and draws on different skill domains in ways that shift as you progress through the stages of solving it. Improving the router’s classification accuracy wouldn’t have addressed any of this, because the issue wasn’t how well the system was routing. It was how the agents themselves were decomposed.

What I Rebuilt

When things feel stuck like this, I sometimes go back to basics with my stakeholders and map out the entire process end to end. I did the same thing here: traced the flow from problem conception to constraints validation to value estimation to solution design to prioritization and productionization. That mapping exercise is what led to the v2 design.

The new system has five genuinely distinct modes, each producing a different kind of analytical output, and a thin orchestrator whose job is to detect where the user is in their thinking rather than trying to match keywords to frameworks. Instead of asking “which framework fits this query?”, the orchestrator is trying to reason about something closer to “given this stage of problem solving you seem to be in, what would be useful to think about next?”

Part of what makes this work is the concept of thinking maturity. If a problem is well defined and you’ve already mapped out the stakeholders, the constraints, and the value pathways, then you’re in a relatively mature state of understanding. But most of the time you don’t start there, and the system needs to recognize that and meet you where you are. The five modes I landed on are: Discover and Frame, Evaluate, Validate, Constraints and Solution Design, and Size and Prioritize.

The Assumption Register

One of the design decisions I’m most interested in seeing play out is the assumption register. In its current form it’s a JSON object, though I can see it eventually moving to a graph database. The register proactively flags assumptions that are high impact but low confidence, and when an assumption gets invalidated during any phase of the problem solving process, the system traces the dependencies and recommends backward transitions to the stages that were built on top of that assumption.

To make this more concrete: say I’m building a predictive model to target new customers across both mobile and desktop channels. One of my working assumptions is that both channels together will bring enough targeted users to meet the growth goals I’ve scoped out. I progress to the fourth stage, where I’m defining the solution and its constraints, and I learn that my technical team doesn’t actually have bandwidth from the web team to launch the desktop campaigns.

That single invalidated assumption has implications that ripple across everything I’ve built up to that point: the way I framed the problem, the solution I designed around it, the value I estimated it would deliver. The role of the assumption register is to maintain a map of all the points across the problem space that were resting on any given assumption, so that when something breaks you can see clearly what else needs to be revisited rather than discovering the consequences piecemeal weeks later.

This is, in a lot of ways, what a product manager is supposed to be doing: maintaining awareness of all these dependencies and how they connect to each other across the problem space. But when you’re working on four or five problem statements at the same time, keeping all of those mappings in your head with any real fidelity is not realistic. Having an intelligent system that tracks this alongside you changes what’s possible in terms of how many problems you can manage well at the same time.

A Few Other Decisions Worth Noting

I built the system with flexibility for knowledge bases to be dynamically injected. In real projects, you’re always operating in a mode of continuous discovery. You learn things from stakeholder interviews and user conversations that should influence the analysis you’ve already done, and the system needs to be able to accommodate new information rather than treating the initial input as the complete picture.

The data structures currently live as JSON documents and Markdown files, but I’ve designed them with the understanding that they’ll eventually need to move to a graph database and vector stores as the system matures. The principle here was to design fast, test fast, and harden incrementally rather than overbuilding infrastructure before validating that the core approach works.

And then there’s the question of human oversight, which I deliberately kept heavy in this version of the design. I believe strongly that you should start from a defensive position where the human reviews and approves at many points in the process. As the agent demonstrates over time that it can perform specific tasks reliably, you can gradually extend more autonomy to it. Autonomy, in other words, has to be earned through demonstrable and consistent behavior, not granted upfront with the hope that things won’t break in production. I’ve been thinking about this idea quite a bit, and I think it deserves a longer exploration on its own. More on that in a future post.

Leave a comment

Aisle Intelligence

Thoughts on AI and Retail