Concepts
Routing strategies
Single-shot vs sequential, optimal stopping
Two ways for the router to pick a model: score them all upfront and send to the winner (single-shot), or start cheap and escalate only if the output fails validation (sequential). Both run the same Expected Utility math.
Expected Utility (single-shot)
The default mode. The router evaluates every configured model in parallel and routes the query to the single best candidate.
- Best for — balanced workloads, low latency, mixed query types.
- Logic — immediately selects the
m_ithat maximizesp_i · R − α · c_i − β · t_i. - Network calls — exactly one LLM call per request.
If the chosen model's output fails quality validation (truncation, broken JSON, refusal), the router still escalates to a fallback — but the default path is one shot, one model.
Tiered Assessment (sequential)
A multi-stage approach. The router starts with the cheapest viable model, validates the output, then decides whether the answer is good enough or whether to escalate.
- Best for — maximum cost optimization, autonomous agentic workflows where many queries are easy and a few are hard.
- Network calls — usually one. Up to three on hard queries.
How it works
For each request:
- First attempt — start with the cheapest available model.
- Post-response calibration — once the response arrives, re-run feature extraction with diagnostic metrics appended: confidence scores, response entropy, and logprobs (if the model exposes them).
- Validation — compute a calibrated probability
p_actualusing the full feature set. This is "how likely is this answer correct, given the response we just got?" - Optimal stopping — compare the utility of the current answer against the potential utility of escalating to a more powerful model (assuming
p_next = 1.0for the gold-standard fallback). - Escalate only if worth it — the router escalates only when the potential gain in accuracy outweighs the additional cost and latency.
Why sequential saves money. Most queries to a coding agent are routine — a small refactor, a quick lookup, a formatting fix. A cheap or local model nails them on the first try. The router never escalates. You pay zero or near-zero per call. Only the hard queries — the ones that genuinely need Opus or GPT-5.4 Thinking — pay the flagship price.
Circuit breaker
Independent of which strategy you pick, the router runs a circuit breaker on every model:
- Automatic failure detection — repeated timeouts, 500 errors, or invalid responses trip the circuit.
- Temporary isolation — a tripped model is bypassed for subsequent requests. Routing falls back to the next best candidate.
- Graceful recovery — after a configurable timeout, the circuit goes half-open and lets a few test requests through. If they succeed, the circuit closes and traffic resumes normally.
Dynamic capability discovery
On startup, the router probes newly discovered models in the background to determine their true capabilities:
- Does the model support
tool_calling(function calling)? - Does the model expose
logprobsfor advanced confidence scoring?
These probes are out-of-band and don't pollute your historical metrics. Tool-intensive requests are only routed to capable models. Models without native function-calling support fall through the MCP/ACP translation layer.
Concurrency limits
To prevent overwhelming specific providers — especially local instances like Ollama — you can cap parallel requests per model in ~/.reality_router/user_models.json:
{
"qwen3-coder:30b": {
"thread_limit": 2
},
"gpt-5.4-thinking": {
"thread_limit": 10
}
}
The router maintains an internal semaphore per model. If a model is at its limit, the router automatically skips it and routes to the next-best candidate. No stalls, no errors.
Which strategy should I pick?
- Pick single-shot if latency matters, your queries vary in difficulty, or your workload is mostly interactive (humans waiting on responses).
- Pick sequential if you're running autonomous agents (RooCode, OpenClaw, AutoGPT), have lots of routine queries mixed with occasional hard ones, and care primarily about minimizing cost.
You can change strategy at any time by re-running ./start.sh.