# Auto Routing for LLM Models

### What is Auto Routing?

Auto Routing is an intelligent model selection system that automatically picks the best AI model for each request — so you don't have to. Instead of manually choosing a specific model every time, you simply enable Auto Routing and let the system decide which model delivers the best results based on your priorities.

The system continuously learns from every interaction. It tracks how fast, how cheap, and how good each model performs, then uses that data to make smarter decisions over time. If the selected model runs into an issue, Auto Routing seamlessly switches to a backup model behind the scenes. You won't notice any interruption — your response just arrives as expected.

### How It Works

When you send a message, Auto Routing goes through a multi-stage decision process to find the best model for that specific request:

#### Step 1 — Request Classification

The system instantly analyzes your request to understand its complexity. It looks at several factors:

* **Message length** — Short questions vs. long detailed prompts
* **Tool usage** — Whether the AI needs to call tools or functions, and how many
* **Knowledge retrieval (RAG)** — Whether the AI needs to search through documents
* **Vision input** — Whether images are part of the conversation

Based on these factors, the request is classified into one of three complexity levels:

| Complexity   | When it applies                                                                  | Preferred models                                   |
| ------------ | -------------------------------------------------------------------------------- | -------------------------------------------------- |
| **Simple**   | Short prompts, no tools, no documents, no images                                 | Lightweight, fast, cost-efficient models           |
| **Moderate** | Medium-length prompts, some tool usage, document retrieval, or image input       | Mid-range capable models                           |
| **Complex**  | Long prompts, many tools, document retrieval combined with tools, vision + tools | Powerful premium models with large context windows |

This classification happens instantly using smart heuristics — no LLM call is needed, so it adds zero latency.

#### Step 2 — Model Tiering

Every available model across all your configured providers is categorized into a tier based on its capabilities:

| Tier         | Characteristics                                                                      |
| ------------ | ------------------------------------------------------------------------------------ |
| **Premium**  | Large context window (100K+ tokens), supports tool calling — the most capable models |
| **Balanced** | Mid-range capabilities — solid all-around performers                                 |
| **Economy**  | Smaller context window (≤16K tokens), no vision support — fast and cheap             |

The system matches request complexity to model tiers. Complex requests prefer Premium models; simple requests prefer Economy models — so you're never paying for horsepower you don't need.

#### Step 3 — Model Scoring

Each candidate model is scored across three dimensions using real, tracked performance data:

* **Quality** — How good are this model's responses? (based on ongoing quality evaluations)
* **Cost** — How much does this model cost per request? (cheaper = higher score)
* **Speed** — How fast does this model respond? (faster = higher score)

These three scores are combined using the weights from your chosen priority mode (see below). The system also applies:

* A **tier alignment bonus** — Models whose tier matches the request complexity get a scoring boost, ensuring the right class of model is preferred.
* A **reliability penalty** — Models with a low success rate (frequent errors) get their score reduced, so unreliable models naturally fall to the bottom.

For newly added models that haven't been used enough yet, the system uses sensible defaults based on the model's tier until it has collected enough real data (at least 5 invocations).

#### Step 4 — Selection with Exploration

The top-scoring model is selected as the primary choice, with multiple backup models ranked behind it for failover.

To prevent the system from getting stuck on the same models forever, there's a built-in **exploration mechanism**: there's a small chance (5%) that an under-tested model gets promoted to the top of the list. This ensures newly added models get a fair shot and the system continuously discovers better options rather than only relying on existing favorites.

#### Step 5 — Failover Protection

If the selected model fails (see Failover System below), the system automatically and invisibly tries the next-best model from the ranked list. You can configure how many backup models are kept ready (1 to 10, default is 3).

### Priority Modes

When you enable Auto Routing, you choose a priority mode that controls how the three scoring dimensions (quality, cost, speed) are weighted:

#### 🏆 Quality First

| Quality | Cost | Speed |
| ------- | ---- | ----- |
| 60%     | 20%  | 20%   |

Picks the smartest, most capable model available — even if it's a bit slower or costs more. Best for tasks where accuracy and depth matter most, like research, writing, complex analysis, or customer-facing interactions.

#### 💰 Cost First

| Quality | Cost | Speed |
| ------- | ---- | ----- |
| 15%     | 60%  | 25%   |

Picks the most affordable model that can still handle the job. Ideal when you're processing high volumes, running batch operations, or want to keep spending low without completely sacrificing quality.

#### ⚡ Speed First

| Quality | Cost | Speed |
| ------- | ---- | ----- |
| 15%     | 25%  | 60%   |

Picks the fastest-responding model. Great for real-time applications, chatbots, quick Q\&A, or any situation where responsiveness is more important than getting the absolute best answer.

#### ⚖️ Balanced

| Quality | Cost | Speed |
| ------- | ---- | ----- |
| 34%     | 33%  | 33%   |

Weighs quality, cost, and speed roughly equally. A good default for everyday use when you don't have a strong preference and want a sensible trade-off across all dimensions.

### DAG Evaluation Pipeline

One of the most powerful features of Auto Routing is its **DAG (Deep Acyclic Graph) Evaluation Pipeline** — the engine that evaluates how good every AI response actually was after it's been delivered.

#### What It Does

After every response is sent to you, the system runs a background evaluation that grades the response across multiple quality dimensions — all in parallel for maximum speed. The results feed back into the routing system, so models that consistently produce better answers are automatically preferred for future requests.

#### How It Works

The pipeline runs **up to four quality evaluations simultaneously** using parallel processing:

```
                    ┌─────────────────┐
                    │  User Question  │
                    │  + AI Response  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────────┐
        │Relevancy │  │Coherence │  │ Helpfulness  │
        │  Check   │  │  Check   │  │    Check     │
        └────┬─────┘  └────┬─────┘  └──────┬───────┘
              │              │              │
              │    ┌─────────────────┐      │
              │    │  Faithfulness   │      │   (only when RAG
              │    │     Check       │      │    context exists)
              │    └────────┬────────┘      │
              │              │              │
              ▼              ▼              ▼
        ┌────────────────────────────────────────┐
        │      Weighted Composite Score          │
        │          (0.0 — 1.0)                   │
        └────────────────────────────────────────┘
```

Each evaluation produces a score between 0.0 and 1.0, and they are combined into a single composite score using these weights:

**When document retrieval (RAG) is used:**

| Metric       | Weight |
| ------------ | ------ |
| Relevancy    | 35%    |
| Faithfulness | 25%    |
| Coherence    | 20%    |
| Helpfulness  | 20%    |

**When no document retrieval is used:**

| Metric       | Weight    |
| ------------ | --------- |
| Relevancy    | 40%       |
| Coherence    | 30%       |
| Helpfulness  | 30%       |
| Faithfulness | *Skipped* |

Faithfulness is only evaluated when the AI was working with retrieved documents — because it measures whether the answer stayed true to the source material. When there's no source material, it's not applicable, so the weights redistribute to the remaining metrics.

If any individual evaluation fails, it's simply excluded from the average rather than dragging the score down — ensuring robust and fair scoring.

### Quality Evaluation — LLM-as-Judge

The quality evaluation system uses an **LLM-as-Judge** approach, where a separate AI model evaluates the quality of each response. This is powered by DeepEval, a specialized evaluation framework.

#### The Four Quality Dimensions

**Relevancy**

Checks whether the AI actually answered the question that was asked. A response that goes off-topic or answers a different question will score low, even if the content itself is well-written.

**Faithfulness**

Only evaluated when the AI used retrieved documents (RAG). Checks whether the answer stays true to the source material — this is an anti-hallucination check. If the model invents facts that aren't in the documents, it scores low.

**Coherence**

Evaluates the logical flow, internal consistency, and readability of the response. Is it well-structured? Does it make sense from start to finish? Is it easy to follow?

**Helpfulness**

Measures whether the response is complete, accurate, and actionable. Did it actually help the user? Is there enough detail? Can the user act on the information provided?

#### Graceful Fallback

If the advanced evaluation framework (DeepEval) is not available, the system automatically falls back to heuristic-based scoring that uses simpler methods:

* **Relevancy fallback** — Measures keyword overlap between the question and answer
* **Coherence fallback** — Analyzes sentence structure, punctuation quality, and readability signals
* **Helpfulness fallback** — Compares response length and detail level relative to the question
* **Faithfulness fallback** — Uses a baseline neutral score

This ensures quality tracking always works, even in minimal or resource-constrained deployments.

### Performance Metrics Tracking

Auto Routing continuously tracks detailed performance metrics for every model, building a real-time understanding of how each model performs.

#### What's Tracked

| Metric                 | Description                                                                                |
| ---------------------- | ------------------------------------------------------------------------------------------ |
| **Average Latency**    | How fast the model responds on average                                                     |
| **P95 Latency**        | Response time at the 95th percentile — captures worst-case performance                     |
| **Cost per 1K Tokens** | How much the model costs, normalized for fair comparison                                   |
| **Success Rate**       | Percentage of requests that complete without errors                                        |
| **Quality Score**      | Composite evaluation score from the DAG pipeline                                           |
| **Sample Count**       | Number of times the model has been used — metrics require at least 5 samples to be trusted |

#### How Metrics Stay Fresh

* The system maintains a **rolling window of the last 1,000 data points** per metric per model. Older data is automatically trimmed, so the metrics always reflect recent performance.
* All metric data **expires after 7 days of inactivity**, ensuring stale models don't linger in the system.
* Quality scores use **Exponential Moving Average (EMA)** — each new quality evaluation contributes 10% to the running average. This means the system adapts to changing model performance smoothly: recent evaluations have more influence, but a single bad evaluation can't wildly swing the score.

#### Automatic Collection

Metrics are collected completely automatically and transparently. Every time any AI model is called — whether through Auto Routing or not — the system records the latency, cost, and success/failure. Quality evaluations run asynchronously in the background after every response, with no impact on response delivery speed.

The metrics collection system is designed to be fully non-blocking: even if the metrics storage is temporarily unavailable, your AI responses continue uninterrupted.

### Failover System

Auto Routing includes a sophisticated failover system that ensures your AI workflows stay running even when individual models have problems.

#### How Failover Works

1. The Auto Router produces a ranked list of candidate models (best first, with backups).
2. Before calling a model, the Failover Manager checks whether it's currently healthy.
3. If the model fails, the system records the failure and immediately tries the next model in the list.
4. This continues until a model succeeds or all candidates have been tried.

#### Errors That Trigger Failover

Not all errors result in failover — only infrastructure-level failures do:

| Error Type                   | Triggers Failover | Cooldown Period |
| ---------------------------- | ----------------- | --------------- |
| **Rate Limit**               | ✅ Yes             | 2 minutes       |
| **Connection Error**         | ✅ Yes             | 30 seconds      |
| **Server Unavailable**       | ✅ Yes             | 60 seconds      |
| **Authentication Error**     | ✅ Yes             | 5 minutes       |
| **Bad Request** (user error) | ❌ No              | —               |
| **Content Filter**           | ❌ No              | —               |

Models in cooldown are temporarily removed from the candidate list, so the system won't waste time retrying a model that just failed.

#### Circuit Breaker Protection

For models that fail repeatedly, the system implements a circuit breaker pattern:

1. **Tracking** — The system counts failures per model within a 5-minute sliding window.
2. **Trip** — After **3 failures within 5 minutes**, the circuit breaker opens and the model is **completely blocked for 10 minutes**.
3. **Recovery** — After the 10-minute block period, the circuit resets and the model becomes eligible again.

This prevents the system from repeatedly hammering a model that's clearly having sustained problems, while still allowing it to recover automatically.

#### Configurable Backup Depth

You can configure how many backup models are kept ready for failover — from 1 to 10 (default: 3). Higher values mean more resilience but may slightly delay failure detection since more alternatives are tried before giving up.

### Model Discovery & Exploration

A common problem with automated systems is that they can get "stuck" always choosing the same models, never trying new or updated options. Auto Routing solves this with a built-in exploration mechanism.

#### How Exploration Works

* On each request, there's a **5% chance** that the system will promote a randomly selected under-tested model to the top of the candidate list.
* A model is considered "under-tested" if it has fewer than 5 recorded data points.
* This means every newly added model will naturally get tried within a reasonable number of requests, without requiring manual intervention.

#### Why This Matters

* **New models get discovered** — When you add a new provider or a new model becomes available, it won't be ignored just because it has no performance history.
* **Changing models get re-evaluated** — If a provider improves their model, the exploration mechanism ensures the updated version gets tested.
* **No manual tuning needed** — You don't have to periodically switch models to test them; the system does it automatically at a rate that doesn't noticeably impact overall performance.

### The Continuous Learning Loop

All the components described above work together as a continuous feedback loop that makes the system smarter with every interaction:

```
┌──────────────────────────────────────────────────────────┐
│                                                          │
│   1. Request arrives                                     │
│          │                                               │
│          ▼                                               │
│   2. Request Classifier → complexity level               │
│          │                                               │
│          ▼                                               │
│   3. Auto Router → scores models → ranked candidates     │
│          │                                               │
│          ▼                                               │
│   4. Failover Manager → filters unhealthy models         │
│          │                                               │
│          ▼                                               │
│   5. Top healthy model is called                         │
│          │                                               │
│          ├──── Success ──┐                               │
│          │               │                               │
│          │               ▼                               │
│          │    6. Metrics recorded (latency, cost)         │
│          │               │                               │
│          │               ▼                               │
│          │    7. Response delivered to user               │
│          │               │                               │
│          │               ▼                               │
│          │    8. Background: DAG Evaluation Pipeline      │
│          │       runs quality checks in parallel          │
│          │               │                               │
│          │               ▼                               │
│          │    9. Quality score feeds back into metrics    │
│          │               │                               │
│          │               ▼                               │
│          │   10. Next request benefits from updated data  │
│          │               │                               │
│          │               └────────── loops back to 1 ────┤
│          │                                               │
│          └──── Failure ──┐                               │
│                          │                               │
│                          ▼                               │
│              Cooldown / Circuit Breaker applied           │
│                          │                               │
│                          ▼                               │
│              Try next candidate → back to step 5         │
│                                                          │
└──────────────────────────────────────────────────────────┘
```

The net effect: the system automatically discovers which models perform best for which types of requests, adapts to changing model quality and availability, handles failures gracefully, and gets measurably smarter over time — all without any manual configuration or intervention.

### Cross-Provider Intelligence

Auto Routing works across **all your configured AI providers simultaneously**. Whether you use OpenAI, Anthropic, Google, Mistral, or any other supported provider, the system considers models from all of them and picks the best option regardless of vendor.

This gives you several advantages:

* **Best-of-breed selection** — The system can pick OpenAI for one task and Anthropic for another, depending on which performs better for that specific type of request.
* **Vendor diversification** — You're not locked into a single provider. If one provider has an outage, models from other providers automatically take over.
* **Cost optimization** — Different providers have different pricing. The system can find the cheapest option across all providers, not just within one.

#### Provider Exclusion

You can exclude specific providers from Auto Routing if needed. For example:

* Compliance requirements that restrict certain vendors
* Data residency concerns with specific providers
* Personal preference or organizational policy
* Testing a specific provider's performance in isolation

Excluded providers are completely removed from consideration — their models won't appear in the candidate list under any circumstances.

### When to Use Auto Routing

| Scenario                                                                            | Recommended                  |
| ----------------------------------------------------------------------------------- | ---------------------------- |
| You use multiple AI providers and don't want to pick a model every time             | ✅ Yes                        |
| You want the best quality without manually testing every model                      | ✅ Yes                        |
| You need high availability and don't want to worry about outages                    | ✅ Yes                        |
| You want to optimize costs across providers                                         | ✅ Yes                        |
| You handle diverse requests (simple + complex) and want each to use the right model | ✅ Yes                        |
| You're adding new models frequently and want them evaluated automatically           | ✅ Yes                        |
| You have a specific model you always want to use for a particular task              | ❌ Use direct model selection |
| You need absolute control over which model handles every request                    | ❌ Use direct model selection |

***

### Configuration Options

| Option                    | Description                                           | Default  |
| ------------------------- | ----------------------------------------------------- | -------- |
| **Priority Mode**         | Quality First, Cost First, Speed First, or Balanced   | Balanced |
| **Excluded Providers**    | List of providers to never use in auto routing        | None     |
| **Max Failover Attempts** | How many backup models to try before giving up (1–10) | 3        |

### Getting Started

1. Open your AI app configuration.
2. In the Model Settings section, you'll see the **Auto Routing** card at the top.
3. Toggle it **on** with the switch.
4. Expand the card to choose your preferred **priority mode** (Quality, Cost, Speed, or Balanced).
5. That's it — the system handles everything from here.

You can switch priority modes or turn off Auto Routing at any time without affecting your existing conversations. The system starts learning immediately from the first request and gets smarter with every interaction.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://soika-labs.gitbook.io/soika-mockingjay/auto-routing-for-llm-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
