Multi Factor
Overview
multi_factor is a selection algorithm that composes four raw runtime signals — quality, latency, cost, and load — into a single weighted score per candidate, with optional SLO hard ceilings that prune candidates before scoring.
It aligns to config/algorithm/selection/multi-factor.yaml and addresses issue #37.
Key Advantages
- Single-decision SLO-aware routing without orchestrating multiple selectors.
- Each signal is a live source: quality from
quality_scoreconfig, latency frompkg/latencypercentiles, cost from pricing, load frompkg/inflight. - Min-max normalization across the candidate set means weights have intuitive meaning regardless of absolute signal scales.
- No model state to train. No external service required.
- Hard SLO ceilings (TPOT, TTFT, cost, in-flight) prune unsafe candidates before scoring.
What Problem Does It Solve?
Real routes care about more than one dimension at once: a faster cheaper model and a slower better model both exist in the same candidate pool, and the "right" answer depends on current load and SLO targets, not just the static config. Existing single-signal selectors (latency_aware, cost-only routing, quality-only routing) force a hard choice. multi_factor lets one decision rule express a smooth tradeoff across all four dimensions, with optional hard SLO ceilings to fence off unsafe candidates.
When to Use
- A decision has 2+ candidate models that differ along multiple dimensions (e.g. a faster cheaper model and a slower better model) and you want a smooth tradeoff knob.
- You want SLO enforcement (e.g. "never route to a model with p95 TPOT > 200ms") without writing a separate decision rule.
- Quality, latency, cost, and load all matter and no single one dominates.
Sibling Algorithms
latency_awareis a special case of this — latency-only scoring. Use it when the other dimensions truly do not matter.hybridcomposes other selectors (Elo + RouterDC + AutoMix) into one.multi_factorcomposes raw signals directly. Both are useful and complementary.
Algorithm Principle
For each candidate model in the candidate set, after SLO filtering:
Where:
- , , , are quality / latency / cost / load values min-max normalized to [0, 1] across the surviving candidate set.
- Latency, cost, and load are inverted (
1 - ...) because lower-is-better. - Quality is direct because higher-is-better.
- Weights are normalized to sum to 1 (negative weights clamped to zero). Equal weights are the recoverable default.
SLO Filtering
Before scoring, any candidate that exceeds a non-zero ceiling is removed:
max_tpot_ms— p95 (or configured) TPOT observed viapkg/latencymax_ttft_ms— p95 (or configured) TTFT observed viapkg/latencymax_cost_per_1m— configured prompt pricingmax_inflight— current in-flight request count frompkg/inflight
If all candidates are filtered out, behavior is controlled by on_no_candidates:
| Value | Behavior |
|---|---|
cheapest (default) | Return the candidate with the lowest configured prompt_per_1m |
first | Return the first candidate as listed |
fail | Return an error to the caller |
Configuration
algorithm:
type: multi_factor
multi_factor:
weights:
quality: 0.4
latency: 0.2
cost: 0.2
load: 0.2
slo:
max_tpot_ms: 200 # optional, omit for no ceiling
max_ttft_ms: 800 # optional
max_cost_per_1m: 5.0 # optional, USD per 1M prompt tokens
max_inflight: 50 # optional
latency_percentile: 95 # which percentile to read (default 95)
on_no_candidates: cheapest