Version: v0.1

Elo Rating Selection

Elo Rating selection uses a runtime rating system to rank models based on user feedback. Models that receive positive feedback gain rating points; those with negative feedback lose points. Over time, better-performing models rise to the top and get selected more often.

This approach uses the Bradley-Terry model (pairwise comparison framework) to continuously improve model selection through online learning.

Note on RouteLLM: The RouteLLM paper (Ong et al.) trains static router models on preference data and achieves ~50% cost reduction (2x savings). Our implementation takes a different approach: a runtime Elo rating system that updates dynamically based on live feedback rather than pre-trained static routing.

Algorithm Flow

Mathematical Foundation

Expected Score (Bradley-Terry Model)

The expected probability that model A beats model B:

E_A = 1 / (1 + 10^((R_B - R_A) / 400))

Where:

R_A = Rating of model A
R_B = Rating of model B
E_A = Expected score (probability A wins)

Rating Update

After feedback, ratings update as:

R'_A = R_A + K × (S_A - E_A)

Where:

K = K-factor (controls volatility, default: 32)
S_A = Actual outcome (1 = win, 0.5 = tie, 0 = loss)
E_A = Expected score

Example Calculation

Model A: Rating 1500, Model B: Rating 1400
Expected score for A: 1 / (1 + 10^((1400-1500)/400)) = 0.64

If A wins (S_A = 1):
  New rating = 1500 + 32 × (1 - 0.64) = 1500 + 11.5 = 1511.5

If A loses (S_A = 0):
  New rating = 1500 + 32 × (0 - 0.64) = 1500 - 20.5 = 1479.5

Core Algorithm (Go)

// Select returns the model with highest Elo rating
func (s *EloSelector) Select(ctx context.Context, selCtx *SelectionContext) (*SelectionResult, error) {
    var bestModel string
    var bestRating float64 = math.Inf(-1)
    
    for _, candidate := range selCtx.CandidateModels {
        rating := s.getRating(candidate.Model, selCtx.DecisionName)
        if rating > bestRating {
            bestRating = rating
            bestModel = candidate.Model
        }
    }
    
    return &SelectionResult{
        SelectedModel: bestModel,
        Score:         bestRating,
        Method:        MethodElo,
    }, nil
}

// UpdateFeedback adjusts ratings based on user feedback
func (s *EloSelector) UpdateFeedback(winner, loser string, tie bool) {
    ratingA := s.getRating(winner)
    ratingB := s.getRating(loser)
    
    // Calculate expected scores
    expectedA := 1.0 / (1.0 + math.Pow(10, (ratingB-ratingA)/400.0))
    expectedB := 1.0 - expectedA
    
    // Actual scores
    var actualA, actualB float64
    if tie {
        actualA, actualB = 0.5, 0.5
    } else {
        actualA, actualB = 1.0, 0.0
    }
    
    // Update ratings
    s.setRating(winner, ratingA + s.kFactor*(actualA-expectedA))
    s.setRating(loser, ratingB + s.kFactor*(actualB-expectedB))
}

How It Works

All models start with an initial rating (default: 1500)
Users provide feedback (thumbs up/down) after responses
Ratings adjust based on the feedback and the "expected" outcome
Higher-rated models are selected more often

Configuration

decision:
  algorithm:
    type: elo
    elo:
      k_factor: 32              # How quickly ratings change (16-64 typical)
      initial_rating: 1500      # Starting rating for new models
      storage_path: /data/elo-ratings.json  # Persist ratings
      auto_save_interval: 1m    # Save frequency

models:
  - name: gpt-4
    backend: openai
  - name: gpt-3.5-turbo
    backend: openai
  - name: claude-3-opus
    backend: anthropic

Submitting Feedback

Use the feedback API to update model ratings:

# Positive feedback (model performed well)
curl -X POST http://localhost:8080/api/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "request_id": "req-123",
    "model": "gpt-4",
    "rating": 1
  }'

# Negative feedback (model performed poorly)
curl -X POST http://localhost:8080/api/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "request_id": "req-456",
    "model": "gpt-3.5-turbo",
    "rating": -1
  }'

Viewing Current Ratings

curl http://localhost:8080/api/v1/ratings

Response:

{
  "ratings": {
    "gpt-4": 1632,
    "gpt-3.5-turbo": 1485,
    "claude-3-opus": 1558
  },
  "last_updated": "2024-01-15T10:30:00Z"
}

K-Factor Tuning

K-Factor	Behavior	Use Case
16	Slow, stable changes	Mature systems with consistent feedback
32	Balanced (default)	Most production use cases
64	Fast, volatile changes	Rapid experimentation, new deployments

Persistent Storage

Enable storage_path to persist ratings across restarts:

elo:
  storage_path: /data/elo-ratings.json
  auto_save_interval: 1m

The file is atomic-write safe and includes backup rotation.

Best Practices

Collect consistent feedback: Ensure feedback reflects actual quality
Start with default K-factor: Adjust only after observing behavior
Enable persistence: Avoid losing ratings on restarts
Monitor rating drift: Watch for models dominating unfairly
Bootstrap with priors: Set initial ratings based on known quality if available

Elo Rating Selection

Algorithm Flow​

Mathematical Foundation​

Expected Score (Bradley-Terry Model)​

Rating Update​

Example Calculation​

Core Algorithm (Go)​

How It Works​

Configuration​

Submitting Feedback​

Viewing Current Ratings​

K-Factor Tuning​

Persistent Storage​

Best Practices​