Version: v0.1

Thompson Sampling Selection

Thompson Sampling is a Bayesian approach to the exploration-exploitation tradeoff. It naturally balances trying new models (exploration) with using known good models (exploitation), applied to LLM model selection as a multi-armed bandit problem.

Reference: A Tutorial on Thompson Sampling by Russo, Van Roy, Kazerouni, Osband & Wen. This comprehensive tutorial covers the theoretical foundations we apply here.

Algorithm Flow

Mathematical Foundation

Beta Distribution

Each model maintains a Beta distribution representing success probability:

P(θ | α, β) = Beta(α, β)

where:
  α = prior_alpha + successes
  β = prior_beta + failures
  θ = true success probability (unknown)

Sampling Process

For each selection, sample from each model's posterior:

θ_i ~ Beta(α_i, β_i)   for each model i

Select model: argmax_i(θ_i)

Bayesian Update

After feedback, update the selected model's distribution:

If success: α' = α + 1
If failure: β' = β + 1

Expected Value and Variance

E[θ] = α / (α + β)           # Expected success rate
Var[θ] = αβ / ((α+β)²(α+β+1))  # Uncertainty

High variance → more exploration (uncertain model) Low variance → more exploitation (confident about model)

Core Algorithm (Go)

// Select using Thompson Sampling
func (s *ThompsonSelector) Select(ctx context.Context, selCtx *SelectionContext) (*SelectionResult, error) {
    var bestModel string
    var bestSample float64 = -1
    
    userID := s.getUserID(selCtx)
    
    for _, candidate := range selCtx.CandidateModels {
        alpha, beta := s.getParams(userID, candidate.Model)
        
        // Sample from Beta distribution
        sample := s.sampleBeta(alpha, beta)
        
        if sample > bestSample {
            bestSample = sample
            bestModel = candidate.Model
        }
    }
    
    return &SelectionResult{
        SelectedModel: bestModel,
        Score:         bestSample,
        Method:        MethodThompson,
    }, nil
}

// UpdateFeedback adjusts Beta parameters
func (s *ThompsonSelector) UpdateFeedback(userID, model string, success bool) {
    alpha, beta := s.getParams(userID, model)
    
    if success {
        s.setParams(userID, model, alpha+1, beta)
    } else {
        s.setParams(userID, model, alpha, beta+1)
    }
}

How It Works

Each model maintains a Beta distribution representing its success probability
For each request, sample from each model's distribution
Select the model with the highest sampled value
Update the distribution based on feedback

This approach automatically explores uncertain options while exploiting known good ones.

Configuration

decision:
  algorithm:
    type: thompson
    thompson:
      prior_alpha: 1.0        # Prior successes (optimistic: higher)
      prior_beta: 1.0         # Prior failures (pessimistic: higher)
      per_user: true          # Per-user personalization
      decay_factor: 0.1       # Decay old observations
      min_samples: 10         # Minimum samples before exploitation

models:
  - name: gpt-4
    backend: openai
  - name: gpt-3.5-turbo
    backend: openai
  - name: claude-3-opus
    backend: anthropic

Key Parameters

Parameter	Default	Description
`prior_alpha`	1.0	Prior successes; higher = more optimistic
`prior_beta`	1.0	Prior failures; higher = more pessimistic
`per_user`	false	Maintain separate distributions per user
`decay_factor`	0.0	Decay rate for old observations (0 = no decay)
`min_samples`	10	Minimum samples before full exploitation

Prior Settings

The prior (alpha, beta) shapes initial behavior:

Setting	Behavior
(1, 1)	Uniform prior - equal exploration
(2, 1)	Optimistic - assume models are good
(1, 2)	Pessimistic - assume models need proving
(10, 10)	Confident prior - slow to change

Per-User Personalization

With per_user: true, each user gets their own model distributions:

thompson:
  per_user: true

This allows the system to learn that User A prefers GPT-4 while User B prefers Claude.

Feedback Integration

Thompson Sampling updates via the feedback API:

# Positive feedback (success)
curl -X POST http://localhost:8080/api/v1/feedback \
  -d '{"request_id": "req-123", "model": "gpt-4", "rating": 1}'

# Negative feedback (failure)
curl -X POST http://localhost:8080/api/v1/feedback \
  -d '{"request_id": "req-456", "model": "gpt-4", "rating": -1}'

Best Practices

Start with uniform priors: (1, 1) unless you have prior knowledge
Enable per-user for personalization: Learn individual preferences
Use decay for non-stationary environments: When model quality changes
Set min_samples appropriately: Too low = premature exploitation

Thompson Sampling Selection

Algorithm Flow​

Mathematical Foundation​

Beta Distribution​

Sampling Process​

Bayesian Update​

Expected Value and Variance​

Core Algorithm (Go)​

How It Works​

Configuration​

Key Parameters​

Prior Settings​

Per-User Personalization​

Feedback Integration​

Best Practices​