Thompson Sampling Selection
Thompson Sampling is a Bayesian approach to the exploration-exploitation tradeoff. It naturally balances trying new models (exploration) with using known good models (exploitation), applied to LLM model selection as a multi-armed bandit problem.
Reference: A Tutorial on Thompson Sampling by Russo, Van Roy, Kazerouni, Osband & Wen. This comprehensive tutorial covers the theoretical foundations we apply here.
Algorithm Flow
Mathematical Foundation
Beta Distribution
Each model maintains a Beta distribution representing success probability:
P(θ | α, β) = Beta(α, β)
where:
α = prior_alpha + successes
β = prior_beta + failures
θ = true success probability (unknown)
Sampling Process
For each selection, sample from each model's posterior:
θ_i ~ Beta(α_i, β_i) for each model i
Select model: argmax_i(θ_i)
Bayesian Update
After feedback, update the selected model's distribution:
If success: α' = α + 1
If failure: β' = β + 1