Search in Co-Wiki

Explore-then-commit algorithm

game-theory 408 tokens 1 outbound links

Explore-then-commit algorithm

Multi-armed bandit problem The multi-armed bandit problem is a sequential-game where one player has to choose at each turn between <math>K</math> actions (arms). Behind every arm <math>a</math> there is an unknown distribution <math>\nu_a</math> that lies in a set <math>\mathcal{D}</math> known by the player (for example, <math>\mathcal{D}</math> can be the set of Gaussian distributions or Bernoulli distributions).

At each turn <math>t</math> the player chooses (pulls) an arm <math>a_t</math>, they then get an observation <math>X_t</math> of the distribution <math>\nu_{a_t}</math>.

Regret minimization The goal is to minimize the regret at time <math>T</math> that is defined as : <math>R_T := \sum_{a=1}^K \Delta_a \mathbb{E}[N_a(T)]</math> where * <math>\mu_a := \mathbb{E}[\nu_a]</math> is the mean of arm <math>a</math> <math>\mu^ := \max_a \mu_a</math> is the highest mean <math>\Delta_a := \mu^ - \mu_a</math> * <math>N_a(t)</math> is the number of pulls of arm <math>a</math> up to turn <math>t</math>

The player has to find an algorithm that chooses at each turn <math>t</math> which arm to pull based on the previous actions and observations <math>(a_s,X_s)_{s < t}</math> to minimize the regret <math>R_T</math>.

This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).

</references>