Reinforcement Learning: An Introduction by Sutton and Barto[1] is a book that is universally recommended to beginners in their RL studies. The first chapter is an extended text-heavy introduction. The second chapter deals with multiarmed bandits, i.e. slot machines with multiple arms, and is the subject of today’s post.

Before getting into the what and how of bandits, I’d like to address the why, since the “why” can guard against getting lost in the details / not seeing the forest for the trees.

Why discuss multiarmed bandits?

RL treats the problem of trying to achieve a goal in an environment where an agent is not instructed about which actions to take to achieve that goal, in contrast to supervised learning problems. Learning the best actions to take is a complicated problem, since the best actions depend on what state an agent is in, e.g. an agent trying to get to a goalpost east of its current location as quickly as possible may find that moving east is a generally good policy, but not if there is a fire-breathing dragon in the way, in which case, it might make sense to move up or down to navigate around the obstacle.

Multiarmed bandits are simpler problem: a single state system. No matter which action an agent takes, i.e. which slot machine arm the agent pulls, the agent ends up back in the same state; the distribution of rewards as a consequence of the agent’s action remains the same, assuming a stationary distribution of rewards, and actions have no effect on subsequent states or rewards. This simple case study is useful for building intuition and introducing RL concepts that will be expanded on in later chapters of [1].

Key RL concepts introduced by the multiarmed bandit problem

The nature of the problem

Agent has a goal: In RL and multiarmed bandit problems, we want to figure out the strategy, or “policy” in RL lingo, that will maximize our rewards. For the simple bandit problem, this goal is equivalent to maximizing the reward — literally, money! — for each arm pull.

Unlike supervised learning, no ground truth is supplied: Each slot has a different distribution of rewards, but the agent playing the machine does not know that distribution. Instead, the agent has to try different actions and evaluate how good the actions are. The goodness of an action is straightforwardly determined by its immediate reward in the bandit case.

Exploration vs. exploitation: Based on a few trials, one arm may appear to yield the highest rewards, but the agent may decide to try others occasionally to improve its estimates of the rewards, an example of balancing exploration and exploitation. The various algorithms handle exploration vs. exploitation differently, but this example introduces one method that is simple but widely-used in practice: the epsilon-greedy algorithm, which takes greedy actions most of the time (exploits) but takes random actions (explores) a fraction epsilon of the time.

Different approaches to learning a policy

model-free: All the strategies discussed in [1] for solving the bandit problem are “model-free” strategies. In real world applications, a model of the world is rarely available, and the agent has to figure out how to act based on sampled experience, and the same applies to the bandit case; even though bandits are a simpler single state system (we don’t have to model transitions from state to state), an agent still does not know the model that generates the probability of a reward \(r\) given an action \(a\), \(P(r|a)\) and has to figure that out from trial and error.

There are model-based algorithms that attempt to model the environment’s transition dynamics from data, but many popular algorithms today are model-free because of the difficulty of modeling those dynamics.

Learning action-values

The bandit problem introduces the idea of estimating the expected value associated with each action, namely the action-value function in RL terms. The concept is very intuitive — as an agent pulls on different bandit arms, it will accumulate rewards associated with each arm. A simple way to estimate the expected value per arm is just to average the rewards generated by pulling on each slot. The policy that follows is then implicit, namely, take the action / pull on the arm with the highest estimated action-value!

Historically, RL formalism has dealt with estimating value functions and using them to figure out a policy, which includes the Q-Learning (“Q” stands for action-value!) approach we mentioned in our earlier post.

Learning policies directly

[1] also use the bandit problem to introduce a type of algorithm that approaches the problem, not indirectly by learning a value function and deriving the policy from those value functions, but by parameterizing the policy directly and learning the parameters that optimize the rewards. This class of algorithm is a “policy gradient method” and is very popular today for its nice convergence properties. After the foreshadowing in the bandit problem, policy gradients only reappear very late in [1] — chapter 13!

We now provide code for concreteness.

Ground truth is hidden in our multiarmed bandit

The Bandit class initializes a multiarmed bandit. The distribution of rewards per arm follows a Gaussian distribution with some mean dollar amount.

class Bandit:
    """N-armed bandit with stationary distribution of rewards per arm.
    Each arm (action) is identified by an integer.
    def __init__(self, n_arms: int, mu: float, sigma: float):
        self.n_arms = n_arms
        self.std = sigma
        # a dict of the mean action_value per arm, w/ each action_value sampled from a Gaussian
        self.action_values = {k: s for k, s in enumerate(np.random.normal(mu, sigma, n_arms))}
        self.actions = list(self.action_values.keys())  # arms of the bandit

    def __call__(self, action: int) -> float:
        """Get reward from bandit for action"""
        return np.random.normal(self.action_values[action], self.std)

Implementation detail: the means per arm, stored in self.action_values, are drawn from a Gaussian distribution upon initialization).

The agent doesn’t know the true mean rewards per arm — it only sees a sample reward when he takes the action of pulling on a particular bandit arm (__call__).

Action, reward, update strategy

For every action the agent takes, it gets a reward. With each additional interaction with the bandit, the agent has a new data point it can use to update its strategy (whether indirectly, via an updated action-value estimate, or directly in the policy gradient).

class BaseBanditAlgo(ABC):
    """Base class for algorithms to maximize the rewards 
    for the multiarmed bandit problem"""
    def __init__(self, bandit: Bandit):
        self.bandit = bandit
        self.timestep = 0
        self.rewards = []

    def _select_action(self) -> int:

    def _update_for_action_and_reward(self, action: int, reward: float):

    def run(self) -> float:
        action = self._select_action()
        reward = self.bandit(action)
        self._update_for_action_and_reward(action, reward)
        return reward

    def __call__(self, n_timesteps: int):
        for i in range(n_timesteps):
            self.timestep += 1

Two types of strategies: value based and policy based

  1. value based - agents try to directly estimate the value of each action (and whose policies, i.e. probability of selecting an action, are therefore implicit, since the agent will want to choose the action that has the highest value)
  2. policy based - agents don’t try to directly estimate the value of an action and instead directly store the policy, i.e. the probability of taking each action.

An example of a value based strategy / action-value method for the bandit problem is the EpsilonGreedy approach, which selects the action associated with the highest estimated action-value with probability \(1-\epsilon\), but chooses a random arm a fraction \(\epsilon\) of the time as part of its exploration strategy.

class EpsilonGreedy(BaseEstimateActionValueAlgo):
    """Greedy algorithm that explores/samples from the non-greedy action some fraction, 
    epsilon, of the time.

    - For a basic greedy algorithm, set epsilon = 0.
    - For optimistic intialization, set q_init > mu, the mean of the Gaussian from
      which the real values per bandit arm are sampled (default is 0).
    def __init__(self, bandit: Bandit, epsilon: float, **kwargs):
        super().__init__(bandit, **kwargs)
        self.epsilon = epsilon

    def _select_action(self) -> int:
        if np.random.sample() < self.epsilon:
            # take random action
            a = np.random.choice(self.bandit.actions)
            # take greedy action
            a = max(self.est_action_values, key=lambda key: self.est_action_values[key])
        return a

(See end of post for additional action-value methods.)

An example of a policy based strategy is the GradientBandit method, which stores its policy, the probability per action in self.preferences. It learns these preferences by doing stochastic gradient ascent along the preferences in the gradient of the expected reward in _update_for_action_and_reward (see [1] for derivation).

class GradientBandit(BaseBanditAlgo):
    """Algorithm that does not try to estimate action values directly and, instead, tries to learn
    a preference for each action (equivalent to stochastic gradient ascent along gradient in expected
    reward over preferences).
    def __init__(self, bandit: Bandit, alpha: float):
        self.alpha = alpha  # step-size
        self.reward_baseline_avg = 0
        self.preferences = {action: 0 for action in bandit.actions}

    def _calc_probs_from_preferences(self):
        """Probabilities per action follow a Boltzmann distribution over the preferences """
        exp_preferences_for_action = {action: np.exp(v) for action, v in self.preferences.items()}
        partition_fxn = sum(exp_preferences_for_action.values())
        self.probabilities_for_action = OrderedDict({action: v / partition_fxn for action, v in 

    def _select_action(self) -> int:
        return np.random.choice(list(self.probabilities_for_action.keys()), 

    def _update_for_action_and_reward(self, action, reward):
        """Update preferences"""
        reward_diff = reward - self.reward_baseline_avg

        # can we combine these updates into single expression using kronecker delta?
        self.preferences[action] += self.alpha * reward_diff * (1 - self.probabilities_for_action[action])
        for a in self.bandit.actions:
            if a == action:
                self.preferences[a] -= self.alpha * reward_diff * self.probabilities_for_action[a]

        self.reward_baseline_avg += 1/self.timestep * reward_diff

Extra: Total rewards for different bandit algorithms

We have discussed a bunch of different bandit algorithms, but haven’t seen what rewards they yield in practice!

In this Jupyter notebook, we run the algorithms through a range of values for their parameters to compare their cumulative rewards across 1000 timesteps (also averaged across many trials of different bandits to smooth things out). In the end, we arrive at a plot of the parameter study, that reproduces Figure 2.6 in [1].

![parameter study]({static}/images/reproduce_multiarmed_bandit_parameter_study.png)


[1] Sutton and Barto - Reinforcement Learning: An Introduction (2nd Edition)

Like this post? Share on: Twitter | Facebook | Email

Cathy Yeh Avatar Cathy Yeh Cathy Yeh got a PhD at UC Santa Barbara studying soft-matter/polymer physics. After stints in a translational modeling group at Pfizer in San Diego, Driver (a no longer extant startup trying to match cancer patients to clinical trials) in San Francisco, and Square, she is currently participating in the OpenAI Scholars program.



Machine learning