Deep reinforcement learning algorithms can be hard to debug, so it helps to visualize as much as possible in the absence of a stack trace . How do we know if the learned policy and value functions make sense? Seeing these quantities plotted in real time as an agent is interacting with an environment can help us answer that question.
Here’s an example of an agent wandering around a custom gridworld environment. When the agent executes the
toggle action in front of an unopened red gift, it receives a reward of 1 point, and the gift turns grey/inactive.
The model is an actor critic, a type of policy gradient algorithm (for a nice introduction, see Jonathan’s battleship post or ) that uses a neural network to parametrize its policy and value functions.
This agent barely “meets expectations” — notably getting stuck at an opened gift between frames 5-35 — but the values and policy mostly make sense. For example, we tend to see spikes in value when the agent is immediately in front of an unopened gift while the policy simultaneously outputs a much higher probability of taking the appropriate
toggle action in front of the unopened gift. (We’d achieve better performance by incorporating some memory into the model in the form of an LSTM).
We’re sharing a little helper code to generate the matplotlib plots of the value and policy functions that are shown in the video.
- Training of the model is not included. You’ll need to load a trained actor critic model, along with access to its policy and value functions for plotting. Here, the trained model has been loaded into
get_actionmethod that returns the
actionto take, along with a numpy array of
policyprobabilities and a scalar
valuefor the observation at the current time step.
- The minigridworld environment conforms to the OpenAI gym API, and the
forloop is a standard implementation for interacting with the environment.
- The gridworld environment already has a built in method for rendering the environment in iteractive mode
relimfunctions are used to make updates to the figures at each step. In particular, this allows us to show what appears to be a sliding window over time of the value function line plot. When running the script, the plots pop up as three separate figures.
 Berkeley Deep RL bootcamp - Core Lecture 6 Nuts and Bolts of Deep RL Experimentation — John Schulman (video | slides) - great advice on the debugging process, things to plot
 OpenAI Spinning Up: Intro to policy optimization