EFAVDBhttps://efavdb.com/2020-09-13T09:24:00-07:00Everybody's Favorite Data BlogUtility engines2020-09-13T09:24:00-07:002020-09-13T09:24:00-07:00Jonathan Landytag:efavdb.com,2020-09-13:/utility-engines<p>A person’s happiness does not depend only on their current lot in life, but
also on the rate of change of their lot. This is because a person’s prior
history informs their expectations. Here, we build a model that highlights
this emotional “path-dependence” quality of utility. Interestingly, we …</p><p>A person’s happiness does not depend only on their current lot in life, but
also on the rate of change of their lot. This is because a person’s prior
history informs their expectations. Here, we build a model that highlights
this emotional “path-dependence” quality of utility. Interestingly, we find
that it can be gamed: One can increase net happiness via a process of gradual
increased deprivation, followed by sudden jolts in increased consumption, as
shown in the cartoon below — this is the best approach. In particular, it
beats the steady consumption rate strategy.</p>
<p align="center">
<img src="images/engine.png">
</p>
<h3 id="the-utility-function-model">The utility function model</h3>
<p>In this post, we assume that the utility realized by a person over a time <span class="math">\(T\)</span>
is given by
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1} U(t) = \int_0^T \left (a \vert
x^{\prime}(t) \vert + b \vert x^{\prime}(t) \vert^2 \right)
\text{sign}(x^{\prime}(t)) dt.
\end{eqnarray}</div>
<p>
Here, <span class="math">\(x(t)\)</span> is a measure of consumption (one’s “lot in life”) at time <span class="math">\(t\)</span>.
Our model here is a natural Taylor expansion, relevant for small changes in
<span class="math">\(x(t)\)</span>. It is positive when consumption is going up and negative when
consumption is going down. We’ll be interested to learn whether varying <span class="math">\(x(t)\)</span>
subject to a fixed average constraint can result in increased net happiness,
relative to the steady state consumption solution. The answer is yes, and this
can be understood qualitatively by considering the two terms above:</p>
<ul>
<li>
<p>First term, <span class="math">\(a \vert x^{\prime}(t) \vert \text{sign}(x^{\prime}(t))\)</span>: This
term is proportional to the rate of change of consumption.
We assume that <span class="math">\(a > 0\)</span>, so that as we start to consume less, it is negative,
as we go back up it is positive. We will be interested in cycles that repeat
here, so will assume that our <span class="math">\(x(t)\)</span> is periodic with period <span class="math">\(T\)</span>. In this
case, the linear term — while possibly acutely felt at each moment — will
integrate to an average of zero over the long term.</p>
</li>
<li>
<p>Second term, <span class="math">\(b \vert x^{\prime}(t) \vert^2 \text{sign}(x^{\prime}(t))\)</span>: This
term is non-linear. It is very weak for small rates of change but kicks in
strongly whenever we have an abrupt change. We again assume that <span class="math">\(b>0\)</span> so that
big drops in consumption are very painful, etc.</p>
</li>
</ul>
<p>With the above comments in place, we can now see how our figure gives a net
gain in utility: On average, only the quadratic term matters and this will
effectively only contribute during sudden jumps. The declines in our figure
are gradual, and so contribute only weakly in this term. However, the
increases are sudden and each give a significant utility “fix” as a consequence.</p>
<p>For those interested, we walk through the simple mathematics needed to exactly
optimize our utility function in an appendix. Concluding comments on the
practical application of these ideas are covered next.</p>
<h3 id="practical-considerations">Practical considerations</h3>
<p>A few comments:</p>
<ul>
<li>
<p>Many people treat themselves on occasion — with chocolates, vacations, etc.
— perhaps empirically realizing that varying things improves their long
term happiness. It is interesting to consider the possibility of optimizing
this effect, which we do with our toy model here: In this model, we do not
want to live in a steady state salted with occasional treats: Instead, we
want the saw-tooth shape of consumption shown in our figure.</p>
</li>
<li>
<p>A sad fact of life is that progress tends to be gradual, while set backs tend
to occur suddenly — e.g., stocks tend to move in patterns like this. This
is the worst way things could go, according to our model.</p>
</li>
<li>
<p>True, human utility functions are certainly more complex than what we have
considered here.</p>
</li>
<li>
<p>It is interesting to contrast models of utility with conservative physical
systems, where the energy of a state is not path dependent, but depends only
on the current state. Path dependence means that two identical people in the
same current situation can have very different valuations of their lot in life.</p>
</li>
</ul>
<p>The appendix below discusses the mathematical optimization of (\ref{1}).</p>
<h3 id="appendix-optimizing-ref1">Appendix — optimizing (\ref{1})</h3>
<p>For simplicity, we consider a path that goes down from <span class="math">\(t=0\)</span> to <span class="math">\(t_0\)</span> — making
its way down by <span class="math">\(\Delta x\)</span>, then goes back up to where it started from <span class="math">\(t_0\)</span> to
<span class="math">\(T\)</span>. It is easy to see that the first term integrates to zero in this case,
provided we start and end at the same value of <span class="math">\(x\)</span>. Now, consider the second
term. On the way down, we have
</p>
<div class="math">\begin{eqnarray}
\int_0^{t_0} \vert x^{\prime} \vert^2 dt &\equiv & t_0 \langle \vert x^{\prime}
\vert^2 \rangle_{t_0}
\\ &\geq & t_0 \langle \vert x^{\prime} \vert \rangle^2_{t_0}
\\
&=& t_0 \left( \frac{\Delta x}{t_0} \right)^2
\tag{2}
\end{eqnarray}</div>
<p>
The inequality here is equivalent to the statement that the variance of the
rate of change of our consumption is positive. We get equality — and minimal
loss from the quadratic term on the way down — if the slope is constant
throughout. That is, we want a linear drop in <span class="math">\(x(t)\)</span> from <span class="math">\(0\)</span> to <span class="math">\(t_0\)</span>. With
this choice, we get
</p>
<div class="math">\begin{eqnarray}
\int_0^{t_0} \vert x^{\prime} \vert^2 dt = \frac{\Delta x^2}{t_0}. \tag{3}
\end{eqnarray}</div>
<p>
On the way back up, we’d like to max out the inequality analogous to above.
This is achieved by having the recovery occur as quickly as possible, say over
a window of time <span class="math">\(t_{r}\)</span> with <span class="math">\(r\)</span> standing for recovery. We can decrease our
loss by increasing <span class="math">\(t_0\)</span> up to <span class="math">\(t_0 \to T\)</span>. In this case, our integral of
the quadratic over all time goes to
</p>
<div class="math">\begin{eqnarray}
\text{gain} = b \Delta x^2 \left (\frac{1}{t_r} - \frac{1}{T} \right)
\tag{4}.
\end{eqnarray}</div>
<p>
This gives the optimal lift possible — that realized by the saw-tooth approach
shown in our first figure above. </p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>2 + 1 = 4, by quinoa2020-07-03T00:00:00-07:002020-07-03T00:00:00-07:00Jonathan Landytag:efavdb.com,2020-07-03:/quinoa packing<p align="center">
<img src="images/quinoa.jpg">
</p>
<p>I was struck the other day by the following: The cooking instructions on my
Bob’s tri-colored quinoa package said to combine 2 cups of water with 1 cup of
dried quinoa, which would ultimately create 4 cups of cooked quinoa. See image above.</p>
<p>My first reaction was to believe …</p><p align="center">
<img src="images/quinoa.jpg">
</p>
<p>I was struck the other day by the following: The cooking instructions on my
Bob’s tri-colored quinoa package said to combine 2 cups of water with 1 cup of
dried quinoa, which would ultimately create 4 cups of cooked quinoa. See image above.</p>
<p>My first reaction was to believe that some error had been made. However, I
then realized that the explanation was packing: When one packs spheres or
other awkward solid geometric shapes into a container, they cannot fill the
space completely. Little pockets of air sit between the spheres. A quick
google search for the packing fraction of spheres gives a value of <span class="math">\(0.75\)</span> for a
crystalline structure and about <span class="math">\(0.64\)</span> for random packings — apparently a
universal law.</p>
<p>We can get a similar number out from my quinoa instructions: Suppose that
before the quinoa is cooked, the water fills its volume completely. However,
after cooking, the water is absorbed into the quinoa and forced to share its
packing fraction. The quinoa stays at the same packing fraction before and
after cooking, so the water must be responsible for the volume growth. This
implies it went from 2 cups to 3, or
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
2 = \rho \times 3,
\end{eqnarray}</div>
<p>
where <span class="math">\(\rho\)</span> is the packing fraction of the quinoa “spheres”. We conclude that
the packing fraction is <span class="math">\(\rho = 2/3\)</span>, very close to the googled value of <span class="math">\(\rho
= 0.64\)</span>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Long term credit assignment with temporal reward transport2020-06-29T12:00:00-07:002020-06-29T12:00:00-07:00Cathy Yehtag:efavdb.com,2020-06-29:/ltca
<h1 id="summary">Summary</h1>
<p>Standard reinforcement learning algorithms struggle with poor sample efficiency in the presence of sparse rewards with long temporal delays between action and effect. To address the long term credit assignment problem, we build on the work of [1] to use “temporal reward transport” (<span class="caps">TRT</span>) to augment the immediate rewards …</p>
<h1 id="summary">Summary</h1>
<p>Standard reinforcement learning algorithms struggle with poor sample efficiency in the presence of sparse rewards with long temporal delays between action and effect. To address the long term credit assignment problem, we build on the work of [1] to use “temporal reward transport” (<span class="caps">TRT</span>) to augment the immediate rewards of significant state-action pairs with rewards from the distant future using an attention mechanism to identify candidates for <span class="caps">TRT</span>. A series of gridworld experiments show clear improvements in learning when <span class="caps">TRT</span> is used in conjunction with a standard advantage actor critic algorithm.</p>
<h1 id="introduction">Introduction</h1>
<p>Episodic reinforcement learning (<span class="caps">RL</span>) models the interaction of an agent with an environment as a Markov Decision Process with a finite number of time steps <span class="math">\(T\)</span>. The environment dynamics <span class="math">\(p(s’,r|s, a)\)</span> are modeled as a joint probability distribution over the next state <span class="math">\(s'\)</span> and reward <span class="math">\(r\)</span> picked up along the way given the previous state <span class="math">\(s\)</span> and action <span class="math">\(a\)</span>. In general, the agent does not have access to an exact model of the environment.</p>
<p>The agent’s goal is to maximize its cumulative rewards, the discounted returns <span class="math">\(G_t\)</span>,</p>
<div class="math">\begin{eqnarray}\label{return} \tag{1}
G_t := R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^T \gamma^k R_{t+k+1}
\end{eqnarray}</div>
<p>where <span class="math">\(0 \leq \gamma \leq 1\)</span>, and <span class="math">\(R_{t}\)</span> is the reward at time <span class="math">\(t\)</span>. In episodic <span class="caps">RL</span>, the discount factor <span class="math">\(\gamma\)</span> is often used to account for uncertainty in the future, to favor rewards now vs. later, and as a variance reduction technique, e.g. in policy gradient methods [2, 3].</p>
<p>Using a discount factor <span class="math">\(\gamma < 1\)</span> introduces a timescale by exponentially suppressing rewards in the future by <span class="math">\(\exp(-n/\tau_{\gamma})\)</span>. The number of timesteps it takes for a reward to decay by <span class="math">\(1/e\)</span> is <span class="math">\(\tau_{\gamma} = 1/(1-\gamma)\)</span>, in units of timesteps, which follows from solving for <span class="math">\(n\)</span> after setting the left and right sides of (\ref{discount-timescale}) to be equal</p>
<div class="math">\begin{align}\label{discount-timescale}\tag{2}
\gamma ^ n \approx \frac{1}{e} = \lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n
\end{align}</div>
<hr/>
<p>The state value function <span class="math">\(v_{\pi}(s)\)</span> is the expected return when starting in state <span class="math">\(s\)</span>, following policy <span class="math">\(\pi(a|s) := p(a|s)\)</span>, a function of the current state.</p>
<div class="math">\begin{eqnarray}\label{state-value} \tag{3}
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
\end{eqnarray}</div>
<p>Policy gradient algorithms improve the policy by using gradient ascent along the gradient of the value function.</p>
<div class="math">\begin{eqnarray}\label{policy-gradient} \tag{4}
\nabla_{\theta} v_\pi(s_0) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}(A_t | S_t) \mathcal{R}(\tau)\right],
\end{eqnarray}</div>
<p>where <span class="math">\(\tau \sim \pi\)</span> describes the agent’s trajectory following policy <span class="math">\(\pi\)</span> beginning from state <span class="math">\(s_0\)</span>, and <span class="math">\(\mathcal{R}(\tau)\)</span> is a function of the rewards obtained along the trajectory. In practice, policy gradients approximate the expected value in (\ref{policy-gradient}) by sampling, which results in very high variance estimates of the gradient.</p>
<p>Common techniques to reduce the variance of the estimated policy gradient include [2]</p>
<ol>
<li>only assigning credit for rewards (the “rewards-to-go”) accumulated after a particular action was taken instead of crediting the action for all rewards from the trajectory.</li>
<li>subtracting a baseline from the rewards weight that is independent of action. Oftentimes, this baseline is the value function in (\ref{state-value}).</li>
<li>using a large batch size.</li>
<li>using the value function (\ref{state-value}) to bootstrap the returns some number of steps into the future instead of using the full raw discounted return, giving rise to a class of algorithms called actor critics that learn a policy and value function in parallel. For example, one-step bootstrapping would approximate the discounted returns in (\ref{return}) as</li>
</ol>
<div class="math">\begin{eqnarray}\label{bootstrap} \tag{5}
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \approx R_{t+1} + \gamma V(S_{t+1}),
\end{eqnarray}</div>
<p>where <span class="math">\(V(S_{t+1})\)</span> is the estimate of the value of state <span class="math">\(S_{t+1}\)</span> (\ref{state-value}).</p>
<p>All of these techniques typically make use of discounting, so an action receives little credit for rewards that happen more than <span class="math">\(\tau_{\gamma}\)</span> timesteps in the future, making it challenging for standard reinforcement learning algorithms to learn effective policies in situations where action and effect are separated by long temporal delays.</p>
<h1 id="results">Results</h1>
<h2 id="temporal-reward-transport">Temporal reward transport</h2>
<p>We use temporal reward transport (or <span class="caps">TRT</span>), inspired directly by the Temporal Value Transport algorithm from [1], to mitigate the loss of signal from discounting by splicing temporally delayed future rewards to the immediate rewards following an action that the <span class="caps">TRT</span> algorithm determines should receive credit.</p>
<p>To assign credit to a specific observation-action pair, we use an attention layer in a neural network binary classifier. The classifier predicts whether the undiscounted returns for an episode are below or above a certain threshold. If a particular observation and its associated action are highly attended to for the classification problem, then that triggers the splicing of future rewards in the episode to that particular observation-action pair.</p>
<p>Model training is divided into two parts:</p>
<ol>
<li>Experience collection using the current policy in an advantage actor critic (<span class="caps">A2C</span>) model.</li>
<li>Parameter updates for the <span class="caps">A2C</span> model and binary classifier.</li>
</ol>
<p><span class="caps">TRT</span> happens between step 1 and 2; it plays no role in experience collection, but modifies the collected rewards through the splicing mechanism, thereby affecting the advantage and, consequently, the policy gradient in (\ref{policy-gradient}).</p>
<div class="math">\begin{eqnarray} \notag
R_t \rightarrow R_t + [\text{distal rewards} (t' > t + \tau_\gamma)]
\end{eqnarray}</div>
<h2 id="environment-for-experiments">Environment for experiments</h2>
<p>We created a <a href="https://github.com/openai/gym">gym</a> <a href="https://github.com/maximecb/gym-minigrid">gridworld</a> environment to specifically study long term credit assignment. The environment is a simplified version of the 3-d DeepMind Lab experiments laid out in [1]. As in [1], we structure the environment to comprise three phases. In the first phase, the agent must take an action that yields no immediate reward. In the second phase, the agent engages with distractions that yield immediate rewards. In the final phase, the agent can acquire a distal reward, depending on the action it took in phase 1.</p>
<p>Concretely, the gridworld environment consists of:</p>
<p>(1) Empty grid with key: agent can pick up the key but receives no immediate reward for picking it up.</p>
<p align="center">
<img alt="Phase 1" src="images/ltca_gridworld_p1.png" style="width:190px;"/>
</p>
<p>(2) Distractor phase: Agent engages with distractors, gifts that yield immediate rewards.</p>
<p align="center">
<img alt="Phase 1" src="images/ltca_gridworld_p2.png" style="width:190px;"/>
</p>
<p>(3) Delayed reward phase: Agent should move to a green goal grid cell. If the agent is carrying the key when it reaches the goal, it is rewarded extra points.</p>
<p align="center">
<img alt="Phase 1" src="images/ltca_gridworld_p3.png" style="width:190px;"/>
</p>
<p>The agent remains in each phase for a fixed period of time, regardless of how quickly it finishes the intermediate task, and then teleports to the next phase. At the end of each episode, the environment resets with a different random seed that randomizes the placement of the key in phase 1 and distractor objects in phase 2.</p>
<h2 id="experiments">Experiments</h2>
<p>In all experiments, we fix the time spent in phase 1 and phase 3, the number of distractor gifts in phase 2, as well as the distal reward in phase 3. In phase 3, the agent receives 5 points for reaching the goal without a key and 20 points for reaching the goal carrying a key (with a small penalty proportional to step count to encourage moving quickly to the goal).</p>
<p>Our evaluation metric for each experiment is the the distal reward obtained in phase 3, which focuses on whether the agent learns to pick up the key in phase 1 in order to acquire the distal reward, although we verify that the agent is also learning to open the gifts in phase 2 by plotting the overall returns (see “Data and code availability” section).</p>
<p>Each experiment varies a particular parameter in the second phase, namely, the time delay, distractor reward size, and distractor reward variance, and compares the performance of the baseline <span class="caps">A2C</span> algorithm with <span class="caps">A2C</span> supplemented with <span class="caps">TRT</span> (<span class="caps">A2C</span>+<span class="caps">TRT</span>).</p>
<h3 id="time-delay-in-distractor-phase">Time delay in distractor phase</h3>
<p>We vary the time spent in the distractor phase, <span class="math">\(T_2\)</span>, as a multiple of the discount factor timescale. We used a discount factor of <span class="math">\(\gamma=0.99\)</span>, which corresponds to a timescale of ~100 steps according to (\ref{discount-timescale}). We ran experiments for <span class="math">\(T_2 = (0, 0.5, 1, 2) * \tau_{\gamma}\)</span>. The distractor reward is 3 points per gift.</p>
<p align="center">
<img alt="P2 time delay expt" src="images/ltca_time_delay_expt_plots.png" style="width:800px;"/>
</p>
<p><small>Fig 1. Returns in phase 3 for time delays in phase 2 of 0.5<span class="math">\(\tau_{\gamma}\)</span>, <span class="math">\(\tau_{\gamma}\)</span>, and 2<span class="math">\(\tau_{\gamma}\)</span>.
</small></p>
<p>As the environment becomes more challenging from left to right with increasing time delay, we see that <span class="caps">A2C</span> plateaus around 5 points in phase 3, corresponding to reaching the goal without the key, whereas <span class="caps">A2C</span>+<span class="caps">TRT</span> increasingly learns to pick up the key over the training period.</p>
<h3 id="distractor-reward-size">Distractor reward size</h3>
<p>We vary the size of the distractor rewards, 4 gifts for the agent to toggle open, in phase 2. We run experiments for a reward of 0, 1, 5, and 8, resulting in maximum possible rewards in phase 2 of 0, 4, 20, and 32.</p>
<p>In comparison, the maximum possible reward in phase 3 is 20.</p>
<p align="center">
<img alt="P2 reward size expt" src="images/ltca_reward_expt_plots.png" style="width:800px;"/>
</p>
<p><small>Fig 2. Returns in phase 3 for distractor rewards of size 0, 5, and 8.
</small></p>
<p>Like the time delay experiments, we see that <span class="caps">A2C</span>+<span class="caps">TRT</span> shows progress learning to pick up the key, whereas <span class="caps">A2C</span> does not over the training period with increasing distractor sizes.</p>
<h3 id="distractor-reward-variance">Distractor reward variance</h3>
<p>We fix the mean reward size of the gifts in phase 2 at 5, but change the variance of the rewards by drawing each reward from a uniform distribution centered at 5, with minimum and maximum ranges of [5, 5], [3, 7], and [0, 10], corresponding to variances of 0, 1.33, and 8.33, respectively.</p>
<p align="center">
<img alt="P2 reward variance expt" src="images/ltca_reward_var_expt_plots.png" style="width:800px;"/>
</p>
<p><small>Fig 3. Returns in phase 3 for distractor reward variance of size 0, 1.33, and 8.33.
</small></p>
<p>The signal-to-noise ratio of the policy gradient, defined as the ratio of the squared magnitude of the expected gradient to the variance of the gradient estimate, was shown to be approximately inversely proportional to the variance of the distractor rewards in phase 2 in [1] for <span class="math">\(\gamma = 1\)</span>. The poor performance of <span class="caps">A2C</span> in the highest variance (low signal-to-noise ratio) case is consistent with this observation, with a small standard deviation in performance around the plateau value of 5 compared to the experiments on time delay and distractor reward size.</p>
<h1 id="discussion">Discussion</h1>
<p>Like temporal value transport introduced in [1], <span class="caps">TRT</span> is a heuristic. Nevertheless, coupling this heuristic with <span class="caps">A2C</span> has been shown to improve performance on several tasks characterized by delayed rewards that are a challenge for standard deep <span class="caps">RL</span>.</p>
<p>Our contribution is a simplified, modular implementation of core ideas in [1], namely, splicing additional rewards from the distant future to state-action pairs identified as significant through a self-attention mechanism. Unlike [1], we implement the self-attention mechanism in a completely separate model and splice the rewards-to-go instead of an estimated value. In addition to the modularity that comes splitting out the attention mechanism for <span class="caps">TRT</span> into a separate model, another advantage of decoupling the models is that we can increase the learning rate of the classifier without destabilizing the learning of the main actor critic model if the classification problem is comparatively easy.</p>
<p><strong>Related work</strong></p>
<p>Other works also draw on the idea of using hindsight to reduce the variance estimates of the policy gradient, and hence increase sample efficiency. “Hindsight credit assignment” proposed in [7] similarly learns discriminative models in hindsight that give rise to a modified form of the value function, evaluated using tabular models in a few toy environments (not focused specifically on the long term credit assignment problem). <span class="caps">RUDDER</span> [8] is more similar in spirit to [1] and <span class="caps">TRT</span> in the focus on redistributing rewards to significant state-action pairs, but identified using saliency analysis on an <span class="caps">LSTM</span> instead of an attention mechanism.</p>
<p><strong>Future work</strong></p>
<p>The robustness of the <span class="caps">TRT</span> algorithm should be further assessed on a wider variety of environments, including e.g. Atari Bowling, which is another environment with a delayed reward task used for evaluations by [1] and [8]. It remains to be seen whether the attention mechanism and <span class="caps">TRT</span> can handle more complex scenarios, in particular scenarios where a sequence of actions must be taken. Just as it is difficult to extract interpretable features from a linear model in the presence of multicollinearity, it is possible that the attention-based classifier may encounter similar problems identifying important state-action pairs when a sequence of actions is required, as our model has no mechanism for causal reasoning.</p>
<p>Although our experiments only evaluated <span class="caps">TRT</span> on <span class="caps">A2C</span>, coupling it with any policy gradient method based on sampling action space should yield similar benefits, which could be straightforwardly tested with our modular implementation.</p>
<p>A benefit of using self-attention is the temporal granularity over an <span class="caps">LSTM</span>. However, a downside is that our approach relies on having the full context of the episode for the attention mechanism ([1] similarly relies on full episodes), in contrast to other methods that can handle commonly used truncation windows with a bootstrapped final value for non-terminal states. Holding full episodes in memory can become untenable for very long episodes, but we have not yet worked out a way to handle this situation in the current setup.</p>
<p>Our first pass implementation transported the raw rewards-to-go instead of the value estimate used in [1], but it is unclear whether transporting the rewards-to-go (essentially re-introducing a portion of the undiscounted Monte Carlo returns) for a subset of important state-action pairs provides a strong enough signal to outweigh the advantages of using a boostrapped estimate intended for variance reduction; the answer may depend on the particular task/environment and is of course contingent on the quality of the value estimate.</p>
<p>The classifier model itself has a lot of room for experimentation. The idea of using a classifier was motivated by a wish to easily extract state-action pairs with high returns from the attention layer, although we have yet to explore whether this provides a clear benefit over a regression model like [1].</p>
<p>The binary classifier is trained to predict whether the rewards-to-go of each subsequence of an episode exceeds a moving average of maximum returns. On the one hand, this is less intuitive than only making the prediction for undiscounted returns of the full episode and introduces highly non-iid inputs for the classifier, which can make make training less stable. On the other hand, one can interpret the current format as a form of data augmentation that results in more instances of the positive class (high return episodes) that benefits the classifier.</p>
<p>If the classifier were modified to only make a single prediction per episode, it may be necessary to create a buffer of recent experiences to shift the distribution of data towards more positive samples for the classifier to draw from in addition to episodes generated from the most recent policy (with the untested assumption that the classifier would be less sensitive to training on such off-policy data than the main actor critic model while benefiting from the higher incidence of the positive class).</p>
<p>Finally, the <span class="caps">TRT</span> algorithm introduces additional hyperparameters that could benefit from additional tuning, including the constant factor multiplying the transported rewards and the attention score threshold to trigger <span class="caps">TRT</span>.</p>
<h1 id="methods">Methods</h1>
<h2 id="environment">Environment</h2>
<p>The agent receives a partial observation of the environment, the 7x7 grid in front of it, with each grid cell encoding 3 input values, resulting in 7x7x3 values total (not pixels).</p>
<p>The gridworld environment supports 7 actions: left, right, forward, pickup, drop, toggle, done.</p>
<p>The environment consists of three phases:</p>
<ul>
<li>Phase 1 “key”: 6x6 grid cells, time spent = 30 steps</li>
<li>Phase 2 “gifts”: 10x10 grid cells, time spent = 50 steps (except for the time delay experiment, which varies the time spent)</li>
<li>Phase 3 “goal”: 7x7 grid cells, time spent = 70.</li>
</ul>
<p>If the agent picks up the key in phase 1, it is initialized carrying the key in phase 3, but not phase 2. The carrying state is visible to the agent in phase 1 and 3. Except for the time delay experiment, each episode is 150 timesteps.</p>
<p><strong>Distractor rewards in phase 2:</strong></p>
<ul>
<li>4 distractor objects, gifts that the agent can toggle open, that yield immediate rewards</li>
<li>Each opened gift yields a mean reward of 3 points (except in the reward size experiment)</li>
<li>Gift rewards have a variance of 0 (except in the reward variance experiment)</li>
</ul>
<p><strong>Distal rewards in phase 3:</strong></p>
<p>5 points for reaching the goal without a key and 20 points for reaching the goal carrying a key. There is a small penalty of <code>-0.9 * step_count / max_steps=70</code> to encourage moving quickly to the goal. For convenience of parallelizing experience collection of complete episodes, the time in the final phase is fixed, even if the agent finishes the task of navigating to the green goal earlier. Furthermore, for convenience of tracking rewards acquired in the final phase, the agent only receives the reward for task completion in the last step of the final phase, even though this last reward reflects the time and state in which the agent initially reached the goal.</p>
<p>Note, unlike the Reconstructive Memory Agent in [1], our agent does not have the ability to encode and reconstruct memories, and our environment is not set up to test for that ability.</p>
<h2 id="agent-model">Agent model</h2>
<p>The agent’s model is an actor critic consisting of 3 components: an image encoder convolutional net (<span class="caps">CNN</span>), an recurrent neural net layer providing memory, and dual heads outputting the policy and value. We used an open-sourced model that has been extensively tested for gym-minigrid environments from [4].</p>
<p align="center">
<img alt="A2C model" src="images/ltca_a2c_model.png" style="width:300px;"/>
</p>
<p><small>Fig 4. <span class="caps">A2C</span> model with three convolutional layers, <span class="caps">LSTM</span>, and dual policy and value function heads.
</small></p>
<p>The image encoder consists of three convolutional layers interleaved with rectified linear (ReLU) activation functions. A max pooling layer also immediately precedes the second convolutional layer.</p>
<p>The encoded image is followed by a single Long Short Term Memory (<span class="caps">LSTM</span>) layer.</p>
<p>The <span class="caps">LSTM</span> outputs a hidden state which feeds into the dual heads of the actor critic. Both heads consist of two fully connected linear layers sandwiching a tanh activation layer. The output of the actor, the policy, is the same size as the action space in the environment. The output of the critic is a scalar corresponding to the estimated value.</p>
<h2 id="binary-classifier-with-self-attention">Binary classifier with self-attention</h2>
<p>The inputs to the binary classifier are the sequence of image embeddings output by the actor critic model’s <span class="caps">CNN</span> (not the hidden state of the <span class="caps">LSTM</span>) and one-hot encoded actions taken in that state.</p>
<p>The action passes through a linear layer with 32 hidden units before concatenation with the image embedding.</p>
<p>Next, the concatenated vector <span class="math">\(\mathbf{x}_i\)</span> undergoes three separate linear transformations, playing the role of “query”, “key” and “value” (see [6] for an excellent explanation upon which we based our implementation of attention). Each transformation projects the vector to a space of size equal to the length of the episode.</p>
<div class="math">\begin{eqnarray}\label{key-query} \tag{6}
\mathbf{q}_i &=& \mathbf{W}_q \mathbf{x}_i \\
\mathbf{k}_i &=& \mathbf{W}_k \mathbf{x}_i \\
\mathbf{v}_i &=& \mathbf{W}_v \mathbf{x}_i \\
\end{eqnarray}</div>
<p>The self-attention layer outputs a weighted average over the value vectors, where the weight is not a parameter of the neural net, but the dot product of the query and key vectors.</p>
<div class="math">\begin{eqnarray}\label{self-attention} \tag{7}
w'_{ij} &=& \mathbf{q}_i^\top \mathbf{k}_j \\
w_{ij} &=& \text{softmax}(w'_{ij}) \\
\mathbf{y_i} &=& \sum_j w_{ij} \mathbf{v_j}
\end{eqnarray}</div>
<p>The dot product in (\ref{self-attention}) is between embeddings of different frames in an episode. We apply masking to the weight matrix before the softmax in (\ref{self-attention}) to ensure that observations from different episodes do not pay attention to each other, in addition to future masking (observations can only attend past observations in the same episode).</p>
<p>The output of the attention layer then passes through a fully connected layer with 64 hidden units, followed by a ReLU activation, and the final output is a scalar, the logit predicting whether the rewards-to-go from a given observation are below or above a threshold.</p>
<p>The threshold itself is a moving average over the maximum undiscounted returns seen across network updates, where the averaging window is a hyperparameter that should balance updating the threshold in response to higher returns due to an improving policy (in general, increasing, although monotonicity is not enforced) with not increasing so quickly such that there are too few episodes in the positive (high returns) class in a given batch of collected experiences.</p>
<p align="center">
<img alt="Classifier model" src="images/ltca_classifier_model.png" style="width:300px;"/>
</p>
<p><small>Fig 5. Binary classifier model with attention, accepting sequences as input.
</small></p>
<h2 id="temporal-reward-transport_1">Temporal reward transport</h2>
<p>After collecting a batch of experiences by following the <span class="caps">A2C</span> model’s policy, we calculate the attention scores <span class="math">\(w_{ij}\)</span> from (\ref{self-attention}) using observations from the full episode as context.</p>
<p align="center">
<img alt="Attention scores single episode" src="images/ltca_attention_single_episode.png" style="width:400px;"/>
</p>
<p><small>Fig 6. Attention scores for a single episode with future masking (of the upper right triangle). The bright vertical stripes correspond to two highly attended state-action pairs.
</small></p>
<p>We calculate the importance, defined as the average weight of observation <span class="math">\(i\)</span>, ignoring masked regions in the attention matrix as</p>
<div class="math">\begin{eqnarray}\label{importance} \tag{8}
\text{importance}_i = \frac{1}{T - i} \sum_{j \geq i}^{T} w_{ij}
\end{eqnarray}</div>
<p align="center">
<img alt="Importances for a batch of frames" src="images/ltca_importances.png" style="width:600px;"/>
</p>
<p><small>Fig 7. Importances for a batch of collected experiences (16 processes x 600 frames = 9600 frames), with frame or step number on the horizontal axis and process number on the vertical axis.
</small></p>
<p>Observations with an importance score above a threshold (between 0 and 1) hyperparameter are eligible for <span class="caps">TRT</span>. After identifying the candidates for <span class="caps">TRT</span>, we add the distal rewards-to-go, weighted by the importance and another hyperparameter for tuning the impact of the <span class="caps">TRT</span> rewards <span class="math">\(\alpha_{TRT}\)</span> to the original reward <span class="math">\(r_i\)</span> obtained during experience collection:</p>
<div class="math">\begin{eqnarray}\label{trt} \tag{9}
r_i &\rightarrow& r_i + \text{TRT-reward}_i \\
\text{TRT-reward}_i &\equiv& \alpha_{TRT} * \text{importance}_i * \text{rewards-to-go}_i
\end{eqnarray}</div>
<p>We define the distal rewards-to-go in (\ref{trt}) as the total undiscounted returns from observation <span class="math">\(i\)</span>, excluding rewards accumulated in an immediate time window of size equal to the discount factor timescale <span class="math">\(\tau_\gamma\)</span> defined in (\ref{discount-timescale}). This temporal exclusion zone helps prevent overcounting rewards.</p>
<p>We calculate the advantage after <span class="caps">TRT</span> using the generalized advantage estimation algorithm <span class="caps">GAE</span>-<span class="math">\(\lambda\)</span> [3] with <span class="math">\(\lambda=0.95\)</span>, which, analogous to <span class="caps">TD</span>-<span class="math">\(\lambda\)</span> [2], calculates the advantage from an exponentially weighted average over 1- to n-step bootstrapped estimates of the <span class="math">\(Q\)</span> value. One of the benefits of using <span class="caps">GAE</span>-<span class="math">\(\lambda\)</span> is the spillover effect that enables the <span class="caps">TRT</span>-reinforced rewards to directly affect neighboring state-action pairs in addition to the triggering state-action pair.</p>
<h2 id="training">Training</h2>
<p>For experience collection, we used 16 parallel processes, with 600 frames collected per process for a batch size of 9600 frames between parameter updates.</p>
<p>The <span class="caps">A2C</span> model loss per time step is</p>
<div class="math">\begin{eqnarray} \label{a2c-loss} \tag{10}
\mathcal{L}_{A2C} \equiv \mathcal{L}_{policy} + \alpha_{value} \mathcal{L}_{value} - \alpha_{entropy} \text{entropy},
\end{eqnarray}</div>
<p>where</p>
<div class="math">\begin{eqnarray} \notag
\mathcal{L}_{policy} &=& - \log p(a_t | o_t, h_t), \\
\mathcal{L}_{value} &=& \left\Vert \hat{V}(o_t, h_t) - R_t \right\Vert^2,
\end{eqnarray}</div>
<p>and <span class="math">\(o_t\)</span> and <span class="math">\(h_t\)</span> are the observation and hidden state from the <span class="caps">LSTM</span> at time <span class="math">\(t\)</span>, respectively. We accumulate the losses defined in (\ref{a2c-loss}) by iterating over batches of consecutive time steps equal to the size of the <span class="caps">LSTM</span> memory of 10, i.e. truncated backpropagation in time for 10 timesteps.</p>
<p>The classifier with attention has a <a href="https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss">binary cross entropy loss</a>, where the contribution to the loss from positive examples is weighted by a factor of 2.</p>
<p>We clip both gradient norms according to a hyperparameter <span class="math">\(\text{max_grad_norm}=0.5\)</span>, and we optimize both models using RMSprop with learning rates of 0.01, RMSprop <span class="math">\(\alpha=0.99\)</span>, and RMSprop <span class="math">\(\epsilon=1e^{-8}\)</span>.</p>
<h1 id="data-and-code-availability">Data and code availability</h1>
<h2 id="environment_1">Environment</h2>
<p>Code for the components of the 3 phase environment is in our <a href="https://github.com/frangipane/gym-minigrid">fork</a> of <a href="https://github.com/maximecb/gym-minigrid">gym-minigrid</a>.</p>
<p>The base environment for running the experiments is defined in <a href="https://github.com/frangipane/rl-credit/blob/master/rl_credit/examples/environment.py">https://github.com/frangipane/rl-credit/</a>. Each experiment script subclasses that base environment, varying some parameter in the distractor phase.</p>
<h2 id="experiments_1">Experiments</h2>
<p>The parameters and results of the experiments are documented in the following publicly available reports on Weights and Biases:</p>
<ul>
<li><a href="https://app.wandb.ai/frangipane/distractor_time_delays/reports/Distractor-Gift-time-delay--VmlldzoxMjYyNzY">Distractor phase time delays</a></li>
<li><a href="https://app.wandb.ai/frangipane/distractor_reward_size/reports/Distractor-gift-reward-size--VmlldzoxMjcxMTI">Distractor phase reward size</a></li>
<li><a href="https://app.wandb.ai/frangipane/distractor_reward_variance/reports/Distractor-gift-variance--VmlldzoxMjgzNTc">Distractor phase variance of rewards</a></li>
</ul>
<p>Code for running the experiments is at <a href="https://github.com/frangipane/rl-credit">https://github.com/frangipane/rl-credit</a> in the examples/ submodule.</p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>Thank you to OpenAI, my OpenAI mentor J. Tworek, Microsoft for the cloud computing credits, Square for supporting my participation in the program, and my 2020 cohort of Scholars: A. Carrera, P. Mishkin, K. Ndousse, J. Orbay, A. Power (especially for the tip about future masking in transformers), and K. Slama.</p>
<h1 id="references">References</h1>
<p>[1] Hung C, Lillicrap T, Abramson J, et al. 2019. <a href="https://www.nature.com/articles/s41467-019-13073-w">Optimizing agent behavior over long time scales by transporting value</a>. Nat Commun 10, 5223.</p>
<p>[2] Sutton R, Barto A. 2018. <a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction (2nd Edition)</a>. Cambridge (<span class="caps">MA</span>): <span class="caps">MIT</span> Press.</p>
<p>[3] Schulman J, Moritz P, Levine S, et al. 2016. <a href="https://arxiv.org/abs/1506.02438">High-Dimensional Continuous Control Using Generalized Advantage Estimation</a>. <span class="caps">ICLR</span>.</p>
<p>[4] Willems L. <a href="https://github.com/lcswillems/rl-starter-files"><span class="caps">RL</span> Starter Files</a> and <a href="https://github.com/lcswillems/torch-ac">Torch <span class="caps">AC</span></a>. GitHub.</p>
<p>[5] Chevalier-Boisvert M, Willems L, Pal S. 2018. <a href="https://github.com/maximecb/gym-minigrid">Minimalistic Gridworld Environment for OpenAI Gym</a>. GitHub.</p>
<p>[6] Bloem P. 2019. <a href="http://www.peterbloem.nl/blog/transformers">Transformers from Scratch</a> [blog]. [accessed 2020 May 1]. http://www.peterbloem.nl/blog/transformers.</p>
<p>[7] Harutyunyan A, Dabney W, Mesnard T. 2019. <a href="http://papers.nips.cc/paper/9413-hindsight-credit-assignment.pdf">Hindsight Credit Assignment</a>. Advances in Neural Information Processing Systems 32: 12488—12497.</p>
<p>[8] Arjona-Medina J, Gillhofer M, Widrich M, et al. 2019. <a href="https://papers.nips.cc/paper/9509-rudder-return-decomposition-for-delayed-rewards.pdf"><span class="caps">RUDDER</span>: Return Decomposition for Delayed Rewards</a>. Advances in Neural Information Processing Systems 32: 13566—13577.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Visualizing an actor critic algorithm in real time2020-05-07T12:00:00-07:002020-05-07T12:00:00-07:00Cathy Yehtag:efavdb.com,2020-05-07:/visualize-actor-critic<p>Deep reinforcement learning algorithms can be hard to debug, so it helps to visualize as much as possible in the absence of a stack trace [1]. How do we know if the learned policy and value functions make sense? Seeing these quantities plotted in real time as an agent is …</p><p>Deep reinforcement learning algorithms can be hard to debug, so it helps to visualize as much as possible in the absence of a stack trace [1]. How do we know if the learned policy and value functions make sense? Seeing these quantities plotted in real time as an agent is interacting with an environment can help us answer that question.</p>
<p>Here’s an example of an agent wandering around a custom <a href="https://github.com/frangipane/gym-minigrid">gridworld</a> environment. When the agent executes the <code>toggle</code> action in front of an unopened red gift, it receives a reward of 1 point, and the gift turns grey/inactive.</p>
<iframe width="640" height="360" src="https://www.youtube.com/embed/M3PMwPFRoc8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>The model is an actor critic, a type of policy gradient algorithm (for a nice introduction, see Jonathan’s <a href="https://efavdb.com/battleship">battleship</a> post or [2]) that uses a neural network to parametrize its policy and value functions.</p>
<p>This agent barely “meets expectations” — notably getting stuck at an opened gift between frames 5-35 — but the values and policy mostly make sense. For example, we tend to see spikes in value when the agent is immediately in front of an unopened gift while the policy simultaneously outputs a much higher probability of taking the appropriate <code>toggle</code> action in front of the unopened gift. (We’d achieve better performance by incorporating some memory into the model in the form of an <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/"><span class="caps">LSTM</span></a>).</p>
<p>We’re sharing a little helper code to generate the matplotlib plots of the value and policy functions that are shown in the video.</p>
<script src="https://gist.github.com/frangipane/4adca6481bf55f2260ff215c5686851b.js"></script>
<p><strong>Comments</strong></p>
<ul>
<li>Training of the model is not included. You’ll need to load a trained actor critic model, along with access to its policy and value functions for plotting. Here, the trained model has been loaded into <code>agent</code> with a <code>get_action</code> method that returns the <code>action</code> to take, along with a numpy array of <code>policy</code> probabilities and a scalar <code>value</code> for the observation at the current time step.</li>
<li>The minigridworld environment conforms to the OpenAI gym <span class="caps">API</span>, and the <code>for</code> loop is a standard implementation for interacting with the environment.</li>
<li>The gridworld environment already has a built in method for rendering the environment in iteractive mode <code>env.render('human')</code>.</li>
<li>Matplotlib’s <code>autoscale_view</code> and <code>relim</code> functions are used to make updates to the figures at each step. In particular, this allows us to show what appears to be a sliding window over time of the value function line plot. When running the script, the plots pop up as three separate figures.</li>
</ul>
<h3 id="references">References</h3>
<p>[1] Berkeley Deep <span class="caps">RL</span> bootcamp - Core Lecture 6 Nuts and Bolts of Deep <span class="caps">RL</span> Experimentation — John Schulman (<a href="https://youtu.be/8EcdaCk9KaQ">video</a> | <a href="https://drive.google.com/open?id=0BxXI_RttTZAhc2ZsblNvUHhGZDA">slides</a>) - great advice on the debugging process, things to plot</p>
<p>[2] OpenAI Spinning Up: <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">Intro to policy optimization</a></p>2-D random walks are special2020-04-10T00:00:00-07:002020-04-10T00:00:00-07:00Dustin McIntoshtag:efavdb.com,2020-04-10:/random-walk-scaling<p>Here, we examine the statistics behind discrete random walks on square lattices in <span class="math">\(M\)</span> dimensions, with focus on two metrics (see figure below for an example in 2-D): 1. <span class="math">\(R\)</span>, the final distance traveled from origin (measured by the Euclidean norm) and 2. <span class="math">\(N_{unique}\)</span>, the number of unique locations …</p><p>Here, we examine the statistics behind discrete random walks on square lattices in <span class="math">\(M\)</span> dimensions, with focus on two metrics (see figure below for an example in 2-D): 1. <span class="math">\(R\)</span>, the final distance traveled from origin (measured by the Euclidean norm) and 2. <span class="math">\(N_{unique}\)</span>, the number of unique locations visited on the lattice.</p>
<p align="center">
<img src="images/example_rw.png">
</p>
<p>We envision a single random walker on an <span class="math">\(M\)</span>-D lattice and allow it to wander randomly throughout the lattice, taking <span class="math">\(N\)</span> steps. We’ll examine how the distributions of <span class="math">\(R\)</span> and <span class="math">\(N_{unique}\)</span> vary with <span class="math">\(M\)</span> and <span class="math">\(N\)</span>; we’ll show that their averages, <span class="math">\(\langle R \rangle\)</span> and <span class="math">\(\langle N_{unique} \rangle\)</span>, and their standard deviations, <span class="math">\(\sigma_R\)</span> and <span class="math">\(\sigma_{N_{unique}}\)</span>, scale as power laws with <span class="math">\(N\)</span>. The dependence of the exponents and scaling factors on <span class="math">\(M\)</span> is interesting and can be only partially reconciled with theory.</p>
<p>A simple simulation of random walks is easy to write in python for arbitrary dimensions (see <a
href="https://colab.research.google.com/drive/13GYlaTvO-Wu_3ep_Pa0mRZo-CYelDFmf">this colab notebook</a>, <a href="https://github.com/dustinmcintosh/random-walks">github</a>).</p>
<p>Here’s a look at the distribution of our two metrics for <span class="math">\(N = 1000\)</span> for a few different dimensionalities:</p>
<p align="center">
<img src="images/unique_locations_visited_1000.png">
</p>
<p>Let’s make some high-level sense of these results:</p>
<ul>
<li>
<p><span class="math">\(\langle R \rangle\)</span> depends only weakly on <span class="math">\(M\)</span> while <span class="math">\(\langle N_{unique} \rangle\)</span> clearly increases with <span class="math">\(M\)</span>. Both of these results make sense: In 1-D, the walker always has an equal chance to step further away from the origin or closer to it. It also always has at least a 50% chance of backtracking to a position it has already visited. As you add dimensions, it becomes less likely to step immediately closer or further from the origin and more likely to wander in an orthogonal direction, increasing the distance from the origin by roughly the same amount independent of <em>which</em> orthogonal direction, while also visiting completely new parts of the lattice.</p>
</li>
<li>
<p>In 1-D, if you take an even number of steps, <span class="math">\(R\)</span> is always an even integer, so the distribution of <span class="math">\(R\)</span> appears “stripe-y” above. <span class="math">\(R(M=1)\)</span> can be understood as something like a <a href="https://en.wikipedia.org/wiki/Folded_normal_distribution">folded normal distribution</a> for large <span class="math">\(N\)</span>.</p>
</li>
<li>
<p>As <span class="math">\(M\)</span> gets larger, the distribution of <span class="math">\(R\)</span> tightens up. We are essentially taking <span class="math">\(M\)</span> 1-D random walks, each with <span class="math">\(\sim N/M\)</span> steps and adding the results in quadrature. The result of this is that our walker is less likely to be very close to the origin and simultaneously less likely to wander too far afield as it is more likely to have traveled a little bit in many orthogonal dimensions.</p>
</li>
<li>
<p>The width of the <span class="math">\(N_{unique}\)</span> distribution, <span class="math">\(\sigma_{N_{unique}}\)</span>, depends in an interesting way on <span class="math">\(M\)</span>, increasing from 1-D to 2-D and thereafter decreasing. This seems because the distribution is centered most distant from the extremes of 0 or <span class="math">\(N\)</span>.</p>
</li>
</ul>
<p>Let’s take a look at the means of our two metrics as a function of <span class="math">\(N\)</span> for various <span class="math">\(M\)</span> (apologies for using a gif here, but it helps with 1. keeping this post concise and 2. comparing the plots as <span class="math">\(M\)</span> changes incrementally. If you prefer static images, I’ve included them at <a href="https://github.com/dustinmcintosh/random-walks">github</a>):</p>
<p align="center">
<img src="images/R_and_Nu_vs_n.gif">
</p>
<p>Note, this is a <a href="https://en.wikipedia.org/wiki/Log%E2%80%93log_plot">log-log plot</a>; power laws show up as straight lines. One of the first things you’ll notice is that the variation in dependence on <span class="math">\(N\)</span> across different <span class="math">\(M\)</span> is fairly banal for <span class="math">\(R\)</span>, but much more interesting for <span class="math">\(N_{unique}\)</span>. In fact, it is a well-known result (see, e.g., <a href="https://math.stackexchange.com/questions/103142/expected-value-of-random-walk">here</a>) that <span class="math">\(\langle R \rangle \sim N^\alpha\)</span> with <span class="math">\(\alpha = 0.5\)</span> for all <span class="math">\(M\)</span>. <span class="math">\(\langle N_{unique} \rangle\)</span>, on the other hand, seems similar to <span class="math">\(\langle R \rangle\)</span> in 1-D, but the dependence on <span class="math">\(N\)</span> increases dramatically thereafter and approaches <span class="math">\(\langle N_{unique} \rangle \approx N^\beta\)</span> with <span class="math">\(\beta=1\)</span> at higher dimensions. If you look closely, you’ll note that 2-D is particularly special (more on that shortly).</p>
<p>We can also look at the widths (standard deviations) of the two distributions (again, you’ll note that 2-D is particularly interesting):</p>
<p align="center">
<img src="images/std_vs_N.gif">
</p>
<p>First, note that all 4 of the quantities describing the distributions of <span class="math">\(R\)</span> and <span class="math">\(N_{unique}\)</span> plotted above are well-described by power laws. Thus, we can extract eight parameters to describe them (four exponents and four scaling factors) as a function of <span class="math">\(M\)</span>:</p>
<div class="math">\begin{eqnarray}
\langle R \rangle &\approx& R_0 * N^\alpha \\
\langle N_{unique} \rangle &\approx& N_0 * N^\beta \\
\sigma_R &\approx& \sigma_{R,0} * N^\gamma \\
\sigma_{N_{unique}} &\approx& \sigma_{N_{unique},0} * N^\delta \\
\end{eqnarray}</div>
<p align="center">
<img src="images/exponents_scaling_vs_M.png">
</p>
<p>I’ve plotted all the theoretical results that I could find for these parameters as lines above:</p>
<ul>
<li>
<p>As mentioned, <span class="math">\(\alpha = 0.5\)</span> for all <span class="math">\(M\)</span>. (blue line top plot)</p>
</li>
<li>
<p><span class="math">\(R_0\)</span> gives <span class="math">\(\langle R \rangle\)</span> its weak dependence on <span class="math">\(M\)</span>, varying according to an elegant ratio of Gamma functions as described <a href="https://math.stackexchange.com/questions/103142/expected-value-of-random-walk">here</a>. (blue dash-dot line bottom plot)</p>
</li>
<li>
<p>It has been shown that <span class="math">\(\beta = 0.5\)</span> for <span class="math">\(M=1\)</span> and <span class="math">\(\beta = 1\)</span> for all <span class="math">\(M \geq 3\)</span>. (orange dash and line in the top plot) I found these theoretical results in this <a href="https://www.osti.gov/servlets/purl/4637387">wonderful paper</a>, which you should really glance at, if not read, just to see an interesting piece of history - it’s by George H. Vineyard in 1963 (no so long ago for such a fundamental math problem!) at Brookhaven National Lab for the <span class="caps">US</span> Atomic Energy Commission. Written on a typewriter, here is a sample equation, complete with hand-written scribbles to indicate the vectors and a clearly corrected typo on the first cosine:</p>
</li>
</ul>
<p align="center">
<img src="images/Vineyard_pic.png">
</p>
<ul>
<li>In the same paper, Vineyard derives that <span class="math">\(N_{unique, 0} = \sqrt{8/\pi}\)</span> in 1-D and <span class="math">\(N_{unique, 0} \approx 0.659462670\)</span> (yes with all those sig. figs.) in 3-D as derived from evaluating Watson’s Integrals (which, I take it, is what the integrals in the image above are called). (orange dashes in bottom plot)</li>
</ul>
<p>The most interesting thing about all this, to me, is that there is no known theoretical result for <span class="math">\(\beta\)</span> in 2-D. From our data, we get <span class="math">\(\beta = 0.87 \pm 0.02\)</span>. An interesting phenomenological argument for <span class="math">\(\beta\)</span>’s dependence on <span class="math">\(M\)</span>: The random walker spends most of it’s time roaming around within a distance of <span class="math">\(\langle R \rangle \sim N^{\alpha}\)</span> from the origin (where <span class="math">\(\alpha=1/2\)</span>, independent of <span class="math">\(M\)</span>), as that is where it is going to end up. In 1-D, there are only <span class="math">\(\sim 2 N^{1/2}\)</span> sites within this distance, so the number of unique sites visited scales with <span class="math">\(N^{1/2}\)</span>. In 3-D, there are <span class="math">\(\sim \frac{4}{3} \pi N^{3/2}\)</span> sites in the sphere of radius <span class="math">\(N^{1/2}\)</span>, which is <span class="math">\(\gg N\)</span>, so the walker explores <span class="math">\(\sim N\)</span> sites (as it can visit no more than it walks), similar for larger <span class="math">\(M\)</span>.</p>
<p><strong>This is what makes 2-D random walks special</strong>: the number of sites within <span class="math">\(N^{1/2}\)</span> distance of the origin is <span class="math">\(\sim N\)</span>, scaling as the number of steps taken - so, the scaling arguments above no longer apply, leaving us somewhere between <span class="math">\(1/2 < \beta < 1\)</span>.</p>
<p>A few final notes on the exponents and scaling factors that I don’t understand in terms of theory:</p>
<ul>
<li>We don’t get exactly <span class="math">\(\beta=1\)</span> from the data in 3-D above, but that may be because some of our <span class="math">\(N\)</span> are not quite large enough.</li>
<li>I have not seen any theory for <span class="math">\(\sigma_{R,0}\)</span>, <span class="math">\(\sigma_{N_{unique},0}\)</span>, <span class="math">\(\gamma\)</span>, or <span class="math">\(\delta\)</span>.</li>
<li><span class="math">\(\gamma\)</span> appears to also be independent of <span class="math">\(M\)</span> at 0.5. Surely this isn’t hard to derive?</li>
<li><span class="math">\(\delta\)</span> increases dramatically from 1-D to 2-D (again, 2-D appears the most interesting case) and slowly settles back down to 0.5 at large <span class="math">\(M\)</span>.</li>
<li><span class="math">\(\sigma_{N_{unique},0}\)</span> has a strange effect for <span class="math">\(M=2\)</span> also, dipping before going back up and then finally decaying towards zero for <span class="math">\(M>4\)</span>.</li>
<li>For large <span class="math">\(M\)</span>, the data indicate that <span class="math">\(\sigma_{R,0} \approx \sigma_{N_{unique},0}\)</span> and <span class="math">\(\gamma \approx \delta \approx 0.5\)</span>, indicating that <span class="math">\(\sigma_{R} \approx \sigma_{N_{unique}}\)</span>. Further, not shown here, but in <a href="https://github.com/dustinmcintosh/random-walks/blob/master/figures/scaling_of_std_dev_vs_m.png">github</a>, <span class="math">\(\sigma_{R,0} \approx \sigma_{N_{unique},0} \sim \sqrt{1/M}\)</span>. Or, in total, <span class="math">\(\sigma_{R} \approx \sigma_{N_{unique}} \sim \sqrt{N/M}\)</span> for large <span class="math">\(M\)</span> and <span class="math">\(N\)</span>.</li>
</ul>
<p>I really enjoyed working on this post from end-to-end: writing the random walker class, finding a great python jackknife function from astropy (see the <a href="https://colab.research.google.com/drive/13GYlaTvO-Wu_3ep_Pa0mRZo-CYelDFmf">colab notebook</a>), hacking my way through matplotlib, discovering the amazing Vineyard paper, and just generally exploring this incredibly rich problem that is so easy to state yet difficult to theoretically solve. I’ll also note that the 2-D results are of particular interest to me given their applicability to the <a href="https://www.efavdb.com/world-wandering-dudes">World Wandering Dudes</a> project that I have been working on.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Q-learning and DQN2020-04-06T00:00:00-07:002020-04-06T00:00:00-07:00Cathy Yehtag:efavdb.com,2020-04-06:/dqn
<p>Q-learning is a reinforcement learning (<span class="caps">RL</span>) algorithm that is the basis for deep Q networks (<span class="caps">DQN</span>), the algorithm by Google DeepMind that achieved human-level performance for a range of Atari games and kicked off the deep <span class="caps">RL</span> revolution starting in 2013-2015.</p>
<p>We begin with some historical context, then provide an …</p>
<p>Q-learning is a reinforcement learning (<span class="caps">RL</span>) algorithm that is the basis for deep Q networks (<span class="caps">DQN</span>), the algorithm by Google DeepMind that achieved human-level performance for a range of Atari games and kicked off the deep <span class="caps">RL</span> revolution starting in 2013-2015.</p>
<p>We begin with some historical context, then provide an overview of value function methods / Q-learning, and conclude with a discussion of <span class="caps">DQN</span>.</p>
<p>If you want to skip straight to code, the implementation of <span class="caps">DQN</span> that we used to train the agent playing Atari Breakout below is available <a href="https://github.com/frangipane/reinforcement-learning/tree/master/DQN">here</a>.</p>
<p align="center">
<img alt="Atari Breakout" src="images/atari_breakout.gif" style="width:250px;"/>
</p>
<p>If you watch the video long enough, you’ll see the agent has learned a strategy that favors breaking bricks at the edges so the ball “breaks out” to the upper side, resulting in a cascade of points.</p>
<h2 id="historical-context">Historical context</h2>
<p>The theories that underpin today’s reinforcement learning algorithms were developed decades ago. For example, Watkins developed Q-learning, a value function method, in <a href="http://www.cs.rhul.ac.uk/~chrisw/thesis.html">1989</a>, and Williams proposed the <span class="caps">REINFORCE</span> policy gradient method in <a href="https://link.springer.com/content/pdf/10.1007%2FBF00992696.pdf">1987</a>. So why the recent surge of interest in deep <span class="caps">RL</span>?</p>
<h3 id="representational-power-from-neural-networks">Representational power from Neural Networks</h3>
<p>Until 2013, most applications of <span class="caps">RL</span> relied on hand engineered inputs for value function and policy representations, which drastically limited the scope of applicability to the real world. Mnih et. al [1] made use of advances in computational power and neural network (<span class="caps">NN</span>) architectures to use a deep <span class="caps">NN</span> for <em>value function approximation</em>, showing that NNs can learn a useful representation from raw pixel inputs in Atari games.</p>
<h3 id="variations-on-a-theme-vanilla-rl-algorithms-dont-work-well-out-of-the-box">Variations on a theme: vanilla <span class="caps">RL</span> algorithms don’t work well out-of-the-box</h3>
<p>The basic <span class="caps">RL</span> algorithms that were developed decades ago do not work well in practice without modifications. For example, <span class="caps">REINFORCE</span> relies on Monte Carlo estimates of the performance gradient; such estimates of the performance gradient are high variance, resulting in unstable or impractically slow learning (poor sample efficiency). The original Q-learning algorithm also suffers from instability due to correlated sequential training data and parameter updates affecting both the estimator and target, creating a “moving target” and hence divergences.</p>
<p>We can think of these original <span class="caps">RL</span> algorithms as the Wright Brothers plane.
<p align="center">
<img alt="Wright brothers plane" src="images/wright_brothers_plane.png" style="width:500px;"/>
</p></p>
<p>The foundational shape is there and recognizable in newer models. However, the enhancements of newer algorithms aren’t just bells and whistles — they have enabled the move from toy problems into more functional territory.</p>
<h2 id="q-learning">Q-learning</h2>
<h3 id="background">Background</h3>
<p><span class="caps">RL</span> models the sequential decision-making problem as a Markov Decision Process (<span class="caps">MDP</span>): transitions from state to state involve both environment dynamics and an agent whose actions affect both the probability of transitioning to the next state and the reward received.</p>
<p>The goal is to find a policy, a mapping from state to actions, that will maximize the agent’s expected returns, i.e. their cumulative future rewards.</p>
<p>Q-learning is an algorithm for learning the eponymous <span class="math">\(Q(s,a)\)</span> action-value function, defined as the expected returns for each state-action <span class="math">\((s,a)\)</span> pair, corresponding to following the optimal policy.</p>
<h3 id="goal-solve-the-bellman-optimality-equation">Goal: solve the Bellman optimality equation</h3>
<p>Recall that <span class="math">\(q_*\)</span> is described by a self-consistent, recursive relation, the Bellman optimality equation, that falls out from the Markov property [6, 7] of MDPs</p>
<div class="math">\begin{eqnarray}\label{action-value-bellman-optimality} \tag{1}
q_*(s, a) &=& \mathbb{E}_{\pi*} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}', a') | S_t = s, A_t = a] \\
&=& \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a') ]
\end{eqnarray}</div>
<p>where <span class="math">\(0 \leq \gamma \leq 1\)</span> is the <em>discount rate</em> which characterizes how much we weight rewards now vs. later, <span class="math">\(R_{t+1}\)</span> is the reward at timestep <span class="math">\(t+1\)</span>, and <span class="math">\(p(s', r | s, a)\)</span> is the environment transition dynamics.</p>
<p>Our <a href="https://efavdb.com/intro-rl-toy-example.html">introduction to <span class="caps">RL</span></a> provides more background on the Bellman equations in case (\ref{action-value-bellman-optimality}) looks unfamiliar.</p>
<h3 id="the-q-learning-approach-to-solving-the-bellman-equation">The Q-learning approach to solving the Bellman equation</h3>
<p>We use capitalized <span class="math">\(Q\)</span> to denote an estimate and lowercase <span class="math">\(q\)</span> to denote the real action-value function. The Q-learning algorithm makes the following update:</p>
<div class="math">\begin{eqnarray}\label{q-learning} \tag{2}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
\end{eqnarray}</div>
<p>The quantity in square brackets in (\ref{q-learning}) is exactly 0 for the optimal action-value, <span class="math">\(q*\)</span>, based on (\ref{action-value-bellman-optimality}). We can think of it as an error term, “the Bellman error”, that describes how far off the target quantity <span class="math">\(R_{t+1} + \gamma \max_a Q(S_{t+1}, a)\)</span> is from our current estimate <span class="math">\(Q(S_t, A_t)\)</span>.</p>
<p>The goal with Q-learning is to iteratively calculate (\ref{q-learning}), updating our estimate of <span class="math">\(Q\)</span> to reduce the Bellman error, until we have converged on a solution.</p>
<p><strong>Q-learning makes two approximations:</strong></p>
<p>I. It replaces the expectation value in (\ref{action-value-bellman-optimality}) with sampled estimates, similar to Monte Carlo estimates. Unlike the dynamic programming approach we described in an earlier <a href="https://efavdb.com/dp-in-rl.html">post</a>, sampling is necessary since we don’t have access to the model of the environment, i.e. the environment transition dynamics.</p>
<p><span class="caps">II</span>. It replaces the target <span class="math">\(R_{t+1} + \max_a \gamma q_*(s’,a’)\)</span> in (\ref{action-value-bellman-optimality}), which contains the true action-value function <span class="math">\(q_*\)</span>, with the one-step temporal difference, <span class="caps">TD</span>(0), target <span class="math">\(R_{t+1} + \gamma \max_a Q(S_{t+1}, a)\)</span>. The <span class="caps">TD</span>(0) target is an example of <em>bootstrapping</em> because it makes use of the current estimate of the action-value function, instead of, say the cumulative rewards from an entire episode, which would be a Monte Carlo target. Temporal difference methods reduce variance that comes from sampling a single trajectory like Monte Carlo at the cost of introducing bias from using an approximate function in the target for updates.</p>
<p>Figure 8.11 of [7] nicely summarizes the types of approximations and their limits in the following diagram:</p>
<p><img alt="backup approximations" src="https://efavdb.com/images/backup_limits_diagram_sutton_barto.png"/></p>
<h2 id="deep-q-networks-dqn">Deep Q-Networks (<span class="caps">DQN</span>)</h2>
<h3 id="key-contributions-to-q-learning">Key contributions to Q-learning</h3>
<p>The <span class="caps">DQN</span> authors made two key enhancements to the original Q-learning algorithm to actually make it work:</p>
<ol>
<li>
<p><strong>Experience replay buffer</strong>: to reduce the instability caused by training on highly correlated sequential data, store samples (transition tuples <span class="math">\((s, a, s’, r)\)</span>) in an “experience replay buffer”. Cut down correlations by randomly sampling the buffer for minibatches of training data. The idea of experience replay was introduced by <a href="http://www.incompleteideas.net/lin-92.pdf">Lin in 1992</a>.</p>
</li>
<li>
<p><strong>Freeze the target network</strong>: to address the instability caused by chasing a moving target, freeze the target network and only update it periodically with the latest parameters from the trained estimator.</p>
</li>
</ol>
<p>These modifications enabled [1] to successfully train a deep Q-network, an action-value function approximated by a convolutional neural net, on the high dimensional visual inputs of a variety of Atari games.</p>
<p>The authors also employed a number of tweaks / data preprocessing on top of the aforementioned key enhancements. One preprocessing trick of note was the concatenation of the four most recent frames as input into the Q-network in order to provide some sense of velocity or trajectory, e.g. the trajectory of a ball in games such as Pong or Breakout. This preprocessing decision helps uphold the assumption that the problem is a Markov Decision Process, which underlies the Bellman optimality equations and Q-learning algorithms; otherwise, the assumption is violated if the agent only observes some fraction of the state of the environment, turning the problem into a <a href="https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process">partially observable <span class="caps">MDP</span></a>.</p>
<h3 id="dqn-implementation-in-code"><span class="caps">DQN</span> implementation in code</h3>
<p>We’ve implemented <span class="caps">DQN</span> <a href="https://github.com/frangipane/reinforcement-learning/blob/master/DQN/dqn.py">here</a>, tested for (1) the <a href="https://gym.openai.com/envs/CartPole-v1/">Cartpole</a> toy problem, which uses a multilayer perceptron <code>MLPCritic</code> as the Q-function approximator for non-visual input data, and (2) Atari Breakout, which uses a convolutional neural network <code>CNNCritic</code> as the Q-function approximator for the (visual) Atari pixel data.</p>
<p>The Cartpole problem is trainable on the average modern laptop <span class="caps">CPU</span>, but we recommend using a beefier setup with GPUs and lots of memory to do Q-learning on Atari. Thanks to the OpenAI Scholars program and Microsoft, we were able to train <span class="caps">DQN</span> on Breakout using an Azure <a href="https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series">Standard_NC24</a> consisting of 224 GiB <span class="caps">RAM</span> and 2 K80 GPUs.</p>
<p>The values from the <span class="math">\(Q\)</span> estimator and frozen target network are fed into the Huber loss that is used to update the parameters of the Q-function in this code snippet:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">compute_loss_q</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
<span class="n">o</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">o2</span><span class="p">,</span> <span class="n">d</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">'obs'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s1">'act'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s1">'rew'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s1">'obs2'</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s1">'done'</span><span class="p">]</span>
<span class="c1"># Pick out q-values associated with / indexed by the action that was taken</span>
<span class="c1"># for that observation</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">ac</span><span class="o">.</span><span class="n">q</span><span class="p">(</span><span class="n">o</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">a</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">long</span><span class="p">())</span>
<span class="c1"># Bellman backup for Q function</span>
<span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="c1"># Targets come from frozen target Q-network</span>
<span class="n">q_target</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">target_q_network</span><span class="p">(</span><span class="n">o2</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">values</span>
<span class="n">backup</span> <span class="o">=</span> <span class="n">r</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">d</span><span class="p">)</span> <span class="o">*</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">q_target</span>
<span class="n">loss_q</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">smooth_l1_loss</span><span class="p">(</span><span class="n">q</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">backup</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">return</span> <span class="n">loss_q</span>
</pre></div>
<p>The experience replay buffer was taken from OpenAI’s Spinning Up in <span class="caps">RL</span> [6] code tutorials for the problem:</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">ReplayBuffer</span><span class="p">:</span>
<span class="sd">"""</span>
<span class="sd"> A simple FIFO experience replay buffer for DDPG agents.</span>
<span class="sd"> Copied from: https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ddpg/ddpg.py#L11,</span>
<span class="sd"> modified action buffer for discrete action space.</span>
<span class="sd"> """</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">obs_dim</span><span class="p">,</span> <span class="n">act_dim</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">store</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">obs</span><span class="p">,</span> <span class="n">act</span><span class="p">,</span> <span class="n">rew</span><span class="p">,</span> <span class="n">next_obs</span><span class="p">,</span> <span class="n">done</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">obs_buf</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="p">]</span> <span class="o">=</span> <span class="n">obs</span>
<span class="bp">self</span><span class="o">.</span><span class="n">obs2_buf</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="p">]</span> <span class="o">=</span> <span class="n">next_obs</span>
<span class="bp">self</span><span class="o">.</span><span class="n">act_buf</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="p">]</span> <span class="o">=</span> <span class="n">act</span>
<span class="bp">self</span><span class="o">.</span><span class="n">rew_buf</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="p">]</span> <span class="o">=</span> <span class="n">rew</span>
<span class="bp">self</span><span class="o">.</span><span class="n">done_buf</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="p">]</span> <span class="o">=</span> <span class="n">done</span>
<span class="bp">self</span><span class="o">.</span><span class="n">ptr</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">ptr</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">size</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">size</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_size</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sample_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">):</span>
<span class="n">idxs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">batch</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">obs</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">obs_buf</span><span class="p">[</span><span class="n">idxs</span><span class="p">],</span>
<span class="n">obs2</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">obs2_buf</span><span class="p">[</span><span class="n">idxs</span><span class="p">],</span>
<span class="n">act</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">act_buf</span><span class="p">[</span><span class="n">idxs</span><span class="p">],</span>
<span class="n">rew</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">rew_buf</span><span class="p">[</span><span class="n">idxs</span><span class="p">],</span>
<span class="n">done</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">done_buf</span><span class="p">[</span><span class="n">idxs</span><span class="p">])</span>
<span class="k">return</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">as_tensor</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">int32</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span> <span class="k">if</span> <span class="n">k</span> <span class="o">==</span> <span class="s1">'act'</span>
<span class="k">else</span> <span class="n">torch</span><span class="o">.</span><span class="n">as_tensor</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">)</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="n">batch</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>
</pre></div>
<p>Finally, we used OpenAI’s baselines <a href="https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py">Atari wrappers</a> to handle the rather involved data preprocessing steps.</p>
<p>You can see logs and plots like this plot of the mean raw returns per step in the environment for the Atari <span class="caps">DQN</span> training run in our <a href="https://app.wandb.ai/frangipane/dqn/runs/30fhfv6y?workspace=user-frangipane">wandb dashboard</a>.</p>
<p><img alt="training curve" src="https://efavdb.com/images/atari_training_returns.png"/></p>
<h2 id="conclusion">Conclusion</h2>
<p>From a pedagogical point of view, Q-learning is a good study for someone getting off the ground with <span class="caps">RL</span> since it pulls together many core <span class="caps">RL</span> concepts, namely:</p>
<ol>
<li>Model the sequential decision making process as an <strong><span class="caps">MDP</span></strong> where environment dynamics are unknown.</li>
<li>Frame the problem as finding <strong>action-value functions</strong> that satisfy the Bellman equations.</li>
<li>Iteratively solve the Bellman equations using <strong>bootstrapped estimates</strong> from samples of an agent’s interactions with an environment.</li>
<li>Use neural networks to <strong>approximate value functions</strong> to handle the more realistic situation of an observation space being too high-dimensional to be stored in table.</li>
</ol>
<p><span class="caps">DQN</span> on top of vanilla Q-learning itself is noteworthy because the modifications — experience replay and frozen target networks — are what make Q-learning actually work, demonstrating that the devil is in the details.</p>
<p>Furthermore, the <span class="caps">DQN</span> tricks have been incorporated in many other <span class="caps">RL</span> algorithms, e.g. see [6] for more examples. The tricks aren’t necessarily “pretty”, but they come from understanding/intuition about shortcomings of the basic algorithms.</p>
<h2 id="references">References</h2>
<p><strong>Papers</strong></p>
<ul>
<li>[1] Mnih et al 2015 - <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#the-optimal-q-function-and-the-optimal-action">Human-level control through deep reinforcement learning</a></li>
</ul>
<p><strong>Video lectures</strong></p>
<ul>
<li>[2] David Silver - <span class="caps">RL</span> lecture 6 Value Function Approximation (<a href="https://www.youtube.com/watch?v=UoPei5o4fps">video</a>, <a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/FA.pdf">slides</a>)</li>
<li>[3] Sergey Levine’s lecture (<span class="caps">CS285</span>) on value function methods (<a href="https://www.youtube.com/watch?v=doR5bMe-Wic&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=8&t=129s">video</a>, <a href="http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf">slides</a>)</li>
<li>[4] Sergey Levine’s lecture (<span class="caps">CS285</span>) on deep <span class="caps">RL</span> with Q-functions (<a href="https://www.youtube.com/watch?v=7Lwf-BoIu3M&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=9&t=0s">video</a>, <a href="http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-8.pdf">slides</a>)</li>
<li>[5] Vlad Mnih - Berkeley Deep <span class="caps">RL</span> Bootcamp 2017 - Core Lecture 3 <span class="caps">DQN</span> + Variants (<a href="https://www.youtube.com/watch?v=fevMOp5TDQs">video</a>, <a href="https://drive.google.com/open?id=0BxXI_RttTZAhVUhpbDhiSUFFNjg">slides</a>)</li>
</ul>
<p><strong>Books / tutorials</strong></p>
<ul>
<li>[6] OpenAI - Spinning Up: <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#the-optimal-q-function-and-the-optimal-action">The Optimal Q-Function and the Optimal Action</a></li>
<li>[7] Sutton and Barto - <a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction (2nd Edition)</a>, section 6.5 “Q-learning: Off-policy <span class="caps">TD</span> Control”, section 16.5 “Human-level Video Game Play”</li>
</ul>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Sample pooling to reduce needed disease screening test counts2020-03-29T00:00:00-07:002020-03-29T00:00:00-07:00Jonathan Landytag:efavdb.com,2020-03-29:/pooling<p>Pooling of test samples can be used to reduce the mean number of test counts
required to determine who in a set of subjects carries a disease. E.g., if the
blood samples of a set of office workers are combined and tested, and the test
comes back negative, then …</p><p>Pooling of test samples can be used to reduce the mean number of test counts
required to determine who in a set of subjects carries a disease. E.g., if the
blood samples of a set of office workers are combined and tested, and the test
comes back negative, then the full office can be ruled out as disease carriers
using just a single test (whereas the naive approach would require testing each
separately). However, if the test comes back positive, then a refined search
through the workers must be carried out to decide which have the disease and
which do not.</p>
<p>Here, we consider two methods for refined search when a group is flagged
positive, and provide python code that can be used to find the optimal pooling
strategy. This depends on the frequency of disease within the testing
population, <span class="math">\(p\)</span>.</p>
<p>Impact summary of pooling concept: </p>
<ul>
<li>If <span class="math">\(p = O(1)\)</span>, so that many people have the illness, pooling doesn’t help. </li>
<li>If <span class="math">\(p = 0.1\)</span>, perhaps typical of people being screened with symptoms, we can
reduce the test count needed by about <span class="math">\(\sim 0.6\)</span> using pooling, and the two refined
search methods we consider perform similarly here.</li>
<li>If <span class="math">\(p = 0.001\)</span>, so that positive cases are rare — perhaps useful for
screening an office of workers expected to be healthy, then we can cut the
mean test count by a factor of <span class="math">\(50\)</span>, and the bisection method for refined search performs best here (details below).</li>
</ul>
<p>Code for this analysis can be found at our github (<a href="https://github.com/EFavDB/pooling/blob/master/pooling_samples.ipynb">link</a>).</p>
<h4 id="covid19-background-strategies-considered-here"><strong><span class="caps">COVID19</span> background, strategies considered here</strong></h4>
<p>The idea of pooling is an old one, but I happened on the idea when an article
was posted about it to the statistics subreddit this past week (<a
href="https://www.reddit.com/r/statistics/comments/fl3dlw/q_if_you_could_test_batches_of_64_samples_for/">link</a>).
There the question was posed what the optimal pooling count would be,
motivating this post.</p>
<p>I imagine pooling may be useful for <span class="caps">COVID19</span> under two conditions: (1)
situations where testing capactity is the limiting factor (as opposed to speed
of diagnosis, say), and (2) Situations where a great many people need to be
screened and it is unlikely that any of them have it — e.g., daily tests
within a large office buiding.</p>
<p>We consider two pooling methods here: (1) A simple method where if the test
on the group comes back positive, we immediately screen each individual. (2) A
bisection method, where if a group comes back positive, we split it in two and
run the test on each subgroup, repeating from there recursively. E.g., in a
group of size 16 with one positive, the recursive approach generates the following
set of test subsets (see notebook on our github linked above for code)</p>
<div class="highlight"><pre><span></span><span class="n">seq</span> <span class="o">=</span> <span class="n">generate_random_seq</span><span class="p">()</span>
<span class="n">test_counts_needed</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">test_count</span><span class="p">(</span><span class="n">seq</span><span class="p">))</span>
<span class="n">total</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
</pre></div>
<p>Here, the 13th individual had the disease, and the bisection method required a
total of 9 tests (one for each row above) to determine the full set of diagnoses. Note that
9 is less than 16, the number needed when we screen everyone from the start.</p>
<p>Our purpose is to provide code and equations that can be used to select from these two
methods should anyone want to apply this idea. Caveat: We currently ignore
any possibility of error in the tests. This may make the approach invalid for
some or all of the current covid19 tests. Error rates should be studied next
where appropriate.</p>
<h4 id="model-and-results"><strong>Model and results</strong></h4>
<p>We posit that we have a pool of
</p>
<div class="math">\begin{eqnarray}
N = 2^{\mathbb{K}} \tag{1} \label{count_pop}
\end{eqnarray}</div>
<p>
people to be tested. In the first round, we pool all their samples and test the
group. If the group comes back positive, we then run one of the refined methods to
figure out which people exactly have the illness. Each person is supposed to have a probability <span class="math">\(p\)</span> of having the disease.
Below, we ask how to set <span class="math">\(\mathbb{K}\)</span> — which determines the pooling size —
so as to minimize the mean number of tests needed divided by <span class="math">\(N\)</span>, which can be
considered the pooling reduction factor.</p>
<p>The mean number of tests needed from the simple strategy is
</p>
<div class="math">\begin{eqnarray}\tag{2} \label{simple_result}
\overline{N}_{simple} = (1 - p)^N\times 1 + \left [1 - (1-p)^N \right] \times (1 + N)
\end{eqnarray}</div>
<p>
The mean number needed in the bisection strategy is
</p>
<div class="math">\begin{eqnarray} \tag{3} \label{bisection_result}
\overline{N}_{bisection} = 1 + 2 \sum_{k=0}^{\mathbb{K}} 2^k \left (1 - (1 -p)^{2^{\mathbb{K}-k}} \right)
\end{eqnarray}</div>
<p>
The proof of (\ref{simple_result}) is straightforward and we give an argument for
(\ref{bisection_result}) in an appendix. A cell of our notebook checks this
and confirms its accuracy.</p>
<p>Using the above results, our code produces plots of the mean number of tests
needed to screen a population vs <span class="math">\(\mathbb{K}\)</span>. This then finds the optimal
number for each type. The plots below give the results for the three <span class="math">\(p\)</span> values
noted in the abstract.</p>
<ul>
<li>
<p>Case 1: <span class="math">\(p = 0.5\)</span>, large fraction of disease carriers. Main result: The
pooling strategies both cause the mean number of tests to be larger than if
we just screened each individual from the start (seen here because the y-axis
values are always bigger than 1). The approach is not useful here.
<img alt="![parameter study]({static}/images/pooling_05.png)" src="https://efavdb.com/images/pooling_05.png"></p>
</li>
<li>
<p>Case 2: <span class="math">\(p = 0.1\)</span>, modest fraction of disease carriers. Main result: The two
methods both give comparable benefits. It is optimal to pool using
<span class="math">\(\mathbb{K}=2\)</span>, which gives groups of <span class="math">\(N = 4\)</span> patients. This cuts the number of
needed tests by a factor of <span class="math">\(0.6\)</span>.
<img alt="![parameter study]({static}/images/pooling_01.png)" src="https://efavdb.com/images/pooling_01.png"></p>
</li>
<li>
<p>Case 3: <span class="math">\(p = 0.001\)</span>, small fraction of disease carriers. Main result:
Bisection wins, the optimal <span class="math">\(\mathbb{K} = 9\)</span> here, which gives a pooling
group of size <span class="math">\(512\)</span>. We cut the test count needed by a factor of <span class="math">\(50\)</span>. Note:
We also show here a histogram showing the number of tests needed when we run a
simulated system like this. We see that we often only need one test, and there
is another peak around <span class="math">\(20\)</span> tests, with a long tail after that.
<img alt="![parameter study]({static}/images/pooling_01.png)" src="https://efavdb.com/images/pooling_0001.png">
<img alt="![parameter study]({static}/images/pooling_hist.png)" src="https://efavdb.com/images/pooling_hist.png"></p>
</li>
</ul>
<p>The code to generate the optimal <span class="math">\(\mathbb{K}\)</span> plots above is given below. This
can be used to generate generalized plots like those above for any <span class="math">\(p\)</span>. The
histogram plot is contained in our github repo, linked in our abstract. Our
appendix follows.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
<span class="n">K</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">P_POSITIVE</span> <span class="o">=</span> <span class="mf">0.05</span>
<span class="k">def</span> <span class="nf">theory_bisection</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="n">P_POSITIVE</span><span class="p">,</span> <span class="n">K</span><span class="o">=</span><span class="n">K</span><span class="p">):</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">([</span><span class="mi">2</span> <span class="o">**</span> <span class="n">k</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">**</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="p">(</span><span class="n">K</span> <span class="o">-</span> <span class="n">k</span><span class="p">)))</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">K</span><span class="p">)]</span> <span class="p">)</span>
<span class="k">return</span> <span class="n">count</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">**</span> <span class="n">K</span>
<span class="k">def</span> <span class="nf">theory_simple</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="n">P_POSITIVE</span><span class="p">,</span> <span class="n">K</span><span class="o">=</span><span class="n">K</span><span class="p">):</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="n">K</span>
<span class="n">p0</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">**</span> <span class="n">n</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">*</span> <span class="n">p0</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">n</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">count</span> <span class="o">/</span> <span class="n">n</span>
<span class="nb">print</span> <span class="s1">'Bisection: fraction of full testing: </span><span class="si">%2.2f</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">theory_bisection</span><span class="p">())</span>
<span class="nb">print</span> <span class="s1">'Simple: fraction of full testing: </span><span class="si">%2.2f</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">theory_simple</span><span class="p">())</span>
<span class="n">p</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="n">theory_bisection</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">15</span><span class="p">)]</span>
<span class="n">min_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'o--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'bisection (min = </span><span class="si">%2.2f</span><span class="s1">)'</span><span class="o">%</span><span class="n">data</span><span class="p">[</span><span class="n">min_index</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">min_index</span><span class="p">,</span> <span class="n">data</span><span class="p">[</span><span class="n">min_index</span><span class="p">],</span> <span class="s1">'ro'</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="n">theory_simple</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">15</span><span class="p">)]</span>
<span class="n">min_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="s1">'o--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'simple (min = </span><span class="si">%2.2f</span><span class="s1">)'</span><span class="o">%</span><span class="n">data</span><span class="p">[</span><span class="n">min_index</span><span class="p">],</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">min_index</span><span class="p">,</span> <span class="n">data</span><span class="p">[</span><span class="n">min_index</span><span class="p">],</span> <span class="s1">'go'</span><span class="p">,</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Test count reduction vs log_2 pooling size, p = </span><span class="si">%0.3f</span><span class="s1">'</span> <span class="o">%</span><span class="n">p</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'log_2 pooling size'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'mean tests / pooling size'</span><span class="p">)</span>
</pre></div>
<h4 id="appendix-derivation-of-refbisection_result"><strong>Appendix: Derivation of (\ref{bisection_result})</strong></h4>
<p>Consider a binary tree with the root node being the initial test. Each node
has two children that correspond to the tests of the two subgroups for a given
test. We must test these if the parent is positive. Level <span class="math">\(0\)</span> is the initial
test and <span class="math">\(k\)</span> rows down we call the level <span class="math">\(k\)</span> of tests. There are total of <span class="math">\(2^k\)</span>
posible tests to run at this level, and there are a total of <span class="math">\(\mathbb{K}\)</span> levels.</p>
<p>The number of tests that need to be run at level <span class="math">\(k\)</span> is set by the number of
positive tests at level <span class="math">\(k-1\)</span>. We have
</p>
<div class="math">\begin{eqnarray}
\text{Number of tests} = 1 + \sum_{k=0}^{\mathbb{K} - 1} \text{number positive level k}
\end{eqnarray}</div>
<p>
Averaging this equation gives
</p>
<div class="math">\begin{eqnarray}
\overline{\text{Number of tests}} &=& 1 + \sum_{k=0}^{\mathbb{K} - 1} 2^k \times prob(\text{test at level k positive}) \\
&=& 1 + \sum_{k=0}^{\mathbb{K} - 1} 2^k \times [ 1- (1 - p)^{2^{\mathbb{K} - k}}].
\end{eqnarray}</div>
<p>
The inner factor here is the probability that a given test of the size being
considered comes back positive — this has <span class="math">\(N / 2^k = 2^{\mathbb{K} - k}\)</span> people
in it. This is the result shown above in (\ref{bisection_result}).</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Dynamic programming in reinforcement learning2020-03-28T12:00:00-07:002020-03-28T12:00:00-07:00Cathy Yehtag:efavdb.com,2020-03-28:/reinforcement-learning-dynamic-programming<h2 id="background">Background</h2>
<p>We discuss how to use dynamic programming (<span class="caps">DP</span>) to solve reinforcement learning (<span class="caps">RL</span>) problems where we have a perfect model of the environment. <span class="caps">DP</span> is a general approach to solving problems by breaking them into subproblems that can be solved separately, cached, then combined to solve the overall problem …</p><h2 id="background">Background</h2>
<p>We discuss how to use dynamic programming (<span class="caps">DP</span>) to solve reinforcement learning (<span class="caps">RL</span>) problems where we have a perfect model of the environment. <span class="caps">DP</span> is a general approach to solving problems by breaking them into subproblems that can be solved separately, cached, then combined to solve the overall problem.</p>
<p>We’ll use a toy model, taken from [1], of a student transitioning between five states in college, which we also used in our <a href="https://efavdb.com/intro-rl-toy-example.html">introduction</a> to <span class="caps">RL</span>:</p>
<p><img alt="student MDP" src="https://efavdb.com/images/student_mdp.png"></p>
<p>The model (dynamics) of the environment describe the probabilities of receiving a reward <span class="math">\(r\)</span> in the next state <span class="math">\(s'\)</span> given the current state <span class="math">\(s\)</span> and action <span class="math">\(a\)</span> taken, <span class="math">\(p(s’, r | s, a)\)</span>. We can read these dynamics off the diagram of the student Markov Decision Process (<span class="caps">MDP</span>), for example:</p>
<p><span class="math">\(p(s'=\text{CLASS2}, r=-2 | s=\text{CLASS1}, a=\text{study}) = 1.0\)</span></p>
<p><span class="math">\(p(s'=\text{CLASS2}, r=1 | s=\text{CLASS3}, a=\text{pub}) = 0.4\)</span></p>
<p>If you’d like to jump straight to code, see this <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP_dynamic_programming_solutions.ipynb">jupyter notebook</a>.</p>
<h3 id="the-role-of-value-functions-in-rl">The role of value functions in <span class="caps">RL</span></h3>
<p>The agent’s (student’s) policy maps states to actions, <span class="math">\(\pi(a|s) := p(a|s)\)</span>.
The goal is to find the optimal policy <span class="math">\(\pi_*\)</span> that will maximize the expected cumulative rewards, the discounted return <span class="math">\(G_t\)</span>, in each state <span class="math">\(s\)</span>.</p>
<p>The value functions, <span class="math">\(v_{\pi}(s)\)</span> and <span class="math">\(q_{\pi}(s, a)\)</span>, in MDPs formalize this goal.</p>
<div class="math">\begin{eqnarray}
v_{\pi}(s) &=& \mathbb{E}_{\pi}[G_t | S_t = s] \\
q_{\pi}(s, a) &=& \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]
\end{eqnarray}</div>
<p>We want to be able to calculate the value function for an arbitrary policy, i.e. <em>prediction</em>, as well as use the value functions to find an optimal policy, i.e. the <em>control</em> problem.</p>
<h2 id="policy-evaluation">Policy evaluation</h2>
<p>Policy evaluation deals with the problem of calculating the value function for some arbitrary policy. In our introduction to <span class="caps">RL</span> <a href="https://efavdb.com/intro-rl-toy-example.html">post</a>, we showed that the value functions obey self-consistent, recursive relations, that make them amenable to <span class="caps">DP</span> approaches given a model of the environment.</p>
<p>These recursive relations are the Bellman expectation equations, which write the value of each state in terms of an average over the values of its successor / neighboring states, along with the expected reward along the way.</p>
<p>The Bellman expectation equation for <span class="math">\(v_{\pi}(s)\)</span> is</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman} \tag{1}
v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_{\pi}(s’) ],
\end{eqnarray}</div>
<p>where <span class="math">\(\gamma\)</span> is the discount factor <span class="math">\(0 \leq \gamma \leq 1\)</span> that weights the importance of future vs. current returns. <strong><span class="caps">DP</span> turns (\ref{state-value-bellman}) into an update rule</strong> (\ref{policy-evaluation}), <span class="math">\(\{v_k(s’)\} \rightarrow v_{k+1}(s)\)</span>, which iteratively converges towards the solution, <span class="math">\(v_\pi(s)\)</span>, for (\ref{state-value-bellman}):</p>
<div class="math">\begin{eqnarray}\label{policy-evaluation} \tag{2}
v_{k+1}(s) = \sum_{a} \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_k(s’) ]
\end{eqnarray}</div>
<p>Applying policy evaluation to our student model for an agent with a random policy, we arrive at the following state value function (see <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP_dynamic_programming_solutions.ipynb">jupyter notebook</a> for implementation):</p>
<p><img alt="student MDP value function random policy" src="https://efavdb.com/images/student_mdp_values_random_policy.png"></p>
<h2 id="finding-the-optimal-value-functions-and-policy">Finding the optimal value functions and policy</h2>
<h3 id="policy-iteration">Policy iteration</h3>
<p>We can evaluate the value functions for a given policy by turning the Bellman expectation equation (\ref{state-value-bellman}) into an update equation with the iterative policy evaluation algorithm.</p>
<p>But how do we use value functions to achieve our end goal of finding an optimal policy that corresponds to the optimal value functions?</p>
<p>Imagine we know the value function for a policy. If taking the greedy action, corresponding to taking <span class="math">\(\text{arg} \max_a q_{\pi}(s,a)\)</span>, from any state in that policy is not consistent with that policy, or, equivalently, <span class="math">\(\max_a q_{\pi}(s,a) > v_\pi(s)\)</span>, then the policy is not optimal since we can improve the policy by taking the greedy action in that state and then onwards following the original policy.</p>
<p>The <em>policy iteration</em> algorithm involves taking turns calculating the value function for a policy (policy evaluation) and improving on the policy (policy improvement) by taking the greedy action in each state for that value function until converging to <span class="math">\(\pi_*\)</span> and <span class="math">\(v_*\)</span> (see [2] for pseudocode for policy iteration).</p>
<h3 id="value-iteration">Value iteration</h3>
<p>Unlike policy iteration, the value iteration algorithm does not require complete convergence of policy evaluation before policy improvement, and, in fact, makes use of just a single iteration of policy evaluation. Just as policy evaluation could be viewed as turning the Bellman expectation equation into an update, value iteration turns the Bellman optimality equation into an update.</p>
<p>In our previous <a href="https://efavdb.com/intro-rl-toy-example.html">post</a> introducing <span class="caps">RL</span> using the student example, we saw that the optimal value functions are the solutions to the Bellman optimality equation, e.g. for the optimal state-value function:</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman-optimality} \tag{3}
v_*(s) &=& \max_a q_{\pi*}(s, a) \\
&=& \max_a \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\
&=& \max_a \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_*(s’) ]
\end{eqnarray}</div>
<p>As a <span class="caps">DP</span> update equation, (\ref{state-value-bellman-optimality}) becomes:
</p>
<div class="math">\begin{eqnarray}\label{value-iteration} \tag{4}
v_{k+1}(s) = \max_a \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_k(s’) ]
\end{eqnarray}</div>
<p>Value iteration combines (truncated) policy evaluation with policy improvement in a single step; the state-value functions are updated with the averages of the value functions of the neighbor states that can occur from a greedy action, i.e. the action that maximizes the right hand side of (\ref{value-iteration}).</p>
<p>Applying value iteration to our student model, we arrive at the following optimal state value function, with the optimal policy delineated by red arrows (see <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP_dynamic_programming_solutions.ipynb">jupyter notebook</a>):</p>
<p><img alt="student MDP optimal policy and value function" src="https://efavdb.com/images/student_mdp_optimal_policy.png"></p>
<h2 id="summary">Summary</h2>
<p>We’ve discussed how to solve for (a) the value functions of an arbitrary policy, (b) the optimal value functions and optimal policy. Solving for (a) involves turning the Bellman expectation equations into an update, whereas (b) involves turning the Bellman optimality equations into an update. These algorithms are guaranteed to converge (see [1] for notes on how the contraction mapping theorem guarantees convergence).</p>
<p>You can see the application of both policy evaluation and value iteration to the student model problem in this <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP_dynamic_programming_solutions.ipynb">jupyter notebook</a>.</p>
<h2 id="references"><a name="References">References</a></h2>
<p>[1] David Silver’s <span class="caps">RL</span> Course Lecture 3 - Planning by Dynamic Programming (<a href="https://www.youtube.com/watch?v=Nd1-UUMVfz4">video</a>,
<a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/DP.pdf">slides</a>)</p>
<p>[2] Sutton and Barto -
<a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> - Chapter 4: Dynamic Programming</p>
<p>[3] Denny Britz’s <a href="https://github.com/dennybritz/reinforcement-learning/tree/master/DP">notes</a> on <span class="caps">RL</span> and <span class="caps">DP</span>, including crisp implementations in code of policy evaluation, policy iteration, and value iteration for the gridworld example discussed in [2].</p>
<p>[4] Deep <span class="caps">RL</span> Bootcamp Lecture 1: Motivation + Overview + Exact Solution Methods, by Pieter Abbeel (<a href="https://www.youtube.com/watch?v=qaMdN6LS9rA">video</a>, <a href="https://drive.google.com/open?id=0BxXI_RttTZAhVXBlMUVkQ1BVVDQ">slides</a>) - a very compressed intro.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Introduction to reinforcement learning by example2020-03-11T12:00:00-07:002020-03-11T12:00:00-07:00Cathy Yehtag:efavdb.com,2020-03-11:/intro-rl-toy-example<p>We take a top-down approach to introducing reinforcement learning (<span class="caps">RL</span>) by starting with a toy example: a student going through college. In order to frame the problem from the <span class="caps">RL</span> point-of-view, we’ll walk through the following steps:</p>
<ul>
<li><strong>Setting up a model of the problem</strong> as a Markov Decision Process …</li></ul><p>We take a top-down approach to introducing reinforcement learning (<span class="caps">RL</span>) by starting with a toy example: a student going through college. In order to frame the problem from the <span class="caps">RL</span> point-of-view, we’ll walk through the following steps:</p>
<ul>
<li><strong>Setting up a model of the problem</strong> as a Markov Decision Process, the framework that underpins the <span class="caps">RL</span> approach to sequential decision-making problems</li>
<li><strong>Deciding on an objective</strong>: maximize rewards</li>
<li><strong>Writing down an equation whose solution is our objective</strong>: Bellman equations</li>
</ul>
<p>David Silver walks through this example in his <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">lecture notes</a> on <span class="caps">RL</span>, but as far as we can tell, does not provide code, so we’re sharing our implementation, comprising:</p>
<ul>
<li>the student’s college <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">environment</a> using the OpenAI gym package.</li>
<li>a <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a> sampling from the model</li>
</ul>
<h2 id="student-in-toy-college">Student in toy college</h2>
<p>We model the student as an agent in a college environment who can move between five states: <span class="caps">CLASS</span> 1, 2, 3, the <span class="caps">FACEBOOK</span> state, and <span class="caps">SLEEP</span> state. The states are represented by the four circles and square. The <span class="caps">SLEEP</span> state — the square with no outward bound arrows — is a terminal state, i.e. once a student reaches that state, her journey is finished.</p>
<p><img alt="student MDP" src="https://efavdb.com/images/student_mdp.png"></p>
<p>Actions that a student can take in her current state are labeled in red (facebook/quit/study/sleep/pub) and influence which state she’ll find herself in next.</p>
<p>In this model, most state transitions are deterministic functions of the action in the current state, e.g. if she decides to study in <span class="caps">CLASS</span> 1, then she’ll definitely advance to <span class="caps">CLASS</span> 2. The single non-deterministic state transition is if she goes pubbing while in <span class="caps">CLASS</span> 3, where the pubbing action is indicated by a solid dot; she can end up in <span class="caps">CLASS</span> 1, 2 or back in 3 with probability 0.2, 0.4, or 0.4, respectively, depending on how reckless the pubbing was.</p>
<p>The model also specifies the reward <span class="math">\(R\)</span> associated with acting in one state and ending up in the next. In this example, the dynamics <span class="math">\(p(s’,r|s,a)\)</span>, are given to us, i.e. we have a full model of the environment, and, hopefully, the rewards have been designed to capture the actual end goal of the student.</p>
<h2 id="markov-decision-process">Markov Decision Process</h2>
<p>Formally, we’ve modeled the student’s college experience as a finite Markov Decision Process (<span class="caps">MDP</span>). The dynamics are Markov because the probability of ending up in the next state depends only on the current state and action, not on any history leading up to the current state. The Markov property is integral to the simplification of the equations that describe the model, which we’ll see in a bit.</p>
<p>The components of an <span class="caps">MDP</span> are:</p>
<ul>
<li><span class="math">\(S\)</span> - the set of possible states</li>
<li><span class="math">\(R\)</span> - the set of (scalar) rewards</li>
<li><span class="math">\(A\)</span> - the set of possible actions in each state</li>
</ul>
<p>The dynamics of the system are described by the probabilities of receiving a reward in the next state given the current state and action taken, <span class="math">\(p(s’,r|s,a)\)</span>. In this example, the <span class="caps">MDP</span> is finite because there are a finite number of states, rewards, and actions.</p>
<p>The student’s agency in this environment comes from how she decides to act in each state. The mapping of a state to actions is the <strong>policy</strong>, <span class="math">\(\pi(a|s) := p(a|s)\)</span>, and can be a deterministic or stochastic function of her state.</p>
<p>Suppose we have an indifferent student who always chooses actions randomly. We can sample from the <span class="caps">MDP</span> to get some example trajectories the student might experience with this policy. In the sample trajectories below, the states are enclosed in parentheses <code>(STATE)</code>, and actions enclosed in square brackets <code>[action]</code>.</p>
<p><strong>Sample trajectories</strong>:</p>
<div class="highlight"><pre><span></span><span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="c1">--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[study]-->(CLASS2)--[sleep]-->(SLEEP)</span>
<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="c1">--[quit]-->(CLASS1)--[study]-->(CLASS2)--[study]-->(CLASS3)--[study]-->(SLEEP)</span>
<span class="p">(</span><span class="n">SLEEP</span><span class="p">)</span><span class="w"></span>
<span class="p">(</span><span class="n">CLASS1</span><span class="p">)</span><span class="c1">--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[study]-->(CLASS2)--[sleep]-->(SLEEP)</span>
<span class="p">(</span><span class="n">FACEBOOK</span><span class="p">)</span><span class="c1">--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[facebook]-->(FACEBOOK)--[quit]-->(CLASS1)--[study]-->(CLASS2)--[study]-->(CLASS3)--[pub]-->(CLASS2)--[study]-->(CLASS3)--[study]-->(SLEEP)</span>
</pre></div>
<p><strong>Rewards following a random policy</strong>:</p>
<p>Under this random policy, what total reward would the student expect when starting from any of the states? We can estimate the expected rewards by summing up the rewards per trajectory and plotting the distributions of total rewards per starting state:</p>
<p><img alt="histogram of sampled returns" src="https://efavdb.com/images/intro_rl_histogram_sampled_returns.png"></p>
<h2 id="maximizing-rewards-discounted-return-and-value-functions">Maximizing rewards: discounted return and value functions</h2>
<p>We’ve just seen how we can estimate rewards starting from each state given a random policy. Next, we’ll formalize our goal in terms of maximizing returns.</p>
<h3 id="returns">Returns</h3>
<p>We simply summed the rewards from the sample trajectories above, but the quantity we often want to maximize in practice is the <strong>discounted return <span class="math">\(G_t\)</span></strong>, which is a sum of the weighted rewards:</p>
<div class="math">\begin{eqnarray}\label{return} \tag{1}
G_t := R_{t+1} + \gamma R_{t+2} + … = \sum_{k=0}^\infty \gamma^k R_{t+k+1}
\end{eqnarray}</div>
<p>where <span class="math">\(0 \leq \gamma \leq 1\)</span>. <span class="math">\(\gamma\)</span> is the <em>discount rate</em> which characterizes how much we weight rewards now vs. later. Discounting is mathematically useful for avoiding infinite returns in MDPs without a terminal state and allows us to account for uncertainty in the future when we don’t have a perfect model of the environment.</p>
<p><strong>Aside</strong></p>
<p>The discount factor introduces a time scale since it says that we don’t care about rewards that are far in the future. The half-life (actually, the <span class="math">\(1/e\)</span> life) of a reward in units of time steps is <span class="math">\(1/(1-\gamma)\)</span>, which comes from a definition of <span class="math">\(1/e\)</span>:</p>
<div class="math">\begin{align}
\frac{1}{e} = \lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n
\end{align}</div>
<p><span class="math">\(\gamma = 0.99\)</span> is often used in practice, which corresponds to a half-life of 100 timesteps since <span class="math">\(0.99^{100} = (1 - 1/100)^{100} \approx 1/e\)</span>.</p>
<h3 id="value-functions">Value functions</h3>
<p>Earlier, we were able to estimate the expected undiscounted returns starting from each state by sampling from the <span class="caps">MDP</span> under a random policy. Value functions formalize this notion of the “goodness” of being in a state.</p>
<h4 id="state-value-function-v">State value function <span class="math">\(v\)</span></h4>
<p>The <strong>state value function</strong> <span class="math">\(v_{\pi}(s)\)</span> is the expected return when starting in state <span class="math">\(s\)</span>, following policy <span class="math">\(\pi\)</span>.</p>
<div class="math">\begin{eqnarray}\label{state-value} \tag{2}
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]
\end{eqnarray}</div>
<p>The state value function can be written as a recursive relationship, the Bellman expectation equation, expressing the value of a state in terms of the values of its neighors by making use of the Markov property.</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman} \tag{3}
v_{\pi}(s) &=& \mathbb{E}_{\pi}[G_t | S_t = s] \\
&=& \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+2} | S_t = s] \\
&=& \sum_{a} \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_{\pi}(s’) ]
\end{eqnarray}</div>
<p>This equation expresses the value of a state as an average over the discounted value of its neighbor / successor states, plus the expected reward transitioning from <span class="math">\(s\)</span> to <span class="math">\(s’\)</span>, and <span class="math">\(v_{\pi}\)</span> is the unique<a href="#unique">*</a> solution. The distribution of rewards depends on the student’s policy since her actions influence her future rewards.</p>
<p><em>Note on terminology</em>:
Policy <em>evaluation</em> uses the Bellman expectation equation to solve for the value function given a policy <span class="math">\(\pi\)</span> and environment dynamics <span class="math">\(p(s’, r | s, a)\)</span>. This is different from policy iteration and value iteration, which are concerned with finding an optimal policy.</p>
<p>We can solve the Bellman equation for the value function as an alternative to the sampling we did earlier for the student toy example. Since the problem has a small number of states and actions, and we have full knowledge of the environment, an exact solution is feasible by directly solving the system of linear equations or iteratively using dynamic programming. Here is the solution to (\ref{state-value-bellman}) for <span class="math">\(v\)</span> under a random policy in the student example (compare to the sample means in the histogram of returns):</p>
<p><img alt="student MDP value function random policy" src="https://efavdb.com/images/student_mdp_values_random_policy.png"></p>
<p>We can verify that the solution is self-consistent by spot checking the value of a state in terms of the values of its neighboring states according to the Bellman equation, e.g. the <span class="caps">CLASS1</span> state with <span class="math">\(v_{\pi}(\text{CLASS1}) = -1.3\)</span>:</p>
<div class="math">$$
v_{\pi}(\text{CLASS1}) = 0.5 [-2 + 2.7] + 0.5 [-1 + -2.3] = -1.3
$$</div>
<h4 id="action-value-function-q">Action value function <span class="math">\(q\)</span></h4>
<p>Another value function is the action value function <span class="math">\(q_{\pi}(s, a)\)</span>, which is the expected return from a state <span class="math">\(s\)</span> if we follow a policy <span class="math">\(\pi\)</span> after taking an action <span class="math">\(a\)</span>:</p>
<div class="math">\begin{eqnarray}\label{action-value} \tag{4}
q_{\pi}(s, a) := \mathbb{E}_{\pi} [ G_t | S_t = s, A = a ]
\end{eqnarray}</div>
<p>We can also write <span class="math">\(v\)</span> and <span class="math">\(q\)</span> in terms of each other. For example, the state value function can be viewed as an average over the action value functions for that state, weighted by the probability of taking each action, <span class="math">\(\pi\)</span>, from that state:</p>
<div class="math">\begin{eqnarray}\label{state-value-one-step-backup} \tag{5}
v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)
\end{eqnarray}</div>
<p>Rewriting <span class="math">\(v\)</span> in terms of <span class="math">\(q\)</span> in (\ref{state-value-one-step-backup}) is useful later for thinking about the “advantage”, <span class="math">\(A(s,a)\)</span>, of taking an action in a state, namely how much better is an action in that state than the average?</p>
<div class="math">\begin{align}
A(s,a) \equiv q(s,a) - v(s)
\end{align}</div>
<hr>
<p><strong>Why <span class="math">\(q\)</span> in addition to <span class="math">\(v\)</span>?</strong></p>
<p>Looking ahead, we almost never have access to the environment dynamics in real world problems, but solving for <span class="math">\(q\)</span> instead of <span class="math">\(v\)</span> lets us get around this problem; we can figure out the best action to take in a state solely using <span class="math">\(q\)</span> (we further expand on this in our <a href="#optimalq">discussion</a> below on the Bellman optimality equation for <span class="math">\(q_*\)</span>.</p>
<p>A concrete example of using <span class="math">\(q\)</span> is provided in our <a href="https://efavdb.com/multiarmed-bandits">post</a> on multiarmed bandits (an example of a simple single-state <span class="caps">MDP</span>), which discusses agents/algorithms that don’t have access to the true environment dynamics. The strategy amounts to estimating the action value function of the slot machine and using those estimates to inform which slot machine arms to pull in order to maximize rewards.</p>
<hr>
<h2 id="optimal-value-and-policy">Optimal value and policy</h2>
<p>The crux of the <span class="caps">RL</span> problem is finding a policy that maximizes the expected return. A policy <span class="math">\(\pi\)</span> is defined to be better than another policy <span class="math">\(\pi’\)</span> if <span class="math">\(v_{\pi}(s) > v_{\pi’}(s)\)</span> for all states. We are guaranteed<a href="#unique">*</a> an optimal state value function <span class="math">\(v_*\)</span> which corresponds to one or more optimal policies <span class="math">\(\pi*\)</span>.</p>
<p>Recall that the value function for an arbitrary policy can be written in terms of an average over the action values for that state (\ref{state-value-one-step-backup}). In contrast, the optimal value function <span class="math">\(v_*\)</span> must be consistent with following a policy that selects the action that maximizes the action value functions from a state, i.e. taking a <span class="math">\(\max\)</span> (\ref{state-value-bellman-optimality}) instead of an average (\ref{state-value-one-step-backup}) over <span class="math">\(q\)</span>, leading to the <strong>Bellman optimality equation</strong> for <span class="math">\(v_*\)</span>:</p>
<div class="math">\begin{eqnarray}\label{state-value-bellman-optimality} \tag{6}
v_*(s) &=& \max_a q_{\pi*}(s, a) \\
&=& \max_a \mathbb{E}_{\pi*} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a] \\
&=& \max_a \sum_{s’, r} p(s’, r | s, a) [r + \gamma v_*(s’) ]
\end{eqnarray}</div>
<p>The optimal policy immediately follows: take the action in a state that maximizes the right hand side of (\ref{state-value-bellman-optimality}). The <a href="https://en.wikipedia.org/wiki/Bellman_equation#Bellman's_Principle_of_Optimality">principle of optimality</a>, which applies to the Bellman optimality equation, means that this greedy policy actually corresponds to the optimal policy! Note: Unlike the Bellman expectation equations, the Bellman optimality equations are a nonlinear system of equations due to taking the max.</p>
<p>The Bellman optimality equation for the action value function <span class="math">\(q_*(s,a)\)</span><a name="optimalq"></a> is:</p>
<div class="math">\begin{eqnarray}\label{action-value-bellman-optimality} \tag{7}
q_*(s, a) &=& \mathbb{E}_{\pi*} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}', a') | S_t = s, A_t = a] \\
&=& \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s', a') ]
\end{eqnarray}</div>
<hr>
<p>Looking ahead: In practice, without a knowledge of the environment dynamics, <span class="caps">RL</span> algorithms based on solving value functions can approximate the expectation in (\ref{action-value-bellman-optimality}) by sampling, i.e. interacting with the environment, and iteratively selecting the action that corresponds to maximizing <span class="math">\(q\)</span> in each state that the agent lands in along its trajectory, which is possible since the maximum occurs <strong>inside</strong> the summation in (\ref{action-value-bellman-optimality}). In contrast, this sampling approach doesn’t work for (\ref{state-value-bellman-optimality}) because of the maximum <strong>outside</strong> the summation in…that’s why action value functions are so useful when we lack a model of the environment!</p>
<hr>
<p>Here is the optimal state value function and policy for the student example, which we solve for in a later post:</p>
<p><img alt="student MDP optimal value function" src="https://efavdb.com/images/student_mdp_optimal_values.png"></p>
<p>Comparing the values per state under the optimal policy vs the random policy, the value in every state under the optimal policy exceeds the value under the random policy.</p>
<h2 id="summary">Summary</h2>
<p>We’ve discussed how the problem of sequential decision making can be framed as an <span class="caps">MDP</span> using the student toy <span class="caps">MDP</span> as an example. The goal in <span class="caps">RL</span> is to figure out a policy — what actions to take in each state — that maximizes our returns.</p>
<p>MDPs provide a framework for approaching the problem by defining the value of each state, the value functions, and using the value functions to define what a “best policy” means. The value functions are unique solutions to the Bellman equations, and the <span class="caps">MDP</span> is “solved” when we know the optimal value function.</p>
<p>Much of reinforcement learning centers around trying to solve these equations under different conditions, e.g. unknown environment dynamics and large — possibly continuous — states and/or action spaces that require approximations to the value functions.</p>
<p>We’ll discuss how we arrived at the solutions for this toy problem in a future post!</p>
<h3 id="example-code">Example code</h3>
<p>Code for sampling from the student environment under a random policy in order to generate the trajectories and histograms of returns is available in this <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/student_MDP.ipynb">jupyter notebook</a>.</p>
<p>The <a href="https://github.com/frangipane/reinforcement-learning/blob/master/02-dynamic-programming/discrete_limit_env.py">code</a> for the student environment creates an environment with an <span class="caps">API</span> that is compatible with OpenAI gym — specifically, it is derived from the <code>gym.envs.toy_text.DiscreteEnv</code> environment.</p>
<p><a name="unique"><em></a>The uniqueness of the solution to the Bellman equations for finite MDPs is stated without proof in Ref [2], but Ref [1] motivates it briefly via the </em>contraction mapping theorem*.</p>
<h2 id="references">References</h2>
<p>[1] David Silver’s <span class="caps">RL</span> Course Lecture 2 - (<a href="https://www.youtube.com/watch?v=lfHX2hHRMVQ">video</a>,
<a href="https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf">slides</a>)</p>
<p>[2] Sutton and Barto -
<a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> - Chapter 3: Finite Markov Decision Processes</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A Framework for Studying Population Dynamics2020-03-08T00:00:00-08:002020-03-08T00:00:00-08:00Dustin McIntoshtag:efavdb.com,2020-03-08:/world-wandering-dudes<p>In this post, I want to briefly introduce a new side project for the blog with applications to understanding population dynamics, natural selection, game theory, and probably more.</p>
<p><a href="https://github.com/dustinmcintosh/world_wandering_dudes">World Wandering Dudes</a> is a simulation framework in which you initiate a “world” which consists of a “field” and a set of …</p><p>In this post, I want to briefly introduce a new side project for the blog with applications to understanding population dynamics, natural selection, game theory, and probably more.</p>
<p><a href="https://github.com/dustinmcintosh/world_wandering_dudes">World Wandering Dudes</a> is a simulation framework in which you initiate a “world” which consists of a “field” and a set of “creatures” (dudes). The field has food on it. Each day, the creatures run around gathering the food which they need to survive and reproduce.</p>
<h3 id="example">Example</h3>
<p>Here’s an example of a few days passing in a world where food randomly sprouts each day, never spoiling, initiated with a single creature (particular note: after day 1 passes and there are two creatures, one of them doesn’t store enough food to reproduce at the end of the day):
<img alt="" src="https://efavdb.com/images/the_first_days.gif"></p>
<p>Taking a snapshot of the world at the end of each day for the first 20 or so days you can see the creatures take over the full field before coming to some general equilibrium state.
<img alt="" src="https://efavdb.com/images/each_day.gif"></p>
<h3 id="how-the-world-works">How the world works</h3>
<p>A. Each day consists of a number of discrete time steps. During each time step, the creatures move around the field randomly, if they find food they grab it from the field and store it.</p>
<p>B. At the end of the day, a few things happen:</p>
<ol>
<li>
<p>Each creature must eat some food. If they don’t have enough stored, they die.</p>
</li>
<li>
<p>If they have enough food after eating, they may also reproduce. Offspring may have mutated properties (e.g., they may move a little faster each time step - speedy creatures - or they may eat less food - efficient creatures)</p>
</li>
<li>
<p>The food may spoil throughout the world (or not) and new food may sprout on the field.</p>
</li>
</ol>
<h3 id="examining-the-historical-record">Examining the historical record</h3>
<p>You can also look at this historical recordfor the field and examine some metrics including total creature count, birth/death rate, the mutation composition of the creatures, amount of stored food, amount of ungathered food, and more:
<img alt="" src="https://efavdb.com/images/example_history.png">
Some phenomenological notes on this particular case (more details on the math behind some of this in future posts):</p>
<ul>
<li>The dynamics of the world are stochastic. For example, sometimes the first creature doesn’t find any food and dies immediately.</li>
<li>The population initially grows roughly exponentially as food becomes plentiful across the map.</li>
<li>With the accumulated food on the field from the initial low-population days, the creatures grow in numbers beyond a sustainable population and a period of starvation and population culling follows. :(</li>
<li>The population reaches an equilibrium at which the number of creatures is nearly the same as the amount of food sprouted each day (it’s not exactly equal!).</li>
<li>At equilibrium, the rate at which creatures are being born is equal to the rate at which they die (on average) and both appear to be about a third of the total population (it’s not a third!).</li>
<li>As mentioned above, upon reproduction the creatures will mutate and the fitter creatures may take over the world. In this particular case, efficient creatures come about first and quickly take over the population. The world can actually sustain a higher population of efficient vs normal/speedy creatures, so the total population increases accordingly. Shortly thereafter, a few speedy creatures start to show up and they, slowly, take over the world, out-competing the efficient creatures and slowly suppressing the overall population.</li>
</ul>
<p>More to come on extensions of this project and understanding the math behind it in the future.</p>
<h3 id="check-it-out-yourself">Check it out yourself</h3>
<p>The github repository is <a href="https://github.com/dustinmcintosh/world_wandering_dudes">here</a>.</p>
<p>You’ll need a bunch of the usual python packages for data science.</p>
<div class="highlight"><pre><span></span>git clone https://github.com/dustinmcintosh/world_wandering_dudes
<span class="nb">cd</span> world_wandering_dudes
</pre></div>
<p>Update the directory for saving figures in <code>SET_ME.py</code> if you’d like to store them somewhere special.</p>
<p>Run the sample code:</p>
<div class="highlight"><pre><span></span>python scripts/basic_simulation.py
</pre></div>
<p>You can recycle the same world again using:</p>
<div class="highlight"><pre><span></span>python scripts/basic_simulation.py -wp my_world.pkl
</pre></div>Multiarmed bandits in the context of reinforcement learning2020-02-25T12:00:00-08:002020-02-25T12:00:00-08:00Cathy Yehtag:efavdb.com,2020-02-25:/multiarmed-bandits<p><a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> by Sutton and Barto[1] is a book that is universally recommended to beginners in their <span class="caps">RL</span> studies. The first chapter is an extended text-heavy introduction. The second chapter deals with multiarmed bandits, i.e. slot machines with multiple arms, and is the subject of today …</p><p><a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a> by Sutton and Barto[1] is a book that is universally recommended to beginners in their <span class="caps">RL</span> studies. The first chapter is an extended text-heavy introduction. The second chapter deals with multiarmed bandits, i.e. slot machines with multiple arms, and is the subject of today’s post.</p>
<p>Before getting into the <em>what</em> and <em>how</em> of bandits, I’d like to address the <strong>why</strong>, since the “why” can guard against getting lost in the details / not seeing the forest for the trees.</p>
<h1 id="why-discuss-multiarmed-bandits">Why discuss multiarmed bandits?</h1>
<p><span class="caps">RL</span> treats the problem of trying to achieve a goal in an environment where an agent is <em>not</em> instructed about which actions to take to achieve that goal, in contrast to supervised learning problems. Learning the best actions to take is a complicated problem, since the best actions depend on what state an agent is in, e.g. an agent trying to get to a goalpost east of its current location as quickly as possible may find that moving east is a generally good policy, but not if there is a fire-breathing dragon in the way, in which case, it might make sense to move up or down to navigate around the obstacle.</p>
<p>Multiarmed bandits are simpler problem: a single state system. No matter which action an agent takes, i.e. which slot machine arm the agent pulls, the agent ends up back in the same state; the distribution of rewards as a consequence of the agent’s action remains the same, assuming a stationary distribution of rewards, and actions have no effect on subsequent states or rewards. This simple case study is useful for building intuition and introducing <span class="caps">RL</span> concepts that will be expanded on in later chapters of [1].</p>
<h1 id="key-rl-concepts-introduced-by-the-multiarmed-bandit-problem">Key <span class="caps">RL</span> concepts introduced by the multiarmed bandit problem</h1>
<h2 id="the-nature-of-the-problem">The nature of the problem</h2>
<p><strong>Agent has a goal</strong>: In <span class="caps">RL</span> and multiarmed bandit problems, we want to figure out the strategy, or “policy” in <span class="caps">RL</span> lingo, that will maximize our rewards. For the simple bandit problem, this goal is equivalent to maximizing the reward — literally, money! — for each arm pull.</p>
<p><strong>Unlike supervised learning, no ground truth is supplied</strong>: Each slot has a different distribution of rewards, but the agent playing the machine does not know that distribution. Instead, the agent has to try different actions and evaluate how good the actions are. The goodness of an action is straightforwardly determined by its immediate reward in the bandit case.</p>
<p><strong>Exploration vs. exploitation</strong>: Based on a few trials, one arm may appear to yield the highest rewards, but the agent may decide to try others occasionally to improve its estimates of the rewards, an example of balancing exploration and exploitation. The various algorithms handle exploration vs. exploitation differently, but this example introduces one method that is simple but widely-used in practice: the epsilon-greedy algorithm, which takes greedy actions most of the time (exploits) but takes random actions (explores) a fraction epsilon of the time.</p>
<h3 id="different-approaches-to-learning-a-policy">Different approaches to learning a policy</h3>
<p><strong>model-free</strong>: All the strategies discussed in [1] for solving the bandit problem are “model-free” strategies. In real world applications, a model of the world is rarely available, and the agent has to figure out how to act based on sampled experience, and the same applies to the bandit case; even though bandits are a simpler single state system (we don’t have to model transitions from state to state), an agent still does not know the model that generates the probability of a reward <span class="math">\(r\)</span> given an action <span class="math">\(a\)</span>, <span class="math">\(P(r|a)\)</span> and has to figure that out from trial and error.</p>
<p>There <em>are</em> model-based algorithms that attempt to model the environment’s transition dynamics from data, but many popular algorithms today are model-free because of the difficulty of modeling those dynamics.</p>
<h4 id="learning-action-values">Learning action-values</h4>
<p>The bandit problem introduces the idea of estimating the expected value associated with each action, namely the <em>action-value function</em> in <span class="caps">RL</span> terms. The concept is very intuitive — as an agent pulls on different bandit arms, it will accumulate rewards associated with each arm. A simple way to estimate the expected value per arm is just to average the rewards generated by pulling on each slot. The policy that follows is then implicit, namely, take the action / pull on the arm with the highest estimated action-value!</p>
<p>Historically, <span class="caps">RL</span> formalism has dealt with estimating value functions and using them to figure out a policy, which includes the Q-Learning (“Q” stands for action-value!) approach we mentioned in our earlier <a href="https://efavdb.com/openai-scholars-intro">post</a>.</p>
<h4 id="learning-policies-directly">Learning policies directly</h4>
<p>[1] also use the bandit problem to introduce a type of algorithm that approaches the problem, not indirectly by learning a value function and deriving the policy from those value functions, but by parameterizing the policy directly and learning the parameters that optimize the rewards. This class of algorithm is a “policy gradient method” and is very popular today for its nice convergence properties. After the foreshadowing in the bandit problem, policy gradients only reappear very late in [1] — chapter 13!</p>
<p>We now provide code for concreteness.</p>
<h1 id="ground-truth-is-hidden-in-our-multiarmed-bandit">Ground truth is hidden in our multiarmed bandit</h1>
<p>The <code>Bandit</code> class initializes a multiarmed bandit. The distribution of rewards per arm follows a Gaussian distribution with some mean dollar amount.</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Bandit</span><span class="p">:</span>
<span class="sd">"""N-armed bandit with stationary distribution of rewards per arm.</span>
<span class="sd"> Each arm (action) is identified by an integer.</span>
<span class="sd"> """</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_arms</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">mu</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">sigma</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">n_arms</span> <span class="o">=</span> <span class="n">n_arms</span>
<span class="bp">self</span><span class="o">.</span><span class="n">std</span> <span class="o">=</span> <span class="n">sigma</span>
<span class="c1"># a dict of the mean action_value per arm, w/ each action_value sampled from a Gaussian</span>
<span class="bp">self</span><span class="o">.</span><span class="n">action_values</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">s</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">n_arms</span><span class="p">))}</span>
<span class="bp">self</span><span class="o">.</span><span class="n">actions</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">action_values</span><span class="o">.</span><span class="n">keys</span><span class="p">())</span> <span class="c1"># arms of the bandit</span>
<span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="sd">"""Get reward from bandit for action"""</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">action_values</span><span class="p">[</span><span class="n">action</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">std</span><span class="p">)</span>
</pre></div>
<p>Implementation detail: the means per arm, stored in <code>self.action_values</code>, are drawn from a Gaussian distribution upon initialization).</p>
<p>The agent doesn’t know the true mean rewards per arm — it only sees a sample reward when he takes the action of pulling on a particular bandit arm (<code>__call__</code>).</p>
<h1 id="action-reward-update-strategy">Action, reward, update strategy</h1>
<p>For every action the agent takes, it gets a reward. With each additional interaction with the bandit, the agent has a new data point it can use to update its strategy (whether indirectly, via an updated action-value estimate, or directly in the policy gradient).</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">BaseBanditAlgo</span><span class="p">(</span><span class="n">ABC</span><span class="p">):</span>
<span class="sd">"""Base class for algorithms to maximize the rewards </span>
<span class="sd"> for the multiarmed bandit problem"""</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bandit</span><span class="p">:</span> <span class="n">Bandit</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bandit</span> <span class="o">=</span> <span class="n">bandit</span>
<span class="bp">self</span><span class="o">.</span><span class="n">timestep</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="o">.</span><span class="n">rewards</span> <span class="o">=</span> <span class="p">[]</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">_select_action</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">pass</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">_update_for_action_and_reward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">reward</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span>
<span class="n">action</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_select_action</span><span class="p">()</span>
<span class="n">reward</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bandit</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_update_for_action_and_reward</span><span class="p">(</span><span class="n">action</span><span class="p">,</span> <span class="n">reward</span><span class="p">)</span>
<span class="k">return</span> <span class="n">reward</span>
<span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_timesteps</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_timesteps</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">timestep</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="bp">self</span><span class="o">.</span><span class="n">rewards</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">run</span><span class="p">())</span>
</pre></div>
<h2 id="two-types-of-strategies-value-based-and-policy-based">Two types of strategies: value based and policy based</h2>
<ol>
<li>value based - agents try to directly estimate the value of
each action (and whose policies, i.e. probability of selecting an
action, are therefore implicit, since the agent will want to choose
the action that has the highest value)</li>
<li>policy based - agents don’t try to directly estimate the value
of an action and instead directly store the policy, i.e. the
probability of taking each action.</li>
</ol>
<p>An example of a <strong>value based</strong> strategy / action-value method for the
bandit problem is the <code>EpsilonGreedy</code> approach, which selects the
action associated with the highest estimated action-value with probability <span class="math">\(1-\epsilon\)</span>, but chooses a random arm
a fraction <span class="math">\(\epsilon\)</span> of the time as part of its exploration strategy.</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">EpsilonGreedy</span><span class="p">(</span><span class="n">BaseEstimateActionValueAlgo</span><span class="p">):</span>
<span class="sd">"""Greedy algorithm that explores/samples from the non-greedy action some fraction, </span>
<span class="sd"> epsilon, of the time.</span>
<span class="sd"> - For a basic greedy algorithm, set epsilon = 0.</span>
<span class="sd"> - For optimistic intialization, set q_init > mu, the mean of the Gaussian from</span>
<span class="sd"> which the real values per bandit arm are sampled (default is 0).</span>
<span class="sd"> """</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bandit</span><span class="p">:</span> <span class="n">Bandit</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="n">bandit</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">epsilon</span> <span class="o">=</span> <span class="n">epsilon</span>
<span class="k">def</span> <span class="nf">_select_action</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">sample</span><span class="p">()</span> <span class="o"><</span> <span class="bp">self</span><span class="o">.</span><span class="n">epsilon</span><span class="p">:</span>
<span class="c1"># take random action</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">bandit</span><span class="o">.</span><span class="n">actions</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># take greedy action</span>
<span class="n">a</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">est_action_values</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">key</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">est_action_values</span><span class="p">[</span><span class="n">key</span><span class="p">])</span>
<span class="k">return</span> <span class="n">a</span>
</pre></div>
<p>(See end of post for additional action-value methods.)</p>
<p>An example of a <strong>policy based</strong> strategy is the <code>GradientBandit</code>
method, which stores its policy, the probability per action in
<code>self.preferences</code>. It learns these preferences by doing stochastic
gradient ascent along the preferences in the gradient of the expected
reward in <code>_update_for_action_and_reward</code> (see [1] for derivation).</p>
<div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">GradientBandit</span><span class="p">(</span><span class="n">BaseBanditAlgo</span><span class="p">):</span>
<span class="sd">"""Algorithm that does not try to estimate action values directly and, instead, tries to learn</span>
<span class="sd"> a preference for each action (equivalent to stochastic gradient ascent along gradient in expected</span>
<span class="sd"> reward over preferences).</span>
<span class="sd"> """</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bandit</span><span class="p">:</span> <span class="n">Bandit</span><span class="p">,</span> <span class="n">alpha</span><span class="p">:</span> <span class="nb">float</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="n">bandit</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">=</span> <span class="n">alpha</span> <span class="c1"># step-size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">reward_baseline_avg</span> <span class="o">=</span> <span class="mi">0</span>
<span class="bp">self</span><span class="o">.</span><span class="n">preferences</span> <span class="o">=</span> <span class="p">{</span><span class="n">action</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">action</span> <span class="ow">in</span> <span class="n">bandit</span><span class="o">.</span><span class="n">actions</span><span class="p">}</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_calc_probs_from_preferences</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_calc_probs_from_preferences</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="sd">"""Probabilities per action follow a Boltzmann distribution over the preferences """</span>
<span class="n">exp_preferences_for_action</span> <span class="o">=</span> <span class="p">{</span><span class="n">action</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">action</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">preferences</span><span class="o">.</span><span class="n">items</span><span class="p">()}</span>
<span class="n">partition_fxn</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">exp_preferences_for_action</span><span class="o">.</span><span class="n">values</span><span class="p">())</span>
<span class="bp">self</span><span class="o">.</span><span class="n">probabilities_for_action</span> <span class="o">=</span> <span class="n">OrderedDict</span><span class="p">({</span><span class="n">action</span><span class="p">:</span> <span class="n">v</span> <span class="o">/</span> <span class="n">partition_fxn</span> <span class="k">for</span> <span class="n">action</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span>
<span class="n">exp_preferences_for_action</span><span class="o">.</span><span class="n">items</span><span class="p">()})</span>
<span class="k">def</span> <span class="nf">_select_action</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">probabilities_for_action</span><span class="o">.</span><span class="n">keys</span><span class="p">()),</span>
<span class="n">p</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">probabilities_for_action</span><span class="o">.</span><span class="n">values</span><span class="p">()))</span>
<span class="k">def</span> <span class="nf">_update_for_action_and_reward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">action</span><span class="p">,</span> <span class="n">reward</span><span class="p">):</span>
<span class="sd">"""Update preferences"""</span>
<span class="n">reward_diff</span> <span class="o">=</span> <span class="n">reward</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">reward_baseline_avg</span>
<span class="c1"># can we combine these updates into single expression using kronecker delta?</span>
<span class="bp">self</span><span class="o">.</span><span class="n">preferences</span><span class="p">[</span><span class="n">action</span><span class="p">]</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">*</span> <span class="n">reward_diff</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">probabilities_for_action</span><span class="p">[</span><span class="n">action</span><span class="p">])</span>
<span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">bandit</span><span class="o">.</span><span class="n">actions</span><span class="p">:</span>
<span class="k">if</span> <span class="n">a</span> <span class="o">==</span> <span class="n">action</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">preferences</span><span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-=</span> <span class="bp">self</span><span class="o">.</span><span class="n">alpha</span> <span class="o">*</span> <span class="n">reward_diff</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">probabilities_for_action</span><span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">reward_baseline_avg</span> <span class="o">+=</span> <span class="mi">1</span><span class="o">/</span><span class="bp">self</span><span class="o">.</span><span class="n">timestep</span> <span class="o">*</span> <span class="n">reward_diff</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_calc_probs_from_preferences</span><span class="p">()</span>
</pre></div>
<h1 id="extra-total-rewards-for-different-bandit-algorithms">Extra: Total rewards for different bandit algorithms</h1>
<p>We have discussed a bunch of different bandit algorithms, but haven’t seen what rewards they yield in practice!</p>
<p>In this
<a href="https://github.com/frangipane/reinforcement-learning/blob/master/00-Introduction/multiarmed_bandits.ipynb">Jupyter notebook</a>,
we run the algorithms through a range of values for their parameters
to compare their cumulative rewards across 1000 timesteps (also
averaged across many trials of different bandits to smooth things
out). In the end, we arrive at a plot of the parameter study, that
reproduces Figure 2.6 in [1].</p>
<p><img alt="![parameter study]({static}/images/reproduce_multiarmed_bandit_parameter_study.png)" src="https://efavdb.com/images/reproduce_multiarmed_bandit_parameter_study.png"></p>
<h1 id="references">References</h1>
<p>[1] Sutton and Barto - <a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction (2nd Edition)</a></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Introduction to OpenAI Scholars 20202020-02-14T09:00:00-08:002020-02-14T09:00:00-08:00Cathy Yehtag:efavdb.com,2020-02-14:/openai-scholars-intro<p>Two weeks ago, I started at the <a href="https://openai.com/blog/openai-scholars-spring-2020/">OpenAI Scholars</a> program, which provides the opportunity to study and work full time on a project in an area of deep learning over 4 months. I’m having a blast! It’s been a joy focusing 100% on learning and challenging myself in …</p><p>Two weeks ago, I started at the <a href="https://openai.com/blog/openai-scholars-spring-2020/">OpenAI Scholars</a> program, which provides the opportunity to study and work full time on a project in an area of deep learning over 4 months. I’m having a blast! It’s been a joy focusing 100% on learning and challenging myself in an atmosphere full of friendly intellectual energy and drive.</p>
<p>My mentor is Jerry Tworek, an OpenAI research scientist who works on reinforcement learning (<span class="caps">RL</span>) in robotics, and I’ve also chosen to focus on <span class="caps">RL</span> during the program. I constructed a <a href="https://docs.google.com/document/d/1MlM5bxMqqiUIig5I6Y28fegvbqokjuvS2llVd2dIIRE/edit?usp=sharing">syllabus</a> that will definitely evolve over time, but I’ll try to keep it up-to-date to serve as a useful record for myself and a guide for others who might be interested in a similar course of study.</p>
<p>Some casual notes from the last two weeks:</p>
<p>(1) There are manifold benefits to working on a topic that is in my mentor’s area of expertise. For example, I’ve already benefited from Jerry’s intuition around hyperparameter tuning and debugging <span class="caps">RL</span>-specific problems, as well as his guidance on major concepts I should focus on in my first month, namely, model-free <span class="caps">RL</span> divided broadly into Q-Learning and Policy Gradients.</p>
<p>(2) <strong>Weights <span class="amp">&</span> Biases</strong> at <a href="wandb.com">wandb.com</a> is a fantastic free tool for tracking machine learning experiments that many people use at OpenAI. It was staggeringly simple to integrate wandb with my training script — both for local runs and in the cloud! Just ~4 extra lines of code, and logged metrics automagically appear in my wandb dashboard, with auto-generated plots grouped by experiment name, saved artifacts, etc.</p>
<p>Here’s an example of a <a href="https://app.wandb.ai/frangipane/dqn?workspace=user-frangipane">dashboard</a> tracking experiments for my first attempt at implementing a deep <span class="caps">RL</span> algorithm from scratch (<span class="caps">DQN</span>, or Deep Q learning). The script that is generating the experiments is still a work in progress, but you can see how few lines were required to integrate with wandb <a href="https://github.com/frangipane/reinforcement-learning/blob/master/DQN/dqn.py">here</a>. Stay tuned for a blog post about <span class="caps">DQN</span> itself in the future!</p>
<p>(3) I’ve found it very helpful to parallelize reading Sutton and Barto’s <a href="http://incompleteideas.net/book/RLbook2018.pdf">Reinforcement Learning: An Introduction</a>, <em>the</em> classic text on <span class="caps">RL</span>, with watching David Silver’s pedagogical online <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">lectures</a>. Silver’s lectures follow the book closely for the first few chapters, then start condensing several chapters per lecture beginning around lecture 4 or 5 — helpful since I’m aiming to ramp up on <span class="caps">RL</span> over a short period of time! Silver also supplements with insightful explanations and material that aren’t covered in the book, e.g. insights about the convergence properties of some <span class="caps">RL</span> algorithms.</p>
<p>Note, Silver contributed to the work on Deep Q Learning applied to Atari that generated a lot of interest in deep <span class="caps">RL</span> beginning in 2013, leading to a <a href="https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf">publication</a> in Nature in 2015, so his lecture 6 on Value Function Approximation (<a href="https://www.youtube.com/watch?v=UoPei5o4fps">video</a>, <a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf">slides</a>) is a perfect accompaniment to reading the paper.</p>Universal limiting mean return of CPPI investment portfolios2019-12-09T16:14:00-08:002019-12-09T16:14:00-08:00Jonathan Landytag:efavdb.com,2019-12-09:/universal-mean-return-formula-of-the-cppi-investment-strategy<p><span class="caps">CPPI</span>* is a risk management tactic that can be applied to
any investment portfolio. The approach entails banking a percentage of
profits whenever a new all time high wealth is achieved, thereby
ensuring that a portfolio’s drawdown never goes below some maximum
percentage. Here, I review <span class="caps">CPPI</span> and then …</p><p><span class="caps">CPPI</span>* is a risk management tactic that can be applied to
any investment portfolio. The approach entails banking a percentage of
profits whenever a new all time high wealth is achieved, thereby
ensuring that a portfolio’s drawdown never goes below some maximum
percentage. Here, I review <span class="caps">CPPI</span> and then consider the mean growth rate
of a <span class="caps">CPPI</span> portfolio. I find that in a certain, common limit, this mean
growth is given by a universal formula,
(\ref{cppi_asymptotic_growth}) below. This universal result does not
depend in detail on the statistics of the investment in question, but
instead only on its mean return and standard deviation. I illustrate
the formula’s accuracy with a simulation in python.</p>
<p>*<span class="caps">CPPI</span> = “Constant Proportion Portfolio Insurance”</p>
<h2 id="introduction">Introduction</h2>
<p>The drawdown of an investment portfolio at a given date is equal to
the amount of money lost relative to its maximum held capital up to
that date. This is illustrated in the figure at right — a portfolio
that once held $100 now holds only $90, so the drawdown is currently
$10. <a href="https://efavdb.com/wp-content/uploads/2019/12/dd.png"><img alt="dd" src="https://efavdb.com/wp-content/uploads/2019/12/dd.png"></a></p>
<p><span class="caps">CPPI</span> is a method that can be applied to guarantee that the maximum
fractional drawdown is never more than some predetermined value — the
idea is to simply squirrel away an appropriate portion of earnings
whenever we hit a new maximum account value, and only risk what’s left
over from that point on. For example, to cap the max loss at 50%, one
should only risk 50% of the initial capital and then continue to bank
50% of any additional earnings whenever a new all time high is reached.</p>
<p>According to Wikipedia, the first person to study <span class="caps">CPPI</span> was Perold, who derived the statistical properties of a <span class="caps">CPPI</span> portfolio’s value at time <span class="math">\(t\)</span>, assuming the underlying stochastic investment follows a Wiener process. I was introduced to the <span class="caps">CPPI</span> concept by the book “Algorithmic Trading” by Ernest Chan. This book implicitly poses the question of what the mean return is for general, discrete investment strategies. Here, I show that a universal formula applies in this case, valid at low to modest leverages and small unit investment Sharpe ratios — this result is given in equation (\ref{cppi_asymptotic_growth}) below.</p>
<p>The post proceeds as follows: In the next section, I define some notation and then write down the limiting result. In the following section, I give a numerical example in python. Finally, an appendix contains a derivation of the main result. This makes use of the universal limiting drawdown distribution result from my <a href="http://efavdb.github.io/universal-drawdown-statistics-in-investing">prior post</a>.</p>
<h2 id="cppi-formulation-and-universal-mean-growth-rate"><span class="caps">CPPI</span> formulation and universal mean growth rate</h2>
<p>In this section, I review the <span class="caps">CPPI</span> strategy and give the limiting mean return result. Consider a portfolio that at time <span class="math">\(t\)</span> has value,
</p>
<div class="math">$$
\begin{align}\tag{1}
W_t = S_t + \Gamma_t
\end{align}
$$</div>
<p>
where <span class="math">\(W\)</span> is our total wealth, <span class="math">\(S\)</span> is the banked (safe) portion, and <span class="math">\(\Gamma\)</span> is the portion we are willing to bet or gamble. The savings is set so that each time we reach a new all time high wealth, <span class="math">\(S\)</span> is adjusted to be equal to a fraction <span class="math">\(\Pi\)</span> of the net wealth. When this is done, the value of <span class="math">\(\Gamma\)</span> must also be adjusted downwards — some of the investment money is moved to savings. Before adjustment, the result of a bet moves <span class="math">\(\Gamma\)</span> to
</p>
<div class="math">\begin{align}\label{gamble_eom_cppi} \tag{2}
\tilde{\Gamma}_{t+1} \equiv \left (1 + f (g_{t} - 1)\right) \Gamma_t.
\end{align}</div>
<p>
Here, the tilde at left indicates this is the value before any reallocation is applied — in case we have reached a new high, <span class="math">\(g_t\)</span> is a stochastic investment outcome variable, and <span class="math">\(f\)</span> is our “leverage” — a constant that encodes how heavily we bet on the game. I will assume all <span class="math">\(g_i\)</span> are independent random variables that are identically distributed with distribution <span class="math">\(p(g)\)</span>. I will assume that <span class="math">\(f\)</span> is a fixed value throughout time, and will write
</p>
<div class="math">\begin{align} \tag{3} \label{3}
f = \frac{1}{\phi} \frac{ \langle g -1 \rangle }{ \text{var}(g)}
\end{align}</div>
<p>
This re-parameterization helps to make the math work out more nicely below. It is also motivated by the Kelly formula, which specifies the gambling leverage that maximizes wealth at long times (here, setting <span class="math">\(\phi \to 1\)</span> gives the Kelly exposure).</p>
<p>The equations above define the <span class="caps">CPPI</span> scheme. The main result of this post is that in the limit where the leverage is not too high, and the stochastic <span class="math">\(g\)</span> has small Sharpe ratio (mean return over standard deviation), the mean log wealth at time <span class="math">\(t\)</span> satisfies
</p>
<div class="math">\begin{align} \label{cppi_asymptotic_growth} \tag{4}
\frac{1}{t}\langle \log W_{t} - \log W_{0} \rangle \sim \frac{ \langle g -1 \rangle^2 }{ \text{var}(g)} (1-\Pi) \frac{(2 \phi -1)}{2 \phi ^2}.
\end{align}</div>
<p>
This result can be used to estimate the mean growth of a portfolio to which <span class="caps">CPPI</span> is applied. Before illustrating its accuracy via a simulation below, I highlight a few points about the result:</p>
<ol>
<li>The fact that (\ref{cppi_asymptotic_growth}) is universal makes it very practical to apply to a real world investment: Rather than having to estimate the full distribution for <span class="math">\(g\)</span>, we need only estimate its mean and variance to get an estimate for the portfolio’s long-time mean return.</li>
<li>The mean return (\ref{cppi_asymptotic_growth}) is equivalent to that found by Perold for Gaussian processes — as expected, since the result is “universal”.</li>
<li>The maximum return at fixed <span class="math">\(\Pi\)</span> is again obtained at <span class="math">\(\phi = 1\)</span>, the Kelly maximum.</li>
<li>If <span class="math">\(\phi = 1/2\)</span>, we are at twice Kelly exposure and the mean return is zero — this a well known result. At <span class="math">\(\phi < 1/2\)</span>, we are above twice Kelly and the return is negative.</li>
<li>The mean return is reduced by a factor <span class="math">\((1-\Pi)\)</span> — the fraction of new high wealths we are exposing to loss. It is interesting that the result is not suppressed faster than this as we hold out more wealth, given that we have lower exposure after a loss than we would otherwise.</li>
<li>One can ask what <span class="math">\(\phi\)</span> gives us the same mean return using <span class="caps">CPPI</span> as we would obtain at full exposure using <span class="math">\(\phi_0\)</span>. E.g., consider the case of <span class="math">\(\Pi = 1/2\)</span>, which corresponds to insuring that half of our wealth is protected from loss. Equating the mean gains of these two gives
<div class="math">\begin{align}
\frac{1}{2} \frac{(2 \phi -1)}{2 \phi ^2} = \frac{(2 (2)-1)}{2 (2)^2} = \frac{3}{8}.
\end{align}</div>
The two roots for <span class="math">\(\phi\)</span> are plotted versus <span class="math">\(\phi_0\)</span> below. <a href="https://efavdb.com/wp-content/uploads/2019/12/solution_half_safe.png"><img alt="solution_half_safe" src="https://efavdb.com/wp-content/uploads/2019/12/solution_half_safe.png"></a></li>
</ol>
<p>Notice that we can’t find solutions for all <span class="math">\(\phi_0\)</span> — it’s not possible to match the mean return for high leverages at full exposure when we force protection of some of our assets.</p>
<p>We now turn to a simulation example.</p>
<h2 id="python-cppi-simulation">Python <span class="caps">CPPI</span> simulation</h2>
<p>In the code below, we consider a system where in each step we either “win” or “lose”: If we win, the money we risk grows by a factor of <span class="math">\(f \times 1.02\)</span>, and if we lose, it goes down by a factor of <span class="math">\(f \times 1 / 1.02\)</span>. We take the probability of winning to be <span class="math">\(0.65\)</span>. This game has a Sharpe ratio of <span class="math">\(0.32\)</span>, small enough that our approximation should work well. The code below carries out a simulated repeated investment game over 100 trials — we hope that it is clear what is happening at each step.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># investment definitions -- a random walk</span>
<span class="n">LIFT_ON_WIN</span> <span class="o">=</span> <span class="mf">1.02</span>
<span class="n">LIFT_ON_LOSS</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">LIFT_ON_WIN</span>
<span class="n">P_WIN</span> <span class="o">=</span> <span class="mf">0.65</span>
<span class="n">P_LOSS</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">P_WIN</span>
<span class="n">g_minus_1_bar</span> <span class="o">=</span> <span class="n">P_WIN</span> <span class="o">*</span> <span class="p">(</span><span class="n">LIFT_ON_WIN</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">P_LOSS</span> <span class="o">*</span> <span class="p">(</span><span class="n">LIFT_ON_LOSS</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">var_g</span> <span class="o">=</span> <span class="n">P_WIN</span> <span class="o">*</span> <span class="p">(</span><span class="n">LIFT_ON_WIN</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">P_LOSS</span> <span class="o">*</span> <span class="p">(</span><span class="n">LIFT_ON_LOSS</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">-</span> <span class="p">(</span><span class="n">g_minus_1_bar</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">full_kelly</span> <span class="o">=</span> <span class="n">g_minus_1_bar</span> <span class="o">/</span> <span class="n">var_g</span>
<span class="n">sharpe</span> <span class="o">=</span> <span class="n">g_minus_1_bar</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var_g</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'Sharpe ratio of g unit bet: </span><span class="si">%.4f</span><span class="s1">'</span> <span class="o">%</span> <span class="n">sharpe</span>
<span class="k">def</span> <span class="nf">simulate_once</span><span class="p">(</span><span class="n">phi</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">steps</span><span class="p">):</span>
<span class="n">initial_value</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">full_kelly</span> <span class="o">/</span> <span class="n">phi</span>
<span class="n">current_nav</span> <span class="o">=</span> <span class="n">initial_value</span>
<span class="n">current_max</span> <span class="o">=</span> <span class="n">current_nav</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">steps</span><span class="p">):</span>
<span class="c1"># update current max</span>
<span class="n">current_max</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">current_nav</span><span class="p">,</span> <span class="n">current_max</span><span class="p">)</span>
<span class="c1"># calculate current effective nav</span>
<span class="n">current_drawdown</span> <span class="o">=</span> <span class="n">current_max</span> <span class="o">-</span> <span class="n">current_nav</span>
<span class="n">gamma</span> <span class="o">=</span> <span class="n">current_max</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">pi</span><span class="p">)</span> <span class="o">-</span> <span class="n">current_drawdown</span>
<span class="c1"># play round of investment game</span>
<span class="n">dice_roll</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span>
<span class="n">win</span> <span class="o">=</span> <span class="p">(</span><span class="n">dice_roll</span> <span class="o"><</span> <span class="n">P_WIN</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">win</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">LIFT_ON_WIN</span> <span class="o">*</span> <span class="n">win</span> <span class="o">+</span> <span class="n">LIFT_ON_LOSS</span> <span class="o">*</span> <span class="n">loss</span>
<span class="n">nav_change</span> <span class="o">=</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">f</span> <span class="o">*</span> <span class="p">(</span><span class="n">g</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># update wealth</span>
<span class="n">current_nav</span> <span class="o">+=</span> <span class="n">nav_change</span>
<span class="k">return</span> <span class="n">current_nav</span>
<span class="n">end_results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># kelly and cppi properties</span>
<span class="n">PHI</span> <span class="o">=</span> <span class="mf">10.0</span>
<span class="n">PI</span> <span class="o">=</span> <span class="o">.</span><span class="mi">75</span>
<span class="n">STEPS</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">TRIALS</span> <span class="o">=</span> <span class="mi">100</span>
<span class="c1"># simulation loop</span>
<span class="k">for</span> <span class="n">trial</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">TRIALS</span><span class="p">):</span>
<span class="n">end_nav</span> <span class="o">=</span> <span class="n">simulate_once</span><span class="p">(</span><span class="n">phi</span><span class="o">=</span><span class="n">PHI</span><span class="p">,</span> <span class="n">pi</span><span class="o">=</span><span class="n">PI</span><span class="p">,</span> <span class="n">steps</span><span class="o">=</span><span class="n">STEPS</span><span class="p">)</span>
<span class="n">end_results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">end_nav</span><span class="p">)</span>
<span class="n">theory</span> <span class="o">=</span> <span class="p">(</span><span class="n">g_minus_1_bar</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">/</span> <span class="n">var_g</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">PI</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">PHI</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">PHI</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'Experiment: </span><span class="si">%2.5f</span><span class="s1">'</span><span class="o">%</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">end_results</span><span class="p">))</span> <span class="o">/</span> <span class="n">STEPS</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'Theory: </span><span class="si">%2.5f</span><span class="s1">'</span><span class="o">%</span> <span class="n">theory</span>
<span class="c1"># OUTPUT:</span>
<span class="c1"># Sharpe ratio of g unit bet: 0.3249</span>
<span class="c1"># Experiment: 0.00251</span>
<span class="c1"># Theory: 0.00251</span>
</pre></div>
<p>The last lines above show the output of our print statements. In particular, the last two lines show the mean growth rate observed over the 100 trials and the theoretical value (\ref{cppi_asymptotic_growth}) — these agree to three decimal places.</p>
<p>Using a loop over <span class="math">\(\phi\)</span> values, I used the code above to obtain the plot below of mean returns vs <span class="math">\(\phi\)</span>. This shows that the limiting result works quite well over most <span class="math">\(\phi\)</span> — though there is some systematic, modest discrepancy at small <span class="math">\(\phi\)</span>. This is expected as the quadratic expansion for log wealth used below starts to break down at high exposures. Nevertheless, the fit is qualitatively quite good at all <span class="math">\(\phi\)</span>. This suggests that our result (\ref{cppi_asymptotic_growth}) can be used for quick mean return forecasts for most practical, applied cases of <span class="caps">CPPI</span>.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2019/12/mean_cppi_growth_rate.png"><img alt="mean_cppi_growth_rate" src="https://efavdb.com/wp-content/uploads/2019/12/mean_cppi_growth_rate.png"></a></p>
<h2 id="appendix-derivation-of-mean-return">Appendix: Derivation of mean return</h2>
<p>We give a rough sketch here of a proof of (\ref{cppi_asymptotic_growth}). Our aim is to quickly get the universal form in a relatively clean way. Note that these results may be new, or perhaps well known to finance theorists — I’m not sure.</p>
<p>To begin let us define
</p>
<div class="math">\begin{align}
\Gamma^*_t = \max_{t^{\prime} \leq t} \Gamma_t.
\end{align}</div>
<p>
This is the maximum <span class="math">\(\Gamma\)</span> seen to date at time <span class="math">\(t\)</span>. Necessarily, this is the value of <span class="math">\(\Gamma\)</span> as of the most recent all time high preceeding <span class="math">\(t\)</span>. If <span class="math">\(\Gamma_t < \Gamma^*_t\)</span>, we say that we are in drawdown by value <span class="math">\(\Gamma^*_t - \Gamma_t\)</span>. At all times, we have
</p>
<div class="math">\begin{align}\nonumber
S_t &= \frac{\Pi}{1 - \Pi} \Gamma^*_t \\
&\equiv \rho \Gamma^*_t.
\end{align}</div>
<p>
This result holds because the saved portion is <span class="math">\(\Pi\)</span> times the net wealth when we reach a new high and <span class="math">\(\Gamma^*\)</span> is what’s left over, <span class="math">\((1 - \Pi)\)</span> times the net wealth at that time.</p>
<p>From the above definitions, our net wealth after a step is given by
</p>
<div class="math">\begin{align}\nonumber
W_{t+1} &= W_{t} \left ( 1 + f(g_t -1) \frac{\Gamma_t}{W_{t}} \right ) \\ \nonumber
&= W_{t} \left ( 1 + f(g_t -1) \frac{\Gamma_t}{S_t + \Gamma_t} \right ) \\
&= W_{t} \left ( 1 + f(g_t -1) \frac{\frac{\Gamma_t}{\Gamma^*_t}}{\rho+ \frac{\Gamma_t}{\Gamma^*_t}} \right )
\end{align}</div>
<p>
Iterating and taking the logarithm we obtain
</p>
<div class="math">\begin{align}\nonumber \tag{A1} \label{A1}
\log W_{t} &= \log W_{0} + \sum_{i=0}^{t-1} \log \left ( 1 + f(g_i -1) \frac{\frac{\Gamma_i}{\Gamma^*_i}}{\rho+ \frac{\Gamma_i}{\Gamma^*_i}} \right ) \\
&\approx \log W_{0} + \sum_{i=0}^{t-1} f (g_i -1) \frac{\frac{\Gamma_i}{\Gamma^*_i}}{\rho+ \frac{\Gamma_i}{\Gamma^*_i}} - \frac{f^2}{2} \left ((g_i -1) \frac{\frac{\Gamma_i}{\Gamma^*_i}}{\rho+ \frac{\Gamma_i}{\Gamma^*_i}}\right)^2 + \ldots
\end{align}</div>
<p>
The series expansion in the second line can be shown to converge quickly provided we have selected a leverage <span class="math">\(f\)</span> that always results in a small percentage change in our net wealth each step. Note that it is the breakdown of this expansion that causes the slight divergence at low <span class="math">\(\phi\)</span> in our last plot above.</p>
<p>Our aim is to evaluate the average of the last equation. The first key point needed to do this is to note that at step <span class="math">\(i\)</span>, we have <span class="math">\(g_i\)</span> and <span class="math">\(\Gamma_i / \Gamma_i^*\)</span> independent (the outcome of the unit bet doesn’t depend on how much we have to wager at this time). This allows us to factor the averages above into one over <span class="math">\(g\)</span> and over the <span class="math">\(\Gamma_i / \Gamma_i^*\)</span> distribution. The former is relatively easy to write down if we assume some properties for <span class="math">\(f\)</span> and <span class="math">\(g\)</span>. To proceed on the latter, we note that
</p>
<div class="math">\begin{align}\nonumber
\log \frac{\Gamma_i}{\Gamma^*_i} = \log \Gamma_i - \log \Gamma^*_i
\end{align}</div>
<p>
is the drawdown of a random walk, with steps defined by (\ref{gamble_eom_cppi}). We have argued in our last post that the tail of the distribution of this drawdown distribution is such that
</p>
<div class="math">\begin{align}\nonumber
p( \log \Gamma_i - \log \Gamma^*_i = -k) \propto \exp(\alpha k).
\end{align}</div>
<p>
where <span class="math">\(\alpha\)</span> is given in that post as an implicit function of the statistics of <span class="math">\(g\)</span>. This tail form will hold almost everywhere when the Sharpe ratio is small. We will assume this here.</p>
<p>Using the change of variables rule, we get
</p>
<div class="math">\begin{align}\nonumber
p\left(\frac{\Gamma_i}{\Gamma^*_i} = k\right) \sim \begin{cases}
\alpha k^{\alpha - 1} & \text{if } x \in (0, 1) \\
0 & \text{else}
\end{cases}
\end{align}</div>
<p>
Again, this is an approximation that assumes we spend relatively little time within one jump from the current all time high — a result that will hold in the small Sharpe ratio limit. With this result, we obtain
</p>
<div class="math">\begin{align}\nonumber \tag{A2} \label{A2}
\left \langle \frac{\frac{\Gamma_i}{\Gamma^*_i}}{\rho+ \frac{\Gamma_i}{\Gamma^*_i}} \right \rangle &\equiv \int_0^1 \frac{x}{\rho + x} \alpha x^{\alpha - 1} dx \\
&= \frac{\alpha \, _2F_1\left(1,\alpha +1;\alpha +2;-\frac{1}{\rho }\right)}{\rho (\alpha +1) }
\end{align}</div>
<p>
Here, <span class="math">\(_2F_1\)</span> is the hypergeometric function. Similarly,
</p>
<div class="math">\begin{align}\nonumber \tag{A3} \label{A3}
\left \langle \left( \frac{\frac{\Gamma_i}{\Gamma^*_i}}{\rho+ \frac{\Gamma_i}{\Gamma^*_i}} \right)^2 \right \rangle &\equiv \int_0^1 \left( \frac{x}{\rho + x} \right)^2 \alpha x^{\alpha - 1} dx \\
&= \frac{\alpha \left(\frac{\rho }{\rho +1}-\frac{(\alpha +1) \, _2F_1\left(1,\alpha
+2;\alpha +3;-\frac{1}{\rho }\right)}{\alpha +2}\right)}{\rho ^2}
\end{align}</div>
<p>
Note that both of the last two lines go to one as <span class="math">\(\rho \to 0\)</span>, the limit where we invest our entire net worth and protect nothing. In this case, the growth equations are just those for a fully invested account. If you plug the last two results into the second line of (\ref{A1}), you get an expression for the mean return.</p>
<p>To get the above results, we assumed a small Sharpe ratio. Therefore, to simplify things, we can use the value of <span class="math">\(\alpha\)</span> that we derived in our last post that is valid in this limit. This was given by <span class="math">\(\alpha \sim 2 \mu / \sigma^2\)</span>, where <span class="math">\(\mu\)</span> and <span class="math">\(\sigma\)</span> and mean and standard deviation of the random walk. We now evaluate these to get an expression for <span class="math">\(\alpha\)</span> in terms of the statistics of <span class="math">\(g\)</span>. First, we note that the mean drift of <span class="math">\(\log \Gamma\)</span> is given by
</p>
<div class="math">\begin{align}\nonumber
\mu &\equiv \langle \log(1 + f (g-1)) \rangle \ \nonumber
&\sim f \langle g -1 \rangle - \frac{f^2}{2} \langle (g-1)^2 \rangle \\
&\sim f \langle g -1 \rangle - \frac{f^2}{2} \text{var}(g).
\end{align}</div>
<p>
The last line follows from the assumption that the Sharpe ratio for <span class="math">\(g -1\)</span> is small, so that
</p>
<div class="math">\begin{align}\nonumber
\langle (g-1)^2 \rangle &= \left ( \langle (g-1)^2 \rangle - \langle (g-1) \rangle^2 \right) + \langle (g-1) \rangle^2 \\
&= \text{var}(g) \left ( 1 + \frac{ \langle (g-1) \rangle^2}{\text{var}(g)} \right ).
\end{align}</div>
<p>
Similarly, one can show that
</p>
<div class="math">\begin{align}\nonumber
\sigma^2 &\equiv \text{var} \log(1 + f (g-1)) \\
&\sim f^2 \text{var}(g).
\end{align}</div>
<p>
This gives
</p>
<div class="math">\begin{align}\nonumber
\alpha &\sim 2 \frac{\mu}{\sigma^2} \\ \nonumber
&\sim 2 \frac{ f \langle g -1 \rangle - \frac{f^2}{2} \text{var}(g)}{ f^2 \text{var}(g)} \\
&= \frac{2}{f} \frac{ \langle g -1 \rangle }{ \text{var}(g)} - 1
\end{align}</div>
<p>
The second term in the first line can be neglected because we require the change in value to be a small fraction of our net wealth. We anticipate applying the algorithm to values of <span class="math">\(f\)</span> of order <span class="math">\(f \sim O( \frac{ \langle g -1 \rangle }{ \text{var}(g)})\)</span>, so the above is <span class="math">\(O(1)\)</span>. If we plug these results into the limiting form for <span class="math">\(\alpha\)</span> and use (\ref{3}) for <span class="math">\(f\)</span>, we get
</p>
<div class="math">\begin{align} \tag{A4} \label{A4}
\alpha \sim 2 \phi -1.
\end{align}</div>
<p>
We won’t show the details, but if you plug (\ref{A2}), (\ref{A3}), and (\ref{A4}) into (\ref{A1}), this gives (\ref{cppi_asymptotic_growth}). To get there, you need to show that a big collapse of terms occurs in the two hypergeometric functions. This collapse can be derived using the series expansion for <span class="math">\(_2F_1\)</span> in about one page. The collapse of terms only occurs in the small Sharpe ratio limit, where <span class="math">\(\alpha\)</span> is given as above. We note that for some walks, an exponential form holds everywhere. In this case, the more general expression using <span class="math">\(_2F_1\)</span> applies even at high Sharpe ratios — though we still require <span class="math">\(f\)</span> small.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Universal drawdown statistics in investing2019-12-05T15:46:00-08:002019-12-05T15:46:00-08:00Jonathan Landytag:efavdb.com,2019-12-05:/universal-drawdown-statistics-in-investing<p>We consider the equilibrium drawdown distribution for a biased random walk — in the context of a repeated investment game, the drawdown at a given time is how much has been lost relative to the maximum capital held up to that time. We show that in the tail, this is exponential …</p><p>We consider the equilibrium drawdown distribution for a biased random walk — in the context of a repeated investment game, the drawdown at a given time is how much has been lost relative to the maximum capital held up to that time. We show that in the tail, this is exponential. Further, when mean drift is small, this has an exponent that is universal in form, depending only on the mean and standard deviation of the step distribution. We give simulation examples in python consistent with the results.</p>
<h2 id="introduction-and-main-results">Introduction and main results</h2>
<p>In this post, we consider a topic of high interest to investors and gamblers alike — the statistics of drawdown. This is the amount of money the investor has lost relative to their maximum held capital to date. <a href="https://efavdb.com/wp-content/uploads/2019/12/dd.png"><img alt="dd" src="https://efavdb.com/wp-content/uploads/2019/12/dd.png"></a></p>
<p>For example, if an investor once held $100, but now holds only $90, his drawdown is currently $10. We will provide some results that characterize how unlikely it is for the investor to have a large drawdown of $\(k\), given knowledge of the statistics of his bets.</p>
<p>We will take as our model system a biased random walk. The probability that at step <span class="math">\(t\)</span> the investment goes from <span class="math">\(k^{\prime}\)</span> to <span class="math">\(k\)</span> will be taken to be independent of time and given by
</p>
<div class="math">\begin{eqnarray}\tag{1} \label{step_distribution}
p(k^{\prime} \to k) = \tau(k - k^{\prime}).
\end{eqnarray}</div>
<p>
We will assume that this has a positive bias <span class="math">\(\mu\)</span>, so that on average the investor makes money. With this assumption, we show below that for <span class="math">\(\vert k \vert\)</span> more than a few step sizes, the drawdown distribution has an exponential form,
</p>
<div class="math">\begin{eqnarray}\tag{2} \label{exponential}
p(k) \propto \exp\left( - \alpha \vert k \vert \right)
\end{eqnarray}</div>
<p>
where the decay constant <span class="math">\(\alpha\)</span> satisfies
</p>
<div class="math">\begin{eqnarray}\tag{3} \label{dd_decay_eqn}
1 = \int_{-\infty}^{\infty} \exp\left( \alpha j \right) \tau(-j) dj.
\end{eqnarray}</div>
<p>
The form (\ref{exponential}) holds for general distributions and (\ref{dd_decay_eqn}) provides the formula for obtaining <span class="math">\(\alpha\)</span> in this case. However, in the limit where the mean drift <span class="math">\(\mu\)</span> in <span class="math">\(\tau\)</span> is small relative to its standard deviation, <span class="math">\(\sigma\)</span>, we show that the solution to (\ref{dd_decay_eqn}) has a universal form, giving
</p>
<div class="math">\begin{eqnarray}\tag{4} \label{exponential_universal}
p(k) \propto \exp\left( - 2 \frac{\mu}{\sigma^2} \vert k \vert \right).
\end{eqnarray}</div>
<p>
Because it is difficult to find very high drift investments, this simple form should hold for most real world investments (under the assumption of a Markov process). It can be used to give one a sense of how much time they can expect to sit at a particular drawdown, given estimates for <span class="math">\(\mu\)</span> and <span class="math">\(\sigma\)</span>.</p>
<p>The results (\ref{exponential} - \ref{exponential_universal}) are the main results of this post. These may be new, but could also be well-known to finance theorists — we are not sure. We illustrate their accuracy in the following section using a numerical example, and provide derivations in an appendix.</p>
<h2 id="numerical-examples-in-python">Numerical examples in python</h2>
<p>Here, we will consider two different kinds of random walk — one where the steps are always the same size, but there is bias in the forward direction, and the other where the steps are taken from a Gaussian or normal distribution. The code below carries out a simulated investing scenario over one million steps.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">binary</span><span class="p">(</span><span class="n">mu</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Return either mu - 1 or mu + 1 with equal probability.</span>
<span class="sd"> Note unit std.</span>
<span class="sd"> """</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> <span class="o">+</span> <span class="n">mu</span>
<span class="k">def</span> <span class="nf">normal_random_step</span><span class="p">(</span><span class="n">mu</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Return a random unit normal with unit std.</span>
<span class="sd"> """</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">()</span> <span class="o">+</span> <span class="n">mu</span>
<span class="c1"># CONSTANTS</span>
<span class="n">TIME_STEPS</span> <span class="o">=</span> <span class="mi">10</span> <span class="o">**</span> <span class="mi">6</span>
<span class="n">MU</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="c1"># BINARY WALK</span>
<span class="n">position</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">max_position_to_date</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">drawdowns_binary</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">time</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">TIME_STEPS</span><span class="p">):</span>
<span class="n">position</span> <span class="o">+=</span> <span class="n">STEP_FUNC</span><span class="p">(</span><span class="n">MU</span><span class="p">)</span>
<span class="n">max_position_to_date</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_position_to_date</span><span class="p">,</span> <span class="n">position</span><span class="p">)</span>
<span class="n">drawdowns_binary</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">max_position_to_date</span> <span class="o">-</span> <span class="n">position</span><span class="p">)</span>
<span class="c1"># GAUSSIAN / NORMAL WALK</span>
<span class="n">STEP_FUNC</span> <span class="o">=</span> <span class="n">normal_random_step</span>
<span class="n">position</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">max_position_to_date</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">drawdowns_normal</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">time</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">TIME_STEPS</span><span class="p">):</span>
<span class="n">position</span> <span class="o">+=</span> <span class="n">STEP_FUNC</span><span class="p">(</span><span class="n">MU</span><span class="p">)</span>
<span class="n">max_position_to_date</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_position_to_date</span><span class="p">,</span> <span class="n">position</span><span class="p">)</span>
<span class="n">drawdowns_normal</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">max_position_to_date</span> <span class="o">-</span> <span class="n">position</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2019/12/dd_normal.png"><img alt="dd_normal" src="https://efavdb.com/wp-content/uploads/2019/12/dd_normal.png"></a></p>
<p>You can see in the code that we have a loop over steps. At each step, we append to a list of observed drawdown values. A plot of the histogram of these values for the Normal case at <span class="math">\(\mu = 0.1\)</span> is shown at right.</p>
<p>To check whether our theoretical forms are accurate, it is useful to plot the cumulative distribution functions vs the theoretical forms — the latter will again be exponential with the same <span class="math">\(\alpha\)</span> values as the probability distribution functions. It turns out that the exponent <span class="math">\(\alpha\)</span> that solves (\ref{dd_decay_eqn}) is always given by the universal form for a Gaussian. However, for the binary walker, we need to solve for this numerically in general. The following code snippet does this.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">scipy.optimize</span> <span class="kn">import</span> <span class="n">fsolve</span>
<span class="c1"># Solving numerically for binary case.</span>
<span class="n">binary_alpha_func</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="n">MU</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">cosh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">alpha_initial_guess</span> <span class="o">=</span> <span class="o">-</span><span class="mi">4</span>
<span class="n">alpha_solution</span> <span class="o">=</span> <span class="n">fsolve</span><span class="p">(</span><span class="n">binary_alpha_func</span><span class="p">,</span> <span class="n">alpha_initial_guess</span><span class="p">)</span>
</pre></div>
<p>A plot of the function above and the solution when <span class="math">\(\mu = 0.85\)</span> is shown below. Note that there is always an unphysical solution at <span class="math">\(\alpha =0\)</span> — this should be ignored.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2019/12/binary_sol.png"><img alt="binary_sol" src="https://efavdb.com/wp-content/uploads/2019/12/binary_sol.png"></a></p>
<p>Using the above results, I have plotted the empirical cdfs versus <span class="math">\(k\)</span> for both walk distributions. The values are shown below for <span class="math">\(\mu = 0.1\)</span> (left) and <span class="math">\(\mu = 0.85\)</span> (right). The slopes of the theoretical and numerical results are what should be compared as these give the value of <span class="math">\(\alpha\)</span>. Note that <span class="math">\(\mu = 0.1\)</span> is a small drift relative to the standard deviation (<span class="math">\(\sigma = 1\)</span>, here), but <span class="math">\(\mu = 0.85\)</span> is not. This is why at left the universal form gives us a good fit to the decay rates for both systems, but at right we need our numerical solution to (\ref{dd_decay_eqn}) to get the binary decay rate.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2019/12/results.png"><img alt="results" src="https://efavdb.com/wp-content/uploads/2019/12/results.png"></a></p>
<p>In conclusion, we have found that the exponential form of drawdown works quite well in these examples, with the theoretical results (\ref{exponential} - \ref{exponential_universal}) providing methods for identifying the exponents. In particular, the plot at left above illustrates the universality of form (\ref{exponential_universal}) — it holds for all walks, provided we are in the small bias limit.</p>
<h2 id="appendix-derivations">Appendix: Derivations</h2>
<p>To derive the exponential form, we consider an integral equation for the drawdown probability <span class="math">\(p\)</span>. At equilibrium, we have
</p>
<div class="math">\begin{eqnarray}\tag{A1}
p(k) = \int_{-\infty}^{0} p(k^{\prime}) T(k^{\prime}, k) dk^{\prime}.
\end{eqnarray}</div>
<p>
where <span class="math">\(T\)</span> is the transition function for the drawdown process. In the tail, we can ignore the boundary at zero and this goes to
</p>
<div class="math">\begin{eqnarray}\tag{A2}
p(k) = \int_{-\infty}^{\infty} p(k^{\prime}) \tau(k - k^{\prime}) dk^{\prime},
\end{eqnarray}</div>
<p>
where we have taken the upper limit to infinity, assuming that the transition function has a finite length so that this is acceptable. We can solve this by positing an exponential solution of form
</p>
<div class="math">\begin{eqnarray}\tag{A3}
p(k) \equiv A \exp\left(\alpha k \right).
\end{eqnarray}</div>
<p>
Plugging this into the above gives
</p>
<div class="math">\begin{eqnarray} \nonumber
A \exp\left(\alpha k \right) &=& \int_{-\infty}^{\infty} A \exp\left(\alpha k^{\prime} \right) \tau(k - k^{\prime}) dk^{\prime}\ \tag{A4}
&=& A \exp\left(\alpha k \right) \int_{-\infty}^{\infty} \exp\left( \alpha j \right) \tau(-j) dj
\end{eqnarray}</div>
<p>
Simplifying this gives (\ref{exponential}).</p>
<p>Now, to get the universal form, we make use of the cumulant expansion, writing
</p>
<div class="math">\begin{eqnarray} \nonumber
1 &=& \int_{-\infty}^{\infty} \exp\left( \alpha j \right) \tau(-j) dj \
&\equiv & \exp \left ( - \mu \alpha + \sigma^2 \frac{\alpha^2}{2} + \ldots \right) \tag{A5}
\end{eqnarray}</div>
<p>
Provided the expansion converges quickly, we obtain
</p>
<div class="math">\begin{eqnarray}
- \mu \alpha + \sigma^2 \frac{\alpha^2}{2} + \ldots = 0 \tag{A6}
\end{eqnarray}</div>
<p>
giving
</p>
<div class="math">\begin{eqnarray} \label{cppi_alpha_asymptotic} \tag{A7}
\alpha \sim 2 \frac{\mu}{\sigma^2}
\end{eqnarray}</div>
<p>
With this solution, the <span class="math">\(k\)</span>-th term in the cumulant expansion goes like
</p>
<div class="math">\begin{eqnarray} \tag{A8}
\frac{2^k}{k!} \left( \frac{\mu}{\sigma^2} \right)^k O(\overline{x^k}) \sim \frac{2^k}{k!} \left( \frac{\mu}{\sigma} \right)^k
\end{eqnarray}</div>
<p>
assuming the the jumps are constrained over some length scale proportional to <span class="math">\(\sigma\)</span>. We see that provided the drift to standard deviation is small, the series converges quickly and our approximation is universally good. Unless you’re cursed with an unusually large drift ratio for a given move, this form should work well.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>TimeMarker class for python2019-09-14T22:31:00-07:002019-09-14T22:31:00-07:00Jonathan Landytag:efavdb.com,2019-09-14:/timemarker-class-for-python<p>We give a simple class for marking the time at different points in a code block and then printing out the time gaps between adjacent marked points. This is useful for identifying slow spots in code.</p>
<h2 id="the-timemarker-class">The TimeMarker class</h2>
<p>In the past, whenever I needed to speed up a block …</p><p>We give a simple class for marking the time at different points in a code block and then printing out the time gaps between adjacent marked points. This is useful for identifying slow spots in code.</p>
<h2 id="the-timemarker-class">The TimeMarker class</h2>
<p>In the past, whenever I needed to speed up a block of python code, the first thing I would do was import the time package, then manually insert a set of lines of the form <code>t1 = time.time()</code>, <code>t2 = time.time()</code>, etc. Then at the end, <code>print t2 - t1, t3 -t2, ...</code>, etc. This works reasonably well, but I found it annoying and time consuming to have to save each time point to a different variable name. In particular, this prevented quick copy and paste of the time marker line. I finally thought to fix it this evening: Behold the <code>TimeMarker</code> class, which solves this problem for me:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">time</span>
<span class="k">class</span> <span class="nc">TimeMarker</span><span class="p">():</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">markers</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">mark</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">markers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">print_markers</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">markers</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">markers</span><span class="p">[</span><span class="mi">1</span><span class="p">:]):</span>
<span class="nb">print</span> <span class="n">pair</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">pair</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
<p>Here is a simple code example: </p>
<div class="highlight"><pre><span></span><span class="n">tm</span> <span class="o">=</span> <span class="n">TimeMarker</span><span class="p">()</span>
<span class="n">tm</span><span class="o">.</span><span class="n">mark</span><span class="p">()</span>
<span class="nb">sum</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">tm</span><span class="o">.</span><span class="n">mark</span><span class="p">()</span>
<span class="nb">sum</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span> <span class="o">**</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">tm</span><span class="o">.</span><span class="n">mark</span><span class="p">()</span>
<span class="n">tm</span><span class="o">.</span><span class="n">print_markers</span><span class="p">()</span>
<span class="c1"># (output...) </span>
<span class="c1"># 7.9870223999e-05 </span>
<span class="c1"># 0.0279731750488 </span>
</pre></div>
<p>The key here is that I can quickly paste the <code>tm.mark()</code> repeatedly throughout my code and quickly check where the slow part sits.</p>Backpropagation in neural networks2019-07-18T23:35:00-07:002019-07-18T23:35:00-07:00Cathy Yehtag:efavdb.com,2019-07-18:/backpropagation-in-neural-networks<h2 id="overview">Overview</h2>
<p>We give a short introduction to neural networks and the backpropagation algorithm for training neural networks. Our overview is brief because we assume familiarity with partial derivatives, the chain rule, and matrix multiplication.</p>
<p>We also hope this post will be a quick reference for those already familiar with the …</p><h2 id="overview">Overview</h2>
<p>We give a short introduction to neural networks and the backpropagation algorithm for training neural networks. Our overview is brief because we assume familiarity with partial derivatives, the chain rule, and matrix multiplication.</p>
<p>We also hope this post will be a quick reference for those already familiar with the notation used by Andrew Ng in his course on <a href="https://www.coursera.org/learn/neural-networks-deep-learning/">“Neural Networks and Deep Learning”</a>, the first in the deeplearning.ai series on Coursera. That course provides but doesn’t derive the vectorized form of the backpropagation equations, so we hope to fill in that small gap while using the same notation.</p>
<h2 id="introduction-neural-networks">Introduction: neural networks</h2>
<h3 id="a-single-neuron-acting-on-a-single-training-example">A single neuron acting on a single training example</h3>
<p><img alt="single neuron" src="https://efavdb.com/wp-content/uploads/2019/07/single_neuron-e1563431237482.png"></p>
<p>The basic building block of a neural network is the composition of a nonlinear function (like a <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a>, <a href="http://mathworld.wolfram.com/HyperbolicTangent.html">tanh</a>, or <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU</a>) <span class="math">\(g(z)\)</span></p>
<div class="math">\begin{eqnarray} \nonumber
a^{[l]} = g(z^{[l]})
\end{eqnarray}</div>
<p>with a linear function acting on a (multidimensional) input, <span class="math">\(a\)</span>.
</p>
<div class="math">\begin{eqnarray} \nonumber
z^{[l]} = w^{[l]T} a^{[l-1]} + b^{[l]}
\end{eqnarray}</div>
<p>These building blocks, i.e. “nodes” or “neurons” of the neural network, are arranged in layers, with the layer denoted by superscript square brackets, e.g. <span class="math">\([l]\)</span> for the <span class="math">\(l\)</span>th layer. <span class="math">\(n_l\)</span> denotes the number of neurons in layer <span class="math">\(l\)</span>.</p>
<h3 id="forward-propagation">Forward propagation</h3>
<p>Forward propagation is the computation of the multiple linear and nonlinear transformations of the neural network on the input data. We can rewrite the above equations in vectorized form to handle multiple training examples and neurons per layer as</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
A^{[l]} = g(Z^{[l]})
\end{eqnarray}</div>
<p>with a linear function acting on a (multidimensional) input, <span class="math">\(A\)</span>.
</p>
<div class="math">\begin{eqnarray} \tag{2} \label{2}
Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}
\end{eqnarray}</div>
<p>The outputs or activations, <span class="math">\(A^{[l-1]}\)</span>, of the previous layer serve as inputs for the linear functions, <span class="math">\(z^{[l]}\)</span>. If <span class="math">\(n_l\)</span> denotes the number of neurons in layer <span class="math">\(l\)</span>, and <span class="math">\(m\)</span> denotes the number of training examples in one (mini)batch pass through the neural network, then the dimensions of these matrices are:</p>
<table>
<thead>
<tr>
<th>Variable</th>
<th>Dimensions</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="math">\(A^{[l]}\)</span></td>
<td>(<span class="math">\(n_l\)</span>, <span class="math">\(m\)</span>)</td>
</tr>
<tr>
<td><span class="math">\(Z^{[l]}\)</span></td>
<td>(<span class="math">\(n_l\)</span>, <span class="math">\(m\)</span>)</td>
</tr>
<tr>
<td><span class="math">\(W^{[l]}\)</span></td>
<td>(<span class="math">\(n_l\)</span>, <span class="math">\(n_{l-1}\)</span>)</td>
</tr>
<tr>
<td><span class="math">\(b^{[l]}\)</span></td>
<td>(<span class="math">\(n_l\)</span>, 1)</td>
</tr>
</tbody>
</table>
<p>For example, this neural network consists of only a single hidden layer with 3 neurons in layer 1.</p>
<p><img alt="neural network" src="https://efavdb.com/wp-content/uploads/2019/07/2layer_nn-e1563432145388.png"></p>
<p>The matrix <span class="math">\(W^{[1]}\)</span> has dimensions (3, 2) because there are 3 neurons in layer 1 and 2 inputs from the previous layer (in this example, the inputs are the raw data, <span class="math">\(\vec{x} = (x_1, x_2)\)</span>). Each row of <span class="math">\(W^{[1]}\)</span> corresponds to a vector of weights for a neuron in layer 1.</p>
<p><img alt="weights matrix" src="https://efavdb.com/wp-content/uploads/2016/06/weights_matrix-e1563432287786.png"></p>
<p>The final output of the neural network is a prediction in the last layer <span class="math">\(L\)</span>, and the closeness of the prediction <span class="math">\(A^{[L](i)}\)</span> to the true label <span class="math">\(y^{(i)}\)</span> for training example <span class="math">\(i\)</span> is quantified by a loss function <span class="math">\(\mathcal{L}(y^{(i)}, A^{[L](i)})\)</span>, where superscript <span class="math">\((i)\)</span> denotes the <span class="math">\(i\)</span>th training example. For classification, the typical choice for <span class="math">\(\mathcal{L}\)</span> is the <a href="https://en.wikipedia.org/wiki/Cross_entropy">cross-entropy loss</a> (log loss).</p>
<p>The cost <span class="math">\(J\)</span> is the average loss over all <span class="math">\(m\)</span> training examples in the dataset.</p>
<div class="math">\begin{eqnarray} \tag{3} \label{3}
J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(y^{(i)}, A^{[L](i)})
\end{eqnarray}</div>
<h3 id="minimizing-the-cost-with-gradient-descent">Minimizing the cost with gradient descent</h3>
<p>The task of training a neural network is to find the set of parameters <span class="math">\(W\)</span> and <span class="math">\(b\)</span> (with different <span class="math">\(W\)</span> and <span class="math">\(b\)</span> for different nodes in the network) that will give us the best predictions, i.e. minimize the cost (\ref{3}).</p>
<p>Gradient descent is the workhorse that we employ for this optimization problem. We randomly initialize the parameters <span class="math">\(W\)</span> and <span class="math">\(b\)</span> for each node, then iteratively update the parameters by moving them in the direction that is opposite to the gradient of the cost.</p>
<div class="math">\begin{eqnarray} \nonumber
W_\text{new} &=& W_\text{previous} - \alpha \frac{\partial J}{\partial W} \\
b_\text{new} &=& b_\text{previous} - \alpha \frac{\partial J}{\partial b}
\end{eqnarray}</div>
<p>
<span class="math">\(\alpha\)</span> is the learning rate, a hyperparameter that needs to be tuned during the training process. The gradient of the cost is calculated by the backpropagation algorithm.</p>
<h2 id="backpropagation-equations">Backpropagation equations</h2>
<p>These are the vectorized backpropagation (<span class="caps">BP</span>) equations which we wish to derive:</p>
<div class="math">\begin{eqnarray} \nonumber
dW^{[l]} &\equiv& \frac{\partial J}{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]}A^{[l-1]T} \tag{BP1} \label{BP1} \\
db^{[l]} &\equiv& \frac{\partial J}{\partial b^{[l]}} = \frac{1}{m} \sum_{i=1}^m dZ^{[l](i)} \tag{BP2} \label{BP2} \\
dA^{[l-1]} &\equiv& \frac{\partial \mathcal{L}}{\partial A^{[l-1]}} = W^{[l]T}dZ^{[l]} \tag{BP3} \label{BP3} \\
dZ^{[l]} &\equiv& \frac{\partial \mathcal{L}}{\partial Z^{[l]}} = dA^{[l]} * g'(Z^{[l]}) \tag{BP4} \label{BP4}
\end{eqnarray}</div>
<p>
The <span class="math">\(*\)</span> in the last line denotes element-wise multiplication.</p>
<p><span class="math">\(W\)</span> and <span class="math">\(b\)</span> are the parameters we want to learn (update), but the <span class="caps">BP</span> equations include two additional expressions for the partial derivative of the loss in terms of linear and nonlinear activations per training example since they are intermediate terms that appear in the calculation of <span class="math">\(dW\)</span> and <span class="math">\(db\)</span>.</p>
<h3 id="chain-rule">Chain rule</h3>
<p>We’ll need the chain rule for <a href="https://en.wikipedia.org/wiki/Total_derivative">total derivatives</a>, which describes how the change in a function <span class="math">\(f\)</span> with respect to a variable <span class="math">\(x\)</span> can be calculated as a sum over the contributions from intermediate functions <span class="math">\(u_i\)</span> that depend on <span class="math">\(x\)</span>:</p>
<div class="math">\begin{eqnarray} \nonumber
\frac{\partial f(u_1, u_2, ..., u_k)}{\partial x} = \sum_{i}^k \frac{\partial f}{\partial u_i} \frac{\partial u_i}{\partial x}
\end{eqnarray}</div>
<p>
where the <span class="math">\(u_i\)</span> are functions of <span class="math">\(x\)</span>. This expression reduces to the single variable chain rule when only one <span class="math">\(u_i\)</span> is a function of <span class="math">\(x\)</span>.</p>
<p>The gradients for every node can be calculated in a single backward pass through the network, starting with the last layer and working backwards, towards the input layer. As we work backwards, we cache the values of <span class="math">\(dZ\)</span> and <span class="math">\(dA\)</span> from previous calculations, which are then used to compute the derivative for variables that are further upstream in the computation graph. The dependency of the derivatives of upstream variables on downstream variables, i.e. cached derivatives, is manifested in the <span class="math">\(\frac{\partial f}{\partial u_i}\)</span> term in the chain rule. (Backpropagation is a dynamic programming algorithm!)</p>
<h3 id="the-chain-rule-applied-to-backpropagation">The chain rule applied to backpropagation</h3>
<p>In this section, we apply the chain rule to derive the vectorized form of equations <span class="caps">BP</span>(1-4). Without loss of generality, we’ll index an element of the matrix or vector on the left hand side of <span class="caps">BP</span>(1-4); the notation for applying the chain rule is therefore straightforward because the derivatives are just with respect to scalars.</p>
<p><strong><span class="caps">BP1</span></strong>
The partial derivative of the cost with respect to the <span class="math">\(s\)</span>th component (corresponding to the <span class="math">\(s\)</span>th input) of <span class="math">\(\vec{w}\)</span> in the <span class="math">\(r\)</span>th node in layer <span class="math">\(l\)</span> is:</p>
<div class="math">\begin{eqnarray}
dW^{[l]}_{rs} &\equiv& \frac{\partial J}{\partial W^{[l]}_{rs}} \\
&=& \frac{1}{m} \sum_{i}^m \frac{\partial \mathcal{L}}{\partial W^{[l]}_{rs}} \\
&=& \frac{1}{m} \sum_{i}^m \frac{\partial \mathcal{L}}{\partial z^{[l]}_{ri}} \frac{\partial z^{[l]}_{ri}}{\partial W^{[l]}_{rs}} \tag{4} \label{4}
\end{eqnarray}</div>
<p>
The last line is due to the chain rule.</p>
<p>The first term in (\ref{4}) is <span class="math">\(dZ^{[l]}_{ri}\)</span> by definition (\ref{<span class="caps">BP4</span>}). We can simplify the second term of (\ref{4}) using the definition of the linear function (\ref{2}), which we rewrite below explicitly for the <span class="math">\(i\)</span>th training example in the <span class="math">\(r\)</span>th node in the <span class="math">\(l\)</span>th layer in order to be able to more easily keep track of indices when we take derivatives of the linear function:
</p>
<div class="math">\begin{eqnarray} \tag{5} \label{5}
Z^{[l]}_{ri} = \sum_j^{n_{l-1}} W^{[l]}_{rj} A^{[l-1]}_{ji} + b^{[l]}_r
\end{eqnarray}</div>
<p>
where <span class="math">\(n_{l-1}\)</span> denotes the number of nodes in layer <span class="math">\(l-1\)</span>.</p>
<p>Therefore,
</p>
<div class="math">\begin{eqnarray}
dW^{[l]}_{rs} &=& \frac{1}{m} \sum_{i}^m dZ^{[l]}_{ri} A^{[l-1]}_{si} \\
&=& \frac{1}{m} \sum_{i}^m dZ^{[l]}_{ri} A^{[l-1]T}_{is} \\
&=& \frac{1}{m} \left( dZ^{[l]} A^{[l-1]T} \right)_{rs}
\end{eqnarray}</div>
<p><strong><span class="caps">BP2</span></strong>
The partial derivative of the cost with respect to <span class="math">\(b\)</span> in the <span class="math">\(r\)</span>th node in layer <span class="math">\(l\)</span> is:</p>
<div class="math">\begin{eqnarray}
db^{[l]}_r &\equiv& \frac{\partial J}{\partial b^{[l]}_r} \\
&=& \frac{1}{m} \sum_{i}^m \frac{\partial \mathcal{L}}{\partial b^{[l]}_r} \\
&=& \frac{1}{m} \sum_{i}^m \frac{\partial \mathcal{L}}{\partial z^{[l]}_{ri}} \frac{\partial z^{[l]}_{ri}}{\partial b^{[l]}_r} \tag{6} \label{6} \\
&=& \frac{1}{m} \sum_{i}^m dZ^{[l]}_{ri}
\end{eqnarray}</div>
<p>
(\ref{6}) is due to the chain rule. The first term in (\ref{6}) is <span class="math">\(dZ^{[l]}_{ri}\)</span> by definition (\ref{<span class="caps">BP4</span>}). The second term of (\ref{6}) simplifies to <span class="math">\(\partial z^{[l]}_{ri} / \partial b^{[l]}_r = 1\)</span> from (\ref{5}).</p>
<p><strong><span class="caps">BP3</span></strong>
The partial derivative of the loss for the <span class="math">\(i\)</span>th example with respect to the nonlinear activation in the <span class="math">\(r\)</span>th node in layer <span class="math">\(l-1\)</span> is:</p>
<div class="math">\begin{eqnarray}
dA^{[l-1]}_{ri} &\equiv& \frac{\partial \mathcal{L}}{\partial A^{[l-1]}_{ri}} \\
&=& \sum_{k=1}^{n_l} \frac{\partial \mathcal{L}}{\partial Z^{[l]}_{ki}} \frac{\partial Z^{[l]}_{ki}}{\partial A^{[l-1]}_{ri}} \tag{7} \label{7} \\
&=& \sum_{k=1}^{n_l} dZ^{[l]}_{ki} W^{[l]}_{kr} \tag{8} \label{8} \\
&=& \sum_{k=1}^{n_l} W^{[l]T}_{rk} dZ^{[l]}_{ki} \\
&=& \left( W^{[l]T} dZ^{[l]} \right)_{ri}
\end{eqnarray}</div>
<p>
The application of the chain rule (\ref{7}) includes a sum over the nodes in layer <span class="math">\(l\)</span> whose linear functions take <span class="math">\(A^{[l-1]}_{ri}\)</span> as an input, assuming the nodes between layers <span class="math">\(l-1\)</span> and <span class="math">\(l\)</span> are fully-connected. The first term in (\ref{8}) is by definition <span class="math">\(dZ\)</span> (\ref{<span class="caps">BP4</span>}); from (\ref{5}), the second term in (\ref{8}) evaluates to <span class="math">\(\partial Z^{[l]}_{ki} / \partial A^{[l-1]}_{ri} = W^{[l]}_{kr}\)</span>.</p>
<p><strong><span class="caps">BP4</span></strong>
The partial derivative of the loss for the <span class="math">\(i\)</span>th example with respect to the linear activation in the <span class="math">\(r\)</span>th node in layer <span class="math">\(l\)</span> is:</p>
<div class="math">\begin{eqnarray}
dZ^{[l]}_{ri} &\equiv& \frac{\partial \mathcal{L}}{\partial Z^{[l]}_{ri}} \\
&=& \frac{\partial \mathcal{L}}{\partial A^{[l]}_{ri}} \frac{\partial A^{[l]}_{ri}}{\partial Z^{[l]}_{ri}} \\
&=& dA^{[l]}_{ri} * g'(Z^{[l]}_{ri})
\end{eqnarray}</div>
<p>
The second line is by the application of the chain rule (single variable since only a single nonlinear activation depends on directly on <span class="math">\(Z^{[l]}_{ri}\)</span>). <span class="math">\(g'(Z)\)</span> is the derivative of the nonlinear activation function with respect to its input, which depends on the nonlinear activation function that is assigned to that particular node, e.g. sigmoid vs. tanh vs. ReLU.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Backpropagation efficiently executes gradient descent for updating the parameters of a neural network by ordering and caching the calculations of the gradient of the cost with respect to the parameters in the nodes. This post is a little heavy on notation since the focus is on deriving the vectorized formulas for backpropagation, but we hope it complements the lectures in Week 3 of Andrew Ng’s <a href="https://www.coursera.org/learn/neural-networks-deep-learning/">“Neural Networks and Deep Learning”</a> course as well as the excellent, but even more notation-heavy, resources on matrix calculus for backpropagation that are linked below.</p>
<hr>
<p><strong>More resources on vectorized backpropagation</strong>
<a href="https://explained.ai/matrix-calculus/index.html">The matrix calculus you need for deep learning</a> - from explained.ai
<a href="http://neuralnetworksanddeeplearning.com/chap2.html">How the backpropagation algorithm works</a> - Chapter 2 of the Neural Networks and Deep Learning free online text</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>An orientational integral2019-07-02T07:04:00-07:002019-07-02T07:04:00-07:00Jonathan Landytag:efavdb.com,2019-07-02:/an-orientational-integral<p>We evaluate an integral having to do with vector averages over all
orientations in an n-dimensional space.</p>
<h2 id="problem-definition">Problem definition</h2>
<p>Let <span class="math">\(\hat{v}\)</span> be a unit vector in <span class="math">\(n\)</span>-dimensions and consider the orientation average of
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
J \equiv \langle \hat{v} \cdot \vec{a}_1 …</div><p>We evaluate an integral having to do with vector averages over all
orientations in an n-dimensional space.</p>
<h2 id="problem-definition">Problem definition</h2>
<p>Let <span class="math">\(\hat{v}\)</span> be a unit vector in <span class="math">\(n\)</span>-dimensions and consider the orientation average of
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
J \equiv \langle \hat{v} \cdot \vec{a}_1 \hat{v} \cdot \vec{a}_2 \ldots \hat{v} \cdot \vec{a}_k \rangle
\end{eqnarray}</div>
<p>
where <span class="math">\(\vec{a}_1, \ldots, \vec{a}_k\)</span> are some given fixed vectors. For example, if all <span class="math">\(\vec{a}_i\)</span> are equal to <span class="math">\(\hat{x}\)</span>, we want the orientation average of <span class="math">\(v_x^k\)</span>.</p>
<h2 id="solution">Solution</h2>
<p>We’ll evaluate our integral using parameter differentiation of the multivariate Gaussian integral. Let
</p>
<div class="math">\begin{eqnarray} \nonumber
I &=& \frac{1}{(2 \pi)^{n/2}} \int e^{- \frac{\vert \vec{v} \vert^2}{2} + \sum_{i=1}^k \alpha_i \vec{v} \cdot \vec{a}_i} d^nv \\ \tag{2} \label{2}
&=& \exp \left [- \frac{1}{2} \vert \sum_{i=1}^k \alpha_i \vec{a}_i \vert^2 \right]
\end{eqnarray}</div>
<p>
The expression in the second line follows from completing the square in the exponent in the first — for review, see our post on the normal distribution, <a href="http://efavdb.github.io/normal-distributions">here</a>. Now, we consider a particular derivative of <span class="math">\(I\)</span> with respect to the <span class="math">\(\alpha\)</span> parameters. From the first line of (\ref{2}), we have
</p>
<div class="math">\begin{eqnarray} \tag{3} \label{3}
\partial_{\alpha_1}\ldots \partial_{\alpha_k}I \vert_{\vec{\alpha}=0} &=& \frac{1}{(2 \pi)^{n/2}} \int e^{- \frac{\vert \vec{v} \vert^2}{2}} \prod_{i=1}^k \vec{v} \cdot \vec{a}_i d^n v \\
&\equiv & \frac{1}{(2 \pi)^{n/2}} \int_0^{\infty} e^{- \frac{\vert \vec{v} \vert^2}{2}} v^{n + k -1} dv \int \prod_{i=1}^k \hat{v} \cdot \vec{a}_i d \Omega_v \\
&=& \frac{2^{k/2 - 1}}{\pi^{n/2}} \Gamma(\frac{n+k}{2}) \times \int \prod_{i=1}^k \hat{v} \cdot \vec{a}_i d \Omega_v
\end{eqnarray}</div>
<p>
The second factor above is almost our desired orientation average <span class="math">\(J\)</span> — the only thing it’s missing is the normalization, which we can get by evaluating this integral without any <span class="math">\(\vec{a}\)</span><span class="quo">‘</span>s.</p>
<p>Next, we evaluate the parameter derivative considered above in a second way, using the second line of (\ref{2}). This gives,
</p>
<div class="math">\begin{eqnarray} \tag{4} \label{4}
\partial_{\alpha_1}\ldots \partial_{\alpha_k}I \vert_{\vec{\alpha}=0} &=& \partial_{\alpha_1}\ldots \partial_{\alpha_k} \exp \left [- \frac{1}{2} \vert \sum_{i=1}^k \alpha_i \vec{a}_i \vert^2 \right] \vert_{\vec{\alpha}=0} \\
&=& \sum_{\text{pairings}} (\vec{a}_{i_1} \cdot \vec{a}_{i_2}) (\vec{a}_{i_3} \cdot \vec{a}_{i_4})\ldots (\vec{a}_{i_{k-1}} \cdot \vec{a}_{i_k})
\end{eqnarray}</div>
<p>
The sum here is over all possible, unique pairings of the indices. You can see this is correct by carrying out the differentiation one parameter at a time.</p>
<p>To complete the calculation, we equate (\ref{3}) and (\ref{4}). This gives
</p>
<div class="math">\begin{eqnarray} \tag{5}\label{5}
\int \prod_{i=1}^k \hat{v} \cdot \vec{a}_i d \Omega_v = \frac{\pi^{n/2}} {2^{k/2 - 1}\Gamma(\frac{n+k}{2})}\sum_{\text{pairings}} (\vec{a}_{i_1} \cdot \vec{a}_{i_2}) (\vec{a}_{i_3} \cdot \vec{a}_{i_4})\ldots (\vec{a}_{i_{k-1}} \cdot \vec{a}_{i_k})
\end{eqnarray}</div>
<p>
Again, to get the desired average, we need to divide the above by the normalization factor. This is given by the value of the integral (\ref{5}) when <span class="math">\(k = 0\)</span>. This gives,
</p>
<div class="math">\begin{eqnarray}\tag{6}\label{6}
J = \frac{1}{2^{k/2}}\frac{\Gamma(n/2)}{\Gamma(\frac{n+k}{2})} \sum_{\text{pairings}} (\vec{a}_{i_1} \cdot \vec{a}_{i_2}) (\vec{a}_{i_3} \cdot \vec{a}_{i_4})\ldots (\vec{a}_{i_{k-1}} \cdot \vec{a}_{i_k})
\end{eqnarray}</div>
<h2 id="example">Example</h2>
<p>Consider the case where <span class="math">\(k=2\)</span> and <span class="math">\(\vec{a}_1 = \vec{a}_2 = \hat{x}\)</span>. In this case, we note that the average of <span class="math">\(\hat{v}_x^2\)</span> is equal to the average along any other orientation. This means we have
</p>
<div class="math">\begin{eqnarray}\nonumber \tag{7} \label{7}
\langle \hat{v}_x^2 \rangle &=& \frac{1}{n} \sum_{i=1}^n \langle \hat{v}_x^2 + \hat{v}_y^2 + \ldots \rangle \\
&=& \frac{1}{n}
\end{eqnarray}</div>
<p>
We get this same result from our more general formula: Plugging in <span class="math">\(k=2\)</span> and <span class="math">\(\vec{a}_1 = \vec{a}_2 = \hat{x}\)</span> into (\ref{6}), we obtain
</p>
<div class="math">\begin{eqnarray}\nonumber \tag{8} \label{8}
\langle \hat{v}_x^2 \rangle &=& \frac{1}{2}\frac{\Gamma(n/2)}{\Gamma(\frac{n}{2} + 1)} \\
&=& \frac{1}{n}
\end{eqnarray}</div>
<p>
The two results agree.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Compounding benefits of tax protected accounts2019-06-28T09:51:00-07:002019-06-28T09:51:00-07:00Jonathan Landytag:efavdb.com,2019-06-28:/compounding-benefits-of-tax-protected-accounts<p>Here, we highlight one of the most important benefits of tax protected accounts (eg Traditional and Roth IRAs and 401ks). Specifically, we review the fact that not having to pay taxes on any investment growth that occurs while the money is held in the account results in compounding / exponential growth …</p><p>Here, we highlight one of the most important benefits of tax protected accounts (eg Traditional and Roth IRAs and 401ks). Specifically, we review the fact that not having to pay taxes on any investment growth that occurs while the money is held in the account results in compounding / exponential growth with a larger exponent than would be obtained in a traditional account.</p>
<h2 id="the-growth-equations">The growth equations</h2>
<p>Here, we consider three types of investment account: A standard bank account without tax protection, a traditional tax protected account, and a Roth tax protected account. We’ll consider an idealized situation where we earn regular income of <span class="math">\(D_0^{\prime}\)</span> at time <span class="math">\(0\)</span> and then place this wealth (taxed, as appropriate for each case) into an investment that always returns a growth factor of <span class="math">\(g\)</span>. For simplicity, we’ll assume that our tax rate never changes and is given by <span class="math">\(t\)</span>. In the next three sections, we calculate expressions for the final wealth at time <span class="math">\(T\)</span> that results from each account. Following that, we compare the results.</p>
<h3 id="standard-account">Standard account</h3>
<p>In the standard account, the initial income must be taxed before it can be invested. Again, we define <span class="math">\(t\)</span> as the tax rate per year, so that the money left after tax at the start is<br>
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
D_0 = D_0^{\prime} (1 - t).
\end{eqnarray}</div>
<p><br>
We place this money into an idealized investment that always returns a growth of <span class="math">\(g\)</span>. Therefore, after one year, the net wealth before tax is<br>
</p>
<div class="math">\begin{eqnarray}\tag{2} \label{2}
D_1^{\prime} = D_0 (1 + g).
\end{eqnarray}</div>
<p><br>
The portion <span class="math">\(D_0 g\)</span> is new income that must be taxed, so after tax we have<br>
</p>
<div class="math">\begin{eqnarray}\tag{3} \label{3}
D_1 = D_0 + D_0 g (1 - t) = D_0[1 + g(1-t)].
\end{eqnarray}</div>
<p><br>
If we iterate this expression up to time <span class="math">\(T\)</span>, we obtain<br>
</p>
<div class="math">\begin{eqnarray}\nonumber
D_T &=& D_0[1 + g(1-t)]^T \
&\equiv & D_0^{\prime} (1 - t)[1 + g(1-t)]^T \tag{4} \label{4}
\end{eqnarray}</div>
<p><br>
This is our equation for the final, post-tax wealth obtained from the standard account.</p>
<h3 id="traditional-tax-protected-account">Traditional tax protected account</h3>
<p>In the traditional account, we do not need to pay tax at time <span class="math">\(0\)</span> on our initial <span class="math">\(D_0^{\prime}\)</span> dollars. Instead, this wealth is immediately put into our growth investment for <span class="math">\(T\)</span> years. This gives a pretax wealth at time <span class="math">\(T\)</span> of<br>
</p>
<div class="math">\begin{eqnarray}\tag{5} \label{5}
D_T^{\prime} = D_0 [1 + g]^T.
\end{eqnarray}</div>
<p><br>
However, when this money is taken out at time <span class="math">\(T\)</span> it must be taxed. This gives<br>
</p>
<div class="math">\begin{eqnarray}\tag{6} \label{6}
D_T = D_0^{\prime} (1-t) [1 + g]^T.
\end{eqnarray}</div>
<p><br>
This is the equation that describes the net wealth generated by the traditional tax protected account.</p>
<h3 id="roth-tax-protected-account">Roth tax protected account</h3>
<p>In the Roth account, we do pay taxes on the initial <span class="math">\(D_0^{\prime}\)</span> at time <span class="math">\(0\)</span>. However, once this is done, we never need to pay taxes again, even when taking the money out at expiration. Therefore, the net wealth at time <span class="math">\(T\)</span> is<br>
</p>
<div class="math">\begin{eqnarray}\tag{7} \label{7}
D_T = D_0^{\prime} (1-t) (1 + g)^T
\end{eqnarray}</div>
<p><br>
Notice that this expression is identical to that for the traditional tax protected account.</p>
<h2 id="comparison">Comparison</h2>
<p>Now that we have derived expressions for the final wealth in the three types of account, we can easily compare them. First, note that (\ref{4}), (\ref{6}), and (\ref{7}) all share the common factor of <span class="math">\(D_0^{\prime} (1-t) \equiv D_0\)</span>, which can be considered the initial post-tax wealth. This means that the only difference between the standard and tax protected accounts is the effective growth rate: The growth rate term for the standard account is<br>
</p>
<div class="math">\begin{eqnarray}
\text{growth factor (standard account)} = [1 + g(1-t)]^T \tag{8} \label{8}
\end{eqnarray}</div>
<p><br>
while that for the two tax protected accounts is<br>
</p>
<div class="math">\begin{eqnarray}
\text{growth factor (tax protected)}=[1 + g]^T \tag{9}\label{9}
\end{eqnarray}</div>
<p><br>
These two factors may look similar, but they represent exponential growth with different exponents. Consequently, for large <span class="math">\(T\)</span>, the growth from (\ref{9}) can be much larger than that from (\ref{8}). To illustrate this point, we tabulate the two functions assuming <span class="math">\(7\)</span> percent growth for <span class="math">\(30\)</span> years at a few representative tax rates below. Notice that the growth rates are similar when that tax rates are lower — which makes sense because taxation does not have much of an effect in this limit. However, the tax protected account has a much larger value in the opposite limit — 7.61 vs 2.81 for the standard account!</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">standard</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">g</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">t</span><span class="p">))</span><span class="o">**</span> <span class="n">T</span>
<span class="k">def</span> <span class="nf">tax_protected</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">t</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">g</span><span class="p">)</span> <span class="o">**</span> <span class="n">T</span>
<span class="n">taxes</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
<span class="n">standard_values</span> <span class="o">=</span> <span class="p">[</span><span class="n">standard</span><span class="p">(</span><span class="mi">30</span><span class="p">,</span> <span class="mf">0.07</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">taxes</span><span class="p">]</span>
<span class="n">protected_values</span> <span class="o">=</span> <span class="p">[</span><span class="n">tax_protected</span><span class="p">(</span><span class="mi">30</span><span class="p">,</span> <span class="mf">0.07</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">taxes</span><span class="p">]</span>
<span class="c1"># output: </span>
<span class="n">TAX</span> <span class="n">RATE</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.5</span>
<span class="n">STANDARD</span><span class="p">:</span> <span class="mf">6.25</span><span class="p">,</span> <span class="mf">5.13</span><span class="p">,</span> <span class="mf">4.20</span><span class="p">,</span> <span class="mf">3.44</span><span class="p">,</span> <span class="mf">2.81</span>
<span class="n">PROTECTED</span><span class="p">:</span> <span class="mf">7.61</span><span class="p">,</span> <span class="mf">7.61</span><span class="p">,</span> <span class="mf">7.61</span><span class="p">,</span> <span class="mf">7.61</span><span class="p">,</span> <span class="mf">7.61</span>
</pre></div>
<h2 id="final-comments">Final comments</h2>
<p>We’ve considered an idealized situation here in order to highlight the important point that tax protected accounts enjoy much larger compounding / exponential growth rates than do standard accounts. This can have a very big effect when taxation is high. However, it’s important to point out that there are other important characteristics not highlighted by our simplified model system. One important case is that the benefits of the traditional and roth accounts can differ if one’s tax rate changes over time. If interested, you should look into this elsewhere.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Utility functions and immigration2019-06-21T09:24:00-07:002019-06-21T09:24:00-07:00Jonathan Landytag:efavdb.com,2019-06-21:/utility-functions-and-immigration<p>We consider how the <span class="caps">GDP</span> or utility output of a city depends on the number of people living within it. From this, we derive some interesting consequences that can inform both government and individual attitudes towards newcomers.</p>
<h3 id="the-utility-function-and-benefit-per-person">The utility function and benefit per person</h3>
<p>In this post, we will consider …</p><p>We consider how the <span class="caps">GDP</span> or utility output of a city depends on the number of people living within it. From this, we derive some interesting consequences that can inform both government and individual attitudes towards newcomers.</p>
<h3 id="the-utility-function-and-benefit-per-person">The utility function and benefit per person</h3>
<p>In this post, we will consider an idealized town whose net output <span class="math">\(U\)</span> (the <span class="caps">GDP</span>) scales as a power law with the number of people <span class="math">\(N\)</span> living within it. That is, we’ll assume,<br>
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
U(N) = a N^{\gamma}.
\end{eqnarray}</div>
<p><br>
We’ll assume that the average benefit captured per person is their share of this utility,<br>
</p>
<div class="math">\begin{eqnarray} \tag{2} \label{2}
BPP(N) = U(N) / N = a N^{\gamma -1}.
\end{eqnarray}</div>
<p><br>
What can we say about the above <span class="math">\(a\)</span> and <span class="math">\(\gamma\)</span>? Well, we must have <span class="math">\(a> 0\)</span> if the society is productive. Further, because we know that cities allow for more complex economies as the number of occupants grow, we must have <span class="math">\(\gamma > 1\)</span>. These are the only assumptions we will make here. Below, we’ll see that these assumptions imply some interesting consequences.</p>
<h3 id="marginal-benefits">Marginal benefits</h3>
<p>When a new person immigrates to a city, its <span class="math">\(N\)</span> value goes up by one. Here, we consider how the utility and benefit per person changes when this occurs. The increase in net utility is simply<br>
</p>
<div class="math">\begin{eqnarray}\tag{3} \label{3}
\partial_N U(N) = a \gamma N^{\gamma -1}.
\end{eqnarray}</div>
<p><br>
Notice that because we have <span class="math">\(\gamma > 1\)</span>, (\ref{3}) is a function that increases with <span class="math">\(N\)</span>. That is, cities with larger populations benefit more (as a collective) per immigrant newcomer than those cities with smaller <span class="math">\(N\)</span> would. This implies that the governments of large cities should be more enthusiastic about welcoming of newcomers than those of smaller cities.</p>
<p>Now consider the marginal benefit per person when one new person moves to this city. This is simply<br>
</p>
<div class="math">\begin{eqnarray}\tag{4} \label{4}
\partial_N BPP(N) = a (\gamma - 1) N^{\gamma -2}.
\end{eqnarray}</div>
<p><br>
Notice that this is different from the form (\ref{3}) that describes the marginal increase in total city utility. In particular, while (\ref{4}) is positive, it is not necessarily increasing with <span class="math">\(N\)</span>: If <span class="math">\(\gamma < 2\)</span>, (\ref{4}) decreases with <span class="math">\(N\)</span>. Cities having <span class="math">\(\gamma\)</span> values like this are such that the net new wealth captured per existing citizen — thanks to each new immigrant — quickly decays to zero. The consequence is that city governments and existing citizens can have a conflict of interest when it comes to immigration.</p>
<h3 id="equilibration">Equilibration</h3>
<p>In a local population that has freedom of movement, we can expect the migration of people to push the benefit per person to be equal across cities. In cases like this, we should then have<br>
</p>
<div class="math">\begin{eqnarray}\tag{5} \label{5}
a_i N^{\gamma_i -1} \approx a_j N^{\gamma_j -1},
\end{eqnarray}</div>
<p><br>
for each city <span class="math">\(i\)</span> and <span class="math">\(j\)</span> for which there is low mutual migration costs. We point out that this is not the same result required to maximize the net, global output. This latter score is likely that which an authoritarian government might try to maximize. To maximize net utility, we need to have the marginal utility per city equal across cities, which means<br>
</p>
<div class="math">\begin{eqnarray}\tag{6} \label{6}
\partial_N U_i(N) = \partial_N U_j(N)
\end{eqnarray}</div>
<p><br>
or,<br>
</p>
<div class="math">\begin{eqnarray}\tag{7} \label{7}
a_i \gamma_i N^{\gamma_i -1} = a_j \gamma_j N^{\gamma_j -1}.
\end{eqnarray}</div>
<p><br>
We see that (\ref{5}) and (\ref{7}) differ in that there are <span class="math">\(\gamma\)</span> factors in (\ref{7}) that are not present in (\ref{5}). This implies that as long as the <span class="math">\(\gamma\)</span> values differ across cities, there will be a conflict of interest between the migrants and the government.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The speed of traffic2019-06-14T09:21:00-07:002019-06-14T09:21:00-07:00Jonathan Landytag:efavdb.com,2019-06-14:/the-speed-of-traffic<p>We use a simple argument to estimate the speed of traffic on a highway as a function of the density of cars. The idea is to simply calculate the maximum speed that traffic could go without supporting a growing traffic jam.</p>
<h3 id="jam-dissipation-argument">Jam dissipation argument</h3>
<p>To estimate the speed of traffic …</p><p>We use a simple argument to estimate the speed of traffic on a highway as a function of the density of cars. The idea is to simply calculate the maximum speed that traffic could go without supporting a growing traffic jam.</p>
<h3 id="jam-dissipation-argument">Jam dissipation argument</h3>
<p>To estimate the speed of traffic as a function of density, we’ll calculate an upper bound and argue that actual traffic speeds must be described by an equation similar to that obtained. To derive our upper bound, we’ll consider what happens when a small traffic jam forms. If the speed of cars is such that the rate of exit from the jam is larger than the rate at which new cars enter the jam, then the jam will dissipate. On the other hand, if this doesn’t hold, the jam will grow, causing the speed to drop until a speed is obtained that allows the jam to dissipate. This sets the bound. Although we consider a jam to make the argument simple, what we really have in mind is any other sort of modest slow-down that may occur.</p>
<p>To begin, we introduce some definitions. (1) Let <span class="math">\(\lambda\)</span> be the density of cars in units of <span class="math">\([cars / mile]\)</span>. (2) Next we consider the rate of exit from a jam: Note that when traffic is stopped, a car cannot move until the car in front of it does. Because a human is driving the car, there is a slight delay between the time that one car moves and the car behind it moves. Let <span class="math">\(T\)</span> be this delay time in <span class="math">\([hours]\)</span>. (3) Let <span class="math">\(v\)</span> be the speed of traffic outside the jam in units of <span class="math">\([miles / hour]\)</span>.</p>
<p>With the above definitions, we now consider the rate at which cars exit a jam. This is the number of cars that can exit the jam per hour, which is simply<br>
</p>
<div class="math">\begin{eqnarray} \tag{1} \label{1}
r_{out} = \frac{1}{T}.
\end{eqnarray}</div>
<p><br>
Next, the rate at which cars enter the jam is given by<br>
</p>
<div class="math">\begin{eqnarray} \tag{2} \label{2}
r_{in} = \lambda v.
\end{eqnarray}</div>
<p><br>
Requiring that <span class="math">\(r_{out} > r_{in}\)</span> we get<br>
</p>
<div class="math">\begin{eqnarray} \label{3} \tag{3}
v < \frac{1}{\lambda T}.
\end{eqnarray}</div>
<p><br>
This is our bound and estimate for the speed of traffic. We note that this form for <span class="math">\(v\)</span> follows from dimensional analysis, so the actual rate of traffic must have the same algebraic form as our upper bound (\ref{3}) — it can differ by a constant factor in front, but should have the same <span class="math">\(\lambda\)</span> and <span class="math">\(T\)</span> dependence.</p>
<h3 id="plugging-in-numbers">Plugging in numbers</h3>
<p>I estimate <span class="math">\(T\)</span>, the delay time between car movements to be about one second, which in hours is<br>
</p>
<div class="math">\begin{eqnarray} \tag{4} \label{4}
T \approx 0.00028\ [hour].
\end{eqnarray}</div>
<p><br>
Next for <span class="math">\(\lambda\)</span>, note that a typical car is about 10 feet long and a mile is around 5000 feet, so the maximum for <span class="math">\(\lambda\)</span> is around <span class="math">\( \lambda \lesssim 500 [cars / mile]\)</span>. Consider a case where there is a car every 10 car lengths or so. In this case, the density will go down from the maximum by a factor of 10, or<br>
</p>
<div class="math">\begin{eqnarray}\tag{5} \label{5}
\lambda \approx 50 \ [cars / mile].
\end{eqnarray}</div>
<p><br>
Plugging (\ref{4}) and (\ref{5}) into (\ref{3}), we obtain<br>
</p>
<div class="math">\begin{eqnarray} \tag{6}
v \lesssim \frac{1}{0.00028 * 50} \approx 70\ [mile / hour],
\end{eqnarray}</div>
<p><br>
quite close to our typical highway traffic speeds (and speed limits).</p>
<h3 id="final-comments">Final comments</h3>
<p>The above bound clearly depends on what values you plug in — I picked numbers that seemed reasonable, but admit I adjusted them a bit till I got the final number I wanted for <span class="math">\(v\)</span>. Anecdotally, I’ve found the result to work well at other densities: For example, when traffic is slow on the highway near my house, if I see that there is a car every 5 car lengths, the speed tends to be about <span class="math">\(30 [miles / hour]\)</span> — so scaling rule seems to work. The last thing I should note is that wikipedia has an article outlining some of the extensive research literature that’s been done on traffic flows — you can see that <a href="https://en.wikipedia.org/wiki/Traffic_flow">here</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Linear compression in python: PCA vs unsupervised feature selection2018-08-11T07:30:00-07:002018-08-11T07:30:00-07:00Jonathan Landytag:efavdb.com,2018-08-11:/unsupervised-feature-selection-in-python-with-linselect<p>We illustrate the application of two linear compression algorithms in python: Principal component analysis (<span class="caps">PCA</span>) and least-squares feature selection. Both can be used to compress a passed array, and they both work by stripping out redundant columns from the array. The two differ in that <span class="caps">PCA</span> operates in a particular …</p><p>We illustrate the application of two linear compression algorithms in python: Principal component analysis (<span class="caps">PCA</span>) and least-squares feature selection. Both can be used to compress a passed array, and they both work by stripping out redundant columns from the array. The two differ in that <span class="caps">PCA</span> operates in a particular rotated frame, while the feature selection solution operates directly on the original columns. As we illustrate below, <span class="caps">PCA</span> always gives a stronger compression. However, the feature selection solution is often comparably strong, and its output has the benefit of being relatively easy to interpret — a virtue that is important for many applications.</p>
<p>We use our python package <code>linselect</code> to carry out efficient feature selection-based compression below — this is available on pypi (<code>pip install linselect</code>) and <a href="https://github.com/EFavDB/linselect">GitHub</a>.</p>
<h2 id="linear-compression-algorithms">Linear compression algorithms</h2>
<p><a href="https://efavdb.com/wp-content/uploads/2018/06/simple_line.jpg"><img alt="simple_line" src="https://efavdb.com/wp-content/uploads/2018/06/simple_line.jpg"></a></p>
<p>To compress a data array having <span class="math">\(n\)</span> columns, linear compression algorithms begin by fitting a <span class="math">\(k\)</span>-dimensional line, or <em>hyperplane</em>, to the data (with <span class="math">\(k < n\)</span>). Any point in the hyperplane can be uniquely identified using a basis of <span class="math">\(k\)</span> components. Marking down each point’s projected location in the hyperplane using these components then gives a <span class="math">\(k\)</span>-column, compressed representation of the data. This idea is illustrated in Fig. 1 at right, where a line is fit to some two-component data. Projecting the points onto the line and then marking down how far along the line each projected point sits, we obtain a one-column compression. Carrying out this process can be useful if storage space is at a premium or if any operations need to be applied to the array (usually operations will run much faster on the compressed format). Further, compressed data is often easier to interpret and visualize, thanks to its reduced dimension.</p>
<p>In this post, we consider two automated linear compression algorithms: principal component analysis (<span class="caps">PCA</span>) and least-squares unsupervised feature selection. These differ because they are obtained from different hyperplane fitting strategies: The <span class="caps">PCA</span> approach is obtained from the <span class="math">\(k\)</span>-dimensional hyperplane fit that minimizes the data’s total squared-projection error. In general, the independent variables of this fit — i.e., the <span class="math">\(k\)</span> components specifying locations in the fit plane — end up being some linear combinations of the original <span class="math">\(x_i\)</span><span class="quo">‘</span>s. In contrast, the feature selection strategy intelligently picks a subset of the original array columns as predictors and then applies the usual least-squares fit to the others for compression [1]. These approaches are illustrated in the left and right panels of Fig. 2 below. The two fit lines there look very similar, but the encodings returned by these strategies differ qualitatively: The 1-d compression returned by <span class="caps">PCA</span> is how far along the <span class="math">\(PCA_1\)</span> direction a point sits (this is some linear combination of <span class="math">\(x_1\)</span> and <span class="math">\(x_2\)</span> — see figure), while the feature selection solution simply returns each point’s <span class="math">\(x_1\)</span> value. One of our goals here is to explain why this difference can favor the feature selection approach in certain applications.</p>
<p>Our post proceeds as follows: In the next section, we consider two representative applications in python: (1) The compression of a data set of tech-sector stock price quotes, and (2) the visualization of some economic summary statistics on the G20 nations. Working through these applications, we are able to familiarize ourselves with the output of the two algorithms, and also through contrast to highlight their relative virtues. The discussion section summarizes what we learn. Finally, a short appendix covers some of the formal mathematics of compression. There, we prove that linear compression-decompression operators are always projections.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/06/pca_vs_linselect.jpg"><img alt="pca_vs_linselect" src="https://efavdb.com/wp-content/uploads/2018/06/pca_vs_linselect.jpg"></a>
<strong>Fig. 2</strong>. A cartoon illustrating the projection that results when applying <span class="caps">PCA</span> (left) and unsupervised feature selection — via <code>linselect</code> (right): The original 2-d big dots are replaced by their small dot, effectively-1-d approximations — a projection.</p>
<h2 id="applications">Applications</h2>
<p>Both data sets explored below are available on our Github, <a href="https://github.com/EFavDB/linselect_demos">here</a>.</p>
<h3 id="stock-prices">Stock prices</h3>
<h4 id="loading-and-compressing-the-data">Loading and compressing the data</h4>
<p>In this section, we apply our algorithms to a prepared data set of one year’s worth of daily percentage price lifts on 50 individual tech stocks [2]. We expect these stocks to each be governed by a common set of market forces, motivating the idea that a substantial compression might be possible. This is true, and the compressed arrays that result may be more efficiently operated on, as noted above. In addition, we’ll see below that we can learn something about the full data set by examining the compression outputs.</p>
<p>The code below loads our data, smooths it over a running 30 day window (to remove idiosyncratic noise that is not of much interest), prints out the first three rows, compresses the data using our two methods, and then finally prints out the first five <span class="caps">PCA</span> components and the top five selected stocks.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">from</span> <span class="nn">linselect</span> <span class="kn">import</span> <span class="n">FwdSelect</span>
<span class="c1"># CONSTANTS</span>
<span class="n">KEEP</span> <span class="o">=</span> <span class="mi">5</span> <span class="c1"># compression dimension</span>
<span class="n">WINDOW_SIZE</span> <span class="o">=</span> <span class="mi">30</span> <span class="c1"># smoothing window size</span>
<span class="c1"># LOAD AND SMOOTH THE DATA</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'stocks.csv'</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="n">WINDOW_SIZE</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">WINDOW_SIZE</span><span class="p">:]</span>
<span class="nb">print</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
<span class="n">TICKERS</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># PCA COMPRESSION</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">StandardScalar</span><span class="p">()</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="n">KEEP</span><span class="p">)</span>
<span class="n">pca</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">))</span>
<span class="n">X_compressed_pca</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">))</span>
<span class="c1"># FEATURE SELECTION COMPRESSION</span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">X_compressed_linselect</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">selector</span><span class="o">.</span><span class="n">ordered_features</span><span class="p">[:</span><span class="n">KEEP</span><span class="p">]]</span>
<span class="c1"># PRINT OUT FIRST FIVE PCA COMPONENTs, TOP FIVE STOCKS</span>
<span class="nb">print</span> <span class="n">pca</span><span class="o">.</span><span class="n">components_</span><span class="p">[:</span><span class="n">KEEP</span><span class="p">]</span>
<span class="nb">print</span> <span class="n">TICKERS</span><span class="p">[</span><span class="n">selector</span><span class="o">.</span><span class="n">ordered_features</span><span class="p">][:</span><span class="n">KEEP</span><span class="p">]</span>
</pre></div>
<p>The output of the above print statements:</p>
<div class="highlight"><pre><span></span><span class="c1"># The first three rows of the data frame:</span>
<span class="n">date</span> <span class="n">AAPL</span> <span class="n">ADBE</span> <span class="n">ADP</span> <span class="n">ADSK</span> <span class="n">AMAT</span> <span class="n">AMZN</span> \
<span class="mi">30</span> <span class="mi">2017</span><span class="o">-</span><span class="mi">05</span><span class="o">-</span><span class="mi">31</span> <span class="mf">0.002821</span> <span class="mf">0.002994</span> <span class="mf">0.000248</span> <span class="mf">0.009001</span> <span class="mf">0.006451</span> <span class="mf">0.003237</span>
<span class="mi">31</span> <span class="mi">2017</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">01</span> <span class="mf">0.003035</span> <span class="mf">0.002776</span> <span class="mf">0.000522</span> <span class="mf">0.008790</span> <span class="mf">0.005487</span> <span class="mf">0.003450</span>
<span class="mi">32</span> <span class="mi">2017</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">02</span> <span class="mf">0.003112</span> <span class="mf">0.002964</span> <span class="o">-</span><span class="mf">0.000560</span> <span class="mf">0.008573</span> <span class="mf">0.005523</span> <span class="mf">0.003705</span>
<span class="n">ASML</span> <span class="n">ATVI</span> <span class="n">AVGO</span> <span class="o">...</span> <span class="n">T</span> <span class="n">TSLA</span> <span class="n">TSM</span> \
<span class="mi">30</span> <span class="mf">0.000755</span> <span class="mf">0.005933</span> <span class="mf">0.003988</span> <span class="o">...</span> <span class="o">-</span><span class="mf">0.001419</span> <span class="mf">0.004500</span> <span class="mf">0.003590</span>
<span class="mi">31</span> <span class="mf">0.002174</span> <span class="mf">0.006369</span> <span class="mf">0.003225</span> <span class="o">...</span> <span class="o">-</span><span class="mf">0.001125</span> <span class="mf">0.003852</span> <span class="mf">0.004279</span>
<span class="mi">32</span> <span class="mf">0.001566</span> <span class="mf">0.006014</span> <span class="mf">0.005343</span> <span class="o">...</span> <span class="o">-</span><span class="mf">0.001216</span> <span class="mf">0.004130</span> <span class="mf">0.004358</span>
<span class="n">TWTR</span> <span class="n">TXN</span> <span class="n">VMW</span> <span class="n">VZ</span> <span class="n">WDAY</span> <span class="n">WDC</span> <span class="n">ZNGA</span>
<span class="mi">30</span> <span class="mf">0.008292</span> <span class="mf">0.001467</span> <span class="mf">0.001984</span> <span class="o">-</span><span class="mf">0.001741</span> <span class="mf">0.006103</span> <span class="mf">0.002916</span> <span class="mf">0.007811</span>
<span class="mi">31</span> <span class="mf">0.008443</span> <span class="mf">0.001164</span> <span class="mf">0.002026</span> <span class="o">-</span><span class="mf">0.001644</span> <span class="mf">0.006303</span> <span class="mf">0.003510</span> <span class="mf">0.008379</span>
<span class="mi">32</span> <span class="mf">0.007796</span> <span class="mf">0.000637</span> <span class="mf">0.001310</span> <span class="o">-</span><span class="mf">0.001333</span> <span class="mf">0.006721</span> <span class="mf">0.002836</span> <span class="mf">0.008844</span>
<span class="c1"># PCA top components:</span>
<span class="p">[[</span> <span class="mf">0.10548148</span><span class="p">,</span> <span class="mf">0.20601986</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.0126039</span> <span class="p">,</span> <span class="mf">0.20139121</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span>
<span class="p">[</span><span class="o">-</span><span class="mf">0.11739195</span><span class="p">,</span> <span class="mf">0.02536787</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.2044143</span> <span class="p">,</span> <span class="mf">0.08462741</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span>
<span class="p">[</span> <span class="mf">0.03251305</span><span class="p">,</span> <span class="mf">0.10796197</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.00463919</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.17564998</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span>
<span class="p">[</span> <span class="mf">0.08678107</span><span class="p">,</span> <span class="mf">0.1931497</span> <span class="p">,</span> <span class="o">-</span><span class="mf">0.16850867</span><span class="p">,</span> <span class="mf">0.16260134</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span>
<span class="p">[</span><span class="o">-</span><span class="mf">0.0174396</span> <span class="p">,</span> <span class="mf">0.01174769</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.11617622</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.01036602</span><span class="p">,</span> <span class="o">...</span><span class="p">]]</span>
<span class="c1"># Feature selector output:</span>
<span class="p">[</span><span class="s1">'WDAY'</span><span class="p">,</span> <span class="s1">'PYPL'</span><span class="p">,</span> <span class="s1">'AMZN'</span><span class="p">,</span> <span class="s1">'LRCX'</span><span class="p">,</span> <span class="s1">'HPQ'</span><span class="p">]</span>
</pre></div>
<p>Lines 22 and 27 in the first code block above are the two compressed versions of the original data array, line 16. For each row, the first compression stores the amplitude of that date’s stock changes along each of the first five <span class="caps">PCA</span> components (printed below line 17 of second code block), while the second compression is simply equal to the five columns of the original array corresponding to the stocks picked out by the selector (printed below line 24 of the second code block).</p>
<h4 id="exploring-the-encodings">Exploring the encodings</h4>
<p>Working with the compressed arrays obtained above provides some immediate operational benefits: Manipulations of the compressed arrays can be carried out more quickly and they require less memory for storage. Here, we review how valuable insight can also obtained from our compressions — via study of the compression components.</p>
<p>First, we consider the <span class="caps">PCA</span> components. It turns out that these components are the eigenvectors of the correlation matrix of our data set (<span class="math">\(X^T \cdot X\)</span>) — that is, they are the collective, fluctuation modes present in the data set (for those who have studied classical mechanics, you can imagine the system as one where the different stocks are masses that are connected by springs, and these eigenvectors are the modes of the system). Using this fact, one can show that the components evolve in an uncorrelated manner. Further, one can show that projecting the data set down onto the top <span class="math">\(k\)</span> modes gives the minimum squared projection error of all possible <span class="math">\(k\)</span>-component projections. The first component then describes the largest amplitude fluctuation pattern exhibited in the data. From line 18 above, this is <span class="math">\([ 0.105, 0.206, -0.012, 0.201, ... ]\)</span>. These coefficients tell us that when the first stock (<span class="caps">AAPL</span>) goes up by some amount, the second (<span class="caps">ADBE</span>) typically goes up by about twice as much (this follows from fact that 0.206 is about twice as big as 0.105), etc. This isn’t the full story of course, because each day’s movements are a superposition (sum) of the amplitudes along each of <span class="caps">PCA</span> components. Including more of these components in a compression allows one to capture more of the detailed correlation patterns exhibited in the data. However, each additional <span class="caps">PCA</span> component provides progressively less value as one moves down the ranking — it is this fact that allows a good compression to be obtained using only a minority of these modes.</p>
<p>Whereas the <span class="caps">PCA</span> components directly encode the collective, correlated fluctuations exhibited in our data, the feature selection solution attempts to identify a minimally-redundant subset of the original array’s columns — one that is representative of the full set. This strategy is best understood in the limit where the original columns fall into a set of discreet clusters (in our example, we might expect the businesses operating in a particular sub-sector to fall into a single cluster). In such cases, a good compression is obtained by selecting one representative column from each cluster: Once the representatives are selected, each of the other members of a given cluster can be approximately reconstructed using its selected representative as a predictor. In the above, we see that our automated feature selector has worked well, in that the companies selected (‘<span class="caps">WDAY</span>’, ‘<span class="caps">PYPL</span>’, ‘<span class="caps">AMZN</span>’, ‘<span class="caps">LRCX</span>’, and ‘<span class="caps">HPQ</span>’) each operate in a different part of the tech landscape [3]. In general, we can expect the feature selector to attempt to mimic the <span class="caps">PCA</span> approach, in that it will seek columns that fluctuate in a nearly orthogonal manner. However, whereas the <span class="caps">PCA</span> components highlight which columns fluctuate together, the feature selector attempts to throw out all but one of the columns that fluctuate together — a sort-of dual approach.</p>
<h4 id="compression-strength">Compression strength</h4>
<p>To decide how many compression components are needed for a given application, one need only consider the variance explained as a function of the compression dimension — this is equal to one minus the average squared error of the projections that result from the compressions (see footnote [4] for a visualization of the error that results from compression here). In the two python packages we’re using, one can access these values as follows:</p>
<div class="highlight"><pre><span></span><span class="o">>></span> <span class="nb">print</span> <span class="n">np</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">explained_variance_ratio_</span><span class="p">)</span>
<span class="p">[</span> <span class="mf">0.223</span> <span class="mf">0.367</span> <span class="mf">0.493</span> <span class="mf">0.598</span> <span class="mf">0.696</span><span class="p">]</span>
<span class="o">>></span> <span class="nb">print</span> <span class="p">[</span><span class="n">var</span> <span class="o">/</span> <span class="mf">50.0</span> <span class="k">for</span> <span class="n">var</span> <span class="ow">in</span> <span class="n">selector</span><span class="o">.</span><span class="n">ordered_cods</span><span class="p">[:</span><span class="n">KEEP</span><span class="p">]]</span>
<span class="p">[</span> <span class="mf">0.169</span> <span class="mf">0.316</span> <span class="mf">0.428</span> <span class="mf">0.530</span> <span class="mf">0.612</span><span class="p">]</span>
</pre></div>
<p>The printed lines above show that both algorithms capture more than <span class="math">\(50%\)</span> of the variance exhibited in the data using only 4 of the 50 stocks. The <span class="caps">PCA</span> compressions are stronger in each dimension because <span class="caps">PCA</span> is unconstrained — it can use any linear combination of the initial features for compression components, whereas the feature selector is constrained to use a subset of the original features.</p>
<p>A plot of the values above across all compression dimensions is shown in Fig. 3 below. Looking at this plot, we see an elbow somewhere between <span class="math">\(5\)</span> and <span class="math">\(10\)</span> retained components. This implies that our <span class="math">\(50\)</span>-dimensional data set mostly lies within a subspace of dimension <span class="math">\(k \in (5, 10)\)</span>. Using any <span class="math">\(k\)</span> in that interval will provide a decent compression, and a satisfying large dimensional reduction — a typical result of applying these algorithms to large, raw data sets. Again, this is useful because it allows one to stop tracking redundant columns that offers little incremental value.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/06/cod_stocks.png"><img alt="cod_stocks" src="https://efavdb.com/wp-content/uploads/2018/06/cod_stocks.png"></a> <strong>Fig. 3</strong>. Plots of the compression strength (coefficient of determination or <span class="math">\(r^2\)</span>) for our two compression algorithms versus compression dimension. We see two things: (1) <span class="caps">PCA</span> gives a slightly stronger compression at each dimension, and (2) The full data set spans 50 dimensions, but the elbow in the plots suggests the data largely sits in a subspace having dimension between 5 to 10.</p>
<h3 id="g20-economic-summary-stats">G20 economic summary stats</h3>
<h4 id="loading-and-compressing-the-data_1">Loading and compressing the data</h4>
<p>In this section, we explore economic summary statistics on the 19 individual countries belonging to the G20 [5]. We scraped this data from data.un.org — for example, the link used for the United States can be found <a href="http://data.un.org/en/iso/us.html">here</a>. Our aim here will be to illustrate how compression algorithms can be used to aid in the visualization of a data set: Plotting the rows of a data set allows one to quickly get a sense for the relationship between them (here, the different G20 countries). Because we cannot plot in more than two or three dimensions, compression is a necessary first step in this process.</p>
<p>A sample row from our data set is given below — the values for Argentina.</p>
<div class="highlight"><pre><span></span>GDP growth rate(annual %, const. 2005 prices) 2.40
GDP per capita(current US$) 14564
Economy: Agriculture(% of GVA) 6
Economy: Industry(% of GVA) 27.8
Economy: Services and other activity(% of GVA) 66.2
Employment: Agriculture(% of employed) 2
Employment: Industry(% of employed) 24.8
Employment: Services(% of employed) 73.1
Unemployment(% of labour force) 6.5
CPI: Consumer Price Index(2000=100) 332
Agricultural production index(2004-2006=100) 119
Food production index(2004-2006=100) 119
International trade: Exports(million US$) / GPV 0.091
International trade: Imports(million US$) / GPV 0.088
Balance of payments, current account / GPV -0.025
Labour force participation(female) pop. %) 48.6
Labour force participation(male) pop. %) 74.4
</pre></div>
<p>Comparing each of the 19 countries across these 17 fields would be a complicated task. However, by considering a plot like Fig. 3 for this data set, we learned that many of these fields are highly correlated (plot not shown). This means that we can indeed get a reasonable, approximate understanding of the relationship between these economies by compressing down to two dimensions and plotting the result. The code to obtain these compressions follows:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">linselect</span> <span class="kn">import</span> <span class="n">FwdSelect</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="c1"># LOADING THE DATA</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'g20.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">values</span>
<span class="n">countries</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># FEATURE SELECTION</span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">x1</span><span class="p">,</span> <span class="n">y1</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">selector</span><span class="o">.</span><span class="n">ordered_features</span><span class="p">[:</span><span class="mi">2</span><span class="p">]]</span><span class="o">.</span><span class="n">T</span>
<span class="c1"># PRINCIPAL COMPONENT ANALYSIS</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">()</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span>
<span class="n">x2</span><span class="p">,</span> <span class="n">y2</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">))</span><span class="o">.</span><span class="n">T</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span>
</pre></div>
<p>The plots of the <span class="math">\((x_1, y_1)\)</span> and <span class="math">\((x_2, y_2)\)</span> compressions obtained above are given in Fig. 4.</p>
<h4 id="visualizing-and-interpreting-the-compressed-data">Visualizing and interpreting the compressed data</h4>
<p>The first thing to note about Fig. 4 is that the geometries of the upper (feature selection) and lower (<span class="caps">PCA</span>) plots are very similar — the neighbors of each country are the same in the two plots. As we know from our discussion above, the first two <span class="caps">PCA</span> components must give a stronger compressed representation of the data than is obtained from the feature selection solution. However, given that similar country relationships are suggested by the two plots, the upper, feature selection view might be preferred. <em>This is because its axes retain their original meaning and are relatively easy to interpret</em>: The y-axis is a measure of the relative scale of international trade within each of the individual economies and the x-axis is a measure of the internal makeup of the economies.</p>
<p>Examining the upper, feature selection plot of Fig. 4, a number of interesting insights can be found. One timely observation: International trade exports are a lower percentage of <span class="caps">GDP</span> for the <span class="caps">US</span> than for any other country considered (for imports, it is third, just after Argentina and Brazil). This observation might be related to the <span class="caps">US</span> administration’s recent willingness to engage in trading tariff increases with other countries. Nations in the same quadrant include Great Britain (gb), Japan (jp), and Australia (au) — each relatively industrialized and geographically isolated nations. In the opposite limits, we have Germany (de) and India (in). The former is relatively industrial and not isolated, while the latter’s economy weights agriculture relatively highly.</p>
<h4 id="summary">Summary</h4>
<p>In this section, we illustrated a general analysis method that allows one to quickly gain insight into a data set: Visual study of the compressed data via a plot. Using this approach, we first found here that the G20 nations are best differentiated economically by considering how important international trade is to their economies and also the makeup of their economies (agricultural or other) — i.e., these are the two features that best explain the full data set of 17 columns that we started with. Plotting the data across these two variables and considering the commonalities of neighboring countries, we were able to identify some natural hypotheses influencing the individual economies. Specifically, geography appears to inform at least one of their key characteristics: more isolated countries often trade less. This is an interesting insight, and one that is quickly arrived at through the compression / plotting strategy.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/06/pca_linselect_g20.jpg"><img alt="pca_linselect_g20" src="https://efavdb.com/wp-content/uploads/2018/06/pca_linselect_g20.jpg"></a>
<strong>Fig. 4</strong>. Plots of the compressed economic summary statistics on the G20 nations, taken from data.un.org: <code>linselect</code> unsupervised feature selection (upper) and <span class="caps">PCA</span> (lower).</p>
<h2 id="discussion">Discussion</h2>
<p>In this post, we have seen that carrying out compressions on a data set can provide insight into the original data. By examining the <span class="caps">PCA</span> components, we gain access to the collective fluctuations present within the data. The feature selection solution returns a minimal subset of the original features that captures the broad stroke information contained in the original full set — in cases where clusters are present, the minimal set contains a representative from each. Both methods allow one to determine the effective dimension of a given data set — when applied to raw data sets, this is often much lower than the apparent dimension due to heavy redundancy.</p>
<p>In general, compressing a data set down into lower dimensions will make the data easier to interpret. We saw in this in the second, G20 economic example above, where a feature set was originally provided that had many columns. Compressing this down into two-dimensions quickly gave us a sense of the relationships between the different economies. The <span class="caps">PCA</span> and feature selection solutions gave similar plots there, but the feature selection solution had the extra benefit of providing easily interpreted axes.</p>
<p>When one’s goal is to use compression for operational efficiency gains, the appropriate dimension can be identified by plotting the variance explained versus compression dimension. Because <span class="caps">PCA</span> is unconstrained, it will give a stronger compression at any dimension. However, the feature selection approach has its own operational advantages: Once a representative subset of features has been identified, one can often simply stop tracking the others. Doing this can result in a huge cost savings for large data pipelines. A similar savings is not possible for <span class="caps">PCA</span>, because evaluation of the <span class="caps">PCA</span> components requires one to first evaluate each of the original feature / column values for a given data point. A similar consideration is also important in some applications: For example, when developing a stock portfolio, transaction costs may make it prohibitively expensive to purchase all of the stocks present in a given sector. By purchasing only a representative subset, a minimal portfolio can be constructed without incurring a substantial transaction cost burden.</p>
<p>In summary, the two compression methods we have considered here are very similar, but subtly different. Appreciating these differences allows one to choose the best approach for a given application.</p>
<h2 id="appendix-compression-as-projection">Appendix: Compression as projection</h2>
<p>We can see that the composite linear compression-decompression operator is a projection operator as follows: If <span class="math">\(X\)</span> is our data array, the general equations describing compression and decompression are,
</p>
<div class="math">\begin{eqnarray}
\label{A1} \tag{A1}
X_{compressed} &=& X \cdot M_{compression} \\
\label{A2} \tag{A2}
X_{approx} &=& X_{compressed} \cdot M_{decompression}.
\end{eqnarray}</div>
<p>
Here, <span class="math">\(M_{compression}\)</span> is an <span class="math">\(n \times k\)</span> matrix and <span class="math">\(M_{decompression}\)</span> is a <span class="math">\(k \times n\)</span> matrix. The squared error of the approximation is,
</p>
<div class="math">\begin{eqnarray}
\Lambda &=& \sum_{i,j} \left (X_{ij} - X_{approx, ij}\right)^2 \\
&=& \sum_j \Vert X_j - X_{compressed} \cdot M_{decompression, j} \Vert^2. \label{A3} \tag{A3}
\end{eqnarray}</div>
<p>
This second line here shows that we can minimize the entire squared error by minimizing each of the column squared errors independently. Further, each of the column level minimizations is equivalent to a least-squares linear regression problem: We treat the column vector <span class="math">\(M_{compressions, j}\)</span> as an unknown coefficient vector, and attempt to set these so that the squared error of the fit to <span class="math">\(X_j\)</span> — using the columns of <span class="math">\(X_{compressed}\)</span> as features — is minimized. We’ve worked out the least-squares linear fit solution in <a href="http://efavdb.github.io/linear-regression">another post</a> (it’s also a well-known result). Plugging this result in, we get the optimal <span class="math">\(M_{decompression}\)</span>,
</p>
<div class="math">\begin{eqnarray} \label{A4}
M_{decompression}^* &=& \left ( X_{compressed}^T X_{compressed} \right)^{-1} X_{compressed}^T X \tag{A4}
\\
&=& \left ( M_{compression}^T X^T X M_{compression} \right)^{-1} M_{compression}^T X^T X.
\end{eqnarray}</div>
<p>
To obtain the second line here, we have used (\ref{A1}), the definition of <span class="math">\(X_{compressed}\)</span>.</p>
<p>What happens if we try to compress our approximate matrix a second time? Nothing: The matrix product <span class="math">\(M_{compression} M_{decompression}^*\)</span> is a projection operator. That is, it satisfies the condition
</p>
<div class="math">\begin{eqnarray}
(M_{compression} M_{decompression}^*)^2 = M_{compression} M_{decompression}^*. \label{A5} \tag{A5}
\end{eqnarray}</div>
<p>
This result is easy enough to confirm using (\ref{A4}). What (\ref{A5}) means geometrically is that our compression operator projects a point in <span class="math">\(n\)</span>-dimensional space onto a subspace of dimension <span class="math">\(k\)</span>. Once a point sits in this subspace, hitting the point with the composite operator has no effect, as the new point already sits in the projected subspace. This is consistent with our 2-d cartoon depicting the effect of <span class="caps">PCA</span> and <code>linselect</code>, above. However, this is also true for general choices of <span class="math">\(M_{compression}\)</span>, provided we use the optimal <span class="math">\(M_{decompression}\)</span> associated with it.</p>
<h2 id="footnotes">Footnotes</h2>
<p>[1] For a discussion on how <span class="caps">PCA</span> selects its <span class="math">\(k\)</span> components, see our prior <a href="http://efavdb.github.io/principal-component-analysis">post</a> on the topic. To identify good feature subsets, <code>linselect</code> uses the stepwise selection strategy. This is described in its <a href="https://github.com/EFavDB/linselect">readme</a>. Here, we simply use the forward selection approach, but <code>linselect</code> supports fairly general stepwise search protocols.</p>
<p>[2] The tickers included are: <span class="caps">AAPL</span>, <span class="caps">ADBE</span>, <span class="caps">ADP</span>, <span class="caps">ADSK</span>, <span class="caps">AMAT</span>, <span class="caps">AMZN</span>, <span class="caps">ASML</span>, <span class="caps">ATVI</span>, <span class="caps">AVGO</span>, <span class="caps">BABA</span>, <span class="caps">BIDU</span>, <span class="caps">CRM</span>, <span class="caps">CSCO</span>, <span class="caps">CTSH</span>, <span class="caps">EA</span>, <span class="caps">FB</span>, <span class="caps">GOOG</span>, <span class="caps">GPRO</span>, <span class="caps">HPE</span>, <span class="caps">HPQ</span>, <span class="caps">IBM</span>, <span class="caps">INFY</span>, <span class="caps">INTC</span>, <span class="caps">INTU</span>, <span class="caps">ITW</span>, <span class="caps">LRCX</span>, <span class="caps">MSFT</span>, <span class="caps">NFLX</span>, <span class="caps">NOK</span>, <span class="caps">NVDA</span>, <span class="caps">NXPI</span>, <span class="caps">OMC</span>, <span class="caps">ORCL</span>, <span class="caps">PANW</span>, <span class="caps">PYPL</span>, <span class="caps">QCOM</span>, <span class="caps">SAP</span>, <span class="caps">SNAP</span>, <span class="caps">SQ</span>, <span class="caps">SYMC</span>, T, <span class="caps">TSLA</span>, <span class="caps">TSM</span>, <span class="caps">TWTR</span>, <span class="caps">TXN</span>, <span class="caps">VMW</span>, <span class="caps">VZ</span>, <span class="caps">WDAY</span>, <span class="caps">WDC</span>, and <span class="caps">ZNGA</span>.</p>
<p>[3] Workday (<span class="caps">WDAY</span>) is a SaaS company that offers a product to businesses, Paypal (<span class="caps">PYPL</span>) is a company that provides payments infrastructure supporting e-commerce, Amazon (<span class="caps">AMZN</span>) is an e-commerce company, Lam Research (<span class="caps">LRCX</span>) makes chips, and Hewlett-Packard (<span class="caps">HPQ</span>) makes computers. Each of these are representatives of a different sub-sector.</p>
<p>[4] We can also get a sense of the compression error by plotting the compressed traces for one of the stocks. <a href="https://efavdb.com/wp-content/uploads/2018/06/sq.png"><img alt="sq" src="https://efavdb.com/wp-content/uploads/2018/06/sq.png"></a> The plot at right does this for Square inc. The ups and downs of <span class="caps">SQ</span> are largely captured by both methods. However, some refined details are lost in the compressions. Similar accuracy levels are seen for each of the other stocks in the full set (not shown here).</p>
<p>[5] The missing twentieth member of the G20 is the <span class="caps">EU</span>. We don’t consider the <span class="caps">EU</span> here simply because the site we scraped from does not have a page dedicated to it.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>linselect demo: a tech sector stock analysis2018-05-31T14:17:00-07:002018-05-31T14:17:00-07:00Jonathan Landytag:efavdb.com,2018-05-31:/linselect-demo<p>This is a tutorial post relating to our python feature selection package, <code>linselect</code>. The package allows one to easily identify minimal, informative feature subsets within a given data set.</p>
<p>Here, we demonstrate <code>linselect</code><span class="quo">‘</span>s basic <span class="caps">API</span> by exploring the relationship between the daily percentage lifts of 50 tech stocks over …</p><p>This is a tutorial post relating to our python feature selection package, <code>linselect</code>. The package allows one to easily identify minimal, informative feature subsets within a given data set.</p>
<p>Here, we demonstrate <code>linselect</code><span class="quo">‘</span>s basic <span class="caps">API</span> by exploring the relationship between the daily percentage lifts of 50 tech stocks over one trading year. We will be interested in identifying minimal stock subsets that can be used to predict the lifts of the others.</p>
<p>This is a demonstration walkthrough, with commentary and interpretation throughout. See the package docs folder for docstrings that succinctly detail the <span class="caps">API</span>.</p>
<p>Contents:</p>
<ul>
<li>Load the data and examine some stock traces</li>
<li>FwdSelect, RevSelect; supervised, single target</li>
<li>FwdSelect, RevSelect; supervised, multiple targets</li>
<li>FwdSelect, RevSelect; unsupervised</li>
<li>GenSelect</li>
</ul>
<p>The data and a Jupyter notebook containing the code for this demo are available on our github, <a href="https://github.com/EFavDB/linselect_demos">here</a>.</p>
<p>The <code>linselect</code> package can be found on our github, <a href="https://github.com/efavdb/linselect">here</a>.</p>
<h2 id="1-load-the-data-and-examine-some-stock-traces">1 - Load the data and examine some stock traces</h2>
<p>In this tutorial, we will explore using <code>linselect</code> to carry out various feature selection tasks on a prepared data set of daily percentage lifts for 50 of the largest tech stocks. This covers data from 2017-04-18 to 2018-04-13. In this section, we load the data and take a look at a couple of the stock traces that we will be studying.</p>
<h3 id="load-data">Load data</h3>
<p>The code snippet below loads the data and shows a small sample.</p>
<div class="highlight"><pre><span></span><span class="c1"># load packages </span>
<span class="kn">from</span> <span class="nn">linselect</span> <span class="kn">import</span> <span class="n">FwdSelect</span><span class="p">,</span> <span class="n">RevSelect</span><span class="p">,</span> <span class="n">GenSelect</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="c1"># load the data, print out a sample </span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'stocks.csv'</span><span class="p">)</span>
<span class="nb">print</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:</span><span class="mi">3</span><span class="p">,</span> <span class="p">:</span><span class="mi">5</span><span class="p">]</span>
<span class="nb">print</span> <span class="n">df</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># date AAPL ADBE ADP ADSK </span>
<span class="c1"># 0 2017-04-18 -0.004442 -0.001385 0.000687 0.004884 </span>
<span class="c1"># 1 2017-04-19 -0.003683 0.003158 0.001374 0.017591 </span>
<span class="c1"># 2 2017-04-20 0.012511 0.009215 0.009503 0.005459 </span>
<span class="c1"># (248, 51) </span>
</pre></div>
<p>The last line here shows that there were 248 trading days in the range considered.</p>
<h3 id="plot-some-stock-traces">Plot some stock traces</h3>
<p>The plot below shows Apple’s and Google’s daily lifts on top of each other, over our full date range (the code for the plot can be found in our notebook). Visually, it’s clear that the two are highly correlated — when one goes up or down, the other tends to as well. This suggests that it should be possible to get a good fit to any one of the stocks using the changes in each of the other stocks.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/apple_google.jpg"><img alt="apple_google" src="https://efavdb.com/wp-content/uploads/2018/05/apple_google.jpg"></a></p>
<p>In general, a stock’s daily price change should be a function of the market at large, the behavior of its market segment(s) and sub-segment(s), and some idiosyncratic behavior special to the company in question. Given this intuition, it seems reasonable to expect one to be able to fit a given stock given the lifts from just a small subset of the other stocks — stocks representative of the sectors relevant to the stock in question. Adding multiple stocks from each segment shouldn’t provide much additional value since these should be redundant. We’ll confirm this intuition below and use <code>linselect</code> to identify these optimal subsets.</p>
<p><strong>Lesson</strong>: The fluctuations of related stocks are often highly correlated. Below, we will be using <code>linselect</code> to find minimal subsets of the 50 stocks that we can use to develop good linear fits to one, multiple, or all of the others.</p>
<h2 id="2-fwdselect-and-revselect-supervised-single-target">2 - FwdSelect and RevSelect; supervised, single target</h2>
<p>Goal: Demonstrate how to identify subsets of the stocks that can be used to fit a given target stock well.</p>
<ul>
<li>First we carry out a <code>FwdSelect</code> fit to identify good choices.</li>
<li>Next, we compare the <code>FwdSelect</code> and <code>RevSelect</code> results</li>
</ul>
<h3 id="forward-selection-applied-to-aapl">Forward selection applied to <span class="caps">AAPL</span></h3>
<p>The code snippet below uses our forward selection class, <code>FwdSelect</code> to seek the best feature subsets to fit <span class="caps">AAPL</span>’s performance.</p>
<div class="highlight"><pre><span></span><span class="c1"># Define X, y variables </span>
<span class="k">def</span> <span class="nf">get_feature_tickers</span><span class="p">(</span><span class="n">targets</span><span class="p">):</span>
<span class="n">all_tickers</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span><span class="o">.</span><span class="n">columns</span>
<span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">all_tickers</span> <span class="k">if</span> <span class="n">c</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">targets</span><span class="p">)</span>
<span class="n">TARGET_TICKERS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'AAPL'</span><span class="p">]</span>
<span class="n">FEATURE_TICKERS</span> <span class="o">=</span> <span class="n">get_feature_tickers</span><span class="p">(</span><span class="n">TARGET_TICKERS</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">FEATURE_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">TARGET_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># Forward step-wise selection </span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c1"># Print out main results of selection process (ordered feature indices, CODs) </span>
<span class="nb">print</span> <span class="n">selector</span><span class="o">.</span><span class="n">ordered_features</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
<span class="nb">print</span> <span class="n">selector</span><span class="o">.</span><span class="n">ordered_cods</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
<span class="c1"># [25, 7, 41] </span>
<span class="c1"># [0.43813848, 0.54534304, 0.58577418] </span>
</pre></div>
<p>The last two lines above print out the main outputs of <code>FwdSelect</code>:</p>
<ul>
<li>The <code>ordered_features</code> list provides the indices of the features, ranked by the algorithm. The first index shown provides the best possible single feature fit to <span class="caps">AAPL</span>, the second index provides the next best addition, etc. Note that we can get the tickers corresponding to these indices using:<br>
<code>python
print [FEATURE_TICKERS[i] for i in selector.ordered_features[:3]]
# ['MSFT' 'AVGO' 'TSM']</code><br>
A little thought plus a Google search rationalizes why these might be the top three predictors for <span class="caps">AAPL</span>: First, Microsoft is probably a good representative of the large-scale tech sector, and second the latter two companies work closely with Apple. <span class="caps">AVGO</span> (Qualcomm) made Apple’s modem chips until very recently, while <span class="caps">TSM</span> (Taiwan semi-conductor) makes the processors for iphones and ipads — and may perhaps soon also provide the CPUs for all Apple computers. Apparently, we can predict <span class="caps">APPL</span> performance using only a combination of (a) a read on the tech sector at large, plus (b) a bit of idiosyncratic information also present in <span class="caps">APPL</span>’s partner stocks.</li>
<li>The <code>ordered_cods</code> list records the coefficient of determination (<span class="caps">COD</span> or R^2) of the fits in question — the first number gives the <span class="caps">COD</span> obtained with just <span class="caps">MSFT</span>, the second with <span class="caps">MSFT</span> and <span class="caps">AVGO</span>, etc.</li>
</ul>
<p>A plot of the values in <code>ordered_cods</code> versus feature count is given below. Here, we have labeled the x-axis with the tickers corresponding to the elements of our <code>selector.ordered_features</code>. We see that the top three features almost fit <span class="caps">AAPL</span>’s performance as well as the full set!</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/apple.png"><img alt="apple" src="https://efavdb.com/wp-content/uploads/2018/05/apple.png"></a></p>
<p><strong>Lesson</strong>: We can often use <code>linselect</code> to significantly reduce the dimension of a given feature set, with minimal cost in performance. This can be used to compress a data set and can also improve our understanding of the problem considered.</p>
<p><strong>Lesson</strong>: To get a feel for the effective number of useful features we have at hand, we can plot the output <code>ordered_cods</code> versus feature count.</p>
<h3 id="compare-forward-and-reverse-selection-applied-to-tsla">Compare forward and reverse selection applied to <span class="caps">TSLA</span></h3>
<p>The code snippet below applies both <code>FwdSelect</code> and <code>RevSelect</code> to seek minimal subsets that fit Tesla’s daily lifts well. The outputs are plotted below this. This shows that <code>FwdSelect</code> performs slightly better when two or fewer features are included here, but that <code>RevSelect</code> finds better subsets after that.</p>
<p><strong>Lesson</strong>: In general, we expect forward selection to work better when looking for small subsets and reverse selection to perform better at large subsets.</p>
<div class="highlight"><pre><span></span><span class="c1"># Define X, y variables </span>
<span class="n">TARGET_TICKERS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'TSLA'</span><span class="p">]</span>
<span class="n">FEATURE_TICKERS</span> <span class="o">=</span> <span class="n">get_feature_tickers</span><span class="p">(</span><span class="n">TARGET_TICKERS</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">FEATURE_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">TARGET_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># Forward step-wise selection </span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c1"># Reverse step-wise selection </span>
<span class="n">selector2</span> <span class="o">=</span> <span class="n">RevSelect</span><span class="p">()</span>
<span class="n">selector2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/rev2.jpg"><img alt="rev2" src="https://efavdb.com/wp-content/uploads/2018/05/rev2.jpg"></a></p>
<h2 id="3-fwdselect-and-revselect-supervised-multiple-targets">3 - FwdSelect and RevSelect; supervised, multiple targets</h2>
<p>In the code below, we seek feature subsets that perform well when fitting multiple targets simultaneously.</p>
<p><strong>Lesson</strong>: <code>linselect</code> can be used to find minimal feature subsets useful for fitting multiple targets. The optimal, “perfect score” <span class="caps">COD</span> in this case is equal to number of targets (three in our example).</p>
<div class="highlight"><pre><span></span><span class="c1"># Define X, y variables </span>
<span class="n">TARGET_TICKERS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'TSLA'</span><span class="p">,</span> <span class="s1">'ADP'</span><span class="p">,</span> <span class="s1">'NFLX'</span><span class="p">]</span>
<span class="n">FEATURE_TICKERS</span> <span class="o">=</span> <span class="n">get_feature_tickers</span><span class="p">(</span><span class="n">TARGET_TICKERS</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">FEATURE_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">TARGET_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># Forward step-wise selection </span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c1"># Reverse step-wise selection </span>
<span class="n">selector2</span> <span class="o">=</span> <span class="n">RevSelect</span><span class="p">()</span>
<span class="n">selector2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/multiple.jpg"><img alt="multiple" src="https://efavdb.com/wp-content/uploads/2018/05/multiple.jpg"></a></p>
<h2 id="4-fwdselect-and-revselect-unsupervised">4 - FwdSelect and RevSelect; unsupervised</h2>
<p>Here, we seek those features that give us a best fit to / linear representation of the whole set. This goal is analogous to that addressed by <span class="caps">PCA</span>, but is a feature selection variant: Whereas <span class="caps">PCA</span> returns a set of linear combinations of the original features, the approach here will return a subset of the original features. This has the benefit of leaving one with a feature subset that is interpretable.</p>
<p>(Note: See [1] for more examples like this. There, I show that if you try to fit smoothed versions of the stock performances, very good, small subsets can be found. Without smoothing, noise obscures this point).</p>
<p><strong>Lesson</strong>: Unsupervised selection seeks to find those features that best describe the full data set — a feature selection analog of <span class="caps">PCA</span>.</p>
<p><strong>Lesson</strong>: Again, a perfect <span class="caps">COD</span> score is equal to the number of targets. In the unsupervised case, this is also the number of features (50 in our example).</p>
<div class="highlight"><pre><span></span><span class="c1"># Set X equal to full data set. </span>
<span class="n">ALL_TICKERS</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">ALL_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># Stepwise regressions </span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">selector2</span> <span class="o">=</span> <span class="n">RevSelect</span><span class="p">()</span>
<span class="n">selector2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/unsupervised.jpg"><img alt="unsupervised" src="https://efavdb.com/wp-content/uploads/2018/05/unsupervised.jpg"></a></p>
<h2 id="5-genselect">5 - GenSelect</h2>
<p><code>GenSelect</code><span class="quo">‘</span>s <span class="caps">API</span> is designed to expose the full flexibility of the efficient linear stepwise algorithm. Because of this, its <span class="caps">API</span> is somewhat more complex than that of <code>FwdSelect</code> and <code>RevSelect</code>. Here, our aim is to quickly demo this <span class="caps">API</span>.</p>
<p>The Essential ingredients:</p>
<ul>
<li>We pass only a single data matrix <code>X</code>, and must specify which columns are the predictors and which are targets.</li>
<li>Because we might sweep up and down, we cannot define an <code>ordered_features</code> list as in <code>FwdSelect</code> and <code>RevSelect</code> (the best subset of size three now may not contain the features in the best subset of size two). Instead, <code>GenSelect</code> maintains a dictionary <code>best_results</code> that stores information on the best results seen so far for each possible feature count. The keys of this dictionary correspond to the possible feature set sizes. The values are also dictionaries, each having two keys: <code>s</code> and <code>cod</code>. These specify the best feature subset seen so far with size equal to the outer key, and the corresponding <span class="caps">COD</span>, respectively.</li>
<li>We can move back and forth, adding features to or removing them from the predictor set. We can specify the search protocol for doing this.</li>
<li>We can reposition our search to any predictor set location and continue the search from there.</li>
<li>We can access the costs of each possible move from our current location, without stepping.</li>
</ul>
<p>If an <span class="math">\(m \times n\)</span> data matrix <code>X</code> is passed to <code>GenSelect</code>, three Boolean arrays define the state of the search.</p>
<ul>
<li><code>s</code> — This array specifies which of the columns are currently being used as predictors.</li>
<li><code>targets</code> — This specifies which of the columns are the target variables.</li>
<li><code>mobile</code> — This specifies which of the columns are locked into or out of our fit — those that are not mobile are marked <code>False</code>.</li>
</ul>
<p>Note: We usually want the targets to not be mobile — though this is not the case in unsupervised applications. One might sometimes also want to lock certain features into the predictor set, and the <code>mobile</code> parameter can be used to accomplish this.</p>
<h3 id="use-genselect-to-carry-out-a-forward-sweep-for-tsla">Use GenSelect to carry out a forward sweep for <span class="caps">TSLA</span></h3>
<p>The code below carries out a single forward sweep for <span class="caps">TSLA</span>. Note that the <code>protocol</code> argument of <code>search</code> is set to <code>(1, 0)</code>, which gives a forward search (see docstrings). For this reason, our results match those of <code>FwdSelect</code> at this point.</p>
<p><strong>Lesson</strong>: Setting up a basic <code>GenSelect</code> call requires defining a few input parameters.</p>
<div class="highlight"><pre><span></span><span class="c1"># Define X </span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">ALL_TICKERS</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c1"># Define targets and mobile Boolean arrays </span>
<span class="n">TARGET_TICKERS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'TSLA'</span><span class="p">]</span>
<span class="n">FEATURE_TICKERS</span> <span class="o">=</span> <span class="n">get_feature_tickers</span><span class="p">(</span><span class="n">TARGET_TICKERS</span><span class="p">)</span>
<span class="n">targets</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">in1d</span><span class="p">(</span><span class="n">ALL_TICKERS</span><span class="p">,</span> <span class="n">TARGET_TICKERS</span><span class="p">)</span>
<span class="n">mobile</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">in1d</span><span class="p">(</span><span class="n">ALL_TICKERS</span><span class="p">,</span> <span class="n">FEATURE_TICKERS</span><span class="p">)</span>
<span class="c1"># Set up search with an initial \`position\`. Then search. </span>
<span class="n">selector</span> <span class="o">=</span> <span class="n">GenSelect</span><span class="p">()</span>
<span class="n">selector</span><span class="o">.</span><span class="n">position</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">mobile</span><span class="o">=</span><span class="n">mobile</span><span class="p">,</span> <span class="n">targets</span><span class="o">=</span><span class="n">targets</span><span class="p">)</span>
<span class="n">selector</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">protocol</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">steps</span><span class="o">=</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="c1"># Review best 3 feature set found </span>
<span class="nb">print</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ALL_TICKERS</span><span class="p">)[</span><span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="s1">'s'</span><span class="p">]],</span> <span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="s1">'cod'</span><span class="p">]</span>
<span class="c1"># ['ATVI' 'AVGO' 'CTSH'] 0.225758 </span>
</pre></div>
<h3 id="continue-the-search-above">Continue the search above</h3>
<p>A <code>GenSelect</code> instance always retains a summary of the best results it has seen so far. This means that we can continue a search where we left off after a <code>search</code> call completes. Below, we reposition our search and sweep back and forth to better explore a particular region. Note that this slightly improves our result.</p>
<p><strong>Lesson</strong>: We can carry out general search protocols using <code>GenSelect</code><span class="quo">‘</span>s <code>position</code> and <code>search</code> methods.</p>
<div class="highlight"><pre><span></span><span class="c1"># Reposition back to the best fit of size 3 seen above. </span>
<span class="n">s</span> <span class="o">=</span> <span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="s1">'s'</span><span class="p">]</span>
<span class="n">selector</span><span class="o">.</span><span class="n">position</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">s</span><span class="p">)</span>
<span class="c1"># Now sweep back and forth around there a few times. </span>
<span class="n">STEPS</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">SWEEPS</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">selector</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">protocol</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">steps</span><span class="o">=</span><span class="n">STEPS</span><span class="p">)</span>
<span class="n">selector</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">protocol</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">STEPS</span><span class="p">,</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">STEPS</span><span class="p">),</span> <span class="n">steps</span><span class="o">=</span><span class="n">SWEEPS</span> <span class="o">*</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">STEPS</span><span class="p">)</span>
<span class="c1"># Review best results found now with exactly N_RETAINED features (different from first pass in cell above?) </span>
<span class="nb">print</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ALL_TICKERS</span><span class="p">)[</span><span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="s1">'s'</span><span class="p">]],</span> <span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="s1">'cod'</span><span class="p">]</span>
<span class="c1"># ['AMZN' 'NVDA' 'ZNGA'] 0.229958 </span>
</pre></div>
<h3 id="compare-to-forward-and-reverse-search-results">Compare to forward and reverse search results</h3>
<p>Below, we compare the <span class="caps">COD</span> values of our three classes.</p>
<p><strong>Lesson</strong>: <code>GenSelect</code> can be used to do a more thorough search than <code>FwdSelect</code> and <code>RevSelect</code>, and so can sometimes find better feature subsets.</p>
<div class="highlight"><pre><span></span><span class="c1"># Get the best COD values seen for each feature set size from GenSelect search </span>
<span class="n">gen_select_cods</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]):</span>
<span class="k">if</span> <span class="n">i</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">gen_select_cods</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s1">'cod'</span><span class="p">])</span>
<span class="c1"># Plot cod versus feature set size. </span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">gen_select_cods</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'GenSelect'</span><span class="p">)</span>
<span class="c1"># FwdSelect again to get corresponding results. </span>
<span class="n">selector2</span> <span class="o">=</span> <span class="n">FwdSelect</span><span class="p">()</span>
<span class="n">selector2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="n">mobile</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">targets</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">selector2</span><span class="o">.</span><span class="n">ordered_cods</span><span class="p">,</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'FwdSelect'</span><span class="p">)</span>
<span class="c1"># RevSelect again to get corresponding results. </span>
<span class="n">selector3</span> <span class="o">=</span> <span class="n">RevSelect</span><span class="p">()</span>
<span class="n">selector3</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="n">mobile</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="n">targets</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">selector3</span><span class="o">.</span><span class="n">ordered_cods</span><span class="p">,</span> <span class="s1">'-.'</span><span class="p">,</span><span class="n">label</span><span class="o">=</span><span class="s1">'RevSelect'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Coefficient of Determination (COD or R^2) for </span><span class="si">{target}</span><span class="s1"> vs features retained'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="n">target</span><span class="o">=</span><span class="s1">', '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">TARGET_TICKERS</span><span class="p">)))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/comparison.jpg"><img alt="comparison" src="https://efavdb.com/wp-content/uploads/2018/05/comparison.jpg"></a></p>
<h3 id="examine-the-cost-of-removing-a-feature-from-the-predictor-set">Examine the cost of removing a feature from the predictor set</h3>
<p>Below, we reposition to the best feature set of size 10 seen so far. We then apply the method <code>reverse_cods</code> to expose the cost of removing each of these individuals from the predictor set at this point. Were we to take a reverse step, the feature with the least cost would be the one taken (looks like <span class="caps">FB</span> from the plot).</p>
<p><strong>Lesson</strong>: We can easily access the costs associated with removing individual features from our current location. We can also access the <span class="caps">COD</span> gains associated with adding in new features by calling the <code>forward_cods</code> method.</p>
<div class="highlight"><pre><span></span><span class="c1"># Reposition </span>
<span class="n">s</span> <span class="o">=</span> <span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">10</span><span class="p">][</span><span class="s1">'s'</span><span class="p">]</span>
<span class="n">selector</span><span class="o">.</span><span class="n">position</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="n">s</span><span class="p">)</span>
<span class="c1"># Get costs to remove a feature (see also \`forward_cods\` method) </span>
<span class="n">costs</span> <span class="o">=</span> <span class="n">selector</span><span class="o">.</span><span class="n">reverse_cods</span><span class="p">()[</span><span class="n">s</span><span class="p">]</span>
<span class="n">TICKERS</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ALL_TICKERS</span><span class="p">)[</span><span class="n">selector</span><span class="o">.</span><span class="n">best_results</span><span class="p">[</span><span class="mi">10</span><span class="p">][</span><span class="s1">'s'</span><span class="p">]]</span>
<span class="c1"># Plot costs to remove each feature given current position </span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">costs</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">TICKERS</span><span class="p">)),</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">90</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">TICKERS</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/05/cost.jpg"><img alt="cost" src="https://efavdb.com/wp-content/uploads/2018/05/cost.jpg"></a></p>
<h2 id="final-comments">Final comments</h2>
<p>In this tutorial, we’ve illustrated many of the basic <span class="caps">API</span> calls available in <code>linselect</code>. In a future tutorial post, we plan to illustrate some interesting use cases of some of these <span class="caps">API</span> calls — e.g., how to use <code>GenSelect</code><span class="quo">‘</span>s arguments to explore the value of supplemental features, added to an already existing data set.</p>
<h2 id="references">References</h2>
<p>[1] J. Landy. Stepwise regression for unsupervised learning, 2017. <a href="https://arxiv.org/abs/1706.03265">arxiv.1706.03265</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Making AI Interpretable with Generative Adversarial Networks2018-04-05T09:43:00-07:002018-04-05T09:43:00-07:00Damien RJtag:efavdb.com,2018-04-05:/gans<p>It has been quite awhile since I have posted, largely because soon after I started my job at Square I had a child! I hope to have some newer blog post soon. But along those lines I want to share a <a href="https://medium.com/square-corner-blog/making-ai-interpretable-with-generative-adversarial-networks-766abc953edf">blog post</a> I did with a coworker (<a href="https://www.linkedin.com/in/juan-hernandez-025a5532/">Juan Hernandez …</a></p><p>It has been quite awhile since I have posted, largely because soon after I started my job at Square I had a child! I hope to have some newer blog post soon. But along those lines I want to share a <a href="https://medium.com/square-corner-blog/making-ai-interpretable-with-generative-adversarial-networks-766abc953edf">blog post</a> I did with a coworker (<a href="https://www.linkedin.com/in/juan-hernandez-025a5532/">Juan Hernandez</a>) for Square that gives a taste of some of the cool data science work we have been up to. This post covers work we did to create a framework for making models interpretable.<br>
<a href="https://medium.com/square-corner-blog/making-ai-interpretable-with-generative-adversarial-networks-766abc953edf"><img alt="" src="https://cdn-images-1.medium.com/max/800/1*lhYEmrsW9kqgB8nfIb9GJQ.png"></a></p>Integration method to map model scores to conversion rates from example data2018-03-03T17:53:00-08:002018-03-03T17:53:00-08:00Jonathan Landytag:efavdb.com,2018-03-03:/integration-method-to-map-model-scores-to-conversion-rates-from-example-data<p>This note addresses the typical applied problem of estimating from data how a target “conversion rate” function varies with some available scalar score function — e.g., estimating conversion rates from some marketing campaign as a function of a targeting model score. The idea centers around estimating the integral of the …</p><p>This note addresses the typical applied problem of estimating from data how a target “conversion rate” function varies with some available scalar score function — e.g., estimating conversion rates from some marketing campaign as a function of a targeting model score. The idea centers around estimating the integral of the rate function; differentiating this gives the rate function. The method is a variation on a standard technique for estimating pdfs via fits to empirical cdfs.</p>
<h3 id="problem-definition-and-naive-binning-solution">Problem definition and naive binning solution</h3>
<p>Here, we are interested in estimating a rate function, <span class="math">\(p \equiv p(x)\)</span>, representing the probability of some “conversion” event as a function of <span class="math">\(x\)</span>, some scalar model score. To do this, we assume we have access to a finite set of score-outcome data of the form <span class="math">\(\{(x_i, n_i), i= 1, \ldots ,k\}\)</span>. Here, <span class="math">\(x_i\)</span> is the score for example <span class="math">\(i\)</span> and <span class="math">\(n_i \in \{0,1\}\)</span> is its conversion indicator.</p>
<p>There are a number of standard methods for estimating rate functions. For example, if the score <span class="math">\(x\)</span> is a prior estimate for the conversion rate, a trivial mapping <span class="math">\(p(x) = x\)</span> may work. This won’t work if the score function in question is not an estimate for <span class="math">\(p\)</span>. A more general approach is to bin together example data points that have similar scores: The observed conversion rate within each bin can then be used as an estimate for the true conversion rate in the bin’s score range. An example output of this approach is shown in Fig. 1. Another option is to create a moving average, analogous to the binned solution.</p>
<p>The simple binning approach introduces two inefficiencies: (1) Binning coarsens a data set, resulting in a loss of information. (2) The data in one bin does not affect the data in the other bins, precluding exploitation of any global smoothness constraints that could be placed on <span class="math">\(p\)</span> as a function of <span class="math">\(x\)</span>. The running average approach is also subject to these issues. The method we discuss below alleviates both inefficiencies.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/03/image17.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2018/03/image17.png"></a><br>
Fig. 1. Binned probability estimate approach: All data with scores in a given range are grouped together, and the outcomes from those data points are used to estimate the conversion rate in each bin. Here, the x-axis represents score range, data was grouped into six bins, and mean and standard deviation of the outcome probabilities were estimated from the observed outcomes within each bin.</p>
<h3 id="efficient-estimates-by-integration">Efficient estimates by integration</h3>
<p>It can be difficult to directly fit a rate function p(x) using score-outcome data because data of this type does not lie on a continuous curve (the y-values alternate between 0 and 1, depending on the outcome for each example). However, if we consider the empirical integral of the available data, we obtain a smooth, increasing function that is much easier to fit.</p>
<p>To evaluate the empirical integral, we assume the samples are first sorted by <span class="math">\(x\)</span> and define<br>
</p>
<div class="math">$$ \tag{1} \label{1}
\delta x_i \equiv x_i - x_{i-1}.
$$</div>
<p><br>
Next, the empirical integral is taken as<br>
</p>
<div class="math">$$ \tag{2} \label{2}
\hat{J}(x_j) \equiv \sum_{i=0}^{j} n_i \delta x_i,
$$</div>
<p><br>
which approximates the integral<br>
</p>
<div class="math">$$\tag{3} \label{3}
J(x) \equiv \int_{x_0}^{x_j} p(x) dx.
$$</div>
<p><br>
We can think of (\ref{3}) as the number of expected conversions given density-<span class="math">\(1\)</span> sampling over the <span class="math">\(x\)</span> range noted. Taking a fit to the <span class="math">\(\{(x_i, \hat{J}(x_i))\}\)</span> values gives a smooth estimate for (\ref{3}). Differentiating with respect to <span class="math">\(x\)</span> the gives an estimate for <span class="math">\(p(x)\)</span>. Fig. 2 illustrates the approach. Here, I fit the available data to a quadratic, capturing the growth in <span class="math">\(p\)</span> with <span class="math">\(x\)</span>.</p>
<p>The example in Fig. 2 has no error bar shown. One way to obtain error bars would be to work with a particular fit form. The uncertainty in the fit coefficients could then be used to estimate uncertainties in the values at each point.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2018/03/image16-1.png"><img alt="image16" src="https://efavdb.com/wp-content/uploads/2018/03/image16-1.png"></a></p>
<p>Fig. 2. (Left) A plot of the empirical integral of the data used to generate Fig. 1 is in blue. A quadratic fit is shown in red. (Right) The derivative of the red fit function at left is shown, an estimate for the rate function in question, <span class="math">\(p\equiv p(x)\)</span>.</p>
<h3 id="example-python-code">Example python code</h3>
<p>The code snippet below carries out the procedure described above on a simple example. One example output is shown in Fig. 3 at the bottom of the section. Running the code multiple times gives one a sense of the error that is present in the predictions. In practical applications, this can’t be done so carrying out the error analysis procedure suggested above should be done to get a better sense of the error involved.</p>
<div class="highlight"><pre><span></span><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy.optimize</span> <span class="kn">import</span> <span class="n">curve_fit</span>
<span class="k">def</span> <span class="nf">p_given_x</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span>
<span class="k">def</span> <span class="nf">outcome_given_p</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>
<span class="c1"># Generate some random data </span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">200</span><span class="p">))</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p_given_x</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">outcome_given_p</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="c1"># Calculate delta x, get weighted outcomes </span>
<span class="n">delta_x</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">weighted_y</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">delta_x</span>
<span class="c1"># Integrate and fit </span>
<span class="n">j</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">weighted_y</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">fit_func</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
<span class="k">return</span> <span class="n">a</span> <span class="o">*</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">b</span> <span class="o">*</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">popt</span><span class="p">,</span> <span class="n">pcov</span> <span class="o">=</span> <span class="n">curve_fit</span><span class="p">(</span><span class="n">fit_func</span><span class="p">,</span> <span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">j</span><span class="p">)</span>
<span class="n">j_fit</span> <span class="o">=</span> <span class="n">fit_func</span><span class="p">(</span><span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="o">*</span><span class="n">popt</span><span class="p">)</span>
<span class="c1"># Finally, differentiate and compare to actual p </span>
<span class="n">p_fit</span> <span class="o">=</span> <span class="p">(</span><span class="n">j_fit</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="o">-</span> <span class="n">j_fit</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">delta_x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># Plots </span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">j</span><span class="p">,</span><span class="s1">'*'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'empirical integral'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">j_fit</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'fit to integral'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">[:</span><span class="o">-</span><span class="mi">2</span><span class="p">],</span> <span class="n">p_fit</span><span class="p">,</span> <span class="s1">'g'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'fit to p versus x'</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="s1">'k--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'actual p versus x'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2018/03/example_fit.png"><img alt="example_fit" src="https://efavdb.com/wp-content/uploads/2018/03/example_fit.png"></a></p>
<p>Fig. 3. The result of one run of the algorithm on a data set where <span class="math">\(p(x) \equiv x^2\)</span>, given 200 random samples of <span class="math">\(x \in (0, 1)\)</span>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Gaussian Processes2017-11-25T09:53:00-08:002017-11-25T09:53:00-08:00Jonathan Landytag:efavdb.com,2017-11-25:/gaussian-processes<p>We review the math and code needed to fit a Gaussian Process (<span class="caps">GP</span>) regressor to data. We conclude with a demo of a popular application, fast function minimization through <span class="caps">GP</span>-guided search. The gif below illustrates this approach in action — the red points are samples from the hidden red curve …</p><p>We review the math and code needed to fit a Gaussian Process (<span class="caps">GP</span>) regressor to data. We conclude with a demo of a popular application, fast function minimization through <span class="caps">GP</span>-guided search. The gif below illustrates this approach in action — the red points are samples from the hidden red curve. Using these samples, we attempt to leverage GPs to find the curve’s minimum as fast as possible.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/11/full_search.gif"><img alt="full_search" src="https://efavdb.com/wp-content/uploads/2017/11/full_search.gif"></a></p>
<p>Appendices contain quick reviews on (i) the <span class="caps">GP</span> regressor posterior derivation, (ii) SKLearn’s <span class="caps">GP</span> implementation, and (iii) <span class="caps">GP</span> classifiers.</p>
<h3 id="introduction">Introduction</h3>
<p>Gaussian Processes (GPs) provide a tool for treating the following general problem: A function <span class="math">\(f(x)\)</span> is sampled at <span class="math">\(n\)</span> points, resulting in a set of noisy<span class="math">\(^1\)</span> function measurements, <span class="math">\(\{f(x_i) = y_i \pm \sigma_i, i = 1, \ldots, n\}\)</span>. Given these available samples, can we estimate the probability that <span class="math">\(f = \hat{f}\)</span>, where <span class="math">\(\hat{f}\)</span> is some candidate function?</p>
<p>To decompose and isolate the ambiguity associated with the above challenge, we begin by applying Bayes’s rule,
</p>
<div class="math">\begin{eqnarray} \label{Bayes} \tag{1}
p(\hat{f} \vert \{y\}) = \frac{p(\{y\} \vert \hat{f} ) p(\hat{f})}{p(\{y\}) }.
\end{eqnarray}</div>
<p>
The quantity at left above is shorthand for the probability we seek — the probability that <span class="math">\(f = \hat{f}\)</span>, given our knowledge of the sampled function values <span class="math">\(\{y\}\)</span>. To evaluate this, one can define and then evaluate the quantities at right. Defining the first in the numerator requires some assumption about the source of error in our measurement process. The second function in the numerator is the prior — it is here where the greatest assumptions must be taken. For example, we’ll see below that the prior effectively dictates the probability of a given smoothness for the <span class="math">\(f\)</span> function in question.</p>
<p>In the <span class="caps">GP</span> approach, both quantities in the numerator at right above are taken to be multivariate Normals / Gaussians. The specific parameters of this Gaussian can be selected to ensure that the resulting fit is good — but the Normality requirement is essential for the mathematics to work out. Taking this approach, we can write down the posterior analytically, which then allows for some useful applications. For example, we used this approach to obtain the curves shown in the top figure of this post — these were obtained through random sampling from the posterior of a fitted <span class="caps">GP</span>, pinned to equal measured values at the two pinched points shown. Posterior samples are useful for visualization and also for taking Monte Carlo averages.</p>
<p>In this post, we (i) review the math needed to calculate the posterior above, (ii) discuss numerical evaluations and fit some example data using GPs, and (iii) review how a fitted <span class="caps">GP</span> can help to quickly minimize a cost function — eg a machine learning cross-validation score. Appendices cover the derivation of the <span class="caps">GP</span> regressor posterior, SKLearn’s <span class="caps">GP</span> implementation, and <span class="caps">GP</span> Classifiers.</p>
<p>Our minimal python class SimpleGP used below is available on our GitHub, <a href="https://github.com/EFavDB/gaussian_processes">here</a>.</p>
<p>Note: To understand the mathematical details covered in this post, one should be familiar with multivariate normal distributions — these are reviewed in our prior post, <a href="http://efavdb.github.io/normal-distributions">here</a>. These details can be skipped by those primarily interested in applications.</p>
<h3 id="analytic-evaluation-of-the-posterior">Analytic evaluation of the posterior</h3>
<p>To evaluate the left side of (\ref{Bayes}), we will evaluate the right. Only the terms in the numerator need to be considered, because the denominator does not depend on <span class="math">\(\hat{f}\)</span>. This means that the denominator must equate to a normalization factor, common to all candidate functions. In this section, we will first write down the assumed forms for the two terms in the numerator and then consider the posterior that results.</p>
<p>The first assumption that we will make is that if the true function is <span class="math">\(\hat{f}\)</span>, then our <span class="math">\(y\)</span>-measurements are independent and Gaussian-distributed about <span class="math">\(\hat{f}(x)\)</span>. This assumption implies that the first term on the right of (\ref{Bayes}) is
</p>
<div class="math">\begin{eqnarray} \tag{2} \label{prob}
p(\{y\} \vert \hat{f} ) \equiv \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp \left ( - \frac{(y_i - \hat{f}(x_i) )^2}{2 \sigma_i^2} \right).
\end{eqnarray}</div>
<p>
The <span class="math">\(y_i\)</span> above are the actual measurements made at our sample points, and the <span class="math">\(\sigma_i^2\)</span> are their variance uncertainties.</p>
<p>The second thing we must do is assume a form for <span class="math">\(p(\hat{f})\)</span>, our prior. We restrict attention to a set of points <span class="math">\(\{x_i: i = 1, \ldots, N\}\)</span>, where the first <span class="math">\(n\)</span> points are the points that have been sampled, and the remaining <span class="math">\((N-n)\)</span> are test points at other locations — points where we would like to estimate the joint statistics<span class="math">\(^2\)</span> of <span class="math">\(f\)</span>. To progress, we simply assume a multi-variate Normal distribution for <span class="math">\(f\)</span> at these points, governed by a covariance matrix <span class="math">\(\Sigma\)</span>. This gives
</p>
<div class="math">\begin{eqnarray} \label{prior} \tag{3}
&&p(f(x_1), \ldots, f(x_N) ) \sim \
&& \frac{1}{\sqrt{ (2 \pi)^{N} \vert \Sigma \vert }} \exp \left ( - \frac{1}{2} \sum_{ij=1}^N f_i \Sigma^{-1}_{ij} f_j \right).
\end{eqnarray}</div>
<p>
Here, we have introduced the shorthand, <span class="math">\(f_i \equiv f(x_i)\)</span>. Notice that we have implicitly assumed that the mean of our normal distribution is zero above. This is done for simplicity: If a non-zero mean is appropriate, this can be added in to the analysis, or subtracted from the underlying <span class="math">\(f\)</span> to obtain a new one with zero mean.</p>
<p>The particular form of <span class="math">\(\Sigma\)</span> is where all of the modeler’s insight and ingenuity must be placed when working with GPs. Researchers who know their topic very well can assert well-motivated, complex priors — often taking the form of a sum of terms, each capturing some physically-relevant contribution to the statistics of their problem at hand. In this post, we’ll assume the simple form
</p>
<div class="math">\begin{eqnarray} \tag{4} \label{covariance}
\Sigma_{ij} \equiv \sigma^2 \exp \left( - \frac{(x_i - x_j)^2}{2 l^2}\right).
\end{eqnarray}</div>
<p>
Notice that with this assumed form, if <span class="math">\(x_i\)</span> and <span class="math">\(x_j\)</span> are close together, the exponential will be nearly equal to one. This ensures that nearby points are highly correlated, forcing all high-probability functions to be smooth. The rate at which (\ref{covariance}) dies down as two test points move away from each another is controlled by the length-scale parameter <span class="math">\(l.\)</span> If this is large (small), the curve will be smooth over a long (short) distance. We illustrate these points in the next section, and also explain how an appropriate length scale can be inferred from the sample data at hand in the section after that.</p>
<p>Now, if we combine (\ref{prob}) and (\ref{prior}) and plug this into (\ref{Bayes}), we obtain an expression for the posterior, <span class="math">\(p(f \vert \{y\})\)</span>. This function is an exponential whose argument is a quadratic in the <span class="math">\(f_i\)</span>. In other words, like the prior, the posterior is a multi-variate normal. With a little work, one can derive explicit expressions for the mean and covariance of this distribution: Using block notation, with <span class="math">\(0\)</span> corresponding to the sample points and <span class="math">\(1\)</span> to the test points, the marginal distribution at the test points is
</p>
<div class="math">\begin{eqnarray} \tag{5} \label{posterior}
&& p(\textbf{f}_1 \vert \{y\}) =\
&& N\left ( \Sigma_{10} \frac{1}{\sigma^2 I_{00} + \Sigma_{00}} \cdot \textbf{y}, \Sigma_{11} - \Sigma_{10} \frac{1}{\sigma^2 I_{00} + \Sigma_{00}} \Sigma_{01} \right).
\end{eqnarray}</div>
<p>
Here,
</p>
<div class="math">\begin{eqnarray} \tag{6} \label{sigma_mat}
\sigma^2 I_{00} \equiv
\left( \begin{array}{cccc}
\sigma_1^2 & 0 & \ldots &0 \\
0 & \sigma_2^2 & \ldots &0 \\
\ldots & & & \\
0 & 0 & \ldots & \sigma_n^2
\end{array} \right),
\end{eqnarray}</div>
<p>
and <span class="math">\(\textbf{y}\)</span> is the length-<span class="math">\(n\)</span> vector of measurements,
</p>
<div class="math">\begin{eqnarray}\tag{7} \label{y_vec}
\textbf{y}^T \equiv (y_1, \ldots, y_n).
\end{eqnarray}</div>
<p>Equation (\ref{posterior}) is one of the main results for Gaussian Process regressors — this result is all one needs to evaluate the posterior. Notice that the mean at all points is linear in the sampled values <span class="math">\(\textbf{y}\)</span> and that the variance at each point is reduced near the measured values. Those interested in a careful derivation of this result can consult our appendix — we actually provide two derivations there. However, in the remainder of the body of the post, we will simply explore applications of this formula.</p>
<h3 id="numerical-evaluations-of-the-posterior">Numerical evaluations of the posterior</h3>
<p>In this section, we will demonstrate how two typical applications of (\ref{posterior}) can be carried out: (i) Evaluation of the mean and standard deviation of the posterior distribution at a test point <span class="math">\(x\)</span>, and (ii) Sampling functions <span class="math">\(\hat{f}\)</span> directly from the posterior. The former is useful in that it can be used to obtain confidence intervals for <span class="math">\(f\)</span> at all locations, and the latter is useful both for visualization and also for obtaining general Monte Carlo averages over the posterior. Both concepts are illustrated in the header image for this post: In this picture, we fit a <span class="caps">GP</span> to a one-d function that had been measured at two locations. The blue shaded region represents a one-sigma confidence interval for the function value at each location, and the colored curves are posterior samples.</p>
<p>The code for our <code>SimpleGP</code> fitter class is available on our <a href="https://github.com/EFavDB/gaussian_processes">GitHub</a>. We’ll explain a bit how this works below, but those interested in the details should examine the code — it’s a short script and should be largely self-explanatory.</p>
<h4 id="intervals">Intervals</h4>
<p>The code snippet below initializes our <code>SimpleGP</code> class, defines some sample locations, values, and uncertainties, then evaluates the mean and standard deviation of the posterior at a set of test points. Briefly, this carried out as follows: The <code>fit</code> method evaluates the inverse matrix <span class="math">\(\left [ \sigma^2 I_{00} + \Sigma_{00} \right]^{-1}\)</span> that appears in (\ref{posterior}) and saves the result for later use — this allows us to avoid reevaluation of this inverse at each test point. Next, (\ref{posterior}) is evaluated once for each test point through the call to the <code>interval</code> method.</p>
<div class="highlight"><pre><span></span><span class="c1"># Initialize fitter -- set covariance parameters</span>
<span class="n">WIDTH_SCALE</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">LENGTH_SCALE</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SimpleGP</span><span class="p">(</span><span class="n">WIDTH_SCALE</span><span class="p">,</span> <span class="n">LENGTH_SCALE</span><span class="p">,</span> <span class="n">noise</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Insert observed sample data here, fit</span>
<span class="n">sample_x</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">]</span>
<span class="n">sample_y</span> <span class="o">=</span> <span class="p">[</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">sample_s</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">]</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">sample_x</span><span class="p">,</span> <span class="n">sample_y</span><span class="p">,</span> <span class="n">sample_s</span><span class="p">)</span>
<span class="c1"># Get the mean and std at each point in x_test</span>
<span class="n">test_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="o">.</span><span class="mi">05</span><span class="p">)</span>
<span class="n">means</span><span class="p">,</span> <span class="n">stds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">interval</span><span class="p">(</span><span class="n">test_x</span><span class="p">)</span>
</pre></div>
<p>In the above, <code>WIDTH_SCALE</code> and <code>LENGTH_SCALE</code> are needed to specify the covariance matrix (\ref{covariance}). The former corresponds to <span class="math">\(\sigma\)</span> and the latter to <span class="math">\(l\)</span> in that equation. Increasing <code>WIDTH_SCALE</code> corresponds to asserting less certainty as to the magnitude of unknown function and increasing <code>LENGTH_SCALE</code> corresponds to increasing how smooth we expect the function to be. The figure below illustrates these points: Here, the blue intervals were obtained by setting <code>WIDTH_SCALE = LENGTH_SCALE = 1</code> and the orange intervals were obtained by setting <code>WIDTH_SCALE = 0.5</code> and <code>LENGTH_SCALE = 2</code>. The result is that the orange posterior estimate is tighter and smoother than the blue posterior. In both plots, the solid curve is a plot of the mean of the posterior distribution, and the vertical bars are one sigma confidence intervals.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/11/intervals.jpg"><img alt="intervals" src="https://efavdb.com/wp-content/uploads/2017/11/intervals.jpg"></a></p>
<h4 id="posterior-samples">Posterior samples</h4>
<p>To sample actual functions from the posterior, we will simply evaluate the mean and covariance matrix in (\ref{posterior}) again, this time passing in the multiple test point locations at which we would like to know the resulting sampled functions. Once we have the mean and covariance matrix of the posterior at these test points, we can pull samples from (\ref{posterior}) using an external library for multivariate normal sampling — for this purpose, we used the python package numpy. The last step in the code snippet below carries out these steps.</p>
<div class="highlight"><pre><span></span><span class="c1"># Insert observed sample data here.</span>
<span class="n">sample_x</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mf">1.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">,</span> <span class="mf">1.4</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">]</span>
<span class="n">sample_y</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="o">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">]</span>
<span class="n">sample_s</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.01</span><span class="p">]</span>
<span class="c1"># Initialize fitter -- set covariance parameters</span>
<span class="n">WIDTH_SCALE</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">LENGTH_SCALE</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SimpleGP</span><span class="p">(</span><span class="n">WIDTH_SCALE</span><span class="p">,</span> <span class="n">LENGTH_SCALE</span><span class="p">,</span> <span class="n">noise</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">sample_x</span><span class="p">,</span> <span class="n">sample_y</span><span class="p">,</span> <span class="n">sample_s</span><span class="p">)</span>
<span class="c1"># Get the mean and std at each point in test_x</span>
<span class="n">test_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="o">.</span><span class="mi">05</span><span class="p">)</span>
<span class="n">means</span><span class="p">,</span> <span class="n">stds</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">interval</span><span class="p">(</span><span class="n">test_x</span><span class="p">)</span>
<span class="c1"># Sample here</span>
<span class="n">SAMPLES</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">test_x</span><span class="p">,</span> <span class="n">SAMPLES</span><span class="p">)</span>
</pre></div>
<p>Notice that in lines 2-4 here, we’ve added in a few additional function sample locations (for fun). The resulting intervals and posterior samples are shown in the figure below. Notice that near the sampled points, the posterior is fairly well localized. However, on the left side of the plot, the posterior approaches the prior once we have moved a distance <span class="math">\(\geq 1\)</span>, the length scale chosen for the covariance matrix (\ref{covariance}).</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/11/samples.jpg"><img alt="samples" src="https://efavdb.com/wp-content/uploads/2017/11/samples.jpg"></a></p>
<h3 id="selecting-the-covariance-hyper-parameters">Selecting the covariance hyper-parameters</h3>
<p>In the above, we demonstrated that the length scale of our covariance form dramatically affects the posterior — the shape of the intervals and also of the samples from the posterior. Appropriately setting these parameters is a general problem that can make working with GPs a challenge. Here, we describe two methods that can be used to intelligently set such hyper-parameters, given some sampled data.</p>
<h4 id="cross-validation">Cross-validation</h4>
<p>A standard method for setting hyper-parameters is to make use of a cross-validation scheme. This entails splitting the available sample data into a training set and a test set. One fits the <span class="caps">GP</span> to the training set using one set of hyper-parameters, then evaluates the accuracy of the model on the held out test set. One then repeats this process across many hyper-parameter choices, and selects that set which resulted in the best test set performance.</p>
<h4 id="marginal-likelihood-maximization">Marginal Likelihood Maximization</h4>
<p>Often, one is interested in applying GPs in limits where evaluation of samples is expensive. This means that one often works with GPs in limits where only a small number of samples are available. In cases like this, the optimal hyper-parameters can vary quickly as the number of training points is increased. This means that the optimal selections obtained from a cross-validation schema may be far from the optimal set that applies when one trains on the full sample set<span class="math">\(^3\)</span>.</p>
<p>An alternative general approach for setting the hyper-parameters is to maximize the marginal likelihood. That is, we try to maximize the likelihood of seeing the samples we have seen — optimizing over the choice of available hyper-parameters. Formally, the marginal likelihood is evaluated by integrating out the unknown <span class="math">\(\hat{f}^4\)</span>,
</p>
<div class="math">\begin{eqnarray} \tag{8}
p(\{y\} \vert \Sigma) \equiv \int p(\{y\} \vert f) p(f \vert \Sigma) df.
\end{eqnarray}</div>
<p>
Carrying out the integral directly can be done just as we have evaluated the posterior distribution in our appendix. However, a faster method is to note that after integrating out the <span class="math">\(f\)</span>, the <span class="math">\(y\)</span> values must be normally distributed as
</p>
<div class="math">\begin{eqnarray}\tag{9}
p(\{y\} \vert \Sigma) \sim N(0, \Sigma + \sigma^2 I_{00}),
\end{eqnarray}</div>
<p>
where <span class="math">\(\sigma^2 I_{00}\)</span> is defined as in (\ref{sigma_mat}). This gives
</p>
<div class="math">\begin{eqnarray} \tag{10} \label{marginallikelihood}
\log p(\{y\}) \sim - \log \vert \Sigma + \sigma^2 I_{00} \vert - \textbf{y} \cdot ( \Sigma + \sigma^2 I_{00} )^{-1} \cdot \textbf{y}.
\end{eqnarray}</div>
<p>
The two terms above compete: The second term is reduced by finding the covariance matrix that maximizes the exponent. Maximizing this alone would tend to result in an overfitting of the data. However, this term is counteracted by the first, which is the normalization for a Gaussian integral. This term becomes larger given short decay lengths and low diagonal variances. It acts as regularization term that suppresses overly complex fits.</p>
<p>In practice, to maximize (\ref{marginallikelihood}), one typically makes use of gradient descent, using analytical expressions for the gradient. This is the approach taken by SKLearn. Being able to optimize the hyper-parameters of a <span class="caps">GP</span> is one of this model’s virtures. Unfortunately, (\ref{marginallikelihood}) is not guaranteed to be convex and multiple local minima often exist. To obtain a good minimum, one can attempt to initialize at some well-motivated point. Alternatively, one can reinitialize the gradient descent repeatedly at random points, finally selecting the best option at the end.</p>
<h3 id="function-minimum-search-and-machine-learning">Function minimum search and machine learning</h3>
<p>We’re now ready to introduce one of the popular application of GPs: fast, guided function minimum search. In this problem, one is able to iteratively obtain noisy samples of a function, and the aim is to identify as quickly as possible the global minimum of the function. Gradient descent could be applied in cases like this, but this approach generally requires repeated sampling if the function is not convex. To reduce the number of steps / samples required, one can attempt to apply a more general, explore-exploit type strategy — one balancing the desire to optimize about the current best known minimum with the goal of seeking out new local minima that are potentially even better. <span class="caps">GP</span> posteriors provide a natural starting point for developing such strategies.</p>
<p>The idea behind the <span class="caps">GP</span>-guided search approach is to develop a score function on top of the <span class="caps">GP</span> posterior. This score function should be chosen to encode some opinion of the value of searching a given point — preferably one that takes an explore-exploit flavor. Once each point is scored, the point with the largest (or smallest, as appropriate) score is sampled. The process is then repeated iteratively until one is satisfied. Many score functions are possible. We discuss four possible choices below, then give an example.</p>
<ul>
<li><strong>Gaussian Lower Confidence Bound (<span class="caps">GLCB</span>)</strong>.
The <span class="caps">GLCB</span> scores each point <span class="math">\(x\)</span> as
<div class="math">\begin{eqnarray}\tag{11}
s_{\kappa}(x) = \mu(x) - \kappa \sigma(x).
\end{eqnarray}</div>
Here, <span class="math">\(\mu\)</span> and <span class="math">\(\sigma\)</span> are the <span class="caps">GP</span> posterior estimates for the mean and standard deviation for the function at <span class="math">\(x\)</span> and <span class="math">\(\kappa\)</span> is a control parameter. Notice that the first <span class="math">\(\mu(x)\)</span> term encourages exploitation around the best known local minimum. Similarly, the second <span class="math">\(\kappa \sigma\)</span> term encourages exploration — search at points where the <span class="caps">GP</span> is currently most unsure of the true function value.</li>
<li><strong>Gaussian Probability of Improvement (<span class="caps">GPI</span>)</strong>.
If the smallest value seen so far is <span class="math">\(y\)</span>, we can score each point using the probability that the true function value at that point is less than <span class="math">\(y\)</span>. That is, we can write
<div class="math">\begin{eqnarray}\tag{12}
s(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \int_{-\infty}^y e^{-(v - \mu)^2 / (2 \sigma^2)} dv.
\end{eqnarray}</div>
</li>
<li><strong>Gaussian Expected Improvement (<span class="caps">EI</span>)</strong>.
A popular variant of the above is the so-called expected improvement.
This is defined as
<div class="math">\begin{eqnarray} \tag{13}
s(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \int_{-\infty}^y e^{-(v - \mu)^2 / (2 \sigma^2)} (y - v) dv.
\end{eqnarray}</div>
This score function tends to encourage more exploration than the probability of improvement, since it values uncertainty more highly.</li>
<li><strong>Probability is minimum</strong>.
A final score function of interest is simply the probability that the point in question is the minimum. One way to obtain this score is to sample from the posterior many times. For each sample, we mark its global minimum, then take a majority vote for where to sample next.</li>
</ul>
<p>The gif at the top of this page (copied below) illustrates an actual <span class="caps">GP</span>-guided search, carried out in python using the package skopt<span class="math">\(^5\)</span>. The red curve at left is the (hidden) curve <span class="math">\(f\)</span> whose global minimum is being sought. The red points are the samples that have been obtained so far, and the green shaded curve is the <span class="caps">GP</span> posterior confidence interval for each point — this gradually improves as more samples are obtained. At right is the Expected Improvement (<span class="caps">EI</span>) score function at each point that results from analysis on top of the <span class="caps">GP</span> posterior — the score function used to guide search in this example. The process is initialized with five random samples, followed by guided search. Notice that as the process evolves, the first few samples focus on exploitation of known local minima. However, after a handful of iterations, the diminishing returns of continuing to sample these locations loses out to the desire to explore the middle points — where the actual global minimum sits and is found.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/11/full_search.gif"><img alt="full_search" src="https://efavdb.com/wp-content/uploads/2017/11/full_search.gif"></a></p>
<h3 id="discussion">Discussion</h3>
<p>In this post we’ve overviewed much of the math of GPs: The math needed to get to the posterior, how to sample from the posterior, and finally how to make practical use of the posterior.</p>
<p>In principle, GPs represent a powerful tool that can be used to fit any function. In practice, the challenge in wielding this tool seems to sit mainly with selection of appropriate hyper-parameters — the search for appropriate parameters often gets stuck in local minima, causing fits to go off the rails. Nevertheless, when done correctly, application of GPs can provide some valuable performance gains — and they are always fun to visualize.</p>
<p>Some additional topics relating to GPs are contained in our appendices. For those interested in even more detail, we can recommend the free online text by Rasmussen and Williams<span class="math">\(^6\)</span>.</p>
<h3 id="appendix-a-derivation-of-posterior">Appendix A: Derivation of posterior</h3>
<p>In this appendix, we present two methods to derive the posterior (\ref{posterior}).</p>
<h4 id="method-1">Method 1</h4>
<p>We will begin by completing the square. Combining (\ref{prob}) and (\ref{prior}), a little algebra gives
</p>
<div class="math">\begin{align} \tag{A1} \label{square_complete}
p(f_1, \ldots, f_N \vert \{y\}) &\sim \exp \left (-\sum_{i=1}^n \frac{(y_i - f_i)^2}{2 \sigma^2_i} - \frac{1}{2} \sum_{ij=1}^N f_i \Sigma^{-1}_{ij} f_j \right) \\
&\sim N\left ( \frac{1}{\Sigma^{-1} + \frac{1}{\sigma^2} I } \cdot \frac{1}{\sigma^2} I \cdot \textbf{y}, \frac{1}{\Sigma^{-1} + \frac{1}{\sigma^2} I } \right).
\end{align}</div>
<p>
Here, <span class="math">\(\frac{1}{\sigma^2} I\)</span> is defined as in (\ref{sigma_mat}), but has zeros in all rows outside of the sample set. To obtain the expression (\ref{posterior}), we must identify the block structure of the inverse matrix that appears above.</p>
<p>To start, we write
</p>
<div class="math">\begin{align} \tag{A2} \label{matrix_to_invert}
\frac{1}{\Sigma^{-1} + \frac{1}{\sigma^2}I } &= \Sigma \frac{1}{I + \frac{1}{\sigma^2}I \Sigma} \\
&= \Sigma \left( \begin{matrix}
I_{00} + \frac{1}{\sigma^2}I_{00} \Sigma_{00} & \frac{1}{\sigma^2}I_{00} \Sigma_{01}\\
0 & I_{11}
\end{matrix} \right)^{-1},
\end{align}</div>
<p>
where we are using block notation. To evaluate the inverse that appears above, we will make use of the block matrix inversion formula,
</p>
<div class="math">\begin{align}
&\left( \begin{matrix}
A & B\\
C & D
\end{matrix} \right)^{-1} = \\
&\left( \begin{matrix}
(A - B D^{-1} C)^{-1} & - (A - B D^{-1} C)^{-1} B D^{-1} \\
-D^{-1} C (A - B D^{-1} C)^{-1} & D^{-1} + D^{-1} C (A - B D^{-1} C) B D^{-1}
\end{matrix} \right).
\end{align}</div>
<p>
The matrix (\ref{matrix_to_invert}) has blocks <span class="math">\(C = 0\)</span> and <span class="math">\(D=I\)</span>, which simplifies the above significantly. Plugging in, we obtain
</p>
<div class="math">\begin{align} \label{shifted_cov} \tag{A3}
\frac{1}{\Sigma^{-1} + \frac{1}{\sigma^2}I } =
\Sigma \left( \begin{matrix}
\frac{1}{I_{00} + \frac{1}{\sigma^2}I \Sigma_{00}} & - \frac{1}{I_{00} + \frac{1}{\sigma^2}I \Sigma_{00}} \Sigma_{01}\\
0 & I_{11}
\end{matrix} \right)
\end{align}</div>
<p>
With this result and (\ref{square_complete}), we can read off the mean of the test set as
</p>
<div class="math">\begin{align} \tag{A4} \label{mean_test}
& \left [ [ \Sigma^{-1} + \frac{1}{\sigma^2} I_{00} ]^{-1} \cdot \frac{1}{\sigma^2} I_{00} \cdot \textbf{y} \right ]_1 \\
&= \Sigma_{10} \frac{1}{I_{00} + \frac{1}{\sigma^2}I_{00} \Sigma_{00}} \frac{1}{\sigma^2} I_{00} \cdot \textbf{y} \\
&= \Sigma_{10} \frac{1}{\sigma^2 I_{00} + \Sigma_{00}} \cdot \textbf{y},
\end{align}</div>
<p>
where we have multiplied the numerator and denominator by the inverse of <span class="math">\(\frac{1}{\sigma^2}I_{00}\)</span> in the second line. Similarly, the covariance of the test set is given by the lower right block of (\ref{shifted_cov}). This is,
</p>
<div class="math">\begin{align}\tag{A5} \label{covariance_test}
\Sigma_{11} - \Sigma_{10} \cdot \frac{1}{\sigma^2 I_{00} + \Sigma_{00}} \cdot \Sigma_{01}.
\end{align}</div>
<p>
The results (\ref{mean_test}) and (\ref{covariance_test}) give (\ref{posterior}).</p>
<h4 id="method-2">Method 2</h4>
<p>In this second method, we consider the joint distribution of a set of test points <span class="math">\(\textbf{f}_1\)</span> and the set of observed samples <span class="math">\(\textbf{f}_0\)</span>. Again, we assume that the function density has mean zero. The joint probability density for the two is then
</p>
<div class="math">\begin{align}\tag{A6}
p(\textbf{f}_0, \textbf{f}_1) \sim N \left (
\left ( \begin{matrix}
0 \\
0
\end{matrix} \right),
\left ( \begin{matrix}{cc}
\Sigma_{0,0} & \Sigma_{0,1} \\
\Sigma_{1,0} & \Sigma_{11}
\end{matrix} \right )
\right )
\end{align}</div>
<p>
Now, we use the result
</p>
<div class="math">\begin{align} \tag{A7}
p( \textbf{f}_1 \vert \textbf{f}_0) &=& \frac{p( \textbf{f}_0, \textbf{f}_1)}{p( \textbf{f}_0)}.
\end{align}</div>
<p>
The last two expressions are all that are needed to derive (\ref{posterior}). The main challenge involves completing the square, and this can be done with the block matrix inversion formula, as in the previous derivation.</p>
<h3 id="appendix-b-sklearn-implementation-and-other-kernels">Appendix B: SKLearn implementation and other kernels</h3>
<p><a href="https://efavdb.com/wp-content/uploads/2017/11/sklearn.jpg"><img alt="sklearn" src="https://efavdb.com/wp-content/uploads/2017/11/sklearn.jpg"></a></p>
<p>SKLearn provides contains the <code>GaussianProcessRegressor</code> class. This allows one to carry out fits and sampling in any dimension — i.e., it is more general than our minimal class in that it can fit feature vectors in more than one dimension. In addition, the <code>fit</code> method of the SKLearn class attempts to find an optimal set of hyper-parameters for a given set of data. This is done through maximization of the marginal likelihood, as described above. Here, we provide some basic notes on this class and the built in kernels that one can use to define the covariance matrix <span class="math">\(\Sigma\)</span> in (\ref{prior}). We also include a simple code snippet illustrating calls.</p>
<h4 id="pre-defined-kernels">Pre-defined Kernels</h4>
<ul>
<li>Radial-basis function (<span class="caps">RBF</span>): This is the default — equivalent to our (\ref{covariance}). The <span class="caps">RBF</span> is characterized by a scale parameter, <span class="math">\(l\)</span>. In more than one dimension, this can be a vector, allowing for anisotropic correlation lengths.</li>
<li>White kernel : The White Kernel is used for noise estimation — docs suggest useful for estimating the global noise level, but not pointwise.</li>
<li>Matern: This is a generalized exponential decay, where the exponents is a powerlaw in separation distance. Special limits include the <span class="caps">RBF</span> and also an absolute distance exponential decay. Some special parameter choices allow for existence of single or double derivatives.</li>
<li>Rational quadratic: This is <span class="math">\((1 + (d / l)^2)^{\alpha}\)</span>.</li>
<li>Exp-Sine-Squared: This allows one to model periodic functions. This is just like the <span class="caps">RBF</span>, but the distance that gets plugged in is the sine of the actual distance. A periodicity parameter exists, as well as a “variance”
— the scale of the Gaussian suppression.</li>
<li>Dot product kernel : This takes form <span class="math">\(1 + x_i \cdot x_j\)</span>. It’s not stationary, in the sense that the result changes if a constant translation is added in. They state that you get this result from linear regression analysis if you place <span class="math">\(N(0,1)\)</span> priors on the coefficients.</li>
<li>Kernels as objects : The kernels are objects, but support binary operations between them to create more complicated kernels, eg addition, multiplication, and exponentiation (latter simply raises initial kernel to a power). They all support analytic gradient evaluation. You can access all of the parameters in a kernel that you define via some helper functions — eg, <code>kernel.get_params()</code>. <code>kernel.hyperparameters</code> is a list of all the hyper-parameters.</li>
</ul>
<h4 id="parameters">Parameters</h4>
<ul>
<li><code>n_restarts_optimizer</code>: This is the number of times to restart the fit — useful for exploration of multiple local minima. The default is zero.</li>
<li><code>alpha</code>: This optional argument allows one to pass in uncertainties for each measurement.</li>
<li><code>normalize_y</code>: This is used to indicate that the mean of the <span class="math">\(y\)</span>-values we’re looking for is not necessarily zero.</li>
</ul>
<h4 id="example-call">Example call</h4>
<p>The code snippet below carries out a simple fit. The result is the plot shown at the top of this section.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.gaussian_process.kernels</span> <span class="kn">import</span> <span class="n">RBF</span><span class="p">,</span> <span class="n">ConstantKernel</span> <span class="k">as</span> <span class="n">C</span>
<span class="kn">from</span> <span class="nn">sklearn.gaussian_process</span> <span class="kn">import</span> <span class="n">GaussianProcessRegressor</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># Build a model</span>
<span class="n">kernel</span> <span class="o">=</span> <span class="n">C</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="p">(</span><span class="mf">1e-3</span><span class="p">,</span> <span class="mf">1e3</span><span class="p">))</span> <span class="o">*</span> <span class="n">RBF</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">gp</span> <span class="o">=</span> <span class="n">GaussianProcessRegressor</span><span class="p">(</span><span class="n">kernel</span><span class="o">=</span><span class="n">kernel</span><span class="p">,</span> <span class="n">n_restarts_optimizer</span><span class="o">=</span><span class="mi">9</span><span class="p">)</span>
<span class="c1"># Some data</span>
<span class="n">xobs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mf">1.5</span><span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mi">3</span><span class="p">]])</span>
<span class="n">yobs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="c1"># Fit the model to the data (optimize hyper parameters)</span>
<span class="n">gp</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">xobs</span><span class="p">,</span> <span class="n">yobs</span><span class="p">)</span>
<span class="c1"># Plot points and predictions</span>
<span class="n">x_set</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">6</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="n">x_set</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">x_set</span><span class="p">])</span>
<span class="n">means</span><span class="p">,</span> <span class="n">sigmas</span> <span class="o">=</span> <span class="n">gp</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_set</span><span class="p">,</span> <span class="n">return_std</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">errorbar</span><span class="p">(</span><span class="n">x_set</span><span class="p">,</span> <span class="n">means</span><span class="p">,</span> <span class="n">yerr</span><span class="o">=</span><span class="n">sigmas</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_set</span><span class="p">,</span> <span class="n">means</span><span class="p">,</span> <span class="s1">'g'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">colors</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'g'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">,</span> <span class="s1">'b'</span><span class="p">,</span> <span class="s1">'k'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">colors</span><span class="p">:</span>
<span class="n">y_set</span> <span class="o">=</span> <span class="n">gp</span><span class="o">.</span><span class="n">sample_y</span><span class="p">(</span><span class="n">x_set</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1000</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_set</span><span class="p">,</span> <span class="n">y_set</span><span class="p">,</span> <span class="n">c</span> <span class="o">+</span> <span class="s1">'--'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
</pre></div>
<p>More details on the sklearn implementation can be found <a href="http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html">here</a>.</p>
<h3 id="appendix-c-gp-classifiers">Appendix C: <span class="caps">GP</span> Classifiers</h3>
<p>Here, we describe how GPs are often used to fit binary classification data — data where the response variable <span class="math">\(y\)</span> can take on values of either <span class="math">\(0\)</span> or <span class="math">\(1\)</span>. The mathematics for <span class="caps">GP</span> Classifiers does not work out as cleanly as it does for <span class="caps">GP</span> Regressors. The reason is that the <span class="math">\(0 / 1\)</span> response is not Gaussian-distributed, which means that the posterior is not either. To make use of the program, one approximates the posterior as normal, via the Laplace approximation.</p>
<p>The starting point is to write down a form for the probability of seeing a given <span class="math">\(y\)</span> value at <span class="math">\(x\)</span>. This, ones takes as the form,
</p>
<div class="math">\begin{align} \tag{A8} \label{classifier}
p(y \vert f(x)) = \frac{1}{1 + \exp\left (- y \times f(x)\right)}.
\end{align}</div>
<p>
This form is a natural non-linear generalization of logistic regression — see our post on this topic, <a href="http://efavdb.github.io/logistic-regression">here</a>.</p>
<p>To proceed, the prior for <span class="math">\(f\)</span> is taken to once again have form (\ref{prior}). Using this and (\ref{classifier}), we obtain the posterior for <span class="math">\(f\)</span>
</p>
<div class="math">\begin{align}
p(f \vert y) &\sim \frac{1}{1 + \exp\left (- y \times f(x)\right)} \exp \left ( - \frac{1}{2} \sum_{ij=1}^N f_i \Sigma^{-1}_{ij} f_j \right) \\
&\approx N(\mu, \Sigma^{\prime}) \tag{A9}
\end{align}</div>
<p>
Here, the last line is the Laplace / Normal approximation to the line above it. Using this form, one can easily obtain confidence intervals and samples from the approximate posterior, as was done for regressors.</p>
<h3 id="footnotes">Footnotes</h3>
<p>[1] The size of the <span class="math">\(\sigma_i\)</span> determines how precisely we know the function value at each of the <span class="math">\(x_i\)</span> points sampled — if they are all <span class="math">\(0\)</span>, we know the function exactly at these points, but not anywhere else.</p>
<p>[2] One might wonder whether introducing more points to the analysis would change the posterior statistics for the original <span class="math">\(N\)</span> points in question. It turns out that this is not the case for GPs: If one is interested only in the joint-statistics of these <span class="math">\(N\)</span> points, all others integrate out. For example, consider the goal of identifying the posterior distribution of <span class="math">\(f\)</span> at only a single test point <span class="math">\(x\)</span>. In this case, the posterior for the <span class="math">\(N = n+1\)</span> points follows from Bayes’s rule,
</p>
<div class="math">\begin{align} \tag{f1}
p(f(x_1), \ldots, f(x_n), f(x_{n+1}) \vert \{y\}) = \frac{p(\{y\} \vert f) p(f)}{p(\{y\})}.
\end{align}</div>
<p>
Now, by assumption, <span class="math">\(p(\{y\} \vert f)\)</span> depends only on <span class="math">\(f(x_1),\ldots, f(x_n)\)</span> — the values of <span class="math">\(f\)</span> where <span class="math">\(y\)</span> was sampled. Integrating over all points except the sample set and test point <span class="math">\(x\)</span> gives
</p>
<div class="math">\begin{align} \tag{f2}
&p(f(x_1), \ldots, f(x_{n+1}) \vert \{y\}) =\\
& \frac{p(\{y\} \vert f(x_1),\ldots,f(x_n))}{p(\{y\})} \int p(f) \prod_{i \not \in \{x_1, \ldots, x_N\}} df_i
\end{align}</div>
<p>
The result of the integral above is a Normal distribution — one with covariance given by the original covariance function evaluated only at the points <span class="math">\(\{x_1, \ldots, x_{N} \}\)</span>. This fact is proven in our post on Normal distributions — see equation (22) of that post, <a href="http://efavdb.github.io/normal-distributions">here</a>. The result implies that we can get the correct sampling statistics on any set of test points, simply by analyzing these alongside the sampled points. This fact is what allows us to tractably treat the formally-infinite number of degrees of freedom associated with GPs.</p>
<p>[3] We have a prior post illustrating this point — see <a href="http://efavdb.github.io/model-selection">here</a>.</p>
<p>[4] The marginal likelihood is equal to the denominator of (\ref{Bayes}), which we previously ignored.</p>
<p>[5] We made this gif through adapting the skopt tutorial code, <a href="https://scikit-optimize.github.io/notebooks/bayesian-optimization.html">here</a>.</p>
<p>[6] For the free text by Rasmussen and Williams, see <a href="http://www.gaussianprocess.org/">here</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Martingales2017-10-20T07:44:00-07:002017-10-20T07:44:00-07:00Jonathan Landytag:efavdb.com,2017-10-20:/martingales<p>Here, I give a quick review of the concept of a Martingale. A Martingale is a sequence of random variables satisfying a specific expectation conservation law. If one can identify a Martingale relating to some other sequence of random variables, its use can sometimes make quick work of certain expectation …</p><p>Here, I give a quick review of the concept of a Martingale. A Martingale is a sequence of random variables satisfying a specific expectation conservation law. If one can identify a Martingale relating to some other sequence of random variables, its use can sometimes make quick work of certain expectation value evaluations.</p>
<p>This note is adapted from Chapter 2 of Stochastic Calculus and Financial Applications, by Steele.</p>
<h3 id="definition">Definition</h3>
<p>Often in random processes, one is interested in characterizing a sequence of random variables <span class="math">\(\{X_i\}\)</span>. The example we will keep in mind is a set of variables <span class="math">\(X_i \in \{-1, 1\}\)</span> corresponding to the steps of an unbiased random walk in one-dimension. A Martingale process <span class="math">\(M_i = f(X_1, X_2, \ldots X_i)\)</span> is a derived random variable on top of the <span class="math">\(X_i\)</span> variables satisfying the following conservation law
</p>
<div class="math">\begin{align} \tag{1}
E(M_i | X_1, \ldots X_{i-1}) = M_{i-1}.
\end{align}</div>
<p>
For example, in the unbiased random walk example, if we take <span class="math">\(S_n = \sum_{i=1}^n X_i\)</span>, then <span class="math">\(E(S_n) = S_{n-1}\)</span>, so <span class="math">\(S_n\)</span> is a Martingale. If we can develop or identify a Martingale for a given <span class="math">\(\{X_i\}\)</span> process, it can often help us to quickly evaluate certain expectation values relating to the underlying process. Three useful Martingales follow.</p>
<ol>
<li>Again, the sum <span class="math">\(S_n = \sum_{i=1}^n X_i\)</span> is a Martingale, provided <span class="math">\(E(X_i) = 0\)</span> for all <span class="math">\(i\)</span>.</li>
<li>The expression <span class="math">\(S_n^2 - n \sigma^2\)</span> is a Martingale, provided <span class="math">\(E(X_i) = 0\)</span> and <span class="math">\(E(X_i^2) = \sigma^2\)</span> for all <span class="math">\(i\)</span>. Proof: <div class="math">\begin{align} \tag{2}
E(S_n^2 | X_1, \ldots X_{n-1}) &= \sigma^2 + 2 E(X_n) S_{n-1} + S_{n-1}^2 - n \sigma^2\\
&= S_{n-1}^2 - (n-1) \sigma^2.
\end{align}</div>
</li>
<li>The product <span class="math">\(P_n = \prod_{i=1}^n X_i\)</span> is a Martingale, provided <span class="math">\(E(X_i) = 1\)</span> for all <span class="math">\(i\)</span>. One example of interest is
<div class="math">\begin{align} \tag{3}
P_n = \frac{\exp \left ( \lambda \sum_{i=1}^n X_i\right)}{E(\exp \left ( \lambda X \right))^n}.
\end{align}</div>
Here, <span class="math">\(\lambda\)</span> is a free tuning parameter. If we choose a <span class="math">\(\lambda\)</span> such that <span class="math">\(E(\exp(\lambda X)) = 1\)</span> for our process, we can get a particularly simple form.</li>
</ol>
<h3 id="stopped-processes">Stopped processes</h3>
<p>In some games, we may want to setup rules that say we will stop the game at time <span class="math">\(\tau\)</span> if some condition is met at index <span class="math">\(\tau\)</span>. For example, we may stop a random walk (initialized at zero) if the walker gets to either position <span class="math">\(A\)</span> or <span class="math">\(-B\)</span> (wins <span class="math">\(A\)</span> or loses <span class="math">\(B\)</span>). This motivates defining the stopped Martingale as,
</p>
<div class="math">\begin{align}
M_{n \wedge \tau} = \begin{cases}
M_n &\text{if } \tau \geq n \\
M_{\tau} &\text{else}. \tag{4}
\end{cases}
\end{align}</div>
<p>
Here, we prove that if <span class="math">\(M_n\)</span> is a Martingale, then so is $M_{n \wedge \tau} $. This is useful because it will tell us that the stopped Martingale has the same conservation law as the unstopped version.</p>
<p>First, we note that if <span class="math">\(A_i \equiv f_2(X_1, \ldots X_{i-1})\)</span> is some function of the observations so far, then the transformed process
</p>
<div class="math">\begin{align} \tag{5}
\tilde{M}_n \equiv M_0 + \sum_{i=1}^n A_i (M_i - M_{i-1})
\end{align}</div>
<p>
is also a Martingale. Proof:
</p>
<div class="math">\begin{align} \tag{6}
E(\tilde{M}_n | X_1, \ldots X_{n-1}) = A_n \left ( E(M_n) - M_{n-1} \right) + \tilde{M}_{n-1} = \tilde{M}_{n-1}.
\end{align}</div>
<p>With this result we can prove the stopped Martingale is also a Martingale. We can do that by writing <span class="math">\(A_i = 1(\tau \geq i)\)</span> — where <span class="math">\(1\)</span> is the indicator function. Plugging this into the above, we get the transformed Martingale,
</p>
<div class="math">\begin{align} \nonumber \tag{7}
\tilde{M}_n &= M_0 + \sum_{i=1}^n 1(\tau \geq i) (M_i - M_{i-1}) \\
&= \begin{cases}
M_n & \text{if } \tau \geq n \
M_{\tau} & \text{else}.
\end{cases}
\end{align}</div>
<p>
This is the stopped Martingale — indeed a Martingale, by the above.</p>
<h3 id="example-applications">Example applications</h3>
<h3 id="_1"></h3>
<h4 id="problem-1">Problem 1</h4>
<p>Consider an unbiased random walker that takes steps of size <span class="math">\(1\)</span>. If we stop the walk as soon as he reaches either <span class="math">\(A\)</span> or <span class="math">\(-B\)</span>, what is the probability that he is at <span class="math">\(A\)</span> when the game stops?</p>
<p>Solution: Let <span class="math">\(\tau\)</span> be the stopping time and let <span class="math">\(S_n = \sum_{i=1}^n X_i\)</span> be the walker’s position at time <span class="math">\(n\)</span>. We know that <span class="math">\(S_n\)</span> is a Martingale. By the above, so then is <span class="math">\(S_{n \wedge \tau}\)</span>, the stopped process Martingale. By the Martingale property
</p>
<div class="math">\begin{align} \tag{8}
E(S_{n \wedge \tau}) = E(S_{i \wedge \tau})
\end{align}</div>
<p>
for all <span class="math">\(i\)</span>. In particular, plugging in <span class="math">\(i = 0\)</span> gives <span class="math">\(E(S_{n \wedge \tau}) = 0\)</span>. If we take <span class="math">\(n \to \infty\)</span>, then
</p>
<div class="math">\begin{align} \tag{9}
\lim_{n \to \infty} E(S_{n \wedge \tau}) \to E(S_{\tau}) = 0.
\end{align}</div>
<p>
But we also have
</p>
<div class="math">\begin{align} \tag{10}
E(S_{\tau}) = P(A) * A - (1 - P(A)) B.
\end{align}</div>
<p>
Equating (9) and (10) gives
</p>
<div class="math">\begin{equation} \tag{11}
P(A) = \frac{B}{A + B}
\end{equation}</div>
<h5 id="problem-2">Problem 2</h5>
<p>In the game above, what is the expected stopping time? Solution: Use the stopped version of the Martingale <span class="math">\(S_n^2 - n \sigma^2\)</span>.</p>
<h5 id="problem-3">Problem 3</h5>
<p>In a biased version of the random walk game, what is the probability of stopping at <span class="math">\(A\)</span>? Solution: Use the stopped Martingale of form <span class="math">\(P_n = \frac{\exp \left ( \lambda \sum_{i=1}^n X_i\right)}{E(\exp \left ( \lambda X \right))^n}\)</span>, with <span class="math">\(\exp[\lambda] = q/p\)</span>, where <span class="math">\(p = 1-q\)</span> is the probability of step to the right.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Logistic Regression2017-07-29T19:10:00-07:002017-07-29T19:10:00-07:00Jonathan Landytag:efavdb.com,2017-07-29:/logistic-regression<p>We review binary logistic regression. In particular, we derive a) the equations needed to fit the algorithm via gradient descent, b) the maximum likelihood fit’s asymptotic coefficient covariance matrix, and c) expressions for model test point class membership probability confidence intervals. We also provide python code implementing a minimal …</p><p>We review binary logistic regression. In particular, we derive a) the equations needed to fit the algorithm via gradient descent, b) the maximum likelihood fit’s asymptotic coefficient covariance matrix, and c) expressions for model test point class membership probability confidence intervals. We also provide python code implementing a minimal “LogisticRegressionWithError” class whose “predict_proba” method returns prediction confidence intervals alongside its point estimates.</p>
<p>Our python code can be downloaded from our github page, <a href="https://github.com/EFavDB/logistic-regression-with-error">here</a>. Its use requires the jupyter, numpy, sklearn, and matplotlib packages.</p>
<h3 id="introduction">Introduction</h3>
<p>The logistic regression model is a linear classification model that can be used to fit binary data — data where the label one wishes to predict can take on one of two values — e.g., <span class="math">\(0\)</span> or <span class="math">\(1\)</span>. Its linear form makes it a convenient choice of model for fits that are required to be interpretable. Another of its virtues is that it can — with relative ease — be set up to return both point estimates and also confidence intervals for test point class membership probabilities. The availability of confidence intervals allows one to flag test points where the model prediction is not precise, which can be useful for some applications — eg fraud detection.</p>
<p>In this note, we derive the expressions needed to fit the logistic model to a training data set. We assume the training data consists of a set of <span class="math">\(n\)</span> feature vector- label pairs, <span class="math">\(\{(\vec{x}_i, y_i)\)</span>, for <span class="math">\(i = 1, 2, \ldots, n\}\)</span>, where the feature vectors <span class="math">\(\vec{x}_i\)</span> belong to some <span class="math">\(m\)</span>-dimensional space and the labels are binary, <span class="math">\(y_i \in \{0, 1\}.\)</span> The logistic model states that the probability of belonging to class <span class="math">\(1\)</span> is given by
</p>
<div class="math">\begin{eqnarray}\tag{1} \label{model1}
p(y=1 \vert \vec{x}) \equiv \frac{1}{1 + e^{- \vec{\beta} \cdot \vec{x} } },
\end{eqnarray}</div>
<p>
where <span class="math">\(\vec{\beta}\)</span> is a coefficient vector characterizing the model. Note that with this choice of sign in the exponent, predictor vectors <span class="math">\(\vec{x}\)</span> having a large, positive component along <span class="math">\(\vec{\beta}\)</span> will be predicted to have a large probability of being in class <span class="math">\(1\)</span>. The probability of class <span class="math">\(0\)</span> is given by the complement,
</p>
<div class="math">\begin{eqnarray}\tag{2} \label{model2}
p(y=0 \vert \vec{x}) \equiv 1 - p(y=1 \vert \vec{x}) = \frac{1}{1 + e^{ \vec{\beta} \cdot \vec{x} } }.
\end{eqnarray}</div>
<p>
The latter equality above follows from simplifying algebra, after plugging in (\ref{model1}) for <span class="math">\(p(y=1 \vert \vec{x}).\)</span></p>
<p>To fit the Logistic model to a training set — i.e., to find a good choice for the fit parameter vector <span class="math">\(\vec{\beta}\)</span> — we consider here only the maximum-likelihood solution. This is that <span class="math">\(\vec{\beta}^*\)</span> that maximizes the conditional probability of observing the training data. The essential results we review below are 1) a proof that the maximum likelihood solution can be found by gradient descent, and 2) a derivation for the asymptotic covariance matrix of <span class="math">\(\vec{\beta}\)</span>. This latter result provides the basis for returning point estimate confidence intervals.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/07/errorbar.png"><img alt="errorbar" src="https://efavdb.com/wp-content/uploads/2017/07/errorbar.png"></a></p>
<p>On our GitHub <a href="https://github.com/EFavDB/logistic-regression-with-error">page</a>, we provide a Jupyter notebook that contains some minimal code extending the SKLearn LogisticRegression class. This extension makes use of the results presented here and allows for class probability confidence intervals to be returned for individual test points. In the notebook, we apply the algorithm to the SKLearn Iris dataset. The figure at right illustrates the output of the algorithm along a particular cut through the Iris data set parameter space. The y-axis represents the probability of a given test point belong to Iris class <span class="math">\(1\)</span>. The error bars in the plot provide insight that is completely missed when considering the point estimates only. For example, notice that the error bars are quite large for each of the far right points, despite the fact that the point estimates there are each near <span class="math">\(1\)</span>. Without the error bars, the high probability of these point estimates might easily be misinterpreted as implying high model confidence.</p>
<p>Our derivations below rely on some prerequisites: Properties of covariance matrices, the multivariate Cramer-Rao theorem, and properties of maximum likelihood estimators. These concepts are covered in two of our prior posts [<span class="math">\(1\)</span>, <span class="math">\(2\)</span>].</p>
<h3 id="optimization-by-gradient-descent">Optimization by gradient descent</h3>
<p>In this section, we derive expressions for the gradient of the negative-log likelihood loss function and also demonstrate that this loss is everywhere convex. The latter result is important because it implies that gradient descent can be used to find the maximum likelihood solution.</p>
<p>Again, to fit the logistic model to a training set, our aim is to find — and also to set the parameter vector to — the maximum likelihood value. Assuming the training set samples are independent, the likelihood of observing the training set labels is given by
</p>
<div class="math">\begin{eqnarray}
L &\equiv& \prod_i p(y_i \vert \vec{x}_i) \\
&=& \prod_{i: y_i = 1} \frac{1}{1 + e^{-\vec{\beta} \cdot \vec{x}_i}} \prod_{i: y_i = 0} \frac{1}{1 + e^{\vec{\beta} \cdot \vec{x}_i}}.
\tag{3} \label{likelihood}
\end{eqnarray}</div>
<p>
Maximizing this is equivalent to minimizing its negative logarithm — a cost function that is somewhat easier to work with,
</p>
<div class="math">\begin{eqnarray}
J &\equiv& -\log L \\
&=& \sum_{\{i: y_i = 1 \}} \log \left (1 + e^{- \vec{\beta} \cdot \vec{x}_i } \right ) + \sum_{\{i: y_i = 0 \}} \log \left (1 + e^{\vec{\beta} \cdot \vec{x}_i } \right ).
\tag{4} \label{costfunction}
\end{eqnarray}</div>
<p>
The maximum-likelihood solution, <span class="math">\(\vec{\beta}^*\)</span>, is that coefficient vector that minimizes the above. Note that <span class="math">\(\vec{\beta}^*\)</span> will be a function of the random sample, and so will itself be a random variable — characterized by a distribution having some mean value, covariance, etc. Given enough samples, a theorem on maximum-likelihood asymptotics (Cramer-Rao) guarantees that this distribution will be unbiased — i.e., it will have mean value given by the correct parameter values — and will also be of minimal covariance [<span class="math">\(1\)</span>]. This theorem is one of the main results motivating use of the maximum-likelihood solution.</p>
<p>Because <span class="math">\(J\)</span> is convex (demonstrated below), the logistic regression maximum-likelihood solution can always be found by gradient descent. That is, one need only iteratively update <span class="math">\(\vec{\beta}\)</span> in the direction of the negative <span class="math">\(\vec{\beta}\)</span>-gradient of <span class="math">\(J\)</span>, which is
</p>
<div class="math">\begin{eqnarray}
- \nabla_{\vec{\beta}} J &=& \sum_{\{i: y_i = 1 \}}\vec{x}_i \frac{ e^{- \vec{\beta} \cdot \vec{x}_i } }{1 + e^{- \vec{\beta} \cdot \vec{x}_i }}
- \sum_{\{i: y_i = 0 \}} \vec{x}_i \frac{ e^{\vec{\beta} \cdot \vec{x}_i }}{1 + e^{\vec{\beta} \cdot \vec{x}_i } } \\
&\equiv& \sum_{\{i: y_i = 1 \}}\vec{x}_i p(y=0 \vert \vec{x}_i)
-\sum_{\{i: y_i = 0 \}} \vec{x}_i p(y= 1 \vert \vec{x}_i). \tag{5} \label{gradient}
\end{eqnarray}</div>
<p>
Notice that the terms that contribute the most here are those that are most strongly misclassified — i.e., those where the model’s predicted probability for the observed class is very low. For example, a point with true label <span class="math">\(y=1\)</span> but large model <span class="math">\(p(y=0 \vert \vec{x})\)</span> will contribute a significant push on <span class="math">\(\vec{\beta}\)</span> in the direction of <span class="math">\(\vec{x}\)</span> — so that the model will be more likely to predict <span class="math">\(y=1\)</span> at this point going forward. Notice that the contribution of a term above is also proportional to the length of its feature vector — training points further from the origin have a stronger impact on the optimization process than those near the origin (at fixed classification difficulty).</p>
<p>The Hessian (second partial derivative) matrix of the cost function follows from taking a second gradient of the above. With a little algebra, one can show that this has <span class="math">\(i-j\)</span> component given by,
</p>
<div class="math">\begin{eqnarray}
H(J)_{ij} &\equiv& -\partial_{\beta_j} \partial_{\beta_i} \log L \\
&=& \sum_k x_{k; i} x_{k; j} p(y= 0 \vert \vec{x}_k) p(y= 1 \vert \vec{x}_k). \tag{6} \label{Hessian}
\end{eqnarray}</div>
<p>
We can prove that this is positive semi-definite using the fact that a matrix <span class="math">\(M\)</span> is necessarily positive semi-definite if <span class="math">\(\vec{s}^T \cdot M \cdot \vec{s} \geq 0\)</span> for all real <span class="math">\(\vec{s}\)</span> [<span class="math">\(2\)</span>]. Dotting our Hessian above on both sides by an arbitrary vector <span class="math">\(\vec{s}\)</span>, we obtain
</p>
<div class="math">\begin{eqnarray}
\vec{s}^T \cdot H \cdot \vec{s} &\equiv& \sum_k \sum_{ij} s_i x_{k; i} x_{k; j} s_j p(y= 0 \vert \vec{x}_k) p(y= 1 \vert \vec{x}_k) \\
&=& \sum_k \vert \vec{s} \cdot \vec{x}_k \vert^2 p(y= 0 \vert \vec{x}_k) p(y= 1 \vert \vec{x}_k) \geq 0.
\tag{7} \label{convex}
\end{eqnarray}</div>
<p>
The last form follows from the fact that both <span class="math">\(p(y= 0 \vert \vec{x}_k)\)</span> and <span class="math">\(p(y= 1 \vert \vec{x}_k)\)</span> are non-negative. This holds for any <span class="math">\(\vec{\beta}\)</span> and any <span class="math">\(\vec{s}\)</span>, which implies that our Hessian is everywhere positive semi-definite. Because of this, convex optimization strategies — e.g., gradient descent — can always be applied to find the global maximum-likelihood solution.</p>
<h3 id="coefficient-uncertainty-and-significance-tests">Coefficient uncertainty and significance tests</h3>
<p>The solution <span class="math">\(\vec{\beta}^*\)</span> that minimizes <span class="math">\(J\)</span> — which can be found by gradient descent — is a maximum likelihood estimate. In the asymptotic limit of a large number of samples, maximum-likelihood parameter estimates satisfy the Cramer-Rao lower bound [<span class="math">\(2\)</span>]. That is, the parameter covariance matrix satisfies [<span class="math">\(3\)</span>],
</p>
<div class="math">\begin{eqnarray}
\text{cov}(\vec{\beta}^*, \vec{\beta}^*) &\sim& H(J)^{-1} \\
&\approx& \frac{1}{\sum_k \vec{x}_{k} \vec{x}_{k}^T p(y= 0 \vert \vec{x}_k) p(y= 1 \vert \vec{x}_k)}.
\tag{8} \label{covariance}
\end{eqnarray}</div>
<p>
Notice that the covariance matrix will be small if the denominator above is large. Along a given direction, this requires that the training set contains samples over a wide range of values in that direction (we discuss this at some length in the analogous section of our post on Linear Regression [<span class="math">\(4\)</span>]). For a term to contribute in the denominator, the model must also have some confusion about its values: If there are no difficult-to-classify training examples, this means that there are no examples near the decision boundary. When this occurs, there will necessarily be a lot of flexibility in where the decision boundary is placed, resulting in large parameter variances.</p>
<p>Although the form above only holds in the asymptotic limit, we can always use it to approximate the true covariance matrix — keeping in mind that the accuracy of the approximation will degrade when working with small training sets. For example, using (\ref{covariance}), the asymptotic variance for a single parameter can be approximated by
</p>
<div class="math">\begin{eqnarray}
\tag{9} \label{single_cov}
\sigma^2_{\beta^*_i} = \text{cov}(\vec{\beta}^*, \vec{\beta}^*)_{ii}.
\end{eqnarray}</div>
<p>
In the asymptotic limit, the maximum-likelihood parameters will be Normally-distributed [<span class="math">\(1\)</span>], so we can provide confidence intervals for the parameters as
</p>
<div class="math">\begin{eqnarray}
\tag{10} \label{parameter_interval}
\beta_i \in \left ( \beta^*_i - z \sigma_{\beta^*_i}, \beta_i^* + z \sigma_{\beta^*_i} \right),
\end{eqnarray}</div>
<p>
where the value of <span class="math">\(z\)</span> sets the size of the interval. For example, choosing <span class="math">\(z = 2\)</span> gives an interval construction procedure that will cover the true value approximately <span class="math">\(95%\)</span> of the time — a result of Normal statistics [<span class="math">\(5\)</span>]. Checking which intervals do not cross zero provides a method for identifying which features contribute significantly to a given fit.</p>
<h3 id="prediction-confidence-intervals">Prediction confidence intervals</h3>
<p>The probability of class <span class="math">\(1\)</span> for a test point <span class="math">\(\vec{x}\)</span> is given by (\ref{model1}). Notice that this depends on <span class="math">\(\vec{x}\)</span> and <span class="math">\(\vec{\beta}\)</span> only through the dot product <span class="math">\(\vec{x} \cdot \vec{\beta}\)</span>. At fixed <span class="math">\(\vec{x}\)</span>, the variance (uncertainty) in this dot product follows from the coefficient covariance matrix above: We have [<span class="math">\(2\)</span>],
</p>
<div class="math">\begin{eqnarray}
\tag{11} \label{logit_var}
\sigma^2_{\vec{x} \cdot \vec{\beta}} \equiv \vec{x}^T \cdot \text{cov}(\vec{\beta}^*, \vec{\beta}^*) \cdot \vec{x}.
\end{eqnarray}</div>
<p>
With this result, we can obtain an expression for the confidence interval for the dot product, or equivalently a confidence interval for the class probability. For example, the asymptotic interval for class <span class="math">\(1\)</span> probability is given by
</p>
<div class="math">\begin{eqnarray}
\tag{12} \label{prob_interval}
p(y=1 \vert \vec{x}) \in \left ( \frac{1}{1 + e^{- \vec{x} \cdot \vec{\beta}^* + z \sigma_{\vec{x} \cdot \vec{\beta}^*}}}, \frac{1}{1 + e^{- \vec{x} \cdot \vec{\beta}^* - z \sigma_{\vec{x} \cdot \vec{\beta}^*}}} \right),
\end{eqnarray}</div>
<p>
where <span class="math">\(z\)</span> again sets the size of the interval as above (<span class="math">\(z=2\)</span> gives a <span class="math">\(95%\)</span> confidence interval, etc. [<span class="math">\(5\)</span>]), and <span class="math">\(\sigma_{\vec{x} \cdot \vec{\beta}^*}\)</span> is obtained from (\ref{covariance}) and (\ref{logit_var}).</p>
<p>The results (\ref{covariance}), (\ref{logit_var}), and (\ref{prob_interval}) are used in our Jupyter notebook. There we provide code for a minimal Logistic Regression class implementation that returns both point estimates and prediction confidence intervals for each test point. We used this code to generate the plot shown in the post introduction. Again, the code can be downloaded <a href="https://github.com/EFavDB/logistic-regression-with-error">here</a> if you are interested in trying it out.</p>
<h3 id="summary">Summary</h3>
<p>In this note, we have 1) reviewed how to fit a logistic regression model to a binary data set for classification purposes, and 2) have derived the expressions needed to return class membership probability confidence intervals for test points.</p>
<p>Confidence intervals are typically not available for many out-of-the-box machine learning models, despite the fact that intervals can often provide significant utility. The fact that logistic regression allows for meaningful error bars to be returned with relative ease is therefore a notable, advantageous property.</p>
<h3 id="footnotes">Footnotes</h3>
<p>[<span class="math">\(1\)</span>] Our notes on the maximum-likelihood estimators can be found <a href="http://efavdb.github.io/maximum-likelihood-asymptotics">here</a>.</p>
<p>[<span class="math">\(2\)</span>] Our notes on covariance matrices and the multivariate Cramer-Rao theorem can be found <a href="http://efavdb.github.io/multivariate-cramer-rao-bound">here</a>.</p>
<p>[<span class="math">\(3\)</span>] The Cramer-Rao identity [<span class="math">\(2\)</span>] states that covariance matrix of the maximum-likelihood estimators approaches the Hessian matrix of the log-likelihood, evaluated at their true values. Here, we approximate this by evaluating the Hessian at the maximum-likelihood point estimate.</p>
<p>[<span class="math">\(4\)</span>] Our notes on linear regression can be found <a href="http://efavdb.github.io/linear-regression">here</a>.</p>
<p>[<span class="math">\(5\)</span>] Our notes on Normal distributions can be found <a href="http://efavdb.github.io/normal-distributions">here</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Normal Distributions2017-05-13T21:48:00-07:002017-05-13T21:48:00-07:00Jonathan Landytag:efavdb.com,2017-05-13:/normal-distributions<p>I review — and provide derivations for — some basic properties of Normal distributions. Topics currently covered: (i) Their normalization, (ii) Samples from a univariate Normal, (iii) Multivariate Normal distributions, (iv) Central limit theorem.</p>
<h3 id="introduction">Introduction</h3>
<p><a href="https://efavdb.com/wp-content/uploads/2017/05/carl-f-gauss-4.jpg"><img alt="carl-f-gauss-4" src="https://efavdb.com/wp-content/uploads/2017/05/carl-f-gauss-4.jpg"></a></p>
<p>This post contains a running list of properties (with derivations) relating to Normal (Gaussian) distributions. Normal distributions …</p><p>I review — and provide derivations for — some basic properties of Normal distributions. Topics currently covered: (i) Their normalization, (ii) Samples from a univariate Normal, (iii) Multivariate Normal distributions, (iv) Central limit theorem.</p>
<h3 id="introduction">Introduction</h3>
<p><a href="https://efavdb.com/wp-content/uploads/2017/05/carl-f-gauss-4.jpg"><img alt="carl-f-gauss-4" src="https://efavdb.com/wp-content/uploads/2017/05/carl-f-gauss-4.jpg"></a></p>
<p>This post contains a running list of properties (with derivations) relating to Normal (Gaussian) distributions. Normal distributions are important for two principal reasons: Their significance a la the central limit theorem and their appearance in saddle point approximations to more general integrals. As usual, the results here assume familiarity with calculus and linear algebra.</p>
<p>Pictured at right is an image of Gauss — “Few, but ripe.”</p>
<h3 id="normalization">Normalization</h3>
<ul>
<li>Consider the integral
<div class="math">\begin{align} \tag{1}
I = \int_{-\infty}^{\infty} e^{-x^2} dx.
\end{align}</div>
To evaluate, consider the value of <span class="math">\(I^2\)</span>. This is
<div class="math">\begin{align}\tag{2}
I^2 &= \int_{-\infty}^{\infty} e^{-x^2} dx \int_{-\infty}^{\infty} e^{-y^2} dy \\
&= \int_0^{\infty} e^{-r^2} 2 \pi r dr = -\pi e^{-r^2} \vert_0^{\infty} = \pi.
\end{align}</div>
Here, I have used the usual trick of transforming the integral over the plane to one over polar <span class="math">\((r, \theta)\)</span> coordinates. The result above gives the normalization for the Normal distribution.</li>
</ul>
<h3 id="samples-from-a-univariate-normal">Samples from a univariate normal</h3>
<ul>
<li>
<p>Suppose <span class="math">\(N\)</span> independent samples are taken from a Normal distribution. The sample mean is defined as <span class="math">\(\hat{\mu} = \frac{1}{N}\sum x_i\)</span> and the sample variance as <span class="math">\(\hat{S}^2 \equiv \frac{1}{N-1} \sum (x_i - \hat{\mu})^2\)</span>. These two statistics are independent. Further, the former is Normal distributed with variance <span class="math">\(\sigma^2/N\)</span> and the latter is proportional to a <span class="math">\(\chi_{N-1}^2.\)</span></p>
<p><em>Proof:</em> Let the sample be <span class="math">\(\textbf{x} = (x_1, x_2, \ldots, x_N)\)</span>. Then the mean can be written as <span class="math">\(\textbf{x} \cdot \textbf{1}/N\)</span>, the projection of <span class="math">\(\textbf{x}\)</span> along <span class="math">\(\textbf{1}/N\)</span>. Similarly, the sample variance can be expressed as the squared length of <span class="math">\(\textbf{x} - (\textbf{x} \cdot \textbf{1} / N)\textbf{1} = \textbf{x} - (\textbf{x} \cdot \textbf{1} / \sqrt{N})\textbf{1}/\sqrt{N}\)</span>, which is the squared length of <span class="math">\(\textbf{x}\)</span> projected into the space orthogonal to <span class="math">\(\textbf{1}\)</span>. The independence of the <span class="math">\(\{x_i\}\)</span> implies that these two variables are themselves independent, the former Normal and the latter <span class="math">\(\chi^2_{N-1}.\)</span></p>
</li>
<li>
<p>The result above implies that the weight for sample <span class="math">\(\textbf{x}\)</span> can be written as
<div class="math">\begin{align} \tag{3}
p(\textbf{x} \vert \mu, \sigma^2) = \frac{1}{(2 \pi \sigma^2)^{N/2}} e^{\left (N (\hat{\mu} - \mu)^2 + (N-1)S^2\right)/(2 \sigma^2) }.
\end{align}</div>
</p>
</li>
<li>Aside on sample variance: Given independent samples from any distribution, dividing by <span class="math">\(N-1\)</span> gives an unbiased estimate for the population variance. However, if the samples are not independent (eg, direct trace from <span class="caps">MCMC</span>), this factor is not appropriate: We have
<div class="math">\begin{align} \nonumber
(N-1)E(S^2) &= E(\sum (x_i - \overline{x})^2) \\
&= E(\sum (x_i - \mu)^2 - N ( \overline{x} - \mu)^2 ) \\ \tag{4}
&= N [\sigma^2 - \text{var}(\overline{x})] \label{sample_var}
\end{align}</div>
If the samples are independent, the above gives <span class="math">\((N-1) \sigma^2\)</span>. However, if the samples are all the same, <span class="math">\(\text{var}(\overline{x}) = \sigma^2\)</span>, giving <span class="math">\(S^2=0\)</span>. In general, the relationship between the samples determines whether <span class="math">\(S^2\)</span> is biased or not.</li>
<li>From the results above, the quantity
<div class="math">\begin{align} \label{t-var} \tag{5}
(\hat{\mu}- \mu)/(S/\sqrt(N))
\end{align}</div>
is the ratio of two independent variables — the numerator a Normal and the denominator the square root of an independent <span class="math">\(\chi^2_{N-1}\)</span> variable. This quantity follows a universal distribution called the <span class="math">\(t\)</span>-distribution. One can write down closed-form expressions for the <span class="math">\(t\)</span>. For example, when <span class="math">\(N=2\)</span>, you get a Cauchy variable: the ratio of one Normal over the absolute value of another, independent Normal (see above). In general, <span class="math">\(t\)</span>-distributions have power law tails. A key point is that we cannot evaluate (\ref{t-var}) numerically if we do not know <span class="math">\(\mu\)</span>. Nevertheless, we can use the known distribution of the above to specify its likely range. Using this, we can then construct a confidence interval for <span class="math">\(\mu\)</span>.</li>
<li>Consider now a situation where you have two separate Normal distributions. To compare their variances you can take samples from the two and then construct the quantity
<div class="math">\begin{align}\label{f-var} \tag{6}
\frac{S_x / \sigma_x}{ S_y/ \sigma_y}.
\end{align}</div>
This is the ratio of two independent <span class="math">\(\chi^2\)</span> variables, resulting in what is referred to as an <span class="math">\(F\)</span>-distributed variable. Like (\ref{t-var}), we often cannot evaluate (\ref{f-var}) numerically. Instead, we use a tabulated cdf of the <span class="math">\(F\)</span>-distribution to derive confidence intervals for the ratio of the two underlying variances. Aside: The <span class="math">\(F\)</span>-distribution arises in the analysis of both <span class="caps">ANOVA</span> and linear regression. Note also that the square of a <span class="math">\(t\)</span>-distributed variable (Normal over the square root of a <span class="math">\(\chi^2\)</span> variable) is <span class="math">\(F\)</span>-distributed.</li>
</ul>
<h3 id="multivariate-normals">Multivariate Normals</h3>
<ul>
<li>Consider a set of jointly-distributed variables <span class="math">\(x\)</span> having normal distribution
<div class="math">\begin{align} \tag{7}
p(x) = \sqrt{\frac{ \text{det}(M)} {2 \pi}} \exp \left [- \frac{1}{2} x^T \cdot M \cdot x \right ],
\end{align}</div>
with <span class="math">\(M\)</span> a real, symmetric matrix. The correlation of two components is given by
<div class="math">\begin{align}\tag{8}
\langle x_i x_j \rangle = M^{-1}_{ij}.
\end{align}</div>
<em>Proof:</em> Let
<div class="math">\begin{align}\tag{9}
I = \int dx \exp \left [- \frac{1}{2} x^T \cdot M \cdot x \right ].
\end{align}</div>
Then,
<div class="math">\begin{align}\tag{10}
\partial_{M_{ij}} \log I = -\frac{1}{2} \langle x_i x_j \rangle.
\end{align}</div>
We can also evaluate this using the normalization of the integral as
<div class="math">\begin{align} \nonumber
\partial_{M_{ij}} \log I &= - \frac{1}{2} \sum_{\alpha} \frac{1}{\lambda_{\alpha}} \partial_{M_{ij}} \lambda_{\alpha} \\ \nonumber
&= - \frac{1}{2} \sum_{\alpha} \frac{1}{\lambda_{\alpha}} v_{\alpha i } v_{\alpha j} \\
&= - \frac{1}{2} M^{-1}_{ij}. \tag{11}
\end{align}</div>
Here, I’ve used the result <span class="math">\( \partial_{M_{ij}} \lambda_{\alpha} = v_{\alpha i } v_{\alpha j}\)</span>. I give a proof of this next. The last line follows by expressing <span class="math">\(M\)</span> in terms of its eigenbasis. Comparing the last two lines above gives the result.</li>
<li>
<p>Consider a matrix <span class="math">\(M\)</span> having eigenvalues <span class="math">\(\{\lambda_{\alpha}\}\)</span>. The first derivative of <span class="math">\(\lambda_{\alpha}\)</span> with respect to <span class="math">\(M_{ij}\)</span> is given by <span class="math">\(v_{\alpha, i} v_{\alpha, j}\)</span>, where <span class="math">\(v_{\alpha}\)</span> is the unit eigenvector corresponding to the eigenvalue <span class="math">\(\lambda_{\alpha}\)</span>.</p>
<p><em>Proof:</em> The eigenvalue in question is given by
<div class="math">\begin{align} \tag{12}
\lambda_{\alpha} = \sum_{ij} v_{\alpha i} M_{ij} v_{\alpha j}.
\end{align}</div>
If we differentiate with respect to <span class="math">\(M_{ab}\)</span>, say, we obtain
<div class="math">\begin{align} \nonumber
\partial_{M_{ab}} \lambda_{\alpha} &= \sum_{ij} \delta_{ia} \delta_{jb} v_{\alpha i} v_{\alpha j} + 2 v_{\alpha i} M_{ij} \partial_{M_{ab}} v_{\alpha j} \\
&= v_{\alpha a} v_{\alpha b} + 2 \lambda_{\alpha} v_{\alpha } \cdot \partial_{M_{ab}} v_{\alpha }
\tag{13}.
\end{align}</div>
The last term above must be zero since the length of <span class="math">\(v_{\alpha }\)</span> is fixed at <span class="math">\(1\)</span>.</p>
</li>
<li>
<p>The conditional distribution. Let <span class="math">\(x\)</span> be a vector of jointly distributed variables of mean zero and covariance matrix <span class="math">\(\Sigma\)</span>. If we segment the variables into two sets, <span class="math">\(x_0\)</span> and <span class="math">\(x_1\)</span>, the distribution of <span class="math">\(x_1\)</span> at fixed <span class="math">\(x_0\)</span> is also normal. Here, we find the mean and covariance. We have
<div class="math">\begin{align} \label{multivargaucond} \tag{14}
p(x) = \mathcal{N} e^{-\frac{1}{2} x_0^T \Sigma^{-1}_{00} x_0} e^{ -\frac{1}{2} \left \{ x_1^T \Sigma^{-1}_{11} x_1 + 2 x_1^T \Sigma^{-1}_{10} x_0 \right \} }
\end{align}</div>
Here, <span class="math">\(\Sigma^{-1}_{ij}\)</span> refers to the <span class="math">\(i-j\)</span> block of the inverse. To complete the square, we write
<div class="math">\begin{align} \tag{15}
x_1^T \Sigma^{-1}_{11} x_1 + 2 x_1^T \Sigma^{-1}_{10} x_0 + c = (x_1^T + a) \Sigma^{-1}_{11} ( x_1 + a).
\end{align}</div>
Comparing both sides, we find
<div class="math">\begin{align} \tag{16}
x_1^T \Sigma^{-1}_{10} x_0 = x_1^T \Sigma^{-1}_{11} a
\end{align}</div>
This holds for any value of <span class="math">\(x_1^T\)</span>, so we must have
<div class="math">\begin{align}\tag{17}
a = \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} x_0 .
\end{align}</div>
Plugging the last few results into (\ref{multivargaucond}), we obtain
<div class="math">\begin{align} \nonumber
p(x) = \mathcal{N} e^{-\frac{1}{2} x_0^T \left( \Sigma^{-1}_{00} -
\Sigma^{-1}_{01} \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} \right) x_0}\times \\
e^{ -\frac{1}{2} \left (x_1 + \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} x_0 \right) \Sigma^{-1}_{11} \left (x_1 + \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} x_0 \right) } \tag{18} \label{multivargaucondfix}
\end{align}</div>
This shows that <span class="math">\(x_0\)</span> and <span class="math">\(x_1 + \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} x_0\)</span> are independent. This formula also shows that the average value of <span class="math">\(x_1\)</span> shifts at fixed <span class="math">\(x_0\)</span>,
<div class="math">\begin{align}\tag{19}
\langle x_1 \rangle = \langle x_1 \rangle_0 - \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} x_0.
\end{align}</div>
With some work, we can rewrite this as
<div class="math">\begin{align} \tag{20}
\langle x_1 \rangle = \langle x_1 \rangle_0 + \Sigma_{10} \frac{1}{\Sigma_{00}}x_0.
\end{align}</div>
There are two ways to prove this equivalent form holds. One is to make use of the expression for the inverse of a block matrix. The second is to note that the above is simply the linear response to a shift in <span class="math">\(x_0\)</span> — see post on linear regression.</p>
</li>
<li>If we integrate over <span class="math">\(x_1\)</span> in (\ref{multivargaucondfix}), we obtain the distribution for <span class="math">\(x_0\)</span>. This is
<div class="math">\begin{align} \tag{21}
p(x_0) = \mathcal{N} e^ {-\frac{1}{2} x_0^T \left( \Sigma^{-1}_{00} -
\Sigma^{-1}_{01} \left( \Sigma^{-1}_{11} \right)^{-1} \Sigma^{-1}_{10} \right) x_0}
\end{align}</div>
The block-diagonal inverse theorem can be used to show that this is equivalent to
<div class="math">\begin{align} \tag{22}
p(x_0) = \mathcal{N} e^{ -\frac{1}{2} x_0^T \left( \Sigma_{00} \right)^{-1} x_0}
\end{align}</div>
Another way to see this is correct is to make use of the fact that the coefficient matrix in the normal is the inverse of the correlation matrix. We know that after integrating out the values of <span class="math">\(x_1\)</span>, we remain normal, and the covariance matrix will simply be given by that for <span class="math">\(x_0\)</span>.</li>
<li>The covariance of the <span class="caps">CDF</span> transform in multivariate case — a result needed for fitting Gaussian Copulas to data: Let <span class="math">\(x_1, x_2\)</span> be jointly distributed Normal variables with covariance matrix
<div class="math">\begin{align}
C = \left( \begin{array}{cc}
1 & \rho \\
\rho & 1
\end{array} \right)
\end{align}</div>
The <span class="caps">CDF</span> transform of <span class="math">\(x_i\)</span> is defined as
<div class="math">\begin{align}
X_i \equiv \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{x_i} \exp\left( -\frac{\tilde{x}_i^2}{2} \right)d\tilde{x}_i.
\end{align}</div>
Here, we’ll calculate the covariance of <span class="math">\(X_1\)</span> and <span class="math">\(X_2\)</span>. Up to a constant that does not depend on <span class="math">\(\rho\)</span>, this is given by the integral
<div class="math">\begin{align}
J \equiv &\frac{1}{\sqrt{(2 \pi)^2 \text{det} C}} \int d\vec{x} e^{-\frac{1}{2} \vec{x} \cdot C^{-1} \cdot \vec{x}} \times \\
&\frac{1}{2 \pi} \int_{-\infty}^{x_1}\int_{-\infty}^{x_2} e^{-\frac{\tilde{x}_1^2}{2} -\frac{\tilde{x}_2^2}{2}}d\tilde{x}_1 d\tilde{x}_2.
\end{align}</div>
To progress, we first write
<div class="math">\begin{align}
\exp\left( -\frac{\tilde{x}_i^2}{2} \right ) = \frac{1}{\sqrt{2\pi }}\int \exp \left (- \frac{1}{2} k_i^2 + i k \tilde{x}_i \right )
\end{align}</div>
We will substitute this equation into the prior line and then integrate over the <span class="math">\(\tilde{x}_i\)</span> using the result
<div class="math">\begin{align}
\int_{-\infty}^{x_i} \exp \left ( i k \tilde{x}_i \right ) d \tilde{x}_i = \frac{e^{i k_i x_i}}{i k_i}.
\end{align}</div>
This gives
<div class="math">\begin{align}
J = &\frac{-1}{(2 \pi)^3 \sqrt{\text{det} C} } \int_{k_1} \int_{k_2} \frac{e^{-\frac{1}{2} (k_1^2 + k_2^2)}}{k_1 k_2} \times \\
&\int d\vec{x} e^{-\frac{1}{2} \vec{x} \cdot C^{-1} \cdot \vec{x} + i \vec{k} \cdot \vec{x}}
\end{align}</div>
The integral on <span class="math">\(\vec{x}\)</span> can now be carried out by completing the square. This gives
<div class="math">\begin{align}
J = \frac{1}{(2 \pi)^2} \int_{k_1} \int_{k_2} \frac{1}{k_1 k_2}
\exp\left( -\frac{1}{2} \vec{k} \cdot (C + I) \cdot \vec{k} \right)
\end{align}</div>
We now differentiate with respect to <span class="math">\(\rho\)</span> to get rid of the <span class="math">\(k_1 k_2\)</span> in the denominator. This gives
<div class="math">\begin{align} \nonumber
\partial_{\rho} J &= \frac{1}{(2 \pi)^2} \int_{k_1} \int_{k_2}
\exp\left( -\frac{1}{2} \vec{k} \cdot (C + I) \cdot \vec{k} \right) \\ \nonumber
&= \frac{1}{2 \pi } \frac{1}{\sqrt{\text{det}(C + I)}} \\
&= \frac{1}{4 \pi } \frac{1}{\sqrt{1 - \frac{\rho^2}{4}}}.
\end{align}</div>
The last step is to integrate with respect to <span class="math">\(\rho\)</span>, but we will now switch back to the original goal of calculating the covariance of the two <span class="caps">CDF</span> transforms, <span class="math">\(P\)</span>, rather than <span class="math">\(J\)</span> itself. At <span class="math">\(\rho = 0\)</span>, we must have <span class="math">\(P(\rho=0) = 0\)</span>, since the transforms will also be uncorrelated in this limit. This gives
<div class="math">\begin{align} \nonumber
P &= \int_0^{\rho} \frac{1}{4 \pi } \frac{1}{\sqrt{1 - \frac{\rho^2}{4}}} d \rho \\
&= \frac{1}{2 \pi } \sin^{-1} \left( \frac{\rho}{2} \right). \tag{23}
\end{align}</div>
Using a similar calculation, we find that the diagonal terms of the <span class="caps">CDF</span> covariance matrix are <span class="math">\(1/12\)</span>.</li>
</ul>
<h3 id="central-limit-theorem">Central Limit Theorem</h3>
<ul>
<li>
<p>Let <span class="math">\(x_1, x_2, \ldots, x_N\)</span> be <span class="caps">IID</span> random variables with an mgf that exists near <span class="math">\(0\)</span>. Let <span class="math">\(E(x_i) = \mu\)</span> and <span class="math">\(\text{var}(x_i) = \sigma^2\)</span>. Then the variable <span class="math">\(\frac{\overline{x} - \mu}{\sigma / \sqrt{N}}\)</span> approaches standard normal as <span class="math">\(N \to \infty\)</span>.</p>
<p><em>Proof:</em> Let <span class="math">\(y_i =\frac{x_i - \mu}{\sigma}\)</span>. Then,
<div class="math">\begin{align}\tag{24}
\tilde{y} \equiv \frac{\overline{x} - \mu}{\sigma / \sqrt{N}} = \frac{1}{\sqrt{N}} \sum_i y_i.
\end{align}</div>
Using the fact that the mgf of a sum of independent variables is given by the product of their mgfs, the quantity at left is
<div class="math">\begin{align} \tag{25}
m_{\tilde{y}}(t) = \left [ m_{y}\left (\frac{t}{\sqrt{N}} \right) \right]^n.
\end{align}</div>
We now expand the term in brackets using a Taylor series, obtaining
<div class="math">\begin{align} \tag{26}
m_{\tilde{y}}(t) &= \left [1 + \frac{t^2}{2 N } + O\left (\frac{t^3}{ N^{3/2}} \right) \right]^N \\ &\to \exp\left ( \frac{t^2}{2} \right),
\end{align}</div>
where the latter form is the fixed <span class="math">\(t\)</span> limit as <span class="math">\(N \to \infty\)</span>. This is the mgf for a <span class="math">\(N(0,1)\)</span> variable, proving the result.</p>
</li>
<li>
<p>One can get a sense of the accuracy of the normal approximation at fixed <span class="math">\(N\)</span> through consideration of higher moments. For example, if we have an even distribution with mgf <span class="math">\(1 + x^2 /2 + (1 + \kappa^{\prime}) x^4 / 8 + \ldots\)</span>. Then the mgf for the scaled average above will be
<div class="math">\begin{align}\nonumber
m_{\tilde{y}} &= \left [1 + \frac{t^2}{2 N } + \frac{(1 + \kappa^{\prime}) t^4}{8 N^2 } + \ldots \right]^N \\
&= 1 + \frac{t^2}{2} + \left (1 + \frac{\kappa^{\prime}}{ N } \right) \frac{t^4}{8} + \ldots \tag{27}
\end{align}</div>
This shows that the deviation in the kurtosis away from its <span class="math">\(N(0,1)\)</span> value decays like <span class="math">\(1/N\)</span>.</p>
</li>
</ul>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Model AUC depends on test set difficulty2017-03-18T22:36:00-07:002017-03-18T22:36:00-07:00Jonathan Landytag:efavdb.com,2017-03-18:/model-auc-depends-on-test-set-difficulty<p>The <span class="caps">AUC</span> score is a popular summary statistic that is often used to communicate the performance of a classifier. However, we illustrate here that this score depends not only on the quality of the model in question, but also on the difficulty of the test set considered: If samples are …</p><p>The <span class="caps">AUC</span> score is a popular summary statistic that is often used to communicate the performance of a classifier. However, we illustrate here that this score depends not only on the quality of the model in question, but also on the difficulty of the test set considered: If samples are added to a test set that are easily classified, the <span class="caps">AUC</span> will go up — even if the model studied has not improved. In general, this behavior implies that isolated, single <span class="caps">AUC</span> scores cannot be used to meaningfully qualify a model’s performance. Instead, the <span class="caps">AUC</span> should be considered a score that is primarily useful for comparing and ranking multiple models — each at a common test set difficulty.</p>
<h3 id="introduction">Introduction</h3>
<p>An important challenge associated with building good classification algorithms centers around their optimization: If an adjustment is made to an algorithm, we need a score that will enable us to decide whether or not the change made was an improvement. Many scores are available for this purpose. A sort-of all-purpose score that is quite popular for characterizing binary classifiers is the model <span class="caps">AUC</span> score (defined below).</p>
<p>The purpose of this post is to illustrate a subtlety associated with the <span class="caps">AUC</span> that is not always appreciated: The score depends strongly on the difficulty of the test set used to measure model performance. In particular, if any soft-balls are added to a test set that are easily classified (i.e., are far from any decision boundary), the <span class="caps">AUC</span> will increase. This increase does not imply a model improvement. Two key take-aways follow:</p>
<ul>
<li>The <span class="caps">AUC</span> is an inappropriate score for comparing models validated on test sets having differing sampling distributions. Therefore, comparing the AUCs of models trained on samples having differing distributions requires care: The training sets can have different distributions, but the test sets must not.</li>
<li>A single <span class="caps">AUC</span> measure cannot typically be used to meaningfully communicate the quality of a single model (though single model <span class="caps">AUC</span> scores are often reported!)</li>
</ul>
<p>The primary utility of the <span class="caps">AUC</span> is that it allows one to compare multiple models at fixed test set difficulty: If a model change results in an increase in the <span class="caps">AUC</span> at fixed test set distribution, it can often be considered an improvement.</p>
<p>We review the definition of the <span class="caps">AUC</span> below and then demonstrate the issues alluded to above.</p>
<h3 id="the-auc-score-reviewed">The <span class="caps">AUC</span> score, reviewed</h3>
<p>Here, we quickly review the definition of the <span class="caps">AUC</span>. This is a score that can be used to quantify the accuracy of a binary classification algorithm on a given test set <span class="math">\(\mathcal{S}\)</span>. The test set consists of a set of feature vector-label pairs of the form
</p>
<div class="math">\begin{eqnarray}\tag{1}
\mathcal{S} = \{(\textbf{x}_i, y_i) \}.
\end{eqnarray}</div>
<p>Here, <span class="math">\(\textbf{x}_i\)</span> is the set of features, or predictor variables, for example <span class="math">\(i\)</span> and <span class="math">\(y_i \in \{0,1 \}\)</span> is the label for example <span class="math">\(i\)</span>. A classifier function <span class="math">\(\hat{p}_1(\textbf{x})\)</span> is one that attempts to guess the value of <span class="math">\(y_i\)</span> given only the feature vector <span class="math">\(\textbf{x}_i\)</span>. In particular, the output of the function <span class="math">\(\hat{p}_1(\textbf{x}_i)\)</span> is an estimate for the probability that the label <span class="math">\(y_i\)</span> is equal to <span class="math">\(1\)</span>. If the algorithm is confident that the class is <span class="math">\(1\)</span> (<span class="math">\(0\)</span>), the probability returned will be large (small).</p>
<p>To characterize model performance, we can set a threshold value of <span class="math">\(p^*\)</span> and mark all examples in the test set with <span class="math">\(\hat{p}(\textbf{x}_i) > p^*\)</span> as being candidates for class one. The fraction of the truly positive examples in <span class="math">\(\mathcal{S}\)</span> marked in this way is referred to as the true-positive rate (<span class="caps">TPR</span>) at threshold <span class="math">\(p^*\)</span>. Similarly, the fraction of negative examples in <span class="math">\(\mathcal{S}\)</span> marked is referred to as the false-positive rate (<span class="caps">FPR</span>) at threshold <span class="math">\(p^*\)</span>. Plotting the <span class="caps">TPR</span> against the <span class="caps">FPR</span> across all thresholds gives the model’s so-called receiver operating characteristic (<span class="caps">ROC</span>) curve. A hypothetical example is shown at right in blue. The dashed line is just the <span class="math">\(y=x\)</span> line, which corresponds to the <span class="caps">ROC</span> curve of a random classifier (one returning a uniform random <span class="math">\(p\)</span> value each time).</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/03/example.png"><img alt="example" src="https://efavdb.com/wp-content/uploads/2017/03/example.png"></a></p>
<p>Notice that if the threshold is set to <span class="math">\(p^* = 1\)</span>, no positive or negative examples will typically be marked as candidates, as this would require one-hundred percent confidence of class <span class="math">\(1\)</span>. This means that we can expect an <span class="caps">ROC</span> curve to always go through the point <span class="math">\((0,0)\)</span>. Similarly, with <span class="math">\(p^*\)</span> set to <span class="math">\(0\)</span>, all examples should be marked as candidates for class <span class="math">\(1\)</span> — and so an <span class="caps">ROC</span> curve should also always go through the point <span class="math">\((1,1)\)</span>. In between, we hope to see a curve that increases in the <span class="caps">TPR</span> direction more quickly than in the <span class="caps">FPR</span> direction — since this would imply that the examples the model is most confident about tend to actually be class <span class="math">\(1\)</span> examples. In general, the larger the Area Under the (<span class="caps">ROC</span>) Curve — again, blue at right — the better. We call this area the “<span class="caps">AUC</span> score for the model” — the topic of this post.</p>
<h3 id="auc-sensitivity-to-test-set-difficulty"><span class="caps">AUC</span> sensitivity to test set difficulty</h3>
<p>To illustrate the sensitivity of the <span class="caps">AUC</span> score to test set difficulty, we now consider a toy classification problem: In particular, we consider a set of unit-variance normal distributions, each having a different mean <span class="math">\(\mu_i\)</span>. From each distribution, we will take a single sample <span class="math">\(x_i\)</span>. From this, we will attempt to estimate whether or not the corresponding mean satisfies <span class="math">\(\mu_i > 0\)</span>. That is, our training set will take the form <span class="math">\(\mathcal{S} = \{(x_i, \mu_i)\}\)</span>, where <span class="math">\(x_i \sim N(\mu_i, 1)\)</span>. For different <span class="math">\(\mathcal{S}\)</span>, we will study the <span class="caps">AUC</span> of the classifier function,</p>
<div class="math">\begin{eqnarray} \label{classifier} \tag{2}
\hat{p}(x) = \frac{1}{2} (1 + \text{tanh}(x))
\end{eqnarray}</div>
<p>
A plot of this function is shown below. You can see that if any test sample <span class="math">\(x_i\)</span> is far to the right (left) of <span class="math">\(x=0\)</span>, the model will classify the sample as positive (negative) with high certainty. At intermediate values near the boundary, the estimated probability of being in the positive class lifts in a reasonable way.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2017/03/classifier-2.png"><img alt="classifier" src="https://efavdb.com/wp-content/uploads/2017/03/classifier-2.png"></a></p>
<p>Notice that if a test example has a mean very close to zero, it will be difficult to classify that example as positive or negative. This is because both positive and negative <span class="math">\(x\)</span> samples are equally likely in this case. This means that the model cannot do much better than a random guess for such <span class="math">\(\mu\)</span>. On the other hand, if an example <span class="math">\(\mu\)</span> is selected that is very far from the origin, a single sample <span class="math">\(x\)</span> from <span class="math">\(N(\mu, 1)\)</span> will be sufficient to make a very good guess as to whether <span class="math">\(\mu > 0\)</span>. Such examples are hard to get wrong, soft-balls.</p>
<p>The impact of adding soft-balls to the test set on the <span class="caps">AUC</span> for model (\ref{classifier}) can be studied by changing the sampling distribution of <span class="math">\(\mathcal{S}\)</span>. The following python snippet takes samples <span class="math">\(\mu_i\)</span> from three distributions — one tight about <span class="math">\(0\)</span> (resulting in a very difficult test set), one that is very wide containing many soft-balls that are easily classified, and one that is intermediate. The <span class="caps">ROC</span> curves that result from these three cases are shown following the code. The three curves are very different, with the <span class="caps">AUC</span> of the soft-ball set very large and that of the tight set close to that of the random classifier. Yet, in each case the model considered was the same — (\ref{classifier}). How could the <span class="caps">AUC</span> have improved?!</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">metrics</span>
<span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax1</span><span class="p">,</span> <span class="n">ax2</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mf">3.5</span><span class="p">))</span>
<span class="n">SAMPLES</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">means_std</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">means_std</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="o">.</span><span class="mi">001</span><span class="p">]:</span>
<span class="n">means</span> <span class="o">=</span> <span class="n">means_std</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="n">x_set</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span> <span class="o">+</span> <span class="n">means</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="p">[</span><span class="n">classifier</span><span class="p">(</span><span class="n">item</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">x_set</span><span class="p">]</span>
<span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">thresholds</span> <span class="o">=</span> <span class="n">metrics</span><span class="o">.</span><span class="n">roc_curve</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="p">(</span><span class="n">means</span><span class="o">></span><span class="mi">0</span><span class="p">),</span> <span class="n">predictions</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">means_std</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">fpr</span><span class="p">,</span> <span class="s1">'k--'</span><span class="p">)</span>
<span class="n">ax2</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">means</span><span class="p">,</span> <span class="mi">0</span> <span class="o">*</span> <span class="n">means</span><span class="p">,</span> <span class="s1">'*'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">means_std</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">'lower right'</span><span class="p">,</span> <span class="n">shadow</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ax2</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s1">'lower right'</span><span class="p">,</span> <span class="n">shadow</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">ax1</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s1">'TPR versus FPR -- The ROC curve'</span><span class="p">)</span>
<span class="n">ax2</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s1">'Means sampled for each case'</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2017/03/Examples.png"><img alt="Examples" src="https://efavdb.com/wp-content/uploads/2017/03/Examples.png"></a></p>
<p>The explanation for the differing <span class="caps">AUC</span> values above is clear: Consider, for example, the effect of adding soft-ball negatives to <span class="math">\(\mathcal{S}\)</span>. In this case, the model (\ref{classifier}) will be able to correctly identify almost all true positive examples at a much higher threshold than that where it begins to mis-classify the introduced negative softballs. This means that the <span class="caps">ROC</span> curve will now hit a <span class="caps">TPR</span> value of <span class="math">\(1\)</span> well-before the <span class="caps">FPR</span> does (which requires all negatives — including the soft-balls to be mis-classified). Similarly, if many soft-ball positives are added in, these will be easily identified as such well-before any negative examples are mis-classified. This again results in a raising of the <span class="caps">ROC</span> curve, and an increase in <span class="caps">AUC</span> — all without any improvement in the actual model quality, which we have held fixed.</p>
<h3 id="discussion">Discussion</h3>
<p>The toy example considered above illustrates the general point the <span class="caps">AUC</span> of a model is really a function of both the model and the test set it is being applied to. Keeping this in mind will help to prevent incorrect interpretations of the <span class="caps">AUC</span>. A special case to watch out for in practice is the situation where the <span class="caps">AUC</span> changes upon adjustment of the training and testing protocol applied (which can result, for example, from changes to how training examples are collected for the model). If you see such a change occur in your work, be careful to consider whether or not it is possible that the difficulty of the test set has changed in the process. If so, the change in the <span class="caps">AUC</span> may not indicate a change in model quality.</p>
<p>Because the <span class="caps">AUC</span> score of a model can depend highly on the difficulty of the test set, reporting this score alone will generally not provide much insight into the accuracy of the model — which really depends only on performance near the true decision boundary and not on soft-ball performance. Because of this, it may be a good practice to always report <span class="caps">AUC</span> scores for optimized models next to those of some fixed baseline model. Comparing the differences of the two <span class="caps">AUC</span> scores provides an approximate method for removing the effect of test set difficulty. If you come across an isolated, high <span class="caps">AUC</span> score in the wild, remember that this does not imply a good model!</p>
<p>A special situation exists where reporting an isolated <span class="caps">AUC</span> score for a single model can provide value: The case where the test set employed shares the same distribution as that of the application set (the space where the model will be employed). In this case, performance within the test set directly relates to expected performance during application. However, applying the <span class="caps">AUC</span> to situations such as this is not always useful. For example, if the positive class sits within only a small subset of feature space, samples taken from much of the rest of the space will be “soft-balls” — examples easily classified as not being in the positive class. Measuring the <span class="caps">AUC</span> on test sets over the full feature space in this context will always result in <span class="caps">AUC</span> values near one — leaving it difficult to register improvements in the model near the decision boundary through measurement of the <span class="caps">AUC</span>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Simple python to LaTeX parser2016-11-18T12:59:00-08:002016-11-18T12:59:00-08:00Jonathan Landytag:efavdb.com,2016-11-18:/simple-python-to-latex-parser<p>We demo a script that converts python numerical commands to LaTeX format. A notebook available on our GitHub page will take this and pretty print the result.</p>
<h3 id="introduction">Introduction</h3>
<p>Here, we provide a simple script that accepts numerical python commands in string format and converts them into LaTeX markup. An example …</p><p>We demo a script that converts python numerical commands to LaTeX format. A notebook available on our GitHub page will take this and pretty print the result.</p>
<h3 id="introduction">Introduction</h3>
<p>Here, we provide a simple script that accepts numerical python commands in string format and converts them into LaTeX markup. An example input / output follows:</p>
<div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="s1">'f(x_123, 2) / (2 + 3/(1 + z(np.sqrt((x + 3)/3)))) + np.sqrt(2 ** w) * np.tanh(2 * math.pi* x)'</span>
<span class="nb">print</span> <span class="n">command_to_latex</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="c1">## output: </span>
\<span class="n">frac</span><span class="p">{</span><span class="n">f</span> \<span class="n">left</span> <span class="p">({</span><span class="n">x</span><span class="p">}</span><span class="n">_</span><span class="p">{</span><span class="mi">123</span><span class="p">}</span> <span class="p">,</span> <span class="mi">2</span> \<span class="n">right</span> <span class="p">)}{</span><span class="mi">2</span> <span class="o">+</span> \<span class="n">frac</span><span class="p">{</span><span class="mi">3</span><span class="p">}{</span><span class="mi">1</span> <span class="o">+</span> <span class="n">z</span> \<span class="n">left</span> <span class="p">(</span> \<span class="n">sqrt</span><span class="p">{</span>\<span class="n">frac</span><span class="p">{</span><span class="n">x</span> <span class="o">+</span> <span class="mi">3</span><span class="p">}{</span><span class="mi">3</span><span class="p">}}</span> \<span class="n">right</span> <span class="p">)}}</span> <span class="o">+</span> \<span class="n">sqrt</span><span class="p">{{</span><span class="mi">2</span><span class="p">}</span><span class="o">^</span><span class="p">{</span><span class="n">w</span><span class="p">}}</span> \<span class="n">cdot</span> \<span class="n">tanh</span> \<span class="n">left</span> <span class="p">(</span><span class="mi">2</span> \<span class="n">cdot</span> \<span class="n">pi</span> \<span class="n">cdot</span> <span class="n">x</span> \<span class="n">right</span> <span class="p">)</span>
</pre></div>
<p>If the output shown here is plugged into a LaTeX editor, we get the following result:</p>
<div class="math">\begin{eqnarray}\tag{1}
\frac{f \left ({x}_{123} , 2 \right )}{2 + \frac{3}{1 + z \left ( \sqrt{\frac{x + 3}{3}} \right )}} + \sqrt{{2}^{w}} \cdot \tanh \left (2 \cdot \pi \cdot x \right )
\end{eqnarray}</div>
<p>
Our Jupyter <a href="https://github.com/EFavDB/python_command_to_latex">notebook</a> automatically pretty prints to this form.</p>
<p>We provide the script here as it may be useful for two sorts of applications: 1) facilitating write-ups of completed projects, and 2) visualizing typed-up formulas to aid checks of their accuracy. The latter is particularly helpful for lengthy commands, which are often hard to read in python format.</p>
<p>We note that the python package sympy also provides a simple command-to-latex parser. However, I have had trouble getting it to output results if any functions appear that have not been defined — we illustrate this issue in the notebook.</p>
<p>As usual, our code can be downloaded from our github page <a href="https://github.com/EFavDB/python_command_to_latex">here</a>.</p>
<h3 id="code">Code</h3>
<p>The main code segment follows. The method command_to_latex recursively computes the LaTeX for any combinations of variables grouped together via parentheses. The base case occurs when there are no parentheses left, at which point the method parse_simple_eqn is called, which converts simple commands to LaTeX. The results are then recombined within the recursive method. Additional replacements can be easily added in the appropriate lines below.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">parse_simple_eqn</span><span class="p">(</span><span class="n">q</span><span class="p">):</span>
<span class="sd">""" Return TeX equivalent of a command without parentheses. """</span>
<span class="c1"># Define replacement rules. </span>
<span class="n">simple_replacements</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">[</span><span class="s1">' '</span><span class="p">,</span> <span class="s1">''</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'**'</span><span class="p">,</span> <span class="s1">'^'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'*'</span><span class="p">,</span> <span class="s1">' \cdot '</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'math.'</span><span class="p">,</span> <span class="s1">''</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'np.'</span><span class="p">,</span> <span class="s1">''</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'pi'</span><span class="p">,</span> <span class="s1">'\pi'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'tan'</span><span class="p">,</span> <span class="s1">'</span><span class="se">\t</span><span class="s1">an'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'cos'</span><span class="p">,</span> <span class="s1">'\cos'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'sin'</span><span class="p">,</span> <span class="s1">'\sin'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'sec'</span><span class="p">,</span> <span class="s1">'\sec'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'csc'</span><span class="p">,</span> <span class="s1">'\csc'</span><span class="p">],</span>
<span class="p">]</span>
<span class="n">complex_replacements</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">[</span><span class="s1">'^'</span><span class="p">,</span> <span class="s1">'{{</span><span class="si">{i1}</span><span class="s1">}}^{{</span><span class="si">{i2}</span><span class="s1">}}'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'_'</span><span class="p">,</span> <span class="s1">'{{</span><span class="si">{i1}</span><span class="s1">}}_{{</span><span class="si">{i2}</span><span class="s1">}}'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'/'</span><span class="p">,</span> <span class="s1">'</span><span class="se">\f</span><span class="s1">rac{{</span><span class="si">{i1}</span><span class="s1">}}{{</span><span class="si">{i2}</span><span class="s1">}}'</span><span class="p">],</span>
<span class="p">[</span><span class="s1">'sqrt'</span><span class="p">,</span><span class="s1">'\sqrt{{</span><span class="si">{i2}</span><span class="s1">}}'</span><span class="p">],</span>
<span class="p">]</span>
<span class="c1"># Carry out simple replacements </span>
<span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">simple_replacements</span><span class="p">:</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">q</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">pair</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">pair</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="c1"># Now complex replacements </span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">'*'</span><span class="p">,</span> <span class="s1">'/'</span><span class="p">,</span> <span class="s1">'+'</span><span class="p">,</span> <span class="s1">'-'</span><span class="p">,</span> <span class="s1">'^'</span><span class="p">,</span> <span class="s1">'_'</span><span class="p">,</span> <span class="s1">','</span><span class="p">,</span> <span class="s1">'sqrt'</span><span class="p">]:</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">q</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="s1">' '</span> <span class="o">+</span> <span class="n">item</span> <span class="o">+</span> <span class="s1">' '</span><span class="p">)</span>
<span class="n">q_split</span> <span class="o">=</span> <span class="n">q</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">q_split</span><span class="p">):</span>
<span class="k">for</span> <span class="n">pair</span> <span class="ow">in</span> <span class="n">complex_replacements</span><span class="p">:</span>
<span class="k">if</span> <span class="n">item</span> <span class="o">==</span> <span class="n">pair</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">item</span> <span class="o">==</span> <span class="s1">'sqrt'</span><span class="p">:</span>
<span class="n">match_str</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">q_split</span><span class="p">[</span><span class="n">index</span><span class="p">:</span><span class="n">index</span><span class="o">+</span><span class="mi">2</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">match_str</span> <span class="o">=</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">q_split</span><span class="p">[</span><span class="n">index</span><span class="o">-</span><span class="mi">1</span><span class="p">:</span><span class="n">index</span><span class="o">+</span><span class="mi">2</span><span class="p">])</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">q</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">match_str</span><span class="p">,</span> <span class="n">pair</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="n">i1</span><span class="o">=</span><span class="n">q_split</span><span class="p">[</span><span class="n">index</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">i2</span><span class="o">=</span><span class="n">q_split</span><span class="p">[</span><span class="n">index</span><span class="o">+</span><span class="mi">1</span><span class="p">]))</span>
<span class="k">return</span> <span class="n">q</span>
<span class="k">def</span> <span class="nf">command_to_latex</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="sd">""" Recursively eliminate parentheses, then apply parse_simple_eqn."""</span>
<span class="n">open_index</span><span class="p">,</span> <span class="n">close_index</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">for</span> <span class="n">q_index</span><span class="p">,</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">q</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="s1">'('</span><span class="p">:</span>
<span class="n">open_index</span> <span class="o">=</span> <span class="n">q_index</span>
<span class="k">elif</span> <span class="n">i</span> <span class="o">==</span> <span class="s1">')'</span><span class="p">:</span>
<span class="n">close_index</span> <span class="o">=</span> <span class="n">q_index</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">open_index</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">o</span> <span class="o">=</span> <span class="n">q</span><span class="p">[:</span><span class="n">open_index</span><span class="p">]</span> <span class="o">+</span> <span class="s1">'@'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">)</span> <span class="o">+</span> <span class="n">q</span><span class="p">[</span><span class="n">close_index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:]</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">q</span><span class="p">[</span><span class="n">open_index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:</span><span class="n">close_index</span><span class="p">]</span>
<span class="n">o_tex</span> <span class="o">=</span> <span class="n">command_to_latex</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">m_tex</span> <span class="o">=</span> <span class="n">command_to_latex</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Clean up redundant parentheses at recombination</span>
<span class="n">r_index</span> <span class="o">=</span> <span class="n">o_tex</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'@'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">))</span>
<span class="k">if</span> <span class="n">o_tex</span><span class="p">[</span><span class="n">r_index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'{'</span><span class="p">:</span>
<span class="k">return</span> <span class="n">o_tex</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'@'</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">),</span> <span class="n">m_tex</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">o_tex</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'@'</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">),</span>
<span class="s1">' </span><span class="se">\\</span><span class="s1">left ('</span> <span class="o">+</span> <span class="n">m_tex</span> <span class="o">+</span> <span class="s1">' </span><span class="se">\\</span><span class="s1">right )'</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">parse_simple_eqn</span><span class="p">(</span><span class="n">q</span><span class="p">)</span>
</pre></div>
<p>That’s it!</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Deep reinforcement learning, battleship2016-10-15T13:52:00-07:002016-10-15T13:52:00-07:00Jonathan Landytag:efavdb.com,2016-10-15:/battleship<p>Here, we provide a brief introduction to reinforcement learning (<span class="caps">RL</span>) — a general technique for training programs to play games efficiently. Our aim is to explain its practical implementation: We cover some basic theory and then walk through a minimal python program that trains a neural network to play the game …</p><p>Here, we provide a brief introduction to reinforcement learning (<span class="caps">RL</span>) — a general technique for training programs to play games efficiently. Our aim is to explain its practical implementation: We cover some basic theory and then walk through a minimal python program that trains a neural network to play the game battleship.</p>
<h3 id="introduction">Introduction</h3>
<p>Reinforcement learning (<span class="caps">RL</span>) techniques are methods that can be used to teach algorithms to play games efficiently. Like supervised machine-learning (<span class="caps">ML</span>) methods, <span class="caps">RL</span> algorithms learn from data — in this case, past game play data. However, whereas supervised-learning algorithms train only on data that is already available, <span class="caps">RL</span> addresses the challenge of performing well while still in the process of collecting data. In particular, we seek design principles that</p>
<ul>
<li>Allow programs to identify good strategies from past examples,</li>
<li>Enable fast learning of new strategies through continued game play.</li>
</ul>
<p>The reason we particularly want our algorithms to learn fast here is that <span class="caps">RL</span> is most fruitfully applied in contexts where training data is limited — or where the space of strategies is so large that it would be difficult to explore exhaustively. It is in these regimes that supervised techniques have trouble and <span class="caps">RL</span> methods shine.</p>
<p>In this post, we review one general <span class="caps">RL</span> training procedure: The policy-gradient, deep-learning scheme. We review the theory behind this approach in the next section. Following that, we walk through a simple python implementation that trains a neural network to play the game battleship.</p>
<p>Our python code can be downloaded from our github page, <a href="https://github.com/EFavDB/battleship">here</a>. It requires the jupyter, tensorflow, numpy, and matplotlib packages.</p>
<h3 id="policy-gradient-deep-rl">Policy-gradient, deep <span class="caps">RL</span></h3>
<p>Policy-gradient, deep <span class="caps">RL</span> algorithms consist of two main components: A policy network and a rewards function. We detail these two below and then describe how they work together to train good models.</p>
<h4 id="the-policy-network">The policy network</h4>
<p>The policy for a given deep <span class="caps">RL</span> algorithm is a neural network that maps state values <span class="math">\(s\)</span> to probabilities for given game actions <span class="math">\(a\)</span>. In other words, the input layer of the network accepts a numerical encoding of the environment — the state of the game at a particular moment. When this input is fed through the network, the values at the output layer correspond to the log probabilities that each of the actions available to us is optimal — one output node is present for each possible action that we can choose. Note that if we knew with certainty which move we should take, only one output node would have a finite probability. However, if our network is uncertain which action is optimal, more than one output node will have finite weight.</p>
<p>To illustrate the above, we present a diagram of the network used in our battleship program below. (For a review of the rules of battleship, see footnote [1].) For simplicity, we work with a 1-d battleship grid. We then encode our current knowledge of the environment using one input neuron for each of our opponent’s grid positions. In particular, we use the following encoding for each neuron / index:</p>
<div class="math">\begin{align} \label{input} \tag{1}
x_{0,i} = \begin{cases}
-1 & \text{Have not yet bombed $i$} \\
\ 0 & \text{Have bombed $i$, no ship} \\
+1 & \text{Have bombed $i$, ship present}.
\end{cases}
\end{align}</div>
<p>
In our example figure below, we have five input neurons, so the board is of size five. The first three neurons have value <span class="math">\(-1\)</span> implying we have not yet bombed those grid points. Finally, the last two are <span class="math">\(+1\)</span> and <span class="math">\(0\)</span>, respectively, implying that a ship does sit at the fourth site, but not at the fifth.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/10/nn.jpg"><img alt="network" src="https://efavdb.com/wp-content/uploads/2016/10/nn.jpg"></a></p>
<p>Note that in the output layer of the policy network shown, the first three values are labeled with log probabilities. These values correspond to the probabilities that we should next bomb each of these indices, respectively. We cannot re-bomb the fourth and fifth grid points, so although the network may output some values to these neurons, we’ll ignore them.</p>
<p>Before moving on, we note that the reason we use a neural network for our policy is to allow for efficient generalization: For games like Go that have a very large number of states, it is not feasible to collect data on every possible board position. This is exactly the context where <span class="caps">ML</span> algorithms excel — generalizing from past observations to make good predictions for new situations. In order to keep our focus on <span class="caps">RL</span>, we won’t review how <span class="caps">ML</span> algorithms work in this post (however, you can check out our <a href="http://efavdb.github.io/archives">archives</a> section for relevant primers). Instead we simply note that — utilizing these tools — we can get good performance by training only on a <em>representative subset</em> of games — allowing us to avoid study of the full set, which can be much larger.</p>
<h4 id="the-rewards-function">The rewards function</h4>
<p>To train an <span class="caps">RL</span> algorithm, we must carry out an iterative game play / scoring process: We play games according to our current policy, selecting moves with frequencies proportional to the probabilities output by the network. If the actions taken resulted in good outcomes, we want to strengthen the probability of those actions going forward.</p>
<p>The rewards function is the tool we use to formally score our outcomes in past games — we will encourage our algorithm to try to maximize this quantity during game play. In effect, it is a hyper-parameter for the <span class="caps">RL</span> algorithm: many different functions could be used, each resulting in different learning characteristics. For our battleship program, we have used the function
</p>
<div class="math">\begin{align} \label{rewards} \tag{2}
r(a;t_0) = \sum_{t \geq t_0} \left ( h(t) - \overline{h(t)} \right) (0.5)^{t-t0}
\end{align}</div>
<p>
Given a completed game log, this function looks at the action <span class="math">\(a\)</span> taken at time <span class="math">\(t_0\)</span> and returns a weighted sum of hit values <span class="math">\(h(t)\)</span> for this and all future steps in the game. Here, <span class="math">\(h(t)\)</span> is <span class="math">\(1\)</span> if we had a hit at step <span class="math">\(t\)</span> and is <span class="math">\(0\)</span> otherwise.</p>
<p>In arriving at (\ref{rewards}), we admit that we did not carry out a careful search over the set of all possible rewards functions. However, we have confirmed that this choice results in good game play, and it is well-motivated: In particular, we note that the weighting term <span class="math">\((0.5)^{t-t0}\)</span> serves to strongly incentivize a hit on the current move (we get a reward of <span class="math">\(1\)</span> for a hit at <span class="math">\(t_0\)</span>), but a hit at <span class="math">\((t_0 + 1)\)</span> also rewards the action at <span class="math">\(t_0\)</span> — with value <span class="math">\(0.5\)</span>. Similarly, a hit at <span class="math">\((t_0 + 2)\)</span> rewards <span class="math">\(0.25\)</span>, etc. This weighted look-ahead aspect of (\ref{rewards}) serves to encourage efficient exploration of the board: It forces the program to care about moves that will enable future hits. The other ingredient of note present in (\ref{rewards}) is the subtraction of <span class="math">\(\overline{h(t)}\)</span>. This is the expected rewards that a random network would obtain. By pulling this out, we only reward our network if it is outperforming random choices — this results in a net speed-up of the learning process.</p>
<h4 id="stochastic-gradient-descent">Stochastic gradient descent</h4>
<p>In order to train our algorithm to maximize captured rewards during game play, we apply gradient descent. To carry this out, we imagine allowing our network parameters <span class="math">\(\theta\)</span> to vary at some particular step in the game. Averaging over all possible actions, the gradient of the expected rewards is then formally,
</p>
<div class="math">\begin{align} \nonumber
\partial_{\theta} \langle r(a \vert s) \rangle &\equiv & \partial_{\theta} \int p(a \vert \theta, s) r(a \vert s) da \\ \nonumber
&=& \int p(a \vert \theta, s) r(a \vert s) \partial_{\theta} \log \left ( p(a \vert \theta, s) \right) da \\
&\equiv & \langle r(a \vert s) \partial_{\theta} \log \left ( p(a \vert \theta, s) \right) \rangle. \tag{3} \label{formal_ev}
\end{align}</div>
<p>
Here, the <span class="math">\(p(a)\)</span> values are the action probability outputs of our network.</p>
<p>Unfortunately, we usually can’t evaluate the last line above. However, what we can do is approximate it using a sampled value: We simply play a game with our current network, then replace the expected value above by the reward actually captured on the <span class="math">\(i\)</span>-th move,
</p>
<div class="math">\begin{align}
\hat{g}_i = r(a_i) \nabla_{\theta} \log p(a_i \vert s_i, \theta). \tag{4} \label{estimator}
\end{align}</div>
<p>
Here, <span class="math">\(a_i\)</span> is the action that was taken, <span class="math">\(r(a_i)\)</span> is reward that was captured, and the derivative of the logarithm shown can be evaluated via back-propagation (aside for those experienced with neural networks: this is the derivative of the cross-entropy loss function that would apply if you treated the event like a supervised-learning training example — with the selected action <span class="math">\(a_i\)</span> taken as the label). The function <span class="math">\(\hat{g}_i\)</span> provides a noisy estimate of the desired gradient, but taking many steps will result in a “stochastic” gradient descent, on average pushing us towards correct rewards maximization.</p>
<h4 id="summary-of-the-training-process">Summary of the training process</h4>
<p>In summary, then, <span class="caps">RL</span> training proceeds iteratively: To initialize an iterative step, we first play a game with our current policy network, selecting moves stochastically according to the network’s output. After the game is complete, we then score our outcome by evaluating the rewards captured on each move — for example, in the battleship game we use (\ref{rewards}). Once this is done, we then estimate the gradient of the rewards function using (\ref{estimator}). Finally, we update the network parameters, moving <span class="math">\(\theta \to \theta + \alpha \sum \hat{g}_i\)</span>, with <span class="math">\(\alpha\)</span> a small step size parameter. To continue, we then play a new game with the updated network, etc.</p>
<p>To see that this process does, in fact, encourage actions that have resulted in good outcomes during training, note that (\ref{estimator}) is proportional to the rewards captured at the step <span class="math">\(i\)</span>. Consequently, when we adjust our parameters in the direction of (\ref{estimator}), we will strongly encourage those actions that have resulted in large rewards outcomes. Further, those moves with negative rewards are actually suppressed. In this way, over time, the network will learn to examine the system and suggest those moves that will likely produce the best outcomes.</p>
<p>That’s it for the basics of deep, policy-gradient <span class="caps">RL</span>. We now turn to our python example, battleship.</p>
<h3 id="python-code-walkthrough-battleship-rl">Python code walkthrough — battleship <span class="caps">RL</span></h3>
<p>Load the needed packages.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="nn">tf</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">pylab</span>
</pre></div>
<p>Define our network — a fully connected, three layer system. The code below is mostly tensorflow boilerplate that can be picked up by going through their first tutorials. The one unusual thing is that we have our learning rate in (26) set to the placeholder value (9). This will allow us to vary our step sizes with observed rewards captured below.</p>
<div class="highlight"><pre><span></span><span class="n">BOARD_SIZE</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">SHIP_SIZE</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">hidden_units</span> <span class="o">=</span> <span class="n">BOARD_SIZE</span>
<span class="n">output_units</span> <span class="o">=</span> <span class="n">BOARD_SIZE</span>
<span class="n">input_positions</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">BOARD_SIZE</span><span class="p">))</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[])</span>
<span class="c1"># Generate hidden layer</span>
<span class="n">W1</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">truncated_normal</span><span class="p">([</span><span class="n">BOARD_SIZE</span><span class="p">,</span> <span class="n">hidden_units</span><span class="p">],</span>
<span class="n">stddev</span><span class="o">=</span><span class="mf">0.1</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">BOARD_SIZE</span><span class="p">))))</span>
<span class="n">b1</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="n">hidden_units</span><span class="p">]))</span>
<span class="n">h1</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">input_positions</span><span class="p">,</span> <span class="n">W1</span><span class="p">)</span> <span class="o">+</span> <span class="n">b1</span><span class="p">)</span>
<span class="c1"># Second layer -- linear classifier for action logits</span>
<span class="n">W2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">truncated_normal</span><span class="p">([</span><span class="n">hidden_units</span><span class="p">,</span> <span class="n">output_units</span><span class="p">],</span>
<span class="n">stddev</span><span class="o">=</span><span class="mf">0.1</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">hidden_units</span><span class="p">))))</span>
<span class="n">b2</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="n">output_units</span><span class="p">]))</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">h1</span><span class="p">,</span> <span class="n">W2</span><span class="p">)</span> <span class="o">+</span> <span class="n">b2</span>
<span class="n">probabilities</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>
<span class="n">init</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">initialize_all_variables</span><span class="p">()</span>
<span class="n">cross_entropy</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">sparse_softmax_cross_entropy_with_logits</span><span class="p">(</span>
<span class="n">logits</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'xentropy'</span><span class="p">)</span>
<span class="n">train_step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">GradientDescentOptimizer</span><span class="p">(</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">)</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">cross_entropy</span><span class="p">)</span>
<span class="c1"># Start TF session</span>
<span class="n">sess</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">init</span><span class="p">)</span>
</pre></div>
<p>Next, we define a method that will allow us to play a game using our network. The <span class="caps">TRAINING</span> variable specifies whether or not to take the optimal moves or to select moves stochastically. Note that the method returns a set of logs that record the game proceedings. These are needed for training.</p>
<div class="highlight"><pre><span></span><span class="n">TRAINING</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">def</span> <span class="nf">play_game</span><span class="p">(</span><span class="n">training</span><span class="o">=</span><span class="n">TRAINING</span><span class="p">):</span>
<span class="sd">""" Play game of battleship using network."""</span>
<span class="c1"># Select random location for ship</span>
<span class="n">ship_left</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">BOARD_SIZE</span> <span class="o">-</span> <span class="n">SHIP_SIZE</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ship_positions</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">ship_left</span><span class="p">,</span> <span class="n">ship_left</span> <span class="o">+</span> <span class="n">SHIP_SIZE</span><span class="p">))</span>
<span class="c1"># Initialize logs for game</span>
<span class="n">board_position_log</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">action_log</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">hit_log</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># Play through game</span>
<span class="n">current_board</span> <span class="o">=</span> <span class="p">[[</span><span class="o">-</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">BOARD_SIZE</span><span class="p">)]]</span>
<span class="k">while</span> <span class="nb">sum</span><span class="p">(</span><span class="n">hit_log</span><span class="p">)</span> <span class="o"><</span> <span class="n">SHIP_SIZE</span><span class="p">:</span>
<span class="n">board_position_log</span><span class="o">.</span><span class="n">append</span><span class="p">([[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">current_board</span><span class="p">[</span><span class="mi">0</span><span class="p">]]])</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">probabilities</span><span class="p">],</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">input_positions</span><span class="p">:</span><span class="n">current_board</span><span class="p">})[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">probs</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="o">*</span> <span class="p">(</span><span class="n">index</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">action_log</span><span class="p">)</span> <span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">probs</span><span class="p">)]</span>
<span class="n">probs</span> <span class="o">=</span> <span class="p">[</span><span class="n">p</span> <span class="o">/</span> <span class="nb">sum</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">probs</span><span class="p">]</span>
<span class="k">if</span> <span class="n">training</span> <span class="o">==</span> <span class="kc">True</span><span class="p">:</span>
<span class="n">bomb_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">BOARD_SIZE</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="n">probs</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">bomb_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span>
<span class="c1"># update board, logs</span>
<span class="n">hit_log</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="p">(</span><span class="n">bomb_index</span> <span class="ow">in</span> <span class="n">ship_positions</span><span class="p">))</span>
<span class="n">current_board</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">bomb_index</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">*</span> <span class="p">(</span><span class="n">bomb_index</span> <span class="ow">in</span> <span class="n">ship_positions</span><span class="p">)</span>
<span class="n">action_log</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">bomb_index</span><span class="p">)</span>
<span class="k">return</span> <span class="n">board_position_log</span><span class="p">,</span> <span class="n">action_log</span><span class="p">,</span> <span class="n">hit_log</span>
</pre></div>
<p>Our implementation of the rewards function (\ref{rewards}):</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">rewards_calculator</span><span class="p">(</span><span class="n">hit_log</span><span class="p">,</span> <span class="n">gamma</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="sd">""" Discounted sum of future hits over trajectory"""</span>
<span class="n">hit_log_weighted</span> <span class="o">=</span> <span class="p">[(</span><span class="n">item</span> <span class="o">-</span>
<span class="nb">float</span><span class="p">(</span><span class="n">SHIP_SIZE</span> <span class="o">-</span> <span class="nb">sum</span><span class="p">(</span><span class="n">hit_log</span><span class="p">[:</span><span class="n">index</span><span class="p">]))</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">BOARD_SIZE</span> <span class="o">-</span> <span class="n">index</span><span class="p">))</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">gamma</span> <span class="o">**</span> <span class="n">index</span><span class="p">)</span> <span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">hit_log</span><span class="p">)]</span>
<span class="k">return</span> <span class="p">[((</span><span class="n">gamma</span><span class="p">)</span> <span class="o">**</span> <span class="p">(</span><span class="o">-</span><span class="n">i</span><span class="p">))</span> <span class="o">*</span> <span class="nb">sum</span><span class="p">(</span><span class="n">hit_log_weighted</span><span class="p">[</span><span class="n">i</span><span class="p">:])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">hit_log</span><span class="p">))]</span>
</pre></div>
<p>Finally, our training loop. Here, we iteratively play through many games, scoring after each game, then adjusting parameters — setting the placeholder learning rate equal to <span class="caps">ALPHA</span> times the rewards captured.</p>
<div class="highlight"><pre><span></span><span class="n">game_lengths</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">TRAINING</span> <span class="o">=</span> <span class="kc">True</span> <span class="c1"># Boolean specifies training mode</span>
<span class="n">ALPHA</span> <span class="o">=</span> <span class="mf">0.06</span> <span class="c1"># step size</span>
<span class="k">for</span> <span class="n">game</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">board_position_log</span><span class="p">,</span> <span class="n">action_log</span><span class="p">,</span> <span class="n">hit_log</span> <span class="o">=</span> <span class="n">play_game</span><span class="p">(</span><span class="n">training</span><span class="o">=</span><span class="n">TRAINING</span><span class="p">)</span>
<span class="n">game_lengths</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">action_log</span><span class="p">))</span>
<span class="n">rewards_log</span> <span class="o">=</span> <span class="n">rewards_calculator</span><span class="p">(</span><span class="n">hit_log</span><span class="p">)</span>
<span class="k">for</span> <span class="n">reward</span><span class="p">,</span> <span class="n">current_board</span><span class="p">,</span> <span class="n">action</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">rewards_log</span><span class="p">,</span> <span class="n">board_position_log</span><span class="p">,</span> <span class="n">action_log</span><span class="p">):</span>
<span class="c1"># Take step along gradient</span>
<span class="k">if</span> <span class="n">TRAINING</span><span class="p">:</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">train_step</span><span class="p">],</span>
<span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">input_positions</span><span class="p">:</span><span class="n">current_board</span><span class="p">,</span> <span class="n">labels</span><span class="p">:[</span><span class="n">action</span><span class="p">],</span> <span class="n">learning_rate</span><span class="p">:</span><span class="n">ALPHA</span> <span class="o">*</span> <span class="n">reward</span><span class="p">})</span>
</pre></div>
<p>Running this last cell, we see that the training works! The following is an example trace from the play_game() method, with the variable <span class="caps">TRAINING</span> set to False. This illustrates an intelligent move selection process.</p>
<div class="highlight"><pre><span></span><span class="c1"># Example game trace output</span>
<span class="p">([[[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
<span class="p">[[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
<span class="p">[[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
<span class="p">[[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]],</span>
<span class="p">[[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]]],</span>
<span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
</pre></div>
<p>Here, the first five lines are the board encodings that the network was fed each step — using (\ref{input}). The second to last row presents the sequential grid selections that were chosen. Finally, the last row is the hit log. Notice that the first two moves nicely sample different regions of the board. After this, a hit was recorded at <span class="math">\(6\)</span>. The algorithm then intelligently selects <span class="math">\(7\)</span> and <span class="math">\(8\)</span>, which it can infer must be the final locations of the ship.</p>
<p>The plot below provides further characterization of the learning process. This shows the running average game length (steps required to fully bomb ship) versus training epoch. The program learns the basics quite quickly, then continues to gradually improve over time [2].</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/10/trace.jpg"><img alt="trace" src="https://efavdb.com/wp-content/uploads/2016/10/trace.jpg"></a></p>
<h3 id="summary">Summary</h3>
<p>In this post, we have covered a variant of <span class="caps">RL</span> — namely, the policy-gradient, deep <span class="caps">RL</span> scheme. This is a method that typically defaults to the currently best-known strategy, but occasionally samples from other approaches, ultimately resulting in an iterative improvement in policy. The two main ingredients here are the policy network and the rewards function. Although network architecture design is usually the place where most of the thinking is involved in supervised learning, it is the rewards function that typically requires the most thought in the <span class="caps">RL</span> context. A good choice should be as local in time as possible, so as to facilitate training (distant forecast dependence will result in a slow learning process). However, the rewards function should also directly attack the ultimate end of the process (“winning” the game — encouragement of side quests that aren’t necessary can often occur if care is not taken). Balancing these two competing demands can be a challenge, and rewards function design is therefore something of an art form.</p>
<p>Our brief introduction here was intended only to illustrate the gist of how <span class="caps">RL</span> is carried out in practice. For further details, we can recommend two resources: the text book by Sutton and Barto [3] and a recent talk by John Schulman [4].</p>
<h3 id="footnotes-and-references">Footnotes and references</h3>
<p>[1] Game rules: Battleship is a two-player game. Both players begin with a finite regular grid of positions — hidden from their opponent — and a set of “ships”. Each player receives the same quantity of each type of ship. At the start of the game, each player places the ships on their grid in whatever locations they like, subject to some constraints: A ship of length 2, say, must occupy two contiguous indices on the board, and no two ships can occupy the same grid location. Once placed, the ships are fixed in position for the remainder of the game. At this point, game play begins, with the goal being to sink the opponent ships. The locations of the enemy ships are initially unknown because we cannot see the opponent’s grid. To find the ships, one “bombs” indices on the enemy grid — with bombing occurs in turns. When an opponent index is bombed, the opponent must truthfully state whether or not a ship was located at the index bombed. Whoever succeeds in bombing all their opponent’s occupied indices first wins the game. Therefore, the problem reduces to finding the enemy ship indices as quickly as possible.</p>
<p>[2] One of my colleagues (<span class="caps">HC</span>) has suggested that the program likely begins to overfit at some point. However, the 1-d version of the game has so few possible ship locations that characterization of this effect via a training and test set split does not seem appropriate. However, this approach could work were we to move to higher dimensions and introduce multiple ships.</p>
<p>[3] Sutton and Barto, (2016). “Reinforcement Learning: An Introduction”. Text site, <a href="https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html">here</a>.</p>
<p>[4] John Schulman, (2016). “Bay Area Deep Learning School”. Youtube recording of talk available <a href="https://www.youtube.com/watch?v=9dXiAecyJrY">here</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>GPU-accelerated Theano & Keras with Windows 102016-09-22T21:48:00-07:002016-09-22T21:48:00-07:00Damien RJtag:efavdb.com,2016-09-22:/gpu-accelerated-theano-keras-with-windows-10<p>There are many tutorials with directions for how to use your Nvidia graphics card for <span class="caps">GPU</span>-accelerated Theano and Keras for Linux, but there is only limited information out there for you if you want to set everything up with Windows and the current <span class="caps">CUDA</span> toolkit. This is a shame …</p><p>There are many tutorials with directions for how to use your Nvidia graphics card for <span class="caps">GPU</span>-accelerated Theano and Keras for Linux, but there is only limited information out there for you if you want to set everything up with Windows and the current <span class="caps">CUDA</span> toolkit. This is a shame however because there are a large number of computers out there with very nice video cards that are only running windows, and it is not always practical to use a Virtual Machine, or Dual-Boot. So for today’s post we will go over how to get everything running in Windows 10 by saving you all the trial and error I went through. (All of these steps should also work in earlier versions of Windows).</p>
<h2 id="dependencies">Dependencies</h2>
<p>Before getting started, make sure you have the following:</p>
<ul>
<li><span class="caps">NVIDIA</span> card that supports <span class="caps">CUDA</span> (<a href="https://developer.nvidia.com/cuda-gpus">link</a>)</li>
<li>Python 2.7 (<a href="http://conda.pydata.org/miniconda.html">Anaconda</a> preferably)</li>
<li>Compilers for C/C++</li>
<li><span class="caps">CUDA</span> 7.5</li>
<li><span class="caps">GCC</span> for code generated by Theano</li>
</ul>
<h2 id="setup">Setup</h2>
<h3 id="visual-studio-2013-community-edition-update-4">Visual Studio 2013 Community Edition Update 4</h3>
<p>First, go and download the installer for <a href="https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx">Visual Studio 2013 Community Edition Update 4</a>. You can not use the 2015 version because it is still not supported by <span class="caps">CUDA</span>. When installing, there is no need to install any of the optional packages. When you are done add the compiler, <strong>C:\Program Files (x86)\Microsoft Visual Studio 12.0\<span class="caps">VC</span>\bin</strong>, to your windows path.</p>
<p>To add something to your windows path go to System, and then Advanced system settings.</p>
<p>System → Advanced system settings → Environment Variables → Path.</p>
<h3 id="cuda"><span class="caps">CUDA</span></h3>
<p>Next, go the <span class="caps">NVIDIA</span>’s website and <a href="https://developer.nvidia.com/cuda-downloads">download</a> the <span class="caps">CUDA</span> 7.5 toolkit. Select the right version for you computer. When you are installing it, make sure to pick custom install if you don’t want your video card drivers to be overwritten with the version that comes with the toolkit, which are often out of date. If it turns out that your version of the drivers are older than what comes with the toolkit,then there is no harm in updating your drivers, otherwise only pick the three boxes starting with <span class="caps">CUDA</span>.</p>
<h3 id="gcc"><span class="caps">GCC</span></h3>
<p>The last thing we need to do <span class="caps">GCC</span> compiler, I recommend <a href="http://tdm-gcc.tdragon.net/download"><span class="caps">TDM</span>-gcc</a>. Install the 64 bit version, and then add the compiler to your windows path, the install has an option to do that for you automatically if you wish.</p>
<p>To make sure that everything is working at this point, run the the following command on the command line (cmd.exe) . If if finds the path for everything you are good to go.</p>
<p><code>where gcc where cl where nvcc where cudafe where cudafe++</code></p>
<h3 id="theano-and-keras">Theano and Keras</h3>
<p>At this point it is easy to install Theano and Keras, just you pip (or conda and pip)!</p>
<div class="highlight"><pre><span></span><span class="err">conda install mingw libpython</span>
<span class="err">pip install theano</span>
<span class="err">pip install keras</span>
</pre></div>
<p>After installing the python libraries you need to tell Theano to use the <span class="caps">GPU</span> instead of the <span class="caps">CPU</span>. A lot of older posts would have you set this in the system environment, but it is possible to make a config file in your home directory named “<em>.theanorc.txt</em>” instead. This also makes it easy to switch out config files. Inside the file put the following:</p>
<div class="highlight"><pre><span></span><span class="k">[global]</span>
<span class="na">device</span> <span class="o">=</span> <span class="s">gpu</span>
<span class="na">floatX</span> <span class="o">=</span> <span class="s">float32</span>
<span class="k">[nvcc]</span>
<span class="na">compiler_bindir</span><span class="o">=</span><span class="s">C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin</span>
</pre></div>
<p>Lastly, set up the Keras config file <code>~/.keras/keras.json</code>. If you haven’t started Keras yet, the folder and file won’t be there but you can create it. Inside the config put the following.</p>
<div class="highlight"><pre><span></span><span class="err">{</span>
<span class="err"> "image_dim_ordering": "tf",</span>
<span class="err"> "epsilon": 1e-07,</span>
<span class="err"> "floatx": "float32",</span>
<span class="err"> "backend": "theano"</span>
<span class="err">}</span>
</pre></div>
<h2 id="testing-theano-with-gpu">Testing Theano with <span class="caps">GPU</span></h2>
<p>Using the following python code, check if your installation of Theano is using your <span class="caps">GPU</span>.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">function</span><span class="p">,</span> <span class="n">config</span><span class="p">,</span> <span class="n">shared</span><span class="p">,</span> <span class="n">sandbox</span>
<span class="kn">import</span> <span class="nn">theano.tensor</span> <span class="k">as</span> <span class="nn">T</span>
<span class="kn">import</span> <span class="nn">numpy</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="n">vlen</span> <span class="o">=</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">30</span> <span class="o">*</span> <span class="mi">768</span> <span class="c1"># 10 x #cores x # threads per core</span>
<span class="n">iters</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">22</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">shared</span><span class="p">(</span><span class="n">numpy</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">rng</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">vlen</span><span class="p">),</span> <span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">))</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">function</span><span class="p">([],</span> <span class="n">T</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">maker</span><span class="o">.</span><span class="n">fgraph</span><span class="o">.</span><span class="n">toposort</span><span class="p">())</span>
<span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iters</span><span class="p">):</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">f</span><span class="p">()</span>
<span class="n">t1</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Looping </span><span class="si">%d</span><span class="s2"> times took </span><span class="si">%f</span><span class="s2"> seconds"</span> <span class="o">%</span> <span class="p">(</span><span class="n">iters</span><span class="p">,</span> <span class="n">t1</span> <span class="o">-</span> <span class="n">t0</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Result is </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">r</span><span class="p">,))</span>
<span class="k">if</span> <span class="n">numpy</span><span class="o">.</span><span class="n">any</span><span class="p">([</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">op</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">Elemwise</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">f</span><span class="o">.</span><span class="n">maker</span><span class="o">.</span><span class="n">fgraph</span><span class="o">.</span><span class="n">toposort</span><span class="p">()]):</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Used the cpu'</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Used the gpu'</span><span class="p">)</span>
</pre></div>
<h2 id="testing-keras-with-gpu">Testing Keras with <span class="caps">GPU</span></h2>
<p>This code will make sure that everything is working and train a model on some random data. The first time might take a little longer because it the software needs to do some compiling.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span>
<span class="c1"># for a single-input model with 2 classes (binary):</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">784</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'sigmoid'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'binary_crossentropy'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'accuracy'</span><span class="p">])</span>
<span class="c1"># generate dummy data</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">((</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">784</span><span class="p">))</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="c1"># train the model, iterating on the data in batches</span>
<span class="c1"># of 32 samples</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">nb_epoch</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
</pre></div>
<p>If everything works you will see something like this!</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/09/output.png"><img alt="output" src="https://efavdb.com/wp-content/uploads/2016/09/output.png"></a></p>
<p>Now you can start playing with neural networks using your <span class="caps">GPU</span>!</p>Hyperparameter sample-size dependence2016-08-21T00:00:00-07:002016-08-21T00:00:00-07:00Jonathan Landytag:efavdb.com,2016-08-21:/model-selection<p>Here, we briefly review a subtlety associated with machine-learning model selection: the fact that the optimal hyperparameters for a model can vary with training set size, <span class="math">\(N.\)</span> To illustrate this point, we derive expressions for the optimal strength for both <span class="math">\(L_1\)</span> and <span class="math">\(L_2\)</span> regularization in single-variable models. We find that …</p><p>Here, we briefly review a subtlety associated with machine-learning model selection: the fact that the optimal hyperparameters for a model can vary with training set size, <span class="math">\(N.\)</span> To illustrate this point, we derive expressions for the optimal strength for both <span class="math">\(L_1\)</span> and <span class="math">\(L_2\)</span> regularization in single-variable models. We find that the optimal <span class="math">\(L_2\)</span> approaches a finite constant as <span class="math">\(N\)</span> increases, but that the optimal <span class="math">\(L_1\)</span> decays exponentially fast with <span class="math">\(N.\)</span> Sensitive dependence on <span class="math">\(N\)</span> such as this should be carefully extrapolated out when optimizing mission-critical models.</p>
<h3 id="introduction">Introduction</h3>
<p>There are two steps one must carry out to fit a machine-learning model. First, a specific model form and cost function must be selected, and second the model must be fit to the data. The first of these steps is often treated by making use of a training-test data split: One trains a set of candidate models to a fraction of the available data and then validates their performance using a hold-out, test set. The model that performs best on the latter is then selected for production.</p>
<p>Our purpose here is to highlight a subtlety to watch out for when carrying out an optimization as above: the fact that the optimal model can depend sensitively on training set size <span class="math">\(N\)</span>. This observation suggests that the training-test split paradigm must sometimes be applied with care: Because a subsample is used for training in the first, selection step, the model identified as optimal there may not be best when training on the full data set.</p>
<p>To illustrate the above points, our main effort here is to present some toy examples where the optimal hyperparameters can be characterized exactly: We derive the optimal <span class="math">\(L_1\)</span> and <span class="math">\(L_2\)</span> regularization strength for models having only a single variable. These examples illustrate two opposite limits: The latter approaches a finite constant as <span class="math">\(N\)</span> increases, but the former varies exponentially with <span class="math">\(N\)</span>. This shows that strong <span class="math">\(N\)</span>-dependence can sometimes occur, but is not necessarily always an issue. In practice, a simple way to check for sensitivity is to vary the size of your training set during model selection: If a strong dependence is observed, care should be taken during the final extrapolation.</p>
<p>We now walk through our two examples.</p>
<h3 id="l_2-optimization"><span class="math">\(L_2\)</span> optimization</h3>
<p>We start off by positing that we have a method for generating a Bayesian posterior for a parameter <span class="math">\(\theta\)</span> that is a function of a vector of <span class="math">\(N\)</span> random samples <span class="math">\(\textbf{x}\)</span>. To simplify our discussion, we assume that — given a flat prior — this is unbiased and normal with variance <span class="math">\(\sigma^2\)</span>. We write <span class="math">\(\theta_0 \equiv \theta_0(\textbf{x})\)</span> for the maximum a posteriori (<span class="caps">MAP</span>) value under the flat prior. With the introduction of an <span class="math">\(L_2\)</span> prior, the posterior for <span class="math">\(\theta\)</span> is then
</p>
<div class="math">$$\tag{1}
P\left(\theta \vert \theta_0(\textbf{x})\right) \propto \exp\left( - \frac{(\theta - \theta_0)^2}{2 \sigma^2} - \Lambda \theta^2 \right).
$$</div>
<p>
Setting the derivative of the above to zero, the point-estimate, <span class="caps">MAP</span> is given by
</p>
<div class="math">$$\tag{2}
\hat{\theta} = \frac{\theta_0}{1 + 2 \Lambda \sigma^2}.
$$</div>
<p>
The average squared error of this estimate is obtained by averaging over the possible <span class="math">\(\theta_0\)</span> values. Our assumptions above imply that <span class="math">\(\theta_0\)</span> is normal about the true parameter value, <span class="math">\(\theta_*\)</span>, so we have
</p>
<div class="math">\begin{eqnarray}
\langle (\hat{\theta} - \theta_*)^2 \rangle &\equiv& \int_{\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{ - \frac{(\theta_0 - \theta_*)^2}{2 \sigma^2}} \left ( \frac{\theta_0}{1 + 2 \Lambda \sigma^2} - \theta_* \right)^2 d \theta_0 \\
&=& \frac{ 4 \Lambda^2 \sigma^4 \theta_*^2 }{(1 + 2 \Lambda \sigma^2 )^2} + \frac{\sigma^2}{\left( 1 + 2 \Lambda \sigma^2 \right)^2}. \tag{3} \label{error}
\end{eqnarray}</div>
<p>
The optimal <span class="math">\(\Lambda\)</span> is readily obtained by minimizing this average error. This gives,
</p>
<div class="math">$$ \label{result}
\Lambda = \frac{1}{2 \theta_*^2}, \tag{4}
$$</div>
<p>
a constant, independent of sample size. The mean squared error with this choice is obtained by plugging (\ref{result}) into (\ref{error}). This gives
</p>
<div class="math">$$
\langle (\hat{\theta} - \theta_*)^2 \rangle = \frac{\sigma^2}{1 + \sigma^2 / \theta_*^2}. \tag{5}
$$</div>
<p>
Notice that this is strictly less than <span class="math">\(\sigma^2\)</span> — the variance one would get without regularization — and that the benefit is largest when <span class="math">\(\sigma^2 \gg \theta_*^2\)</span>. That is, <span class="math">\(L_2\)</span> regularization is most effective when <span class="math">\(\theta_*\)</span> is hard to differentiate from zero — an intuitive result!</p>
<h3 id="l_1-optimization"><span class="math">\(L_1\)</span> optimization</h3>
<p>The analysis for <span class="math">\(L_1\)</span> optimization is similar to the above, but slightly more involved. We go through it quickly. The posterior with an <span class="math">\(L_1\)</span> prior is given by
</p>
<div class="math">$$ \tag{6}
P\left(\theta \vert \theta_0(\textbf{x})\right) \propto \exp\left( - \frac{(\theta - \theta_0)^2}{2 \sigma^2} - \Lambda \vert \theta \vert \right).
$$</div>
<p>
Assuming for simplicity that <span class="math">\(\hat{\theta} > 0\)</span>, the <span class="caps">MAP</span> value is now
</p>
<div class="math">$$ \tag{7}
\hat{\theta} = \begin{cases}
\theta_0 - \Lambda \sigma^2 & \text{if } \theta_0 - \Lambda \sigma^2 > 0 \\
0 & \text{else}.
\end{cases}
$$</div>
<p>
The mean squared error of the estimator is
</p>
<div class="math">$$ \tag{8}
\langle (\hat{\theta} - \theta_*)^2 \rangle \equiv \int \frac{1}{\sqrt{2 \pi \sigma^2}}
e^{ - \frac{(\theta_0 - \theta_*)^2}{2 \sigma^2}} \left ( \hat{\theta} - \theta_* \right)^2 d \theta_0.
$$</div>
<p>
This can be evaluated in terms of error functions. The optimal value of <span class="math">\(\Lambda\)</span> is obtained by differentiating the above. Doing this, one finds that it satisfies the equation
</p>
<div class="math">$$ \tag{9}
e^{ - \frac{(\tilde{\Lambda}- \tilde{\theta_*})^2}{2} } + \sqrt{\frac{\pi}{2}} \tilde{\Lambda} \ \text{erfc}\left( \frac{\tilde{\Lambda} - \tilde{\theta_*}}{\sqrt{2}} \right ) = 0,
$$</div>
<p>
where <span class="math">\(\tilde{\Lambda} \equiv \sigma \Lambda\)</span> and <span class="math">\(\tilde{\theta_*} \equiv \theta_* / \sigma\)</span>. In general, the equation above must be solved numerically. However, in the case where <span class="math">\(\theta_* \gg \sigma\)</span> — relevant when <span class="math">\(N\)</span> is large — we can obtain a clean asymptotic solution. In this case, we have <span class="math">\(\tilde{\theta_*} \gg 1\)</span> and we expect <span class="math">\(\Lambda\)</span> small. This implies that the above equation can be approximated as
</p>
<div class="math">$$ \tag{10}
e^{ - \frac{\tilde{\theta_*}^2}{2} } - \sqrt{2 \pi} \tilde{\Lambda} \sim 0.
$$</div>
<p>
Solving gives
</p>
<div class="math">\begin{eqnarray} \tag{11}
\Lambda \sim \frac{1}{\sqrt{2 \pi \sigma^2}} e^{ - \frac{\theta_*^2}{2 \sigma^2}} \sim \frac{\sqrt{N}}{\sqrt{2 \pi \sigma_1^2}} e^{ - \frac{N \theta_*^2}{2 \sigma_1^2}}.
\end{eqnarray}</div>
<p>
Here, in the last line we have made the <span class="math">\(N\)</span>-dependence explicit, writing <span class="math">\(\sigma^2 = \sigma_1^2 / N\)</span> — a form that follows when our samples <span class="math">\(\textbf{x}\)</span> are independent. Whereas the optimal <span class="math">\(L_2\)</span> regularization strength approaches a constant, our result here shows that the optimal <span class="math">\(L_1\)</span> strength decays exponentially to zero as <span class="math">\(N\)</span> increases.</p>
<h3 id="summary">Summary</h3>
<p>The subtlety that we have discussed here is likely already familiar to those with significant applied modeling experience: optimal model hyperparameters can vary with training set size. However, the two toy examples we have presented are interesting in that they allow for this <span class="math">\(N\)</span> dependence to be derived explicitly. Interestingly, we have found that the <span class="caps">MSE</span>-minimizing <span class="math">\(L_2\)</span> regularization remains finite, even at large training set size, but the optimal <span class="math">\(L_1\)</span> regularization goes to zero in this same limit. For small and medium <span class="math">\(N\)</span>, this exponential dependence represents a strong sensitivity to <span class="math">\(N\)</span> — one that must be carefully taken into account when extrapolating to the full training set.</p>
<p>One can imagine many other situations where hyperparameters vary strongly with <span class="math">\(N\)</span>. For example, very complex systems may allow for ever-increasing model complexity as more data becomes available. Again, in practice, the most straightforward method to check for this is to vary the size of the training set during model selection. If strong dependence is observed, this should be extrapolated out to obtain the truly optimal model for production.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayesian Statistics: MCMC2016-08-07T18:37:00-07:002016-08-07T18:37:00-07:00Jonathan Landytag:efavdb.com,2016-08-07:/metropolis<p>We review the Metropolis algorithm — a simple Markov Chain Monte Carlo (<span class="caps">MCMC</span>) sampling method — and its application to estimating posteriors in Bayesian statistics. A simple python example is provided.</p>
<h2 id="introduction">Introduction</h2>
<p>One of the central aims of statistics is to identify good methods for fitting models to data. One way to …</p><p>We review the Metropolis algorithm — a simple Markov Chain Monte Carlo (<span class="caps">MCMC</span>) sampling method — and its application to estimating posteriors in Bayesian statistics. A simple python example is provided.</p>
<h2 id="introduction">Introduction</h2>
<p>One of the central aims of statistics is to identify good methods for fitting models to data. One way to do this is through the use of Bayes’ rule: If <span class="math">\(\textbf{x}\)</span> is a vector of <span class="math">\(k\)</span> samples from a distribution and <span class="math">\(\textbf{z}\)</span> is a vector of model parameters, Bayes’ rule gives
</p>
<div class="math">\begin{align} \tag{1} \label{Bayes}
p(\textbf{z} \vert \textbf{x}) = \frac{p(\textbf{x} \vert \textbf{z}) p(\textbf{z})}{p(\textbf{x})}.
\end{align}</div>
<p>
Here, the probability at left, <span class="math">\(p(\textbf{z} \vert \textbf{x})\)</span> — the “posterior” — is a function that tells us how likely it is that the underlying true parameter values are <span class="math">\(\textbf{z}\)</span>, given the information provided by our observations <span class="math">\(\textbf{x}\)</span>. Notice that if we could solve for this function, we would be able to identify which parameter values are most likely — those that are good candidates for a fit. We could also use the posterior’s variance to quantify how uncertain we are about the true, underlying parameter values.</p>
<p>Bayes’ rule gives us a method for evaluating the posterior — now our goal: We need only evaluate the right side of (\ref{Bayes}). The quantities shown there are</p>
<p><span class="math">\(p(\textbf{x} \vert \textbf{z})\)</span> — This is the probability of seeing <span class="math">\(\textbf{x}\)</span> at fixed parameter values <span class="math">\(\textbf{z}\)</span>. Note that if the model is specified, we can often immediately write this part down. For example, if we have a Normal distribution model, specifying <span class="math">\(\textbf{z}\)</span> means that we have specified the Normal’s mean and variance. Given these, we can say how likely it is to observe any <span class="math">\(\textbf{x}\)</span>.</p>
<p><span class="math">\(p(\textbf{z})\)</span> — the “prior”. This is something we insert by hand before taking any data. We choose its form so that it covers the values we expect are reasonable for the parameters in question.</p>
<p><span class="math">\(p(\textbf{x})\)</span> — the denominator. Notice that this doesn’t depend on <span class="math">\(\textbf{z}\)</span>, and so represents a normalization constant for the posterior.</p>
<p>It turns out that the last term above can sometimes be difficult to evaluate analytically, and so we must often resort to numerical methods for estimating the posterior. Monte Carlo sampling is one of the most common approaches taken for doing this. The idea behind Monte Carlo is to take many samples <span class="math">\(\{\textbf{z}_i\}\)</span> from the posterior (\ref{Bayes}). Once these are obtained, we can approximate population averages by averages over the samples. For example, the true posterior average <span class="math">\(\langle\textbf{z} \rangle \equiv \int \textbf{z} p(\textbf{z} \vert \textbf{x}) d \textbf{z}\)</span> can be approximated by <span class="math">\(\overline{\textbf{z}} \equiv \frac{1}{N}\sum_i \textbf{z}_i\)</span>, the sample average. By the law of large numbers, the sample averages are guaranteed to approach the distribution averages as <span class="math">\(N \to \infty\)</span>. This means that Monte Carlo can always be used to obtain very accurate parameter estimates, provided we take <span class="math">\(N\)</span> sufficiently large — and that we can find a convenient way to sample from the posterior. In this post, we review one simple variant of Monte Carlo that allows for posterior sampling: the Metropolis algorithm.</p>
<h2 id="metropolis-algorithm">Metropolis Algorithm</h2>
<h3 id="iterative-procedure">Iterative Procedure</h3>
<p>Metropolis is an iterative, try-accept algorithm. We initialize the algorithm by selecting a parameter vector <span class="math">\(\textbf{z}\)</span> at random. Following this, we repeatedly carry out the following two steps to obtain additional posterior samples:</p>
<ol>
<li>Identify a next candidate sample <span class="math">\(\textbf{z}_j\)</span> via some random process. This candidate selection step can be informed by the current sample’s position, <span class="math">\(\textbf{z}_i\)</span>. For example, one could require that the next candidate be selected from those parameter vectors a given step-size distance from the current sample, <span class="math">\(\textbf{z}_j \in \{\textbf{z}_k: \vert \textbf{z}_i - \textbf{z}_k \vert = \delta \}\)</span>. However, while the candidate selected can depend on the current sample, it must not depend on any prior history of the sampling process. Whatever the process chosen (there’s some flexibility here), we write <span class="math">\(t_{i,j}\)</span> for the rate of selecting <span class="math">\(\textbf{z}_j\)</span> as the next candidate given the current sample is <span class="math">\(\textbf{z}_i\)</span>.</li>
<li>Once a candidate is identified, we either accept or reject it via a second random process. If it is accepted, we mark it down as the next sample, then go back to step one, using the current sample to inform the next candidate selection. Otherwise, we mark the current sample down again, taking it as a repeat sample, and then use it to return to candidate search step, as above. Here, we write <span class="math">\(A_{i,j}\)</span> for the rate of accepting <span class="math">\(\textbf{z}_j\)</span>, given that it was selected as the next candidate, starting from <span class="math">\(\textbf{z}_i\)</span>.</li>
</ol>
<h3 id="selecting-the-trial-and-acceptance-rates">Selecting the trial and acceptance rates</h3>
<p><a href="https://efavdb.com/wp-content/uploads/2016/08/Untitled-1.jpg"><img alt="Untitled-1" src="https://efavdb.com/wp-content/uploads/2016/08/Untitled-1.jpg"></a></p>
<p>In order to ensure that our above process selects samples according to the distribution (\ref{Bayes}), we need to appropriately set the <span class="math">\(\{t_{i,j}\}\)</span> and <span class="math">\(\{A_{i,j}\}\)</span> values. To do that, note that at equilibrium one must see the same number of hops from <span class="math">\(\textbf{z}_i\)</span> to <span class="math">\(\textbf{z}_j\)</span> as hops from <span class="math">\(\textbf{z}_j\)</span> from <span class="math">\(\textbf{z}_i\)</span> (if this did not hold, one would see a net shifting of weight from one to the other over time, contradicting the assumption of equilibrium). If <span class="math">\(\rho_i\)</span> is the fraction of samples the process takes from state <span class="math">\(i\)</span>, this condition can be written as
</p>
<div class="math">\begin{align} \label{inter}
\rho_i t_{i,j} A_{i,j} = \rho_j t_{j,i} A_{j,i} \tag{3}
\end{align}</div>
<p>
To select a process that returns the desired sampling weight, we solve for <span class="math">\(\rho_i\)</span> over <span class="math">\(\rho_j\)</span> in (\ref{inter}) and then equate this to the ratio required by (\ref{Bayes}). This gives
</p>
<div class="math">\begin{align} \tag{4} \label{cond}
\frac{\rho_i}{\rho_j} = \frac{t_{j,i} A_{j,i}}{t_{i,j} A_{i,j}}
\equiv \frac{p(\textbf{x} \vert \textbf{z}_i)p(\textbf{z}_i)}{p(\textbf{x} \vert \textbf{z}_j)p(\textbf{z}_j)}.
\end{align}</div>
<p>
Now, the single constraint above is not sufficient to pin down all of our degrees of freedom. In the Metropolis case, we choose the following working balance: The trial rates between states are set equal, <span class="math">\(t_{i,j} = t_{j,i}\)</span> (but remain unspecified — left to the discretion of the coder on a case-by-case basis), and we set
</p>
<div class="math">$$ \tag{5}
A_{i,j} = \begin{cases}
1, & \text{if } p(\textbf{z}_j \vert \textbf{x}) > p(\textbf{z}_i \vert \textbf{x}) \\
\frac{p(\textbf{x} \vert \textbf{z}_j)p(\textbf{z}_j)}{p(\textbf{x} \vert \textbf{z}_i)p(\textbf{z}_i)} \equiv \frac{p(\textbf{z}_j \vert \textbf{x})}{p(\textbf{z}_i \vert \textbf{x})}, & \text{else}.
\end{cases}
$$</div>
<p>
This last equation says that we choose to always accept a candidate sample if it is more likely than the current one. However, if the candidate is less likely, we only accept a fraction of the time — with rate equal to the relative probability ratio of the two states. For example, if the candidate is only <span class="math">\(80%\)</span> as likely as the current sample, we accept it <span class="math">\(80%\)</span> of the time. That’s it for Metropolis — a simple <span class="caps">MCMC</span> algorithm, guaranteed to satisfy (\ref{cond}), and to therefore equilibrate to (\ref{Bayes})! An example follows.</p>
<h3 id="coding-example">Coding example</h3>
<p>The following python snippet illustrates the Metropolis algorithm in action. Here, we take 15 samples from a Normal distribution of variance one and true mean also equal to one. We pretend not to know the mean (but assume we do know the variance), assume a uniform prior for the mean, and then run the algorithm to obtain two hundred thousand samples from the mean’s posterior. <a href="https://efavdb.com/wp-content/uploads/2016/08/result-1.png"><img alt="result" src="https://efavdb.com/wp-content/uploads/2016/08/result-1.png"></a> The histogram at right summarizes the results, obtained by dropping the first 1% of the samples (to protect against bias towards the initialization value). Averaging over the samples returns a mean estimate of <span class="math">\(\mu \approx 1.4 \pm 0.5\)</span> (95% confidence interval), consistent with the true value of <span class="math">\(1\)</span>.</p>
<div class="highlight"><pre><span></span><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># Take some samples</span>
<span class="n">true_mean</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">true_mean</span><span class="p">,</span> <span class="kp">size</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">total_samples</span> <span class="o">=</span> <span class="mi">200000</span>
<span class="c1"># Function used to decide move acceptance</span>
<span class="k">def</span> <span class="nf">posterior_numerator</span><span class="p">(</span><span class="n">mu</span><span class="p">):</span>
<span class="kp">prod</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">:</span>
<span class="kp">prod</span> <span class="o">*=</span> <span class="n">np</span><span class="o">.</span><span class="kp">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mu</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="kp">prod</span>
<span class="c1"># Initialize MCMC, then iterate</span>
<span class="n">z1</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">posterior_samples</span> <span class="o">=</span> <span class="p">[</span><span class="n">z1</span><span class="p">]</span>
<span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="n">posterior_samples</span><span class="p">)</span> <span class="o"><</span> <span class="n">total_samples</span><span class="p">:</span>
<span class="n">z_current</span> <span class="o">=</span> <span class="n">posterior_samples</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">z_candidate</span> <span class="o">=</span> <span class="n">z_current</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span> <span class="o">-</span> <span class="mf">0.5</span>
<span class="n">rel_prob</span> <span class="o">=</span> <span class="n">posterior_numerator</span><span class="p">(</span>
<span class="n">z_candidate</span><span class="p">)</span> <span class="o">/</span> <span class="n">posterior_numerator</span><span class="p">(</span><span class="n">z_current</span><span class="p">)</span>
<span class="k">if</span> <span class="n">rel_prob</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">posterior_samples</span><span class="o">.</span><span class="kp">append</span><span class="p">(</span><span class="n">z_candidate</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">trial_toss</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span>
<span class="k">if</span> <span class="n">trial_toss</span> <span class="o"><</span> <span class="n">rel_prob</span><span class="p">:</span>
<span class="n">posterior_samples</span><span class="o">.</span><span class="kp">append</span><span class="p">(</span><span class="n">z_candidate</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">posterior_samples</span><span class="o">.</span><span class="kp">append</span><span class="p">(</span><span class="n">z_current</span><span class="p">)</span>
<span class="c1"># Drop some initial samples and thin</span>
<span class="n">thinned_samples</span> <span class="o">=</span> <span class="n">posterior_samples</span><span class="p">[</span><span class="mi">2000</span><span class="p">:]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">thinned_samples</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"Histogram of MCMC samples"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>
<h3 id="summary">Summary</h3>
<p>To summarize, we have reviewed the application of <span class="caps">MCMC</span> to Bayesian statistics. <span class="caps">MCMC</span> is a general tool for obtaining samples from a probability distribution. It can be applied whenever one can conveniently specify the relative probability of two states — and so is particularly apt for situations where only the normalization constant of a distribution is difficult to evaluate, precisely the problem with the posterior (\ref{Bayes}). The method entails carrying out an iterative try-accept algorithm, where the rates of trial and acceptance can be adjusted, but must be balanced so that the equilibrium distribution that results approaches the desired form. The key equation enabling us to strike this balance is (\ref{inter}) — the zero flux condition (aka the <em>detailed balance</em> condition to physicists) that holds between states at equilibrium.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Interpreting the results of linear regression2016-06-29T14:54:00-07:002016-06-29T14:54:00-07:00Cathy Yehtag:efavdb.com,2016-06-29:/interpret-linear-regression<p>Our <a href="http://efavdb.github.io/linear-regression">last post</a> showed how to obtain the least-squares solution for linear regression and discussed the idea of sampling variability in the best estimates for the coefficients. In this post, we continue the discussion about uncertainty in linear regression — both in the estimates of individual linear regression coefficients and the …</p><p>Our <a href="http://efavdb.github.io/linear-regression">last post</a> showed how to obtain the least-squares solution for linear regression and discussed the idea of sampling variability in the best estimates for the coefficients. In this post, we continue the discussion about uncertainty in linear regression — both in the estimates of individual linear regression coefficients and the quality of the overall fit.</p>
<p>Specifically, we’ll discuss how to calculate the 95% confidence intervals and p-values from hypothesis tests that are output by many statistical packages like python’s statsmodels or R. An example with code is provided at the end.</p>
<h2 id="review">Review</h2>
<p>We wish to predict a scalar response variable <span class="math">\(y_i\)</span> given a vector of predictors <span class="math">\(\vec{x}_i\)</span> of dimension <span class="math">\(K\)</span>. In linear regression, we assume that <span class="math">\(y_i\)</span> is a linear function of <span class="math">\(\vec{x}_i\)</span>, parameterized by a set of coefficients <span class="math">\(\vec{\beta}\)</span> and an error term <span class="math">\(\epsilon_i\)</span>. The linear model (in matrix format and dropping the arrows over the vectors) for predicting <span class="math">\(N\)</span> response variables is
</p>
<div class="math">\begin{align}\tag{1}
y = X\beta + \epsilon.
\end{align}</div>
<p>The dimensions of each component are: dim(<span class="math">\(X\)</span>) = (<span class="math">\(N\)</span>,<span class="math">\(K\)</span>), dim(<span class="math">\(\beta\)</span>) = (<span class="math">\(K\)</span>,1), dim(<span class="math">\(y\)</span>) = dim(<span class="math">\(\epsilon\)</span>) = (<span class="math">\(N\)</span>,1), where <span class="math">\(N\)</span> = # of examples, <span class="math">\(K\)</span> = # of regressors / predictors, counting an optional intercept/constant term.</p>
<p>The ordinary least-squares best estimator of the coefficients, <span class="math">\(\hat{\beta}\)</span>, was <a href="http://efavdb.github.io/linear-regression">derived last time</a>:
</p>
<div class="math">\begin{align}\tag{2}\label{optimal}
\hat{\beta} = (X'X)^{-1}X'y,
\end{align}</div>
<p>where the hat “^” denotes an estimator, not a true population parameter.</p>
<p>(\ref{optimal}) is a point estimate, but fitting different samples of data from the population will cause the best estimators to shift around. The amount of shifting can be explained by the variance-covariance matrix of <span class="math">\(\hat{\beta}\)</span>, <a href="http://efavdb.github.io/linear-regression">also derived</a> last time (independent of assumptions of normality):
</p>
<div class="math">\begin{align}\tag{3}\label{cov}
cov(\hat{\beta}, \hat{\beta}) = \sigma^2 (X'X)^{-1}.
\end{align}</div>
<h2 id="goodness-of-fit-r2">Goodness of fit - <span class="math">\(R^2\)</span></h2>
<p>To get a better feel for (\ref{cov}), it’s helpful to rewrite it in terms of the coefficient of determination <span class="math">\(R^2\)</span>. <span class="math">\(R^2\)</span> measures how much of the variation in the response variable <span class="math">\(y\)</span> is explained by variation in the regressors <span class="math">\(X\)</span> (as opposed to the unexplained variation from <span class="math">\(\epsilon\)</span>).</p>
<p>The variation in <span class="math">\(y\)</span>, i.e. the “total sum of squares” <span class="math">\(SST\)</span>, can be partitioned into the sum of two terms, “regression sum of squares” and “error sum of squares”: <span class="math">\(SST = SSR + SSE\)</span>.</p>
<p>For convenience, let’s center <span class="math">\(y\)</span> and <span class="math">\(X\)</span> around their means, e.g. <span class="math">\(y \rightarrow y - \bar{y}\)</span> so that the mean <span class="math">\(\bar{y}=0\)</span> for the centered variables. Then,
</p>
<div class="math">\begin{align}\tag{4}\label{SS}
SST &= \sum_i^N (y - \bar{y})^2 = y'y \\
SSR &= \sum_i^N (X\hat{\beta} - \bar{y})^2 = \hat{y}'\hat{y} \\
SSE &= \sum_i^N (y - \hat{y})^2 = e'e,
\end{align}</div>
<p>where <span class="math">\(\hat{y} \equiv X\hat{\beta}\)</span>. Then <span class="math">\(R^2\)</span> is defined as the ratio of the regression sum of squares to the total sum of squares:
</p>
<div class="math">\begin{align}\tag{5}\label{R2}
R^2 \equiv \frac{SSR}{SST} = 1 - \frac{SSE}{SST}
\end{align}</div>
<p><span class="math">\(R^2\)</span> ranges between 0 and 1, with 1 being a perfect fit. According to (\ref{cov}), the variance of a single coefficient <span class="math">\(\hat{\beta}_k\)</span> is proportional to the quantity <span class="math">\((X'X)_{kk}^{-1}\)</span>, where <span class="math">\(k\)</span> denotes the kth diagonal element of <span class="math">\((X'X)^{-1}\)</span>, and can be rewritten as
</p>
<div class="math">\begin{align}\tag{6}\label{cov2}
var(\hat{\beta}_k) &= \sigma^2 (X'X)_{kk}^{-1} \\ &= \frac{\sigma^2}{(1 - R_k^2)\sum_i^N (x_{ik} - \bar{x}_k)^2},
\end{align}</div>
<p>where <span class="math">\(R_k^2\)</span> is the <span class="math">\(R^2\)</span> in the regression of the kth variable, <span class="math">\(x_k\)</span>, against the other predictors <a href="#A1">[A1]</a>.</p>
<p>The key observation from (\ref{cov2}) is that the precision in the estimator decreases if the fit is made over highly correlated regressors, for which <span class="math">\(R_k^2\)</span> approaches 1. This problem of multicollinearity in linear regression will be manifested in our simulated example.</p>
<p>(\ref{cov2}) is also consistent with the observation from our previous post that, all things being equal, the precision in the estimator increases if the fit is made over a direction of greater variance in the data.</p>
<p>In the next section, <span class="math">\(R^2\)</span> will again be useful for interpreting the behavior of one of our test statistics.</p>
<h2 id="calculating-test-statistics">Calculating test statistics</h2>
<p>If we assume that the vector of residuals has a multivariate normal distribution, <span class="math">\(\epsilon \sim N(0, \sigma^2I)\)</span>, then we can construct test statistics to characterize the uncertainty in the regression. In this section, we’ll calculate</p>
<p>(a) <strong>confidence intervals</strong> - random intervals around individual estimators <span class="math">\(\hat{\beta}_k\)</span> that, if constructed for regressions over multiple samples, would contain the true population parameter, <span class="math">\(\beta_k\)</span>, a certain fraction, e.g. 95%, of the time.
(b) <strong>p-value</strong> - the probability of events as extreme or more extreme than an observed value (a test statistic) occurring under the null hypothesis. If the p-value is less than a given significance level <span class="math">\(\alpha\)</span> (a common choice is <span class="math">\(\alpha = 0.05\)</span>), then the null hypothesis is rejected, e.g. a regression coefficient is said to be significant.</p>
<p>From the assumption of the distribution of <span class="math">\(\epsilon\)</span>, it follows that <span class="math">\(\hat{\beta}\)</span> has a multivariate normal distribution <a href="#A2">[A2]</a>:
</p>
<div class="math">\begin{align}\tag{7}
\hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}).
\end{align}</div>
<p> To be explicit, a single coefficient, <span class="math">\(\hat{\beta}_k\)</span>, is distributed as
</p>
<div class="math">\begin{align}\tag{8}
\hat{\beta}_k \sim N(\beta_k, \sigma^2 (X'X)_{kk}^{-1}).
\end{align}</div>
<p>This variable can be standardized as a z-score:
</p>
<div class="math">\begin{align}\tag{9}
z_k = \frac{\hat{\beta}_k - \beta_k}{\sigma^2 (X'X)_{kk}^{-1}} \sim N(0,1)
\end{align}</div>
<p>In practice, we don’t know the population parameter, <span class="math">\(\sigma^2\)</span>, so we can’t use the z-score. Instead, we can construct a pivotal quantity, a t-statistic. The t-statistic for <span class="math">\(\hat{\beta}_k\)</span> follows a t-distribution with n-K degrees of freedom <a href="#ref1">[1]</a>,
</p>
<div class="math">\begin{align}\tag{10}\label{tstat}
t_{\hat{\beta}_k} = \frac{\hat{\beta}_k - \beta_k}{s(\hat{\beta}_k)} \sim t_{n-K},
\end{align}</div>
<p> where <span class="math">\(s(\hat{\beta}_k)\)</span> is the standard error of <span class="math">\(\hat{\beta}_k\)</span>
</p>
<div class="math">\begin{align}\tag{11}
s(\hat{\beta}_k)^2 = \hat{\sigma}^2 (X'X)_{kk}^{-1},
\end{align}</div>
<p> and <span class="math">\(\hat{\sigma}^2\)</span> is the unbiased estimator of <span class="math">\(\sigma^2\)</span>
</p>
<div class="math">\begin{align}\tag{12}
\hat{\sigma}^2 = \frac{\epsilon'\epsilon}{n - K}.
\end{align}</div>
<h3 id="confidence-intervals-around-regression-coefficients">Confidence intervals around regression coefficients</h3>
<p>The <span class="math">\((1-\alpha)\)</span> confidence interval around an estimator, <span class="math">\(\hat{\beta}_k \pm \Delta\)</span>, is defined such that the probability of a random interval containing the true population parameter is <span class="math">\((1-\alpha)\)</span>:
</p>
<div class="math">\begin{align}\tag{13}
P[\hat{\beta}_k - \Delta < \beta_k < \hat{\beta}_k + \Delta ] = 1 - \alpha,
\end{align}</div>
<p> where <span class="math">\(\Delta = t_{1-\alpha/2, n-K} s(\hat{\beta}_k)\)</span>, and <span class="math">\(t_{1-\alpha/2, n-K}\)</span> is the <span class="math">\(\alpha/2\)</span>-level critical value for the t-distribution with <span class="math">\(n-K\)</span> degrees of freedom.</p>
<h3 id="t-test-for-the-significance-of-a-predictor">t-test for the significance of a predictor</h3>
<p>Directly related to the calculation of confidence intervals is testing whether a regressor, <span class="math">\(\hat{\beta}_k\)</span>, is statistically significant. The t-statistic for the kth regression coefficient under the null hypothesis that <span class="math">\(x_k\)</span> and <span class="math">\(y\)</span> are independent follows a t-distribution with n-K degrees of freedom, c.f. (\ref{tstat}) with <span class="math">\(\beta_k = 0\)</span>:
</p>
<div class="math">\begin{align}\tag{14}
t = \frac{\hat{\beta}_k - 0}{s(\hat{\beta}_k)} \sim t_{n-K}.
\end{align}</div>
<p>We reject the null-hypothesis if <span class="math">\(P[t] < \alpha\)</span>.</p>
<p>According to (\ref{cov2}), <span class="math">\(s(\hat{\beta}_k)\)</span> increases with multicollinearity. Hence, the estimator must be more “extreme” in order to be statistically significant in the presence of multicollinearity.</p>
<h3 id="f-test-for-the-significance-of-the-regression">F-test for the significance of the regression</h3>
<p>Whereas the t-test considers the significance of a single regressor, the F-test evaluates the significance of the entire regression, where the null hypothesis is that <em>all</em> the regressors except the constant are equal to zero: <span class="math">\(\hat{\beta}_1 = \hat{\beta}_2 = ... = \hat{\beta}_{K-1} = 0\)</span>.</p>
<p>The F-statistic under the null hypothesis follows an F-distribution with {K-1, N-K} degrees of freedom <a href="#ref1">[1]</a>:
</p>
<div class="math">\begin{align}\tag{15}\label{F}
F = \frac{SSR/(K-1)}{SSE/(N-K)} \sim F_{K-1, N-K}.
\end{align}</div>
<p>It is useful to rewrite the F-statistic in terms of <span class="math">\(R^2\)</span> by substituting the expressions from (\ref{<span class="caps">SS</span>}) and (\ref{R2}):
</p>
<div class="math">\begin{align}\tag{16}\label{F2}
F = \frac{(N-K) R^2}{(K-1) (1-R^2)}
\end{align}</div>
<p>Notice how, for fixed <span class="math">\(R^2\)</span>, the F-statistic decreases with an increasing number of predictors <span class="math">\(K\)</span>. Adding uninformative predictors to the model will decrease the significance of the regression, which motivates parsimony in constructing linear models.</p>
<h2 id="example">Example</h2>
<p>With these formulas in hand, let’s consider the problem of predicting the weight of adult women using some simulated data (loosely based on reality). We’ll look at two models:
(1) <strong>weight ~ height</strong>.
As expected, height will be a strong predictor of weight, corroborated by a significant p-value for the coefficient of height in the model.
(2) <strong>weight ~ height + shoe size</strong>.
Height and shoe size are strongly correlated in the simulated data, while height is still a strong predictor of weight. We’ll find that neither of the predictors has a significant individual p-value, a consequence of collinearity.</p>
<p>First, import some libraries. We use <code>statsmodels.api.OLS</code> for the linear regression since it contains a much more detailed report on the results of the fit than <code>sklearn.linear_model.LinearRegression</code>.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="k">as</span> <span class="nn">sm</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">t</span>
<span class="kn">import</span> <span class="nn">random</span>
</pre></div>
<p>Next, set the population parameters for the simulated data.</p>
<div class="highlight"><pre><span></span><span class="o">#</span> <span class="n">height</span> <span class="p">(</span><span class="n">inches</span><span class="p">)</span>
<span class="n">mean_height</span> <span class="o">=</span> <span class="mi">65</span>
<span class="n">std_height</span> <span class="o">=</span> <span class="mi">2</span><span class="p">.</span><span class="mi">25</span>
<span class="o">#</span> <span class="n">shoe</span> <span class="k">size</span> <span class="p">(</span><span class="n">inches</span><span class="p">)</span>
<span class="n">mean_shoe_size</span> <span class="o">=</span> <span class="mi">7</span><span class="p">.</span><span class="mi">5</span>
<span class="n">std_shoe_size</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">25</span>
<span class="o">#</span> <span class="n">correlation</span> <span class="k">between</span> <span class="n">height</span> <span class="k">and</span> <span class="n">shoe</span> <span class="k">size</span>
<span class="n">r_height_shoe</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">98</span> <span class="o">#</span> <span class="n">height</span> <span class="k">and</span> <span class="n">shoe</span> <span class="k">size</span> <span class="k">are</span> <span class="n">highly</span> <span class="n">correlated</span>
<span class="o">#</span> <span class="n">covariance</span> <span class="n">b</span><span class="o">/</span><span class="n">w</span> <span class="n">height</span> <span class="k">and</span> <span class="n">shoe</span> <span class="k">size</span>
<span class="n">var_height_shoe</span> <span class="o">=</span> <span class="n">r_height_shoe</span><span class="o">*</span><span class="n">std_height</span><span class="o">*</span><span class="n">std_shoe_size</span>
<span class="o">#</span> <span class="n">matrix</span> <span class="k">of</span> <span class="n">means</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="k">and</span> <span class="n">covariance</span><span class="p">,</span> <span class="n">cov</span>
<span class="n">mu</span> <span class="o">=</span> <span class="p">(</span><span class="n">mean_height</span><span class="p">,</span> <span class="n">mean_shoe_size</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="p">[[</span><span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">std_height</span><span class="p">),</span> <span class="n">var_height_shoe</span><span class="p">],</span>
<span class="p">[</span><span class="n">var_height_shoe</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">std_shoe_size</span><span class="p">)]]</span>
</pre></div>
<p>Generate the simulated data:</p>
<div class="highlight"><pre><span></span><span class="o">#</span> <span class="nb">number</span> <span class="k">of</span> <span class="k">data</span> <span class="n">points</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">85</span><span class="p">)</span>
<span class="o">#</span> <span class="n">height</span> <span class="k">and</span> <span class="n">shoe</span> <span class="k">size</span>
<span class="n">X1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">cov</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="o">#</span> <span class="n">height</span><span class="p">,</span> <span class="n">alone</span>
<span class="n">X0</span> <span class="o">=</span> <span class="n">X1</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">weight</span> <span class="o">=</span> <span class="o">-</span><span class="mi">220</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">X0</span><span class="o">*</span><span class="mi">5</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
</pre></div>
<p>Below is the simulated data plotted against each other.
<a href="https://efavdb.com/wp-content/uploads/2016/06/scatter_height_weight_shoesize_cropped.png"><img alt="scatterplots" src="https://efavdb.com/wp-content/uploads/2016/06/scatter_height_weight_shoesize_cropped.png"></a></p>
<p>Fit the linear models:</p>
<div class="highlight"><pre><span></span><span class="o">#</span> <span class="k">add</span> <span class="k">column</span> <span class="k">of</span> <span class="n">ones</span> <span class="k">for</span> <span class="n">intercept</span>
<span class="n">X0</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">X0</span><span class="p">)</span>
<span class="n">X1</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">add_constant</span><span class="p">(</span><span class="n">X1</span><span class="p">)</span>
<span class="o">#</span> <span class="ss">"OLS"</span> <span class="n">stands</span> <span class="k">for</span> <span class="n">Ordinary</span> <span class="n">Least</span> <span class="n">Squares</span>
<span class="n">sm0</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">X0</span><span class="p">).</span><span class="n">fit</span><span class="p">()</span>
<span class="n">sm1</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">OLS</span><span class="p">(</span><span class="n">weight</span><span class="p">,</span> <span class="n">X1</span><span class="p">).</span><span class="n">fit</span><span class="p">()</span>
</pre></div>
<p>Look at the summary report, <code>sm0.summary()</code>, for the weight ~ height model.</p>
<div class="highlight"><pre><span></span><span class="err">OLS Regression Results</span>
<span class="err">==============================================================================</span>
<span class="err">Dep. Variable: y R-squared: 0.788</span>
<span class="c">Model: OLS Adj. R-squared: 0.776</span>
<span class="c">Method: Least Squares F-statistic: 66.87</span>
<span class="c">Date: Wed, 29 Jun 2016 Prob (F-statistic): 1.79e-07</span>
<span class="c">Time: 14:28:08 Log-Likelihood: -70.020</span>
<span class="err">No. Observations: 20 AIC: 144.0</span>
<span class="err">Df Residuals: 18 BIC: 146.0</span>
<span class="err">Df Model: 1</span>
<span class="err">Covariance Type: nonrobust</span>
<span class="err">==============================================================================</span>
<span class="err">coef std err t P>|t| [95.0% Conf. Int.]</span>
<span class="err">------------------------------------------------------------------------------</span>
<span class="err">const -265.2764 49.801 -5.327 0.000 -369.905 -160.648</span>
<span class="err">x1 6.1857 0.756 8.178 0.000 4.596 7.775</span>
<span class="err">==============================================================================</span>
<span class="c">Omnibus: 0.006 Durbin-Watson: 2.351</span>
<span class="err">Prob(Omnibus): 0.997 Jarque-Bera (JB): 0.126</span>
<span class="c">Skew: 0.002 Prob(JB): 0.939</span>
<span class="c">Kurtosis: 2.610 Cond. No. 1.73e+03</span>
<span class="err">==============================================================================</span>
</pre></div>
<p>The height variable, <code>x1</code>, is significant according to the t-test, as is the intercept, denoted <code>const</code> in the report. Also, notice the coefficient used to simulate the dependence of weight on height (<span class="math">\(\beta_1\)</span> = 5.5), is contained in the 95% confidence interval of <code>x1</code>.</p>
<p>Next, let’s look at the summary report, <code>sm1.summary()</code>, for the weight ~ height + shoe_size model.</p>
<div class="highlight"><pre><span></span><span class="err">OLS Regression Results</span>
<span class="err">==============================================================================</span>
<span class="err">Dep. Variable: y R-squared: 0.789</span>
<span class="c">Model: OLS Adj. R-squared: 0.765</span>
<span class="c">Method: Least Squares F-statistic: 31.86</span>
<span class="c">Date: Wed, 29 Jun 2016 Prob (F-statistic): 1.78e-06</span>
<span class="c">Time: 14:28:08 Log-Likelihood: -69.951</span>
<span class="err">No. Observations: 20 AIC: 145.9</span>
<span class="err">Df Residuals: 17 BIC: 148.9</span>
<span class="err">Df Model: 2</span>
<span class="err">Covariance Type: nonrobust</span>
<span class="err">==============================================================================</span>
<span class="err">coef std err t P>|t| [95.0% Conf. Int.]</span>
<span class="err">------------------------------------------------------------------------------</span>
<span class="err">const -333.1599 204.601 -1.628 0.122 -764.829 98.510</span>
<span class="err">x1 7.4944 3.898 1.923 0.071 -0.729 15.718</span>
<span class="err">x2 -2.3090 6.739 -0.343 0.736 -16.527 11.909</span>
<span class="err">==============================================================================</span>
<span class="c">Omnibus: 0.015 Durbin-Watson: 2.342</span>
<span class="err">Prob(Omnibus): 0.993 Jarque-Bera (JB): 0.147</span>
<span class="c">Skew: 0.049 Prob(JB): 0.929</span>
<span class="c">Kurtosis: 2.592 Cond. No. 7.00e+03</span>
<span class="err">==============================================================================</span>
</pre></div>
<p>Neither of the regressors <code>x1</code> and <code>x2</code> is significant at a significance level of <span class="math">\(\alpha=0.05\)</span>. In the simulated data, adult female weight has a positive linear correlation with height and shoe size, but the strong collinearity of the predictors (simulated with a correlation coefficient of 0.98) causes each variable to fail a t-test in the model — and even results in the wrong sign for the dependence on shoe size.</p>
<p>Although the predictors fail individual t-tests, the overall regression <em>is</em> significant, i.e. the predictors are jointly informative, according to the F-test.</p>
<p>Notice, however, that the p-value of the F-test has decreased compared to the simple linear model, as expected from (\ref{F2}), since including the extra variable, shoe size, did not improve <span class="math">\(R^2\)</span> but did increase <span class="math">\(K\)</span>.</p>
<p>Let’s manually calculate the standard error, t-statistics, F-statistic, corresponding p-values, and confidence intervals using the equations from above.</p>
<div class="highlight"><pre><span></span><span class="o">#</span> <span class="n">OLS</span> <span class="n">solution</span><span class="p">,</span> <span class="n">eqn</span> <span class="k">of</span> <span class="n">form</span> <span class="n">ax</span><span class="o">=</span><span class="n">b</span> <span class="o">=></span> <span class="p">(</span><span class="n">X</span><span class="s1">'X)*beta_hat = X'</span><span class="o">*</span><span class="n">y</span>
<span class="n">beta_hat</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">solve</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X1</span><span class="p">.</span><span class="n">T</span><span class="p">,</span> <span class="n">X1</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X1</span><span class="p">.</span><span class="n">T</span><span class="p">,</span> <span class="n">weight</span><span class="p">))</span>
<span class="o">#</span> <span class="n">residuals</span>
<span class="n">epsilon</span> <span class="o">=</span> <span class="n">weight</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span> <span class="n">beta_hat</span><span class="p">)</span>
<span class="o">#</span> <span class="n">degrees</span> <span class="k">of</span> <span class="n">freedom</span> <span class="k">of</span> <span class="n">residuals</span>
<span class="n">dof</span> <span class="o">=</span> <span class="n">X1</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">X1</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="o">#</span> <span class="n">best</span> <span class="n">estimator</span> <span class="k">of</span> <span class="n">sigma</span>
<span class="n">sigma_hat</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">epsilon</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">)</span> <span class="o">/</span> <span class="n">dof</span><span class="p">)</span>
<span class="o">#</span> <span class="n">standard</span> <span class="n">error</span> <span class="k">of</span> <span class="n">beta_hat</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">sigma_hat</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X1</span><span class="p">.</span><span class="n">T</span><span class="p">,</span> <span class="n">X1</span><span class="p">)),</span> <span class="mi">0</span><span class="p">))</span>
<span class="o">#</span> <span class="mi">95</span><span class="o">%</span> <span class="n">confidence</span> <span class="n">intervals</span>
<span class="o">#</span> <span class="o">+/-</span><span class="n">t_</span><span class="err">{</span><span class="mi">1</span><span class="o">-</span><span class="n">alpha</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="n">n</span><span class="o">-</span><span class="n">K</span><span class="err">}</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="nb">interval</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">alpha</span><span class="p">,</span> <span class="n">dof</span><span class="p">)</span>
<span class="n">conf_intervals</span> <span class="o">=</span> <span class="n">beta_hat</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">s</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nb">array</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="nb">interval</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">95</span><span class="p">,</span> <span class="n">dof</span><span class="p">))</span>
<span class="o">#</span> <span class="n">t</span><span class="o">-</span><span class="k">statistics</span> <span class="k">under</span> <span class="k">null</span> <span class="n">hypothesis</span>
<span class="n">t_stat</span> <span class="o">=</span> <span class="n">beta_hat</span> <span class="o">/</span> <span class="n">s</span>
<span class="o">#</span> <span class="n">p</span><span class="o">-</span><span class="k">values</span>
<span class="o">#</span> <span class="n">survival</span> <span class="k">function</span> <span class="n">sf</span><span class="o">=</span><span class="mi">1</span><span class="o">-</span><span class="n">CDF</span>
<span class="n">p_values</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">sf</span><span class="p">(</span><span class="k">abs</span><span class="p">(</span><span class="n">t_stat</span><span class="p">),</span> <span class="n">dof</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span>
<span class="o">#</span> <span class="n">SSR</span> <span class="p">(</span><span class="n">regression</span> <span class="k">sum</span> <span class="k">of</span> <span class="n">squares</span><span class="p">)</span>
<span class="n">y_hat</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span> <span class="n">beta_hat</span><span class="p">)</span>
<span class="n">y_mu</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">weight</span><span class="p">)</span>
<span class="n">mean_SSR</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">((</span><span class="n">y_hat</span> <span class="o">-</span> <span class="n">y_mu</span><span class="p">).</span><span class="n">T</span><span class="p">,</span> <span class="p">(</span><span class="n">y_hat</span> <span class="o">-</span> <span class="n">y_mu</span><span class="p">))</span><span class="o">/</span><span class="p">(</span><span class="n">len</span><span class="p">(</span><span class="n">beta_hat</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="o">#</span> <span class="n">f</span><span class="o">-</span><span class="n">statistic</span>
<span class="n">f_stat</span> <span class="o">=</span> <span class="n">mean_SSR</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">sigma_hat</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="s1">'f-statistic:'</span><span class="p">,</span> <span class="n">f_stat</span><span class="p">,</span> <span class="s1">'\n'</span><span class="p">)</span>
<span class="o">#</span> <span class="n">p</span><span class="o">-</span><span class="n">value</span> <span class="k">of</span> <span class="n">f</span><span class="o">-</span><span class="n">statistic</span>
<span class="n">p_values_f_stat</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">sf</span><span class="p">(</span><span class="k">abs</span><span class="p">(</span><span class="n">f_stat</span><span class="p">),</span> <span class="n">dfn</span><span class="o">=</span><span class="p">(</span><span class="n">len</span><span class="p">(</span><span class="n">beta_hat</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">dfd</span><span class="o">=</span><span class="n">dof</span><span class="p">)</span>
<span class="n">print</span><span class="p">(</span><span class="s1">'p-value of f-statistic:'</span><span class="p">,</span> <span class="n">p_values_f_stat</span><span class="p">,</span> <span class="s1">'\n'</span><span class="p">)</span>
</pre></div>
<p>The output values, below, from printing the manual calculations are consistent with the summary report:</p>
<div class="highlight"><pre><span></span><span class="n">beta_hat</span><span class="p">:</span> <span class="p">[</span><span class="o">-</span><span class="mi">333</span><span class="p">.</span><span class="mi">15990097</span> <span class="mi">7</span><span class="p">.</span><span class="mi">49444671</span> <span class="o">-</span><span class="mi">2</span><span class="p">.</span><span class="mi">30898743</span><span class="p">]</span>
<span class="n">degrees</span> <span class="k">of</span> <span class="n">freedom</span> <span class="k">of</span> <span class="n">residuals</span><span class="p">:</span> <span class="mi">17</span>
<span class="n">sigma_hat</span><span class="p">:</span> <span class="mi">8</span><span class="p">.</span><span class="mi">66991550428</span>
<span class="n">standard</span> <span class="n">error</span> <span class="k">of</span> <span class="n">beta_hat</span><span class="p">:</span> <span class="p">[</span> <span class="mi">204</span><span class="p">.</span><span class="mi">60056111</span> <span class="mi">3</span><span class="p">.</span><span class="mi">89776076</span> <span class="mi">6</span><span class="p">.</span><span class="mi">73900599</span><span class="p">]</span>
<span class="n">confidence</span> <span class="n">intervals</span><span class="p">:</span>
<span class="p">[[</span> <span class="o">-</span><span class="mi">7</span><span class="p">.</span><span class="mi">64829352</span><span class="n">e</span><span class="o">+</span><span class="mi">02</span> <span class="mi">9</span><span class="p">.</span><span class="mi">85095501</span><span class="n">e</span><span class="o">+</span><span class="mi">01</span><span class="p">]</span>
<span class="p">[</span> <span class="o">-</span><span class="mi">7</span><span class="p">.</span><span class="mi">29109662</span><span class="n">e</span><span class="o">-</span><span class="mi">01</span> <span class="mi">1</span><span class="p">.</span><span class="mi">57180031</span><span class="n">e</span><span class="o">+</span><span class="mi">01</span><span class="p">]</span>
<span class="p">[</span> <span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">65270473</span><span class="n">e</span><span class="o">+</span><span class="mi">01</span> <span class="mi">1</span><span class="p">.</span><span class="mi">19090724</span><span class="n">e</span><span class="o">+</span><span class="mi">01</span><span class="p">]]</span>
<span class="n">t</span><span class="o">-</span><span class="k">statistics</span><span class="p">:</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">.</span><span class="mi">62834305</span> <span class="mi">1</span><span class="p">.</span><span class="mi">92275698</span> <span class="o">-</span><span class="mi">0</span><span class="p">.</span><span class="mi">34263027</span><span class="p">]</span>
<span class="n">p</span><span class="o">-</span><span class="k">values</span> <span class="k">of</span> <span class="n">t</span><span class="o">-</span><span class="k">statistics</span><span class="p">:</span> <span class="p">[</span> <span class="mi">0</span><span class="p">.</span><span class="mi">1218417</span> <span class="mi">0</span><span class="p">.</span><span class="mi">07142839</span> <span class="mi">0</span><span class="p">.</span><span class="mi">73607656</span><span class="p">]</span>
<span class="n">f</span><span class="o">-</span><span class="n">statistic</span><span class="p">:</span> <span class="mi">31</span><span class="p">.</span><span class="mi">8556171105</span>
<span class="n">p</span><span class="o">-</span><span class="n">value</span> <span class="k">of</span> <span class="n">f</span><span class="o">-</span><span class="n">statistic</span><span class="p">:</span> <span class="mi">1</span><span class="p">.</span><span class="mi">77777555162</span><span class="n">e</span><span class="o">-</span><span class="mi">06</span>
</pre></div>
<p>The full code is available as an <a href="https://github.com/EFavDB/linear-regression">IPython notebook on github</a>.</p>
<h2 id="summary">Summary</h2>
<p>Assuming a multivariate normal distribution for the residuals in linear regression allows us to construct test statistics and therefore specify uncertainty in our fits.</p>
<p>A t-test judges the explanatory power of a predictor in isolation, although the standard error that appears in the calculation of the t-statistic is a function of the other predictors in the model. On the other hand, an F-test is a global test that judges the explanatory power of all the predictors together, and we’ve seen that parsimony in choosing predictors can improve the quality of the overall regression.</p>
<p>We’ve also seen that multicollinearity can throw off the results of individual t-tests as well as obscure the interpretation of the signs of the fitted coefficients. A symptom of multicollinearity is when none of the individual coefficients are significant but the overall F-test is significant.</p>
<h3 id="reference">Reference</h3>
<p>[1] Greene, W., Econometric Analysis, Seventh edition, Prentice Hall, 2011 - <a href="http://people.stern.nyu.edu/wgreene/MathStat/Outline.htm">chapters available online</a></p>
<h3 id="appendix">Appendix</h3>
<p>[A1]
We specifically want the kth diagonal element from the inverse moment matrix, <span class="math">\((X'X)^{-1}\)</span>. The matrix <span class="math">\(X\)</span> can be <a href="https://en.wikipedia.org/wiki/Block_matrix">partitioned</a> as </p>
<div class="math">$$[X_{(k)} \vec{x}_k],$$</div>
<p> where <span class="math">\(\vec{x}_k\)</span> is an N x 1 column vector containing the kth variable of each of the N samples, and <span class="math">\(X_{(k)}\)</span> is the N x (K-1) matrix containing the rest of the variables and constant intercept. For convenience, let <span class="math">\(X_{(k)}\)</span> and <span class="math">\(\vec{x}_k\)</span> be centered about their (column-wise) means.</p>
<p>Matrix multiplication of the block-partitioned form of <span class="math">\(X\)</span> with its transpose results in the following block matrix:
</p>
<div class="math">\begin{align}
(X'X) =
\begin{bmatrix}
X_{(k)}'X_{(k)} & X_{(k)}'\vec{x}_k \
\vec{x}_k'X_{(k)} & \vec{x}_k'\vec{x}_k
\end{bmatrix}
\end{align}</div>
<p>The above matrix has four blocks, and <a href="https://en.wikipedia.org/wiki/Block_matrix#Block_matrix_inversion">can be inverted blockwise</a> to obtain another matrix with four blocks. The lower right block corresponding to the kth diagonal element of the inverted matrix is a scalar:
</p>
<div class="math">\begin{align}
(X'X)^{-1}_{kk} &= [\vec{x}_k'\vec{x}_k - \vec{x}_k'X_{(k)}(X_{(k)}'X_{(k)})^{-1}X_{(k)}'\vec{x}_k]^{-1} \\
&= \left[\vec{x}_k'\vec{x}_k \left( 1 - \frac{\vec{x}_k'X_{(k)}(X_{(k)}'X_{(k)})^{-1}X_{(k)}'\vec{x}_k}{\vec{x}_k'\vec{x}_k} \right)\right]^{-1}
\end{align}</div>
<p>Then the numerator of the fraction in the parentheses above can be simplified:
</p>
<div class="math">\begin{align}
\vec{x}_k'X_{(k)} ((X_{(k)}'X_{(k)})^{-1}X_{(k)}'\vec{x}_k) &= \vec{x}_k' X_{(k)} \hat{\beta}_{(k)} \\
&= (X_{(k)}\hat{\beta}_{(k)} + \epsilon_k)'X_{(k)}\hat{\beta}_{(k)} \\
&= \hat{x}_k'\hat{x}_k,
\end{align}</div>
<p>where <span class="math">\(\hat{\beta}_{(k)}\)</span> is the <span class="caps">OLS</span> solution for the coefficients in the regression on the <span class="math">\(\vec{x}_k\)</span> by the remaining variables <span class="math">\(X_{(k)}\)</span>: <span class="math">\(\vec{x}_k = X_{(k)} \beta_{(k)} + \epsilon_k\)</span>. In the last line, we used one of the constraints on the residuals — that the residuals and predictors are uncorrelated, <span class="math">\(\epsilon_k'X_{(k)} = 0\)</span>. Plugging in this simplification for the numerator and using the definition of <span class="math">\(R^2\)</span> from (\ref{R2}), we obtain our final result:
</p>
<div class="math">\begin{align}
(X'X)^{-1}_{kk} &= \left[\vec{x}_k'\vec{x}_k \left( 1 - \frac{\hat{x}_k'\hat{x}_k}{\vec{x}_k'\vec{x}_k} \right)\right]^{-1} \\
&= \left[\vec{x}_k'\vec{x}_k ( 1 - R_k^2 )\right]^{-1}
\end{align}</div>
<p>[A2]
</p>
<div class="math">\begin{align}
\hat{\beta} &= (X'X)^{-1}X'y \\
&= (X'X)^{-1}X'(X\beta + \epsilon) \\
&= \beta + (X'X)^{-1}X'N(0, \sigma^2I) \\
& \sim N(\beta, \sigma^2 (X'X)^{-1})
\end{align}</div>
<p> The last line is by properties of <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Affine_transformation">affine transformations on multivariate normal distributions</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Linear Regression2016-05-29T11:27:00-07:002016-05-29T11:27:00-07:00Jonathan Landytag:efavdb.com,2016-05-29:/linear-regression<p>We review classical linear regression using vector-matrix notation. In particular, we derive a) the least-squares solution, b) the fit’s coefficient covariance matrix — showing that the coefficient estimates are most precise along directions that have been sampled over a large range of values (the high variance directions, a la <span class="caps">PCA …</span></p><p>We review classical linear regression using vector-matrix notation. In particular, we derive a) the least-squares solution, b) the fit’s coefficient covariance matrix — showing that the coefficient estimates are most precise along directions that have been sampled over a large range of values (the high variance directions, a la <span class="caps">PCA</span>), and c) an unbiased estimate for the underlying sample variance (assuming normal sample variance in this last case). We then review how these last two results can be used to provide confidence intervals / hypothesis tests for the coefficient estimates. Finally, we show that similar results follow from a Bayesian approach.</p>
<p>Last edited July 23, 2016.</p>
<h3 id="introduction">Introduction</h3>
<p>Here, we consider the problem of fitting a linear curve to <span class="math">\(N\)</span> data points of the form <span class="math">\((\vec{x}_i, y_i),\)</span> where the <span class="math">\(\{\vec{x}_i\}\)</span> are column vectors of predictors that sit in an <span class="math">\(L\)</span>-dimensional space and the <span class="math">\(\{y_i\}\)</span> are the response values we wish to predict given the <span class="math">\(\{x_i\}\)</span>. The linear approximation will be defined by a set of coefficients, <span class="math">\(\{\beta_j\}\)</span> so that
</p>
<div class="math">\begin{align}
\hat{y}_i \equiv \sum_j x_{i,j} \beta_j = \vec{x}_i^T \cdot \vec{\beta} . \tag{1} \label{1}
\end{align}</div>
<p>
We seek the <span class="math">\(\vec{\beta}\)</span> that minimizes the average squared <span class="math">\(y\)</span> error,
</p>
<div class="math">\begin{align} \tag{2} \label{2}
J = \sum_i \left ( y_i - \hat{y}_i \right)^2 = \sum_i \left (y_i - \vec{x}_i^T \cdot \vec{\beta} \right)^2.
\end{align}</div>
<p>
It turns out that this is a problem where one can easily derive an analytic expression for the optimal solution. It’s also possible to derive an expression for the variance in the optimal solution — that is, how much we might expect the optimal parameter estimates to change were we to start with some other <span class="math">\(N\)</span> data points instead. These estimates can then be used to generate confidence intervals for the coefficient estimates. Here, we review these results, give a simple interpretation to the theoretical variance, and finally show that the same results follow from a Bayesian approach.</p>
<h3 id="optimal-solution">Optimal solution</h3>
<p>We seek the coefficient vector that minimizes (\ref{2}). We can find this by differentiating this cost function with respect to <span class="math">\(\vec{\beta}\)</span>, setting the result to zero. This gives,
</p>
<div class="math">\begin{align} \tag{3}
\partial_{\beta_j} J = 2 \sum_i \left (y_i - \sum_k x_{i,k} \beta_k \right) x_{i,j} = 0.
\end{align}</div>
<p>
We next define the matrix <span class="math">\(X\)</span> so that <span class="math">\(X_{i,j} = \vec{x}_{i,j}\)</span>. Plugging this into the above, we obtain
</p>
<div class="math">\begin{align}
\partial_{\beta_j} J &= 2 \sum_i X_{j,i}^T \left (y_i - \sum_k X_{i,k} \beta_k \right) = 0 \\
&= X^T \cdot \left ( \vec{y} - X \cdot \vec{\beta}\right ) = 0.\tag{4}
\end{align}</div>
<p>
Rearranging gives
</p>
<div class="math">\begin{align}
X^T X \cdot \vec{\beta} = X^T \cdot \vec{y} \to
\vec{\beta} = (X^T X)^{-1} \cdot X^T \cdot \vec{y} \tag{5} \label{optimal}
\end{align}</div>
<p>
This is the squared-error-minimizing solution.</p>
<h3 id="parameter-covariance-matrix">Parameter covariance matrix</h3>
<p>Now, when one carries out a linear fit to some data, the best line often does not go straight through all of the data. Here, we consider the case where the reason for the discrepancy is not that the posited linear form is incorrect, but that there are some hidden variables not measured that the <span class="math">\(y\)</span>-values also depend on. Assuming our data points represent random samples over these hidden variables, we can model their effect as adding a random noise term to the form (\ref{1}), so that
</p>
<div class="math">\begin{align}\tag{6} \label{noise}
y_i = \vec{x}_i^T \cdot \vec{\beta}_{true} + \epsilon_i,
\end{align}</div>
<p>
with <span class="math">\(\langle \epsilon_i \rangle =0\)</span>, <span class="math">\(\langle \epsilon_i^2 \rangle = \sigma^2\)</span>, and <span class="math">\(\vec{\beta}_{true}\)</span> the exact (but unknown) coefficient vector.</p>
<p>Plugging (\ref{noise}) into (\ref{optimal}), we see that <span class="math">\(\langle \vec{\beta} \rangle = \vec{\beta}_{true}\)</span>. However, the variance of the <span class="math">\(\epsilon_i\)</span> injects some uncertainty into our fit: Each realization of the noise will generate slightly different <span class="math">\(y\)</span> values, causing the <span class="math">\(\vec{\beta}\)</span> fit coefficients to vary. To estimate the magnitude of this effect, we can calculate the covariance matrix of <span class="math">\(\vec{\beta}\)</span>. At fixed (constant) <span class="math">\(X\)</span>, plugging in (\ref{optimal}) for <span class="math">\(\vec{\beta}\)</span> gives
</p>
<div class="math">\begin{align}
cov(\vec{\beta}, \vec{\beta}) &= cov \left( (X^T X)^{-1} \cdot X^T \cdot \vec{y} , \vec{y}^T \cdot X \cdot (X^T X)^{-1, T} \right) \\
&= (X^T X)^{-1} \cdot X^T \cdot cov(\vec{y}^T, \vec{y} ) \cdot X \cdot (X^T X)^{-1, T}
\\
&= \sigma^2 \left( X^T X \right)^{-1} \cdot X^T X \cdot \left( X^T X \right)^{-1, T} \\
&= \sigma^2 \left( X^T X \right)^{-1}. \tag{7} \label{cov}
\end{align}</div>
<p>
In the third line here, note that we have assumed that the <span class="math">\(\epsilon_i\)</span> are independent, so that <span class="math">\(cov(\vec{y},\vec{y}) = \sigma^2 I.\)</span> We’ve also used the fact that <span class="math">\(X^T X\)</span> is symmetric.</p>
<p>To get a feel for the significance of (\ref{cov}), it is helpful to consider the case where the average <span class="math">\(x\)</span> values are zero. In this case,
</p>
<div class="math">\begin{align}
\left( X^T X \right)_{i,j} &\equiv& \sum_k \delta X_{k,i} \delta X_{k,j} \equiv N \times \langle x_i, x_j\rangle. \tag{8} \label{corr_mat}
\end{align}</div>
<p>
<a href="https://efavdb.com/wp-content/uploads/2016/05/scatter.jpg"><img alt="margin around decision boundary" src="https://efavdb.com/wp-content/uploads/2016/05/scatter.jpg"></a> That is, <span class="math">\(X^T X\)</span> is proportional to the correlation matrix of our <span class="math">\(x\)</span> values. This correlation matrix is real and symmetric, and thus has an orthonormal set of eigenvectors. The eigenvalue corresponding to the <span class="math">\(k\)</span>-th eigenvector gives the variance of our data set’s <span class="math">\(k\)</span>-th component values in this basis — details can be found in our <a href="http://efavdb.github.io/principal-component-analysis">article on <span class="caps">PCA</span></a>. This implies a simple interpretation of (\ref{cov}): The variance in the <span class="math">\(\vec{\beta}\)</span> coefficients will be lowest for predictors parallel to the highest variance <span class="caps">PCA</span> components (eg <span class="math">\(x_1\)</span> in the figure shown) and highest for predictors parallel to the lowest variance <span class="caps">PCA</span> components (<span class="math">\(x_2\)</span> in the figure). This observation can often be exploited during an experiment’s design: If a particular coefficient is desired to high accuracy, one should make sure to sample the corresponding predictor over a wide range.</p>
<p>[Note: Cathy gives an interesting, alternative interpretation for the parameter estimate variances in a follow-up post, <a href="http://efavdb.github.io/interpret-linear-regression">here</a>.]</p>
<h3 id="unbiased-estimator-for-sigma2">Unbiased estimator for <span class="math">\(\sigma^2\)</span></h3>
<p>The result (\ref{cov}) gives an expression for the variance of the parameter coefficients in terms of the underlying sample variance <span class="math">\(\sigma^2\)</span>. In practice, <span class="math">\(\sigma^2\)</span> is often not provided and must be estimated from the observations at hand. Assuming that the <span class="math">\(\{\epsilon_i\}\)</span> in (\ref{noise}) are independent <span class="math">\(\mathcal{N}(0, \sigma^2)\)</span> random variables, we now show that the following provides an unbiased estimate for this variance:
</p>
<div class="math">$$
S^2 \equiv \frac{1}{N-L} \sum_i \left ( y_i - \vec{x}_i^T \cdot \vec{\beta} \right) ^2. \tag{9} \label{S}
$$</div>
<p>
Note that this is a normalized sum of squared residuals from our fit, with <span class="math">\((N-L)\)</span> as the normalization constant — the number of samples minus the number of fit parameters. To prove that <span class="math">\(\langle S^2 \rangle = \sigma^2\)</span>, we plug in (\ref{optimal}) for <span class="math">\(\vec{\beta}\)</span>, combining with (\ref{noise}) for <span class="math">\(\vec{y}\)</span>. This gives
</p>
<div class="math">\begin{align} \nonumber
S^2 &= \frac{1}{N-L} \sum_i \left ( y_i - \vec{x}_i^T \cdot (X^T X)^{-1} \cdot X^T \cdot \{ X \cdot \vec{\beta}_{true} + \vec{\epsilon} \} \right) ^2 \\ \nonumber
&= \frac{1}{N-L} \sum_i \left ( \{y_i - \vec{x}_i^T \cdot\vec{\beta}_{true} \} - \vec{x}_i^T \cdot (X^T X)^{-1} \cdot X^T \cdot \vec{\epsilon} \right) ^2 \\
&= \frac{1}{N-L} \sum_i \left ( \epsilon_i - \vec{x}_i^T \cdot (X^T X)^{-1} \cdot X^T \cdot \vec{\epsilon} \right) ^2 \tag{10}. \label{S2}
\end{align}</div>
<p>
The second term in the last line is the <span class="math">\(i\)</span>-th component of the vector
</p>
<div class="math">$$
X \cdot (X^T X)^{-1} \cdot X^T \cdot \vec{\epsilon} \equiv \mathbb{P} \cdot \vec{\epsilon}. \tag{11} \label{projection}
$$</div>
<p>
Here, <span class="math">\(\mathbb{P}\)</span> is a projection operator — this follows from the fact that <span class="math">\(\mathbb{P}^2 = \mathbb{P}\)</span>. When it appears in (\ref{projection}), <span class="math">\(\mathbb{P}\)</span> maps <span class="math">\(\vec{\epsilon}\)</span> into the <span class="math">\(L\)</span>-dimensional coordinate space spanned by the <span class="math">\(\{\vec{x_i}\}\)</span>, scales the result using (\ref{corr_mat}), then maps it back into its original <span class="math">\(N\)</span>-dimensional space. The net effect is to project <span class="math">\(\vec{\epsilon}\)</span> into an <span class="math">\(L\)</span>-dimensional subspace of the full <span class="math">\(N\)</span>-dimensional space (more on the <span class="math">\(L\)</span>-dimensional subspace just below). Plugging (\ref{projection}) into (\ref{S2}), we obtain
</p>
<div class="math">$$
S^2 = \frac{1}{N-L} \sum_i \left ( \epsilon_i - (\mathbb{P} \cdot \vec{\epsilon})_i \right)^2 \equiv \frac{1}{N-L} \left \vert \vec{\epsilon} - \mathbb{P} \cdot \vec{\epsilon} \right \vert^2. \label{S3} \tag{12}
$$</div>
<p>
This final form gives the result: <span class="math">\(\vec{\epsilon}\)</span> is an <span class="math">\(N\)</span>-dimensional vector of independent, <span class="math">\(\mathcal{N}(0, \sigma^2)\)</span> variables, and (\ref{S3}) shows that <span class="math">\(S^2\)</span> is equal to <span class="math">\(1/(N-L)\)</span> times the squared length of an <span class="math">\((N-L)\)</span>-dimensional projection of it (the part along <span class="math">\(\mathbb{I} - \mathbb{P}\)</span>). The length of this projection will on average be <span class="math">\((N-L) \sigma^2\)</span>, so that <span class="math">\(\langle S^2 \rangle = \sigma^2\)</span>.</p>
<p>We need to make two final points before moving on. First, because <span class="math">\(S^2\)</span> is a sum of <span class="math">\((N-L)\)</span> independent <span class="math">\(\mathcal{N}(0, \sigma^2)\)</span> random variables, it follows that
</p>
<div class="math">$$
\frac{(N-L) S^2}{\sigma^2} \sim \chi_{N-L}^2. \tag{13} \label{chi2}
$$</div>
<p>
Second, <span class="math">\(S^2\)</span> is independent of <span class="math">\(\vec{\beta}\)</span>: We can see this by rearranging (\ref{optimal}) as
</p>
<div class="math">$$
\vec{\beta} = \vec{\beta}_{true} + (X^T X)^{-1} \cdot X^T \cdot \vec{\epsilon}. \tag{14} \label{beta3}
$$</div>
<p>
We can left multiply this by <span class="math">\(X\)</span> without loss to obtain
</p>
<div class="math">$$
X \cdot \vec{\beta} = X \cdot \vec{\beta}_{true} + \mathbb{P} \cdot \vec{\epsilon}, \tag{15} \label{beta2}
$$</div>
<p>
where we have used (\ref{projection}). Comparing (\ref{beta2}) and (\ref{S3}), we see that the components of <span class="math">\(\vec{\epsilon}\)</span> that inform <span class="math">\(\vec{\beta}\)</span> are in the subspace fixed by <span class="math">\(\mathbb{P}\)</span>. This is the space complementary to that informing <span class="math">\(S^2\)</span>, implying that <span class="math">\(S^2\)</span> is independent of <span class="math">\(\vec{\beta}\)</span>.</p>
<h3 id="confidence-intervals-and-hypothesis-tests">Confidence intervals and hypothesis tests</h3>
<p>The results above immediately provide us with a method for generating confidence intervals for the individual coefficient estimates (continuing with our Normal error assumption): From (\ref{beta3}), it follows that the coefficients are themselves Normal random variables, with variance given by (\ref{cov}). Further, <span class="math">\(S^2\)</span> provides an unbiased estimate for <span class="math">\(\sigma^2\)</span>, proportional to a <span class="math">\(\chi^2_{N-L}\)</span> random variable. Combining these results gives
</p>
<div class="math">$$
\frac{\beta_{i,true} - \beta_{i}}{\sqrt{\left(X^T X\right)^{-1}_{ii} S^2}} \sim t_{(N-L)}. \tag{16}
$$</div>
<p>
That is, the pivot at left follows a Student’s <span class="math">\(t\)</span>-distribution with <span class="math">\((N-L)\)</span> degrees of freedom (i.e., it’s proportional to the ratio of a standard Normal and the square root of a chi-squared variable with that many degrees of freedom). A rearrangement of the above gives the following level <span class="math">\(\alpha\)</span> confidence interval for the true value:
</p>
<div class="math">$$
\beta_i - t_{(N-L), \alpha /2} \sqrt{\left(X^T X \right)^{-1}_{ii} S^2}\leq \beta_{i, true} \leq \beta_i + t_{(N-L), \alpha /2} \sqrt{\left(X^T X \right)^{-1}_{ii} S^2} \tag{17} \label{interval},
$$</div>
<p>
where <span class="math">\(\beta_i\)</span> is obtained from the solution (\ref{optimal}). The interval above can be inverted to generate level <span class="math">\(\alpha\)</span> hypothesis tests. In particular, we note that a test of the null — that a particular coefficient is actually zero — would not be rejected if (\ref{interval}) contains the origin. This approach is often used to test whether some data is consistent with the assertion that a predictor is linearly related to the response.</p>
<p>[Again, see Cathy’s follow-up post <a href="http://efavdb.github.io/interpret-linear-regression">here</a> for an alternate take on these results.]</p>
<h3 id="bayesian-analysis">Bayesian analysis</h3>
<p>The final thing we wish to do here is consider the problem from a Bayesian perspective, using a flat prior on the <span class="math">\(\vec{\beta}\)</span>. In this case, assuming a Gaussian form for the <span class="math">\(\epsilon_i\)</span> in (\ref{noise}) gives
</p>
<div class="math">\begin{align}\tag{18} \label{18}
p(\vec{\beta} \vert \{y_i\}) \propto p(\{y_i\} \vert \vec{\beta}) p(\vec{\beta}) = \mathcal{N} \exp \left [ -\frac{1}{2 \sigma^2}\sum_i \left (y_i - \vec{\beta} \cdot \vec{x}_i \right)^2\right].
\end{align}</div>
<p>
Notice that this posterior form for <span class="math">\(\vec{\beta}\)</span> is also Gaussian, and is centered about the solution (\ref{optimal}). Formally, we can write the exponent here in the form
</p>
<div class="math">\begin{align}
-\frac{1}{2 \sigma^2}\sum_i \left (y_i - \vec{\beta} \cdot \vec{x}_i \right)^2 \equiv -\frac{1}{2} \vec{\beta}^T \cdot \frac{1}{\Sigma^2} \cdot \vec{\beta}, \tag{19}
\end{align}</div>
<p>
where <span class="math">\(\Sigma^2\)</span> is the covariance matrix for the components of <span class="math">\(\vec{\beta}\)</span>, as implied by the posterior form (\ref{18}). We can get the components of its inverse by differentiating (\ref{18}) twice. This gives,
</p>
<div class="math">\begin{align}
\left ( \frac{1}{\Sigma^2}\right)_{jk} &= \frac{1}{2 \sigma^2} \partial_{\beta_j} \partial_{\beta_k} \sum_i \left (y_i - \vec{\beta} \cdot \vec{x}_i \right)^2 \\
&= -\frac{1}{\sigma^2}\partial_{\beta_j} \sum_i \left (y_i - \vec{\beta} \cdot \vec{x}_i \right) x_{i,k} \\
&= \frac{1}{\sigma^2} \sum_i x_{i,j} x_{i,k} = \frac{1}{\sigma^2} (X^T X)_{jk}. \tag{20}
\end{align}</div>
<p>
In other words, <span class="math">\(\Sigma^2 = \sigma^2 (X^T X)^{-1}\)</span>, in agreement with the classical expression (\ref{cov}).</p>
<h3 id="summary">Summary</h3>
<p>In summary, we’ve gone through one quick derivation of linear fit solution that minimizes the sum of squared <span class="math">\(y\)</span> errors for a given set of data. We’ve also considered the variance of this solution, showing that the resulting form is closely related to the principal components of the predictor variables sampled. The covariance solution (\ref{cov}) tells us that all parameters have standard deviations that decrease like <span class="math">\(1/\sqrt{N}\)</span>, with <span class="math">\(N\)</span> the number of samples. However, the predictors that are sampled over wider ranges always have coefficient estimates that more precise. This is due to the fact that sampling over many different values allows one to get a better read on how the underlying function being fit varies with a predictor. Following this, assuming normal errors, we showed that <span class="math">\(S^2\)</span> provides an unbiased estimate, chi-squared estimator for the sample variance — one that is independent of parameter estimates. This allowed us to then write down a confidence interval for the <span class="math">\(i\)</span>-th coefficient. The final thing we have shown is that the Bayesian, Gaussian approximation gives similar results: In this approach, the posterior that results is centered about the classical solution, and has a covariance matrix equal to that obtained by classical approach.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Average queue wait times with random arrivals2016-04-23T09:51:00-07:002016-04-23T09:51:00-07:00Jonathan Landytag:efavdb.com,2016-04-23:/average-queue-wait-times-with-random-arrivals<p>Queries ping a certain computer server at random times, on average <span class="math">\(\lambda\)</span> arriving per second. The server can respond to one per second and those that can’t be serviced immediately are queued up. What is the average wait time per query? Clearly if <span class="math">\(\lambda \ll 1\)</span>, the average wait …</p><p>Queries ping a certain computer server at random times, on average <span class="math">\(\lambda\)</span> arriving per second. The server can respond to one per second and those that can’t be serviced immediately are queued up. What is the average wait time per query? Clearly if <span class="math">\(\lambda \ll 1\)</span>, the average wait time is zero. But if <span class="math">\(\lambda > 1\)</span>, the queue grows indefinitely and the answer is infinity! Here, we give a simple derivation of the general result — (9) below.</p>
<h3 id="introduction">Introduction</h3>
<p>The mathematics of queue waiting times — first worked out by <a href="https://en.wikipedia.org/wiki/Erlang_(unit)">Agner Krarup Erlang</a> — is interesting for two reasons. First, as noted above, queues can exhibit phase-transition like behaviors: If the average arrival time is shorter than the average time it takes to serve a customer, the line will grow indefinitely, causing the average wait time to diverge. Second, when the average arrival time is less than the service time, waiting times are governed entirely by fluctuations — and so can’t be estimated well using mean-field arguments. For example, in the very low arrival rate limit, the only situation where anyone would ever have to wait at all is that where someone else happens to arrive just before them — an unlucky, rare event.</p>
<p>Besides being interesting from a theoretical perspective, an understanding of queue formation phenomena is also critical for many practical applications — both in computer science and in wider industry settings. Optimal staffing of a queue requires a careful estimate of the expected customer arrival rate. If too many workers are staffed, the average wait time will be nicely low, but workers will largely be idle. Staff too few, and the business could enter into the divergent queue length regime — certainly resulting in unhappy customers and lost business (or dropped queries). Staffing just the right amount requires a sensitive touch — and in complex cases, a good understanding of the theory.</p>
<p>In order to derive the average wait time for queues of different sorts, one often works within the framework of Markov processes. This approach is very general and elementary, but requires a bit of effort to develop the machinery needed get to the end results. Here, we demonstrate an alternative, sometimes faster approach that is based on writing down an integral equation for the wait time distribution. We consider only a simple case — that where the queue is serviced by only one staff member, the customers arrive at random times via a Poisson process, and each customer requires the same time to service, one second.</p>
<h3 id="integral-equation-formulation">Integral equation formulation</h3>
<p>Suppose the <span class="math">\(N\)</span>-th customer arrives at time <span class="math">\(0\)</span>, and let <span class="math">\(P(t)\)</span> be the probability that this customer has to wait a time <span class="math">\(t\geq 0\)</span> before being served. This wait time can be written in terms of the arrival and wait times of the previous customer: If this previous customer arrived at time <span class="math">\(t^{\prime}\)</span> and has to wait a time <span class="math">\(w\)</span> before being served, his service will conclude at time <span class="math">\(t = t^{\prime} + w + 1\)</span>. If this is greater than <span class="math">\(0\)</span>, the <span class="math">\(N\)</span>-th customer will have to wait before being served. In particular, he will wait <span class="math">\(t\)</span> if the previous customer waited <span class="math">\(w = t - t^{\prime} - 1\)</span>.</p>
<p>The above considerations allow us to write down an equation satisfied by the wait time distribution. If we let the probability that the previous customer arrived at <span class="math">\(t^{\prime}\)</span> be <span class="math">\(A(t^{\prime})\)</span>, we have (for <span class="math">\(t > 0\)</span>)
</p>
<div class="math">\begin{eqnarray}
\tag{1} \label{int_eqn}
P(t) &=& \int_{-\infty}^{0^-} A(t^{\prime}) P(t - t^{\prime} - 1) d t^{\prime} \\
&=& \int_{-\infty}^{0^-} \lambda e^{\lambda t^{\prime}} P(t - t^{\prime} - 1) d t^{\prime}
\end{eqnarray}</div>
<p>
Here, in the first equality we’re simply averaging over the possible arrival times of the previous customer (which had to occur before the <span class="math">\(N\)</span>-th, at <span class="math">\(0\)</span>), multiplying by the probability <span class="math">\(P(t - t^{\prime} - 1)\)</span> that this customer had to wait the amount of time <span class="math">\(w\)</span> needed so that the <span class="math">\(N\)</span>-th customer will wait <span class="math">\(t\)</span>. We also use the symmetry that each customer has the same wait time distribution at steady state. In the second equality, we have plugged in the arrival time distribution appropriate for our Poisson model.</p>
<p>To proceed, we differentiate both sides of (\ref{int_eqn}) with respect to <span class="math">\(t\)</span>,
</p>
<div class="math">\begin{eqnarray}\tag{2} \label{int2}
P^{\prime}(t) &=& \int_{-\infty}^{0^-} \lambda e^{\lambda t^{\prime}} \frac{d}{dt}P(t - t^{\prime} - 1) d t^{\prime} \\
&=& - \int_{-\infty}^{0^-} \lambda e^{\lambda t^{\prime}} \frac{d}{dt^{\prime}}P(t - t^{\prime} - 1) d t^{\prime}.
\end{eqnarray}</div>
<p>
The second equality follows after noticing that we can switch the parameter being differentiated in the first. Integrating by parts, we obtain
</p>
<div class="math">\begin{eqnarray}
P^{\prime}(t) = \lambda \left [P(t) - P(t-1) \right], \tag{3} \label{sol}
\end{eqnarray}</div>
<p>
a delay differential equation for the wait time distribution. This could be integrated numerically to get the full solution. However, our interest here is primarily the mean waiting time — as we show next, it’s easy to extract this part of the solution analytically.</p>
<h3 id="probability-of-no-wait-and-the-mean-wait-time">Probability of no wait and the mean wait time</h3>
<p>We can obtain a series of useful relations by multiplying (\ref{sol}) by powers of <span class="math">\(t\)</span> and integrating. The first such expression is obtained by multiplying by <span class="math">\(t^1\)</span>. Doing this and integrating its left side, we obtain
</p>
<div class="math">\begin{eqnarray} \tag{4} \label{int3}
\int_{0^{+}}^{\infty} P^{\prime}(t) t dt = \left . P(t) t \right |_{0^{+}}^{\infty} - \int_{0^+}^{\infty} P(t) dt = 1 - P(0).
\end{eqnarray}</div>
<p>
Similarly integrating its right side, we obtain</p>
<div class="math">\begin{eqnarray}\tag{5} \label{int4}
\lambda \int_{0^{+}}^{\infty} t \left [P(t) - P(t-1) \right] = \lambda [ \overline{t} - \overline{(t + 1)} ] = - \lambda.
\end{eqnarray}</div>
<p>
Equating the last two lines, we obtain the probability of no wait,
</p>
<div class="math">\begin{eqnarray} \tag{6} \label{int5}
P(0) = 1 - \lambda.
\end{eqnarray}</div>
<p>
This shows that when the arrival rate is low, the probability of no wait goes to one — an intuitively reasonable result. On the other hand, as <span class="math">\(\lambda \to 1\)</span>, the probability of no wait approaches zero. In between, the idle time fraction of our staffer (which is equal to the probability of no wait, given a random arrival time) grows linearly, connecting these two limits.</p>
<p>To obtain an expression for the average wait time, we carry out a similar analysis to that above, but multiply (\ref{sol}) by <span class="math">\(t^2\)</span> instead. The integral on left is then
</p>
<div class="math">\begin{eqnarray} \tag{7} \label{int6}
\int_{0^{+}}^{\infty} P^{\prime}(t) t^2 dt = \left . P(t) t^2 \right |_{0^{+}}^{\infty} - 2\int_{0^+}^{\infty} P(t) t dt = - 2 \overline{t}.
\end{eqnarray}</div>
<p>
Similarly, the integral at right is
</p>
<div class="math">\begin{eqnarray} \tag{8} \label{fin_int}
\lambda \int_{0^{+}}^{\infty} t^2 \left [P(t) - P(t-1) \right] &=& \lambda \overline{ t^2} - \overline{ (t + 1)^2} \\
&=& - \lambda (2 \overline{t} +1).
\end{eqnarray}</div>
<p>
Equating the last two lines and rearranging gives our solution for the average wait,
</p>
<div class="math">\begin{eqnarray} \tag{9} \label{fin}
\overline{t} = \frac{\lambda}{2 (1 - \lambda)}.
\end{eqnarray}</div>
<p>
As advertised, this diverges as <span class="math">\(\lambda \to 1\)</span>, see illustration in the plot below. It’s very interesting that even as <span class="math">\(\lambda\)</span> approaches this extreme limit, the line is still empty a finite fraction of the time — see (\ref{int5}). Evidently a finite idle time fraction can’t be avoided, even as one approaches the divergent <span class="math">\(\lambda = 1\)</span> limit.</p>
<p><img alt="average wait time" src="https://efavdb.com/wp-content/uploads/2016/04/Screen-Shot-2016-04-23-at-5.02.38-PM.png"></p>
<h3 id="conclusions-and-extensions">Conclusions and extensions</h3>
<p>To carry this approach further, one could consider the case where the queue feeds <span class="math">\(k\)</span> staff, rather than just one. I’ve made progress on this effort in certain cases, but have been stumped on the general problem. One interesting thing you can intuit about this <span class="math">\(k\)</span>-staff version is that one approaches the mean-field analysis as <span class="math">\(k\to \infty\)</span> (adding more staff tends to smooth things over, resulting in a diminishing of the importance of the randomness of the arrival times). This means that as <span class="math">\(k\)</span> grows, we’ll have very little average wait time for any <span class="math">\(\lambda<1\)</span>, but again divergent wait times for any <span class="math">\(\lambda \geq 1\)</span> — like an infinite step function. Another direction one could pursue is to allow the service times to follow a distribution. Both cases can also be worked out using the Markov approach — references to such work can be found in the link provided in the introduction.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Improved Bonferroni correction factors for multiple pairwise comparisons2016-04-10T07:58:00-07:002016-04-10T07:58:00-07:00Jonathan Landytag:efavdb.com,2016-04-10:/bonferroni-correction-for-multiple-pairwise-comparison-tests<p>A common task in applied statistics is the pairwise comparison of the responses of <span class="math">\(N\)</span> treatment groups in some statistical test — the goal being to decide which pairs exhibit differences that are statistically significant. Now, because there is one comparison being made for each pairing, a naive application of the …</p><p>A common task in applied statistics is the pairwise comparison of the responses of <span class="math">\(N\)</span> treatment groups in some statistical test — the goal being to decide which pairs exhibit differences that are statistically significant. Now, because there is one comparison being made for each pairing, a naive application of the Bonferroni correction analysis suggests that one should set the individual pairwise test sizes to <span class="math">\(\alpha_i \to \alpha_f/{N \choose 2}\)</span> in order to obtain a desired family-wise type 1 error rate of <span class="math">\(\alpha_f\)</span>. Indeed, this solution is suggested by many texts. However, implicit in the Bonferroni analysis is the assumption that the comparisons being made are each mutually independent. This is not the case here, and we show that as a consequence the naive approach often returns type 1 error rates far from those desired. We provide adjusted formulas that allow for error-free Bonferroni-like corrections to be made.</p>
<p>(edit (7/4/2016): After posting this article, I’ve since found that the method we suggest here is related to / is a generalization of Tukey’s range test — see <a href="https://en.wikipedia.org/wiki/Tukey%27s_range_test">here</a>.)</p>
<p>(edit (6/11/2018): I’ve added the notebook used below to our Github, <a href="https://github.com/EFavDB/improved_bonferroni">here</a>)</p>
<h3 id="introduction">Introduction</h3>
<p>In this post, we consider a particular kind of statistical test where one examines <span class="math">\(N\)</span> different treatment groups, measures some particular response within each, and then decides which of the <span class="math">\({N \choose 2}\)</span> pairs appear to exhibit responses that differ significantly. This is called the pairwise comparison problem (or sometimes “posthoc analysis”). It comes up in many contexts, and in general it will be of interest whenever one is carrying out a multiple-treatment test.</p>
<p>Our specific interest here is in identifying the appropriate individual measurement error bars needed to guarantee a given family-wise type 1 error rate, <span class="math">\(\alpha_f\)</span>. Briefly, <span class="math">\(\alpha_f\)</span> is the probability that we incorrectly make any assertion that two measurements differ significantly when the true effect sizes we’re trying to measure are actually all the same. This can happen due to the nature of statistical fluctuations. For example, when measuring the heights of <span class="math">\(N\)</span> identical objects, measurement error can cause us to incorrectly think that some pair have slightly different heights, even though that’s not the case. A classical approach to addressing this problem is given by the Bonferroni approximation: If we consider <span class="math">\(\mathcal{N}\)</span> independent comparisons, and each has an individual type 1 error rate of <span class="math">\(\alpha_i,\)</span> then the family-wise probability of not making any type 1 errors is simply the product of the probabilities that we don’t make any individual type 1 errors,
</p>
<div class="math">$$ \tag{1} \label{bon1}
p_f = (1 - \alpha_f) = p_i^{\mathcal{N}} \equiv \left ( 1 - \alpha_i \right)^{\mathcal{N}} \approx 1 - \mathcal{N} \alpha_i.
$$</div>
<p>
The last equality here is an expansion that holds when <span class="math">\(p_f\)</span> is close to <span class="math">\(1\)</span>, the limit we usually work in. Rearranging (\ref{bon1}) gives a simple expression,
</p>
<div class="math">$$ \tag{2} \label{bon2}
\alpha_i = \frac{\alpha_f}{\mathcal{N}}.
$$</div>
<p>
This is the (naive) Bonferroni approximation — it states that one should use individual tests of size <span class="math">\(\alpha_f / \mathcal{N}\)</span> in order to obtain a family-wise error rate of <span class="math">\(\alpha_f\)</span>.</p>
<p>The reason why we refer to (\ref{bon2}) as the naive Bonferroni approximation is that it doesn’t actually apply to the problem we consider here. The reason why is that <span class="math">\(p_f \not = p_i^{\mathcal{N}}\)</span> in (\ref{bon1}) if the <span class="math">\(\mathcal{N}\)</span> comparisons considered are not independent: This is generally the case for our system of <span class="math">\(\mathcal{N} = {N \choose 2}\)</span> comparisons, since they are based on an underlying set of measurements having only <span class="math">\(N\)</span> degrees of freedom (the object heights, in our example). Despite this obvious issue, the naive approximation is often applied in this context. Here, we explore the nature of the error incurred in such applications, and we find that it is sometimes very significant. We also show that it’s actually quite simple to apply the principle behind the Bonferroni approximation without error: One need only find a way to evaluate the true <span class="math">\(p_f\)</span> for any particular choice of error bars. Inverting this then allows one to identify the error bars needed to obtain the desired <span class="math">\(p_f\)</span>.</p>
<h3 id="general-treatment">General treatment</h3>
<p>In this section, we derive a formal expression for the type 1 error rate in the pairwise comparison problem. For simplicity, we will assume 1) that the uncertainty in each of our <span class="math">\(N\)</span> individual measurements is the same (e.g., the variance in the case of Normal variables), and 2) that our pairwise tests assert that two measurements differ statistically if and only if they are more than <span class="math">\(k\)</span> units apart.</p>
<p>To proceed, we consider the probability that a type 1 error does not occur, <span class="math">\(p_f\)</span>. This requires that all <span class="math">\(N\)</span> measurements sit within <span class="math">\(k\)</span> units of each other. For any set of values satisfying this condition, let the smallest of the set be <span class="math">\(x\)</span>. We have <span class="math">\(N\)</span> choices for which of the treatments sit as this position. The remaining <span class="math">\((N-1)\)</span> values must all be within the region <span class="math">\((x, x+k)\)</span>. Because we’re considering the type 1 error rate, we can assume that each of the independent measurements takes on the same distribution <span class="math">\(P(x)\)</span>. These considerations imply
</p>
<div class="math">$$ \tag{3} \label{gen}
p_{f} \equiv 1 - \alpha_{f} = N \int_{-\infty}^{\infty} P(x) \left \{\int_x^{x+k} P(y) dy \right \}^{N-1} dx.
$$</div>
<p>
Equation (\ref{gen}) is our main result. It is nice for a couple of reasons. First, its form implies that when <span class="math">\(N\)</span> is large it will scale like <span class="math">\(a \times p_{1,eff}^N\)</span>, for some <span class="math">\(k\)</span>-dependent numbers <span class="math">\(a\)</span> and <span class="math">\(p_{1,eff}\)</span>. This is reminiscent of the expression (\ref{bon1}), where <span class="math">\(p_f\)</span> took the form <span class="math">\(p_i^{\mathcal{N}}\)</span>. Here, we see that the correct value actually scales like some number to the <span class="math">\(N\)</span>-th power, not the <span class="math">\(\mathcal{N}\)</span>-th. This reflects the fact that we actually only have <span class="math">\(N\)</span> independent degrees of freedom here, not <span class="math">\({N \choose 2}\)</span>. Second, when the inner integral above can be carried out formally, (\ref{gen}) can be expressed as a single one-dimensional integral. In such cases, the integral can be evaluated numerically for any <span class="math">\(k\)</span>, allowing one to conveniently identify the <span class="math">\(k\)</span> that returns any specific, desired <span class="math">\(p_f\)</span>. We illustrate both points in the next two sections, where we consider Normal and Cauchy variables, respectively.</p>
<h3 id="normally-distributed-responses">Normally-distributed responses</h3>
<p>We now consider the case where the individual statistics are each Normally-distributed about zero, and we reject any pair if they are more than <span class="math">\(k \times \sqrt{2} \sigma\)</span> apart, with <span class="math">\(\sigma^2\)</span> the variance of the individual statistics. In this case, the inner integral of (\ref{gen}) goes to
</p>
<div class="math">$$\tag{4} \label{inner_g}
\frac{1}{\sqrt{2 \pi \sigma^2}} \int_x^{x+k \sqrt{2} \sigma} \exp\left [ -\frac{y^2}{2 \sigma^2} \right] dy = \frac{1}{2} \left [\text{erf}(k + \frac{x}{\sqrt{2} \sigma}) - \text{erf}(\frac{x}{\sqrt{2} \sigma})\right].
$$</div>
<p>
Plugging this into (\ref{gen}), we obtain
</p>
<div class="math">$$\tag{5} \label{exact_g}
p_f = \int \frac{N e^{-x^2 / 2 \sigma^2}}{\sqrt{2 \pi \sigma^2}} \exp \left ((N-1) \log \frac{1}{2} \left [\text{erf}(k + \frac{x}{\sqrt{2} \sigma}) - \text{erf}(\frac{x}{\sqrt{2} \sigma})\right]\right)dx.
$$</div>
<p>
This exact expression (\ref{exact_g}) can be used to obtain the <span class="math">\(k\)</span> value needed to achieve any desired family-wise type 1error rate. Example solutions obtained in this way are compared to the <span class="math">\(k\)</span>-values returned by the naive Bonferroni approach in the table below. The last column <span class="math">\(p_{f,Bon}\)</span> shown is the family-wise success rate that you get when you plug in <span class="math">\(k_{Bon},\)</span> the naive Bonferroni <span class="math">\(k\)</span> value targeting <span class="math">\(p_{f,exact}\)</span>.</p>
<table>
<thead>
<tr>
<th><span class="math">\(N\)</span></th>
<th><span class="math">\(p_{f,exact}\)</span></th>
<th><span class="math">\(k_{exact}\)</span></th>
<th><span class="math">\(k_{Bon}\)</span></th>
<th><span class="math">\(p_{f, Bon}\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>0.9</td>
<td>2.29</td>
<td>2.39</td>
<td>0.921</td>
</tr>
<tr>
<td>8</td>
<td>0.9</td>
<td>2.78</td>
<td>2.91</td>
<td>0.929</td>
</tr>
<tr>
<td>4</td>
<td>0.95</td>
<td>2.57</td>
<td>2.64</td>
<td>0.959</td>
</tr>
<tr>
<td>8</td>
<td>0.95</td>
<td>3.03</td>
<td>3.1</td>
<td>0.959</td>
</tr>
</tbody>
</table>
<p>Examining the table shown, you can see that the naive approach is consistently overestimating the <span class="math">\(k\)</span> values (error bars) needed to obtain the desired family-wise rates — but not dramatically so. The reason for the near-accuracy is that two solutions basically scale the same way with <span class="math">\(N\)</span>. To see this, one can carry out an asymptotic analysis of (\ref{exact_g}). We skip the details and note only that at large <span class="math">\(N\)</span> we have
</p>
<div class="math">$$\tag{6} \label{asy_g}
p_f \sim \text{erf} \left ( \frac{k}{2}\right)^N
\sim \left (1 - \frac{e^{-k^2 / 4}}{k \sqrt{\pi}/2} \right)^N.
$$</div>
<p>
This is interesting because the individual pairwise tests have p-values given by
</p>
<div class="math">$$ \tag{7} \label{asy_i}
p_i = \int_{-k\sqrt{2}\sigma}^{k\sqrt{2}\sigma} \frac{e^{-x^2 / (4 \sigma^2)}}{\sqrt{4 \pi \sigma^2 }} = \text{erf}(k /\sqrt{2}) \sim 1 - \frac{e^{-k^2/2}}{k \sqrt{\pi/2}}.
$$</div>
<p>
At large <span class="math">\(k\)</span>, this is dominated by the exponential. Comparing with (\ref{asy_g}), this implies
</p>
<div class="math">$$ \tag{8} \label{fin_g}
p_f \sim \left (1 - \alpha_i^{1/2} \right)^N \sim 1 - N \alpha_i^{1/2} \equiv 1 - \alpha_f.
$$</div>
<p>
Fixing <span class="math">\(\alpha_f\)</span>, this requires that <span class="math">\(\alpha_i\)</span> scale like <span class="math">\(N^{-2}\)</span>, the same scaling with <span class="math">\(N\)</span> as the naive Bonferroni solution. Thus, in the case of Normal variables, the Bonferroni approximation provides an inexact, but reasonable approximation (nevertheless, we suggest going with the exact approach using (\ref{exact_g}), since it’s just as easy!). We show in the next section that this is not the case for Cauchy variables.</p>
<h3 id="cauchy-distributed-variables">Cauchy-distributed variables</h3>
<p>We’ll now consider the case of <span class="math">\(N\)</span> independent, identically-distributed Cauchy variables having half widths <span class="math">\(a\)</span>,
</p>
<div class="math">$$ \tag{9} \label{c_dist}
P(x) = \frac{a}{\pi} \frac{1}{a^2 + x^2}.
$$</div>
<p>
When we compare any two, we will reject the null if they are more than <span class="math">\(ka\)</span> apart. With this choice, the inner integral of (\ref{gen}) is now
</p>
<div class="math">$$
\tag{10} \label{inner_c}
\frac{a}{\pi} \int_x^{x+ k a} \frac{1}{a^2 + y^2} dy =\\ \frac{1}{\pi} \left [\tan^{-1}(k + x/a) - \tan^{-1}(x/a) \right].
$$</div>
<p>
Plugging into into (\ref{gen}) now gives</p>
<div class="math">$$\tag{11} \label{exact_c}
p_f = \int \frac{N a/\pi}{a^2 + x^2} e^{(N-1) \log
\frac{1}{\pi} \left [\tan^{-1}(k + x/a) - \tan^{-1}(x/a) \right]
}.
$$</div>
<p>
This is the analog of (\ref{exact_g}) for Cauchy variables — it can be used to find the exact <span class="math">\(k\)</span> value needed to obtain a given family-wise type 1 error rate. The table below compares the exact values to those returned by the naive Bonferroni analysis [obtained using the fact that the difference between two independent Cauchy variables of width <span class="math">\(a\)</span> is itself a Cauchy distributed variable, but with width <span class="math">\(2a\)</span>].</p>
<table>
<thead>
<tr>
<th><span class="math">\(N\)</span></th>
<th><span class="math">\(p_{f,exact}\)</span></th>
<th><span class="math">\(k_{exact}\)</span></th>
<th><span class="math">\(k_{Bon}\)</span></th>
<th><span class="math">\(p_{f, Bon}\)</span></th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>0.9</td>
<td>27</td>
<td>76</td>
<td>0.965</td>
</tr>
<tr>
<td>8</td>
<td>0.9</td>
<td>55</td>
<td>350</td>
<td>0.985</td>
</tr>
<tr>
<td>4</td>
<td>0.95</td>
<td>53</td>
<td>153</td>
<td>0.983</td>
</tr>
<tr>
<td>8</td>
<td>0.95</td>
<td>107</td>
<td>700</td>
<td>0.993</td>
</tr>
</tbody>
</table>
<p>In this case, you can see that the naive Bonferroni approximation performs badly. For example, in the last line, it suggests using error bars that are seven times too large for each point estimate. The error gets even worse as <span class="math">\(N\)</span> grows: Again, skipping the details, we note that in this limit, (\ref{exact_c}) scales like
</p>
<div class="math">$$\tag{12} \label{asym_c}
p_f \sim \left [\frac{2}{\pi} \tan^{-1}(k/2) \right]^N.
$$</div>
<p>
This can be related to the individual <span class="math">\(p_i\)</span> values, which are given by
</p>
<div class="math">$$ \tag{13} \label{asym2_c}
p_i = \int_{-ka}^{ka} \frac{2 a / \pi}{4 a^2 + x^2}dx = \frac{2}{\pi}\tan^{-1}(k/2).
$$</div>
<p>
Comparing the last two lines, we obtain
</p>
<div class="math">$$ \tag{14} \label{asym3_c}
p_f \equiv 1 - \alpha_f \sim p_i^N \sim 1 - N \alpha_i.
$$</div>
<p>
Although we’ve been a bit sloppy with coefficients here, (\ref{asym3_c}) gives the correct leading <span class="math">\(N\)</span>-dependence: <span class="math">\(k_{exact} \sim 1/\alpha_i \propto N\)</span>. We can see this linear scaling in the table above. This explains why <span class="math">\(k_{exact}\)</span> and <span class="math">\(k_{Bon}\)</span> — which scales like <span class="math">\({N \choose 2} \sim N^2\)</span> — differ more and more as <span class="math">\(N\)</span> grows. In this case, you should definitely never use the naive approximation, but instead stick to the exact analysis based on (\ref{exact_c}).</p>
<h3 id="conclusion">Conclusion</h3>
<p>Some people criticize the Bonferroni correction factor as being too conservative. However, our analysis here suggests that this feeling may be due in part to its occasional improper application. The naive approximation simply does not apply in the case of pairwise comparisons because the <span class="math">\({N \choose 2}\)</span> pairs considered are not independent — there are only <span class="math">\(N\)</span> independent degrees of freedom in this problem. Although the naive correction does not apply to the problem of pairwise comparisons, we’ve shown here that it remains a simple matter to correctly apply the principle behind it: One can easily select any desired family-wise type 1 error rate through an appropriate selection of the individual test sizes — just use (\ref{gen})!</p>
<p>We hope you enjoyed this post — we anticipate writing a bit more on hypothesis testing in the near future.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Try Caffe pre-installed on a VirtualBox image2016-03-22T15:02:00-07:002016-03-22T15:02:00-07:00Cathy Yehtag:efavdb.com,2016-03-22:/caffe-virtualbox<p>A previous <a href="http://efavdb.github.io/deep-learning-with-jupyter-on-aws">post</a> showed beginners how to try out deep learning libraries by</p>
<ol>
<li>using an Amazon Machine Image (<span class="caps">AMI</span>) pre-installed with deep learning libraries</li>
<li>setting up a Jupyter notebook server to play with said libraries</li>
</ol>
<p>If you have VirtualBox and <a href="https://www.vagrantup.com/">Vagrant</a>, you can follow a similar procedure on your own …</p><p>A previous <a href="http://efavdb.github.io/deep-learning-with-jupyter-on-aws">post</a> showed beginners how to try out deep learning libraries by</p>
<ol>
<li>using an Amazon Machine Image (<span class="caps">AMI</span>) pre-installed with deep learning libraries</li>
<li>setting up a Jupyter notebook server to play with said libraries</li>
</ol>
<p>If you have VirtualBox and <a href="https://www.vagrantup.com/">Vagrant</a>, you can follow a similar procedure on your own computer. The advantage is that you can develop locally, then deploy on an expensive <span class="caps">AWS</span> <span class="caps">EC2</span> gpu instance when your scripts are ready.</p>
<p>For example, <a href="http://caffe.berkeleyvision.org/">Caffe</a>, the machine vision framework, allows you to seamlessly transition between cpu- and gpu-mode, and is available as a <a href="https://atlas.hashicorp.com/malthejorgensen/boxes/caffe-deeplearning">vagrant box</a> running Ubuntu 14.04 (<a href="#virtualization">**</a>64-bit), with Caffe pre-installed.</p>
<p>To add the box, type on the command line:
<code>vagrant box add malthejorgensen/caffe-deeplearning</code></p>
<p>If you don’t already have VirtualBox and Vagrant installed, you can find instructions online, or look at my <a href="#vagrant_install">dotfiles</a> to get an idea.</p>
<hr>
<h2 id="gotchas">Gotchas</h2>
<h3 id="ssh-authentication-failure"><span class="caps">SSH</span> authentication failure</h3>
<p>For me, the box had the wrong public key in <code>/home/vagrant/.ssh/authorized_keys file</code>, which gave me “authentication failure” upon starting up the box with <code>vagrant up</code>. This was fixed by:</p>
<p>Manually ssh into the box: <code>vagrant ssh</code>.</p>
<p>Then type (key taken from <a href="https://raw.githubusercontent.com/mitchellh/vagrant/master/keys/vagrant.pub">here</a>):</p>
<div class="highlight"><pre><span></span><span class="nb">echo</span> <span class="s2">"ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA6NF8iallvQVp22WDkTkyrtvp9eWW6A8YVr+kz4TjGYe7gHzIw+niNltGEFHzD8+v1I2YJ6oXevct1YeS0o9HZyN1Q9qgCgzUFtdOKLv6IedplqoPkcmF0aYet2PkEDo3MlTBckFXPITAMzF8dJSIFo9D8HfdOV0IAdx4O7PtixWKn5y2hMNG0zQPyUecp4pzC6kivAIhyfHilFR61RGL+GPXQ2MWZWFYbAGjyiYJnAmCP3NOTd0jMZEnDkbUvxhMmBYSdETk1rRgm+R4LOzFUGaHqHDLKLX+FIPKcF96hrucXzcWyLbIbEgE98OHlnVYCzRdK8jlqm8tehUc9c9WhQ== vagrant insecure public key"</span> > ~/.ssh/authorized_keys
</pre></div>
<p>Log out of the box, reload the box with <code>vagrant reload</code>, and hopefully the ssh authentication error is fixed.</p>
<h3 id="jupyter-notebook-server">Jupyter notebook server</h3>
<p>By default, the box has a notebook server on port 8003 that starts up from the /home/vagrant/caffe/examples directory, to be used in conjunction with port forwarding set in the Vagrant file:
<code>config.vm.network "forwarded_port", guest: 8003, host: 8003</code>
With the default setup, go to <code>http://localhost:8003</code> in your browser to access /home/vagrant/caffe/examples.</p>
<p>The default server setup limits access to only /home/vagrant/caffe/examples, so I prefer to set up my own configuration of the jupyter notebook server on port 8888 (allowing port forwarding of port 8888 in the Vagrantfile as well) and then start up the server from /home/vagrant, or wherever I’m working. To do this,</p>
<p>Log in to the box: <code>vagrant ssh</code></p>
<p>Then create the notebook config file <code>~/.jupyter/jupyter_notebook_config.py</code> containing the following lines:</p>
<div class="highlight"><pre><span></span><span class="err">c.NotebookApp.ip = '*'</span>
<span class="err">c.NotebookApp.open_browser = False</span>
<span class="err">c.NotebookApp.port = 8888</span>
</pre></div>
<h3 id="vagrantfile">Vagrantfile</h3>
<p>Here’s the vagrant file that worked for me:</p>
<hr>
<ul>
<li>Scripts to <a href="https://github.com/frangipane/.dotfiles/blob/master/install/apt-get.sh">install Virtualbox</a> (line 31 and onwards) and <a href="https://github.com/frangipane/.dotfiles/blob/master/install/install-vagrant.sh">install Vagrant</a>.</li>
</ul>
<p>** This is a 64-bit box, so you need to have Intel <span class="caps">VT</span>-x enabled in your <span class="caps">BIOS</span>.</p>Start deep learning with Jupyter notebooks in the cloud2016-03-10T20:41:00-08:002016-03-10T20:41:00-08:00Cathy Yehtag:efavdb.com,2016-03-10:/deep-learning-with-jupyter-on-aws<p>Want a quick and easy way to play around with deep learning libraries? Puny <span class="caps">GPU</span> got you down? Thanks to Amazon Web Services (<span class="caps">AWS</span>) — specifically, <span class="caps">AWS</span> Elastic Compute Cloud (<span class="caps">EC2</span>) — no data scientist need be left behind.</p>
<p>Jupyter/IPython notebooks are indispensable tools for learning and tinkering. This post shows …</p><p>Want a quick and easy way to play around with deep learning libraries? Puny <span class="caps">GPU</span> got you down? Thanks to Amazon Web Services (<span class="caps">AWS</span>) — specifically, <span class="caps">AWS</span> Elastic Compute Cloud (<span class="caps">EC2</span>) — no data scientist need be left behind.</p>
<p>Jupyter/IPython notebooks are indispensable tools for learning and tinkering. This post shows how to set up a public Jupyter notebook server in <span class="caps">EC2</span> and then access it remotely through your web browser, just as you would if you were using a notebook launched from your own laptop.</p>
<p>For a beginner, having to both set up deep learning libraries and navigate the <span class="caps">AWS</span> menagerie feels like getting thrown into the deep end when you just want to stick a toe in. You can skip the hassle of setting up deep learning frameworks from scratch by choosing an Amazon Machine Image (<span class="caps">AMI</span>) that comes pre-installed with the libraries and their dependencies. (Concerned about costs? — see the note<a href="#note1">*</a> at the bottom of this post.)</p>
<p>For example, the Stanford class, <a href="http://cs231n.stanford.edu/">CS231n: Convolutional Neural Networksfor Visual Recognition</a>, has provided a public <span class="caps">AMI</span> with these specs:</p>
<ul>
<li>cs231n_caffe_torch7_keras_lasagne_v2</li>
<li><span class="caps">AMI</span> <span class="caps">ID</span>: ami-125b2c72 in the us-west-1 region</li>
<li>Use a g2.2xlarge instance.</li>
<li>Caffe, Torch7, Theano, Keras and Lasagne are pre-installed. Python bindings of caffe are available. It has <span class="caps">CUDA</span> 7.5 and CuDNN v3.</li>
</ul>
<p>If you’re new to <span class="caps">AWS</span>, CS231n provides a nice step-by-step <a href="http://cs231n.github.io/aws-tutorial/"><span class="caps">AWS</span> tutorial</a> with lots of screenshots. We’re just going to tweak their procedure to enable access to Jupyter/IPython notebooks.</p>
<p>After you’re done, you’ll be able to work through tutorials in notebook format like those provided by caffe in their examples folder, e.g. <a href="http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb">00-classification.ipynb</a>.</p>
<p>We’ve written a little bash script <code>jupyter_userdata.sh</code> to execute Jupyter’s <a href="http://jupyter-notebook.readthedocs.org/en/latest/public_server.html">instructions</a> for setting up a public notebook server, so you don’t have to manually configure the notebook server every time you want to spin up a new <span class="caps">AMI</span> instance.</p>
<p>For the script to work, Jupyter itself should already be installed — which it is in the CS231n <span class="caps">AMI</span>.</p>
<p>You just have to edit the password in the script. To generate a hashed password, use IPython:</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">notebook.auth</span> <span class="kn">import</span> <span class="n">passwd</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">passwd</span><span class="p">()</span>
<span class="n">Enter</span> <span class="n">password</span><span class="p">:</span>
<span class="n">Verify</span> <span class="n">password</span><span class="p">:</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="s1">'sha1:bcd259ccf...<your hashed password here>'</span>
</pre></div>
<p>Replace the right hand side of line 24 in the script with the hashed password you just generated.</p>
<hr>
<p>Then, follow these steps to launch an <span class="caps">EC2</span> instance.</p>
<p><strong>1.</strong> First, follow the CS231n <a href="http://cs231n.github.io/aws-tutorial/"><span class="caps">AWS</span> tutorial</a> up until the step <em>“Choose the instance type <code>g2.2xlarge</code>, and click on “Review and Launch”</em>.</p>
<p>Don’t click on “Review and Launch” yet!</p>
<p><strong>2.</strong> Here’s where we add a couple extra steps to the tutorial.<a href="#note2">**</a></p>
<p>We’re going to supply the shell script as <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html">user-data</a>, a way to pass in scripts to automate configurations to your <span class="caps">AMI</span>. Instead of clicking on “Review and Launch”, click on the gray button in the lower right “Next: Configure Instance Details”.</p>
<p>In the next page, click on the arrowhead next to “Advanced Details” to expand its options. Click on the radio button next to “As text”, then copy and paste the text from <code>jupyter_userdata.sh</code> (modified with your password) into the field.</p>
<p>Warning: if you click on “As file” instead and browse to wherever you saved <code>jupyter_userdata.sh</code>, the file must first be <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html">base64-encoded</a>.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/03/Step3_Configure-Instance-Details.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2016/03/Step3_Configure-Instance-Details.png"></a></p>
<p><strong>3.</strong> Next, (skipping steps 4. and 5.) click on the link to “6. Configure Security Group” near the top of the page. By default, <span class="caps">SSH</span> is enabled, but we need to enable access to the notebook server, whose port we’ve set as 8888 in the bash script.</p>
<p>Click on the grey button “Add Rule”, then for the new rule, choose Type: Custom <span class="caps">TCP</span> Rule; Protocol: <span class="caps">TCP</span>; Port Range: 8888; Source: Anywhere.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/03/Step6_Configure-Security-Group.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2016/03/Step6_Configure-Security-Group.png"></a></p>
<p><strong>4.</strong> Now, pick up where you left off in the CS231n tutorial (“<em>… click on “Review and Launch</em>“.), which takes you to “Step 7. Review Instance Launch”. Complete the tutorial.</p>
<hr>
<p>Check that the Jupyter notebook server was set up correctly:</p>
<ol>
<li>ssh into your instance (see CS231n instructions).</li>
<li>Navigate to <code>~/caffe/examples</code>.</li>
<li>Start the notebook server using the <code>jupyter notebook</code> command.</li>
<li>
<p>In your web browser, access the notebook server with https://PUBLIC_IP:8888, where PUBLIC_IP is the public <span class="caps">IP</span> of your instance, displayed from the instance description on your <span class="caps">AWS</span> dashboard. Your browser will warn that your self-signed certificate is insecure or unrecognized.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/03/scary-browser-warning.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2016/03/scary-browser-warning.png"></a></p>
<p>That’s ok — click past the warnings, and you should get a sign-in page. Type in your password.</p>
</li>
<li>
<p>Next, you should see the files and directories in <code>/home/ubuntu/caffe/examples</code></p>
</li>
<li>Open one of the example notebooks, e.g. <code>00-classification.ipynb</code>, and try running some cells to make sure everything is working.</li>
</ol>
<p>Voila! We hope this guide removes some obstacles to getting started. Happy learning!</p>
<hr>
<ul>
<li>The cost of running a <span class="caps">GPU</span> instance is high compared to many other instance types, but still very reasonable if you’re just tinkering for a few hours on a pre-trained model, not training a whole neural network from scratch.</li>
</ul>
<p>Check out the <a href="https://aws.amazon.com/ec2/pricing/">pricing</a> for an <span class="caps">EC2</span> instance in the section “On-Demand Instance Prices” and selecting the region of your <span class="caps">AMI</span>. At the time of writing, the cost of an on-demand <code>g2.2xlarge</code> instance in the <span class="caps">US</span> West (Northern California) region was $0.7/hour, whereas the price of a <a href="https://aws.amazon.com/ec2/spot/pricing/">spot</a> instance (a cheaper alternative which will automatically terminate when the spot pricing exceeds your bid) was $0.3/hour.</p>
<p><strong> If you followed the CS231n tutorial exactly and forgot to supply user data, you can still use this script. First modify the security configuration of your instance according to step </strong>3**. Then use the <code>scp</code> command to copy the script from your local computer to your instance, <code>ssh</code> into your instance, then execute the script: <code>source jupyter_userdata.sh</code>. If you need help with using <code>scp</code>, see “To use <span class="caps">SCP</span> to transfer a file” in this <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html">guide</a>.</p>Dotfiles for peace of mind2016-02-23T12:18:00-08:002016-02-23T12:18:00-08:00Cathy Yehtag:efavdb.com,2016-02-23:/dotfiles<p>Reinstalling software and configuring settings on a new computer is a pain. After my latest hard drive failure set the stage for yet another round of download-extract-install and configuration file twiddling, it was time to overhaul my approach. <em>“Enough is enough!”</em></p>
<p>This post walks through</p>
<ol>
<li>how to back up and …</li></ol><p>Reinstalling software and configuring settings on a new computer is a pain. After my latest hard drive failure set the stage for yet another round of download-extract-install and configuration file twiddling, it was time to overhaul my approach. <em>“Enough is enough!”</em></p>
<p>This post walks through</p>
<ol>
<li>how to back up and automate the installation and configuration process</li>
<li>how to set up a minimal framework for data science</li>
</ol>
<p>We’ll use a <a href="https://github.com/EFavDB/dotfiles">dotfiles repository</a> on Github to illustrate both points in parallel.</p>
<hr>
<p>Dotfiles are named after the configuration files that start with a dot in Unix-based systems. These files are hidden from view in your home directory, but visible with a <code>$ ls -a</code> command. Some examples are <code>.bashrc</code> (for configuring the bash shell), <code>.gitconfig</code> (for configuring git), and <code>.emacs</code> (for configuring the Emacs text editor).</p>
<p>Let’s provide a concrete example of a customization: suppose you have a hard time remembering the syntax to extract a file (“Is it tar -xvf, -jxvf, or -zxvf?”). If you’re using a bash shell, you can define a function, <code>extract()</code> in your <code>.bashrc</code> file that makes life a little easier:</p>
<div class="highlight"><pre><span></span><span class="err">extract() { </span>
<span class="err">if [ -f "$1" ]; then </span>
<span class="err">case "$1" in </span>
<span class="err">*.tar.bz2) tar -jxvf "$1" ;; </span>
<span class="err">*.tar.gz) tar -zxvf "$1" ;; </span>
<span class="err">*.bz2) bunzip2 "$1" ;; </span>
<span class="err">*.dmg) hdiutil mount "$1" ;; </span>
<span class="err">*.gz) gunzip "$1" ;; </span>
<span class="err">*.tar) tar -xvf "$1" ;; </span>
<span class="err">*.tbz2) tar -jxvf "$1" ;; </span>
<span class="err">*.tgz) tar -zxvf "$1" ;; </span>
<span class="err">*.zip) unzip "$1" ;; </span>
<span class="err">*.ZIP) unzip "$1" ;; </span>
<span class="err">*.pax) cat "$1" | pax -r ;; </span>
<span class="err">*.pax.Z) uncompress "$1" --stdout | pax -r ;; </span>
<span class="err">*.Z) uncompress "$1" ;; </span>
<span class="err">*) echo "'$1' cannot be extracted/mounted via extract()" ;; </span>
<span class="err">esac </span>
<span class="err">else </span>
<span class="err">echo "'$1' is not a valid file to extract" </span>
<span class="err">fi </span>
<span class="err">} </span>
</pre></div>
<p>So the next time you have to extract a file <code>some_file.tar.bz2</code>, just type <code>extract some_file.tar.bz2</code> in bash. (This example was found in this <a href="https://github.com/webpro/dotfiles/blob/master/system/.function_fs#L23">dotfiles repo</a>.)</p>
<p>The structure of my dotfiles takes after the <a href="https://github.com/webpro/dotfiles">repo</a> described by Lars Kappert in the article <a href="https://medium.com/@webprolific/getting-started-with-dotfiles-43c3602fd789#.eis4hwbff">“Getting Started With Dotfiles”</a>. However, my repo is pared down significantly, with minor modifications for my Linux Mint system (his is <span class="caps">OS</span> X) and a focus on packages for data science.</p>
<hr>
<h2 id="a-framework-for-data-science">A framework for data science</h2>
<p>This starter environment only has a few parts. We need a text editor — preferably one that can support multiple languages encountered in data science — and a way to manage scientific/statistical software packages.</p>
<h3 id="components">Components</h3>
<p>The setup consists of:</p>
<ul>
<li><a href="https://www.gnu.org/software/emacs/">Emacs</a> — a powerful text editor that can be customized to provide an <span class="caps">IDE</span>-like experience for both python and R, while providing syntax highlighting for other languages, e.g. markdown, LaTeX, shell, lisp, and so on. (More on customizing Emacs in a future post.)</li>
<li><a href="http://conda.pydata.org/docs/">Conda</a> — both a package manager and environment manager. Advantages:<ul>
<li>Packages are easy to install compared to pip, e.g. see a post by the <a href="http://technicaldiscovery.blogspot.com/2013/12/why-i-promote-conda.html">author of numpy</a>.</li>
<li>Conda is language agnostic in terms of both managing packages and environments for different languages (as opposed to pip/virtualenv/venv). This feature is great if you use both python and R.</li>
<li>Standard python scientific computing libraries like numpy, scipy, matplotlib, etc. are available in the conda repository.</li>
</ul>
</li>
</ul>
<p>I use the system package manager (i.e. <code>apt-get install ...</code>) to install a few packages like git, but otherwise rely on Conda to install Python (multiple versions are okay!), R, and their libraries.</p>
<p>I like how clean the conda installation feels. Any packages installed by Conda, as well as different versions of Python itself, are neatly organized under the <code>miniconda3</code> directory in my home directory. In contrast, my previous Linux setups were littered with software installations from various package managers, along with sometimes unsuccessful attempts to compile software from source.</p>
<h3 id="workflow">Workflow</h3>
<p>My workflow with Conda follows this helpful <a href="http://stiglerdiet.com/blog/2015/Nov/24/my-python-environment-workflow-with-conda/">post</a> by Tim Hopper. Each project gets its own directory and is associated with an environment whose dependencies are specified by an <code>environment.yml</code> file.</p>
<p>For example, create a folder for a project, my_proj. Within the project folder, create a bare-bones <code>environment.yml</code> file to specify a dependency on python 3 and matplotlib:</p>
<div class="highlight"><pre><span></span><span class="n">name</span><span class="o">:</span> <span class="n">my_proj</span>
<span class="n">dependencies</span><span class="o">:</span>
<span class="o">-</span> <span class="n">python</span><span class="o">=</span><span class="mi">3</span>
<span class="o">-</span> <span class="n">matplotlib</span>
</pre></div>
<p>Then, to create the conda environment named after that directory, run <code>$ conda env create</code> inside the my_proj directory. To activate the virtual environment, run <code>$ source activate my_proj</code>.</p>
<p>Activating a conda environment can be further automated with <a href="https://github.com/kennethreitz/autoenv">autoenv</a>. Autoenv automatically activates the environment for you when you <code>$ cd</code> into a project directory. You just need to create a <code>.env</code> file that contains the command to activate your environment, e.g. <code>source activate my_proj</code>, under the project directory.</p>
<p>Tim has written a convenient bash function, <code>conda-env-file</code> (see <a href="#conda-env-file">below</a>), for generating a basic <code>environment.yml</code> file and <code>.env</code> file, which I’ve incorporated into my own dotfiles, along with autoenv. The order of commands that I type in bash then follows:</p>
<ol>
<li><code>mkdir my_proj</code> # create project folder</li>
<li><code>cd my_proj</code> # enter project directory</li>
<li><code>conda-env-file</code> # execute homemade function to create environment.yml and .env</li>
<li><code>conda env create</code> # conda creates an environment “my_proj” that is named after the project directory (using environment.yml)</li>
<li><code>cd ..</code></li>
<li><code>cd my_proj</code> # autoenv automatically activates environment (using the file .env) when you re-enter the directory</li>
</ol>
<hr>
<h2 id="the-dotfiles-layout">The dotfiles layout</h2>
<p>Below is the layout of the directories and files (generated by the <code>tree</code> command) in the <a href="https://github.com/EFavDB/dotfiles">dotfiles repo</a>.</p>
<div class="highlight"><pre><span></span><span class="err">.</span>
<span class="err">├── install</span>
<span class="err">│ ├── apt-get.sh</span>
<span class="err">│ ├── conda.sh</span>
<span class="err">│ ├── git.sh</span>
<span class="err">│ ├── install-emacs.sh</span>
<span class="err">│ └── install-miniconda.sh</span>
<span class="err">├── install.sh</span>
<span class="err">├── runcom</span>
<span class="err">│ ├── .bash_profile</span>
<span class="err">│ ├── .bashrc</span>
<span class="err">│ └── .profile</span>
<span class="err">└── system</span>
<span class="err"> ├── env</span>
<span class="err"> ├── functions</span>
<span class="err"> └── path</span>
</pre></div>
<h3 id="configuration">Configuration</h3>
<p>There any number of dotfiles that can be configured (for example, see the collection <a href="http://dotfiles.github.io/">here</a>), but this repo only provides customizations for the dotfiles <code>.profile</code>, <code>.bash_profile</code>, and <code>.bashrc</code> — located in the directory, <code>runcom</code> (which stands for “run commands”) — that contain commands that are executed at login or during interactive non-login shell sessions. For details about the role of shell initialization dotfiles, see the <a href="#aside">end</a> of this post.</p>
<p>Instead of putting all our customizations in one long, unwieldy dotfile, it’s helpful to divide them into chunks, which we keep in the subfolder, <code>system</code>.</p>
<p>The files <code>env</code>, <code>functions</code>, <code>path</code> are sourced in a loop by the dotfiles in <code>runcom</code>. For example, <code>.bashrc</code> sources <code>functions</code> and <code>env</code>:</p>
<div class="highlight"><pre><span></span><span class="err">for DOTFILE in "$DOTFILES_DIR"/system/{functions,env}; do </span>
<span class="err">[ -f "$DOTFILE" ] && . "$DOTFILE" </span>
<span class="err">done </span>
</pre></div>
<p>Let’s take a look at the configurations in each of these files:<br>
</p>
<p><strong>env</strong> - enables autoenv for activating virtual environments</p>
<div class="highlight"><pre><span></span><span class="err">[ -f /opt/autoenv/activate.sh ] && . /opt/autoenv/activate.sh</span>
</pre></div>
<p><strong>functions</strong> - defines a custom function, <code>conda-env-file</code>, that generates an <code>environment.yml</code> that lists the dependencies for a conda virtual environment, and a one-line file <code>.env</code> (not to be confused with <code>env</code> in the previous bullet point) used by autoenv. (In addition to pip and python, I include the dependencies ipython, jedi, and flake8 needed by my Emacs python <span class="caps">IDE</span> setup.) </p>
<div class="highlight"><pre><span></span><span class="k">function</span> <span class="n">conda</span><span class="o">-</span><span class="n">env</span><span class="o">-</span><span class="n">file</span> <span class="err">{</span>
<span class="o">#</span> <span class="k">Create</span> <span class="n">conda</span> <span class="n">environment</span><span class="p">.</span><span class="n">yml</span> <span class="n">file</span> <span class="k">and</span> <span class="n">autoenv</span> <span class="n">activation</span> <span class="n">file</span>
<span class="o">#</span> <span class="n">based</span> <span class="k">on</span> <span class="n">directory</span> <span class="n">name</span><span class="p">.</span>
<span class="n">autoenvfilename</span><span class="o">=</span><span class="s1">'.env'</span>
<span class="n">condaenvfilename</span><span class="o">=</span><span class="s1">'environment.yml'</span>
<span class="n">foldername</span><span class="o">=</span><span class="err">$</span><span class="p">(</span><span class="n">basename</span> <span class="err">$</span><span class="n">PWD</span><span class="p">)</span>
<span class="k">if</span> <span class="p">[</span> <span class="o">!</span> <span class="o">-</span><span class="n">f</span> <span class="err">$</span><span class="n">condaenvfilename</span> <span class="p">];</span> <span class="k">then</span>
<span class="n">printf</span> <span class="ss">"name: $foldername\ndependencies:\n- pip\n- python\n- ipython\n- jedi\n- flake8"</span> <span class="o">></span> <span class="err">$</span><span class="n">condaenvfilename</span>
<span class="n">echo</span> <span class="ss">"$condaenvfilename created."</span>
<span class="k">else</span>
<span class="n">echo</span> <span class="ss">"$condaenvfilename already exists."</span>
<span class="n">fi</span>
<span class="k">if</span> <span class="p">[</span> <span class="o">!</span> <span class="o">-</span><span class="n">f</span> <span class="err">$</span><span class="n">autoenvfilename</span> <span class="p">];</span> <span class="k">then</span>
<span class="n">printf</span> <span class="ss">"source activate $foldername\n"</span> <span class="o">></span> <span class="err">$</span><span class="n">autoenvfilename</span>
<span class="n">echo</span> <span class="ss">"$autoenvfilename created."</span>
<span class="k">else</span>
<span class="n">echo</span> <span class="ss">"$autoenvfilename already exists."</span>
<span class="n">fi</span>
<span class="err">}</span>
</pre></div>
<p><strong>path</strong> - prepends the miniconda3 path to the <span class="caps">PATH</span> environment variable. For example, calls to python will default to the Miniconda3 version (3.5.1 in my case) rather than my system version (2.7).</p>
<div class="highlight"><pre><span></span><span class="err">export PATH="/home/$USER/miniconda3/bin:$PATH"</span>
</pre></div>
<p>Now, we’re done with configuring the dotfiles in this repo (apart from Emacs, which is treated separately). We just have to create symlinks in our home directory to the dotfiles in <code>runcom</code>, which is performed by the shell script, <code>install.sh</code>:</p>
<div class="highlight"><pre><span></span><span class="o">##</span> <span class="p">...</span>
<span class="n">ln</span> <span class="o">-</span><span class="n">sfv</span> <span class="ss">"$DOTFILES_DIR/runcom/.bash_profile"</span> <span class="o">~</span>
<span class="n">ln</span> <span class="o">-</span><span class="n">sfv</span> <span class="ss">"$DOTFILES_DIR/runcom/.profile"</span> <span class="o">~</span>
<span class="n">ln</span> <span class="o">-</span><span class="n">sfv</span> <span class="ss">"$DOTFILES_DIR/runcom/.bashrc"</span> <span class="o">~</span>
<span class="o">##</span> <span class="p">...</span>
</pre></div>
<h3 id="installation">Installation</h3>
<p>In addition to setting up dotfiles symlinks, <code>install.sh</code> automates the installation of all our data science tools via calls to each of the scripts in the <code>install</code> subfolder. Each script is named after the mechanism of installation (i.e. <code>apt-get</code>, <code>conda</code>, <code>git</code>) or purpose (to install Miniconda and Emacs).</p>
<ul>
<li><strong>apt-get.sh</strong> - installs a handful of programs using the system package manager, including <code>build-essentials</code>, which is needed to compile programs from source. Also enables source-code repositories (not enabled by default in Linux Mint 17), to be used for compiling emacs from source.</li>
<li><strong>install-emacs.sh</strong> - build Emacs 24.4 from source, which is needed for compatibility with the Magit plug-in (git for Emacs). At the time of writing, only Emacs 24.3 was available on the system repo.</li>
<li><strong>install-miniconda.sh</strong> - <a href="http://conda.pydata.org/docs/">miniconda</a> includes just conda, conda-build, and python. I prefer this lightweight version to the Anaconda version, which comes with more than 150 scientific packages by default. <em>A note from the Miniconda downloads page: “There are two variants of the installer: Miniconda is Python 2 based and Miniconda3 is Python 3 based… the choice of which Miniconda is installed only affects the root environment. Regardless of which version of Miniconda you install, you can still install both Python 2.x and Python 3.x environments. The other difference is that the Python 3 version of Miniconda will default to Python 3 when creating new environments and building packages.” (I chose Miniconda3.)</em></li>
<li><strong>conda.sh</strong> - Use conda to install popular scientific packages for python, R, some popular R packages, and packages for <span class="caps">IDE</span> support in Emacs.</li>
<li><strong>git.sh</strong> - Install <a href="https://github.com/kennethreitz/autoenv">autoenv</a> for working with virtual environment directories. Also clone the configurations from my <a href="https://github.com/frangipane/emacs">Emacs repo</a>.</li>
</ul>
<hr>
<h3 id="conclusion">Conclusion</h3>
<p>The <a href="https://github.com/EFavDB/dotfiles">dotfiles repo</a> discussed in this post will remain in this minimal state on GitHub so that it can be easily parsed and built upon. It’s the most straightforward to adopt if you are on a similar system (Linux Mint or Ubuntu 14.04), as I haven’t put in checks for <span class="caps">OSX</span>. If you don’t like Emacs, feel free to comment out the relevant lines in <code>install.sh</code> and <code>install/git.sh</code>, and replace with your editor of choice.</p>
<p>Also take a look at other collections of <a href="https://github.com/webpro/awesome-dotfiles">awesome dotfiles</a> for nuggets (like the <code>extract()</code> function) to co-opt. And enjoy the peace of mind that comes with having dotfiles insurance!</p>
<hr>
<h3 id="notes-on-shell-initialization-dotfiles"><em>Notes on shell initialization dotfiles</em></h3>
<p>The handling of the dotfiles .profile, .bash_profile, and .bashrc is frequently a source of <a href="http://superuser.com/questions/183870/difference-between-bashrc-and-bash-profile">confusion</a> that we’ll try to clear up here.</p>
<p>For example, .profile and .bash_profile are both recommended for setting environment variables, so what’s the point of having both?</p>
<p><strong>.profile</strong><br>
.profile is loaded upon login to a Unix system (for most distributions) and is where you should put customizations that apply to your whole session, e.g. environment variable assignments like <code>PATH</code> that are not specifically related to bash. .profile holds anything that should be (1) available to graphical applications — like launching a program from a <span class="caps">GUI</span> by clicking on an icon or menu — or (2) to <code>sh</code>, which is run by graphical <a href="https://wiki.archlinux.org/index.php/display_manager">display managers</a> like <span class="caps">GDM</span>/LightDM/<span class="caps">MDM</span> when your computer boots up in graphics mode (the most common scenario these days). Note, even though the default login shell is bash in Ubuntu, the default system shell that is used during the bootup process in Ubuntu is <a href="https://wiki.ubuntu.com/DashAsBinSh">dash, not bash</a>, (<code>readlink -f /bin/sh</code> outputs <code>/bin/dash</code>).</p>
<p>Let’s give a concrete example of case (1): the miniconda installer provides a default option to add the miniconda binaries to the search path in .bashrc: <code>export PATH="/home/$USER/miniconda3/bin:$PATH"</code>. Assuming you’ve used <code>conda</code> (not <code>apt-get</code>) to install python scientific computing libraries and have set the path in .bashrc, if Emacs is launched from an icon on the desktop, then Emacs plugins that depend on those libraries (e.g. <code>ein</code>, a plugin that integrates IPython with Emacs) will throw an error; since the graphical invocation only loads .profile, the miniconda binaries would not be in the search path. (On the other hand, there would be no problem launching Emacs from the terminal via <code>$ emacs</code>.) For this reason, it’s preferable to add the miniconda path in .profile instead of .bashrc.</p>
<p>For changes to .profile to take effect, you have to log out entirely and then log back in.</p>
<p><strong>.bash_profile</strong><br>
Like .profile, .bash_profile should contain environment variable definitions. I haven’t yet encountered a situation where a configuration can be set in .bash_profile that can’t be set in .profile or .bashrc.</p>
<p>Therefore, my .bash_profile just loads .profile and .bashrc. Some choose to bypass .bash_profile entirely and only have .profile (which bash reads if .bash_profile or .bash_login don’t exist) and .bashrc.</p>
<p><strong>.bashrc</strong><br>
Definitions of alias, functions, and other settings you’d want in an interactive command line should be put in .bashrc. .bashrc is sourced by interactive, non-login shells.</p>
<p><strong>login, non-login, interactive, and non-interactive shells</strong></p>
<ul>
<li>To check if you’re in a login shell, type on the command line <code>echo $0</code>. If the output is <code>-bash</code>, then you’re in a login shell. If the output is <code>bash</code>, then it’s not a login shell (see <code>man bash</code>).</li>
<li>Usually, a shell started from a new terminal in a <span class="caps">GUI</span> will be an interactive, non-login shell. The notable exception is <span class="caps">OSX</span>, whose terminal defaults to starting login shells. Thus, an <span class="caps">OSX</span> user may blithely sweep customizations that would ordinarily be placed in .bashrc — like aliases and functions — into .bash_profile and not bother with creating a .bashrc at all. However, those settings would not be properly initialized if the terminal default is changed to non-login shells.</li>
<li>If you ssh in or login on a text console, then you get an interactive, login shell.</li>
<li>More examples in <a href="http://unix.stackexchange.com/questions/38175/difference-between-login-shell-and-non-login-shell/46856#46856">this StackExchange thread</a>.</li>
</ul>
<p>This discussion might seem pedantic since you can often get away with a less careful setup. In my experience, though, what can go wrong will probably go wrong, so best to be proactive.</p>Independent component analysis2016-02-14T00:00:00-08:002016-02-14T00:00:00-08:00Jonathan Landytag:efavdb.com,2016-02-14:/independent-component-analysis<p>Two microphones are placed in a room where two conversations are taking place simultaneously. Given these two recordings, can one “remix” them in some prescribed way to isolate the individual conversations? Yes! In this post, we review one simple approach to solving this type of problem, Independent Component Analysis (<span class="caps">ICA …</span></p><p>Two microphones are placed in a room where two conversations are taking place simultaneously. Given these two recordings, can one “remix” them in some prescribed way to isolate the individual conversations? Yes! In this post, we review one simple approach to solving this type of problem, Independent Component Analysis (<span class="caps">ICA</span>). We share an ipython document implementing <span class="caps">ICA</span> and link to a youtube video illustrating its application to audio de-mixing.</p>
<h3 id="introduction">Introduction</h3>
<p>To formalize the problem posed in the abstract, let two desired conversation signals be represented by <span class="math">\(c_1(t)\)</span> and <span class="math">\(c_2(t)\)</span>, and two mixed microphone recordings of these by <span class="math">\(m_1(t)\)</span> and <span class="math">\(m_2(t)\)</span>. We’ll assume that the latter are both linear combinations of the former, with
</p>
<div class="math">\begin{align} \label{mean}
m_1(t) &= a_1 c_1(t) + a_2 c_2(t) \\
m_2(t) &= a_3 c_1(t) + a_4 c_2(t). \label{1} \tag{1}
\end{align}</div>
<p>
Here, we stress that the <span class="math">\(a_i\)</span> coefficients in (\ref{1}) are hidden from us: We only have access to the <span class="math">\(m_i\)</span>. Hypothetical illustrations are given in the figure below. Given only these mixed signals, we’d like to recover the underlying <span class="math">\(c_i\)</span> used to construct them (spoiler: a sine wave and a saw-tooth function were used for this figure).</p>
<p><a href="https://efavdb.com/wp-content/uploads/2016/02/mixed2.jpg"><img alt="mixed" src="https://efavdb.com/wp-content/uploads/2016/02/mixed2.jpg"></a></p>
<p>Amazingly, it turns out that with the introduction of a modest assumption, a simple solution to our problem can be obtained: We need only assume that the desired <span class="math">\(c_i\)</span> are mutually independent<span class="math">\(^1\)</span>. This assumption is helpful because it turns out that when two independent signals are added together, the resulting mixture is always “more Gaussian” than either of the individual, independent signals (a la the central limit theorem). Seeking linear combinations of the available <span class="math">\(m_i\)</span> that locally extremize their non-Gaussian character therefore provides a way to identify the pure, unmixed signals. This approach to solving the problem is called “Independent Component Analysis”, or <span class="caps">ICA</span>.</p>
<p>Here, we demonstrate the principle of <span class="caps">ICA</span> through consideration of the audio de-mixing problem. This is a really impressive application. However, one should strive to remember that the algorithm is not a one-trick-pony. <span class="caps">ICA</span> is an unsupervised machine learning algorithm of general applicability — similar in nature, and complementary to, the more familiar <a href="http://efavdb.github.io/principal-component-analysis"><span class="caps">PCA</span></a> algorithm. Whereas in <span class="caps">PCA</span> we seek the feature-space directions that maximize captured variance, in <span class="caps">ICA</span> we seek those directions that maximize the “interestingness” of the distribution — i.e., the non-Gaussian character of the resulting projections. It can be fruitfully applied in many contexts<span class="math">\(^2\)</span>.</p>
<p>We turn now to the problem of audio de-mixing via <span class="caps">ICA</span>.</p>
<h3 id="audio-de-mixing">Audio de-mixing</h3>
<p>In this post, we use the kurtosis of a signal to quantify its degree of “non-Gaussianess”. For a given signal <span class="math">\(x(t)\)</span>, this is defined as
</p>
<div class="math">$$
\kappa(x) \equiv \left \langle \left (x- \langle x \rangle \right)^4 \right \rangle - 3 \left \langle \left (x- \langle x \rangle \right)^2 \right \rangle^2, \label{2} \tag{2}
$$</div>
<p>
where brackets represent an average over time (or index). It turns out that the kurtosis is always zero for a Gaussian-distributed signal, so (\ref{2}) is a natural choice of score function for measuring deviation away from Gaussian behavior<span class="math">\(^3\)</span>. Essentially, it’s a measure of how flat a distribution is — with numbers greater (smaller) than 0 corresponding to distributions that are more (less) flat than a Gaussian.</p>
<p>With (\ref{2}) chosen as our score function, we can now jump right into applying <span class="caps">ICA</span>. The code snippet below considers all possible mixtures of two mixed signals <span class="math">\(m_1\)</span> and <span class="math">\(m_2\)</span>, obtains the resulting signal kurtosis values, and plots the result.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">kurtosis_of_mixture</span><span class="p">(</span><span class="n">c1</span><span class="p">):</span>
<span class="n">c2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">c1</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">c1</span> <span class="o">*</span> <span class="n">m1</span> <span class="o">+</span> <span class="n">c2</span> <span class="o">*</span> <span class="n">m2</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">s</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="n">k</span> <span class="o">=</span> <span class="n">mean</span><span class="p">([</span><span class="n">item</span> <span class="o">**</span> <span class="mi">4</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">s</span><span class="p">])</span> <span class="o">-</span> <span class="mi">3</span>
<span class="k">return</span> <span class="n">k</span>
<span class="n">c_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mf">0.001</span><span class="p">)</span>
<span class="n">k_array</span> <span class="o">=</span> <span class="p">[</span><span class="n">kurtosis_of_mixture</span><span class="p">(</span><span class="n">item</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">c_array</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">c_array</span><span class="p">,</span> <span class="n">k_array</span><span class="p">)</span>
</pre></div>
<p><a href="https://efavdb.com/wp-content/uploads/2016/02/k3.jpg"><img alt="k" src="https://efavdb.com/wp-content/uploads/2016/02/k3.jpg"></a></p>
<p>In line <span class="math">\((3)\)</span> of the code here, we define the “remixed” signal <span class="math">\(s\)</span>, which is a linear combination of the two mixed signals <span class="math">\(m_1\)</span> and <span class="math">\(m_2\)</span>. Note that in line <span class="math">\((4)\)</span>, we normalize the signal so that it always has variance <span class="math">\(1\)</span> — this simply eliminates an arbitrary scale factor from the analysis. Similarly in line <span class="math">\((2)\)</span>, we specify <span class="math">\(c_2\)</span> as a function of <span class="math">\(c_1\)</span>, requiring the sum of their squared values to equal one — this fixes another arbitrary scale factor.</p>
<p>When we applied the code above to the two signals shown in the introduction, we obtained the top plot at right. This shows the kurtosis of <span class="math">\(s\)</span> as a function of <span class="math">\(c_1\)</span>, the weight applied to signal <span class="math">\(m_1\)</span>. Notice that there are two internal extrema in this plot: a peak near <span class="math">\(-0.9\)</span> and a local minimum near <span class="math">\(-0.7\)</span>. These are the two <span class="math">\(c_1\)</span> weight choices that <span class="caps">ICA</span> suggests may relate to the pure, underlying signals we seek. To plot each of these signals, we used code similar to the following (the code shown is just for the maximum)</p>
<div class="highlight"><pre><span></span><span class="n">index1</span> <span class="o">=</span> <span class="n">k_array</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">k_array</span><span class="p">))</span>
<span class="n">c1</span> <span class="o">=</span> <span class="n">c_array</span><span class="p">[</span><span class="n">index1</span><span class="p">]</span>
<span class="n">c2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">c1</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">int16</span><span class="p">(</span><span class="n">item</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">c1</span> <span class="o">*</span> <span class="n">x1</span> <span class="o">+</span> <span class="n">c2</span> <span class="o">*</span> <span class="n">x2</span><span class="p">])</span>
<span class="n">plot</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
</pre></div>
<p>This code finds the index where the kurtosis was maximized, generates the corresponding remix, and plots the result. Applying this, the bottom figure at right popped out. It worked! — and with just a few lines of code, which makes it seem all the more amazing. In summary, we looked for linear combinations of the <span class="math">\(m_i\)</span> shown in the introduction that resulted in a stationary kurtosis — plotting these combinations, we found that these were precisely the pure signals we sought<span class="math">\(^4\)</span>.</p>
<p>A second application to actual audio clips is demoed in our youtube video linked below. The full ipython file utilized in the video can be downloaded on our github page, <a href="https://github.com/EFavDB/ICA">here</a><span class="math">\(^5\)</span>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>We hope this little post has you convinced that <span class="caps">ICA</span> is a powerful, yet straightforward algorithm<span class="math">\(^6\)</span>. Although we’ve only discussed one application here, many others can be found online: Analysis of financial data, an idea to use <span class="caps">ICA</span> to isolate a desired wifi signal from a crowded frequency band, and the analysis of brain waves — see discussion in the article mentioned in reference 2 — etc. In general, the potential application set of <span class="caps">ICA</span> may be as large as that for <span class="caps">PCA</span>. Next time you need to do some unsupervised learning or data compression, definitely keep it in mind.</p>
<h3 id="footnotes-and-references">Footnotes and references</h3>
<p>[1] Formally, saying that two signals are independent means that the evolution of one conveys no information about that of the other.</p>
<p>[2] For those interested in further reading on the theory and applications of <span class="caps">ICA</span>, we can recommend the review article by Hyvärinen and Oja — “Independent Component Analysis: Algorithms and Applications” — available for free online.</p>
<p>[3] Other metrics can also be used in the application of <span class="caps">ICA</span>. The kurtosis is easy to evaluate and is also well-motivated because of the fact that it is zero for any Gaussian. However, there are non-Gaussian distributions that also have zero kurtosis. Further, as seen in our linked youtube video, peaks in the kurtosis plot need not always correspond to the pure signals. A much more rigorous approach is to use the mutual information of the signals as your score. This function is zero if and only if you’ve found a projection that results in a fully independent set of signals. Thus, it will always work. The problem with this choice is that it is much harder to evaluate — thus, simpler scores are often used in practice, even though they aren’t necessarily rigorously correct. The article mentioned in footnote 2 gives a good review of some other popular score function choices.</p>
<p>[4] In general, symmetry arguments imply that the pure signals will correspond to local extrema in the kurtosis landscape. This works because the kurtosis of <span class="math">\(x_1 + a x_2\)</span> is the same as that of <span class="math">\(x_1 - a x_2\)</span>, when <span class="math">\(x_1\)</span> and <span class="math">\(x_2\)</span> are independent. To complete the argument, you need to consider coefficient expansions in the mixed space. The fact that the pure signals can sometimes sit at kurtosis local minima doesn’t really jive with the intuitive argument about mixtures being more Gaussian — but that was a vague statement anyways. A rigorous, alternative introduction could be made via mutual information, as mentioned in the previous footnote.</p>
<p>[5] To run the script, you’ll need ipython installed, as well as the python packages: scipy, numpy, matplotlib, and pyaudio — see instructions for the latter <a href="https://people.csail.mit.edu/hubert/pyaudio/">here</a>. The pip install command for pyaudio didn’t work for me on my mac, but the following line did:
<code>pip install --global-option='build_ext' --global-option='-I/usr/local/include' --global-option='-L/usr/local/lib' pyaudio</code></p>
<p>[6] Of course, things get a bit more complicated when you have a large number of signals. However, fast, simple algorithms have been found to carry this out even in high dimensions. See the reference in footnote 2 for discussion.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Maximum-likelihood asymptotics2015-12-30T00:01:00-08:002015-12-30T00:01:00-08:00Jonathan Landytag:efavdb.com,2015-12-30:/maximum-likelihood-asymptotics<p>In this post, we review two facts about maximum-likelihood estimators: 1) They are consistent, meaning that they converge to the correct values given a large number of samples, <span class="math">\(N\)</span>, and 2) They satisfy the <a href="http://efavdb.github.io/multivariate-cramer-rao-bound">Cramer-Rao</a> lower bound for unbiased parameter estimates in this same limit — that is, they have the …</p><p>In this post, we review two facts about maximum-likelihood estimators: 1) They are consistent, meaning that they converge to the correct values given a large number of samples, <span class="math">\(N\)</span>, and 2) They satisfy the <a href="http://efavdb.github.io/multivariate-cramer-rao-bound">Cramer-Rao</a> lower bound for unbiased parameter estimates in this same limit — that is, they have the lowest possible variance of any unbiased estimator, in the <span class="math">\(N\gg 1\)</span> limit.</p>
<h3 id="introduction">Introduction</h3>
<p>We begin with a simple example maximum-likelihood inference problem: Suppose one has obtained <span class="math">\(N\)</span> independent samples <span class="math">\(\{x_1, x_2, \ldots, x_N\}\)</span> from a Gaussian distribution of unknown mean <span class="math">\(\mu\)</span> and variance <span class="math">\(\sigma^2\)</span>. In order to obtain a maximum-likelihood estimate for these parameters, one asks which <span class="math">\(\hat{\mu}\)</span> and <span class="math">\(\hat{\sigma}^2\)</span> would be most likely to generate the samples observed. To find these, we first write down the probability of observing the samples, given our model. This is simply
</p>
<div class="math">$$
P(\{x_1, x_2, \ldots, x_N\} \vert \mu, \sigma^2) =\\ \exp\left [ \sum_{i=1}^N \left (-\frac{1}{2} \log (2 \pi \sigma^2) -\frac{1}{2 \sigma^2} (x_i - \mu)^2\right ) \right ]. \tag{1} \label{1}
$$</div>
<p>
To obtain the maximum-likelihood estimates, we maximize (\ref{1}): Setting its derivatives with respect to <span class="math">\(\mu\)</span> and <span class="math">\(\sigma^2\)</span> to zero and solving gives
</p>
<div class="math">\begin{align}\label{mean}
\hat{\mu} &= \frac{1}{N} \sum_i x_i \tag{2} \\
\hat{\sigma}^2 &= \frac{1}{N} \sum_i (x_i - \hat{\mu})^2. \tag{3} \label{varhat}
\end{align}</div>
<p>
These are mean and variance values that would be most likely to generate our observation set <span class="math">\(\{x_i\}\)</span>. Our solutions show that they are both functions of the random observation set. Because of this, <span class="math">\(\hat{\mu}\)</span> and <span class="math">\(\hat{\sigma}^2\)</span> are themselves random variables, changing with each sample set that happens to be observed. Their distributions can be characterized by their mean values, variances, etc.</p>
<p>The average squared error of a parameter estimator is determined entirely by its bias and variance — see eq (2) of <a href="http://efavdb.github.io/bayesian-linear-regression">prior post</a>. Now, one can show that the <span class="math">\(\hat{\mu}\)</span> estimate of (\ref{mean}) is unbiased, but this is not the case for the variance estimator (\ref{varhat}) — one should (famously) divide by <span class="math">\(N-1\)</span> instead of <span class="math">\(N\)</span> here to obtain an unbiased estimator<span class="math">\(^1\)</span>. This shows that maximum-likelihood estimators need not be unbiased. Why then are they so popular? One reason is that these estimators are guaranteed to be unbiased when <span class="math">\(N\)</span>, the sample size, is large. Further, in this same limit, these estimators achieve the minimum possible variance for any unbiased parameter estimate — as set by the fundamental <a href="http://efavdb.github.io/multivariate-cramer-rao-bound">Cramer-Rao</a> bound. The purpose of this post is to review simple proofs of these latter two facts about maximum-likelihood estimators<span class="math">\(^2\)</span>.</p>
<h3 id="consistency">Consistency</h3>
<p>Let <span class="math">\(P(x \vert \theta^*)\)</span> be some distribution characterized by a parameter <span class="math">\(\theta^*\)</span> that is unknown. We will show that the maximum-likelihood estimator converges to <span class="math">\(\theta^*\)</span> when <span class="math">\(N\)</span> is large: As in (\ref{1}), the maximum-likelihood solution is that <span class="math">\(\theta\)</span> that maximizes
</p>
<div class="math">$$\tag{4} \label{4}
J \equiv \frac{1}{N}\sum_{i=1}^N \log P(x_i \vert \theta),
$$</div>
<p>
where the <span class="math">\(\{x_i\}\)</span> are the independent samples taken from <span class="math">\(P(x \vert \theta^*)\)</span>. By the law of large numbers, when <span class="math">\(N\)</span> is large, this average over the samples converges to its population mean. In other words,
</p>
<div class="math">$$\tag{5}
\lim_{N \to \infty}J \rightarrow \int_x P(x \vert \theta^*) \log P(x \vert \theta) dx.
$$</div>
<p>
We will show that <span class="math">\(\theta^*\)</span> is the <span class="math">\(\theta\)</span> value that maximizes the above. We can do this directly, writing
</p>
<div class="math">$$
\begin{align}
J(\theta) - J(\theta^*) & = \int_x P(x \vert \theta^*) \log \left ( \frac{P(x \vert \theta) }{P(x \vert \theta^*)}\right) \\
& \leq \int_x P(x \vert \theta^*) \left ( \frac{P(x \vert \theta) }{P(x \vert \theta^*)} - 1 \right) \\
& = \int_x P(x \vert \theta) - P(x \vert \theta^*) = 1 - 1 = 0. \tag{6} \label{6}
\end{align}
$$</div>
<p>
Here, we have used <span class="math">\(\log t \leq t-1\)</span> in the second line. Rearranging the above shows that <span class="math">\(J(\theta^*) \geq J(\theta)\)</span> for all <span class="math">\(\theta\)</span> — when <span class="math">\(N \gg 1\)</span>, meaning that <span class="math">\(J\)</span> is maximized at <span class="math">\(\theta^*\)</span>. That is, the maximum-likelihood estimator <span class="math">\(\hat{\theta} \to \theta^*\)</span> in this limit<span class="math">\(^3\)</span>.</p>
<h3 id="optimal-variance">Optimal variance</h3>
<p>To derive the variance of a general maximum-likelihood estimator, we will see how its average value changes upon introduction of a small Bayesian prior, <span class="math">\(P(\theta) \sim \exp(\Lambda \theta)\)</span>. The trick will be to evaluate the change in two separate ways — this takes a few lines, but is quite straightforward. In the first approach, we do a direct maximization: The quantity to be maximized is now
</p>
<div class="math">$$ \label{7}
J = \sum_{i=1}^N \log P(x_i \vert \theta) + \Lambda \theta. \tag{7}
$$</div>
<p>
Because we take <span class="math">\(\Lambda\)</span> small, we can use a Taylor expansion to find the new solution, writing
</p>
<div class="math">$$ \label{8}
\hat{\theta} = \theta^* + \theta_1 \Lambda + O(\Lambda^2). \tag{8}
$$</div>
<p>
Setting the derivative of (\ref{7}) to zero, with <span class="math">\(\theta\)</span> given by its value in (\ref{8}), we obtain
</p>
<div class="math">$$
\sum_{i=1}^N \partial_{\theta} \left . \log P(x_i \vert \theta) \right \vert_{\theta^*} + \\ \sum_{i=1}^N \partial_{\theta}^2 \left . \log P(x_i \vert \theta) \right \vert_{\theta^*} \times \theta_1 \Lambda + \Lambda + O(\Lambda^2) = 0. \tag{9} \label{9}
$$</div>
<p>
The first term here goes to zero at large <span class="math">\(N\)</span>, as above. Setting the terms at <span class="math">\(O(\Lambda^1)\)</span> to zero gives
</p>
<div class="math">$$
\theta_1 = - \frac{1}{ \sum_{i=1}^N \partial_{\theta}^2 \left . \log P(x_i \vert \theta) \right \vert_{\theta^*} }. \tag{10} \label{10}
$$</div>
<p>
Plugging this back into (\ref{8}) gives the first order correction to <span class="math">\(\hat{\theta}\)</span> due to the perturbation. Next, as an alternative approach, we evaluate the change in <span class="math">\(\theta\)</span> by maximizing the <span class="math">\(P(\theta)\)</span> distribution, expanding about its unperturbed global maximum, <span class="math">\(\theta^*\)</span>: We write, formally,
</p>
<div class="math">$$\tag{11} \label{11}
P(\theta) = e^{ - a_0 - a_2 (\theta - \theta^*)^2 - a_3 (\theta - \theta^*)^3 + \ldots + \Lambda \theta}.
$$</div>
<p>
Differentiating to maximize (\ref{11}), and again assuming a solution of form (\ref{8}), we obtain
</p>
<div class="math">$$\label{12} \tag{12}
-2 a_2 \times \theta_1 \Lambda + \Lambda + O(\Lambda^2) = 0 \ \ \to \ \ \theta_1 = \frac{1}{2 a_2}.
$$</div>
<p>
We now require consistency between our two approaches, equating (\ref{10}) and (\ref{12}). This gives an expression for <span class="math">\(a_2\)</span>. Plugging this back into (\ref{11}) then gives (for the unperturbed distribution)
</p>
<div class="math">$$\tag{13} \label{13}
P(\theta) = \mathcal{N} \exp \left [ N \frac{ \langle \partial_{\theta}^2 \left . \log P(x, \theta) \right \vert_{\theta^*} \rangle }{2} (\theta - \theta^*)^2 + \ldots \right].
$$</div>
<p>
Using this Gaussian approximation<span class="math">\(^4\)</span>, we can now read off the large <span class="math">\(N\)</span> variance of <span class="math">\(\hat{\theta}\)</span> as
</p>
<div class="math">$$\tag{14} \label{14}
var(\hat{\theta}) = - \frac{1}{N} \times \frac{1}{\langle \partial_{\theta}^2 \left . \log P(x, \theta) \right \vert_{\theta^*} \rangle }.
$$</div>
<p>
This is the lowest possible value for any unbiased estimator, as set by the Cramer-Rao bound. The proof shows that maximum-likelihood estimators always saturate this bound, in the large <span class="math">\(N\)</span> limit — a remarkable result. We discuss the intuitive meaning of the Cramer-Rao bound in a <a href="http://efavdb.github.io/multivariate-cramer-rao-bound">prior post</a>.</p>
<h3 id="footnotes">Footnotes</h3>
<p>[1] To see that (\ref{varhat}) is biased, we just need to evaluate the average of <span class="math">\(\sum_i (x_i - \hat{\mu})^2\)</span>. This is</p>
<div class="math">$$
\overline{\sum_i x_i^2 - 2 \sum_{i,j} \frac{x_i x_j}{N} + \sum_{i,j,k} \frac{x_j x_k}{N^2}} = N \overline{x^2} - (N-1) \overline{x}^2 - \overline{x^2} \\
= (N-1) \left ( \overline{x^2} - \overline{x}^2 \right) \equiv (N-1) \sigma^2.
$$</div>
<p>
Dividing through by <span class="math">\(N\)</span>, we see that <span class="math">\(\overline{\hat{\sigma}^2} = \left(\frac{N-1}{N}\right)\sigma^2\)</span>. The deviation from the true variance <span class="math">\(\sigma^2\)</span> goes to zero at large <span class="math">\(N\)</span>, but is non-zero for any finite <span class="math">\(N\)</span>: The estimator is biased, but the bias goes to zero at large <span class="math">\(N\)</span>.</p>
<p>[2] The consistency proof is taken from lecture notes by D. Panchenko, see <a href="http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture3.pdf">here</a>. Professor Panchenko is quite famous for having proven the correctness of the Parisi ansatz in replica theory. Our variance proof is original — please let us know if you have seen it elsewhere. Note that it can also be easily extended to derive the covariance matrix of a set of maximum-likelihood estimators that are jointly distributed — we cover only the scalar case here, for simplicity.</p>
<p>[3] The proof here actually only shows that there is no <span class="math">\(\theta\)</span> that gives larger likelihood than <span class="math">\(\theta^*\)</span> in the large <span class="math">\(N\)</span> limit. However, for some problems, it is possible that more than one <span class="math">\(\theta\)</span> maximizes the likelihood. A trivial example is given by the case where the distribution is actually only a function of <span class="math">\((\theta - \theta_0)^2\)</span>. In this case, both values <span class="math">\(\theta_0 \pm (\theta^* - \theta_0)\)</span> will necessarily maximize the likelihood.</p>
<p>[4] It’s a simple matter to carry this analysis further, including the cubic and higher order terms in the expansion (\ref{11}). These lead to correction terms for (\ref{14}), smaller in magnitude than that given there. These terms become important when <span class="math">\(N\)</span> decreases in magnitude.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Principal component analysis2015-12-05T22:22:00-08:002015-12-05T22:22:00-08:00Jonathan Landytag:efavdb.com,2015-12-05:/principal-component-analysis<p>We review the two essentials of principal component analysis (“<span class="caps">PCA</span>”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first <span class="math">\(k\)</span> of these — listed in descending …</p><p>We review the two essentials of principal component analysis (“<span class="caps">PCA</span>”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first <span class="math">\(k\)</span> of these — listed in descending eigenvalue order — provides the best possible <span class="math">\(k\)</span>-dimensional approximation to the data, in the sense of captured variance.</p>
<h3 id="introduction">Introduction</h3>
<p>One way to introduce principal component analysis is to consider the problem of least-squares fits: Consider, for example, the figure shown below. To fit a line to this data, one might attempt to minimize the squared <span class="math">\(y\)</span> residuals (actual minus fit <span class="math">\(y\)</span> values). However, if the <span class="math">\(x\)</span> and <span class="math">\(y\)</span> values are considered to be on an equal footing, this <span class="math">\(y\)</span>-centric approach is not quite appropriate. A natural alternative is to attempt instead to find the line that minimizes the <em>total squared projection error</em>: If <span class="math">\((x_i, y_i)\)</span> is a data point, and <span class="math">\((\hat{x}_i, \hat{y}_i)\)</span> is the point closest to it on the regression line (aka, its “projection” onto the line), we attempt to minimize
</p>
<div class="math">$$\tag{1} \label{score}
J = \sum_i (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2.
$$</div>
<p><a href="https://efavdb.com/wp-content/uploads/2015/12/projection.png"><img alt="margin around decision boundary" src="https://efavdb.com/wp-content/uploads/2015/12/projection.png"></a></p>
<p>The summands here are illustrated in the figure: The dotted lines shown are the projection errors for each data point relative to the red line. The minimizer of (\ref{score}) is the line that minimizes the sum of the squares of these values.</p>
<p>Generalizing the above problem, one could ask which <span class="math">\(k\)</span>-dimensional hyperplane passes closest to a set of data points in <span class="math">\(N\)</span>-dimensions. Being able to identify the solution to this problem can be very helpful when <span class="math">\(N \gg 1\)</span>. The reason is that in high-dimensional, applied problems, many features are often highly-correlated. When this occurs, projection of the data onto a <span class="math">\(k\)</span>-dimensional subspace can often result in a great reduction in memory usage (one moves from needing to store <span class="math">\(N\)</span> values for each data point to <span class="math">\(k\)</span>) with minimal loss of information (if the points are all near the plane, replacing them by their projections causes little distortion). Projection onto subspaces can also be very helpful for visualization: For example, plots of <span class="math">\(N\)</span>-dimensional data projected onto a best two-dimensional subspace can allow one to get a feel for a dataset’s shape.</p>
<p>At first glance, the task of actually minimizing (\ref{score}) may appear daunting. However, it turns out this can be done easily using linear algebra. One need only carry out the following three steps:</p>
<ul>
<li>Preprocessing: If appropriate, shift features and normalize so that they all have mean <span class="math">\(\mu = 0\)</span> and variance <span class="math">\(\sigma^2 = 1\)</span>. The latter, scaling step is needed to account for differences in units, which may cause variations along one component to look artificially large or small relative to those along other components (eg, one raw component might be a measure in centimeters, and another in kilometers).</li>
<li>Compute the covariance matrix. Assuming there are <span class="math">\(m\)</span> data points, the <span class="math">\(i\)</span>, <span class="math">\(j\)</span> component of this matrix is given by:
<div class="math">$$\tag{2} \label{2} \Sigma_{ij}^2 = \frac{1}{m}\sum_{l=1}^m \langle (f_{l,i} - \mu_i) (f_{l,j} - \mu_j) \rangle\\ = \langle x_i \vert \left (\frac{1}{m} \sum_{l=1}^m \vert \delta f_l \rangle \langle \delta f_l \vert \right) \vert x_j \rangle.$$</div>
Note that, at right, we are using bracket notation for vectors. We make further use of this below — see footnote [1] at bottom for review. We’ve also written <span class="math">\(\vert \delta f_l \rangle\)</span> for the vector <span class="math">\(\vert f_l \rangle - \sum_{i = 1}^n \mu_i \vert x_i \rangle\)</span> — the vector <span class="math">\(\vert f_l \rangle\)</span> with the dataset’s centroid subtracted out.</li>
<li>Project all feature vectors onto the <span class="math">\(k\)</span> eigenvectors <span class="math">\(\{\vert v_j \rangle\)</span>, <span class="math">\(j = 1 ,2 \ldots, k\}\)</span> of <span class="math">\(\Sigma^2\)</span> that have the largest eigenvalues <span class="math">\(\lambda_j\)</span>, writing
<div class="math">$$\tag{3} \label{3}
\vert \delta f_i \rangle \approx \sum_{j = 1}^k \langle v_j \vert \delta f_i \rangle \times \vert v_j\rangle.
$$</div>
The term <span class="math">\(\langle v_j \vert \delta f_i \rangle\)</span> above is the coefficient of the vector <span class="math">\(\vert \delta f_i \rangle\)</span> along the <span class="math">\(j\)</span>-th principal component. If we set <span class="math">\(k = N\)</span> above, (\ref{3}) becomes an identity. However, when <span class="math">\(k < N\)</span>, the expression represents an approximation only, with the vector <span class="math">\(\vert \delta f_i \rangle\)</span> approximated by its projection into the subspace spanned by the largest <span class="math">\(k\)</span> principal components.</li>
</ul>
<p>The steps above are all that are needed to carry out a <span class="caps">PCA</span> analysis/compression of any dataset. We show in the next section why this solution will indeed provide the <span class="math">\(k\)</span>-dimensional hyperplane resulting in minimal dataset projection error.</p>
<h3 id="mathematics-of-pca">Mathematics of <span class="caps">PCA</span></h3>
<p>To understand <span class="caps">PCA</span>, we proceed in three steps.</p>
<ol>
<li>Significance of a partial trace: Let <span class="math">\(\{\textbf{u}_j \}\)</span> be some arbitrary orthonormal basis set that spans our full <span class="math">\(N\)</span>-dimensional space, and consider the sum
<div class="math">\begin{align}\tag{4} \label{4}
\sum_{j = 1}^k \Sigma^2_{jj} = \frac{1}{m} \sum_{i,j} \langle u_j \vert \delta f_i \rangle \langle \delta f_i \vert u_j \rangle\\ = \frac{1}{m} \sum_{i,j} \langle \delta f_i \vert u_j \rangle \langle u_j \vert \delta f_i \rangle\\ \equiv \frac{1}{m} \sum_{i} \langle \delta f_i \vert P \vert \delta f_i \rangle.
\end{align}</div>
To obtain the first equality here, we have used <span class="math">\(\Sigma^2 = \frac{1}{m} \sum_{i} \vert \delta f_i \rangle \langle \delta f_i \vert\)</span>, which follows from (\ref{2}). To obtain the last, we have written <span class="math">\(P\)</span> for the projection operator onto the space spanned by the first <span class="math">\(k\)</span> <span class="math">\(\{\textbf{u}_j \}\)</span>. Note that this last equality implies that the partial trace is equal to the average squared length of the projected feature vectors — that is, the variance of the projected data set.</li>
<li>Notice that the projection error is simply given by the total trace of <span class="math">\(\Sigma^2\)</span>, minus the partial trace above. Thus, minimization of the projection error is equivalent to maximization of the projected variance, (\ref{4}).</li>
<li>We now consider which basis maximizes (\ref{4}). To do that, we decompose the <span class="math">\(\{\textbf{u}_i \}\)</span> in terms of the eigenvectors <span class="math">\(\{\textbf{v}_j\}\)</span> of <span class="math">\(\Sigma^2\)</span>, writing
<div class="math">\begin{align} \tag{5} \label{5}
\vert u_i \rangle = \sum_j \vert v_j \rangle \langle v_j \vert u_i \rangle \equiv \sum_j u_{ij} \vert v_j \rangle.
\end{align}</div>
Here, we’ve inserted the identity in the <span class="math">\(\{v_j\}\)</span> basis, and written <span class="math">\( \langle v_j \vert u_i \rangle \equiv u_{ij}\)</span>. With these definitions, the partial trace becomes
<div class="math">\begin{align}\tag{6} \label{6}
\sum_{i=1}^k \langle u_i \vert \Sigma^2 \vert u_i \rangle = \sum_{i,j,l} u_{ij}u_{il} \langle v_j \vert \Sigma^2 \vert v_l \rangle \\= \sum_{i=1}^k\sum_{j} u_{ij}^2 \lambda_j.
\end{align}</div>
The last equality here follows from the fact that the <span class="math">\(\{\textbf{v}_i\}\)</span> are the eigenvectors of <span class="math">\(\Sigma^2\)</span> — we have also used the fact that they are orthonormal, which follows from the fact that <span class="math">\(\Sigma^2\)</span> is a real, symmetric matrix. The sum (\ref{6}) is proportional to a weighted average of the eigenvalues of <span class="math">\(\Sigma^2\)</span>. We have a total mass of <span class="math">\(k\)</span> to spread out amongst the <span class="math">\(N\)</span> eigenvalues. The maximum mass that can sit on any one eigenvalue is one. This follows since <span class="math">\(\sum_{i = 1}^k u_{ij}^2 \leq \sum_{i = 1}^N u_{ij}^2 =1\)</span>, the latter equality following from the fact that <span class="math">\( \sum_{i = 1}^N u_{ij}^2\)</span> is an expression for the squared length of <span class="math">\(\vert v_j\rangle\)</span> in the <span class="math">\(\{u_i\}\)</span> basis. Under these constraints, the maximum possible average one can get in (\ref{6}) occurs when all the mass sits on the largest <span class="math">\(k\)</span> eigenvalues, with each of these eigenvalues weighted with mass one. This condition occurs if and only if the first <span class="math">\(k\)</span> <span class="math">\(\{\textbf{u}_i\}\)</span> span the same space as that spanned by the first <span class="math">\(k\)</span> <span class="math">\(\{\textbf{v}_j\}\)</span> — those with the <span class="math">\(k\)</span> largest eigenvalues.</li>
</ol>
<p>That’s it for the mathematics of <span class="caps">PCA</span>.</p>
<h3 id="footnotes">Footnotes</h3>
<p>[1] <em>Review of bracket notation</em>: <span class="math">\(\vert x \rangle\)</span> represents a regular vector, <span class="math">\(\langle x \vert\)</span> is its transpose, and <span class="math">\(\langle y \vert x \rangle\)</span> represents the dot product of <span class="math">\(x\)</span> and <span class="math">\(y\)</span>. So, for example, when the term in parentheses at the right side of (\ref{2}) acts on the vector <span class="math">\(\vert x_j \rangle\)</span> to its right, you get <span class="math">\( \frac{1}{m} \sum_{k=1}^m \vert \delta f_k \rangle \left (\langle \delta f_k \vert x_j \rangle\right).\)</span> Here, <span class="math">\( \left (\langle \delta f_k \vert x_j \rangle\right)\)</span> is a dot product, a scalar, and <span class="math">\(\vert \delta f_k \rangle\)</span> is a vector. The result is thus a weighted sum of vectors. In other words, the bracketed term (\ref{2}) acts on a vector and returns a linear combination of other vectors. That means it is a matrix, as is any other object of form <span class="math">\(\sum_i \vert a_i \rangle \langle b_i \vert\)</span>. A special, important example is the identity matrix: Given any complete, orthonormal set of vectors <span class="math">\(\{x_j\}\)</span>, the identity matrix <span class="math">\(I\)</span> can be written as <span class="math">\(I = \sum_i \vert x_i \rangle \langle x_i \vert\)</span>. This identity is often used to make a change of basis.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>NBA 2015-16!!!2015-10-25T16:30:00-07:002015-10-25T16:30:00-07:00Jonathan Landytag:efavdb.com,2015-10-25:/nba-2015-16<p><span class="caps">NBA</span> is back this Tuesday! The <a href="http://efavdb.github.io/nba-dash">dashboard</a> and <a href="http://efavdb.github.io/weekly-nba-predictions">weekly predictions</a> are now live*, once again. These will each be updated daily, with game winner predictions, hypothetical who-would-beat-whom daily matchup predictions, and more. For a discussion on how we make our predictions, see our first <a href="http://efavdb.github.io/nba-learner-2013-14-warmup">post</a> on this topic. Note that …</p><p><span class="caps">NBA</span> is back this Tuesday! The <a href="http://efavdb.github.io/nba-dash">dashboard</a> and <a href="http://efavdb.github.io/weekly-nba-predictions">weekly predictions</a> are now live*, once again. These will each be updated daily, with game winner predictions, hypothetical who-would-beat-whom daily matchup predictions, and more. For a discussion on how we make our predictions, see our first <a href="http://efavdb.github.io/nba-learner-2013-14-warmup">post</a> on this topic. Note that our approach does not make use of any bookie predictions (unlike many other sites), and so provide an independent look on the game.</p>
<p>This season, we hope to crack 70% accuracy!</p>
<ul>
<li>Note that we have left up last season’s completed games results, for review purposes. Once every team has played one game, we’ll switch it over to the current season’s results.</li>
</ul>Support Vector Machines for classification2015-10-22T14:24:00-07:002015-10-22T14:24:00-07:00Cathy Yehtag:efavdb.com,2015-10-22:/svm-classification<p>To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:</p>
<blockquote>
<p>“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”</p>
</blockquote>
<p><a href="http://commons.wikimedia.org/wiki/File%3AAndrew_Ng.png" title="See page for author [CC BY 3.0 us (http://creativecommons.org/licenses/by/3.0/us/deed.en)], via Wikimedia Commons"><img alt="Andrew Ng" src="//upload.wikimedia.org/wikipedia/commons/5/5c/Andrew_Ng.png"></a></p>
<p>Professor Ng covers SVMs in his excellent <a href="https://www.coursera.org/learn/machine-learning">Machine Learning <span class="caps">MOOC</span></a>, a gateway for many into the …</p><p>To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:</p>
<blockquote>
<p>“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”</p>
</blockquote>
<p><a href="http://commons.wikimedia.org/wiki/File%3AAndrew_Ng.png" title="See page for author [CC BY 3.0 us (http://creativecommons.org/licenses/by/3.0/us/deed.en)], via Wikimedia Commons"><img alt="Andrew Ng" src="//upload.wikimedia.org/wikipedia/commons/5/5c/Andrew_Ng.png"></a></p>
<p>Professor Ng covers SVMs in his excellent <a href="https://www.coursera.org/learn/machine-learning">Machine Learning <span class="caps">MOOC</span></a>, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:</p>
<p><span class="dquo">“</span>What are the <em>support vectors</em> in support vector machines?”</p>
<p>We also provide python (https://github.com/EFavDB/svm-classification/blob/master/svm.ipynb) using scikit-learn’s svm module to fit a binary classification problem using a custom kernel, along with code to generate the (awesome!) interactive plots in Part 3.</p>
<p>This post consists of three sections:</p>
<ul>
<li>Part 1 sets up the problem from a geometric point of view and then shows how it can be framed as an optimization problem.</li>
<li>Part 2 transforms the optimization problem and uncovers the support vectors in the process.</li>
<li>Part 3 discusses how kernels can be used to separate non-linearly separable data.</li>
</ul>
<hr>
<h2 id="part-1-defining-the-margin">Part 1: Defining the margin</h2>
<h3 id="maximizing-the-margin">Maximizing the margin</h3>
<p>The figure below is a binary classification problem (points labeled <span class="math">\(y_i = \pm 1\)</span>) that is linearly separable.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2015/05/binaryclass_2d.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2015/05/binaryclass_2d.png"></a></p>
<p>There are many possible decision boundaries that would perfectly separate the two classes, but an <span class="caps">SVM</span> will choose the line in 2-d (or “hyperplane”, more generally) that maximizes the margin around the boundary.</p>
<p>Intuitively, we can be very confident about the labels of points that fall far from the boundary, but we’re less confident about points near the boundary.
</p>
<h3 id="formulating-the-margin-with-geometry">Formulating the margin with geometry</h3>
<p>Any point <span class="math">\(\boldsymbol{x}\)</span> lying on the separating hyperplane satisfies:
<span class="math">\(\boldsymbol{w} \cdot \boldsymbol{x} + b = 0\)</span>
<span class="math">\(\boldsymbol{w}\)</span> is the vector normal to the plane, and <span class="math">\(b\)</span> is a constant that describes how much the plane is shifted relative to the origin. The distance of the plane from the origin is <span class="math">\(b / \| \boldsymbol{w} \|\)</span>.</p>
<p><a href="https://efavdb.com/wp-content/uploads/2015/05/binaryclass_margin.png"><img alt="" src="https://efavdb.com/wp-content/uploads/2015/05/binaryclass_margin.png"></a></p>
<p>Now draw parallel planes on either side of the decision boundary, so we have what looks like a road, with the decision boundary as the median, and the additional planes as gutters. The margin, i.e. the width of the road, is (<span class="math">\(d_+ + d_-\)</span>) and is restricted by the data points closest to the boundary, which lie on the gutters.</p>
<p>The half-spaces bounded by the planes on the gutters are:</p>
<p><span class="math">\(\boldsymbol{w} \cdot \boldsymbol{x} + b \geq +a\)</span>, for <span class="math">\(y_i = +1\)</span></p>
<p><span class="math">\(\boldsymbol{w} \cdot \boldsymbol{x} + b \leq -a\)</span>, for <span class="math">\(y_i = -1\)</span></p>
<p>These two conditions can be put more succinctly:</p>
<p><span class="math">\(y_i (\boldsymbol{w} \cdot \boldsymbol{x} + b) \geq a, \forall \; i\)</span></p>
<p>Some arithmetic leads to the equation for the margin:</p>
<p><span class="math">\(d_+ + d_- = 2a / \| \boldsymbol{w} \|\)</span></p>
<p>Without loss of generality, we can set <span class="math">\(a=1\)</span>, since it only sets the scale (units) of <span class="math">\(b\)</span> and <span class="math">\(\boldsymbol{w}\)</span>. So to maximize the margin, we have to maximize <span class="math">\(1 / \| \boldsymbol{w} \|\)</span>. However, this is an unpleasant (non-convex) objective function. Instead we minimize <span class="math">\(\| \boldsymbol{w}\|^2\)</span>, which is convex.</p>
<h3 id="the-optimization-problem">The optimization problem</h3>
<p>Maximizing the margin boils down to a constrained optimization problem: minimize some quantity <span class="math">\(f(w)\)</span>, subject to constraints <span class="math">\(g(w,b)\)</span>. This optimization problem is particularly nice because it is convex; the objective <span class="math">\(\| \boldsymbol{w}\|^2\)</span> is convex, as are the constraints, which are linear.</p>
<p>In other words, we are faced with a <a href="http://en.wikipedia.org/wiki/Quadratic_programming">quadratic programming</a> problem. The standard format of the optimization problem for the separable case is</p>
<div class="m