chap3.html

<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
    <meta charset="utf-8">
    <meta name="citation_title" content="ニューラルネットワークと深層学習">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2014">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <link rel="icon" href="nnadl_favicon.ICO" />
    <title>ニューラルネットワークと深層学習</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS":
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('http://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="">第3章</a></h1>
  <h1 class="chapter_title"><a href="">ニューラルネットワークの学習の改善</a></h1></div><div class="section"><div id="toc">
<p class="toc_title"><a href="index.html">ニューラルネットワークと深層学習</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">ニューラルネットワークを用いた手書き文字認識</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">逆伝播の仕組み</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">ニューラルネットワークの学習の改善</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">ニューラルネットワークが任意の関数を表現できることの視覚的証明</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">ニューラルネットワークを訓練するのはなぜ難しいのか</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a>Deep learning<div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><li>Convolutional neural networks</li><li>Pretraining</li><li>Recurrent neural networks, Boltzmann machines, and other  models</li><li>Is there a universal thinking algorithm?</li><li>On the future of neural networks</li></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() {
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});
});</script><p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<span class="sidebar_title">Sponsors</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>


<!--
<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible.
Thanks also to all the contributors to the <a
href="bugfinder.html">Bugfinder Hall of Fame</a>.  </p>

<p class="sidebar">The book is currently a beta release, and is still
under active development.  Please send error reports to
mn@michaelnielsen.org.  For other enquiries, please see the <a
href="faq.html">FAQ</a> first.</p>
-->

<p class="sidebar">著者と共にこの本を作り出してくださった<a
href="supporters.html">サポーター</a>の皆様に感謝いたします。
また、<a
        href="bugfinder.html">バグ発見者の殿堂</a>に名を連ねる皆様にも感謝いたします。
また、日本語版の出版にあたっては、<a
href="translators.html">翻訳者</a>の皆様に深く感謝いたします。

</p>


<p class="sidebar">この本は目下のところベータ版で、開発続行中です。
エラーレポートは mn@michaelnielsen.org まで、日本語版に関する質問は muranushi@gmail.com までお送りください。
その他の質問については、まずは<a
href="faq.html">FAQ</a>をごらんください。</p>


<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="http://eepurl.com/BYr9L">Mailing list for book announcements</a>
</p>

<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
</p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
  著：<a href="http://michaelnielsen.org">Michael Nielsen</a> / 2014年9月-12月 <br >  訳：<a href="https://github.com/nnadl-ja/nnadl_site_ja">「ニューラルネットワークと深層学習」翻訳プロジェクト</a>
</p>
</div>
</p>

<!--
<p>When a golf player is first learning to play golf, they usually spend
most of their time developing a basic swing.  Only gradually do they
develop other shots, learning to chip, draw and fade the ball,
building on and modifying their basic swing.  In a similar way, up to
now we've focused on understanding the backpropagation algorithm.
It's our "basic swing", the foundation for learning in most work on
neural networks.  In this chapter I explain a suite of techniques
which can be used to improve on our vanilla implementation of
backpropagation, and so improve the way our networks learn.</p>-->

<p>
最初にゴルフを始めようとするとき、まずは基本スイングの練習にほとんどの時間を使うのが普通です。それ以外のスイングの練習をするのは、少しずつしかできません。チップやドローやフェードを身につけるのは、基本スイングの上に、修正しながら組み立てるものです。
同様に、私達はいままで逆伝搬法に集中してきました。それが私達にとっての「基本スイング」であり、ニューラルネットワークにおけるほとんどの仕事を理解するための基本だったからです。この章では、純粋な逆伝搬の実装を改善しネットワークの学習のしかたを改善するいくつかのテクニックを説明します。</p>


<!--<p>The techniques we'll develop in this chapter include: a better choice
of cost function, known as
<a href="chap3.html#the_cross-entropy_cost_function">the
  cross-entropy</a> cost function; four so-called
<a href="chap3.html#overfitting_and_regularization">"regularization"
  methods</a> (L1 and L2 regularization, dropout, and artificial
expansion of the training data), which make our networks better at
generalizing beyond the training data; a
<a href="chap3.html#weight_initialization">better method for
  initializing the weights</a> in the network; and a
<a href="#how_to_choose_a_neural_network's_hyper-parameters">set
  of heuristics to help choose good hyper-parameters</a> for the network.
I'll also overview <a href="chap3.html#other_techniques">several other
  techniques</a> in less depth.  The discussions are largely independent
of one another, and so you may jump ahead if you wish.  We'll also
<a href="#handwriting_recognition_revisited_the_code">implement</a>
many of the techniques in running code, and use them to improve the
results obtained on the handwriting classification problem studied in
<a href="chap1.html">Chapter 1</a>.</p> -->

<p>この章で説明するテクニックは以下のとおりです。<a href="chap3.html#the_cross-entropy_cost_function">クロスエントロピー</a>と呼ばれるより良いコスト関数、
ネットワークを学習データによらず一般化するのに役立つ<a href="chap3.html#overfitting_and_regularization">「正規化」法</a>と呼ばれる4つの手法（L1正規化およびL2正規化、ドロップアウト、そして人工的な学習データの伸張）、
ネットワークの中の<a href="chap3.html#weight_initialization">重みを初期化するより良い方法</a>、
そして、ネットワークに対して<a href="#how_to_choose_a_neural_network's_hyper-parameters">良いハイパーパラメータを選択するためのいくつかの発見的方法</a>です。また、<a href="chap3.html#other_techniques">それ以外のいくつかのテクニック</a>についても、あまり細部にこだわらず概観します。
これらの議論は完全にお互いに独立なので、必要に応じて読み飛ばすこともできます。また、多くのテクニックを動くコードとして<a href="#handwriting_recognition_revisited_the_code">実装</a>もします。それらの実装を使うと、<a href="chap1.html">第1章</a>で説明した手書き文字の分類問題の結果が改善されます。</p>

<!--
<p>Of course, we're only covering a few of the many, many techniques
which have been developed for use in neural nets.  The philosophy is
that the best entree to the plethora of available techniques is
in-depth study of a few of the most important.  Mastering those
important techniques is not just useful in its own right, but will
also deepen your understanding of what problems can arise when you use
neural networks.  That will leave you well prepared to quickly pick up
other techniques, as you need them.</p>-->


<p>もちろん、ここではニューラルネットワークのために開発されたとても多くの手法のうちほんの少しを紹介するだけです。ここでの哲学は、利用できる手法が沢山ありすぎる場合の最も良い入門は、少しの手法を深く学習することだということです。それらの重要な手法はそのまま役に立つというだけではなく、ニューラルネットワークを使うときに起こる問題に対する理解を深めてくれます。その結果、必要に応じて他のテクニックを即座に使えるようになるでしょう。</p>

<!--<p></p><p></p><p></p><p><h3><a name="the_cross-entropy_cost_function"></a><a href="#the_cross-entropy_cost_function">The cross-entropy cost function</a></h3></p>-->
<p></p><p></p><p></p><p><h3><a name="the_cross-entropy_cost_function"></a><a href="#the_cross-entropy_cost_function">クロスエントロピーコスト関数</a></h3></p>
<!--<p>Most of us find it unpleasant to be wrong.  Soon after beginning to
learn the piano I gave my first performance before an audience.  I was
nervous, and began playing the piece an octave too low.  I got
confused, and couldn't continue until someone pointed out my error.  I
was very embarassed.  Yet while unpleasant, we also learn quickly when
we're decisively wrong.  You can bet that the next time I played
before an audience I played in the correct octave!  By contrast, we
learn more slowly when our errors are less well-defined.</p>-->
<p>
ほとんどの人にとって、間違えることは嫌なことです。私はピアノを習い始めてすぐに聴衆の前で演奏することがありました。そのとき緊張していたので、1オクターブ低く演奏を始めてしまいました。私は混乱してしまい誰かが間違いを指摘するまで気づきませんでした。とても恥ずかしい思いをしました。私達は、はっきり間違えているときには、嫌な思いをしつつも速く学ぶことができます。次に私がピアノを聴衆の前で弾くときには正しいオクターブで弾いたということは、いうまでもないでしょう。一方で、間違いがはっきりしないときはゆっくりとしか学べません。
</p>
<!--<p>Ideally, we hope and expect that our neural networks will learn fast
from their errors.  Is this what happens in practice?  To answer this
question, let's look at a toy example.  The example involves a neuron
with just one input:</p>-->
<p>理想的にはニューラルネットワークは間違いから学んでほしいと、我々は願っていて期待もしています。そのようなことは実際に起こるでしょうか。この問いに答えるために簡単な例を見てみましょう。この例はちょうど一個の入力を持つニューロンです。
</p>
<p><center>
<img src="images/tikz28.png"/>
</center></p>
<!--<p>We'll train this neuron to do something ridiculously easy: take the
input $1$ to the output $0$.  Of course, this is such a trivial task
that we could easily figure out an appropriate weight and bias by
hand, without using a learning algorithm.  However, it turns out to be
illuminating to use gradient descent to attempt to learn a weight and
bias.  So let's take a look at how the neuron learns.</p>-->
<p>バカバカしいほど簡単なことをするためにこのニューロンを学習させます。入力1に対して出力0を返すようにします。もちろんこれは適当な重みとバイアスをすぐに手計算でき、学習アルゴリズムを使わないですむような自明な作業です。しかし、再急降下法を使って重みとバイアスを計算することは理解に役立ちます。ですから、ニューロンがどのように学ぶかを見てみましょう。</p>
<!--<p>To make things definite, I'll pick the initial weight to be $0.6$ and
the initial bias to be $0.9$.  These are generic choices used as a
place to begin learning, I wasn't picking them to be special in any
way.  The initial output from the neuron is $0.82$, so quite a bit of
learning will be needed before our neuron gets near the desired
output, $0.0$.  Click on "Run" in the bottom right corner below to
see how the neuron learns an output much closer to $0.0$.  Note that
this isn't a pre-recorded animation, your browser is actually
computing the gradient, then using the gradient to update the weight
and bias, and displaying the result.  The learning rate is $\eta =
0.15$, which turns out to be slow enough that we can follow what's
happening, but fast enough that we can get substantial learning in
just a few seconds.  The cost is the quadratic cost function, $C$,
introduced back in Chapter 1.  I'll remind you of the exact form of
the cost function shortly, so there's no need to go and dig up the
definition.  Note that you can run the animation multiple times by
clicking on "Run" again.</p>-->
<p>話をはっきりするために、初期の重みを$0.6$、バイアスを$0.9$としましょう。この設定は学習の前の設定として適当に選んだもので、とにかく特別な値を採用したわけではありません。ニューロンからの出力の初期値は$0.82$なので、ニューロンが望まれる出力の$0.0$を出すようになるようになるには、それなりの時間がかかりそうです。出力が$0.0$にとても近くなるまでにどのようにニューロンが学習していくかを見るには、右下にある「Run」ボタンを押してみてください。これは録画された動画ではないことに注意してください。ブラウザは実際に勾配を計算して、その勾配は重みとバイアスを更新するのに使われて、結果が表示されています。学習係数は$\eta = 0.15$であり、これの値は、なにが起こっているかを見るには十分に遅く、しかし数秒でそれなりの学習をさせるには十分に速い値です。コスト関数は、第1章で紹介した二次コスト関数です。コスト関数の厳密な式は後で復習するので、今とくに定義を深く追う必要はありません。このアニメーションは「Run」を押せば何度でも実行できます。
</p>
<p>
<script type="text/javascript" src="js/paper.js"></script>
<script type="text/paperscript" src="js/saturation1.js" canvas="saturation1">
</script>
<center>
<canvas id="saturation1" width="520" height="300"></canvas>
</center></p>
<!--<p>As you can see, the neuron rapidly learns a weight and bias that
drives down the cost, and gives an output from the neuron of about
$0.09$.  That's not quite the desired output, $0.0$, but it is pretty
good.  Suppose, however, that we instead choose both the starting
weight and the starting bias to be $2.0$.  In this case the initial
output is $0.98$, which is very badly wrong.  Let's look at how the
neuron learns to output $0$ in this case.  Click on "Run" again:</p>-->
<p>これでわかるように、ニューロンは急速に重みとバイアスを学習していき、コストを下げていきます。結果としてニューロンからの出力は$0.09$になります。これは望まれる出力$0.0$とは違いますが、良い値です。しかしながら、最初の重みとバイアスとして$2.0$という値を選んだとしてみましょう。この場合にニューロンがどのように出力が$0$になるように学習していくかを見てみましょう。今度も「Run」を押してください。
</p>
<p>
<script type="text/paperscript" src="js/saturation2.js" canvas="saturation2">
</script>
<a id="saturation2_anchor"></a>
<center>
<canvas id="saturation2" width="520" height="300"></canvas>
</center></p>
<!--<p>Although this example uses the same learning rate ($\eta = 0.15$), we
can see that learning starts out much more slowly.  Indeed, for the
first 150 or so learning epochs, the weights and biases don't change
much at all.  Then the learning kicks in and, much as in our first
example, the neuron's output rapidly moves closer to $0.0$.</p>-->
<p>この例は同じ学習係数($\eta = 0.15$)を使っているのにもかかわらず、学習がずっとゆっくり始まることがわかります。実施、最初の150エポックくらいでは重みやバイアスはほとんど変わりません。その後に学習が始まり、最初の例と同じように出力は急速に$0.0$に近づきます。
</p>
<!--<p>This behaviour is strange when contrasted to human learning.  As I
said at the beginning of this section, we often learn fastest when
we're badly wrong about something.  But we've just seen that our
artificial neuron has a lot of difficulty learning when it's badly
wrong - far more difficulty than when it's just a little wrong.
What's more, it turns out that this behaviour occurs not just in this
toy model, but in more general networks.  Why is learning so slow?
And can we find a way of avoiding this slowdown?</p>-->
<p>この振る舞いは人間の学習と比べると奇異に見えます。この節の最初に言ったように、我々はひどく間違った時に速く学習することが多いです。ですが、我々の人工的なニューロンは、ひどく間違っている時のほうが、少しだけ間違えていた時と比べて学習がとても困難であったように見えます。さらにいうと、このような振る舞いはこの単純なモデルだけに起こることではなく、もっと一般的なネットワークでも起こることがわかっています。なぜ学習がそんなに遅いのでしょう？そして、このように遅くなることを防ぐ手段はあるのでしょうか？</p>
<!--<p>To understand the origin of the problem, consider that our neuron
learns by changing the weight and bias at a rate determined by the
partial derivatives of the cost function, $\partial C/\partial w$ and
$\partial C / \partial b$.  So saying "learning is slow" is really
the same as saying that those partial derivatives are small.  The
challenge is to understand why they are small.  To understand that,
let's compute the partial derivatives.  Recall that we're using the
quadratic cost function, which, from
Equation <span id="margin_619820868292_reveal" class="equation_link">(6)</span><span id="margin_619820868292" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_619820868292_reveal').click(function() {$('#margin_619820868292').toggle('slow', function() {});});</script>, is given by
<a class="displaced_anchor" name="eqtn54"></a>\begin{eqnarray}
  C = \frac{(y-a)^2}{2},
\tag{54}\end{eqnarray}-->
<p>問題の原点を理解するために我々のニューロンが、重みとバイアスをコスト関数の偏微分$\partial C/\partial w$と$\partial C / \partial b$によって決まる値で更新されるとしましょう。「学習が遅い」ということは、その偏微分が小さいということと同じことです。問題は、なぜこれが小さいのかということを理解することです。そのことを理解するために偏微分を計算してみましょう。我々は二次導関数を使っていたことを思い出しましょう。これは、式<span id="margin_619820868292_reveal" class="equation_link">(6)</span><span id="margin_619820868292" class="marginequation" style="display: none;"><a href="chap1.html#eqtn6" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_619820868292_reveal').click(function() {$('#margin_619820868292').toggle('slow', function() {});});</script>から、次の式で与えられます。
<a class="displaced_anchor" name="eqtn54"></a>\begin{eqnarray}
  C = \frac{(y-a)^2}{2},
\tag{54}\end{eqnarray}
<!--where $a$ is the neuron's output when the training input $x = 1$ is
used, and $y = 0$ is the corresponding desired output.  To write this
more explicitly in terms of the weight and bias, recall that $a =
\sigma(z)$, where $z = wx+b$.  Using the chain rule to differentiate
with respect to the weight and bias we get
<a class="displaced_anchor" name="eqtn55"></a><a class="displaced_anchor" name="eqtn56"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\

  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),
\tag{56}\end{eqnarray}
where I have substituted $x = 1$ and $y = 0$.  To understand the
behaviour of these expressions, let's look more closely at the
$\sigma'(z)$ term on the right-hand side.  Recall the shape of the
$\sigma$ function:</p>-->
ここで、$a$は学習の入力$x = 1$に対するニューロンの出力であり、$y = 0$は対応する期待される出力です。このことを重みとバイアスの言葉でもっとはっきり記述するために、$z = wx+b$のとき$a = \sigma(z)$と表せることを思い出しましょう。合成関数微分の公式を使い、重みとバイアスで微分すると以下の式を得ます。
<a class="displaced_anchor" name="eqtn55"></a><a class="displaced_anchor" name="eqtn56"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z) \tag{55}\\

  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z),
\tag{56}\end{eqnarray}
ここで、$x = 1$と$y = 0$の代入を使いました。これらの表現の振る舞いを理解するためには、右辺に出てくる$\sigma'(z)$の項に注目しましょう。関数$\sigma$のグラフの形状をみてみましょう。
</p>
<p>
<div id="sigmoid_graph"><a name="sigmoid_graph"></a></div>
<script type="text/javascript" src="http://d3js.org/d3.v3.min.js"></script>
<script>
function s(x) {return 1/(1+Math.exp(-x));}
var m = [40, 120, 50, 120];
var height = 290 - m[0] - m[2];
var width = 600 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 400;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d){ return {
        x: x1(d),
        y: s(x1(d))};
    });
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
                .domain([0, 1])
                .range([height, 0]);
var line = d3.svg.line()
    .x(function(d) { return x(d.x); })
    .y(function(d) { return y(d.y); })
var graph = d3.select("#sigmoid_graph")
    .append("svg")
    .attr("width", width + m[1] + m[3])
    .attr("height", height + m[0] + m[2])
    .append("g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
                  .scale(x)
                  .tickValues(d3.range(-4, 5, 1))
                  .orient("bottom")
graph.append("g")
    .attr("class", "x axis")
    .attr("transform", "translate(0, " + height + ")")
    .call(xAxis);
var yAxis = d3.svg.axis()
                  .scale(y)
                  .tickValues(d3.range(0, 1.01, 0.2))
                  .orient("left")
                  .ticks(5)
graph.append("g")
    .attr("class", "y axis")
    .call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
     .attr("class", "x label")
     .attr("text-anchor", "end")
     .attr("x", width/2)
     .attr("y", height+35)
     .text("z");
graph.append("text")
        .attr("x", (width / 2))
        .attr("y", -10)
        .attr("text-anchor", "middle")
        .style("font-size", "16px")
        .text("sigmoid function");
</script>
</p>
<!--<p>We can see from this graph that when the neuron's output is close to
$1$, the curve gets very flat, and so $\sigma'(z)$ gets very small.
Equations <span id="margin_506983300551_reveal" class="equation_link">(55)</span><span id="margin_506983300551" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_506983300551_reveal').click(function() {$('#margin_506983300551').toggle('slow', function() {});});</script> and <span id="margin_277682059598_reveal" class="equation_link">(56)</span><span id="margin_277682059598" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_277682059598_reveal').click(function() {$('#margin_277682059598').toggle('slow', function() {});});</script> then tell us that
$\partial C / \partial w$ and $\partial C / \partial b$ get very
small.  This is the origin of the learning slowdown.  What's more, as
we shall see a little later, the learning slowdown occurs for
essentially the same reason in more general neural networks, not just
the toy example we've been playing with.</p>-->
ニューロンが$1$に近づくと曲線がとても平らになり、そのため$\sigma'(z)$がとても小さくなることがわかります。そして、式<span id="margin_506983300551_reveal" class="equation_link">(55)</span><span id="margin_506983300551" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_506983300551_reveal').click(function() {$('#margin_506983300551').toggle('slow', function() {});});</script>と<span id="margin_277682059598_reveal" class="equation_link">(56)</span><span id="margin_277682059598" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_277682059598_reveal').click(function() {$('#margin_277682059598').toggle('slow', function() {});});</script>により$\partial C / \partial w$と$\partial C / \partial b$がとても小さくなります。これが学習が遅くなる原因です。さらにいうと、いままで扱ってきた単純な例に限らずとも、もっと一般的なニューラルネットワークでも学習が遅くなるのは根本的には同じ理由によるものです。
<!--<p><h4><a name="introducing_the_cross-entropy_cost_function"></a><a href="#introducing_the_cross-entropy_cost_function">Introducing the cross-entropy cost function</a></h4></p>-->
<p><h4><a name="introducing_the_cross-entropy_cost_function"></a><a href="#introducing_the_cross-entropy_cost_function">クロスエントロピーコスト関数の導入</a></h4></p>
<!--<p>How can we address the learning slowdown?  It turns out that we can
solve the problem by replacing the quadratic cost with a different
cost function, known as the cross-entropy.  To understand the
cross-entropy, let's move a little away from our super-simple toy
model.  We'll suppose instead that we're trying to train a neuron with
several input variables, $x_1, x_2, \ldots$, corresponding weights
$w_1, w_2, \ldots$, and a bias, $b$:
<center>
<img src="images/tikz29.png"/>
</center>
The output from the neuron is, of course, $a = \sigma(z)$, where $z =
\sum_j w_j x_j+b$ is the weighted sum of the inputs.  We define the
cross-entropy cost function for this neuron by
<a class="displaced_anchor" name="eqtn57"></a>\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],
\tag{57}\end{eqnarray}
where $n$ is the total number of items of training data, the sum is
over all training inputs, $x$, and $y$ is the corresponding desired
output.</p>-->
<p>学習が遅くなる問題はどう扱えばよいでしょう。二次コスト関数のかわりに、クロスエントロピーと呼ばれる他のコスト関数を使うことで解くことができることがわかります。クロスエントロピーを理解するには、我々の非常に簡単な例からちょっとだけ離れる必要があります。その代わり、ニューロンをいつくかの入力変数$x_1, x_2, \ldots$で学習させることとし、それらのそれぞれに対応する重みを$w_1, w_2, \ldots$をとし、バイアスを$b$とします。
<center>
<img src="images/tikz29.png"/>
</center>
ニューロンの出力はもちろん$a = \sigma(z)$となり、ここで$z =
\sum_j w_j x_j+b$は重み付きの入力の和です。このニューロンのクロスエントロピーを次の式で定義します。
<a class="displaced_anchor" name="eqtn57"></a>\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],
\tag{57}\end{eqnarray}
ここで$n$は学習データのデータ数で、和の計算はすべての学習の入力$x$についてで、$y$は対応する望まれる出力です。
</p>
<!--<p>It's not obvious that the expression <span id="margin_691582954050_reveal" class="equation_link">(57)</span><span id="margin_691582954050" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_691582954050_reveal').click(function() {$('#margin_691582954050').toggle('slow', function() {});});</script>
fixes the learning slowdown problem.  In fact, frankly, it's not even
obvious that it makes sense to call this a cost function!  Before
addressing the learning slowdown, let's see in what sense the
cross-entropy can be interpreted as a cost function.</p>-->
<p>この式<span id="margin_691582954050_reveal" class="equation_link">(57)</span><span id="margin_691582954050" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_691582954050_reveal').click(function() {$('#margin_691582954050').toggle('slow', function() {});});</script>が、学習が遅くなる問題を解決するということは明らかではありません。それどころか、これをコスト関数と呼んでいいかすら明らかではないでしょう。学習が遅くなる問題に注目する前に、どういう意味でクロスエントロピーがコスト関数として解釈できるかを見てみましょう。
</p>
<!--<p>Two properties in particular make it reasonable to interpret the
cross-entropy as a cost function.  First, it's non-negative, that is,
$C > 0$.  To see this, notice that all the individual terms in the sum
in <span id="margin_607315497502_reveal" class="equation_link">(57)</span><span id="margin_607315497502" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_607315497502_reveal').click(function() {$('#margin_607315497502').toggle('slow', function() {});});</script> are positive, since: (a) both
logarithms are of numbers in the range $0$ to $1$, and thus are
negative; and (b) there is a minus sign out the front.</p>-->
<p>
二つの性質によりクロスエントロピーはコスト関数であると解釈できます。一つにはそれは非負、つうまり$C>0$であることです。このことを知るには、式<span id="margin_607315497502_reveal" class="equation_link">(57)</span><span id="margin_607315497502" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_607315497502_reveal').click(function() {$('#margin_607315497502').toggle('slow', function() {});});</script>のすべての項が正であることを確認すればいいです。これは、(a)両方のlogの引数が0から1の範囲なのでその値は負になり、(b)前にマイナス符号が付いているからです。
</p>
<!--<p>Second, if the neuron's actual output is close to the desired output,
i.e., $y = y(x)$ for all training inputs $x$, then the cross-entropy
will be close to zero*<span class="marginnote">
*To prove this I will need to assume
  that the desired output $y(x)$ is either $0$ or $1$.  This is
  usually the case when solving classification problems, for example,
  or when computing Boolean functions.  To understand what happens
  when we don't make this assumption, see the exercises at the end of
  this section.</span>.  To see this, suppose for example that $y(x) = 0$
and $a \approx 0$ for some input $x$.  This is a case when the neuron
is doing a good job on that input.  We see that the first term in the
expression <span id="margin_161232079323_reveal" class="equation_link">(57)</span><span id="margin_161232079323" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_161232079323_reveal').click(function() {$('#margin_161232079323').toggle('slow', function() {});});</script> for the cost vanishes, since
$y(x) = 0$, while the second term is just $-\ln (1-a) \approx 0$.  A
similar analysis holds when $y(x) = 1$ and $a \approx 1$.  And so the
contribution to the cost will be low provided the actual output is
close to the desired output.</p>-->
<p>二つ目には、すべての学習の入力$x$についてニューロンの出力が望まれる出力、つまり$y=y(x)$に近いのなら、クロスエントロピーはゼロに近づく<span class="marginnote">
*このことそ証明するには、望まれる出力$y(x)$が$0$か$1$であることを仮定しなければなりません。これは、例えば分類問題や真偽値関数を計算しているときについては通常成り立ちます。この仮定をおかない時になにが起こるかを理解するには、この節の最後の演習を参照してください。</span>ということです。このことを見るために、例えば同じ入力$x$に対し$y(x)=0$かつ$a \approx 0$と仮定しましょう。これはニューロンがその入力に対して良い仕事をした場合です。$y(x) = 0$なので、式<span id="margin_161232079323_reveal" class="equation_link">(57)</span><span id="margin_161232079323" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_161232079323_reveal').click(function() {$('#margin_161232079323').toggle('slow', function() {});});</script>の最初の項が消え、2項目は$-\ln (1-a) \approx 0$となります。$y(x) = 1$かつ$a \approx 1$の場合も同様な解析ができます。このことにより、実際の出力が望まれる出力に近いとき、コストへの寄与は小さいことがわかります。
</p>
<!--<p>Summing up, the cross-entropy is positive, and tends toward zero as
the neuron gets better at computing the desired output, $y$, for all
training inputs, $x$.  These are both properties we'd intuitively
expect for a cost function.  Indeed, both properties are also
satisfied by the quadratic cost. So that's good news for the
cross-entropy.  But the cross-entropy cost function has the benefit
that, unlike the quadratic cost, it avoids the problem of learning
slowing down.  To see this, let's compute the partial derivative of
the cross-entropy cost with respect to the weights.  We substitute $a
= \sigma(z)$ into <span id="margin_261134997286_reveal" class="equation_link">(57)</span><span id="margin_261134997286" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_261134997286_reveal').click(function() {$('#margin_261134997286').toggle('slow', function() {});});</script>, and apply the chain
rule twice, obtaining:
<a class="displaced_anchor" name="eqtn58"></a><a class="displaced_anchor" name="eqtn59"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(
    \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)
  \frac{\partial \sigma}{\partial w_j} \tag{58}\\
 & = & -\frac{1}{n} \sum_x \left(
    \frac{y}{\sigma(z)}
    -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.
\tag{59}\end{eqnarray}-->
<p>まとめると、クロスエントロピーは正で、すべての学習の入力$x$と望まれる出力$y$に関して、ニューロンがより良くなるとクロスエントロピーは$0$に近づきます。この両方の性質は、私達がコスト関数に対して直感的に期待する性質です。実際、これらの性質は二次コスト関数でも満たされています。これはクロスエントロピーにとって良いニュースです。しかし、クロスエントロピーは二次コスト関数と違って、学習が遅くなる問題がないという良い性質があります。このことを見るために、クロスエントロピーの重みについての偏微分を計算してみましょう。式<span id="margin_261134997286_reveal" class="equation_link">(57)</span><span id="margin_261134997286" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_261134997286_reveal').click(function() {$('#margin_261134997286').toggle('slow', function() {});});</script>に$a = \sigma(z)$を代入して、合成関数の公式を2度適用すると以下の式を得ます：
<a class="displaced_anchor" name="eqtn58"></a><a class="displaced_anchor" name="eqtn59"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & -\frac{1}{n} \sum_x \left(
    \frac{y }{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)
  \frac{\partial \sigma}{\partial w_j} \tag{58}\\
 & = & -\frac{1}{n} \sum_x \left(
    \frac{y}{\sigma(z)}
    -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.
\tag{59}\end{eqnarray}
<!--Putting everything over a common denominator and simplifying this
becomes:
<a class="displaced_anchor" name="eqtn60"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & \frac{1}{n}
  \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}
  (\sigma(z)-y).
\tag{60}\end{eqnarray}
Using the definition of the sigmoid function, $\sigma(z) =
1/(1+e^{-z})$, and a little algebra we can show that $\sigma'(z) =
\sigma(z)(1-\sigma(z))$.  I'll ask you to verify this in an exercise
below, but for now let's accept it as given.  We see that the
$\sigma'(z)$ and $\sigma(z)(1-\sigma(z))$ terms cancel in the equation
just above, and it simplifies to become:
<a class="displaced_anchor" name="eqtn61"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y).
\tag{61}\end{eqnarray}
-->
すべてを共通の分母でまとめて簡単にすると次の式を得ます：
<a class="displaced_anchor" name="eqtn60"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & \frac{1}{n}
  \sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}
  (\sigma(z)-y).
\tag{60}\end{eqnarray}
シグモイド関数の定義を使うと$\sigma(z) =
1/(1+e^{-z})$であり、すこしの代数的計算をすると$\sigma'(z) =
\sigma(z)(1-\sigma(z))$であることがわかります。このことは演習で確認しますが、とりあえずはそれはわかっているものとして受け入れましょう。ちょうど上の式で、$\sigma'(z) =
\sigma(z)(1-\sigma(z))$が消え、単純になり次のようになります：
<a class="displaced_anchor" name="eqtn61"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y).
\tag{61}\end{eqnarray}
<!-- This is a beautiful expression.  It tells us that the rate at which
the weight learns is controlled by $\sigma(z)-y$, i.e., by the error
in the output.  The larger the error, the faster the neuron will
learn.  This is just what we'd intuitively expect.  In particular, it
avoids the learning slowdown caused by the $\sigma'(z)$ term in the
analogous equation for the quadratic cost, Equation <span id="margin_907356608801_reveal" class="equation_link">(55)</span><span id="margin_907356608801" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_907356608801_reveal').click(function() {$('#margin_907356608801').toggle('slow', function() {});});</script>.
When we use the cross-entropy, the $\sigma'(z)$ term gets cancelled
out, and we no longer need worry about it being small.  This
cancellation is the special miracle ensured by the cross-entropy cost
function.  Actually, it's not really a miracle.  As we'll see later,
the cross-entropy was specially chosen to have just this property.</p>-->
これは美しい表現です。これを見ると、重みが学習する度合いは$\sigma(z)-y$つまり出力における誤りによってコントロールされてることがわかります。つまり誤りが大きくなればなるほどニューロンの学習は速くなります。これはちょうど我々が直感的に望んでいたことです。特に、二次コスト関数の同じような式、つまり式<span id="margin_907356608801_reveal" class="equation_link">(55)</span><span id="margin_907356608801" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
    \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_907356608801_reveal').click(function() {$('#margin_907356608801').toggle('slow', function() {});});</script>
で$\sigma'(z)$が学習の速度低下の原因となっていたがそれを防いだのです。クロスエントロピーを使うと、$\sigma'(z)$の項が消え、そこのことを心配する必要がなくなるのです。この項の消滅はクロスエントロピーで起こる特別な奇跡です。実際には奇跡ではありません。あとでわかるように、クロスエントロピーはこの性質を持つように特別に選ばれたのです。</p>
<!--<p>In a similar way, we can compute the partial derivative for the bias.
I won't go through all the details again, but you can easily verify
that
<a class="displaced_anchor" name="eqtn62"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y).
\tag{62}\end{eqnarray}
Again, this avoids the learning slowdown caused by the $\sigma'(z)$
term in the analogous equation for the quadratic cost,
Equation <span id="margin_93784536101_reveal" class="equation_link">(56)</span><span id="margin_93784536101" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_93784536101_reveal').click(function() {$('#margin_93784536101').toggle('slow', function() {});});</script>.</p>-->
<p>同じようにバイアスについての偏微分も計算します。また詳細を追うことをしませんが、次の式が確認できます。
<a class="displaced_anchor" name="eqtn62"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y).
\tag{62}\end{eqnarray}
繰り返しますが、これにより二次コストの同じような式、つまり式<span id="margin_93784536101_reveal" class="equation_link">(56)</span><span id="margin_93784536101" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_93784536101_reveal').click(function() {$('#margin_93784536101').toggle('slow', function() {});});</script>
で$\sigma'(z)$が学習速度低下の原因になるのを防いでいます。</p>
<!--<p><h4><a name="exercise_35813"></a><a href="#exercise_35813">Exercise</a></h4><ul>
    <li> Verify that $\sigma'(z) = \sigma(z)(1-\sigma(z))$.</p><p></ul></p>-->
<p><h4><a name="exercise_35813"></a><a href="#exercise_35813">演習</a></h4><ul>
    <li> $\sigma'(z) = \sigma(z)(1-\sigma(z))$であることを確認しなさい</p><p></ul></p>
<!--<p>Let's return to the toy example we played with earlier, and explore
what happens when we use the cross-entropy instead of the quadratic
cost.  To re-orient ourselves, we'll begin with the case where the
quadratic cost did just fine, with starting weight $0.6$ and starting
bias $0.9$.  Press "Run" to see what happens when we replace the
quadratic cost by the cross-entropy:</p>-->
<p>前に遊んでみたおもちゃのような例にもどり、二次コスト関数の代わりにクロスエントロピーを使うとどうなるか見てみましょう。方向転換して、まずは二次コスト関数がうまく行くケースとして、重みが$0.6$、バイアスが$0.9$のときから始めましょう。
</p>
<!--<p><script type="text/paperscript" src="js/saturation3.js" canvas="saturation3"></script><center><canvas id="saturation3" width="520" height="300"></canvas></center></p>
<p>Unsurprisingly, the neuron learns perfectly well in this instance,
just as it did earlier.  And now let's look at the case where our
neuron got stuck before (<a href="#saturation2_anchor">link</a>, for
comparison), with the weight and bias both starting at $2.0$:</p>-->
<p><script type="text/paperscript" src="js/saturation3.js" canvas="saturation3"></script><center><canvas id="saturation3" width="520" height="300"></canvas></center></p>
<p>驚くまでもなく、この例では前と同じようにニューロンは完璧に学習します。では今度は、以前にニューロンが停滞したケース(<a href="#saturation2_anchor">link</a>, for
comparison)、つまり重みとバイアスが両方とも$2.0$から始めるケースを見てみましょう：
</p>
<!--<p><script type="text/paperscript" src="js/saturation4.js" canvas="saturation4"></script><center><canvas id="saturation4" width="520" height="300"></canvas></center></p><p>Success!  This time the neuron learned quickly, just as we hoped.  If
you observe closely you can see that the slope of the cost curve was
much steeper initially than the initial flat region on the
corresponding curve for the quadratic cost.  It's that steepness which
the cross-entropy buys us, preventing us from getting stuck just when
we'd expect our neuron to learn fastest, i.e., when the neuron starts
out badly wrong.</p>-->
<p><script type="text/paperscript" src="js/saturation4.js" canvas="saturation4"></script><center><canvas id="saturation4" width="520" height="300"></canvas></center></p>
<p>うまくいきました！今度はニューロンは望んだ通り早く学習します。近づいて見ると、コスト曲線の坂は、二次コスト関数の対応する初期値の平らな部分と比べると、もっと急になっていることがわかります。その傾きはクロスエントロピーがうまくやってくれていて、ニューロンが最も速く学習してほしい時に停滞してしまうこと、つまりニューロンが間違った起動のしかたをすることを防いでくれます。
</p>
<!--<p>I didn't say what learning rate was used in the examples just
illustrated.  Earlier, with the quadratic cost, we used $\eta = 0.15$.
Should we have used the same learning rate in the new examples?  In
fact, with the change in cost function it's not possible to say
precisely what it means to use the "same" learning rate; it's an
apples and oranges comparison.  For both cost functions I simply
experimented to find a learning rate that made it possible to see what
is going on.</p>-->
<p>例で使われた学習率とはなにかについていってませんでした。以前、二次コスト関数では$\eta = 0.15$を使っていました。同じ例でも同じ学習率を使うべきでしょうか？ところが、コスト関数を変更してしまうと、学習率が「同じ」とはどういうことかを正確には言えません。リンゴとオレンジを比べているようなものです。両方のコスト関数に対して、なにが起こっているかわかるような学習率を、単純に私は経験的に見つけていたというだけです。
</p>
<!--<p>You might object that this makes the graphs above meaningless.  Who
cares how fast the neuron learns, when our choice of learning rate was
arbitrary?!  That objection misses the point.  The point of the graphs
isn't about the absolute speed of learning.  It's about how the speed
of learning changes.  In particular, when we use the quadratic cost
learning is <em>slower</em> when the neuron is unambiguously wrong than
it is later on, as the neuron gets closer to the correct output; while
with the cross-entropy learning is faster when the neuron is
unambiguously wrong.  Those statements don't depend on how the
learning rate is set.</p>-->
<p>そんなことではグラフに意味がないではないかと、あなたは反対するかもしれません。学習率の選択法が決まっていないのに、ニューロンの学習する速さに誰が興味を持つだろうか！？そのような反対意見は重要な点を見逃しています。グラフで重要な点は学習の絶対的な速度についてではないのです。学習の速度がどのように変わっていくかが重要なのです。取り立てて言うと、二次コスト関数は、後に正しい出力値に近づいたときと比べて、ニューロンがあきらかに間違っている状態の時に<em>遅い</em>のです。一方で、クロスエントロピーは、あきらかに間違っている状態の時にも速いです。これらの話は学習率がどのように設定されたかに依存しません。
</p>
<!--<p>We've been studying the cross-entropy for a single neuron.  However,
it's easy to generalize the cross-entropy to many-neuron multi-layer
networks.  In particular, suppose $y = y_1, y_2, \ldots$ are the
desired values at the output neurons, i.e., the neurons in the final
layer, while $a^L_1, a^L_2, \ldots$ are the actual output values.
Then we define the cross-entropy by
<a class="displaced_anchor" name="eqtn63"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].
\tag{63}\end{eqnarray}
This is the same as our earlier expression,
Equation <span id="margin_529736937390_reveal" class="equation_link">(57)</span><span id="margin_529736937390" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_529736937390_reveal').click(function() {$('#margin_529736937390').toggle('slow', function() {});});</script>, except now we've got the
$\sum_j$ summing over all the output neurons.  I won't explicitly work
through a derivation, but it should be plausible that using the
expression <span id="margin_932302704635_reveal" class="equation_link">(63)</span><span id="margin_932302704635" class="marginequation" style="display: none;"><a href="chap3.html#eqtn63" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_932302704635_reveal').click(function() {$('#margin_932302704635').toggle('slow', function() {});});</script> avoids a learning slowdown in
many-neuron networks.  If you're interested, you can work through the
derivation in the problem below.  </p>-->
<p>私達は、一つのニューロンのときのクロスエントロピーを調べてきました。しかし、多層の多くのニューロンの場合に一般化するのは簡単です。とくに、$y = y_1, y_2, \ldots$がニューロンの望まれる出力だと仮定しましょう。つまり、ニューロンの最終層の出力です。また、$a^L_1, a^L_2, \ldots$がニューロンの実際の出力だとしましょう。ここで、クロスエントロピーを次の式で定義します。
<a class="displaced_anchor" name="eqtn63"></a>\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right].
\tag{63}\end{eqnarray}
これは以前の表現、式<span id="margin_529736937390_reveal" class="equation_link">(57)</span><span id="margin_529736937390" class="marginequation" style="display: none;"><a href="chap3.html#eqtn57" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_529736937390_reveal').click(function() {$('#margin_529736937390').toggle('slow', function() {});});</script>と比べて、すべてのニューロンについての和$\sum_j$をとったことをのぞいて同じです。微分を明示的に計算することはしませんが、式<span id="margin_932302704635_reveal" class="equation_link">(63)</span><span id="margin_932302704635" class="marginequation" style="display: none;"><a href="chap3.html#eqtn63" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = -\frac{1}{n} \sum_x
  \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \nonumber\end{eqnarray}</a></span><script>$('#margin_932302704635_reveal').click(function() {$('#margin_932302704635').toggle('slow', function() {});});</script>を使えば多くのニューロンのネットワークで学習の速度低下を防げることはすばらしいことでしょう。もし興味があれば、後述する問題で微分を計算してみることはできます。
</p>
<!--<p>When should we use the cross-entropy instead of the quadratic cost?
In fact, the cross-entropy is nearly always the better choice,
provided the output neurons are sigmoid neurons.  To see why, consider
that when we're setting up the network we usually initialize the
weights and biases using some sort of randomization.  It may happen
that those initial choices result in the network being decisively
wrong for some training input - that is, an output neuron will have
saturated near $1$, when it should be $0$, or vice versa.  If we're
using the quadratic cost that will slow down learning.  It won't stop
learning completely, since the weights will continue learning from
other training inputs, but it's obviously undesirable.</p>-->
<p>二次コスト関数の代わりにクロスエントロピーを使うべきなのはいつでしょうか？ところが、出力ニューロンがシグモイドニューロンであるときには、クロスエントロピーはほとんどいつもよりよい選択なのです。なぜかを理解するためには、ニューロンを設定するときには、通常重みとバイアスを同じ種類の乱数で初期化するということを考慮しなければいけません。そのような初期値は、学習の入力により決定的に悪くなる可能性があります。つまり、出力ニューロンが$0$であるべき場合に$1$付近で止まってしまったり、その逆もあり得るということです。二次コスト関数を使っていたとしたら、学習は遅くなります。重みは他の入力から学ぶことを続けるので、学習が完全に停止することはありませんが、それは明らかに私達が望んでいないことです。
</p>
<!--<p><h4><a name="exercises_824189"></a><a href="#exercises_824189">Exercises</a></h4><ul>
<li> One gotcha with the cross-entropy is that it can be difficult at
  first to remember the respective roles of the $y$s and the $a$s.
  It's easy to get confused about whether the right form is $-[y \ln a
  + (1-y) \ln (1-a)]$ or $-[a \ln y + (1-a) \ln (1-y)]$.  What happens
  to the second of these expressions when $y = 0$ or $1$?  Does this
  problem afflict the first expression?  Why or why not? </p>-->
<p><h4><a name="exercises_824189"></a><a href="#exercises_824189">演習</a></h4><ul>
<li>クロスエントロピーについて気をつけなければいけないのは、最初は$y$と$a$のそれぞれの役割を覚えるのが難しいところです。正しい式が$-[y \ln a
  + (1-y) \ln (1-a)]$であったか、$-[a \ln y + (1-a) \ln (1-y)]$であったかという点はよく混乱します。これらの後者の式の方で$y=0$または$y=1$だった場合なにが起こるでしょう？このことは前者の式に悪影響があるでしょうか？理由を考えてください。</p>
<!--<p><li> In the single-neuron discussion at the start of this section, I
  argued that the cross-entropy is small if $\sigma(z) \approx y$ for
  all training inputs.  The argument relied on $y$ being equal to
  either $0$ or $1$.  This is usually true in classification problems,
  but for other problems (e.g., regression problems) $y$ can sometimes
  take values intermediate between $0$ and $1$.  Show that the
  cross-entropy is still minimized when $\sigma(z) = y$ for all
  training inputs.  When this is the case the cross-entropy has the
  value:
  <a class="displaced_anchor" name="eqtn64"></a>\begin{eqnarray}
    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)].
  \tag{64}\end{eqnarray}
  The quantity $-[y \ln y+(1-y)\ln(1-y)]$ is sometimes known as the
  <a href="http://en.wikipedia.org/wiki/Binary_entropy_function">binary
    entropy</a>.</p>-->
<p><li>
この節の最初の一つのニューロンでの考察で、もし学習データの入力で$\sigma(z) \approx y$ならばクロスエントロピーは小さくなるという議論をしました。この議論は$y$は$0$または$1$であるという仮定の元にでした。この仮定は分類問題では通常は正しいですが、他の問題（例えば回帰問題）では$y$はときに$0$から$1$の間の値をとることがあります。すべての学習の入力について、クロスエントロピーは$\sigma(z) = y$で最小になることを示してください。これが成り立つときクロスエントロピーは次の値をとります：
  <a class="displaced_anchor" name="eqtn64"></a>\begin{eqnarray}
    C = -\frac{1}{n} \sum_x [y \ln y+(1-y) \ln(1-y)].
  \tag{64}\end{eqnarray}
  値$-[y \ln y+(1-y)\ln(1-y)]$は
  <a href="http://en.wikipedia.org/wiki/Binary_entropy_function">バイナリエントロピー</a>と呼ばれることもあります。</p>
<!--<p></ul></p><p><h4><a name="problems_933474"></a><a href="#problems_933474">Problems</a></h4><ul>
<li><strong>Many-layer multi-neuron networks</strong> In the notation introduced in
  the <a href="chap2.html">last chapter</a>, show that for the quadratic
  cost the partial derivative with respect to weights in the output
  layer is
  <a class="displaced_anchor" name="eqtn65"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n}
      \sum_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j).
  \tag{65}\end{eqnarray}
  The term $\sigma'(z^L_j)$ causes a learning slowdown whenever an
  output neuron saturates on the wrong value.  Show that for the
  cross-entropy cost the output error $\delta^L$ for a single training
  example $x$ is given by
  <a class="displaced_anchor" name="eqtn66"></a>\begin{eqnarray}
    \delta^L = a^L-y.
  \tag{66}\end{eqnarray}
  Use this expression to show that the partial derivative with respect
  to the weights in the output layer is given by
  <a class="displaced_anchor" name="eqtn67"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x
      a^{L-1}_k  (a^L_j-y_j).
  \tag{67}\end{eqnarray}
  The $\sigma'(z^L_j)$ term has vanished, and so the cross-entropy
  avoids the problem of learning slowdown, not just when used with a
  single neuron, as we saw earlier, but also in many-layer
  multi-neuron networks.  A simple variation on this analysis holds
  also for the biases.  If this is not obvious to you, then you should
  work through that analysis as well.</p>-->
<p></ul></p><p><h4><a name="problems_933474"></a><a href="#problems_933474">問題</a></h4><ul>
<li><strong>多層複数ニューロンのネットワーク</strong>
<a href="chap2.html">前章</a>で導入した記法を使って、出力層で二次コスト関数の重みについての微分は次の式になることを示しなさい：
  <a class="displaced_anchor" name="eqtn65"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n}
      \sum_x a^{L-1}_k  (a^L_j-y_j) \sigma'(z^L_j).
  \tag{65}\end{eqnarray}
  ここで$\sigma'(z^L_j)$の項は、出力ニューロンが悪い値で停滞した時に学習の速度低下を引き起こします。クロスエントロピーについて、一つの学習データの例$x$についての出力誤差$\delta^L$の重みは次の式で表されることを示しなさい：
  <a class="displaced_anchor" name="eqtn66"></a>\begin{eqnarray}
    \delta^L = a^L-y.
  \tag{66}\end{eqnarray}
この式を使って、出力層の重みについての偏微分は次の式で与えられることを示しなさい：
  <a class="displaced_anchor" name="eqtn67"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x
      a^{L-1}_k  (a^L_j-y_j).
  \tag{67}\end{eqnarray}
$\sigma'(z^L_j)$の項が消え、ニューロンが一つでなく多層複数ニューロンの場合でも、クロスエントロピーは学習の速度低下を避けることがわかりました。この解析の簡単な変形でバイアスについても同じことがいえます。もしそれが明らかではないなら、同様の解析を続けてやるべきです。</p>
<!--<p><li><strong>Using the quadratic cost when we have linear neurons in the
  output layer</strong> Suppose that we have a many-layer multi-neuron
  network.  Suppose all the neurons in the final layer are
  <em>linear neurons</em>, meaning that the sigmoid activation function
  is not applied, and the outputs are simply $a^L_j = z^L_j$.  Show
  that if we use the quadratic cost function then the output error
  $\delta^L$ for a single training example $x$ is given by
  <a class="displaced_anchor" name="eqtn68"></a>\begin{eqnarray}
    \delta^L = a^L-y.
  \tag{68}\end{eqnarray}
  Similarly to the previous problem, use this expression to show that
  the partial derivatives with respect to the weights and biases in
  the output layer are given by
  <a class="displaced_anchor" name="eqtn69"></a><a class="displaced_anchor" name="eqtn70"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x
      a^{L-1}_k  (a^L_j-y_j) \tag{69}\\
      \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x
      (a^L_j-y_j).
  \tag{70}\end{eqnarray}
  This shows that if the output neurons are linear neurons then the
  quadratic cost will not give rise to any problems with a learning
  slowdown.  In this case the quadratic cost is, in fact, an
  appropriate cost function to use.
</ul></p>-->
<p><li><strong>出力に線形ニューロンがあるときの二次コストの使用</strong>
多層複数ニューロンのネットワークがあるとします。最終層のニューロンはすべて<em>線形ニューロン</em>であると仮定します。線形ニューロンとは、シグモイド関数が適用されず、出力が単純に$a^L_j = z^L_j$であることを意味します。二次コスト関数を使えば、一つの学習データの例$x$に対する出力誤差$\delta^L$は次の式で与えられることを示しなさい：
  <a class="displaced_anchor" name="eqtn68"></a>\begin{eqnarray}
    \delta^L = a^L-y.
  \tag{68}\end{eqnarray}
前の問題と同様に、この式を使うと、出力層での重みとバイアスについての偏微分は次の式で与えられることを示しなさい：
  <a class="displaced_anchor" name="eqtn69"></a><a class="displaced_anchor" name="eqtn70"></a>\begin{eqnarray}
      \frac{\partial C}{\partial w^L_{jk}} & = & \frac{1}{n} \sum_x
      a^{L-1}_k  (a^L_j-y_j) \tag{69}\\
      \frac{\partial C}{\partial b^L_{j}} & = & \frac{1}{n} \sum_x
      (a^L_j-y_j).
  \tag{70}\end{eqnarray}
このことにより、もし出力ニューロンが線形ニューロンならば、二次コスト関数は学習の速度低下について何の問題も産まないことがわかります。このケースでは、二次コスト関数はそれどころか使うべき適切なコスト関数です。
</ul></p>
<!--<p><h4><a name="using_the_cross-entropy_to_classify_mnist_digits"></a><a href="#using_the_cross-entropy_to_classify_mnist_digits">Using the cross-entropy to classify MNIST digits</a></h4></p><p></p><p>The cross-entropy is easy to implement as part of a program which
learns using gradient descent and backpropagation.  We'll do that
<a href="#handwriting_recognition_revisited_the_code">later in the
  chapter</a>, developing an improved version of our
<a href="chap1.html#implementing_our_network_to_classify_digits">earlier
  program</a> for classifying the MNIST handwritten digits,
<tt>network.py</tt>.  The new program is called <tt>network2.py</tt>, and
incorporates not just the cross-entropy, but also several other
techniques developed in this chapter*<span class="marginnote">
*The code is available
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/code/network2.py">on
    GitHub</a>.</span>.-->
<p><h4><a name="using_the_cross-entropy_to_classify_mnist_digits"></a><a href="#using_the_cross-entropy_to_classify_mnist_digits">クロスエントロピーを使ったMNISTの分類</a></h4></p><p></p><p>最急降下法とバックプロパゲーションを使って学習プログラムの一部として、クロスエントロピーを実装するのは簡単です。そのことは<a href="#handwriting_recognition_revisited_the_code">この章のあと</a>で、MNISTの手書き数字の分類に関する<a href="chap1.html#implementing_our_network_to_classify_digits">前述のプログラム</a><tt>network.py</tt>の改善版を開発することで示します。
新しいプログラムは<tt>network2.py</tt>という名前で、これはだたクロスエントロピーを取り込んだだけではなく、この章で使ったいくつかのテクニックを取り込んでいます*<span class="marginnote">
    *コードは<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network2.py">GitHub上</a>から手に入れることができます。</span>。
<!-- For now, let's look at how well our new program
classifies MNIST digits.  As was the case in Chapter 1, we'll use a
network with $30$ hidden neurons, and we'll use a mini-batch size of
$10$.  We set the learning rate to $\eta = 0.5$*<span class="marginnote">
*In Chapter 1
  we used the quadratic cost and a learning rate of $\eta = 3.0$.  As
  discussed above, it's not possible to say precisely what it means to
  use the "same" learning rate when the cost function is changed.
  For both cost functions I experimented to find a learning rate that
  provides near-optimal performance, given the other hyper-parameter
  choices. <br/> <br/> There is, incidentally, a very rough
  general heuristic for relating the learning rate for the
  cross-entropy and the quadratic cost.  As we saw earlier, the
  gradient terms for the quadratic cost have an extra $\sigma' =
  \sigma(1-\sigma)$ term in them.  Suppose we average this over values
  for $\sigma$, $\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$.  We see
  that (very roughly) the quadratic cost learns an average of $6$
  times slower, for the same learning rate.  This suggests that a
  reasonable starting point is to divide the learning rate for the
  quadratic cost by $6$.  Of course, this argument is far from
  rigourous, and shouldn't be taken too seriously.  Still, it can
  sometimes be a useful starting point.</span> and we train for $30$ epochs.
The interface to <tt>network2.py</tt> is slightly different than
<tt>network.py</tt>, but it should still be clear what is going on.  You
can, by the way, get documentation about <tt>network2.py</tt>'s
interface by using commands such as <tt>help(network2.Network.SGD)</tt>
in a Python shell.</p> -->
ここで、私達の新しいプログラムがどのくらいうまくMNISTの数字を分類するかを見てみましょう。第1章と同様に、$30$個の隠れニューロンを持つネットワークを使い、ミニバッチサイズとして$10$を使います。学習率を$\eta = 0.5$*<span class="marginnote">
  *1章では二次コスト関数を使い、学習率$\eta = 3.0$としました。前の議論により、コスト関数が違うときに「同じ」学習率を使うというのはどういうことか、正確に言うことはできません。両方のコスト関数について、私は実験をして、与えられたハイパーパラメータに対してほぼ最適なパフォーマンスを出す学習率を見つけました。<br/> <br/>クロスエントロピーと二次コストの学習率については、非常に大雑把な一般的な発見的手法が存在します。前に見たように、二次コストの勾配項の中には、$\sigma' = \sigma(1-\sigma)$という余計な項があります。この値を$\sigma$について平均をとってみると、$\int_0^1 d\sigma \sigma(1-\sigma) = 1/6$となります。これにより、（非常に大雑把にいうと）二次コストは同じ学習率について、平均すると$6$倍遅く学習するということがわかります。このことは、妥当な出発点は二次コストの学習率を$6$で割ることだということを示唆しています。もちろん、この議論は厳密からは程遠いですし、あまり真剣にとらえるべきではないです。とはいえ、それは時には役に立つ出発点となります。</span>とし、$30$エポック学習させます。
<tt>network2.py</tt>のインターフェースは<tt>network.py</tt>と少し違いますが、何が起こっているかはわかると思います。ところで、<tt>network2.py</tt>のインターフェースに関するドキュメントは、Pythonのシェルから<tt>help(network2.Network.SGD)</tt>と入力することで見ることができます。</p>
<!--<p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
  </pre></div>-->
<p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
<!--</p><p>Note, by the way, that the <tt>net.large_weight_initializer()</tt>
command is used to initialize the weights and biases in the same way
as described in Chapter 1.  We need to run this command because later
in this chapter we'll change the default weight initialization in our
networks.  The result from running the above sequence of commands is a
network with $95.49$ percent accuracy.  This is pretty close to the
result we obtained in Chapter 1, $95.42$ percent, using the quadratic
cost.</p><p>Let's look also at the case where we use $100$ hidden neurons, the
cross-entropy, and otherwise keep the parameters the same.  In this
case we obtain an accuracy of $96.82$ percent.  That's a substantial
improvement over the results from Chapter 1, where we obtained a
classification accuracy of $96.59$ percent, using the quadratic cost.
That may look like a small change, but consider that the error rate
has dropped from $3.41$ percent to $3.18$ percent.  That is, we've
eliminated about one in fourteen of the original errors.  That's quite
  a handy improvement.</p>-->
</p><p>ここで注意ですが<tt>net.large_weight_initializer()</tt>コマンドは第1章で示したのと同じように重みとバイアスを初期化します。このコマンドを実行しなければいけないのは、この章で後ほど、ネットワークの重みのデフォルト初期値を変更するからです。上記のコマンド列を実行した結果は$95.49$％の的中率になります。これは1章で二次コストによって得られた的中率$95.49$％にとても近いです。</p>
<p>$100$個の隠れニューロンを使い、クロスエントロピーを使いそれ以外のパラメータは同じにしたケースを見てみましょう。この場合、的中率は$96.82$％になります。これは二次コストを使って的中率$96.59$であった1章の結果からすると大きな改善です。これは小さな変化に見えるかもしれませんが、誤り率は$.41$％から$3.18$％に下がっています。これは元の誤り14個につき1個を消したということです。これはとてもありがたい改善です。</p>
<!--<p>It's encouraging that the cross-entropy cost gives us similar or
better results than the quadratic cost.  However, these results don't
conclusively prove that the cross-entropy is a better choice.  The
reason is that I've put only a little effort into choosing
hyper-parameters such as learning rate, mini-batch size, and so on.
For the improvement to be really convincing we'd need to do a thorough
job optimizing such hyper-parameters.  Still, the results are
encouraging, and reinforce our earlier theoretical argument that the
cross-entropy is a better choice than the quadratic cost.</p>-->
<p>クロスエントロピーコストは二次コストと比べて同等かより良い結果をもたらすと言いたくなるかもしれません。しかし、これらの結果はクロスエントロピーがより良い選択であることを証明したと結論付けるわけにはいかないのです。なぜなら、学習率やミニバッチサイズのようなハイパーパラメータの選定にまだ十分な努力をしていないからです。この改善に本当に説得力を持たせるには、そのようなハイパーパラメータを徹底的に最適化しなければいけません。とはいってもここまでの結果は、クロスエントロピーは二次コストよりよいのではという前述の議論を後押し補強するものです。
</p>
<!--<p>This, by the way, is part of a general pattern that we'll see through
this chapter and, indeed, through much of the rest of the book.  We'll
develop a new technique, we'll try it out, and we'll get "improved"
results.  It is, of course, nice that we see such improvements.  But
the interpretation of such improvements is always problematic.
They're only truly convincing if we see an improvement after putting
tremendous effort into optimizing all the other hyper-parameters.
That's a great deal of work, requiring lots of computing power, and
we're not usually going to do such an exhaustive investigation.
Instead, we'll proceed on the basis of informal tests like those done
above.  Still, you should keep in mind that such tests fall short of
definitive proof, and remain alert to signs that the arguments are
breaking down.</p>-->
<p>ところでこれは、この章を通して見られる一般的なパターンであり、さらにはこの本の残りを通しても見られるパターンです。我々は新しいテクニックを展開し、それを試し、そして「改善した」結果を得るということを行います。もちろんそのような改善はよいことです。しかし、その改善の解釈はいつも問題があります。他のハイパーパラメータの最適化に莫大な努力をして改善が見られるときに、初めてそれらのテクニックは真の説得力を持ちます。それは大変な量の仕事で、多くの計算資源を必要とするので、通常はそのような網羅的な調査は行いません。そのかわり、いままでやってきたように非形式的なテストを行います。一方で、そのようなテストは絶対的な証明としては不十分だということを忘れず、議論が崩壊しそうな兆候には警戒し続けてください。
</p>
<!--<p>By now, we've discussed the cross-entropy at great length.  Why go to
so much effort when it gives only a small improvement to our MNIST
results?  Later in the chapter we'll see other techniques - notably,
<a href="#overfitting_and_regularization">regularization</a> - which
give much bigger improvements.  So why so much focus on cross-entropy?
Part of the reason is that the cross-entropy is a widely-used cost
function, and so is worth understanding well.  But the more important
reason is that neuron saturation is an important problem in neural
nets, a problem we'll return to repeatedly throughout the book.  And
so I've discussed the cross-entropy at length because it's a good
laboratory to begin understanding neuron saturation and how it may be
addressed.</p>-->
<p>ここまでクロスエントロピーについて長い間議論してきました。MNISTの結果にほんの少しの改善をするだけなのに、なぜそんなに労力をつかうのでしょう? この章の後で他のテクニック、特に<a href="#overfitting_and_regularization">正規化</a>などを見ますが、それらはもっと大きな改善をします。ならばどうしてクロスエントロピーにそんなに力をいれるのでしょう? 理由の一つは、クロスエントロピーは広く使われているコスト関数でなので、よく理解する価値があるからです。しかしもっと大事な理由は、ニューロンの飽和がニューラルネットで重要な問題で、その問題はこの本を通して繰り返し触れることだからです。そのため、私はクロスエントロピーについて長く議論してきましたが、それはニューロンの飽和を理解し、どのように処理されるべきかを理解するために良い実験室であったのでした。
</p>
<!--<p><h4><a name="what_does_the_cross-entropy_mean_where_does_it_come_from"></a><a href="#what_does_the_cross-entropy_mean_where_does_it_come_from">What does the cross-entropy mean?  Where does it come from?</a></h4></p>-->
<p><h4><a name="what_does_the_cross-entropy_mean_where_does_it_come_from"></a><a href="#what_does_the_cross-entropy_mean_where_does_it_come_from">クロスエントロピーは何を意味するのか?  それはどこから来たのか?</a></h4></p>
<!--<p>Our discussion of the cross-entropy has focused on algebraic analysis
and practical implementation.  That's useful, but it leaves unanswered
broader conceptual questions, like: what does the cross-entropy mean?
Is there some intuitive way of thinking about the cross-entropy?  And
how could we have dreamed up the cross-entropy in the first place?</p>-->
<p>クロスエントロピーについての私達の議論は、代数的な分析と実用的な実装に集中してきました。それは役に立ちますが、もっと大きな思想的な問題が解かれていません。例えば、クロスエントロピーは何を意味するのか? クロスエントロピーを直感的に考える方法があるだろうか? どうやったらクロスエントロピーのようなものを最初に思いつくのか? といったことです。
</p>
<!--<p>Let's begin with the last of these questions: what could have
motivated us to think up the cross-entropy in the first place?
Suppose we'd discovered the learning slowdown described earlier, and
understood that the origin was the $\sigma'(z)$ terms in
Equations <span id="margin_957799962109_reveal" class="equation_link">(55)</span><span id="margin_957799962109" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_957799962109_reveal').click(function() {$('#margin_957799962109').toggle('slow', function() {});});</script> and <span id="margin_397648571817_reveal" class="equation_link">(56)</span><span id="margin_397648571817" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_397648571817_reveal').click(function() {$('#margin_397648571817').toggle('slow', function() {});});</script>.  After staring at
those equations for a bit, we might wonder if it's possible to choose
a cost function so that the $\sigma'(z)$ term disappeared.  In that
case, the cost $C = C_x$ for a single training example $x$ would
satisfy
<a class="displaced_anchor" name="eqtn71"></a><a class="displaced_anchor" name="eqtn72"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\

  \frac{\partial C}{\partial b } & = & (a-y).
\tag{72}\end{eqnarray}-->
これらの疑問のうち最後のもの：どうやったらクロスエントロピーのようなものを最初に思いつくのか?から始めましょう。前に述べたように学習の速度低下をすでに発見していて、その原因は式<span id="margin_957799962109_reveal" class="equation_link">(55)</span><span id="margin_957799962109" class="marginequation" style="display: none;"><a href="chap3.html#eqtn55" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & (a-y)\sigma'(z) x = a \sigma'(z)  \nonumber\end{eqnarray}</a></span><script>$('#margin_957799962109_reveal').click(function() {$('#margin_957799962109').toggle('slow', function() {});});</script> and <span id="margin_397648571817_reveal" class="equation_link">(56)</span><span id="margin_397648571817" class="marginequation" style="display: none;"><a href="chap3.html#eqtn56" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
    \frac{\partial C}{\partial b} & = & (a-y)\sigma'(z) = a \sigma'(z) \nonumber\end{eqnarray}</a></span><script>$('#margin_397648571817_reveal').click(function() {$('#margin_397648571817').toggle('slow', function() {});});</script>の$\sigma'(z)$の項であるということまで分かっているとしましょう。これらの式を少しの間見ていると、どのようなコスト関数を選べば$\sigma'(z)$の項を消すことができるかを考えるようになるかもしれません。その場合、一つの学習データのサンプル$x$に対してコスト$C = C_x$は次を満たさなければいけません。
<a class="displaced_anchor" name="eqtn71"></a><a class="displaced_anchor" name="eqtn72"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & x_j(a-y) \tag{71}\\

  \frac{\partial C}{\partial b } & = & (a-y).
\tag{72}\end{eqnarray}
<!--If we could choose the cost function to make these equations true,
then they would capture in a simple way the intuition that the greater
the initial error, the faster the neuron learns.  They'd also
eliminate the problem of a learning slowdown.  In fact, starting from
these equations we'll now show that it's possible to derive the form
of the cross-entropy, simply by following our mathematical noses.  To
see this, note that from the chain rule we have
<a class="displaced_anchor" name="eqtn73"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
  \sigma'(z).
\tag{73}\end{eqnarray}
Using $\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ the last equation
becomes
<a class="displaced_anchor" name="eqtn74"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
  a(1-a).
\tag{74}\end{eqnarray}
Comparing to Equation <span id="margin_645820283975_reveal" class="equation_link">(72)</span><span id="margin_645820283975" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_645820283975_reveal').click(function() {$('#margin_645820283975').toggle('slow', function() {});});</script> we obtain
<a class="displaced_anchor" name="eqtn75"></a>\begin{eqnarray}
  \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}.
\tag{75}\end{eqnarray}
Integrating this expression with respect to $a$ gives
<a class="displaced_anchor" name="eqtn76"></a>\begin{eqnarray}
  C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant},
\tag{76}\end{eqnarray}
for some constant of integration.  This is the contribution to the
cost from a single training example, $x$.  To get the full cost
function we must average over training examples, obtaining
<a class="displaced_anchor" name="eqtn77"></a>\begin{eqnarray}
  C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant},
\tag{77}\end{eqnarray}-->
もし、これらの式が成り立つようなコスト関数が選べたとすると、初期誤りが大きければニューロンはより速く学習するという直感を簡単に実現することになります。またそのことは、学習の速度低下の問題も解決します。実際、この式から始めて単純に数学的な考えを進めると、クロスエントロピーの式を導くことができます。このことを見るため、まずは合成関数の微分の法則を考えます。
<a class="displaced_anchor" name="eqtn73"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
  \sigma'(z).
\tag{73}\end{eqnarray}
$\sigma'(z) = \sigma(z)(1-\sigma(z)) = a(1-a)$ということを使うと、2つ目の式は次のようになります。
<a class="displaced_anchor" name="eqtn74"></a>\begin{eqnarray}
  \frac{\partial C}{\partial b} = \frac{\partial C}{\partial a}
  a(1-a).
\tag{74}\end{eqnarray}
式<span id="margin_645820283975_reveal" class="equation_link">(72)</span><span id="margin_645820283975" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_645820283975_reveal').click(function() {$('#margin_645820283975').toggle('slow', function() {});});</script>と比べて、次を得ます。
<a class="displaced_anchor" name="eqtn75"></a>\begin{eqnarray}
  \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)}.
\tag{75}\end{eqnarray}
この式を$a$について積分すると次のような積分定数を持つ式を得ます。
<a class="displaced_anchor" name="eqtn76"></a>\begin{eqnarray}
  C = -[y \ln a + (1-y) \ln (1-a)]+ {\rm constant},
\tag{76}\end{eqnarray}
これが、学習データの一つサンプル$x$のコストに対する寄与です。コスト関数全体を得るには、学習サンプルの全てについて和をとり、次のようになります。
<a class="displaced_anchor" name="eqtn77"></a>\begin{eqnarray}
  C = -\frac{1}{n} \sum_x [y \ln a +(1-y) \ln(1-a)] + {\rm constant},
\tag{77}\end{eqnarray}
<!--where the constant here is the average of the individual constants for
each training example.  And so we see that
Equations <span id="margin_172115178513_reveal" class="equation_link">(71)</span><span id="margin_172115178513" class="marginequation" style="display: none;"><a href="chap3.html#eqtn71" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & x_j(a-y)  \nonumber\end{eqnarray}</a></span><script>$('#margin_172115178513_reveal').click(function() {$('#margin_172115178513').toggle('slow', function() {});});</script>
and <span id="margin_419838980882_reveal" class="equation_link">(72)</span><span id="margin_419838980882" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_419838980882_reveal').click(function() {$('#margin_419838980882').toggle('slow', function() {});});</script> uniquely determine the form
of the cross-entropy, up to an overall constant term.  The
cross-entropy isn't something that was miraculously pulled out of thin
air.  Rather, it's something that we could have discovered in a simple
and natural way.</p>-->
<p>ここでの定数はそれぞれの学習データのサンプルに対する定数の平均値です。
式<span id="margin_172115178513_reveal" class="equation_link">(71)</span><span id="margin_172115178513" class="marginequation" style="display: none;"><a href="chap3.html#eqtn71" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w_j} & = & x_j(a-y)  \nonumber\end{eqnarray}</a></span><script>$('#margin_172115178513_reveal').click(function() {$('#margin_172115178513').toggle('slow', function() {});});</script>と<span id="margin_419838980882_reveal" class="equation_link">(72)</span><span id="margin_419838980882" class="marginequation" style="display: none;"><a href="chap3.html#eqtn72" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
      \frac{\partial C}{\partial b } & = & (a-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_419838980882_reveal').click(function() {$('#margin_419838980882').toggle('slow', function() {});});</script>により、クロスエントロピーの式を定数項を除いて唯一に定めます。クロスエントロピーは薄い空気の中から奇跡的に生まれるものではないのです。むしろ単純かつ自然に発見されるべきものなのです。</p>
<!--<p>What about the intuitive meaning of the cross-entropy?  How should we
think about it?  Explaining this in depth would take us further afield
than I want to go.  However, it is worth mentioning that there is a
standard way of interpreting the cross-entropy that comes from the
field of information theory.  Roughly speaking, the idea is that the
cross-entropy is a measure of surprise.  In particular, our neuron is
trying to compute the function $x \rightarrow y = y(x)$.  But instead
it computes the function $x \rightarrow a = a(x)$.  Suppose we think
of $a$ as our neuron's estimated probability that $y$ is $1$, and
$1-a$ is the estimated probability that the right value for $y$ is
$0$.  Then the cross-entropy measures how "surprised" we are, on
average, when we learn the true value for $y$.  We get low surprise if
the output is what we expect, and high surprise if the output is
unexpected.  Of course, I haven't said exactly what "surprise"
means, and so this perhaps seems like empty verbiage.  But in fact
there is a precise information-theoretic way of saying what is meant
by surprise.  Unfortunately, I don't know of a good, short,
self-contained discussion of this subject that's available online.
But if you want to dig deeper, then Wikipedia contains a
<a href="http://en.wikipedia.org/wiki/Cross_entropy#Motivation">brief
  summary</a> that will get you started down the right track.  And the
details can be filled in by working through the materials about the
Kraft inequality in chapter 5 of the book about information theory by
<a href="http://books.google.ca/books?id=VWq5GG6ycxMC">Cover and Thomas</a>.</p>-->
<p>クロスエントロピーの直感的な意味はなんでしょうか? どうやったらそのようなことを考えられるのでしょう? このことを深く説明するのは、行きたいところよりさらに遠くに連れて行かれてしまうかもしれません。しかしクロスエントロピーは情報理論から来ているのですが、それを解釈する標準的方法があるということに触れておくことは価値があるでしょう。大雑把にいうと、クロスエントロピーとは驚きの尺度です。特に私達のニューロンは関数$x \rightarrow y = y(x)$を計算しようとしています。しかしそのかわり関数$x \rightarrow a = a(x)$を計算しようとしています。ここで$a$はニューロンが計算した$y$が$1$である確率だとして、同じように$1-a$は計算された$y$が$0$である確率だとしましょう。そうすると、クロスエントロピーは真の$y$の値を学習するときに平均的にどのくらい「驚き」を得るかを示します。出力が期待したどおりだとあまり驚かないし、期待していないものだと強く驚きます。もちろん私は「驚き」を正確に説明していないので、これは中身のないおしゃべりに見えるかもしれません。しかし実際に、驚きが何を意味するかを説明する、正確な情報理論的手法があるのです。これについて、良い、短い、自己完結したオンラインで入手可能な議論を私は見たことがないです。でももしこれについて掘り下げたいのなら、ウィキペディアは<a href="http://en.wikipedia.org/wiki/Cross_entropy#Motivation">簡単なまとめ</a>があり、これにより正しい方向に向かうことができるでしょう。情報理論についての本「<a href="http://books.google.ca/books?id=VWq5GG6ycxMC">Cover and Thomas</a>」の5章に書かれているKraft不等式についての説明をあたってみれば、さらに詳細を埋めることができるでしょう。</p>
<!--<p><h4><a name="problem_386289"></a><a href="#problem_386289">Problem</a></h4><ul>
<li> We've discussed at length the learning slowdown that can occur
  when output neurons saturate, in networks using the quadratic cost
  to train.  Another factor that may inhibit learning is the presence
  of the $x_j$ term in Equation <span id="margin_113861306005_reveal" class="equation_link">(61)</span><span id="margin_113861306005" class="marginequation" style="display: none;"><a href="chap3.html#eqtn61" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_113861306005_reveal').click(function() {$('#margin_113861306005').toggle('slow', function() {});});</script>.  Because of
  this term, when an input $x_j$ is near to zero, the corresponding
  weight $w_j$ will learn slowly.  Explain why it is not possible to
  eliminate the $x_j$ term through a clever choice of cost function.
</ul></p>-->
<p><h4><a name="problem_386289"></a><a href="#problem_386289">問題</a></h4><ul>
    <li>学習に二次コストを使ったネットワークで出力ニューロンが飽和したときに学習の速度低下が起こることを時間をかけて議論してきました。学習を抑制する他の因子は、式<span id="margin_113861306005_reveal" class="equation_link">(61)</span><span id="margin_113861306005" class="marginequation" style="display: none;"><a href="chap3.html#eqtn61" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
          \frac{\partial C}{\partial w_j} =  \frac{1}{n} \sum_x x_j(\sigma(z)-y) \nonumber\end{eqnarray}</a></span><script>$('#margin_113861306005_reveal').click(function() {$('#margin_113861306005').toggle('slow', function() {});});</script>の$x_j$の項の存在です。この項により、入力$x_j$がゼロに近ければ対応する重み$w_j$の学習は遅くなります。賢いコスト関数を選んでも、この$x_j$の項を消すのは不可能であることを説明してください。
</ul></p>
<p>
<!--  <h3><a name="overfitting_and_regularization"></a><a href="#overfitting_and_regularization">Overfitting and regularization</a></h3></p>-->
<h3><a name="overfitting_and_regularization"></a><a href="#overfitting_and_regularization">過適合と規格化</a></h3></p>

<!--
<p>The Nobel prizewinning physicist Enrico Fermi was once asked his
opinion of a mathematical model some colleagues had proposed as the
solution to an important unsolved physics problem.  The model gave
excellent agreement with experiment, but Fermi was skeptical.  He
asked how many free parameters could be set in the model.  "Four"
  was the answer.  Fermi replied*
-->

<p>ノーベル賞受賞者の物理学者エンリコ・フェルミはあるとき、
  とある物理の未解決問題の解としてあるグループが提案した数学的モデルについての意見を求められました。
  そのモデルは実験と非常に良い一致をみせていましたが  フェルミは懐疑的でした。
  フェルミは、そのモデルには自由に設定できるパラメータがいくつあるか、と尋ねました。
  答えは４つ、ということでした。
  そこでフェルミはこう答えたそうです。


  <span class="marginnote">
*このエピソードは、
  <a href="http://www.nature.com/nature/journal/v427/n6972/full/427297a.html">フリーマン・ダイソン
  </a>
  のチャーミングな文章から引用したものです。ダイソンはまさにフェルミが批判したモデルの提案者の一人でした。
  4パラメータの象はここ
  <a href="http://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/">
  </a>にあります。 </span>:

  <!--
  <span class="marginnote">
*The quote comes from a
  charming article by
  <a href="http://www.nature.com/nature/journal/v427/n6972/full/427297a.html">Freeman
    Dyson</a>, who is one of the people who proposed the flawed model. A
  four-parameter elephant may be found
  <a href="http://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/">here</a>. </span>:
-->

  <!--
"I remember my friend Johnny von Neumann used to say, with four
parameters I can fit an elephant, and with five I can make him wiggle
his trunk.".
    -->
「私の友人ジョン・フォン・ノイマンが言っていたよ。私ならパラメータが4つあれば象だってフィッティングできる、5つあれば象の鼻を振れる、とね。」
</p><p>

  <!--
The point, of course, is that models with a large number of free
parameters can describe an amazingly wide range of phenomena.  Even if
such a model agrees well with the available data, that doesn't make it
a good model.  It may just mean there's enough freedom in the model
that it can describe almost any data set of the given size, without
capturing any genuine insights into the underlying phenomenon.  When
that happens the model will work well for the existing data, but will
fail to generalize to new situations.  The true test of a model is its
ability to make predictions in situations it hasn't been exposed to
before.
    -->

この話のポイントは、もちろん、モデルの自由パラメータの数が多ければ、驚くほど多くの現象を説明できてしまう、という点です。
たとえ手元のデータと良く一致したとしても、そのようなモデルが良いモデルだとは一概には言えません。
  もしかすると、そのモデルは、与えられたデータの
  背後にある現象になんら本質的な洞察を与えることなく、
  パラメータの多さにまかせてとりあえず
  フィットできてしまうだけなのかもしれません。
  もしこれが起こっているなら、このモデルは既存のデータに対してはよくあてはまりますが、
  新しい状況への汎化に失敗するでしょう。
  モデルの真価は、そのモデルが経験したことないような状況での予言力で測られるのです。


</p><p>
<!--
  Fermi and von Neumann were suspicious of models with four parameters.
Our 30 hidden neuron network for classifying MNIST digits has nearly
24,000 parameters!  That's a lot of parameters.  Our 100 hidden neuron
network has nearly 80,000 parameters, and state-of-the-art deep neural
nets sometimes contain millions or even billions of parameters.
  Should we trust the results? -->

  フェルミとフォンノイマンは、4つしかパラメータのないモデルに対してすでに懐疑的でした。
  30個の隠れニューロンを持つ私たちのMNIST分類ネットワークには、なんと約24,000個ものパラメータがあります！
  うわっ、私のパラメータ、多すぎ？100個の隠れニューロンがあるネットワークなら、パラメータの数は8万個近くにもなり、
  そして最新の深層学習ニューラルネットにはときに何百万、何億という数のパラメータが含まれます。
  こんなものを信じて、大丈夫でしょうか？


</p><p>

<!--
  Let's sharpen this problem up by constructing a situation where our
network does a bad job generalizing to new situations.  We'll use our
30 hidden neuron network, with its 23,860 parameters.  But we won't
train the network using all 50,000 MNIST training images.  Instead,
we'll use just the first 1,000 training images.  Using that restricted
set will make the problem with generalization much more evident.
We'll train in a similar way to before, using the cross-entropy cost
function, with a learning rate of $\eta = 0.5$ and a mini-batch size
of $10$.  However, we'll train for 400 epochs, a somewhat larger
number than before, because we're not using as many training examples.
Let's use <tt>network2</tt> to look at the way the cost function
  changes: -->

  この問題の具体例として、私たちのニューラルネットワークが新しい状況に適応できない状態というのを
  実際に作ってみましょう。まず、使うのは隠れニューロン30個、パラメータが23,860個のモデルです。ただし、
  MNISTの50,000個の画像を全部使わず、先頭1,000個の画像だけを使います。データ集合を制限することで、汎化失敗現象が見やすくなります。
  訓練方法は以前と同様、学習率$\eta = 0.5$、ミニバッチサイズ$10$を採用します。ただし、訓練例が少ないぶん、訓練期間は400エポックと、以前より長くします。
  コスト関数の変化のようすを見るために<tt>network2</tt>を使いましょう：


</p><p><div class="highlight"><pre>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">400</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">monitor_training_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>

<!--
  Using the results we can plot the way the cost changes as the network
learns*<span class="marginnote">
*This and the next four graphs were generated by the
  program
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/overfitting.py">overfitting.py</a>.</span>
  :
-->


この結果を用いて、学習の進行に対するコスト関数の変化をプロットしてみます*
<span class="marginnote">
*このグラフと、次の4つのグラフは
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/overfitting.py">overfitting.py</a>
から生成しています。</span>
  :


</p><p><center><img src="images/overfitting1.png" width="520px"></center></p><p>

  <!--
  This looks encouraging, showing a smooth decrease in the cost, just as
we expect.  Note that I've only shown training epochs 200 through 399.
This gives us a nice up-close view of the later stages of learning,
  which, as we'll see, turns out to be where the interesting action is. -->

  期待通り、コストはなめらかな減少を示しています。
  学習後期のふるまいを拡大して観察したいので、
  上図ではエポック200 - 399だけを表示しています。
  この学習後期で起こっている面白い現象を、これからみていきます。
</p><p>

<!--
  Let's now look at how the classification accuracy on the test data
  changes over time:-->

  こんどは、試験データの分類精度がどう変化しているか見てみましょう：

</p><p><center><img src="images/overfitting2.png" width="520px"></center></p><p>

  <!--
  Again, I've zoomed in quite a bit.  In the first 200 epochs (not
shown) the accuracy rises to just under 82 percent.  The learning then
gradually slows down.  Finally, at around epoch 280 the classification
accuracy pretty much stops improving.  Later epochs merely see small
stochastic fluctuations near the value of the accuracy at epoch 280.
Contrast this with the earlier graph, where the cost associated to the
training data continues to smoothly drop.  If we just look at that
cost, it appears that our model is still getting "better".  But the
test accuracy results show the improvement is an illusion.  Just like
the model that Fermi disliked, what our network learns after epoch 280
no longer generalizes to the test data.  And so it's not useful
learning.  We say the network is <em>overfitting</em> or
  <em>overtraining</em> beyond epoch 280. -->

  この図も、かなり拡大して見せています。図の範囲に入る以前の、最初の200エポックで
  精度は82パーセント直前まで上昇しています。その後、学習はだんだん鈍化し、ついに280エポック付近で分類精度はほとんど改善がみられなくなります。
  以降は、280エポックにて達成した精度のまわりに、小さな統計的ゆらぎが見られるだけになります。
  このグラフと、訓練データのコスト関数がなめらかに減少しつづける以前のグラフを見比べてください。
  コスト関数を見るかぎり、モデルの性能はいっけん良くなり続けているようです。しかし試験データの分類精度をみると、
  コスト関数でみられた「改善」は幻であることがわかります。
  エポック280以降のニューラルネットワークが学習した知識は、
  フェルミがディスっていたモデルと同じく、試験データに一般化できない知識だったのです。
  したがって、有用な学習ではなかった、といえます。このようなとき、エポック280以降のニューラルネットワークは
 <em>過適合</em>している、
  <em>過学習</em>している、などと言います。


</p><p>

  <!--
  You might wonder if the problem here is that I'm looking at the
<em>cost</em> on the training data, as opposed to the
<em>classification accuracy</em> on the test data.  In other words,
maybe the problem is that we're making an apples and organges
comparison.  What would happen if we compared the cost on the training
data with the cost on the test data, so we're comparing similar
measures?  Or perhaps we could compare the classification accuracy on
both the training data and the test data?  In fact, essentially the
same phenomenon shows up no matter how we do the comparison.  The
details do change, however. For instance, let's look at the cost on
the test data: -->

もしかして、訓練データの<em>コスト関数</em>と、試験データの
<em>分類精度</em>という、異なるものを比較したのがいけなかったのかもしれません。
それでは、訓練データのコスト関数と試験データのコスト関数同士を較べたらどうでしょうか？
逆に、訓練データと試験データの分類精度を比べたらどうでしょうか？実は、
どの方法で比較しても、同じような過学習の兆候が見られます。ただし、
細部には確かに違いがあります。例えば、試験データのコスト関数を見てみましょう：

</p><p>
  <center><img src="images/overfitting3.png" width="520px"></center></p><p>

  <!--
  We can see that the cost on the test data improves until around epoch
15, but after that it actually starts to get worse, even though the
cost on the training data is continuing to get better.  This is
another sign that our model is overfitting.  It poses a puzzle,
though, which is whether we should regard epoch 15 or epoch 280 as the
point at which overfitting is coming to dominate learning?  From a
practical point of view, what we really care about is improving
classification accuracy on the test data, while the cost on the test
data is no more than a proxy for classification accuracy.  And so it
makes most sense to regard epoch 280 as the point beyond which
  overfitting is dominating learning in our neural network. -->

  図から、試験データのコスト関数はエポック15あたりまでは改善していくが、その後は実は
  悪化しはじめていたことがわかります。いっぽうで訓練データのコスト関数は改善しつづけていますから、
  これもまた過適合の兆候であるといえます。もっとも、ここで1つの疑問がわいてきます。学習は
  どの時点から過適合に陥ったと見做すべきでしょうか？
  エポック15からでしょうか、それともエポック280？

  実用的な観点からいえば、私たちの本来の目的は試験データの分類精度を向上させることであって、
  試験データのコスト関数は分類精度の間接的な指標にすぎません。ですから、エポック280をもって
  学習が過適合に陥った時点と見做すのが最も合理的といえます。


</p><p>

<!--
  Another sign of overfitting may be seen in the classification accuracy
  on the training data:-->

  訓練データの分類精度にも、過学習の兆候が現れています：

</p><p><center><img src="images/overfitting4.png" width="520px"></center></p><p>

  <!--
  The accuracy rises all the way up to $100$ percent.  That is, our
network correctly classifies all $1,000$ training images!  Meanwhile,
our test accuracy tops out at just $82.27$ percent.  So our network
really is learning about peculiarities of the training set, not just
recognizing digits in general.  It's almost as though our network is
merely memorizing the training set, without understanding digits well
  enough to generalize to the test set. -->

  精度が100パーセントまでひたすら向上しています。つまり、私たちのニューラルネットワークは
  1,000件の訓練画像をすべて正しく分類しているのです！いっぽう、試験データの分類精度は
  せいぜい82.27パーセント程度が最大です。つまり、私たちのニューラルネットワークはもはや
  数字の認識一般を学習しているのではなく、訓練データ画像に特有の癖を学習してしまっているのです。
  数字とは何かを理解し試験データにも汎化できるような理解を得ようとせずに、ただ訓練データを丸暗記してしまっている、
  と言ってもよいでしょう。

</p><p>

  Overfitting is a major problem in neural networks.  This is especially
true in modern networks, which often have very large numbers of
weights and biases.  To train effectively, we need a way of detecting
when overfitting is going on, so we don't overtrain.  And we'd like to
  have techniques for reducing the effects of overfitting.

</p><p>

  The obvious way to detect overfitting is to use the approach above,
keeping track of accuracy on the test data as our network trains.  If
we see that the accuracy on the test data is no longer improving, then
we should stop training.  Of course, strictly speaking, this is not
necessarily a sign of overfitting.  It might be that accuracy on the
test data and the training data both stop improving at the same time.
  Still, adopting this strategy will prevent overfitting.

</p><p>

  In fact, we'll use a variation on this strategy.  Recall that when we
  load in the MNIST data we load in three data sets:

<div class="highlight"><pre>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
</pre></div>

Up to now we've been using the <tt>training_data</tt> and
<tt>test_data</tt>, and ignoring the <tt>validation_data</tt>.  The
<tt>validation_data</tt> contains $10,000$ images of digits, images
which are different from the $50,000$ images in the MNIST training
set, and the $10,000$ images in the MNIST test set.  Instead of using
the <tt>test_data</tt> to prevent overfitting, we will use the
<tt>validation_data</tt>.  To do this, we'll use much the same strategy
as was described above for the <tt>test_data</tt>.  That is, we'll
compute the classification accuracy on the <tt>validation_data</tt> at
the end of each epoch.  Once the classification accuracy on the
<tt>validation_data</tt> has saturated, we stop training.  This strategy
is called <em>early stopping</em>.  Of course, in practice we won't
immediately know when the accuracy has saturated.  Instead, we
continue training until we're confident that the accuracy has
saturated*<span class="marginnote">
*It requires some judgement to determine when to
  stop.  In my earlier graphs I identified epoch 280 as the place at
  which accuracy saturated.  It's possible that was too pessimistic.
  Neural networks sometimes plateau for a while in training, before
  continuing to improve.  I wouldn't be surprised if more learning
  could have occurred even after epoch 400, although the magnitude of
  any further improvement would likely be small.  So it's possible to
adopt more or less aggressive strategies for early stopping.</span>.

</p><p>Why use the <tt>validation_data</tt> to prevent overfitting, rather than
the <tt>test_data</tt>?  In fact, this is part of a more general
strategy, which is to use the <tt>validation_data</tt> to evaluate
different trial choices of hyper-parameters such as the number of
epochs to train for, the learning rate, the best network architecture,
and so on.  We use such evaluations to find and set good values for
the hyper-parameters.  Indeed, although I haven't mentioned it until
now, that is, in part, how I arrived at the hyper-parameter choices
made earlier in this book. (More on this
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">later</a>.)</p><p>Of course, that doesn't in any way answer the question of why we're
using the <tt>validation_data</tt> to prevent overfitting, rather than
the <tt>test_data</tt>.  Instead, it replaces it with a more general
question, which is why we're using the <tt>validation_data</tt> rather
than the <tt>test_data</tt> to set good hyper-parameters?  To understand
why, consider that when setting hyper-parameters we're likely to try
many different choices for the hyper-parameters.  If we set the
hyper-parameters based on evaluations of the <tt>test_data</tt> it's
possible we'll end up overfitting our hyper-parameters to the
<tt>test_data</tt>.  That is, we may end up finding hyper-parameters
which fit particular peculiarities of the <tt>test_data</tt>, but where
the performance of the network won't generalize to other data sets.
We guard against that by figuring out the hyper-parameters using the
<tt>validation_data</tt>.  Then, once we've got the hyper-parameters we
want, we do a final evaluation of accuracy using the <tt>test_data</tt>.
That gives us confidence that our results on the <tt>test_data</tt> are
a true measure of how well our neural network generalizes.  To put it
another way, you can think of the validation data as a type of
training data that helps us learn good hyper-parameters.  This
approach to finding good hyper-parameters is sometimes known as the
<em>hold out</em> method, since the <tt>validation_data</tt> is kept apart
or "held out" from the <tt>training_data</tt>.</p><p>Now, in practice, even after evaluating performance on the
<tt>test_data</tt> we may change our minds and want to try another
approach - perhaps a different network architecture - which will
involve finding a new set of hyper-parameters.  If we do this, isn't
there a danger we'll end up overfitting to the <tt>test_data</tt> as
well?  Do we need a potentially infinite regress of data sets, so we
can be confident our results will generalize?  Addressing this concern
fully is a deep and difficult problem.  But for our practical
purposes, we're not going to worry too much about this question.
Instead, we'll plunge ahead, using the basic hold out method, based on
the <tt>training_data</tt>, <tt>validation_data</tt>, and
<tt>test_data</tt>, as described above.</p><p>We've been looking so far at overfitting when we're just using 1,000
training images.  What happens when we use the full training set of
50,000 images?  We'll keep all the other parameters the same (30
hidden neurons, learning rate 0.5, mini-batch size of 10), but train
using all 50,000 images for 30 epochs.  Here's a graph showing the
results for the classification accuracy on both the training data and
the test data.  Note that I've used the test data here, rather than
the validation data, in order to make the results more directly
comparable with the earlier graphs.</p><p><center><img src="images/overfitting_full.png" width="520px"></center></p><p>As you can see, the accuracy on the test and training data remain much
closer together than when we were using 1,000 training examples.  In
particular, the best classification accuracy of $97.86$ percent on the
training data is only $1.53$ percent higher than the $95.33$ percent
on the test data.  That's compared to the $17.73$ percent gap we had
earlier!  Overfitting is still going on, but it's been greatly
reduced.  Our network is generalizing much better from the training
data to the test data.  In general, one of the best ways of reducing
overfitting is to increase the size of the training data.  With enough
training data it is difficult for even a very large network to
overfit.  Unfortunately, training data can be expensive or difficult
to acquire, so this is not always a practical option.</p><p><h4><a name="regularization"></a><a href="#regularization">Regularization</a></h4></p><p>Increasing the amount of training data is one way of reducing
overfitting.  Are there other ways we can reduce the extent to which
overfitting occurs?  One possible approach is to reduce the size of
our network. However, large networks have the potential to be more
powerful than small networks, and so this is an option we'd only adopt
reluctantly.</p><p>Fortunately, there are other techniques which can reduce overfitting,
even when we have a fixed network and fixed training data.  These are
known as <em>regularization</em> techniques.  In this section I describe
one of the most commonly used regularization techniques, a technique
sometimes known as <em>weight decay</em> or <em>L2 regularization</em>.
The idea of L2 regularization is to add an extra term to the cost
function, a term called the <em>regularization term</em>.  Here's the
regularized cross-entropy:</p><p><a class="displaced_anchor" name="eqtn78"></a>\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.
\tag{78}\end{eqnarray}</p><p>The first term is just the usual expression for the cross-entropy.
But we've added a second term, namely the sum of the squares of all
the weights in the network.  This is scaled by a factor $\lambda /
2n$, where $\lambda > 0$ is known as the <em>regularization
  parameter</em>, and $n$ is, as usual, the size of our training set.
I'll discuss later how $\lambda$ is chosen.  It's also worth noting
that the regularization term <em>doesn't</em> include the biases.  I'll
also come back to that below.</p><p>Of course, it's possible to regularize other cost functions, such as
the quadratic cost.  This can be done in a similar way:</p><p><a class="displaced_anchor" name="eqtn79"></a>\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +
  \frac{\lambda}{2n} \sum_w w^2.
\tag{79}\end{eqnarray}</p><p>In both cases we can write the regularized cost function as
<a class="displaced_anchor" name="eqtn80"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}
\sum_w w^2,
\tag{80}\end{eqnarray} where $C_0$ is the original, unregularized cost
function.</p><p>Intuitively, the effect of regularization is to make it so the network
prefers to learn small weights, all other things being equal.  Large
weights will only be allowed if they considerably improve the first
part of the cost function.  Put another way, regularization can be
viewed as a way of compromising between finding small weights and
minimizing the original cost function.  The relative importance of the
two elements of the compromise depends on the value of $\lambda$: when
$\lambda$ is small we prefer to minimize the original cost function,
but when $\lambda$ is large we prefer small weights.</p><p>Now, it's really not at all obvious why making this kind of compromise
should help reduce overfitting!  But it turns out that it does. We'll
address the question of why it helps in the next section.  But first,
let's work through an example showing that regularization really does
reduce overfitting.</p><p>To construct such an example, we first need to figure out how to apply
our stochastic gradient descent learning algorithm in a regularized
neural network.  In particular, we need to know how to compute the
partial derivatives $\partial C / \partial w$ and $\partial C
/ \partial b$ for all the weights and biases in the network.  Taking
the partial derivatives of Equation <span id="margin_87298027313_reveal" class="equation_link">(80)</span><span id="margin_87298027313" class="marginequation" style="display: none;"><a href="chap3.html#eqtn80" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}
\sum_w w^2 \nonumber\end{eqnarray}</a></span><script>$('#margin_87298027313_reveal').click(function() {$('#margin_87298027313').toggle('slow', function() {});});</script> gives</p><p><a class="displaced_anchor" name="eqtn81"></a><a class="displaced_anchor" name="eqtn82"></a>\begin{eqnarray}
  \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} +
  \frac{\lambda}{n} w \tag{81}\\
  \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.
\tag{82}\end{eqnarray} </p><p>The $\partial C_0 / \partial w$ and $\partial C_0 / \partial b$ terms
can be computed using backpropagation, as described in
<a href="chap2.html">the last chapter</a>.  And so we see that it's easy to
compute the gradient of the regularized cost function: just use
backpropagation, as usual, and then add $\frac{\lambda}{n} w$ to the
partial derivative of all the weight terms.  The partial derivatives
with respect to the biases are unchanged, and so the gradient descent
learning rule for the biases doesn't change from the usual rule:</p><p><a class="displaced_anchor" name="eqtn83"></a>\begin{eqnarray}
b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.
\tag{83}\end{eqnarray}</p><p>The learning rule for the weights becomes:</p><p><a class="displaced_anchor" name="eqtn84"></a><a class="displaced_anchor" name="eqtn85"></a>\begin{eqnarray}
  w & \rightarrow & w-\eta \frac{\partial C_0}{\partial
    w}-\frac{\eta \lambda}{n} w \tag{84}\\
  & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial
    C_0}{\partial w}.
\tag{85}\end{eqnarray}</p><p>This is exactly the same as the usual gradient descent learning rule,
except we first rescale the weight $w$ by a factor $1-\frac{\eta
  \lambda}{n}$.  This rescaling is sometimes referred to as
<em>weight decay</em>, since it makes the weights smaller.  At first
glance it looks as though this means the weights are being driven
unstoppably toward zero.  But that's not right, since the other term
may lead the weights to increase, if so doing causes a decrease in the
unregularized cost function.</p><p>Okay, that's how gradient descent works.  What about stochastic
gradient descent?  Well, just as in unregularized stochastic gradient
descent, we can estimate $\partial C_0 / \partial w$ by averaging over
a mini-batch of $m$ training examples.  Thus the regularized learning
rule for stochastic gradient descent becomes
(c.f. Equation <span id="margin_637276429741_reveal" class="equation_link">(20)</span><span id="margin_637276429741" class="marginequation" style="display: none;"><a href="chap1.html#eqtn20" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial w_k}  \nonumber\end{eqnarray}</a></span><script>$('#margin_637276429741_reveal').click(function() {$('#margin_637276429741').toggle('slow', function() {});});</script>)</p><p><a class="displaced_anchor" name="eqtn86"></a>\begin{eqnarray}
  w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
  \sum_x \frac{\partial C_x}{\partial w},
\tag{86}\end{eqnarray}</p><p>where the sum is over training examples $x$ in the mini-batch, and
$C_x$ is the (unregularized) cost for each training example.  This is
exactly the same as the usual rule for stochastic gradient descent,
except for the $1-\frac{\eta \lambda}{n}$ weight decay factor.
Finally, and for completeness, let me state the regularized learning
rule for the biases.  This is, of course, exactly the same as in the
unregularized case (c.f. Equation <span id="margin_891921584746_reveal" class="equation_link">(21)</span><span id="margin_891921584746" class="marginequation" style="display: none;"><a href="chap1.html#eqtn21" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m}
  \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}</a></span><script>$('#margin_891921584746_reveal').click(function() {$('#margin_891921584746').toggle('slow', function() {});});</script>),</p><p><a class="displaced_anchor" name="eqtn87"></a>\begin{eqnarray}
  b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},
\tag{87}\end{eqnarray}
where the sum is over training examples $x$ in the mini-batch.</p><p>Let's see how regularization changes the performance of our neural
network. We'll use a network with $30$ hidden neurons, a mini-batch
size of $10$, a learning rate of $0.5$, and the cross-entropy cost
function.  However, this time we'll use a regularization parameter of
$\lambda = 0.1$.  Note that in the code, we use the variable name
<tt>lmbda</tt>, because <tt>lambda</tt> is a reserved word in Python, with
an unrelated meaning.  I've also used the <tt>test_data</tt> again, not
the <tt>validation_data</tt>.  Strictly speaking, we should use the
<tt>validation_data</tt>, for all the reasons we discussed earlier.  But
I decided to use the <tt>test_data</tt> because it makes the results
more directly comparable with our earlier, unregularized results.  You
can easily change the code to use the <tt>validation_data</tt> instead,
and you'll find that it gives similar results.
<div class="highlight"><pre>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">400</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_training_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">monitor_training_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>

The cost on the training data decreases over the whole time, much as
it did in the earlier, unregularized case*<span class="marginnote">
*This and the next
  two graphs were produced with the program
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/overfitting.py">overfitting.py</a>.</span>:</p><p><center><img src="images/regularized1.png" width="520px"></center></p><p>But this time the accuracy on the <tt>test_data</tt> continues to
increase for the entire 400 epochs:</p><p><center><img src="images/regularized2.png" width="520px"></center></p><p>Clearly, the use of regularization has suppressed overfitting.  What's
more, the accuracy is considerably higher, with a peak classification
accuracy of $87.1$ percent, compared to the peak of $82.27$ percent
obtained in the unregularized case.  Indeed, we could almost certainly
get considerably better results by continuing to train past 400
epochs. It seems that, empirically, regularization is causing our
network to generalize better, and considerably reducing the effects of
overfitting.</p><p>What happens if we move out of the artificial environment of just
having 1,000 training images, and return to the full 50,000 image
training set?  Of course, we've seen already that overfitting is much
less of a problem with the full 50,000 images.  Does regularization
help any further?  Let's keep the hyper-parameters the same as before
- $30$ epochs, learning rate $0.5$, mini-batch size of $10$.
However, we need to modify the regularization parameter.  The reason
is because the size $n$ of the training set has changed from $n =
1,000$ to $n = 50,000$, and this changes the weight decay factor $1 -
\frac{\eta \lambda}{n}$.  If we continued to use $\lambda = 0.1$ that
would mean much less weight decay, and thus much less of a
regularization effect.  We compensate by changing to $\lambda = 5.0$.</p><p>Okay, let's train our network, stopping first to re-initialize the
weights:
<div class="highlight"><pre>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">monitor_training_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>

We obtain the results:</p><p><center><img src="images/regularized_full.png" width="520px"></center></p><p>There's lots of good news here.  First, our classification accuracy on
the test data is up, from $95.49$ percent when running unregularized,
to $96.49$ percent.  That's a big improvement.  Second, we can see
that the gap between results on the training and test data is much
narrower than before, running at under a percent.  That's still a
significant gap, but we've obviously made substantial progress
reducing overfitting.</p><p>Finally, let's see what test classification accuracy we get when we
use 100 hidden neurons and a regularization parameter of $\lambda =
5.0$. I won't go through a detailed analysis of overfitting here, this
is purely for fun, just to see how high an accuracy we can get when we
use our new tricks: the cross-entropy cost function and L2
regularization.</p><p><div class="highlight"><pre>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>The final result is a classification accuracy of $97.92$ percent on
the validation data.  That's a big jump from the 30 hidden neuron
case.  In fact, tuning just a little more, to run for 60 epochs at
$\eta = 0.1$ and $\lambda = 5.0$ we break the $98$ percent barrier,
achieving $98.04$ percent classification accuracy on the validation
data.  Not bad for what turns out to be 152 lines of code!</p><p>I've described regularization as a way to reduce overfitting and to
increase classification accuracies.  In fact, that's not the only
benefit.  Empirically, when doing multiple runs of our MNIST networks,
but with different (random) weight initializations, I've found that
the unregularized runs will occasionally get "stuck", apparently
caught in local minima of the cost function.  The result is that
different runs sometimes provide quite different results.  By
contrast, the regularized runs have provided much more easily
replicable results.</p><p>Why is this going on?  Heuristically, if the cost function is
unregularized, then the length of the weight vector is likely to grow,
all other things being equal.  Over time this can lead to the weight
vector being very large indeed.  This can cause the weight vector to
get stuck pointing in more or less the same direction, since changes
due to gradient descent only make tiny changes to the direction, when
the length is long.  I believe this phenomenon is making it hard for
our learning algorithm to properly explore the weight space, and
consequently harder to find good minima of the cost function.</p><p>
<h4><a name="why_does_regularization_help_reduce_overfitting"></a><a href="#why_does_regularization_help_reduce_overfitting">Why does regularization help reduce overfitting?</a></h4></p><p>We've seen empirically that regularization helps reduce overfitting.
That's encouraging but, unfortunately, it's not obvious why
regularization helps!  A standard story people tell to explain what's
going on is along the following lines: smaller weights are, in some
sense, lower complexity, and so provide a simpler and more powerful
explanation for the data, and should thus be preferred.  That's a
pretty terse story, though, and contains several elements that perhaps
seem dubious or mystifying.  Let's unpack the story and examine it
critically.  To do that, let's suppose we have a simple data set for
which we wish to build a model:</p><p><div
id="simple_model"></div> <script type="text/javascript" src="js/simple_data.js">
 </script> </p><p>Implicitly, we're studying some real-world phenomenon here, with $x$
and $y$ representing real-world data.  Our goal is to build a model
which lets us predict $y$ as a function of $x$.  We could try using
neural networks to build such a model, but I'm going to do something
even simpler: I'll try to model $y$ as a polynomial in $x$.  I'm doing
this instead of using neural nets because using polynomials will make
things particularly transparent.  Once we've understood the polynomial
case, we'll translate to neural networks.  Now, there are ten points
in the graph above, which means we can find a unique $9$th-order
polynomial $y = a_0 x^9 + a_1 x^8 + \ldots + a_9$ which fits the data
exactly.  Here's the graph of that polynomial*<span class="marginnote">
*I won't show
  the coefficients explicitly, although they are easy to find using a
  routine such as Numpy's <tt>polyfit</tt>.  You can view the exact form
  of the polynomial in the <a href="js/polynomial_model.js">source code
    for the graph</a> if you're curious.  It's the function <tt>p(x)</tt>
  defined starting on line 14 of the program which produces the
  graph.</span>:</p><p><div id="polynomial_fit"></div> <script
type="text/javascript" src="js/polynomial_model.js"></script></p><p>That provides an exact fit.  But we can also get a good fit using the
linear model $y = 2x$:</p><p><div id="linear_fit"></div> <script
type="text/javascript" src="js/linear_model.js"></script> </p><p>Which of these is the better model?  Which is more likely to be true?
And which model is more likely to generalize well to other examples of
the same underlying real-world phenomenon?</p><p>These are difficult questions.  In fact, we can't determine with
certainty the answer to any of the above questions, without much more
information about the underlying real-world phenomenon.  But let's
consider two possibilities: (1) the $9$th order polynomial is, in
fact, the model which truly describes the real-world phenomenon, and
the model will therefore generalize perfectly; (2) the correct model
is $y = 2x$, but there's a little additional noise due to, say,
measurement error, and that's why the model isn't an exact fit.</p><p>It's not <em>a priori</em> possible to say which of these two
possibilities is correct.  (Or, indeed, if some third possibility
holds).  Logically, either could be true.  And it's not a trivial
difference.  It's true that on the data provided there's only a small
difference between the two models.  But suppose we want to predict the
value of $y$ corresponding to some large value of $x$, much larger
than any shown on the graph above.  If we try to do that there will be
a dramatic difference between the predictions of the two models, as
the $9$th order polynomial model comes to be dominated by the $x^9$
term, while the linear model remains, well, linear.</p><p>One point of view is to say that in science we should go with the
simpler explanation, unless compelled not to.  When we find a simple
model that seems to explain many data points we are tempted to shout
"Eureka!"  After all, it seems unlikely that a simple explanation
should occur merely by coincidence.  Rather, we suspect that the model
must be expressing some underlying truth about the phenomenon.  In the
case at hand, the model $y = 2x+{\rm noise}$ seems much simpler than
$y = a_0 x^9 + a_1 x^8 + \ldots$.  It would be surprising if that
simplicity had occurred by chance, and so we suspect that $y = 2x+{\rm
  noise}$ expresses some underlying truth.  In this point of view, the
9th order model is really just learning the effects of local
noise. And so while the 9th order model works perfectly for these
particular data points, the model will fail to generalize to other
data points, and the noisy linear model will have greater predictive
power.</p><p>Let's see what this point of view means for neural networks.  Suppose
our network mostly has small weights, as will tend to happen in a
regularized network.  The smallness of the weights means that the
behaviour of the network won't change too much if we change a few
random inputs here and there.  That makes it difficult for a
regularized network to learn the effects of local noise in the data.
Think of it as a way of making it so single pieces of evidence don't
matter too much to the output of the network.  Instead, a regularized
network learns to respond to types of evidence which are seen often
across the training set.  By contrast, a network with large weights
may change its behaviour quite a bit in response to small changes in
the input.  And so an unregularized network can use large weights to
learn a complex model that carries a lot of information about the
noise in the training data.  In a nutshell, regularized networks are
constrained to build relatively simple models based on patterns seen
often in the training data, and are resistant to learning
peculiarities of the noise in the training data.  The hope is that
this will force our networks to do real learning about the phenemonon
at hand, and to generalize better from what they learn.</p><p>With that said, this idea of preferring simpler explanation should
make you nervous.  People sometimes refer to this idea as "Occam's
Razor", and will zealously apply it as though it has the status of
some general scientific principle.  But, of course, it's not a general
scientific principle.  There is no <em>a priori</em> logical reason to
prefer simple explanations over more complex explanations.  Indeed,
sometimes the more complex explanation turns out to be correct.</p><p>Let me describe two examples where more complex explanations have
turned out to be correct.  In the 1940s the physicist Marcel Schein
announced the discovery of a new particle of nature.  The company he
worked for, General Electric, was ecstatic, and publicised the
discovery widely.  But the physicist Hans Bethe was skeptical.  Bethe
visited Schein, and looked at the plates showing the tracks of
Schein's new particle.  Schein showed Bethe plate after plate, but on
each plate Bethe identified some problem that suggested the data
should be discarded.  Finally, Schein showed Bethe a plate that looked
good.  Bethe said it might just be a statistical fluke.  Schein:
"Yes, but the chance that this would be statistics, even according to
your own formula, is one in five."  Bethe: "But we have already
looked at five plates."  Finally, Schein said: "But on my plates,
each one of the good plates, each one of the good pictures, you
explain by a different theory, whereas I have one hypothesis that
explains all the plates, that they are [the new particle]."  Bethe
replied: "The sole difference between your and my explanations is
that yours is wrong and all of mine are right.  Your single
explanation is wrong, and all of my multiple explanations are right."
Subsequent work confirmed that Nature agreed with Bethe, and Schein's
particle is no more*<span class="marginnote">
*The story is related by the physicist
  Richard Feynman in an
  <a href="http://www.aip.org/history/ohilist/5020_4.html">interview</a>
  with the historian Charles Weiner.</span>.</p><p>As a second example, in 1859 the astronomer Urbain Le Verrier observed
that the orbit of the planet Mercury doesn't have quite the shape that
Newton's theory of gravitation says it should have.  It was a tiny,
tiny deviation from Newton's theory, and several of the explanations
proferred at the time boiled down to saying that Newton's theory was
more or less right, but needed a tiny alteration.  In 1916, Einstein
showed that the deviation could be explained very well using his
general theory of relativity, a theory radically different to
Newtonian gravitation, and based on much more complex mathematics.
Despite that additional complexity, today it's accepted that
Einstein's explanation is correct, and Newtonian gravity, even in its
modified forms, is wrong.  This is in part because we now know that
Einstein's theory explains many other phenomena which Newton's theory
has difficulty with.  Furthermore, and even more impressively,
Einstein's theory accurately predicts several phenomena which aren't
predicted by Newtonian gravity at all. But these impressive qualities
weren't entirely obvious in the early days.  If one had judged merely
on the grounds of simplicity, then some modified form of Newton's
theory would arguably have been more attractive.</p><p>There are three morals to draw from these stories.  First, it can be
quite a subtle business deciding which of two explanations is truly
"simpler".  Second, even if we can make such a judgement, simplicity
is a guide that must be used with great caution!  Third, the true test
of a model is not simplicity, but rather how well it does in
predicting new phenomena, in new regimes of behaviour.</p><p>With that said, and keeping the need for caution in mind, it's an
empirical fact that regularized neural networks usually generalize
better than unregularized networks.  And so through the remainder of
the book we will make frequent use of regularization.  I've included
the stories above merely to help convey why no-one has yet developed
an entirely convincing theoretical explanation for why regularization
helps networks generalize.  Indeed, researchers continue to write
papers where they try different approaches to regularization, compare
them to see which works better, and attept to understand why different
approaches work better or worse.  And so you can view regularization
as something of a kludge.  While it often helps, we don't have an
entirely satisfactory systematic understanding of what's going on,
merely incomplete heuristics and rules of thumb.</p><p>There's a deeper set of issues here, issues which go to the heart of
science.  It's the question of how we generalize.  Regularization may
give us a computional magic wand that helps our networks generalize
better, but it doesn't give us a principled understanding of how
generalization works, nor of what the best approach is*<span class="marginnote">
*These
  issues go back to the
  <a href="http://en.wikipedia.org/wiki/Problem_of_induction">problem
    of induction</a>, famously discussed by the Scottish philosopher
  David Hume in <a href="http://www.gutenberg.org/ebooks/9662">"An
    Enquiry Concerning Human Understanding"</a> (1748).  The problem of
  induction has been given a modern machine learning form in the
  no-free lunch theorem
  (<a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=585893">link</a>)
  of David Wolpert and William Macready (1997).</span>.</p><p>This is particularly galling because in everyday life, we humans
generalize phenomenally well.  Shown just a few images of an elephant
a child will quickly learn to recognize other elephants.  Of course,
they may occasionally make mistakes, perhaps confusing a rhinoceros
for an elephant, but in general this process works remarkably
accurately.  So we have a system - the human brain - with a huge
number of free parameters.  And after being shown just one or a few
training images that system learns to generalize to other images.  Our
brains are, in some sense, regularizing amazingly well!  How do we do
it?  At this point we don't know.  I expect that in years to come we
will develop more powerful techniques for regularization in artificial
neural networks, techniques that will ultimately enable neural nets to
generalize well even from small data sets.</p><p>In fact, our networks already generalize better than one might <em>a
  priori</em> expect.  A network with 100 hidden neurons has nearly 80,000
parameters.  We have only 50,000 images in our training data.  It's
like trying to fit an 80,000th degree polynomial to 50,000 data
points.  By all rights, our network should overfit terribly.  And yet,
as we saw earlier, such a network actually does a pretty good job
generalizing.  Why is that the case?  It's not well understood.  It
has been conjectured*<span class="marginnote">
*In
  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf">Gradient-Based
    Learning Applied to Document Recognition</a>, by Yann LeCun,
  Léon Bottou, Yoshua Bengio, and Patrick Haffner
  (1998).</span>  that "the dynamics of gradient descent learning in
multilayer nets has a `self-regularization' effect".  This is
exceptionally fortunate, but it's also somewhat disquieting that we
don't understand why it's the case.  In the meantime, we will adopt
the pragmatic approach and use regularization whenever we can.  Our
neural networks will be the better for it.</p><p>Let me conclude this section by returning to a detail which I left
unexplained earlier: the fact that L2 regularization <em>doesn't</em>
constrain the biases.  Of course, it would be easy to modify the
regularization procedure to regularize the biases.  Empirically, doing
this often doesn't change the results very much, so to some extent
it's merely a convention whether to regularize the biases or not.
However, it's worth noting that having a large bias doesn't make a
neuron sensitive to its inputs in the same way as having large
weights.  And so we don't need to worry about large biases enabling
our network to learn the noise in our training data.  At the same
time, allowing large biases gives our networks more flexibility in
behaviour - in particular, large biases make it easier for neurons
to saturate, which is sometimes desirable.  For these reasons we don't
usually include bias terms when regularizing.</p><p><h4><a name="other_techniques_for_regularization"></a><a href="#other_techniques_for_regularization">Other techniques for regularization</a></h4></p><p>There are many regularization techniques other that L2 regularization.
In fact, so many techniques have been developed that I can't possibly
summarize them all.  In this section I briefly describe three other
approaches to reducing overfitting: L1 regularization, dropout, and
artificially increasing the training set size.  We won't go into
nearly as much depth studying these techniques as we did earlier.
Instead, the purpose is to get familiar with the main ideas, and to
appreciate something of the diversity of regularization techniques
available.</p><p><strong>L1 regularization:</strong> In this approach we modify the
unregularized cost function by adding the sum of the absolute values
of the weights:</p><p><a class="displaced_anchor" name="eqtn88"></a>\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w|.
\tag{88}\end{eqnarray} </p><p>Intuitively, this is similar to L2 regularization, penalizing large
weights, and tending to make the network prefer small weights.  Of
course, the L1 regularization term isn't the same as the L2
regularization term, and so we shouldn't expect to get exactly the
same behaviour.  Let's try to understand how the behaviour of a
network trained using L1 regularization differs from a network trained
using L2 regularization.</p><p>To do that, we'll look at the partial derivatives of the cost
function.  Differentiating <span id="margin_81306900706_reveal" class="equation_link">(88)</span><span id="margin_81306900706" class="marginequation" style="display: none;"><a href="chap3.html#eqtn88" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w| \nonumber\end{eqnarray}</a></span><script>$('#margin_81306900706_reveal').click(function() {$('#margin_81306900706').toggle('slow', function() {});});</script> we obtain:
<a class="displaced_anchor" name="eqtn89"></a>\begin{eqnarray}  \frac{\partial C}{\partial
    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
    sgn}(w),
\tag{89}\end{eqnarray}</p><p>where ${\rm sgn}(w)$ is the sign of $w$, that is, $+1$ if $w$ is
positive, and $-1$ if $w$ is negative.  Using this expression, we can
easily modify backpropagation to do stochastic gradient descent using
L1 regularization.  The resulting update rule for an L1 regularized
network is
<a class="displaced_anchor" name="eqtn90"></a>\begin{eqnarray}  w \rightarrow w' =
  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
    C_0}{\partial w},
\tag{90}\end{eqnarray} </p><p>where, as per usual, we can estimate $\partial C_0 / \partial w$ using
a mini-batch average, if we wish.  Compare that to the update rule for
L2 regularization (c.f. Equation <span id="margin_398765654565_reveal" class="equation_link">(86)</span><span id="margin_398765654565" class="marginequation" style="display: none;"><a href="chap3.html#eqtn86" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
  \sum_x \frac{\partial C_x}{\partial w},  \nonumber\end{eqnarray}</a></span><script>$('#margin_398765654565_reveal').click(function() {$('#margin_398765654565').toggle('slow', function() {});});</script>),
<a class="displaced_anchor" name="eqtn91"></a>\begin{eqnarray}
  w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right)
  - \eta \frac{\partial C_0}{\partial w}.
\tag{91}\end{eqnarray}
In both expressions the effect of regularization is to shrink the
weights.  This accords with our intuition that both kinds of
regularization penalize large weights.  But the way the weights shrink
is different.  In L1 regularization, the weights shrink by a constant
amount toward $0$.  In L2 regularization, the weights shrink by an
amount which is proportional to $w$.  And so when a particular weight
has a large magnitude, $|w|$, L1 regularization shrinks the weight
much less than L2 regularization does.  By contrast, when $|w|$ is
small, L1 regularization shrinks the weight much more than L2
regularization.  The net result is that L1 regularization tends to
concentrate the weight of the network in a relatively small number of
high-importance connections, while the other weights are driven toward
zero.</p><p>I've glossed over an issue in the above discussion, which is that the
partial derivative $\partial C / \partial w$ isn't defined when $w =
0$.  The reason is that the function $|w|$ has a sharp "corner" at
$w = 0$, and so isn't differentiable at that point.  That's okay,
though.  What we'll do is just apply the usual (unregularized) rule
for stochastic gradient descent when $w = 0$.  That should be okay -
intuitively, the effect of regularization is to shrink weights, and
obviously it can't shrink a weight which is already $0$.  To put it
more precisely, we'll use Equations <span id="margin_253754174974_reveal" class="equation_link">(89)</span><span id="margin_253754174974" class="marginequation" style="display: none;"><a href="chap3.html#eqtn89" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  \frac{\partial C}{\partial
    w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} \, {\rm
    sgn}(w) \nonumber\end{eqnarray}</a></span><script>$('#margin_253754174974_reveal').click(function() {$('#margin_253754174974').toggle('slow', function() {});});</script>
and <span id="margin_321890282168_reveal" class="equation_link">(90)</span><span id="margin_321890282168" class="marginequation" style="display: none;"><a href="chap3.html#eqtn90" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}  w \rightarrow w' =
  w-\frac{\eta \lambda}{n} \mbox{sgn}(w) - \eta \frac{\partial
    C_0}{\partial w} \nonumber\end{eqnarray}</a></span><script>$('#margin_321890282168_reveal').click(function() {$('#margin_321890282168').toggle('slow', function() {});});</script> with the convention that $\mbox{sgn}(0) = 0$.
That gives a nice, compact rule for doing stochastic gradient descent
with L1 regularization.</p><p><strong>Dropout:</strong> Dropout is a radically different technique for
regularization.  Unlike L1 and L2 regularization, dropout doesn't rely
on modifying the cost function.  Instead, in dropout we modify the
network itself.  Let me describe the basic mechanics of how dropout
works, before getting into why it works, and what the results are.</p><p>Suppose we're trying to train a network:</p><p><center>
<img src="images/tikz30.png"/>
</center></p><p>In particular, suppose we have a training input $x$ and corresponding
desired output $y$.  Ordinarily, we'd train by forward-propagating $x$
through the network, and then backpropagating to determine the
contribution to the gradient.  With dropout, this process is modified.
We start by randomly (and temporarily) deleting half the hidden
neurons in the network, while leaving the input and output neurons
untouched.  After doing this, we'll end up with a network along the
following lines.  Note that the dropout neurons, i.e., the neurons
which have been temporarily deleted, are still ghosted in:</p><p><center>
<img src="images/tikz31.png"/>
</center></p><p>We forward-propagate the input $x$ through the modified network, and
then backpropagate the result, also through the modified network.
After doing this over a mini-batch of examples, we update the
appropriate weights and biases.  We then repeat the process, first
restoring the dropout neurons, then choosing a new random subset of
hidden neurons to delete, estimating the gradient for a different
mini-batch, and updating the weights and biases in the network.</p><p></p><p></p><p></p><p></p><p>By repeating this process over and over, our network will learn a set
of weights and biases.  Of course, those weights and biases will have
been learnt under conditions in which half the hidden neurons were
dropped out.  When we actually run the full network that means that
twice as many hidden neurons will be active.  To compensate for that,
we halve the weights outgoing from the hidden neurons.</p><p>This dropout procedure may seem strange and <em>ad hoc</em>.  Why would
we expect it to help with regularization?  To explain what's going on,
I'd like you to briefly stop thinking about dropout, and instead
imagine training neural networks in the standard way (no dropout).  In
particular, imagine we train several different neural networks, all
using the same training data.  Of course, the networks may not start
out identical, and as a result after training they may sometimes give
different results.  When that happens we could use some kind of
averaging or voting scheme to decide which output to accept.  For
instance, if we have trained five networks, and three of them are
classifying a digit as a "3", then it probably really is a "3".
The other two networks are probably just making a mistake.  This kind
of averaging scheme is often found to be a powerful (though expensive)
way of reducing overfitting.  The reason is that the different
networks may overfit in different ways, and averaging may help
eliminate that kind of overfitting.</p><p>What's this got to do with dropout?  Heuristically, when we dropout
different sets of neurons, it's rather like we're training different
neural networks.  And so the dropout procedure is like averaging the
effects of a very large number of different networks.  The different
networks will overfit in different ways, and so, hopefully, the net
effect of dropout will be to reduce overfitting.</p><p><a name="dropout_explanation"></a></p><p>A related heuristic explanation for dropout is given in one of the
earliest papers to use the
technique*<span class="marginnote">
*<a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet
    Classification with Deep Convolutional Neural Networks</a>, by Alex
  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "This
technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons. It is,
therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other neurons."
In other words, if we think of our network as a model which is making
predictions, then we can think of dropout as a way of making sure that
the model is robust to the loss of any individual piece of evidence.
In this, it's somewhat similar to L1 and L2 regularization, which tend
to reduce weights, and thus make the network more robust to losing any
individual connection in the network.</p><p>Of course, the true measure of dropout is that it has been very
successful in improving the performance of neural networks.  The
original
paper*<span class="marginnote">
*<a href="http://arxiv.org/pdf/1207.0580.pdf">Improving
    neural networks by preventing co-adaptation of feature detectors</a>
  by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya
  Sutskever, and Ruslan Salakhutdinov (2012).  Note that the paper
  discusses a number of subtleties that I have glossed over in this
  brief introduction.</span> introducing the technique applied it to many
different tasks. For us, it's of particular interest that they applied
dropout to MNIST digit classification, using a vanilla feedforward
neural network along lines similar to those we've been considering.
The paper noted that the best result anyone had achieved up to that
point using such an architecture was $98.4$ percent classification
accuracy on the test set.  They improved that to $98.7$ percent
accuracy using a combination of dropout and a modified form of L2
regularization.  Similarly impressive results have been obtained for
many other tasks, including problems in image and speech recognition,
and natural language processing.  Dropout has been especially useful
in training large, deep networks, where the problem of overfitting is
often acute.</p><p><strong>Artificially expanding the training data:</strong> We saw earlier that
our MNIST classification accuracy dropped down to percentages in the
mid-80s when we used only 1,000 training images.  It's not surprising
that this is the case, since less training data means our network will
be exposed to fewer variations in the way human beings write digits.
Let's try training our 30 hidden neuron network with a variety of
different training data set sizes, to see how performance varies.  We
train using a mini-batch size of 10, a learning rate $\eta = 0.5$, a
regularization parameter $\lambda = 5.0$, and the cross-entropy cost
function.  We will train for 30 epochs when the full training data set
is used, and scale up the number of epochs proportionally when smaller
training sets are used.  To ensure the weight decay factor remains the
same across training sets, we will use a regularization parameter of
$\lambda = 5.0$ when the full training data set is used, and scale
down $\lambda$ proportionally when smaller training sets are
used*<span class="marginnote">
*This and the next two graph are produced with the
  program
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>.</span>.</p><p><center><img src="images/more_data.png" width="520px"></center></p><p>As you can see, the classification accuracies improve considerably as
we use more training data.  Presumably this improvement would continue
still further if more data was available.  Of course, looking at the
graph above it does appear that we're getting near saturation.
Suppose, however, that we redo the graph with the training set size
plotted logarithmically:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>It seems clear that the graph is still going up toward the end.  This
suggests that if we used vastly more training data - say, millions
or even billions of handwriting samples, instead of just 50,000 -
then we'd likely get considerably better performance, even from this
very small network.</p><p>Obtaining more training data is a great idea. Unfortunately, it can be
expensive, and so is not always possible in practice.  However,
there's another idea which can work nearly as well, and that's to
artificially expand the training data.  Suppose, for example, that we
take an MNIST training image of a five,</p><p></p><p><center><img src="images/more_data_5.png" width="120px"></center></p><p>and rotate it by a small amount, let's say 15 degrees:</p><p><center><img src="images/more_data_rotated_5.png" width="120px"></center></p><p>It's still recognizably the same digit.  And yet at the pixel level
it's quite different to any image currently in the MNIST training
data.  It's conceivable that adding this image to the training data
might help our network learn more about how to classify digits.
What's more, obviously we're not limited to adding just this one
image.  We can expand our training data by making <em>many</em> small
rotations of <em>all</em> the MNIST training images, and then using the
expanded training data to improve our network's performance.</p><p>This idea is very powerful and has been widely used.  Let's look at
some of the results from a
paper*<span class="marginnote">
*<a href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best
    Practices for Convolutional Neural Networks Applied to Visual
    Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John
  Platt (2003).</span> which applied several variations of the idea to
MNIST.  One of the neural network architectures they considered was
along similar lines to what we've been using, a feedforward network
with 800 hidden neurons and using the cross-entropy cost function.
Running the network with the standard MNIST training data they
achieved a classification accuracy of 98.4 percent on their test set.
But then they expanded the training data, using not just rotations, as
I described above, but also translating and skewing the images.  By
training on the expanded data set they increased their network's
accuracy to 98.9 percent.  They also experimented with what they
called "elastic distortions", a special type of image distortion
intended to emulate the random oscillations found in hand muscles.  By
using the elastic distortions to expand the data they achieved an even
higher accuracy, 99.3 percent.  Effectively, they were broadening the
experience of their network by exposing it to the sort of variations
that are found in real handwriting.</p><p>Variations on this idea can be used to improve peformance on many
learning tasks, not just handwriting recognition.  The general
principle is to expand the training data by applying operations that
reflect real-world variation.  It's not difficult to think of ways of
doing this.  Suppose, for example, that you're building a neural
network to do speech recognition.  We humans can recognize speech even
in the presence of distortions such as background noise.  And so you
can expand your data by adding background noise.  We can also
recognize speech if it's sped up or slowed down. So that's another way
we can expand the training data.  These techniques are not always used
- for instance, instead of expanding the training data by adding
noise, it may well be more efficient to clean up the input to the
network by first applying a noise reduction filter.  Still, it's worth
keeping the idea of expanding the training data in mind, and looking
for opportunities to apply it.</p><p><h4><a name="exercise_195778"></a><a href="#exercise_195778">Exercise</a></h4><ul>
<li> As discussed above, one way of expanding the MNIST training data
  is to use small rotations of training images.  What's a problem that
  might occur if we allow arbitrarily large rotations of training
  images?
</ul></p><p><strong>An aside on big data and what it means to compare
  classification accuracies:</strong> Let's look again at how our neural
network's accuracy varies with training set size:</p><p><center><img src="images/more_data_log.png" width="520px"></center></p><p>Suppose that instead of using a neural network we use some other
machine learning technique to classify digits.  For instance, let's
try using the support vector machines (SVM) which we met briefly back
in <a href="chap1.html#SVM">Chapter 1</a>.  As was the case in Chapter 1,
don't worry if you're not familiar with SVMs, we don't need to
understand their details.  Instead, we'll use the SVM supplied by the
<a href="http://scikit-learn.org/stable/">scikit-learn library</a>.  Here's
how SVM performance varies as a function of training set size.  I've
plotted the neural net results as well, to make comparison
easy*<span class="marginnote">
*This graph was produced with the program
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/more_data.py">more_data.py</a>
  (as were the last few graphs).</span>:</p><p><center><img src="images/more_data_comparison.png" width="520px"></center></p><p>Probably the first thing that strikes you about this graph is that our
neural network outperforms the SVM for every training set size.
That's nice, although you shouldn't read too much into it, since I
just used the out-of-the-box settings from scikit-learn's SVM, while
we've done a fair bit of work improving our neural network.  A more
subtle but more interesting fact about the graph is that if we train
our SVM using 50,000 images then it actually has better performance
(94.48 percent accuracy) than our neural network does when trained
using 5,000 images (93.24 percent accuracy).  In other words, more
training data can sometimes compensate for differences in the machine
learning algorithm used.</p><p>Something even more interesting can occur.  Suppose we're trying to
solve a problem using two machine learning algorithms, algorithm A and
algorithm B.  It sometimes happens that algorithm A will outperform
algorithm B with one set of training data, while algorithm B will
outperform algorithm A with a different set of training data.  We
don't see that above - it would require the two graphs to cross -
but it does happen*<span class="marginnote">
*Striking examples may be found in
  <a href="http://dx.doi.org/10.3115/1073012.1073017">Scaling to very
    very large corpora for natural language disambiguation</a>, by
  Michele Banko and Eric Brill (2001).</span>.  The correct response to the
question "Is algorithm A better than algorithm B?" is really: "What
training data set are you using?"</p><p>All this is a caution to keep in mind, both when doing development,
and when reading research papers.  Many papers focus on finding new
tricks to wring out improved performance on standard benchmark data
sets.  "Our whiz-bang technique gave us an improvement of X percent
on standard benchmark Y" is a canonical form of research claim.  Such
claims are often genuinely interesting, but they must be understood as
applying only in the context of the specific training data set used.
Imagine an alternate history in which the people who originally
created the benchmark data set had a larger research grant.  They
might have used the extra money to collect more training data.  It's
entirely possible that the "improvement" due to the whiz-bang
technique would disappear on a larger data set.  In other words, the
purported improvement might be just an accident of history.  The
message to take away, especially in practical applications, is that
what we want is both better algorithms <em>and</em> better training
data.  It's fine to look for better algorithms, but make sure you're
not focusing on better algorithms to the exclusion of easy wins
getting more or better training data.</p><p><h4><a name="problem_953063"></a><a href="#problem_953063">Problem</a></h4><ul>
<li><strong>(Research problem)</strong> How do our machine learning algorithms
  perform in the limit of very large data sets?  For any given
  algorithm it's natural to attempt to define a notion of asymptotic
  performance in the limit of truly big data. A quick-and-dirty
  approach to this problem is to simply try fitting curves to graphs
  like those shown above, and then to extrapolate the fitted curves
  out to infinity.  An objection to this approach is that different
  approaches to curvefitting will give different notions of asymptotic
  performance.  Can you find a principled justification for fitting to
  some particular class of curves?  If so, compare the asymptotic
  performance of several different machine learning algorithms.
</ul></p><p><strong>Summing up:</strong> We've now completed our dive into overfitting and
regularization.  Of course, we'll return again to the issue.  As I've
mentioned several times, overfitting is a major problem in neural
networks, especially as computers get more powerful, and we have the
ability to train larger networks.  As a result there's a pressing need
to develop powerful regularization techniques to reduce overfitting,
and this is an extremely active area of current work.</p><p><h3><a name="weight_initialization"></a><a href="#weight_initialization">Weight initialization</a></h3></p><p>When we create our neural networks, we have to make choices for the
initial weights and biases.  Up to now, we've beeen choosing them
according to a prescription which I discussed only briefly
<a href="chap1.html#weight_initialization">back in Chapter 1</a>.  Just to
remind you, that prescription was to choose both the weights and
biases using independent Gaussian random variables, normalized to have
mean $0$ and standard deviation $1$.  While this approach has worked
well, it was quite <em>ad hoc</em>, and it's worth revisiting to see if
we can find a better way of setting our initial weights and biases,
and perhaps help our neural networks learn faster.</p><p>It turns out that we can do quite a bit better than initializing with
normalized Gaussians.  To see why, suppose we're working with a
network with a large number - say $1,000$ - of input neurons.  And
let's suppose we've used normalized Gaussians to initialize the
weights connecting to the first hidden layer.  For now I'm going to
concentrate specifically on the weights connecting the input neurons
to the first neuron in the hidden layer, and ignore the rest of the
network:</p><p><center>
<img src="images/tikz32.png"/>
</center></p><p>We'll suppose for simplicity that we're trying to train using a
training input $x$ in which half the input neurons are on, i.e., set
to $1$, and half the input neurons are off, i.e., set to $0$.  The
argument which follows applies more generally, but you'll get the gist
from this special case.  Let's consider the weighted sum $z = \sum_j
w_j x_j+b$ of inputs to our hidden neuron.  $500$ terms in this sum
vanish, because the corresponding input $x_j$ is zero.  And so $z$ is
a sum over a total of $501$ normalized Gaussian random variables,
accounting for the $500$ weight terms and the $1$ extra bias term.
Thus $z$ is itself distributed as a Gaussian with mean zero and
standard deviation $\sqrt{501} \approx 22.4$.  That is, $z$ has a very
broad Gaussian distribution, not sharply peaked at all:</p><p><div id="wide_gaussian"></div>
<script type="text/javascript" src="js/wide_gaussian.js"></script></p><p>In particular, we can see from this graph that it's quite likely that
$|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll -1$.  If
that's the case then the output $\sigma(z)$ from the hidden neuron
will be very close to either $1$ or $0$.  That means our hidden neuron
will have saturated.  And when that happens, as we know, making small
changes in the weights will make only absolutely miniscule changes in
the activation of our hidden neuron.  That miniscule change in the
activation of the hidden neuron will, in turn, barely affect the rest
of the neurons in the network at all, and we'll see a correspondingly
miniscule change in the cost function.  As a result, those weights
will only learn very slowly when we use the gradient descent
algorithm*<span class="marginnote">
*We discussed this in more detail in Chapter 2,
  where we used the
  <a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">equations
    of backpropagation</a> to show that weights input to saturated
  neurons learned slowly.</span>.  It's similar to the problem we discussed
earlier in this chapter, in which output neurons which saturated on
the wrong value caused learning to slow down.  We addressed that
earlier problem with a clever choice of cost function.  Unfortunately,
while that helped with saturated output neurons, it does nothing at
all for the problem with saturated hidden neurons.</p><p>I've been talking about the weights input to the first hidden layer.
Of course, similar arguments apply also to later hidden layers: if the
weights in later hidden layers are initialized using normalized
Gaussians, then activations will often be very close to $0$ or $1$,
and learning will proceed very slowly.</p><p>Is there some way we can choose better initializations for the weights
and biases, so that we don't get this kind of saturation, and so avoid
a learning slowdown?  Suppose we have a neuron with $n_{\rm in}$ input
weights.  Then we shall initialize those weights as Gaussian random
variables with mean $0$ and standard deviation $1/\sqrt{n_{\rm in}}$.
That is, we'll squash the Gaussians down, making it less likely that
our neuron will saturate.  We'll continue to choose the bias as a
Gaussian with mean $0$ and standard deviation $1$, for reasons I'll
return to in a moment.  With these choices, the weighted sum $z =
\sum_j w_j x_j + b$ will again be a Gaussian random variable with mean
$0$, but it'll be much more sharply peaked than it was before.
Suppose, as we did earlier, that $500$ of the inputs are zero and
$500$ are $1$.  Then it's easy to show (see the exercise below) that
$z$ has a Gaussian distribution with mean $0$ and standard deviation
$\sqrt{3/2} = 1.22\ldots$.  This is much more sharply peaked than
before, so much so that even the graph below understates the
situation, since I've had to rescale the vertical axis, when compared
to the earlier graph:</p><p><div id="narrow_gaussian"></div> <script
  type="text/javascript" src="js/narrow_gaussian.js"></script>  </p><p>Such a neuron is much less likely to saturate, and correspondingly
much less likely to have problems with a learning slowdown.</p><p><h4><a name="exercise_319349"></a><a href="#exercise_319349">Exercise</a></h4><ul>
<li> Verify that the standard deviation of $z = \sum_j w_j x_j + b$
  in the paragraph above is $\sqrt{3/2}$.  It may help to know that:
  (a) the variance of a sum of independent random variables is the sum
  of the variances of the individual random variables; and (b) the
  variance is the square of the standard deviation.
</ul></p><p>I stated above that we'll continue to initialize the biases as before,
as Gaussian random variables with a mean of $0$ and a standard
deviation of $1$.  This is okay, because it doesn't make it too much
more likely that our neurons will saturate.  In fact, it doesn't much
matter how we initialize the biases, provided we avoid the problem
with saturation.  Some people go so far as to initialize all the
biases to $0$, and rely on gradient descent to learn appropriate
biases.  But since it's unlikely to make much difference, we'll
continue with the same initialization procedure as before.</p><p>Let's compare the results for both our old and new approaches to
weight initialization, using the MNIST digit classification task.  As
before, we'll use $30$ hidden neurons, a mini-batch size of $10$, a
regularization parameter $\lambda = 5.0$, and the cross-entropy cost
function.  We will decrease the learning rate slightly from $\eta =
0.5$ to $0.1$, since that makes the results a little more easily
visible in the graphs.  We can train using the old method of weight
initialization:
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">large_weight_initializer</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>

We can also train using the new approach to weight initialization.
This is actually even easier, since <tt>network2</tt>'s default way of
initializing the weights is using this new approach.  That means we
can omit the <tt>net.large_weight_initializer()</tt> call above:
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>

Plotting the results*<span class="marginnote">
*The program used to generate this and
  the next graph is
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/weight_initialization.py">weight_initialization.py</a>.</span>,
we obtain:</p><p><center><img src="images/weight_initialization_30.png" width="520px"></center></p><p>In both cases, we end up with a classification accuracy somewhat over
96 percent.  The final classification accuracy is almost exactly the
same in the two cases.  But the new initialization technique brings us
there much, much faster.  At the end of the first epoch of training
the old approach to weight initialization has a classification
accuracy under 87 percent, while the new approach is already almost 93
percent.  What appears to be going on is that our new approach to
weight initialization starts us off in a much better regime, which
lets us get good results much more quickly.  The same phenomenon is
also seen if we plot results with $100$ hidden neurons:</p><p><center><img src="images/weight_initialization_100.png" width="520px"></center></p><p>In this case, the two curves don't quite meet.  However, my
experiments suggest that with just a few more epochs of training (not
shown) the accuracies become almost exactly the same.  So on the basis
of these experiments it looks as though the improved weight
initialization only speeds up learning, it doesn't change the final
performance of our networks.  However, in Chapter 4 we'll see examples
of neural networks where the long-run behaviour is significantly
better with the $1/\sqrt{n_{\rm in}}$ weight initialization.  Thus
it's not only the speed of learning which is improved, it's sometimes
also the final performance.</p><p>The $1/\sqrt{n_{\rm in}}$ approach to weight initialization helps
improve the way our neural nets learn.  Other techniques for weight
initialization have also been proposed, many building on this basic
idea.  I won't review the other approaches here, since $1/\sqrt{n_{\rm
    in}}$ works well enough for our purposes.  If you're interested in
looking further, I recommend looking at the discussion on pages 14 and
15 of a 2012 paper by Yoshua
Bengio*<span class="marginnote">
*<a href="http://arxiv.org/pdf/1206.5533v2.pdf">Practical
    Recommendations for Gradient-Based Training of Deep
    Architectures</a>, by Yoshua Bengio (2012).</span>, as well as the
references therein.</p><p><h4><a name="problem_735589"></a><a href="#problem_735589">Problem</a></h4><ul>
<li><strong>Connecting regularization and the improved method of weight
  initialization</strong> L2 regularization sometimes automatically gives us
  something similar to the new approach to weight initialization.
  Suppose we are using the old approach to weight initialization.
  Sketch a heuristic argument that: (1) supposing $\lambda$ is not too
  small, the first epochs of training will be dominated almost
  entirely by weight decay; (2) provided $\eta \lambda \ll n$ the
  weights will decay by a factor of $\exp(-\eta \lambda / m)$ per
  epoch; and (3) supposing $\lambda$ is not too large, the weight
  decay will tail off when the weights are down to a size around
  $1/\sqrt{n}$, where $n$ is the total number of weights in the
  network.  Argue that these conditions are all satisfied in the
  examples graphed in this section.
</ul></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="handwriting_recognition_revisited_the_code"></a><a href="#handwriting_recognition_revisited_the_code">Handwriting recognition revisited: the code</a></h3></p><p>Let's implement the ideas we've discussed in this chapter.  We'll
develop a new program,
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network2.py"><tt>network2.py</tt></a>,
which is an improved version of the program
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py"><tt>network.py</tt></a>
we developed in
<a href="chap1.html#implementing_our_network_to_classify_digits">Chapter
  1</a>.  If you haven't looked at <tt>network.py</tt> in a while then you
may find it helpful to spend a few minutes quickly reading over the
earlier discussion.  It's only 74 lines of code, and is easily
understood. </p><p>As was the case in <tt>network.py</tt>, the star of <tt>network2.py</tt>
is the <tt>Network</tt> class, which we use to represent our neural
networks.  We initialize an instance of <tt>Network</tt> with a list of
<tt>sizes</tt> for the respective layers in the network, and a choice
for the <tt>cost</tt> to use, defaulting to the cross-entropy:</p><p><div class="highlight"><pre><span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sizes</span><span class="p">,</span> <span class="n">cost</span><span class="o">=</span><span class="n">CrossEntropyCost</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sizes</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span> <span class="o">=</span> <span class="n">sizes</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">default_weight_initializer</span><span class="p">()</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">=</span><span class="n">cost</span>
</pre></div>
</p><p>The first couple of lines of the <tt>__init__</tt> method are the same
as in <tt>network.py</tt>, and are pretty self-explanatory.  But the
next two lines are new, and we need to understand what they're doing
in detail.</p><p>Let's start by examining the <tt>default_weight_initializer</tt>
method. This makes use of our <a href="#weight_initialization">new and
  improved approach</a> to weight initialization.  As we've seen, in that
approach the weights input to a neuron are initialized as Gaussian
random variables with mean 0 and standard deviation $1$ divided by the
square root of the number of connections input to the neuron.  Also in
this method we'll initialize the biases, using Gaussian random
variables with mean $0$ and standard deviation $1$.  Here's the code:</p><p><div class="highlight"><pre>    <span class="k">def</span> <span class="nf">default_weight_initializer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
</pre></div>
</p><p>To understand the code, it may help to recall that <tt>np</tt> is the
Numpy library for doing linear algebra.  We'll <tt>import</tt> Numpy at
the beginning of our program.  Also, notice that we don't initialize
any biases for the first layer of neurons.  We avoid doing this
because the first layer is an input layer, and so any biases would not
be used.  We did exactly the same thing in <tt>network.py</tt>.</p><p>Complementing the <tt>default_weight_initializer</tt> we'll also include
a <tt>large_weight_initializer</tt> method.  This method initializes the
weights and biases using the old approach from Chapter 1, with both
weights and biases initialized as Gaussian random variables with mean
$0$ and standard deviation $1$.  The code is, of course, only a tiny
bit different from the <tt>default_weight_initializer</tt>:</p><p><div class="highlight"><pre>    <span class="k">def</span> <span class="nf">large_weight_initializer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
</pre></div>
</p><p>I've included the <tt>large_weight_initializer</tt> method mostly as a
convenience to make it easier to compare the results in this chapter
to those in Chapter 1. I can't think of many practical situations
where I would recommend using it!</p><p>The second new thing in <tt>Network</tt>'s <tt>__init__</tt> method is
that we now initialize a <tt>cost</tt> attribute.  To understand how
that works, let's look at the class we use to represent the
cross-entropy cost*<span class="marginnote">
*If you're not familiar with Python's
  static methods you can ignore the <tt>@staticmethod</tt> decorators,
  and just treat <tt>fn</tt> and <tt>delta</tt> as ordinary methods.  If
  you're curious about details, all <tt>@staticmethod</tt> does is tell
  the Python interpreter that the method which follows doesn't depend
  on the object in any way.  That's why <tt>self</tt> isn't passed as a
  parameter to the <tt>fn</tt> and <tt>delta</tt> methods.</span>:</p><p><div class="highlight"><pre><span class="k">class</span> <span class="nc">CrossEntropyCost</span><span class="p">:</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">nan_to_num</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="o">-</span><span class="n">y</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="o">-</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">a</span><span class="p">)))</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
</pre></div>
</p><p>Let's break this down.  The first thing to observe is that even though
the cross-entropy is, mathematically speaking, a function, we've
implemented it as a Python class, not a Python function.  Why have I
made that choice?  The reason is that the cost plays two different
roles in our network.  The obvious role is that it's a measure of how
well an output activation, <tt>a</tt>, matches the desired output,
<tt>y</tt>.  This role is captured by the <tt>CrossEntropyCost.fn</tt>
method.  (Note, by the way, that the <tt>np.nan_to_num</tt> call inside
<tt>CrossEntropyCost.fn</tt> ensures that Numpy deals correctly with the
log of numbers very close to zero.)  But there's also a second way the
cost function enters our network.  Recall from
<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter
  2</a> that when running the backpropagation algorithm we need to
compute the network's output error, $\delta^L$. The form of the output
error depends on the choice of cost function: different cost function,
different form for the output error.  For the cross-entropy the output
error is, as we saw in Equation <span id="margin_33201621799_reveal" class="equation_link">(66)</span><span id="margin_33201621799" class="marginequation" style="display: none;"><a href="chap3.html#eqtn66" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
    \delta^L = a^L-y.
   \nonumber\end{eqnarray}</a></span><script>$('#margin_33201621799_reveal').click(function() {$('#margin_33201621799').toggle('slow', function() {});});</script>,</p><p><a class="displaced_anchor" name="eqtn92"></a>\begin{eqnarray}
  \delta^L = a^L-y.
\tag{92}\end{eqnarray}
For this reason we define a second method,
<tt>CrossEntropyCost.delta</tt>, whose purpose is to tell our network
how to compute the output error.  And then we bundle these two methods
up into a single class containing everything our networks need to know
about the cost function.</p><p>In a similar way, <tt>network2.py</tt> also contains a class to
represent the quadratic cost function.  This is included for
comparison with the results of Chapter 1, since going forward we'll
mostly use the cross entropy.  The code is just below.  The
<tt>QuadraticCost.fn</tt> method is a straightfoward computation of the
quadratic cost associated to the actual output, <tt>a</tt>, and the
desired output, <tt>y</tt>.  The value returned by
<tt>QuadraticCost.delta</tt> is based on the
expression <span id="margin_667945012404_reveal" class="equation_link">(30)</span><span id="margin_667945012404" class="marginequation" style="display: none;"><a href="chap2.html#eqtn30" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L = (a^L-y) \odot \sigma'(z^L) \nonumber\end{eqnarray}</a></span><script>$('#margin_667945012404_reveal').click(function() {$('#margin_667945012404').toggle('slow', function() {});});</script> for the output error for the
quadratic cost, which we derived back in Chapter 2.</p><p><div class="highlight"><pre><span class="k">class</span> <span class="nc">QuadraticCost</span><span class="p">:</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="k">return</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
</pre></div>
</p><p>We've now understood the main differences between <tt>network2.py</tt>
and <tt>network.py</tt>.  It's all pretty simple stuff.  There are a
number of smaller changes, which I'll discuss below, including the
implementation of L2 regularization.  Before getting to that, let's
look at the complete code for <tt>network2.py</tt>.  You don't need to
read all the code in detail, but it is worth understanding the broad
structure, and in particular reading the documentation strings, so you
understand what each piece of the program is doing.  Of course, you're
also welcome to delve as deeply as you wish!  If you get lost, you may
wish to continue reading the prose below, and return to the code
later.  Anyway, here's the code:</p><p><div class="highlight"><pre><span class="sd">&quot;&quot;&quot;network2.py</span>
<span class="sd">~~~~~~~~~~~~~~</span>

<span class="sd">An improved version of network.py, implementing the stochastic</span>
<span class="sd">gradient descent learning algorithm for a feedforward neural network.</span>
<span class="sd">Improvements include the addition of the cross-entropy cost function,</span>
<span class="sd">regularization, and better initialization of network weights.  Note</span>
<span class="sd">that I have focused on making the code simple, easily readable, and</span>
<span class="sd">easily modifiable.  It is not optimized, and omits many desirable</span>
<span class="sd">features.</span>

<span class="sd">&quot;&quot;&quot;</span>

<span class="c">#### Libraries</span>
<span class="c"># Standard library</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">sys</span>

<span class="c"># Third-party libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>


<span class="c">#### Define the quadratic and cross-entropy cost functions</span>

<span class="k">class</span> <span class="nc">QuadraticCost</span><span class="p">:</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the cost associated with an output ``a`` and desired output</span>
<span class="sd">        ``y``.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the error delta from the output layer.&quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">CrossEntropyCost</span><span class="p">:</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the cost associated with an output ``a`` and desired output</span>
<span class="sd">        ``y``.  Note that np.nan_to_num is used to ensure numerical</span>
<span class="sd">        stability.  In particular, if both ``a`` and ``y`` have a 1.0</span>
<span class="sd">        in the same slot, then the expression (1-y)*np.log(1-a)</span>
<span class="sd">        returns nan.  The np.nan_to_num ensures that that is converted</span>
<span class="sd">        to the correct value (0.0).</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">nan_to_num</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="o">-</span><span class="n">y</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="o">-</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">a</span><span class="p">)))</span>

    <span class="nd">@staticmethod</span>
    <span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the error delta from the output layer.  Note that the</span>
<span class="sd">        parameter ``z`` is not used by the method.  It is included in</span>
<span class="sd">        the method&#39;s parameters in order to make the interface</span>
<span class="sd">        consistent with the delta method for other cost classes.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">a</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>


<span class="c">#### Main Network class</span>
<span class="k">class</span> <span class="nc">Network</span><span class="p">():</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sizes</span><span class="p">,</span> <span class="n">cost</span><span class="o">=</span><span class="n">CrossEntropyCost</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;The list ``sizes`` contains the number of neurons in the respective</span>
<span class="sd">        layers of the network.  For example, if the list was [2, 3, 1]</span>
<span class="sd">        then it would be a three-layer network, with the first layer</span>
<span class="sd">        containing 2 neurons, the second layer 3 neurons, and the</span>
<span class="sd">        third layer 1 neuron.  The biases and weights for the network</span>
<span class="sd">        are initialized randomly, using</span>
<span class="sd">        ``self.default_weight_initializer`` (see docstring for that</span>
<span class="sd">        method).</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">sizes</span><span class="p">)</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span> <span class="o">=</span> <span class="n">sizes</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">default_weight_initializer</span><span class="p">()</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">=</span><span class="n">cost</span>

    <span class="k">def</span> <span class="nf">default_weight_initializer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Initialize each weight using a Gaussian distribution with mean 0</span>
<span class="sd">        and standard deviation 1 over the square root of the number of</span>
<span class="sd">        weights connecting to the same neuron.  Initialize the biases</span>
<span class="sd">        using a Gaussian distribution with mean 0 and standard</span>
<span class="sd">        deviation 1.</span>

<span class="sd">        Note that the first layer is assumed to be an input layer, and</span>
<span class="sd">        by convention we won&#39;t set any biases for those neurons, since</span>
<span class="sd">        biases are only ever used in computing the outputs from later</span>
<span class="sd">        layers.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>

    <span class="k">def</span> <span class="nf">large_weight_initializer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Initialize the weights using a Gaussian distribution with mean 0</span>
<span class="sd">        and standard deviation 1.  Initialize the biases using a</span>
<span class="sd">        Gaussian distribution with mean 0 and standard deviation 1.</span>

<span class="sd">        Note that the first layer is assumed to be an input layer, and</span>
<span class="sd">        by convention we won&#39;t set any biases for those neurons, since</span>
<span class="sd">        biases are only ever used in computing the outputs from later</span>
<span class="sd">        layers.</span>

<span class="sd">        This weight and bias initializer uses the same approach as in</span>
<span class="sd">        Chapter 1, and is included for purposes of comparison.  It</span>
<span class="sd">        will usually be better to use the default weight initializer</span>
<span class="sd">        instead.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:]]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
                        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>

    <span class="k">def</span> <span class="nf">feedforward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the output of the network if ``a`` is input.&quot;&quot;&quot;</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">a</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span><span class="o">+</span><span class="n">b</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">a</span>

    <span class="k">def</span> <span class="nf">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span>
            <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">,</span>
            <span class="n">evaluation_data</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
            <span class="n">monitor_evaluation_cost</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">monitor_training_cost</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">monitor_training_accuracy</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Train the neural network using mini-batch stochastic gradient</span>
<span class="sd">        descent.  The ``training_data`` is a list of tuples ``(x, y)``</span>
<span class="sd">        representing the training inputs and the desired outputs.  The</span>
<span class="sd">        other non-optional parameters are self-explanatory, as is the</span>
<span class="sd">        regularization parameter ``lmbda``.  The method also accepts</span>
<span class="sd">        ``evaluation_data``, usually either the validation or test</span>
<span class="sd">        data.  We can monitor the cost and accuracy on either the</span>
<span class="sd">        evaluation data or the training data, by setting the</span>
<span class="sd">        appropriate flags.  The method returns a tuple containing four</span>
<span class="sd">        lists: the (per-epoch) costs on the evaluation data, the</span>
<span class="sd">        accuracies on the evaluation data, the costs on the training</span>
<span class="sd">        data, and the accuracies on the training data.  All values are</span>
<span class="sd">        evaluated at the end of each training epoch.  So, for example,</span>
<span class="sd">        if we train for 30 epochs, then the first element of the tuple</span>
<span class="sd">        will be a 30-element list containing the cost on the</span>
<span class="sd">        evaluation data at the end of each epoch. Note that the lists</span>
<span class="sd">        are empty if the corresponding flag is not set.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="k">if</span> <span class="n">evaluation_data</span><span class="p">:</span> <span class="n">n_data</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">evaluation_data</span><span class="p">)</span>
        <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
        <span class="n">evaluation_cost</span><span class="p">,</span> <span class="n">evaluation_accuracy</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
        <span class="n">training_cost</span><span class="p">,</span> <span class="n">training_accuracy</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
            <span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span>
            <span class="n">mini_batches</span> <span class="o">=</span> <span class="p">[</span>
                <span class="n">training_data</span><span class="p">[</span><span class="n">k</span><span class="p">:</span><span class="n">k</span><span class="o">+</span><span class="n">mini_batch_size</span><span class="p">]</span>
                <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">)]</span>
            <span class="k">for</span> <span class="n">mini_batch</span> <span class="ow">in</span> <span class="n">mini_batches</span><span class="p">:</span>
                <span class="bp">self</span><span class="o">.</span><span class="n">update_mini_batch</span><span class="p">(</span>
                    <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span> <span class="n">lmbda</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">training_data</span><span class="p">))</span>
            <span class="k">print</span> <span class="s">&quot;Epoch </span><span class="si">%s</span><span class="s"> training complete&quot;</span> <span class="o">%</span> <span class="n">j</span>
            <span class="k">if</span> <span class="n">monitor_training_cost</span><span class="p">:</span>
                <span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">total_cost</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="p">)</span>
                <span class="n">training_cost</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
                <span class="k">print</span> <span class="s">&quot;Cost on training data: {}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">monitor_training_accuracy</span><span class="p">:</span>
                <span class="n">accuracy</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="n">convert</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                <span class="n">training_accuracy</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">accuracy</span><span class="p">)</span>
                <span class="k">print</span> <span class="s">&quot;Accuracy on training data: {} / {}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
                    <span class="n">accuracy</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">monitor_evaluation_cost</span><span class="p">:</span>
                <span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">total_cost</span><span class="p">(</span><span class="n">evaluation_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="p">,</span> <span class="n">convert</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
                <span class="n">evaluation_cost</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
                <span class="k">print</span> <span class="s">&quot;Cost on evaluation data: {}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">monitor_evaluation_accuracy</span><span class="p">:</span>
                <span class="n">accuracy</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">evaluation_data</span><span class="p">)</span>
                <span class="n">evaluation_accuracy</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">accuracy</span><span class="p">)</span>
                <span class="k">print</span> <span class="s">&quot;Accuracy on evaluation data: {} / {}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
                    <span class="bp">self</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="n">evaluation_data</span><span class="p">),</span> <span class="n">n_data</span><span class="p">)</span>
            <span class="k">print</span>
        <span class="k">return</span> <span class="n">evaluation_cost</span><span class="p">,</span> <span class="n">evaluation_accuracy</span><span class="p">,</span> \
            <span class="n">training_cost</span><span class="p">,</span> <span class="n">training_accuracy</span>

    <span class="k">def</span> <span class="nf">update_mini_batch</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mini_batch</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span> <span class="n">lmbda</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Update the network&#39;s weights and biases by applying gradient</span>
<span class="sd">        descent using backpropagation to a single mini batch.  The</span>
<span class="sd">        ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the</span>
<span class="sd">        learning rate, ``lmbda`` is the regularization parameter, and</span>
<span class="sd">        ``n`` is the total size of the training data set.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">mini_batch</span><span class="p">:</span>
            <span class="n">delta_nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_w</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
            <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">nb</span><span class="o">+</span><span class="n">dnb</span> <span class="k">for</span> <span class="n">nb</span><span class="p">,</span> <span class="n">dnb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">delta_nabla_b</span><span class="p">)]</span>
            <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">nw</span><span class="o">+</span><span class="n">dnw</span> <span class="k">for</span> <span class="n">nw</span><span class="p">,</span> <span class="n">dnw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">nabla_w</span><span class="p">,</span> <span class="n">delta_nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="o">-</span><span class="n">eta</span><span class="o">*</span><span class="p">(</span><span class="n">lmbda</span><span class="o">/</span><span class="n">n</span><span class="p">))</span><span class="o">*</span><span class="n">w</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nw</span>
                        <span class="k">for</span> <span class="n">w</span><span class="p">,</span> <span class="n">nw</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)]</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">b</span><span class="o">-</span><span class="p">(</span><span class="n">eta</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">mini_batch</span><span class="p">))</span><span class="o">*</span><span class="n">nb</span>
                       <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">nb</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="n">nabla_b</span><span class="p">)]</span>

    <span class="k">def</span> <span class="nf">backprop</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return a tuple ``(nabla_b, nabla_w)`` representing the</span>
<span class="sd">        gradient for the cost function C_x.  ``nabla_b`` and</span>
<span class="sd">        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar</span>
<span class="sd">        to ``self.biases`` and ``self.weights``.&quot;&quot;&quot;</span>
        <span class="n">nabla_b</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">b</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">]</span>
        <span class="n">nabla_w</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">w</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">]</span>
        <span class="c"># feedforward</span>
        <span class="n">activation</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="c"># list to store all the activations, layer by layer</span>
        <span class="n">zs</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># list to store all the z vectors, layer by layer</span>
        <span class="k">for</span> <span class="n">b</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">activation</span><span class="p">)</span><span class="o">+</span><span class="n">b</span>
            <span class="n">zs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activation</span> <span class="o">=</span> <span class="n">sigmoid_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">activations</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">activation</span><span class="p">)</span>
        <span class="c"># backward pass</span>
        <span class="n">delta</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">)</span><span class="o">.</span><span class="n">delta</span><span class="p">(</span><span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span>
        <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
        <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="c"># Note that the variable l in the loop below is used a little</span>
        <span class="c"># differently to the notation in Chapter 2 of the book.  Here,</span>
        <span class="c"># l = 1 means the last layer of neurons, l = 2 is the</span>
        <span class="c"># second-last layer, and so on.  It&#39;s a renumbering of the</span>
        <span class="c"># scheme in the book, used here to take advantage of the fact</span>
        <span class="c"># that Python can use negative indices in lists.</span>
        <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_layers</span><span class="p">):</span>
            <span class="n">z</span> <span class="o">=</span> <span class="n">zs</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span>
            <span class="n">spv</span> <span class="o">=</span> <span class="n">sigmoid_prime_vec</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
            <span class="n">delta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">spv</span>
            <span class="n">nabla_b</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">delta</span>
            <span class="n">nabla_w</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">delta</span><span class="p">,</span> <span class="n">activations</span><span class="p">[</span><span class="o">-</span><span class="n">l</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">(</span><span class="n">nabla_b</span><span class="p">,</span> <span class="n">nabla_w</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">convert</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the number of inputs in ``data`` for which the neural</span>
<span class="sd">        network outputs the correct result. The neural network&#39;s</span>
<span class="sd">        output is assumed to be the index of whichever neuron in the</span>
<span class="sd">        final layer has the highest activation.  </span>

<span class="sd">        The flag ``convert`` should be set to False if the data set is</span>
<span class="sd">        validation or test data (the usual case), and to True if the</span>
<span class="sd">        data set is the training data. The need for this flag arises</span>
<span class="sd">        due to differences in the way the results ``y`` are</span>
<span class="sd">        represented in the different data sets.  In particular, it</span>
<span class="sd">        flags whether we need to convert between the different</span>
<span class="sd">        representations.  It may seem strange to use different</span>
<span class="sd">        representations for the different data sets.  Why not use the</span>
<span class="sd">        same representation for all three data sets?  It&#39;s done for</span>
<span class="sd">        efficiency reasons -- the program usually evaluates the cost</span>
<span class="sd">        on the training data and the accuracy on other data sets.</span>
<span class="sd">        These are different types of computations, and using different</span>
<span class="sd">        representations speeds things up.  More details on the</span>
<span class="sd">        representations can be found in</span>
<span class="sd">        mnist_loader.load_data_wrapper.</span>

<span class="sd">        &quot;&quot;&quot;</span>
        <span class="k">if</span> <span class="n">convert</span><span class="p">:</span>
            <span class="n">results</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">feedforward</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span> <span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">y</span><span class="p">))</span>
                       <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="n">data</span><span class="p">]</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">results</span> <span class="o">=</span> <span class="p">[(</span><span class="n">np</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">feedforward</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span> <span class="n">y</span><span class="p">)</span>
                        <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="n">data</span><span class="p">]</span>
        <span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span> <span class="k">for</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="ow">in</span> <span class="n">results</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">total_cost</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">lmbda</span><span class="p">,</span> <span class="n">convert</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Return the total cost for the data set ``data``.  The flag</span>
<span class="sd">        ``convert`` should be set to False if the data set is the</span>
<span class="sd">        training data (the usual case), and to True if the data set is</span>
<span class="sd">        the validation or test data.  See comments on the similar (but</span>
<span class="sd">        reversed) convention for the ``accuracy`` method, above.</span>
<span class="sd">        &quot;&quot;&quot;</span>
        <span class="n">cost</span> <span class="o">=</span> <span class="mf">0.0</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
            <span class="n">a</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">feedforward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">convert</span><span class="p">:</span> <span class="n">y</span> <span class="o">=</span> <span class="n">vectorized_result</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
            <span class="n">cost</span> <span class="o">+=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">.</span><span class="n">fn</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
        <span class="n">cost</span> <span class="o">+=</span> <span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">lmbda</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">))</span><span class="o">*</span><span class="nb">sum</span><span class="p">(</span>
            <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">cost</span>

    <span class="k">def</span> <span class="nf">save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">filename</span><span class="p">):</span>
        <span class="sd">&quot;&quot;&quot;Save the neural network to the file ``filename``.&quot;&quot;&quot;</span>
        <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">&quot;sizes&quot;</span><span class="p">:</span> <span class="bp">self</span><span class="o">.</span><span class="n">sizes</span><span class="p">,</span>
                <span class="s">&quot;weights&quot;</span><span class="p">:</span> <span class="p">[</span><span class="n">w</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">],</span>
                <span class="s">&quot;biases&quot;</span><span class="p">:</span> <span class="p">[</span><span class="n">b</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="p">],</span>
                <span class="s">&quot;cost&quot;</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">.</span><span class="n">__name__</span><span class="p">)}</span>
        <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">&quot;w&quot;</span><span class="p">)</span>
        <span class="n">json</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
        <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>

<span class="c">#### Loading a Network</span>
<span class="k">def</span> <span class="nf">load</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Load a neural network from the file ``filename``.  Returns an</span>
<span class="sd">    instance of Network.</span>

<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">&quot;r&quot;</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
    <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">cost</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">modules</span><span class="p">[</span><span class="n">__name__</span><span class="p">],</span> <span class="n">data</span><span class="p">[</span><span class="s">&quot;cost&quot;</span><span class="p">])</span>
    <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">&quot;sizes&quot;</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">cost</span><span class="p">)</span>
    <span class="n">net</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">w</span><span class="p">)</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="s">&quot;weights&quot;</span><span class="p">]]</span>
    <span class="n">net</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="k">for</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="s">&quot;biases&quot;</span><span class="p">]]</span>
    <span class="k">return</span> <span class="n">net</span>

<span class="c">#### Miscellaneous functions</span>
<span class="k">def</span> <span class="nf">vectorized_result</span><span class="p">(</span><span class="n">j</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Return a 10-dimensional unit vector with a 1.0 in the j&#39;th position</span>
<span class="sd">    and zeroes elsewhere.  This is used to convert a digit (0...9)</span>
<span class="sd">    into a corresponding desired output from the neural network.</span>

<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">e</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="n">e</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.0</span>
    <span class="k">return</span> <span class="n">e</span>

<span class="k">def</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;The sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="mf">1.0</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">sigmoid_prime</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Derivative of the sigmoid function.&quot;&quot;&quot;</span>
    <span class="k">return</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>

<span class="n">sigmoid_prime_vec</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">sigmoid_prime</span><span class="p">)</span>
</pre></div>
</p><p>One of the more interesting changes in the code is to include L2
regularization.  Although this is a major conceptual change, it's so
trivial to implement that it's easy to miss in the code.  For the most
part it just involves passing the parameter <tt>lmbda</tt> to various
methods, notably the <tt>Network.SGD</tt> method.  The real work is done
in a single line of the program, the fourth-last line of the
<tt>Network.update_mini_batch</tt> method.  That's where we modify the
gradient descent update rule to include weight decay.  But although
the modification is tiny, it has a big impact on results!</p><p>This is, by the way, common when implementing new techniques in neural
networks.  We've spent thousands of words discussing regularization.
It's conceptually quite subtle and difficult to understand.  And yet
it was trivial to add to our program!  It occurs surprisingly often
that sophisticated techniques can be implemented with small changes to
code.</p><p>Another small but important change to our code is the addition of
several optional flags to the stochastic gradient descent method,
<tt>Network.SGD</tt>.  These flags make it possible to monitor the cost
and accuracy either on the <tt>training_data</tt> or on a set of
<tt>evaluation_data</tt> which can be passed to <tt>Network.SGD</tt>.
We've used these flags often earlier in the chapter, but let me give
an example of how it works, just to remind you:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">],</span> <span class="n">cost</span><span class="o">=</span><span class="n">network2</span><span class="o">.</span><span class="n">CrossEntropyCost</span><span class="p">())</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span>
<span class="o">...</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_training_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_training_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>Here, we're setting the <tt>evaluation_data</tt> to be the
<tt>validation_data</tt>.  But we could also have monitored performance
on the <tt>test_data</tt> or any other data set.  We also have four
flags telling us to monitor the cost and accuracy on both the
<tt>evaluation_data</tt> and the <tt>training_data</tt>.  Those flags are
<tt>False</tt> by default, but they've been turned on here in order to
monitor our <tt>Network</tt>'s performance.  Furthermore,
<tt>network2.py</tt>'s <tt>Network.SGD</tt> method returns a four-element
tuple representing the results of the monitoring.  We can use this as
follows:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">evaluation_cost</span><span class="p">,</span> <span class="n">evaluation_accuracy</span><span class="p">,</span>
<span class="o">...</span> <span class="n">training_cost</span><span class="p">,</span> <span class="n">training_accuracy</span> <span class="o">=</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span>
<span class="o">...</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_evaluation_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_training_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="o">...</span> <span class="n">monitor_training_cost</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p>So, for example, <tt>evaluation_cost</tt> will be a 30-element list
containing the cost on the evaluation data at the end of each epoch.
This sort of information is extremely useful in understanding a
network's behaviour.  It can, for example, be used to draw graphs
showing how the network learns over time.  Indeed, that's exactly how
I constructed all the graphs earlier in the chapter.  Note, however,
that if any of the monitoring flags are not set, then the
corresponding element in the tuple will be the empty list.</p><p>Other additions to the code include a <tt>Network.save</tt> method, to
save <tt>Network</tt> objects to disk, and a function to <tt>load</tt>
them back in again later.  Note that the saving and loading is done
using JSON, not Python's <tt>pickle</tt> or <tt>cPickle</tt> modules,
which are the usual way we save and load objects to and from disk in
Python.  Using JSON requires more code than <tt>pickle</tt> or
<tt>cPickle</tt> would.  To understand why I've used JSON, imagine that
at some time in the future we decided to change our <tt>Network</tt>
class to allow neurons other than sigmoid neurons.  To implement that
change we'd most likely change the attributes defined in the
<tt>Network.__init__</tt> method.  If we've simply pickled the objects
that would cause our <tt>load</tt> function to fail.  Using JSON to do
the serialization explicitly makes it easy to ensure that old
<tt>Network</tt>s will still <tt>load</tt>.</p><p>There are many other minor changes in the code for <tt>network2.py</tt>,
but they're all simple variations on <tt>network.py</tt>.  The net
result is to expand our 74-line program to a far more capable 152
lines.</p><p><h4><a name="problems_236423"></a><a href="#problems_236423">Problems</a></h4><ul>
<li> Modify the code above to implement L1 regularization, and use L1
  regularization to classify MNIST digits using a $30$ hidden neuron
  network.  Can you find a regularization parameter that enables you
  to do better than running unregularized?</p><p><li> Take a look at the <tt>Network.cost_derivative</tt> method in
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py"><tt>network.py</tt></a>.
  That method was written for the quadratic cost.  How would you
  rewrite the method for the cross-entropy cost?  Can you think of a
  problem that might arise in the cross-entropy version?  In
  <tt>network2.py</tt> we've eliminated the
  <tt>Network.cost_derivative</tt> method entirely, instead
  incorporating its functionality into the
  <tt>CrossEntropyCost.delta</tt> method.  How does this solve the
  problem you've just identified?</p><p></ul></p><p><h3><a name="how_to_choose_a_neural_network's_hyper-parameters"></a><a href="#how_to_choose_a_neural_network's_hyper-parameters">How to choose a neural network's hyper-parameters?</a></h3></p><p>Up until now I haven't explained how I've been choosing values for
hyper-parameters such as the learning rate, $\eta$, the regularization
parameter, $\lambda$, and so on.  I've just been supplying values
which work pretty well.  In practice, when you're using neural nets to
attack a problem, it can be difficult to find good hyper-parameters.
Imagine, for example, that we've just been introduced to the MNIST
problem, and have begun working on it, knowing nothing at all about
what hyper-parameters to use.  Let's suppose that by good fortune in
our first experiments we choose many of the hyper-parameters in the
same way as was done earlier this chapter: 30 hidden neurons, a
mini-batch size of 10, training for 30 epochs using the cross-entropy.
But we choose a learning rate $\eta = 10.0$ and regularization
parameter $\lambda = 1000.0$.  Here's what I saw on one such run:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">1000.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">Epoch</span> <span class="mi">0</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">1030</span> <span class="o">/</span> <span class="mi">10000</span>

<span class="n">Epoch</span> <span class="mi">1</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">990</span> <span class="o">/</span> <span class="mi">10000</span>

<span class="n">Epoch</span> <span class="mi">2</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>

<span class="o">...</span>

<span class="n">Epoch</span> <span class="mi">27</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">1009</span> <span class="o">/</span> <span class="mi">10000</span>

<span class="n">Epoch</span> <span class="mi">28</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">983</span> <span class="o">/</span> <span class="mi">10000</span>

<span class="n">Epoch</span> <span class="mi">29</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">967</span> <span class="o">/</span> <span class="mi">10000</span>
</pre></div>
</p><p>Our classification accuracies are no better than chance!  Our network
is acting as a random noise generator!</p><p>"Well, that's easy to fix," you might say, "just decrease the
learning rate and regularization hyper-parameters".  Unfortunately,
you don't <em>a priori</em> know those are the hyper-parameters you need
to adjust.  Maybe the real problem is that our 30 hidden neuron
network will never work well, no matter how the other hyper-parameters
are chosen?  Maybe we really need at least 100 hidden neurons?  Or 300
hidden neurons?  Or multiple hidden layers?  Or a different approach
to encoding the output?  Maybe our network is learning, but we need to
train for more epochs?  Maybe the mini-batches are too small?  Maybe
we'd do better switching back to the quadratic cost function?  Maybe
we need to try a different approach to weight initialization?  And so
on, on and on and on.  It's easy to feel lost in hyper-parameter
space.  This can be particularly frustrating if your network is very
large, or uses a lot of training data, since you may train for hours
or days or weeks, only to get no result.  If the situation persists,
it damages your confidence.  Maybe neural networks are the wrong
approach to your problem?  Maybe you should quit your job and take up
beekeeping?</p><p>In this section I explain some heuristics which can be used to set the
hyper-parameters in a neural network.  The goal is to help you develop
a workflow that enables you to do a pretty good job setting
hyper-parameters.  Of course, I won't cover everything about
hyper-parameter optimization.  That's a huge subject, and it's not, in
any case, a problem that is ever completely solved, nor is there
universal agreement amongst practitioners on the right strategies to
use.  There's always one more trick you can try to eke out a bit more
performance from your network.  But the heuristics in this section
should get you started.</p><p><strong>Broad strategy:</strong> When using neural networks to attack a new
problem the first challenge is to get <em>any</em> non-trivial learning,
i.e., for the network to achieve results better than chance.  This can
be surprisingly difficult, especially when confronting a new class of
problem.  Let's look at some strategies you can use if you're having
this kind of trouble.</p><p>Suppose, for example, that you're attacking MNIST for the first time.
You start out enthusiastic, but are a little discouraged when your
first network fails completely, as in the example above.  The way to
go is to strip the problem down.  Get rid of all the training and
validation images except images which are 0s or 1s.  Then try to train
a network to distinguish 0s from 1s.  Not only is that an inherently
easier problem than distinguishing all ten digits, it also reduces the
amount of training data by 80 percent, speeding up training by a
factor of 5.  That enables much more rapid experimentation, and so
gives you more rapid insight into how to build a good network.</p><p>You can further speed up experimentation by stripping your network
down to the simplest network likely to do meaningful learning.  If you
believe a <tt>[784, 10]</tt> network can likely do better-than-chance
classification of MNIST digits, then begin your experimentation with
such a network.  It'll be much faster than training a
<tt>[784, 30, 10]</tt> network, and you can build back up to the latter.</p><p>You can get another speed up in experimentation by increasing the
frequency of monitoring.  In <tt>network2.py</tt> we monitor performance
at the end of each training epoch.  With 50,000 images per epoch, that
means waiting quite a while - about a minute, on my laptop, when
training a <tt>[784, 30, 10]</tt> network - before getting feedback on
how well the network is learning.  Of course, a minute isn't really
very long, but if you want to trial dozens of hyper-parameter choices
it's annoying, and if you want to trial hundreds or thousands of
choices it starts to get debilitating.  We can get feedback more
quickly by monitoring the validation accuracy more often, say, after
every 1,000 training images.  Furthermore, instead of using the full
10,000 image validation set to monitor performance, we can get a much
faster estimate using just 100 validation images.  All that matters is
that the network sees enough images to do real learning, and to get a
pretty good rough estimate of performance. Of course, our program
<tt>network2.py</tt> doesn't currently do this kind of monitoring.  But
as a kludge to achieve a similar effect for the purposes of
illustration, we'll strip down our training data to just the first
1,000 MNIST training images.  Let's try it and see what happens.  (To
keep the code below simple I haven't implemented the idea of using
only 0 and 1 images.  Of course, that can be done with just a little
more work.)</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">1000.0</span><span class="p">,</span> \
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">[:</span><span class="mi">100</span><span class="p">],</span> \
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">Epoch</span> <span class="mi">0</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">1</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">2</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>
<span class="o">...</span>
</pre></div>
</p><p>We're still getting pure noise!  But there's a big win: we're now
getting feedback every second, rather than once a minute or so.  That
means you can more quickly experiment with other choices of
hyper-parameter, or even conduct experiments trialling many different
choices of hyper-parameter nearly simultaneously.</p><p>In the above example I left $\lambda$ as $\lambda = 1000.0$, as we
used earlier.  But since we changed the number of training examples we
should really change $\lambda$ to keep the weight decay the same.
That means changing $\lambda$ to $20.0$.  If we do that then this is
what happens:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">20.0</span><span class="p">,</span> \
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">[:</span><span class="mi">100</span><span class="p">],</span> \
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">Epoch</span> <span class="mi">0</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">12</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">1</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">14</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">2</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">25</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">3</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">18</span> <span class="o">/</span> <span class="mi">100</span>
<span class="o">...</span>
</pre></div>
</p><p>Ahah!  We have a signal.  Not a terribly good signal, but a signal
nonetheless.  That's something we can build on, modifying the
hyper-parameters to try to get further improvement.  Maybe we guess
that our learning rate needs to be higher.  (As you perhaps realize,
that's a silly guess, for reasons we'll discuss shortly, but please
bear with me.)  So to test our guess we try dialling $\eta$ up to
$100.0$:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">20.0</span><span class="p">,</span> \
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">[:</span><span class="mi">100</span><span class="p">],</span> \
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">Epoch</span> <span class="mi">0</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">1</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">2</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">3</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">10</span> <span class="o">/</span> <span class="mi">100</span>

<span class="o">...</span>
</pre></div>
</p><p>That's no good!  It suggests that our guess was wrong, and the problem
wasn't that the learning rate was too low.  So instead we try dialling
$\eta$ down to $\eta = 1.0$:</p><p><div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">[:</span><span class="mi">1000</span><span class="p">],</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="n">lmbda</span> <span class="o">=</span> <span class="mf">20.0</span><span class="p">,</span> \
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">[:</span><span class="mi">100</span><span class="p">],</span> \
<span class="o">...</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">Epoch</span> <span class="mi">0</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">62</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">1</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">42</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">2</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">43</span> <span class="o">/</span> <span class="mi">100</span>

<span class="n">Epoch</span> <span class="mi">3</span> <span class="n">training</span> <span class="n">complete</span>
<span class="n">Accuracy</span> <span class="n">on</span> <span class="n">evaluation</span> <span class="n">data</span><span class="p">:</span> <span class="mi">61</span> <span class="o">/</span> <span class="mi">100</span>

<span class="o">...</span>
</pre></div>
</p><p>That's better!  And so we can continue, individually adjusting each
hyper-parameter, gradually improving performance.  Once we've explored
to find an improved value for $\eta$, then we move on to find a good
value for $\lambda$.  Then experiment with a more complex
architecture, say a network with 10 hidden neurons.  Then adjust the
values for $\eta$ and $\lambda$ again.  Then increase to 20 hidden
neurons.  And then adjust other hyper-parameters some more.  And so
on, at each stage evaluating performance using our held-out validation
data, and using those evaluations to find better and better
hyper-parameters.  As we do so, it typically takes longer to witness
the impact due to modifications of the hyper-parameters, and so we can
gradually decrease the frequency of monitoring.</p><p>This all looks very promising as a broad strategy.  However, I want to
return to that initial stage of finding hyper-parameters that enable a
network to learn anything at all.  In fact, even the above discussion
conveys too positive an outlook.  It can be immensely frustrating to
work with a network that's learning nothing.  You can tweak
hyper-parameters for days, and still get no meaningful response.  And
so I'd like to re-emphasize that during the early stages you should
make sure you can get quick feedback from experiments.  Intuitively,
it may seem as though simplifying the problem and the architecture
will merely slow you down.  In fact, it speeds things up, since you
much more quickly find a network with a meaningful signal.  Once
you've got such a signal, you can often get rapid improvements by
tweaking the hyper-parameters.  As with many things in life, getting
started can be the hardest thing to do.</p><p>Okay, that's the broad strategy.  Let's now look at some specific
recommendations for setting hyper-parameters.  I will focus on the
learning rate, $\eta$, the L2 regularization parameter, $\lambda$, and
the mini-batch size.  However, many of the remarks apply also to other
hyper-parameters, including those associated to network architecture,
other forms of regularization, and some hyper-parameters we'll meet
later in the book, such as the momentum co-efficient.</p><p><strong>Learning rate:</strong> Suppose we run three MNIST networks with three
different learning rates, $\eta = 0.025$, $\eta = 0.25$ and $\eta =
2.5$, respectively.  We'll set the other hyper-parameters as for the
experiments in earlier sections, running over 30 epochs, with a
mini-batch size of 10, and with $\lambda = 5.0$.  We'll also return to
using the full $50,000$ training images.  Here's a graph showing the
behaviour of the training cost as we train*<span class="marginnote">
*The graph was
  generated by
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/multiple_eta.py">multiple_eta.py</a>.</span>:</p><p><center><img src="images/multiple_eta.png" width="520px"></center></p><p>With $\eta = 0.025$ the cost decreases smoothly until the final epoch.
With $\eta = 0.25$ the cost initially decreases, but after about $20$
epochs it is near saturation, and thereafter most of the changes are
merely small and apparently random oscillations.  Finally, with $\eta
= 2.5$ the cost makes large oscillations right from the start.  To
understand the reason for the oscillations, recall that stochastic
gradient descent is supposed to step us gradually down into a valley
of the cost function,</p><p><center>
<img src="images/tikz33.png"/>
</center></p><p>However, if $\eta$ is too large then the steps will be so large that
they may actually overshoot the minimum, causing the algorithm to
climb up out of the valley instead.  That's likely*<span class="marginnote">
*This
  picture is helpful, but it's intended as an intuition-building
  illustration of what may go on, not as a complete, exhaustive
  explanation.  Briefly, a more complete explanation is as follows:
  gradient descent uses a first-order approximation to the cost
  function as a guide to how to decrease the cost.  For large $\eta$,
  higher-order terms in the cost function become more important, and
  may dominate the behaviour, causing gradient descent to break down.
  This is especially likely as we approach minima and quasi-minima of
  the cost function, since near such points the gradient becomes
  small, making it easier for higher-order terms to dominate
  behaviour.</span>  what's causing the cost to oscillate when $\eta = 2.5$.
When we choose $\eta = 0.25$ the initial steps do take us toward a
minimum of the cost function, and it's only once we get near that
minimum that we start to suffer from the overshooting problem.  And
when we choose $\eta = 0.025$ we don't suffer from this problem at all
during the first $30$ epochs.  Of course, choosing $\eta$ so small
creates another problem, namely, that it slows down stochastic
gradient descent.  An even better approach would be to start with
$\eta = 0.25$, train for $20$ epochs, and then switch to $\eta =
0.025$.  We'll discuss such variable learning rate schedules later.
For now, though, let's stick to figuring out how to find a single good
value for the learning rate, $\eta$.</p><p></p><p>With this picture in mind, we can set $\eta$ as follows.  First, we
estimate the threshold value for $\eta$ at which the cost on the
training data immediately begins decreasing, instead of oscillating or
increasing.  This estimate doesn't need to be too accurate.  You can
estimate the order of magnitude by starting with $\eta = 0.01$.  If
the cost decreases during the first few epochs, then you should
successively try $\eta = 0.1, 1.0, \ldots$ until you find a value for
$\eta$ where the cost oscillates or increases during the first few
epochs.  Alternately, if the cost oscillates or increases during the
first few epochs when $\eta = 0.01$, then try $\eta = 0.001, 0.0001,
\ldots$ until you find a value for $\eta$ where the cost decreases
during the first few epochs.  Following this procedure will give us an
order of magnitude estimate for the threshold value of $\eta$.  You
may optionally refine your estimate, to pick out the largest value of
$\eta$ at which the cost decreases during the first few epochs, say
$\eta = 0.5$ or $\eta = 0.2$ (there's no need for this to be
super-accurate).  This gives us an estimate for the threshold value of
$\eta$.</p><p></p><p>Obviously, the actual value of $\eta$ that you use should be no larger
than the threshold value.  In fact, if the value of $\eta$ is to
remain useable over many epochs then you likely want to use a value
for $\eta$ that is smaller, say, a factor of two below the threshold.
Such a choice will typically allow you to train for many epochs,
without causing too much of a slowdown in learning.</p><p></p><p>In the case of the MNIST data, following this strategy leads to an
estimate of $0.1$ for the order of magnitude of the threshold value of
$\eta$.  After some more refinement, we obtain a threshold value $\eta
= 0.5$.  Following the prescription above, this suggests using $\eta =
0.25$ as our value for the learning rate.  In fact, I found that using
$\eta = 0.5$ worked well enough over $30$ epochs that for the most
part I didn't worry about using a lower value of $\eta$.</p><p>This all seems quite straightforward.  However, using the training
cost to pick $\eta$ appears to contradict what I said earlier in this
section, namely, that we'd pick hyper-parameters by evaluating
performance using our held-out validation data.  In fact, we'll use
validation accuracy to pick the regularization hyper-parameter, the
mini-batch size, and network parameters such as the number of layers
and hidden neurons, and so on.  Why do things differently for the
learning rate?  Frankly, this choice is my personal aesthetic
preference, and is perhaps somewhat idiosyncratic.  The reasoning is
that the other hyper-parameters are intended to improve the final
classification accuracy on the test set, and so it makes sense to
select them on the basis of validation accuracy.  However, the
learning rate is only incidentally meant to impact the final
classification accuracy.  It's primary purpose is really to control
the step size in gradient descent, and monitoring the training cost is
the best way to detect if the step size is too big.  With that said,
this is a personal aesthetic preference.  Early on during learning the
training cost usually only decreases if the validation accuracy
improves, and so in practice it's unlikely to make much difference
which criterion you use.</p><p><strong>Use early stopping to determine the number of training
  epochs:</strong> As we discused earlier in the chapter, early stopping means
that at the end of each epoch we should compute the classification
accuracy on the validation data.  When that stops improving,
terminate.  This makes setting the number of epochs very simple.  In
particular, it means that we don't need to worry about explicitly
figuring out how the number of epochs depends on the other
hyper-parameters.  Instead, that's taken care of automatically.
Furthermore, early stopping also automatically prevents us from
overfitting.  This is, of course, a good thing, although in the early
stages of experimentation it can be helpful to turn off early
stopping, so you can see any signs of overfitting, and use it to
inform your approach to regularization.</p><p>To implement early stopping we need to say more precisely what it
means that the classification accuracy has stopped improving.  As
we've seen, the accuracy can jump around quite a bit, even when the
overall trend is to improve.  If we stop the first time the accuracy
decreases then we'll almost certainly stop when there are more
improvements to be had.  A better rule is to terminate if the best
classification accuracy doesn't improve for quite some time.  Suppose,
for example, that we're doing MNIST.  Then we might elect to terminate
if the classification accuracy hasn't improved during the last ten
epochs.  This ensures that we don't stop too soon, in response to bad
luck in training, but also that we're not waiting around forever for
an improvement that never comes.</p><p>This no-improvement-in-ten rule is good for initial exploration of
MNIST.  However, networks can sometimes plateau near a particular
classification accuracy for quite some time, only to then begin
improving again.  If you're trying to get really good performance, the
no-improvement-in-ten rule may be too aggressive about stopping.  In
that case, I suggest using the no-improvement-in-ten rule for initial
experimentation, and gradually adopting more lenient rules, as you
better understand the way your network trains:
no-improvement-in-twenty, no-improvement-in-fifty, and so on.  Of
course, this introduces a new hyper-parameter to optimize!  In
practice, however, it's usually easy to set this hyper-parameter to
get pretty good results.  Similarly, for problems other than MNIST,
the no-improvement-in-ten rule may be much too aggressive or not
nearly aggressive enough, depending on the details of the problem.
However, with a little experimentation it's usually easy to find a
pretty good strategy for early stopping.</p><p>We haven't used early stopping in our MNIST experiments to date.  The
reason is that we've been doing a lot of comparisons between different
approaches to learning.  For such comparisons it's helpful to use the
same number of epochs in each case.  However, it's well worth
modifying <tt>network2.py</tt> to implement early stopping:</p><p><h4><a name="problem_831601"></a><a href="#problem_831601">Problem</a></h4><ul>
<li> Modify <tt>network2.py</tt> so that it implements early stopping
  using a no-improvement-in-$n$ epochs strategy, where $n$ is a
  parameter that can be set.</p><p><li> Can you think of a rule for early stopping <em>other</em> than
  no-improvement-in-$n$?  Ideally, the rule should compromise between
  getting high validation accuracies and not training too long.  Add
  your rule to <tt>network2.py</tt>, and run three experiments comparing
  the validation accuracies and number of epochs of training to
  no-improvement-in-$10$.</p><p></ul></p><p><strong>Learning rate schedule:</strong> We've been holding the learning rate
$\eta$ constant.  However, it's often advantageous to vary the
learning rate.  Early on during the learning process it's likely that
the weights are badly wrong.  And so it's best to use a large learning
rate that causes the weights to change quickly.  Later, we can reduce
the learning rate as we make more fine-tuned adjustments to our
weights.</p><p>How should we set our learning rate schedule?  Many approaches are
possible.  One natural approach is to use the same basic idea as early
stopping.  The idea is to hold the learning rate constant until the
validation accuracy starts to get worse.  Then decrease the learning
rate by some amount, say a factor of two or ten.  We repeat this many
times, until, say, the learning rate is a factor of 1,024 (or 1,000)
times lower than the initial value.  Then we terminate.</p><p></p><p>A variable learning schedule can improve performance, but it also
opens up a world of possible choices for the learning schedule.  Those
choices can be a headache - you can spend forever trying to optimize
your learning schedule.  For first experiments my suggestion is to use
a single, constant value for the learning rate.  That'll get you a
good first approximation.  Later, if you want to obtain the best
performance from your network, it's worth experimenting with a
learning schedule, along the lines I've described*<span class="marginnote">
*A readable
  recent paper which demonstrates the benefits of variable learning
  rates in attacking MNIST is
  <a href="http://arxiv.org/abs/1003.0358">Deep, Big, Simple Neural Nets
    Excel on Handwritten Digit Recognition</a>, by Dan Claudiu
  Cireșan, Ueli Meier, Luca Maria Gambardella, and
  Jürgen Schmidhuber (2010).</span>.</p><p><h4><a name="exercise_336628"></a><a href="#exercise_336628">Exercise</a></h4><ul>
<li> Modify <tt>network2.py</tt> so that it implements a learning
  schedule that: halves the learning rate each time the validation
  accuracy satisfies the no-improvement-in-$10$ rule; and terminates
  when the learning rate has dropped to $1/128$ of its original value.
</ul></p><p><strong>The regularization parameter, $\lambda$:</strong> I suggest starting
initially with no regularization ($\lambda = 0.0$), and determining a
value for $\eta$, as above.  Using that choice of $\eta$, we can then
use the validation data to select a good value for $\lambda$.  Start
by trialling $\lambda = 1.0$*<span class="marginnote">
*I don't have a good principled
  justification for using this as a starting value.  If anyone knows
  of a good principled discussion of where to start with $\lambda$,
  I'd appreciate hearing it (mn@michaelnielsen.org).</span>, and then
increase or decrease by factors of $10$, as needed to improve
performance on the validation data.  Once you've found a good order of
magnitude, you can fine tune your value of $\lambda$.  That done, you
should return and re-optimize $\eta$ again.</p><p><h4><a name="exercise_30392"></a><a href="#exercise_30392">Exercise</a></h4><ul>
<li> It's tempting to use gradient descent to try to learn good
  values for hyper-parameters such as $\lambda$ and $\eta$.  Can you
  think of an obstacle to using gradient descent to determine
  $\lambda$?  Can can you think of an obstacle to using gradient
  descent to determine $\eta$?
</ul></p><p><strong>How I selected hyper-parameters earlier in this book:</strong> If you
use the recommendations in this section you'll find that you get
values for $\eta$ and $\lambda$ which don't always exactly match the
values I've used earlier in the book.  The reason is that the book has
narrative constraints that have sometimes made it impractical to
optimize the hyper-parameters.  Think of all the comparisons we've
made of different approaches to learning, e.g., comparing the
quadratic and cross-entropy cost functions, comparing the old and new
methods of weight initialization, running with and without
regularization, and so on.  To make such comparisons meaningful, I've
usually tried to keep hyper-parameters constant across the approaches
being compared (or to scale them in an appropriate way).  Of course,
there's no reason for the same hyper-parameters to be optimal for all
the different approaches to learning, so the hyper-parameters I've
used are something of a compromise.</p><p>As an alternative to this compromise, I could have tried to optimize
the heck out of the hyper-parameters for every single approach to
learning.  In principle that'd be a better, fairer approach, since
then we'd see the best from every approach to learning.  However,
we've made dozens of comparisons along these lines, and in practice I
found it too computationally expensive.  That's why I've adopted the
compromise of using pretty good (but not necessarily optimal) choices
for the hyper-parameters.</p><p><strong>Mini-batch size:</strong> How should we set the mini-batch size?  To
answer this question, let's first suppose that we're doing online
learning, i.e., that we're using a mini-batch size of $1$.</p><p>The obvious worry about online learning is that using mini-batches
which contain just a single training example will cause significant
errors in our estimate of the gradient.  In fact, though, the errors
turn out to not be such a problem.  The reason is that the individual
gradient estimates don't need to be super-accurate.  All we need is an
estimate accurate enough that our cost function tends to keep
decreasing.  It's as though you are trying to get to the North
Magnetic Pole, but have a wonky compass that's 10-20 degrees off each
time you look at it.  Provided you stop to check the compass
frequently, and the compass gets the direction right on average,
you'll end up at the North Magnetic Pole just fine.</p><p>Based on this argument, it sounds as though we should use online
learning.  In fact, the situation turns out to be more complicated
than that.  In a <a href="chap2.html#backprop_over_minibatch">problem
  in the last chapter</a> I pointed out that it's possible to use matrix
techniques to compute the gradient update for <em>all</em> examples in a
mini-batch simultaneously, rather than looping over them.  Depending
on the details of your hardware and linear algebra library this can
make it quite a bit faster to compute the gradient estimate for a
mini-batch of (for example) size $100$, rather than computing the
mini-batch gradient estimate by looping over the $100$ training
examples separately.  It might take (say) only $50$ times as long,
rather than $100$ times as long.</p><p>Now, at first it seems as though this doesn't help us that much.  With
our mini-batch of size $100$ the learning rule for the weights looks
like:
<a class="displaced_anchor" name="eqtn93"></a>\begin{eqnarray}
  w \rightarrow w' = w-\eta \frac{1}{100} \sum_x \nabla C_x,
\tag{93}\end{eqnarray}
where the sum is over training examples in the mini-batch.  This is
versus
<a class="displaced_anchor" name="eqtn94"></a>\begin{eqnarray}
  w \rightarrow w' = w-\eta \nabla C_x
\tag{94}\end{eqnarray}
for online learning.  Even if it only takes $50$ times as long to do
the mini-batch update, it still seems likely to be better to do online
learning, because we'd be updating so much more frequently.  Suppose,
however, that in the mini-batch case we increase the learning rate by
a factor $100$, so the update rule becomes
<a class="displaced_anchor" name="eqtn95"></a>\begin{eqnarray}
  w \rightarrow w' = w-\eta \sum_x \nabla C_x.
\tag{95}\end{eqnarray}
That's a lot like doing $100$ separate instances of online learning
with a learning rate of $\eta$.  But it only takes $50$ times as long
as doing a single instance of online learning.  Of course, it's not
truly the same as $100$ instances of online learning, since in the
mini-batch the $\nabla C_x$'s are all evaluated for the same set of
weights, as opposed to the cumulative learning that occurs in the
online case.  Still, it seems distinctly possible that using the
larger mini-batch would speed things up.</p><p>With these factors in mind, choosing the best mini-batch size is a
compromise.  Too small, and you don't get to take full advantage of
the benefits of good matrix libraries optimized for fast hardware.
Too large and you're simply not updating your weights often enough.
What you need is to choose a compromise value which maximizes the
speed of learning.  Fortunately, the choice of mini-batch size at
which the speed is maximized is relatively independent of the other
hyper-parameters (apart from the overall architecture), so you don't
need to have optimized those hyper-parameters in order to find a good
mini-batch size.  The way to go is therefore to use some acceptable
(but not necessarily optimal) values for the other hyper-parameters,
and then trial a number of different mini-batch sizes, scaling $\eta$
as above.  Plot the validation accuracy versus <em>time</em> (as in,
real elapsed time, not epoch!), and choose whichever mini-batch size
gives you the most rapid improvement in performance.  With the
mini-batch size chosen you can then proceed to optimize the other
hyper-parameters.</p><p>Of course, as you've no doubt realized, I haven't done this
optimization in our work.  Indeed, our implementation doesn't use the
faster approach to mini-batch updates at all.  I've simply used a
mini-batch size of $10$ without comment or explanation in nearly all
examples.  Because of this, we could have sped up learning by reducing
the mini-batch size.  I haven't done this, in part because I wanted to
illustrate the use of mini-batches beyond size $1$, and in part
because my preliminary experiments suggested the speedup would be
rather modest.  In practical implementations, however, we would most
certainly implement the faster approach to mini-batch updates, and
then make an effort to optimize the mini-batch size, in order to
maximize our overall speed.</p><p></p><p><strong>Automated techniques:</strong> I've been describing these heuristics
as though you're optimizing your hyper-parameters by hand.
Hand-optimization is a good way to build up a feel for how neural
networks behave.  However, and unsurprisingly, a great deal of work
has been done on automating the process. A common technique is
<em>grid search</em>, which systematically searches through a grid in
hyper-parameter space.  A review of both the achievements and the
limitations of grid search (with suggestions for easily-implemented
alternatives) may be found in a 2012
paper*<span class="marginnote">
*<a href="http://dl.acm.org/citation.cfm?id=2188395">Random
    search for hyper-parameter optimization</a>, by James Bergstra and
  Yoshua Bengio (2012).</span> by James Bergstra and Yoshua Bengio.  Many
more sophisticated approaches have also been proposed.  I won't review
all that work here, but do want to mention a particularly promising
2012 paper which used a Bayesian approach to automatically optimize
hyper-parameters*<span class="marginnote">
*<a href="http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf">Practical
    Bayesian optimization of machine learning algorithms</a>, by Jasper
  Snoek, Hugo Larochelle, and Ryan Adams.</span>.  The code from the paper
is <a href="https://github.com/jaberg/hyperopt">publicly available</a>, and
has been used with some success by other researchers.</p><p><strong>Summing up:</strong> Following the rules-of-thumb I've described won't
give you the absolute best possible results from your neural network.
But it will likely give you a good start and a basis for further
improvements.  In particular, I've discussed the hyper-parameters
largely independently.  In practice, there are relationships between
the hyper-parameters.  You may experiment with $\eta$, feel that
you've got it just right, then start to optimize for $\lambda$, only
to find that it's messing up your optimization for $\eta$.  In
practice, it helps to bounce backward and forward, gradually closing
in good values.  Above all, keep in mind that the heuristics I've
described are rules of thumb, not rules cast in stone.  You should be
on the lookout for signs that things aren't working, and be willing to
experiment.  In particular, this means carefully monitoring your
network's behaviour, especially the validation accuracy.</p><p>The difficult of choosing hyper-parameters is exacerbated by the fact
that the lore about how to choose hyper-parameters is widely spread,
across many research papers and software programs, and often is only
available inside the heads of individual practitioners.  There are
many, many papers setting out (sometimes contradictory)
recommendations for how to proceed.  However, there are a few
particularly useful papers that synthesize and distill out much of
this lore.  Yoshua Bengio has a 2012
paper*<span class="marginnote">
*<a href="http://arxiv.org/abs/1206.5533">Practical
    recommendations for gradient-based training of deep
    architectures</a>, by Yoshua Bengio (2012).</span> that gives some
practical recommendations for using backpropagation and gradient
descent to train neural networks, including deep neural nets.  Bengio
discusses many issues in much more detail than I have, including how
to do more systematic hyper-parameter searches.  Another good paper is
a 1998
paper*<span class="marginnote">
*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient
    BackProp</a>, by Yann LeCun, Léon Bottou,
  Genevieve Orr and Klaus-Robert Müller (1998)</span> by
Yann LeCun, Léon Bottou, Genevieve Orr and
Klaus-Robert Müller.  Both these papers appear in
an extremely useful 2012 book that collects many tricks commonly used
in neural
nets*<span class="marginnote">
*<a href="http://www.springer.com/computer/theoretical+computer+science/book/978-3-642-35288-1">Neural
    Networks: Tricks of the Trade</a>, edited by
  Grégoire Montavon, Geneviève Orr, and Klaus-Robert
    Müller.</span>.  The book is expensive, but many of the articles have
been placed online by their respective authors with, one presumes, the
blessing of the publisher, and may be located using a search engine.</p><p>One thing that becomes clear as you read these articles and,
especially, as you engage in your own experiments, is that
hyper-parameter optimization is not a problem that is ever completely
solved.  There's always another trick you can try to improve
performance.  There is a saying common among writers that books are
never finished, only abandoned.  The same is also true of neural
network optimization: the space of hyper-parameters is so large that
one never really finishes optimizing, one only abandons the network to
posterity.  So your goal should be to develop a workflow that enables
you to quickly do a pretty good job on the optimization, while leaving
you the flexibility to try more detailed optimizations, if that's
important.</p><p>The challenge of setting hyper-parameters has led some people to
complain that neural networks require a lot of work when compared with
other machine learning techniques.  I've heard many variations on the
following complaint: "Yes, a well-tuned neural network may get the
best performance on the problem.  On the other hand, I can try a
random forest [or SVM or$\ldots$ insert your own favourite technique]
and it just works.  I don't have time to figure out just the right
neural network."  Of course, from a practical point of view it's good
to have easy-to-apply techniques.  This is particularly true when
you're just getting started on a problem, and it may not be obvious
whether machine learning can help solve the problem at all.  On the
other hand, if getting optimal performance is important, then you may
need to try approaches that require more specialist knowledge.  While
it would be nice if machine learning were always easy, there is no
  <em>a priori</em> reason it should be trivially simple.</p><p>

<!--  <h3><a name="other_techniques"></a><a href="#other_techniques">Other techniques</a></h3>-->
  <h3><a name="other_techniques"></a><a href="#other_techniques">その他の手法</a></h3>

<!--
</p><p>Each technique developed in this chapter is valuable to know in its
own right, but that's not the only reason I've explained them.  The
larger point is to familiarize you with some of the problems which can
occur in neural networks, and with a style of analysis which can help
overcome those problems.  In a sense, we've been learning how to think
about neural nets.  Over the remainder of this chapter I briefly
sketch a handful of other techniques.  These sketches are less
in-depth than the earlier discussions, but should convey some feeling
  for the diversity of techniques available for use in neural networks.</p>
-->

</p><p>
この章で開発したさまざまな手法は、それ自身知っておく価値があるものたちばかりです。
しかし、私がそれらの手法を説明したのは、単にそれ自体の価値のためだけではありません。
ニューラルネットワークを扱う上で起こりうる問題や、それを克服する助けとなる解析のスタイルについて、皆様にも慣れていただく、
というのが、より重要な目標でした。
ある意味では、私たちはニューラルネットについて考える方法を学習していたのです。
この章の残りでは、これまでにカバーしきれなかった他の手法を軽く紹介します。
これからの紹介は、これまでの議論よりは深みに欠けるものになってしまいますが、
ニューラルネットワークを扱う手法の多様さを感じていただければ幸いです。
</p>


<p><h4><a name="variations_on_stochastic_gradient_descent"></a><a href="#variations_on_stochastic_gradient_descent">Variations on stochastic gradient descent</a></h4></p><p>Stochastic gradient descent by backpropagation has served us well in
attacking the MNIST digit classification problem.  However, there are
many other approaches to optimizing the cost function, and sometimes
those other approaches offer performance superior to mini-batch
stochastic gradient descent.  In this section I sketch two such
approaches, the Hessian and momentum techniques.</p><p><strong>Hessian technique:</strong> To begin our discussion it helps to put
neural networks aside for a bit.  Instead, we're just going to
consider the abstract problem of minimizing a cost function $C$ which
is a function of many variables, $w = w_1, w_2, \ldots$, so $C =
C(w)$.  By Taylor's theorem, the cost function can be approximated
near a point $w$ by
<a class="displaced_anchor" name="eqtn96"></a>\begin{eqnarray}
  C(w+\Delta w) & = & C(w) + \sum_j \frac{\partial C}{\partial w_j} \Delta w_j
  \nonumber \\ & & + \frac{1}{2} \sum_{jk} \Delta w_j \frac{\partial^2 C}{\partial w_j
    \partial w_k} \Delta w_k + \ldots
\tag{96}\end{eqnarray}
We can rewrite this more compactly as
<a class="displaced_anchor" name="eqtn97"></a>\begin{eqnarray}
  C(w+\Delta w) = C(w) + \nabla C \cdot \Delta w +
  \frac{1}{2} \Delta w^T H \Delta w + \ldots,
\tag{97}\end{eqnarray}
where $\nabla C$ is the usual gradient vector, and $H$ is a matrix
known as the <em>Hessian matrix</em>, whose $jk$th entry is $\partial^2
C / \partial w_j \partial w_k$.  Suppose we approximate $C$ by
discarding the higher-order terms represented by $\ldots$ above,
<a class="displaced_anchor" name="eqtn98"></a>\begin{eqnarray}
  C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +
  \frac{1}{2} \Delta w^T H \Delta w.
\tag{98}\end{eqnarray}
Using calculus we can show that the expression on the right-hand side
can be minimized*<span class="marginnote">
*Strictly speaking, for this to be a minimum,
  and not merely an extremum, we need to assume that the Hessian
  matrix is positive definite.  Intuitively, this means that the
  function $C$ looks like a valley locally, not a mountain or a
  saddle.</span> by choosing
<a class="displaced_anchor" name="eqtn99"></a>\begin{eqnarray}
  \Delta w = -H^{-1} \nabla C.
\tag{99}\end{eqnarray}
Provided <span id="margin_341200104540_reveal" class="equation_link">(98)</span><span id="margin_341200104540" class="marginequation" style="display: none;"><a href="chap3.html#eqtn98" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +
  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_341200104540_reveal').click(function() {$('#margin_341200104540').toggle('slow', function() {});});</script> is a good approximate expression for the
cost function, then we'd expect that moving from the point $w$ to
$w+\Delta w = w-H^{-1} \nabla C$ should significantly decrease the
cost function.  That suggests a possible algorithm for minimizing the
cost:</p><p><ul>
<li> Choose a starting point, $w$.</p><p><li> Update $w$ to a new point $w' = w-H^{-1} \nabla C$, where the
  Hessian $H$ and $\nabla C$ are computed at $w$.</p><p><li> Update $w'$ to a new point $w{'}{'} = w'-H'^{-1} \nabla' C$,
  where the Hessian $H'$ and $\nabla' C$ are computed at $w'$.</p><p><li> $\ldots$
</ul>
In practice, <span id="margin_681684129137_reveal" class="equation_link">(98)</span><span id="margin_681684129137" class="marginequation" style="display: none;"><a href="chap3.html#eqtn98" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  C(w+\Delta w) \approx C(w) + \nabla C \cdot \Delta w +
  \frac{1}{2} \Delta w^T H \Delta w \nonumber\end{eqnarray}</a></span><script>$('#margin_681684129137_reveal').click(function() {$('#margin_681684129137').toggle('slow', function() {});});</script> is only an approximation, and it's
better to take smaller steps.  We do this by repeatedly changing $w$
by an amount $\Delta w = -\eta H^{-1} \nabla C$, where $\eta$ is known
as the <em>learning rate</em>.</p><p>This approach to minimizing a cost function is known as the
<em>Hessian technique</em> or <em>Hessian optimization</em>.  There are
theoretical and empirical results showing that Hessian methods
converge on a minimum in fewer steps than standard gradient descent.
In particular, by incorporating information about second-order changes
in the cost function it's possible for the Hessian approach to avoid
many pathologies that can occur in gradient descent.  Furthermore,
there are versions of the backpropagation algorithm which can be used
to compute the Hessian.</p><p>If Hessian optimization is so great, why aren't we using it in our
neural networks?  Unfortunately, while it has many desirable
properties, it has one very undesirable property: the Hessian matrix
is very big.  Suppose you have a neural network with $10^7$ weights
and biases.  Then the corresponding Hessian matrix will contain $10^7
\times 10^7 = 10^{14}$ entries.  That's a lot of entries to compute,
especially when you're going to need to invert the matrix as well!
That makes Hessian optimization difficult to apply in practice.
However, that doesn't mean that it's not useful to understand.  In
fact, there are many variations on gradient descent which are inspired
by Hessian optimization, but which avoid the problem with overly-large
matrices.  Let's take a look at one such technique, momentum-based
gradient descent.</p><p><a name="momentum"></a></p><p><strong>Momentum-based gradient descent:</strong> Intuitively, the advantage
Hessian optimization has is that it incorporates not just information
about the gradient, but also information about how the gradient is
changing.  Momentum-based gradient descent is based on a similar
intuition, but avoids large matrices of second derivatives.  To
understand the momentum technique, think back to our
<a href="chap1.html#gradient_descent">original picture</a> of gradient
descent, in which we considered a ball rolling down into a valley.  At
the time, we observed that gradient descent is, despite its name, only
loosely similar to a ball falling to the bottom of a valley.  The
momentum technique modifies gradient descent in two ways that make it
more similar to the physical picture.  First, it introduces a notion
of "velocity" for the parameters we're trying to optimize.  The
gradient acts to change the velocity, not (directly) the "position",
in much the same way as physical forces change the velocity, and only
indirectly affect position.  Second, the momentum method introduces a
kind of friction term, which tends to gradually reduce the velocity.</p><p>Let's give a more precise mathematical description.  We introduce
velocity variables $v = v_1, v_2, \ldots$, one for each corresponding
$w_j$ variable*<span class="marginnote">
*In a neural net the $w_j$ variables would, of
  course, include all weights and biases.</span>. Then we replace the
gradient descent update rule $w \rightarrow w'= w-\eta \nabla C$ by
<a class="displaced_anchor" name="eqtn100"></a><a class="displaced_anchor" name="eqtn101"></a>\begin{eqnarray}
  v & \rightarrow  & v' = \mu v - \eta \nabla C \tag{100}\\

  w & \rightarrow & w' = w+v'.
\tag{101}\end{eqnarray}
In these equations, $\mu$ is a hyper-parameter which controls the
amount of damping or friction in the system.  To understand the
meaning of the equations it's helpful to first consider the case where
$\mu = 1$, which corresponds to no friction.  When that's the case,
inspection of the equations shows that the "force" $\nabla C$ is now
modifying the velocity, $v$, and the velocity is controlling the rate
of change of $w$.  Intuitively, we build up the velocity by repeatedly
adding gradient terms to it.  That means that if the gradient is in
(roughly) the same direction through several rounds of learning, we
can build up quite a bit of steam moving in that direction.  Think,
for example, of what happens if we're moving straight down a slope:</p><p><center>
<img src="images/tikz34.png"/>
</center></p><p>With each step the velocity gets larger down the slope, so we move
more and more quickly to the bottom of the valley.  This can enable
the momentum technique to work much faster than standard gradient
descent.  Of course, a problem is that once we reach the bottom of the
valley we will overshoot.  Or, if the gradient should change rapidly,
then we could find ourselves moving in the wrong direction.  That's
the reason for the $\mu$ hyper-parameter in <span id="margin_821276191177_reveal" class="equation_link">(100)</span><span id="margin_821276191177" class="marginequation" style="display: none;"><a href="chap3.html#eqtn100" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_821276191177_reveal').click(function() {$('#margin_821276191177').toggle('slow', function() {});});</script>.  I
said earlier that $\mu$ controls the amount of friction in the system;
to be a little more precise, you should think of $1-\mu$ as the amount
of friction in the system.  When $\mu = 1$, as we've seen, there is no
friction, and the velocity is completely driven by the gradient
$\nabla C$.  By contrast, when $\mu = 0$ there's a lot of friction,
the velocity can't build up, and Equations <span id="margin_1375636495_reveal" class="equation_link">(100)</span><span id="margin_1375636495" class="marginequation" style="display: none;"><a href="chap3.html#eqtn100" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  v & \rightarrow  & v' = \mu v - \eta \nabla C  \nonumber\end{eqnarray}</a></span><script>$('#margin_1375636495_reveal').click(function() {$('#margin_1375636495').toggle('slow', function() {});});</script>
and <span id="margin_834321992917_reveal" class="equation_link">(101)</span><span id="margin_834321992917" class="marginequation" style="display: none;"><a href="chap3.html#eqtn101" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  w & \rightarrow & w' = w+v' \nonumber\end{eqnarray}</a></span><script>$('#margin_834321992917_reveal').click(function() {$('#margin_834321992917').toggle('slow', function() {});});</script> reduce to the usual equation for gradient
descent, $w \rightarrow w'=w-\eta \nabla C$.  In practice, using a
value of $\mu$ interediate between $0$ and $1$ can give us much of the
benefit of being able to build up speed, but without causing
overshooting.  We can choose such a value for $\mu$ using the held-out
validation data, in much the same way as we select $\eta$ and
$\lambda$.</p><p></p><p></p><p>I've avoided naming the hyper-parameter $\mu$ up to now.  The reason
is that the standard name for $\mu$ is badly chosen: it's called the
<em>momentum co-efficient</em>.  This is potentially confusing, since
$\mu$ is not at all the same as the notion of momentum from
physics. Rather, it is much more closely related to friction.
However, the term momentum co-efficient is widely used, so we will
continue to use it.</p><p>A nice thing about the momentum technique is that it takes almost no
work to modify an implementation of gradient descent to incorporate
momentum.  We can still use backpropagation to compute the gradients,
just as before, and use ideas such as sampling stochastically chosen
mini-batches.  In this way, we can get some of the advantages of the
Hessian technique, using information about how the gradient is
changing.  But it's done without the disadvantages, and with only
minor modifications to our code.  In practice, the momentum technique
is commonly used, and often speeds up learning.</p><p><h4><a name="exercise_603875"></a><a href="#exercise_603875">Exercise</a></h4><ul>
<li> What would go wrong if we used $\mu > 1$ in the momentum
  technique?</p><p><li> What would go wrong if we used $\mu < 0$ in the momentum
  technique?
</ul></p><p><h4><a name="problem_713937"></a><a href="#problem_713937">Problem</a></h4><ul>
<li> Add momentum-based stochastic gradient descent to
  <tt>network2.py</tt>.
</ul></p><p><strong>Other approaches to minimizing the cost function:</strong> Many other
approaches to minimizing the cost function have been developed, and
there isn't universal agreeement on which is the best approach.  As
you go deeper into neural networks it's worth digging into the other
techniques, understanding how they work, their strengths and
weaknesses, and how to apply them in practice.  A paper I mentioned
earlier*<span class="marginnote">
*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient
    BackProp</a>, by Yann LeCun, Léon Bottou,
  Genevieve Orr and Klaus-Robert Müller (1998).</span>
introduces and compares several of these techniques, including
conjugate gradient descent and the BFGS method (see also the closely
related limited-memory BFGS method, known as
<a href="http://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS</a>).
Another technique which has recently shown promising
results*<span class="marginnote">
*See, for example,
  <a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On the
    importance of initialization and momentum in deep learning</a>, by
  Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton
  (2012).</span> is Nesterov's accelerated gradient technique, which
improves on the momentum technique.  However, for many problems, plain
stochastic gradient descent works well, especially if momentum is
used, and so we'll stick to stochastic gradient descent through the
remainder of this book.</p><p><h4><a name="other_models_of_artificial_neuron"></a><a href="#other_models_of_artificial_neuron">Other models of artificial neuron</a></h4></p><p>Up to now we've built our neural networks using sigmoid neurons.  In
principle, a network built from sigmoid neurons can compute any
function.  In practice, however, networks built using other model
neurons sometimes outperform sigmoid networks.  Depending on the
application, networks based on such alternate models may learn faster,
generalize better to test data, or perhaps do both.  Let me mention a
couple of alternate model neurons, to give you the flavour of some
variations in common use.</p><p>Perhaps the simplest variation is the tanh (pronounched "tanch")
neuron, which replaces the sigmoid function by the hyperbolic tangent
function.  The output of a tanh neuron with input $x$, weight vector
$w$, and bias $b$ is given by
<a class="displaced_anchor" name="eqtn102"></a>\begin{eqnarray}
 \tanh(w \cdot x+b),
\tag{102}\end{eqnarray}
where $\tanh$ is, of course, the hyperbolic tangent function.  It
turns out that this is very closely related to the sigmoid neuron.  To
see this, recall that the $\tanh$ function is defined by
<a class="displaced_anchor" name="eqtn103"></a>\begin{eqnarray}
  \tanh(z) \equiv \frac{e^z-e^{-z}}{e^z+e^{-z}}.
\tag{103}\end{eqnarray}
With a little algebra it can easily be verified that
<a class="displaced_anchor" name="eqtn104"></a>\begin{eqnarray}
  \sigma(z) = \frac{1+\tanh(z/2)}{2},
\tag{104}\end{eqnarray}
that is, $\tanh$ is just a rescaled version of the sigmoid function.
We can also see graphically that the $\tanh$ function has the same
shape as the sigmoid function,</p><p>
<div id="tanh"></div>
<script>
function s(x) {return (Math.exp(x)-Math.exp(-x))/(Math.exp(x)+Math.exp(-x));}
var m = [40, 120, 50, 120];
var height = 290 - m[0] - m[2];
var width = 600 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 400;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d){ return {
        x: x1(d),
        y: s(x1(d))};
    });
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
                .domain([-1,1])
                .range([height, 0]);
var line = d3.svg.line()
    .x(function(d) { return x(d.x); })
    .y(function(d) { return y(d.y); })
var graph = d3.select("#tanh")
    .append("svg")
    .attr("width", width + m[1] + m[3])
    .attr("height", height + m[0] + m[2])
    .append("g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
                  .scale(x)
                  .tickValues(d3.range(-4, 5, 1))
                  .orient("bottom")
graph.append("g")
    .attr("class", "x axis")
    .attr("transform", "translate(0, " + height/2 + ")")
    .call(xAxis);
var yAxis = d3.svg.axis()
                  .scale(y)
                  .tickValues(d3.range(-1, 1.01, 0.5))
                  .orient("left")
graph.append("g")
    .attr("class", "y axis")
    .call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
     .attr("class", "x label")
     .attr("text-anchor", "end")
     .attr("x", width+20)
     .attr("y", height/2+8)
     .text("z");
graph.append("text")
        .attr("x", (width / 2))
        .attr("y", -10)
        .attr("text-anchor", "middle")
        .style("font-size", "16px")
        .text("tanh function");
</script>
</p><p>One difference between tanh neurons and sigmoid neurons is that the
output from tanh neurons ranges from -1 to 1, not 0 to 1.  This means
that if you're going to build a network based on tanh neurons you may
need to normalize your outputs (and, depending on the details of the
application, possibly your inputs) a little differently than in
sigmoid networks.</p><p>Similar to sigmoid neurons, a network of tanh neurons can, in
principle, compute any function*<span class="marginnote">
*There are some technical
  caveats to this statement for both tanh and sigmoid neurons, as well
  as for the rectified linear neurons discussed below.  However,
  informally it's usually fine to think of neural networks as being
  able to approximate any function to arbitrary accuracy.</span> mapping
inputs to the range -1 to 1.  Furthermore, ideas such as
backpropagation and stochastic gradient descent are as easily applied
to a network of tanh neurons as to a network of sigmoid neurons.</p><p><h4><a name="exercise_935719"></a><a href="#exercise_935719">Exercise</a></h4><ul>
<li> Prove the identity in Equation <span id="margin_620778500926_reveal" class="equation_link">(104)</span><span id="margin_620778500926" class="marginequation" style="display: none;"><a href="chap3.html#eqtn104" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \sigma(z) = \frac{1+\tanh(z/2)}{2} \nonumber\end{eqnarray}</a></span><script>$('#margin_620778500926_reveal').click(function() {$('#margin_620778500926').toggle('slow', function() {});});</script>.
</ul></p><p>Which type of neuron should you use in your networks, the tanh or
sigmoid?  <em>A priori</em> the answer is not obvious, to put it mildly!
However, there are theoretical arguments and some empirical evidence
to suggest that the tanh sometimes performs better*<span class="marginnote">
*See, for
  example,
  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient
    BackProp</a>, by Yann LeCun, Léon Bottou,
  Genevieve Orr and Klaus-Robert Müller (1998), and
  <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding
    the difficulty of training deep feedforward networks</a>, by Xavier
  Glorot and Yoshua Bengio (2010).</span>.  Let me briefly give you the
flavour of one of the theoretical arguments for tanh neurons.  Suppose
we're using sigmoid neurons, so all activations in our network are
positive.  Let's consider the weights $w^{l+1}_{jk}$ input to the
$j$th neuron in the $l+1$th layer.  The rules for backpropagation (see
<a href="chap2.html#eqtnBP4">here</a>) tell us that the associated gradient
will be $a^l_k \delta^{l+1}_j$.  Because the activations are positive
the sign of this gradient will be the same as the sign of
$\delta^{l+1}_j$.  What this means is that if $\delta^{l+1}_j$ is
positive then <em>all</em> the weights $w^{l+1}_{jk}$ will decrease
during gradient descent, while if $\delta^{l+1}_j$ is negative then
<em>all</em> the weights $w^{l+1}_{jk}$ will increase during gradient
descent.  In other words, all weights to the same neuron must either
increase together or decrease together.  That's a problem, since some
of the weights may need to increase while others need to decrease.
That can only happen if some of the input activations have different
signs.  That suggests replacing the sigmoid by an activation function,
such as $\tanh$, which allows both positive and negative activations.
Indeed, because $\tanh$ is symmetric about zero, $\tanh(-z) =
-\tanh(z)$, we might even expect that, roughly speaking, the
activations in hidden layers would be equally balanced between
positive and negative.  That would help ensure that there is no
systematic bias for the weight updates to be one way or the other.</p><p>How seriously should we take this argument?  While the argument is
suggestive, it's a heuristic, not a rigorous proof that tanh neurons
outperform sigmoid neurons.  Perhaps there are other properties of the
sigmoid neuron which compensate for this problem?  Indeed, for many
tasks the tanh is found empirically to provide only a small or no
improvement in performance over sigmoid neurons.  Unfortunately, we
don't yet have hard-and-fast rules to know which neuron types will
learn fastest, or give the best generalization performance, for any
particular application.</p><p>Another variation on the sigmoid neuron is the <em>rectified linear
  neuron</em> or <em>rectified linear unit</em>.  The output of a rectified
linear unit with input $x$, weight vector $w$, and bias $b$ is given
by
<a class="displaced_anchor" name="eqtn105"></a>\begin{eqnarray}
  \max(0, w \cdot x+b).
\tag{105}\end{eqnarray}
Graphically, the rectifying function $\max(0, z)$ looks like this:</p><p>
<div id="relu"></div>
<script type="text/javascript">
function s(x) {return Math.max(0, x);}
var m = [40, 120, 50, 120];
var height = 340 - m[0] - m[2];
var width = 550 - m[1] - m[3];
var xmin = -5;
var xmax = 5;
var sample = 100;
var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]);
var data = d3.range(sample).map(function(d){ return {
        x: x1(d),
        y: s(x1(d))};
    });
var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]);
var y = d3.scale.linear()
                .domain([-5, 5])
                .range([height, 0]);
var line = d3.svg.line()
    .x(function(d) { return x(d.x); })
    .y(function(d) { return y(d.y); })
var graph = d3.select("#relu")
    .append("svg")
    .attr("width", width + m[1] + m[3])
    .attr("height", height + m[0] + m[2])
    .append("g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
var xAxis = d3.svg.axis()
                  .scale(x)
                  .tickValues(d3.range(-4, 5.01, 1))
                  .orient("bottom")
graph.append("g")
    .attr("class", "x axis")
    .attr("transform", "translate(0, " + height/2 + ")")
    .call(xAxis);
var yAxis = d3.svg.axis()
                  .scale(y)
                  .tickValues(d3.range(-4, 5.01, 1))
                  .orient("left")
graph.append("g")
    .attr("class", "y axis")
    .call(yAxis);
graph.append("path").attr("d", line(data));
graph.append("text")
     .attr("class", "x label")
     .attr("text-anchor", "end")
     .attr("x", width+20)
     .attr("y", height/2+8)
     .text("z");
graph.append("text")
        .attr("x", (width / 2))
        .attr("y", -10)
        .attr("text-anchor", "middle")
        .style("font-size", "16px")
        .text("max(0, z)");
</script></p><p>Obviously such neurons are quite different from both sigmoid and tanh
neurons.  However, like the sigmoid and tanh neurons, rectified linear
units can be used to compute any function, and they can be trained
using ideas such as backpropagation and stochastic gradient descent.</p><p>When should you use rectified linear units instead of sigmoid or tanh
neurons?  Some recent work on image recognition*<span class="marginnote">
*See, for
  example,
  <a href="http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf">What
    is the Best Multi-Stage Architecture for Object Recognition?</a>, by
  Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato and Yann
  LeCun (2009),
  <a href="http://eprints.pascal-network.org/archive/00008596/01/glorot11a.pdf">Deep
    Sparse Rectiﬁer Neural Networks</a>, by Xavier Glorot, Antoine
  Bordes, and Yoshua Bengio (2011), and
  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet
    Classification with Deep Convolutional Neural Networks</a>, by Alex
  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).  Note that
  these papers fill in important details about how to set up the
  output layer, cost function, and regularization in networks using
  rectified linear units. I've glossed over all these details in this
  brief account. The papers also discuss in more detail the benefits
  and drawbacks of using rectified linear units.  Another informative
  paper is
  <a href="https://www.cs.toronto.edu/&#126;hinton/absps/reluICML.pdf">Rectified
    Linear Units Improve Restricted Boltzmann Machines</a>, by Vinod Nair
  and Geoffrey Hinton (2010), which demonstrates the benefits of using
  rectified linear units in a somewhat different approach to neural
  networks.</span>  has found considerable benefit in using rectified linear
units through much of the network.  However, as with tanh neurons, we
do not yet have a really deep understanding of when, exactly,
rectified linear units are preferable, nor why.  To give you the
flavour of some of the issues, recall that sigmoid neurons stop
learning when they saturate, i.e., when their output is near either
$0$ or $1$.  As we've seen repeatedly in this chapter, the problem is
that $\sigma'$ terms reduce the gradient, and that slows down
learning.  Tanh neurons suffer from a similar problem when they
saturate.  By contrast, increasing the weighted input to a rectified
linear unit will never cause it to saturate, and so there is no
corresponding learning slowdown.  On the other hand, when the weighted
input to a rectified linear unit is negative, the gradient vanishes,
and so the neuron stops learning entirely.  These are just two of the
many issues that make it non-trivial to understand when and why
rectified linear units perform better than sigmoid or tanh neurons.</p><p>I've painted a picture of uncertainty here, stressing that we do not
yet have a solid theory of how activation functions should be chosen.
Indeed, the problem is harder even than I have described, for there
are infinitely many possible activation functions.  Which is the best
for any given problem?  Which will result in a network which learns
fastest?  Which will give the highest test accuracies? I am surprised
how little really deep and systematic investigation has been done of
these questions.  Ideally, we'd have a theory which tells us, in
detail, how to choose (and perhaps modify-on-the-fly) our activation
functions.  On the other hand, we shouldn't let the lack of a full
theory stop us!  We have powerful tools already at hand, and can make
a lot of progress with those tools.  Through the remainder of this
book I'll continue to use sigmoid neurons as our go-to neuron, since
they're powerful and provide concrete illustrations of the core ideas
about neural nets.  But keep in the back of your mind that these same
ideas can be applied to other types of neuron, and that there are
sometimes advantages in doing so.</p><p><h4><a name="on_stories_in_neural_networks"></a><a href="#on_stories_in_neural_networks">On stories in neural networks</a></h4> </p><p><blockquote><em><strong>Question:</strong> How do you
  approach utilizing and researching machine learning techniques that
  are supported almost entirely empirically, as opposed to
  mathematically? Also in what situations have you noticed some of
  these techniques fail?</p><p>  <strong>Answer:</strong> You have to realize that our theoretical
  tools are very weak. Sometimes, we have good mathematical intuitions
  for why a particular technique should work. Sometimes our intuition
  ends up being wrong [...] The questions become: how well does my
  method work on this particular problem, and how large is the set of
  problems on which it works well.</p><p>  -
  <a href="http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun/chivdv7">Question
    and answer</a> with neural networks researcher Yann LeCun</em></blockquote></p><p>Once, attending a conference on the foundations of quantum mechanics,
I noticed what seemed to me a most curious verbal habit: when talks
finished, questions from the audience often began with "I'm very
sympathetic to your point of view, but [...]".  Quantum foundations
was not my usual field, and I noticed this style of questioning
because at other scientific conferences I'd rarely or never heard a
questioner express their sympathy for the point of view of the
speaker.  At the time, I thought the prevalence of the question
suggested that little genuine progress was being made in quantum
foundations, and people were merely spinning their wheels.  Later, I
realized that assessment was too harsh.  The speakers were wrestling
with some of the hardest problems human minds have ever confronted.
Of course progress was slow!  But there was still value in hearing
updates on how people were thinking, even if they didn't always have
unarguable new progress to report.</p><p>You may have noticed a verbal tic similar to "I'm very sympathetic
[...]" in the current book.  To explain what we're seeing I've often
fallen back on saying "Heuristically, [...]", or "Roughly speaking,
[...]", following up with a story to explain some phenomenon or
other.  These stories are plausible, but the empirical evidence I've
presented has often been pretty thin.  If you look through the
research literature you'll see that stories in a similar style appear
in many research papers on neural nets, often with thin supporting
evidence.  What should we think about such stories?</p><p>In many parts of science - especially those parts that deal with
simple phenomena - it's possible to obtain very solid, very reliable
evidence for quite general hypotheses.  But in neural networks there
are large numbers of parameters and hyper-parameters, and extremely
complex interactions between them.  In such extraordinarily complex
systems it's exceedingly difficult to establish reliable general
statements.  Understanding neural networks in their full generality is
a problem that, like quantum foundations, tests the limits of the
human mind.  Instead, we often make do with evidence for or against a
few specific instances of a general statement.  As a result those
statements sometimes later need to be modified or abandoned, when new
evidence comes to light.</p><p>One way of viewing this situation is that any heuristic story about
neural networks carries with it an implied challenge.  For example,
consider the statement I <a href="#dropout_explanation">quoted
  earlier</a>, explaining why dropout works*<span class="marginnote">
*From
  <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">ImageNet
    Classification with Deep Convolutional Neural Networks</a> by Alex
  Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).</span>: "This
technique reduces complex co-adaptations of neurons, since a neuron
cannot rely on the presence of particular other neurons. It is,
therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other neurons."
This is a rich, provocative statement, and one could build a fruitful
research program entirely around unpacking the statement, figuring out
what in it is true, what is false, what needs variation and
refinement.  Indeed, there is now a small industry of researchers who
are investigating dropout (and many variations), trying to understand
how it works, and what its limits are.  And so it goes with many of
the heuristics we've discussed.  Each heuristic is not just a
(potential) explanation, it's also a challenge to investigate and
understand in more detail.</p><p>Of course, there is not time for any single person to investigate all
these heuristic explanations in depth.  It's going to take decades (or
longer) for the community of neural networks researchers to develop a
really powerful, evidence-based theory of how neural networks learn.
Does this mean you should reject heuristic explanations as unrigorous,
and not sufficiently evidence-based?  No!  In fact, we need such
heuristics to inspire and guide our thinking.  It's like the great age
of exploration: the early explorers sometimes explored (and made new
discoveries) on the basis of beliefs which were wrong in important
ways.  Later, those mistakes were corrected as we filled in our
knowledge of geography.  When you understand something poorly - as
the explorers understood geography, and as we understand neural nets
today - it's more important to explore boldly than it is to be
rigorously correct in every step of your thinking.  And so you should
view these stories as a useful guide to how to think about neural
nets, while retaining a healthy awareness of the limitations of such
stories, and carefully keeping track of just how strong the evidence
is for any given line of reasoning.  Put another way, we need good
stories to help motivate and inspire us, and rigorous in-depth
investigation in order to uncover the real facts of the matter.</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2014

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Tue Sep  2 09:19:44 2014
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>