chap5.html

<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
    <meta charset="utf-8">
    <meta name="citation_title" content="ニューラルネットワークと深層学習">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2015">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <link rel="icon" href="nnadl_favicon.ICO" />
    <title>ニューラルネットワークと深層学習</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS":
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">
    <link rel="stylesheet" href="http://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="">CHAPTER 5</a></h1>
  <h1 class="chapter_title"><a href="">ニューラルネットワークを訓練するのはなぜ難しいのか</a></h1></div><div class="section"><div id="toc">
<p class="toc_title"><a href="index.html">ニューラルネットワークと深層学習</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">ニューラルネットワークを用いた手書き文字認識</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">逆伝播の仕組み</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">ニューラルネットワークの学習の改善</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">ニューラルネットワークが任意の関数を表現できることの視覚的証明</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">ニューラルネットワークを訓練するのはなぜ難しいのか</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});</script>

<p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">深層学習</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() {
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});
});</script>


<p class="toc_not_mainchapter"><a href="sai.html">
Appendix: 知性のある <i>シンプルな</i> アルゴリズムはあるか?</a></p>
<p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<span class="sidebar_title">Sponsors</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>


<!--
<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible.
Thanks also to all the contributors to the <a
href="bugfinder.html">Bugfinder Hall of Fame</a>.  </p>

<p class="sidebar">The book is currently a beta release, and is still
under active development.  Please send error reports to
mn@michaelnielsen.org.  For other enquiries, please see the <a
href="faq.html">FAQ</a> first.</p>
-->

<p class="sidebar">著者と共にこの本を作り出してくださった<a
href="supporters.html">サポーター</a>の皆様に感謝いたします。
また、<a
        href="bugfinder.html">バグ発見者の殿堂</a>に名を連ねる皆様にも感謝いたします。
また、日本語版の出版にあたっては、<a
href="translators.html">翻訳者</a>の皆様に深く感謝いたします。

</p>


<p class="sidebar">この本は目下のところベータ版で、開発続行中です。
エラーレポートは mn@michaelnielsen.org まで、日本語版に関する質問は muranushi@gmail.com までお送りください。
その他の質問については、まずは<a
href="faq.html">FAQ</a>をごらんください。</p>


<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="http://eepurl.com/BYr9L">Mailing list for book announcements</a>
</p>

<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
</p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
  著：<a href="http://michaelnielsen.org">Michael Nielsen</a> / 2014年9月-12月 <br >  訳：<a href="https://github.com/nnadl-ja/nnadl_site_ja">「ニューラルネットワークと深層学習」翻訳プロジェクト</a>
</p>
</div>
</p><p><!--Imagine you're an engineer who has been asked to design a computerfrom scratch.  One day you're working away in your office, designinglogical circuits, setting out <CODE>AND</CODE> gates, <CODE>OR</CODE> gates,and so on, when your boss walks in with bad news. The customer hasjust added a surprising design requirement: the circuit for the entirecomputer must be just two layers deep:-->自身が、コンピュータを一から設計するよう求められているエンジニアであると想像してみてください。あなたはある日、オフィスの外にいます。論理回路を設計するため、<CODE>AND</CODE>ゲートや<CODE>OR</CODE>ゲートから手を付け始めています。そこへ、あなたの上司が悪い知らせを持ってきました。顧客が驚くべき設計要件を追加したのです。それは、コンピュータ全体をたった2層の深さの回路で作れという要件でした。</p><p><center><img src="images/shallow_circuit.png" width="500px"></center></p><p><!--You're dumbfounded, and tell your boss: "The customer is crazy!"-->あなたは慌てて、上司に言いました。「この顧客は狂っています！」</p><p><!--Your boss replies: "I think they're crazy, too.  But what thecustomer wants, they get."-->「彼らは狂ってると私も思う。でも、顧客が望むものは提供するしかないんだ」と上司は答えます。</p><p><!--In fact, there's a limited sense in which the customer isn't crazy.Suppose you're allowed to use a special logical gate which lets you<CODE>AND</CODE> together as many inputs as you want.  And you're alsoallowed a many-input <CODE>NAND</CODE> gate, that is, a gate which can<CODE>AND</CODE> multiple inputs and then negate the output.  With thesespecial gates it turns out to be possible to compute any function atall using a circuit that's just two layers deep.-->実際には、顧客が狂ってるわけではありません。どんなに入力が多くても<CODE>AND</CODE>を取得できる、特別な論理ゲートを使えると想定してみてください。加えて、多入力の<CODE>NAND</CODE>を取得できる論理ゲートも使えるとします。この<CODE>NAND</CODE>ゲートは、複数入力の<CODE>AND</CODE>を取得し、その結果を反転させて出力するゲートです。これらの特別な論理ゲートを使えば、ちょうど2層の回路で、どんな関数でも計算できます。</p><p><!--But just because something is possible doesn't make it a good idea.In practice, when solving circuit design problems (or most any kind ofalgorithmic problem), we usually start by figuring out how to solvesub-problems, and then gradually integrate the solutions.  In otherwords, we build up to a solution through multiple layers ofabstraction.-->でも、実現可能だからと言って、良いアイデアとは限りません。実際、回路設計問題（もしくはアルゴリズムのあらゆる問題）を解くときには、私たちはまず、部分問題から解き始めます。そして、徐々に部分問題の解を組み合わせていくのです。すなわち、多数の層の解を抽象化させて、全体の解を作り上げていきます。</p><p><!--For instance, suppose we're designing a logical circuit to multiplytwo numbers.  Chances are we want to build it up out of sub-circuitsdoing operations like adding two numbers.  The sub-circuits for addingtwo numbers will, in turn, be built up out of sub-sub-circuits foradding two bits.  Very roughly speaking our circuit will look like:-->たとえば、2つの数の掛け算を行う論理回路を設計するとしましょう。きっとあなたは、2つの数の足し算を行う部分回路を組み合わせたいと思います。そして2つの数の足し算を行う部分回路を作るときには、今度は、2つのビットの足し算を行う、部分-部分回路を組み合わせるでしょう。大ざっぱに言うと、設計した論理回路はこんな形をしています。</p><p><center><img src="images/circuit_multiplication.png" width="500px"></center></p><p><!--That is, our final circuit contains at least three layers of circuitelements.  In fact, it'll probably contain more than three layers, aswe break the sub-tasks down into smaller units than I've described.But you get the general idea.-->最終的な論理回路は、少なくとも3層の回路要素から構成されます。私が述べたものよりも、さらに部分問題にブレークダウンしていけば、3層以上含むことになるでしょう。ただ、これまでの話の中で、あなたは汎用的に使えるアイデアを得たはずです。</p><p><!--So deep circuits make the process of design easier.  But they're notjust helpful for design.  There are, in fact, mathematical proofsshowing that for some functions very shallow circuits requireexponentially more circuit elements to compute than do deep circuits.For instance, a famous 1984 paper by Furst, Saxe andSipser*-->このように、回路を深くすれば、設計は簡単になっていきます。でも設計が楽になるだけではありません。実際、同じ計算を行う際、深い回路に比べると、浅い回路を使う場合には指数関数的に多くの回路要素が必要となることが、数学的に証明されています。たとえば、1984年のFurst、Saxe、そしてSipserの有名な論文*<span class="marginnote">*  <a href="http://link.springer.com/article/10.1007/BF01744431">Parity,    Circuits, and the Polynomial-Time Hierarchy</a>, by Merrick Furst,  James B. Saxe, and Michael Sipser (1984)  を確認してください。</span><!-- showed that computing theparity of a set of bits requires exponentially many gates, if donewith a shallow circuit.  On the other hand, if you use deeper circuitsit's easy to compute the parity using a small circuit: you justcompute the parity of pairs of bits, then use those results to computethe parity of pairs of pairs of bits, and so on, building up quicklyto the overall parity.  Deep circuits thus can be intrinsically muchmore powerful than shallow circuits.-->では、ビットのパリティを計算する際に、浅い回路を用いると、指数関数的に多くのゲートが必要となることが示されています。一方で、深い回路を使うとなれば、小さな回路要素でパリティ計算できます。そのときには、まずビットの組のパリティを計算します。その結果を使って、ビットの組の組のパリティを計算します。それを繰り返して、全体のパリティを素早く計算できます。深い回路は浅い回路よりも、本質的に遥かに強力なのです。</p><p><!--Up to now, this book has approached neural networks like the crazycustomer.  Almost all the networks we've worked with have just asingle hidden layer of neurons (plus the input and output layers):-->これまで、この本はニューラルネットワークに対して、先ほど述べた狂った顧客のような取り組み方をしてきました。すなわち、これまで扱ってきたほぼ全てのネットワークは、たった一つの隠れ層（に加えて入力層と出力層）しか持っていませんでした。</p><p><center><img src="images/tikz35.png"/></center></p><p><!--These simple networks have been remarkably useful: in earlier chapterswe used networks like this to classify handwritten digits with betterthan 98 percent accuracy!  Nonetheless, intuitively we'd expectnetworks with many more hidden layers to be more powerful:-->これらのシンプルなネットワークは、驚くほど有用です。前章までに、このネットワークを使った手書き数字の分類で、98％以上の精度を出しました！それにも関わらず、多くの隠れ層を備えるネットワークの方が強力なのではないか、という直感的な期待があるのです。</p><p><center><img src="images/tikz36.png"/></center></p><p><!--Such networks could use the intermediate layers to build up multiplelayers of abstraction, just as we do in Boolean circuits.  Forinstance, if we're doing visual pattern recognition, then the neuronsin the first layer might learn to recognize edges, the neurons in thesecond layer could learn to recognize more complex shapes, saytriangle or rectangles, built up from edges.  The third layer wouldthen recognize still more complex shapes.  And so on.  These multiplelayers of abstraction seem likely to give deep networks a compellingadvantage in learning to solve complex pattern recognition problems.Moreover, just as in the case of circuits, there are theoreticalresults suggesting that deep networks are instrinsically more powerfulthan shallow networks*<span class="marginnote">*For certain problems and network  architectures this is proved in  <a href="http://arxiv.org/pdf/1312.6098.pdf">On the number of response    regions of deep feed forward networks with piece-wise linear    activations</a>, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio  (2014). See also the more informal discussion in section 2 of  <a href="http://www.iro.umontreal.ca/&#126;bengioy/papers/ftml_book.pdf">Learning    deep architectures for AI</a>, by Yoshua Bengio (2009).</span>-->多層のネットワークは、先ほどの論理回路の例と同じように、中間層を抽象化のために組み合わせて使います。たとえば、私たちが視覚パターン認識をするとしましょう。そのときには、第一層のニューロンはエッジを認識するようになります。第二層のニューロンは、三角形や四角形などの複雑な形状を、エッジの情報から認識します。第三層のニューロンはさらに複雑な形状を認識します。以降はこれの繰り返しです。どうやら、これらの抽象化層のおかげで、ニューラルネットワークは複雑なパターン認識問題を解く方法を学習できるようです。加えて、回路の例と同じように、深いネットワークは、浅いネットワークよりも元来強力であるという理論的な結果があります*<span class="marginnote">*特定の問題とネットワーク構造については、このことが以下で証明されています。  <a href="http://arxiv.org/pdf/1312.6098.pdf">On the number of response    regions of deep feed forward networks with piece-wise linear    activations</a>, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio  (2014). さらなる非公式な議論については、以下のセクション2を参照してください。  <a href="http://www.iro.umontreal.ca/&#126;bengioy/papers/ftml_book.pdf">Learning    deep architectures for AI</a>, by Yoshua Bengio (2009).</span>。</p><p><!--How can we train such deep networks?  In this chapter, we'll trytraining deep networks using our workhorse learning algorithm -<a href="chap1.html#learning_with_gradient_descent">stochastic  gradient descent</a> by <a href="chap2.html">backpropagation</a>.  But we'llrun into trouble, with our deep networks not performing much (if atall) better than shallow networks.-->このような深いネットワークを訓練するにはどうすればよいのでしょうか？この章では、働き者の学習アルゴリズムである<a href="chap2.html">逆伝播による</a><a href="chap1.html#learning_with_gradient_descent">確率的勾配降下法</a>を使って、深いネットワークを訓練していこうと思います。しかしその過程で、深いネットワークを使っていても、浅いネットワークのときよりもパフォーマンスが良くならないという問題にぶつかります。</p><p><!--That failure seems surprising in the light of the discussion above.Rather than give up on deep networks, we'll dig down and try tounderstand what's making our deep networks hard to train.  When welook closely, we'll discover that the different layers in our deepnetwork are learning at vastly different speeds.  In particular, whenlater layers in the network are learning well, early layers often getstuck during training, learning almost nothing at all.  This stucknessisn't simply due to bad luck.  Rather, we'll discover there arefundamental reasons the learning slowdown occurs, connected to our useof gradient-based learning techniques.-->上記の議論の後では、そのような失敗が起きることが、不思議に思えるでしょう。失敗するからと言って、深いネットワークに見切りをつけずに、深いネットワークを訓練する難しさの原因を掘り下げて、理解していきましょう。よく観察すると、深いネットワーク中の異なる層は、まったく異なるスピードで学習していることがわかります。特に、後ろの方の層は大きく学び、前の方の層はしばしば訓練中にもたついて、全く何も学ばなかったりします。この前方の層のどん詰まりの状況は、単に運が悪かったというわけではありません。むしろ、勾配を利用した学習テクニックによく見られる、学習が遅くなってしまう本質的な理由がそこにはあります。</p><p><!--As we delve into the problem more deeply, we'll learn that theopposite phenomenon can also occur: the early layers may be learningwell, but later layers can become stuck.  In fact, we'll find thatthere's an intrinsic instability associated to learning by gradientdescent in deep, many-layer neural networks.  This instability tendsto result in either the early or the later layers getting stuck duringtraining.-->問題を深く掘り下げていくと、対照的な現象の発生も目撃します。前の方の層が大きく学び、後ろの方の層が学ばないという現象です。実際、深く多層からなるニューラルネットワークの場合、勾配降下法による学習には不安定性が伴うことを学びます。この不安定性により、前方の層か後方の層のどちらかが、訓練時に学習しなくなる傾向があります。</p><p><!--This all sounds like bad news.  But by delving into thesedifficulties, we can begin to gain insight into what's required totrain deep networks effectively.  And so these investigations are goodpreparation for the next chapter, where we'll use deep learning toattack image recognition problems.-->これは悪い知らせに聞こえます。しかし、これらの問題を掘り下げていくと、深いネットワークを効果的に訓練するための必要事項に関して洞察を得ることができます。そして、この調査が次の章への良い準備になるのです。次の章では、深層学習を使って画像認識問題に取り組みます。</p><p><h3><a name="the_vanishing_gradient_problem"></a><a href="#the_vanishing_gradient_problem"><!--The vanishing gradient problem-->勾配消失問題</a></h3></p><p><!--So, what goes wrong when we try to train a deep network?-->さて、深いネットワークを訓練しようとするときに、何が上手く行かないのでしょうか？</p><p><!--To answer that question, let's first revisit the case of a networkwith just a single hidden layer.  As per usual, we'll use the MNISTdigit classification problem as our playground for learning andexperimentation*<span class="marginnote">*I introduced the MNIST problem and data  <a href="chap1.html#learning_with_gradient_descent">here</a> and  <a href="chap1.html#implementing_our_network_to_classify_digits">here</a>.</span>.-->この問いに答えるために、隠れ層1つのみのネットワークの例を再び見てみましょう。いつも通り、MNISTの手書き文字に対する分類問題を、学習と実験の場として使います*<span class="marginnote">*MNIST の問題とデータは  <a href="chap1.html#learning_with_gradient_descent">ここ</a> と  <a href="chap1.html#implementing_our_network_to_classify_digits">ここ</a>で紹介しました</span>。</p><p><!--If you wish, you can follow along by training networks on yourcomputer.  It is also, of course, fine to just read along.  If you dowish to follow live, then you'll need Python 2.7, Numpy, and a copy ofthe code, which you can get by cloning the relevant repository fromthe command line:-->お望みなら、自身のコンピュータでネットワークを訓練して追うことができます。もちろん、ただ読み進めるだけでも構いません。自身で実験をしつつ内容を追いたい場合、Python 2.7、Numpy、そしてコードが必要です。コードはコマンドラインを使ってレポジトリから複製できます。<div class="highlight"><pre>git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git</pre></div>
<!--If you don't use <tt>git</tt> then you can download the data and code<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">here</a>.  You'll need to change into the <tt>src</tt> subdirectory.-->
もし<tt>git</tt>を使わない場合、データとコードを<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip">ここ</a>からダウンロードできます。
<tt>src</tt>サブディレクトリへ行ってください。
</p>
<p>
<!--Then, from a Python shell we load the MNIST data:-->
そして、PythonのシェルからMNISTのデータをロードしてください。
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">mnist_loader</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> \
<span class="o">...</span> <span class="n">mnist_loader</span><span class="o">.</span><span class="n">load_data_wrapper</span><span class="p">()</span>
</pre></div>
</p>
<p>
<!--We set up our network:-->
ネットワークをセットアップします。
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network2</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</pre></div>
</p><p><!--This network has 784 neurons in the input layer, corresponding to the$28 \times 28 = 784$ pixels in the input image.  We use 30 hiddenneurons, as well as 10 output neurons, corresponding to the 10possible classifications for the MNIST digits ('0', '1', '2', $\ldots$,'9').-->このネットワークは入力層に784のニューロンを持ち、これらが入力画像の $28 \times 28 = 784$ ピクセルに対応します。30の隠れニューロンと、10の出力ニューロンを使います。この出力ニューロンがMNISTの文字の ('0', '1', '2', $\ldots$, '9') の10分類に対応します。</p><p><!--Let's try training our network for 30 complete epochs, usingmini-batches of 10 training examples at a time, a learning rate $\eta= 0.1$, and regularization parameter $\lambda = 5.0$.  As we trainwe'll monitor the classification accuracy on the<tt>validation_data</tt>*<span class="marginnote">*Note that the networks take quite some  time to train - up to a few minutes per training epoch, depending  on the speed of your machine.  So if you're running the code it's  best to continue reading and return later, not to wait for the code  to finish executing.</span>:-->
さあ、10個の訓練画像を含むミニバッチ、学習率は $\eta
= 0.1$ 、正規化パラメータ $\lambda = 5.0$ の条件下で、ネットワークを30エポック分、訓練してみましょう。
<tt>validation data</tt>*を分類する正確さの変化を、訓練の進行にしたがい観測します*
<span class="marginnote">
  *ネットワークの訓練にはかなり時間がかかることに注意してください。
  コンピュータの性能によりますが、エポックごとに最大2,3分程度かかります。
  なので、コードを実行しているなら、終了を待たずにそのまま読み進めて、後で振り返るとよいでしょう。
</span>
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p><!--We get a classification accuracy of 96.48 percent (or thereabouts -it'll vary a bit from run to run), comparable to our earlier resultswith a similar configuration.-->分類精度は96.48%（もしくは、実行するごとにちょっとだけ変化しますが大体その程度）です。この結果は、同じ設定で実施した先の結果とほぼ同じ数値です。<!-- ここは意訳しています --></p><p><!--Now, let's add another hidden layer, also with 30 neurons in it, andtry training with the same hyper-parameters:-->
さて、同じ30個のニューロンを持つ隠れ層を追加してみましょう。
このネットワークを、同じハイパーパラメータのもとで訓練します。
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p><!--This gives an improved classification accuracy, 96.90 percent.  That'sencouraging: a little more depth is helping.  Let's add another30-neuron hidden layer:
-->
分類精度が向上し、96.90%となりました。
この結果は励みになります。少し深くしたことで、良い結果が得られたようです。
さあ、さらに30個のニューロンを持つ隠れ層を追加しましょう。
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p><!--That doesn't help at all.  In fact, the result drops back down to96.57 percent, close to our original shallow network.  And suppose weinsert one further hidden layer:-->
今回は結果が良くなりませんでした。
精度が96.57%へ落ちています。
一番最初の浅いネットワークの結果に近づきました。
もう1つ隠れ層を追加しましょう。
</p>
<p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">network2</span><span class="o">.</span><span class="n">Network</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">5.0</span><span class="p">,</span>
<span class="o">...</span> <span class="n">evaluation_data</span><span class="o">=</span><span class="n">validation_data</span><span class="p">,</span> <span class="n">monitor_evaluation_accuracy</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</p><p><!--The classification accuracy drops again, to 96.53 percent.  That'sprobably not a statistically significant drop, but it's notencouraging, either.-->分類精度は再び悪化し、96.53%となりました。統計的に有意な悪化ではおそらくないですが、良い結果ではありません。</p><p><!--This behaviour seems strange.  Intuitively, extra hidden layers oughtto make the network able to learn more complex classificationfunctions, and thus do a better job classifying.  Certainly, thingsshouldn't get worse, since the extra layers can, in the worst case,simply do nothing*<span class="marginnote">*See <a href="#identity_neuron">this later    problem</a> to understand how to build a hidden layer that does  nothing.</span>.  But that's not what's going on.-->この振る舞いは奇妙に思えます。直感的には、隠れ層を追加したことで、ネットワークは複雑な分類関数を学習でき、分類タスクもうまくいくはずです。最悪の場合でも、隠れ層に単に何もさせないようにすれば、悪化しないはずなのです*<span class="marginnote">*何もしない隠れ層がどう生まれるかを理解するための<a href="#identity_neuron">後に紹介する問題</a>を確認してください。</span>。しかし、実際には悪化が起きています。</p><p><!--So whatb is going on?  Let's assume that the extra hidden layers reallycould help in principle, and the problem is that our learningalgorithm isn't finding the right weights and biases.  We'd like tofigure out what's going wrong in our learning algorithm, and how to dobetter.-->では一体、何が起きているのでしょう？隠れ層を追加することで、原理的には上手くいくとすると、問題は学習アルゴリズムが適切な重みとバイアスを探せていないことです。学習アルゴリズムの上手くいかない点とその対処法を見つけたいです。</p><p><!--To get some insight into what's going wrong, let's visualize how thenetwork learns.  Below, I've plotted part of a $[784, 30, 30, 10]$network, i.e., a network with two hidden layers, each containing $30$hidden neurons.  Each neuron in the diagram has a little bar on it,representing how quickly that neuron is changing as the networklearns.  A big bar means the neuron's weights and bias are changingrapidly, while a small bar means the weights and bias are changingslowly.  More precisely, the bars denote the gradient $\partial C/ \partial b$ for each neuron, i.e., the rate of change of the costwith respect to the neuron's bias.-->何が悪いのかの洞察を得るために、ネットワークの学習の様子を可視化してみましょう。下図に、 $[784, 30, 30, 10]$ のネットワークの一部をプロットしました。このネットワークは、2層の隠れ層を持ち、それぞれが30個のニューロンを持っています。図の中の各ニューロンには小さなバーがあります。このバーは、ネットワークの学習に伴ってニューロンが変化する速さを示しています。バーが大きいと、ニューロンの重みとバイアスは急速に変化しており、一方で、バーが小さいと、重みとバイアスの変化が遅いことを示します。より正確に言うと、バーは各ニューロンの勾配 $\partial C/ \partial b$ を示しています。この勾配は、ニューロンのバイアスに対するコスト関数の変化率のことです。  <!--Back in<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter  2</a> we saw that this gradient quantity controlled not just howrapidly the bias changes during learning, but also how rapidly theweights input to the neuron change, too.  Don't worry if you don'trecall the details: the thing to keep in mind is simply that thesebars show how quickly each neuron's weights and bias are changing asthe network learns.--><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">2章</a>で見たように、この勾配の大きさが、学習中のバイアスの変化の速さだけでなく、ニューロンに入力される重みの変化の速さも制御します。詳細を思い出せなくても心配しないでください。このバーは、ネットワークの学習時の、ニューロンの重みとバイアスの変化の速さだということだけ覚えておけばよいです。</p><p><!--To keep the diagram simple, I've shown just the top six neurons in thetwo hidden layers.  I've omitted the input neurons, since they've gotno weights or biases to learn.  I've also omitted the output neurons,since we're doing layer-wise comparisons, and it makes most sense tocompare layers with the same number of neurons.  The results areplotted at the very beginning of training, i.e., immediately after thenetwork is initialized.  Here they are*<span class="marginnote">*The data plotted is  generated using the program  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/generate_gradient.py">generate_gradient.py</a>.  The same program is also used to generate the results quoted later  in this section.</span>:-->図をシンプルにするために、隠れ層のうち単純に上から6個のニューロンのみ表示しました。学習すべき重みやバイアスを持たない入力ニューロンは省略してあります。同じ数のニューロンを持つ層ごとの比較を行いたいので、出力ニューロンも省略しました。プロットした結果は、訓練開始時のもの、すなわちネットワークが初期化された直後のものです*<span class="marginnote">*プロットしたデータは、  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/fig/generate_gradient.py">generate_gradient.py</a>を使って生成されたものです。  同じプログラムはこのセクションの後に引用される結果でも使われています。</span>。</p><p><canvas id="initial_gradient" width=470 height=620></canvas></p><p><!--The network was initialized randomly, and so it's not surprising thatthere's a lot of variation in how rapidly the neurons learn.  Still,one thing that jumps out is that the bars in the second hidden layerare mostly much larger than the bars in the first hidden layer.  As aresult, the neurons in the second hidden layer will learn quite a bitfaster than the neurons in the first hidden layer.  Is this merely acoincidence, or are the neurons in the second hidden layer likely tolearn faster than neurons in the first hidden layer in general?-->ネットワークはランダムに初期化されていたので、ニューロンの学習速度のバラつきがあるのには驚きません。しかし一つ気になるのは、2個目の隠れ層のバーが1個目の隠れ層のバーより遥かに大きい点です。つまり、2個目の隠れ層は1個目の隠れ層よりも高速に学習するということのようです。これは単なる偶然でしょうか、それとも一般的に2個目の隠れ層は1個目の隠れ層よりも速く学習するものなのでしょうか？</p><p><!--To determine whether this is the case, it helps to have a global wayof comparing the speed of learning in the first and second hiddenlayers.  To do this, let's denote the gradient as $\delta^l_j= \partial C /\partial b^l_j$, i.e., the gradient for the $j$th neuron in the $l$thlayer*<span class="marginnote">*Back in  <a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">Chapter    2</a> we referred to this as the error, but here we'll adopt the  informal term "gradient".  I say "informal" because of course  this doesn't explicitly include the partial derivatives of the cost  with respect to the weights, $\partial C / \partial w$.</span>.-->これが本当かどうか調べるために、一般的な方法で1層目と2層目の学習速度を比較してみます。このために、勾配を $\delta^l_j= \partial C /\partial b^l_j$ とします。これは $l$ 層目の中の $j$ 個目のニューロンの勾配*です<span class="marginnote">*<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">2章</a>を振り返ると、これを誤差と呼んでいました。しかし、ここでは「勾配」という非公式な呼び方を採用します。「非公式」と言っているのは、もちろんこの式が、コスト関数の重みに関する偏微分 $\partial C / \partial w$ を明示的に含んでいないからです。</span>。<!--We can think of the gradient $\delta^1$ as a vector whose entries determine how quickly the first hidden layer learns, and $\delta^2$ as a vector whose entries determine how quickly the second hidden layer learns.We'll then use the lengths of these vectors as (rough!)  globalmeasures of the speed at which the layers are learning.  So, forinstance, the length $\| \delta^1 \|$ measures the speed at which thefirst hidden layer is learning, while the length $\| \delta^2 \|$measures the speed at which the second hidden layer is learning.-->勾配 $\delta^1$ は1個目の隠れ層の学習速度を定めるベクトルとしてみなすことができ、 $\delta^2$ は2個目の隠れ層の学習速度を定めるベクトルとみなせます。これらのベクトルの長さを、層の学習速度の一般的な尺度（大ざっぱ！）として使いましょう。したがってたとえば、長さ $\| \delta^1 \|$ は1層目の隠れ層の学習速度を示し、一方で、長さ $\| \delta^2 \|$ は2層目の隠れ層の学習速度を示すこととします。</p><p><!--With these definitions, and in the same configuration as was plottedabove, we find $\| \delta^1 \| = 0.07\ldots$ and $\| \delta^2 \| =0.31\ldots$.  So this confirms our earlier suspicion: the neurons inthe second hidden layer really are learning much faster than theneurons in the first hidden layer.-->これらの定義に従って、上図の設定のときの値を確認すると、$\| \delta^1 \| = 0.07\ldots$ であり $\| \delta^2 \| = 0.31\ldots$ です。したがって、先程の疑いを確かめたことになります。実際に、2個目の隠れ層のニューロンは、1個目の隠れ層のニューロンよりも遥かに速く学習していたのです。</p><p><!--What happens if we add more hidden layers?  If we have three hiddenlayers, in a $[784, 30, 30, 30, 10]$ network, then the respectivespeeds of learning turn out to be 0.012, 0.060, and 0.283.  Again,earlier hidden layers are learning much slower than later hiddenlayers.  Suppose we add yet another layer with $30$ hidden neurons.In that case, the respective speeds of learning are 0.003, 0.017,0.070, and 0.285.  The pattern holds: early layers learn slower thanlater layers.-->隠れ層をさらに追加したとしたら、何が起こるでしょうか？もし、隠れ層が3個となったときには、各層での学習速度は 0.012、0.060、そして 0.283となります。やはり、前方の隠れ層の学習は、後方の隠れ層よりかなり遅くなっています。さらに30個のニューロンを持つ隠れ層を追加してみます。この場合には、各層での学習速度は、0.003、0.017、0.070、そして0.285となります。パターンとしては同じです。前方の層は後方の層よりも遅く学習しています。</p><p><!--We've been looking at the speed of learning at the start of training,that is, just after the networks are initialized.  How does the speedof learning change as we train our networks?  Let's return to look atthe network with just two hidden layers.  The speed of learningchanges as follows:-->これまでは訓練開始頃、すなわちネットワーク初期化直後の学習速度を見てきました。ネットワークの訓練が進むと、学習率はどのように変化するのでしょうか？2個の隠れ層のみ持つネットワークに立ち戻って確認してみましょう。学習速度は次のように変化します。</p><p><center><img src="images/training_speed_2_layers.png" width="500px"></center></p><p><!--To generate these results, I used batch gradient descent with just1,000 training images, trained over 500 epochs.  This is a bitdifferent than the way we usually train - I've used no mini-batches,and just 1,000 training images, rather than the full 50,000 imagetraining set.  I'm not trying to do anything sneaky, or pull the woolover your eyes, but it turns out that using mini-batch stochasticgradient descent gives much noisier (albeit very similar, when youaverage away the noise) results.  Using the parameters I've chosen isan easy way of smoothing the results out, so we can see what's goingon.-->この結果を生成するために、1000個の訓練画像に対して、バッチ勾配降下を500エポックに渡り行いました。これは通常の訓練方法と少し異なります。ミニバッチを使わずに、しかも50000の訓練画像全体ではなく、たった1000の訓練画像のみを使ったからです。もちろん卑怯な方法を使ったわけでも、あなたの目をくらまそうとしたつもりもありません。ミニバッチ確率的勾配降下法を使うと、ノイズが混じった結果となってしまうのです（ノイズを平均化すればとても似た結果になりますが）。私が選んだパラメータを使った方が、結果を簡単に滑らかにできるため、何が起きてるか見やすくなるはずです。</p><p><!--In any case, as you can see the two layers start out learning at verydifferent speeds (as we already know).  The speed in both layers thendrops very quickly, before rebounding.  But through it all, the firsthidden layer learns much more slowly than the second hidden layer.-->（既に知っているとおり、）2つの層が大きく異なる速度で学習し始めているのがわかります。そして、どちらの層も急速に減少し、リバウンドします。しかし全体的に、1個目の隠れ層は、2個目の隠れ層よりもゆっくり学習していると言えます。</p><p><!--What about more complex networks?  Here's the results of a similarexperiment, but this time with three hidden layers (a $[784, 30, 30,30, 10]$ network):-->もっと複雑なネットワークの場合はどうなるでしょうか？次の図は、3個の隠れ層 $[784, 30, 30, 30, 10]$ を持つネットワークの場合の実験結果です。</p><p><center><img src="images/training_speed_3_layers.png" width="500px"></center></p><p><!--Again, early hidden layers learn much more slowly than later hiddenlayers.  Finally, let's add a fourth hidden layers (a $[784, 30, 30,30, 30, 10]$ network), and see what happens when we train:-->今回も、前方の隠れ層は後方の隠れ層より遅くなっています。最後に、4個目の隠れ層を追加し（ $[784, 30, 30, 30, 30, 10]$ のネットワークとして）、訓練時に何が起きるか見ましょう。</p><p><center><img src="images/training_speed_4_layers.png" width="500px"></center></p><p><!--Again, early hidden layers learn much more slowly than later hiddenlayers.  In this case, the first hidden layer is learning roughly 100times slower than the final hidden layer.  No wonder we were havingtrouble training these networks earlier!-->またもや、前方の隠れ層は後方のものよりも遅くなりました。今回、最初の隠れ層は、最後の隠れ層よりも約100倍遅いという結果になっています。前方のネットワークを訓練する難しさがわかったことでしょう！</p><p><!--We have here an important observation: in at least some deep neuralnetworks, the gradient tends to get smaller as we move backwardthrough the hidden layers.  This means that neurons in the earlierlayers learn much more slowly than neurons in later layers.  And whilewe've seen this in just a single network, there are fundamentalreasons why this happens in many neural networks.  The phenomenon isknown as the <em>vanishing gradient problem</em>*<span class="marginnote">*See  <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.7321">Gradient    flow in recurrent nets: the difficulty of learning long-term    dependencies</a>, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,  and J&uuml;rgen Schmidhuber (2001).  This paper studied recurrent  neural nets, but the essential phenomenon is the same as in the  feedforward networks we are studying.  See also Sepp Hochreiter's  earlier Diploma Thesis,  <a href="http://www.idsia.ch/&#126;juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf">Untersuchungen    zu dynamischen neuronalen Netzen</a> (1991, in German).</span>.-->私たちは重要な現象を観察しています。少しでも深いニューラルネットワークでは、勾配は隠れ層を前方に伝わるにつれて小さくなる傾向があります。これにより、前方の層のニューロンの学習は、後方の層のニューロンよりも遥かに遅くなります。この現象をたった1つのネットワークの例で確認しましたが、実は多くのニューラルネットワークにおいて、同様の現象が発生する根源的な理由があります。これは<em>勾配消失問題</em>*として有名な現象です<span class="marginnote">*<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.7321">Gradient    flow in recurrent nets: the difficulty of learning long-term    dependencies</a>, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi,  and J&uuml;rgen Schmidhuber (2001)を確認してください。  この論文はrecurrent neural networkを扱ったものですが、私たちの調べている順伝播ネットワークと本質的な現象は同じです。  Sepp Hochreiterの学位論文、  <a href="http://www.idsia.ch/&#126;juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf">Untersuchungen    zu dynamischen neuronalen Netzen</a> (1991, in German)についても見てください</span>。</p><p><!--Why does the vanishing gradient problem occur?  Are there ways we canavoid it?  And how should we deal with it in training deep neuralnetworks?  In fact, we'll learn shortly that it's not inevitable,although the alternative is not very attractive, either: sometimes thegradient gets much larger in earlier layers!  This is the<em>exploding gradient problem</em>, and it's not much better news thanthe vanishing gradient problem.  More generally, it turns out that thegradient in deep neural networks is <em>unstable</em>, tending to eitherexplode or vanish in earlier layers.  This instability is afundamental problem for gradient-based learning in deep neuralnetworks.  It's something we need to understand, and, if possible,take steps to address.-->なぜ勾配消失問題が起きるのでしょうか？この問題を避けられる方法はあるのでしょうか？深いニューラルネットを訓練するときにどう対処すべきなのでしょうか？実は、代替策がなくもないということをすぐに学びます。しかしその代替策はそこまで魅力的なものではありません。勾配が前方の層でかなり大きくなってしまう、という現象がしばしば発生してしまうのです！これは<em>勾配爆発問題</em>と言い、勾配消失問題と同じく悩ましい問題です。一般的には、深いニューラルネットワークにおける勾配は<em>不安定</em>であり、前方の層で爆発もしくは消失する傾向があります。この不安定性は、深いニューラルネットワークを勾配を利用して訓練するときの重大な問題です。この問題は理解する必要があり、もし可能であれば、対処する手を打たなくてはいけません。</p><p><!--One response to vanishing (or unstable) gradients is to wonder ifthey're really such a problem.  Momentarily stepping away from neuralnets, imagine we were trying to numerically minimize a function $f(x)$of a single variable.  Wouldn't it be good news if the derivative$f'(x)$ was small?  Wouldn't that mean we were already near anextremum?  In a similar way, might the small gradient in early layersof a deep network mean that we don't need to do much adjustment of theweights and biases?-->勾配消失（もしくは勾配の不安定性）への1つの反応は、本当にそれが問題になるのかと疑問に思うことです。ニューラルネットから一瞬離れて、ある1変数関数 $f(x)$ を数値的に最小化することを考えてみましょう。微分 $f'(x)$ が小さいことは良い知らせではないでしょうか？既に極値に近づいていることを意味しているのではないでしょうか？こんな風に考えると、深いネットワークの前方の層での勾配が小さいということは、重みとバイアスをこれ以上調整する必要が無いことを意味しているのではないのでしょうか？</p><p><!--Of course, this isn't the case.  Recall that we randomly initializedthe weight and biases in the network.  It is extremely unlikely ourinitial weights and biases will do a good job at whatever it is wewant our network to do.  To be concrete, consider the first layer ofweights in a $[784, 30, 30, 30, 10]$ network for the MNIST problem.The random initialization means the first layer throws away mostinformation about the input image.  Even if later layers have beenextensively trained, they will still find it extremely difficult toidentify the input image, simply because they don't have enoughinformation.  And so it can't possibly be the case that not muchlearning needs to be done in the first layer.  If we're going to traindeep networks, we need to figure out how to address the vanishinggradient problem.-->もちろん、これらの考えは間違っています。ネットワークの重みとバイアスをランダムに初期化したことを思い出してください。どんなネットワークであっても、重みとバイアスの初期値がいい感じになっているなんて、ありえないことです。具体的に確認するために、MNIST問題用の $[784, 30, 30, 30, 10]$ のネットワークにおける最初の層の重みを考えてみます。ランダムに初期化するということは、最初の層は、入力画像についての情報を捨てているということです。たとえ、後方の層が素晴らしく訓練されていたとしても、後方の層からは入力画像を同定することができない状況となるのです。なので、最初の層での学習は必要です。深いネットワークを訓練する場合には、この勾配消失問題にどう取り組むべきかを明らかにする必要があります。</p><p><h3><a name="what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"></a><a href="#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><!--What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets-->勾配消失問題の原因とは？ 深いニューラルネットにおける不安定な勾配</a></h3></p><p><!--To get insight into why the vanishing gradient problem occurs, let'sconsider the simplest deep neural network: one with just a singleneuron in each layer.  Here's a network with three hidden layers:-->なぜ勾配消失問題が起きるかを考察するために、一番シンプルな深層ニューラルネットワークを考えます。それは各層にただ1つニューロンをもつネットワークです。下図のネットワークは3つの隠れ層を持ちます。<center>  <img src="images/tikz37.png"/></center></p><p><!--Here, $w_1, w_2, \ldots$ are the weights, $b_1, b_2, \ldots$ are thebiases, and $C$ is some cost function.  Just to remind you how thisworks, the output $a_j$ from the $j$th neuron is $\sigma(z_j)$, where$\sigma$ is the usual <a href="chap1.html#sigmoid_neurons">sigmoid  activation function</a>, and $z_j = w_{j} a_{j-1}+b_j$ is the weightedinput to the neuron.  I've drawn the cost $C$ at the end to emphasizethat the cost is a function of the network's output, $a_4$: if theactual output from the network is close to the desired output, thenthe cost will be low, while if it's far away, the cost will be high.-->ここで、 $w_1, w_2, \ldots$ は重み、 $b_1, b_2, \ldots$ はバイアス、 $C$ はコスト関数です。$j$ 番目のニューロンの出力 $a_j$ は $\sigma(z_j)$ とします。$\sigma$ はいつもの <a href="chap1.html#sigmoid_neurons">シグモイドの活性化関数</a>です。$z_j = w_{j} a_{j-1}+b_j$ はニューロンへの入力に重みを施したものです。コスト関数 $C$ は、ネットワーク全体の出力 $a_4$ のコストを表すことを強調しておきます。もしネットワークの実際の出力が目標に近かった場合、コストは小さくなり、一方、実際の出力が目標とかけ離れていれば、コストは高くなります。</p><p><!--We're going to study the gradient $\partial C / \partial b_1$associated to the first hidden neuron.  We'll figure out an expressionfor $\partial C / \partial b_1$, and by studying that expression we'llunderstand why the vanishing gradient problem occurs.-->1個目の隠れ層のニューロンに関連する勾配 $\partial C / \partial b_1$ について私たちは学んでいきます。$\partial C / \partial b_1$ の式を学ぶことで、勾配消失問題がなぜ起きるかを理解したいと思います。</p><p><!--I'll start by simply showing you the expression for $\partial C /\partial b_1$.  It looks forbidding, but it's actually got a simplestructure, which I'll describe in a moment.  Here's the expression(ignore the network, for now, and note that $\sigma'$ is just thederivative of the $\sigma$ function):-->まず $\partial C / \partial b_1$ の式を見てみましょう。とっつきにくく見えますが、実際にはシンプルな構造をしています。すぐに解きほぐしていきます。これが数式です（とりあえず、ネットワークは無視してください。$\sigma'$ は $\sigma$ のただの導関数です）</p><p><center>  <img src="images/tikz38.png"/></center></p><p><!--The structure in the expression is as follows: there is a$\sigma'(z_j)$ term in the product for each neuron in the network; aweight $w_j$ term for each weight in the network; and a final$\partial C / \partial a_4$ term, corresponding to the cost functionat the end.  Notice that I've placed each term in the expression abovethe corresponding part of the network.  So the network itself is amnemonic for the expression.-->式の構造は次のようになっています。$\sigma'(z_j)$ の項が各ニューロンに相当する積の中に現れています。重み $w_j$ の項はネットワークの各重みに対応します。そして、最後の項 $\partial C / \partial a_4$ はコスト関数に対応します。上に挙げた項が、それぞれ対応するネットワーク部分の上に置かれていることを確認してください。つまり、ネットワークはそれ自体が、数式の疑似表現になっているのです。<!-- mnemonicを疑似表現と訳しました．なにか代案をください --></p><p><!--You're welcome to take this expression for granted, and skip to the<a href="#discussion_why">discussion of how it relates to the vanishing  gradient problem</a>.  There's no harm in doing this, since theexpression is a special case of our<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">earlier  discussion of backpropagation</a>.  But there's also a simpleexplanation of why the expression is true, and so it's fun (andperhaps enlightening) to take a look at that explanation.-->この数式を当たり前にとらえて、<a href="#discussion_why">勾配消失問題とどう関係するのかの議論</a>へ進んでも構いません。そうして問題はありません。なぜならこの数式は、<a href="chap2.html#the_four_fundamental_equations_behind_backpropagation">先の逆伝播の議論</a>における特別な場合であるからです。でも、なぜこの数式が成り立つのかには簡潔な説明が可能であって、その説明を理解するのも面白い（ですし、加えてたぶん知恵もつくはず）です。</p><p><!--Imagine we make a small change $\Delta b_1$ in the bias $b_1$.  Thatwill set off a cascading series of changes in the rest of the network.First, it causes a change $\Delta a_1$ in the output from the firsthidden neuron.  That, in turn, will cause a change $\Delta z_2$ in theweighted input to the second hidden neuron.  Then a change $\Deltaa_2$ in the output from the second hidden neuron.  And so on, all theway through to a change $\Delta C$ in the cost at the output.  We have<a class="displaced_anchor" name="eqtn114"></a>\begin{eqnarray}  \frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}.\tag{114}\end{eqnarray}This suggests that we can figure out an expression for the gradient$\partial C / \partial b_1$ by carefully tracking the effect of eachstep in this cascade.-->バイアス $b_1$ の微小な変化 $\Delta b_1$ を想像してみてください。その変化が、残りのネットワークへカスケード的な変化を引き起こしていくでしょう。まず第一に、1個目の隠れ層のニューロンの出力の変化 $\Delta a_1$ を起こします。それが今度は、2個目の隠れ層への重み付き入力の変化 $\Delta z_2$ を起こします。さらに、2個目の隠れ層からの出力 $\Delta a_2$ も変化します。そうして、出力でのコスト $\Delta C$ まではるばる変化していくのです。<a class="displaced_anchor" name="eqtn114"></a>\begin{eqnarray}  \frac{\partial C}{\partial b_1} \approx \frac{\Delta C}{\Delta b_1}.\tag{114}\end{eqnarray}このことは、カスケードの各ステップの影響を注意深く辿っていくことで、最終的に $\partial C / \partial b_1$ がわかることを示しています。</p><p><!--To do this, let's think about how $\Delta b_1$ causes the output $a_1$from the first hidden neuron to change.  We have $a_1 = \sigma(z_1) =\sigma(w_1 a_0 + b_1)$, so<a class="displaced_anchor" name="eqtn115"></a><a class="displaced_anchor" name="eqtn116"></a>\begin{eqnarray}  \Delta a_1 & \approx &  \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\  & = & \sigma'(z_1) \Delta b_1.\tag{116}\end{eqnarray}-->これを行うために、 $\Delta b_1$ がどのように1個目の隠れ層の出力 $a_1$ を変化させるのか考えてみましょう。$a_1 = \sigma(z_1) =\sigma(w_1 a_0 + b_1)$ という式があるので、以下のようになります。<a class="displaced_anchor" name="eqtn115"></a><a class="displaced_anchor" name="eqtn116"></a>\begin{eqnarray}  \Delta a_1 & \approx &  \frac{\partial \sigma(w_1 a_0+b_1)}{\partial b_1} \Delta b_1 \tag{115}\\  & = & \sigma'(z_1) \Delta b_1.\tag{116}\end{eqnarray}<!--That $\sigma'(z_1)$ term should look familiar: it's the first term inour claimed expression for the gradient $\partial C / \partial b_1$.Intuitively, this term converts a change $\Delta b_1$ in the bias intoa change $\Delta a_1$ in the output activation.  That change $\Deltaa_1$ in turn causes a change in the weighted input $z_2 = w_2 a_1 +b_2$ to the second hidden neuron:<a class="displaced_anchor" name="eqtn117"></a><a class="displaced_anchor" name="eqtn118"></a>\begin{eqnarray}  \Delta z_2 & \approx &  \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\  & = & w_2 \Delta a_1.\tag{118}\end{eqnarray}-->$\sigma'(z_1)$ の項は馴染みがあるはずです。勾配 $\partial C / \partial b_1$ の式中の最初の項でした。直感的には、この項はバイアスの変化 $\Delta b_1$ を出力の活性の変化 $\Delta a_1$ へと変換します。$\Delta a_1$ の変化は、今度は重みの付与された入力 $z_2 = w_2 a_1 +b_2$ として2個目の隠れ層へ伝わります。<a class="displaced_anchor" name="eqtn117"></a><a class="displaced_anchor" name="eqtn118"></a>\begin{eqnarray}  \Delta z_2 & \approx &  \frac{\partial z_2}{\partial a_1} \Delta a_1 \tag{117}\\  & = & w_2 \Delta a_1.\tag{118}\end{eqnarray}<!--Combining our expressions for $\Delta z_2$ and $\Delta a_1$, we seehow the change in the bias $b_1$ propagates along the network toaffect $z_2$:<a class="displaced_anchor" name="eqtn119"></a>\begin{eqnarray}\Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1.\tag{119}\end{eqnarray}Again, that should look familiar: we've now got the first two terms inour claimed expression for the gradient $\partial C / \partial b_1$.-->$\Delta z_2$ と $\Delta a_1$ の式を組み合わせると、バイアス $b_1$ の変化がネットワークをどう伝わって $z_2$ へ影響を及ぼすのかがわかります。<a class="displaced_anchor" name="eqtn119"></a>\begin{eqnarray}\Delta z_2 & \approx & \sigma'(z_1) w_2 \Delta b_1.\tag{119}\end{eqnarray}再び馴染みのある式が現れました。勾配 $\partial C / \partial b_1$ の式の最初の2つの項です。</p><p><!--We can keep going in this fashion, tracking the way changes propagatethrough the rest of the network.  At each neuron we pick up a$\sigma'(z_j)$ term, and through each weight we pick up a $w_j$ term.The end result is an expression relating the final change $\Delta C$in cost to the initial change $\Delta b_1$ in the bias:-->このやり方で進めて、変化が残りのネットワークを伝播するのを追いかけていきます。各ニューロンで $\sigma'(z_j)$ の項と、各重みごとに $w_j$ の項が現れます。最終結果は、最初のバイアスの変化 $\Delta b_1$ に対する、最後のコスト関数の変化 $\Delta C$ の式となります。<!--<a class="displaced_anchor" name="eqtn120"></a>\begin{eqnarray}\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4)\frac{\partial C}{\partial a_4} \Delta b_1.\tag{120}\end{eqnarray}Dividing by $\Delta b_1$ we do indeed get the desired expression forthe gradient:<a class="displaced_anchor" name="eqtn121"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots\sigma'(z_4) \frac{\partial C}{\partial a_4}.\tag{121}\end{eqnarray}--><a class="displaced_anchor" name="eqtn120"></a>\begin{eqnarray}\Delta C & \approx & \sigma'(z_1) w_2 \sigma'(z_2) \ldots \sigma'(z_4)\frac{\partial C}{\partial a_4} \Delta b_1.\tag{120}\end{eqnarray}$\Delta b_1$ で割ることで、本当に欲しい勾配を手に入れます。<a class="displaced_anchor" name="eqtn121"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) w_2 \sigma'(z_2) \ldots\sigma'(z_4) \frac{\partial C}{\partial a_4}.\tag{121}\end{eqnarray}</p><p><a id="discussion_why"></a></p><p><!--<strong>Why the vanishing gradient problem occurs:</strong> To understand whythe vanishing gradient problem occurs, let's explicitly write out theentire expression for the gradient:<a class="displaced_anchor" name="eqtn122"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}.\tag{122}\end{eqnarray}Excepting the very last term, this expression is a product of terms ofthe form $w_j \sigma'(z_j)$.  To understand how each of those termsbehave, let's look at a plot of the function $\sigma'$: <div  id="sigmoid_prime_graph"><a  name="sigmoid_prime_graph"></a></div><script type="text/javascript"  src="js/d3.v3.min.js"></script>--><strong>なぜ勾配消失問題が起きるのか:</strong> 勾配消失問題の要因を理解するために、勾配全体の式を書き出してみましょう。<a class="displaced_anchor" name="eqtn122"></a>\begin{eqnarray}\frac{\partial C}{\partial b_1} = \sigma'(z_1) \, w_2 \sigma'(z_2) \, w_3 \sigma'(z_3) \, w_4 \sigma'(z_4) \, \frac{\partial C}{\partial a_4}.\tag{122}\end{eqnarray}最後の項以外は、$w_j \sigma'(z_j)$ の積の項からなります。これらの積の項がどう振る舞うかを理解するために、 $\sigma'$ のプロットを見てみます。<div  id="sigmoid_prime_graph"><a  name="sigmoid_prime_graph"></a></div><script type="text/javascript"  src="js/d3.v3.min.js"></script></p><p><!--The derivative reaches a maximum at $\sigma'(0) = 1/4$.  Now, if weuse our <a href="chap3.html#weight_initialization">standard approach</a>to initializing the weights in the network, then we'll choose theweights using a Gaussian with mean $0$ and standard deviation $1$.  Sothe weights will usually satisfy $|w_j| < 1$.  Putting theseobservations together, we see that the terms $w_j \sigma'(z_j)$ willusually satisfy $|w_j \sigma'(z_j)| < 1/4$.  And when we take aproduct of many such terms, the product will tend to exponentiallydecrease: the more terms, the smaller the product will be.  This isstarting to smell like a possible explanation for the vanishinggradient problem.-->この導関数は $\sigma'(0) = 1/4$ の地点で最大値に達します。ところで、<a href="chap3.html#weight_initialization">標準的な方法</a>でネットワークの重みを初期化する場合には、平均 $0$ かつ標準偏差 $1$ のガウス分布から重みを選びます。このとき、重みは通常 $|w_j| < 1$ を満たします。これらの考察を組み合わせると、 $w_j \sigma'(z_j)$ の項は通常 $|w_j \sigma'(z_j)| < 1/4$ を満足します。そして、そのような項をいくつも掛けていくと、積は指数関数的に減少していく傾向があります。すなわち、項が多くなればなるほど、積は小さくなっていきます。上記は勾配消失問題の説明として妥当な気がします。</p><p><!--To make this all a bit more explicit, let's compare the expression for$\partial C / \partial b_1$ to an expression for the gradient withrespect to a later bias, say $\partial C / \partial b_3$.  Of course,we haven't explicitly worked out an expression for $\partial C/ \partial b_3$, but it follows the same pattern described above for$\partial C / \partial b_1$.  Here's the comparison of the twoexpressions:-->この説明をもう少し明確化するために、 $\partial C / \partial b_1$ の式を、後方の層のバイアスの勾配 $\partial C / \partial b_3$ と比べてみましょう。もちろん、まだ $\partial C / \partial b_3$ の式は計算してないのですが、 $\partial C / \partial b_1$ の上述のパターンと同じになります。こちらが2つの式の比較です。<center>  <img src="images/tikz39.png"/></center><!--The two expressions share many terms.  But the gradient $\partial C/ \partial b_1$ includes two extra terms each of the form $w_j\sigma'(z_j)$. As we've seen, such terms are typically less than $1/4$in magnitude.  And so the gradient $\partial C / \partial b_1$ willusually be a factor of $16$ (or more) smaller than $\partial C/ \partial b_3$.  This is the essential origin of the vanishinggradient problem.-->2つの式は共通の項を多く持っています。しかし、勾配 $\partial C / \partial b_1$ は $w_j \sigma'(z_j)$ の形をした2つの項を余分に含んでいます。上で述べたように、$\sigma'(z_j)$ の項は $1/4$ より明らかに小さいです。なので、通常は勾配 $\partial C / \partial b_1$ は $\partial C/ \partial b_3$ よりも $16$ 倍（以上）小さくなります。これこそが勾配消失問題の本質的な要因なのです。</p><p><!--Of course, this is an informal argument, not a rigorous proof that thevanishing gradient problem will occur.  There are several possibleescape clauses.  In particular, we might wonder whether the weights$w_j$ could grow during training.  If they do, it's possible the terms$w_j \sigma'(z_j)$ in the product will no longer satisfy $|w_j\sigma'(z_j)| < 1/4$.  Indeed, if the terms get large enough -greater than $1$ - then we will no longer have a vanishing gradientproblem.  Instead, the gradient will actually grow exponentially as wemove backward through the layers.  Instead of a vanishing gradientproblem, we'll have an exploding gradient problem.-->もちろん、これは大雑把な議論であり、勾配消失問題の発生に関する厳格な証明ではありません。幾つかの例外事項もあります。特に、重み $w_j$ が訓練中どんどん大きくなる可能性については気になることでしょう。その場合、積の中の $w_j \sigma'(z_j)$ はもはや $|w_j\sigma'(z_j)| < 1/4$ を満たさなくなります。実際にこの項が十分大きくなったとしたら、つまり $1$ より大きくなったら、勾配消失問題はなくなるでしょう。代わりに、勾配は層を逆伝播するごとに指数関数的に大きくなっていきます。勾配消失問題の代わりに、勾配爆発問題が起きるのです。</p><p><!--<strong>The exploding gradient problem:</strong> Let's look at an explicitexample where exploding gradients occur.  The example is somewhatcontrived: I'm going to fix parameters in the network in just theright way to ensure we get an exploding gradient.  But even though theexample is contrived, it has the virtue of firmly establishing thatexploding gradients aren't merely a hypothetical possibility, theyreally can happen--><strong>勾配爆発問題：</strong>勾配爆発の起きる典型的な例を見てみましょう。この例は恣意的なもので、勾配爆発が確実に起きるように、ネットワークのパラメータを修正していきます。例として恣意的ではありますが、勾配爆発は仮説上の現象ではなく、実際に十分起きうるものです。</p><p><!--There are two steps to getting an exploding gradient.  First, wechoose all the weights in the network to be large, say $w_1 = w_2 =w_3 = w_4 = 100$.  Second, we'll choose the biases so that the$\sigma'(z_j)$ terms are not too small.  That's actually pretty easyto do: all we need do is choose the biases to ensure that the weightedinput to each neuron is $z_j = 0$ (and so $\sigma'(z_j) = 1/4$).  So,for instance, we want $z_1 = w_1 a_0 + b_1 = 0$.  We can achieve thisby setting $b_1 = -100 * a_0$.  We can use the same idea to select theother biases.  When we do this, we see that all the terms $w_j\sigma'(z_j)$ are equal to $100 * \frac{1}{4} = 25$.  With thesechoices we get an exploding gradient.-->勾配爆発を引き起こすまでには、ステップが2つあります。1つ目は、ネットワークの中の全ての重みを大きくしておくことです。たとえば、 $w_1 = w_2 = w_3 = w_4 = 100$ とします。2つ目のステップは、$\sigma'(z_j)$ の項があまり小さくなりすぎないように、バイアスを選ぶことです。実際の操作は簡単です。私たちが行うべきのは、各ニューロンへの重み付けされた入力が $z_j = 0$ を、(そして $\sigma'(z_j) = 1/4$ を) 満たすようにすることです。したがって、たとえば、$z_1 = w_1 a_0 + b_1 = 0$ を得たいとしましょう。私たちは $b_1 = -100 * a_0$ と設定することで上記を実現できます。同じアイデアを、他のバイアスを選ぶときにも使えます。このとき、全ての $w_j \sigma'(z_j)$ の項は $100 * \frac{1}{4} = 25$ に等しいことを確認しておきます。これらの選択により、勾配爆発を起こせるのです。</p><p><!--<strong>The unstable gradient problem:</strong> The fundamental problem hereisn't so much the vanishing gradient problem or the exploding gradientproblem.  It's that the gradient in early layers is the product ofterms from all the later layers.  When there are many layers, that'san intrinsically unstable situation.  The only way all layers canlearn at close to the same speed is if all those products of termscome close to balancing out.  Without some mechanism or underlyingreason for that balancing to occur, it's highly unlikely to happensimply by chance.  In short, the real problem here is that neuralnetworks suffer from an <em>unstable gradient problem</em>.  As aresult, if we use standard gradient-based learning techniques,different layers in the network will tend to learn at wildly differentspeeds.--><strong>不安定勾配問題：</strong>ここでの根本的な問題は、勾配消失問題でも、勾配爆発問題でもありません。前方の層の勾配が、それ以降の層の勾配の積となっていることが問題なのです。多層の場合、本質的に不安定な状況となります。全部の層がほぼ同じ速度で学ぶ唯一の方法は、それらの項の積をほぼバランスさせることです。バランスさせるための仕組みや要因が何もないとき、偶然には上手く行きようがありません。つまり本当の問題は、ニューラルネットワークが <em>不安定勾配問題</em> を伴っていることです。結果的に、勾配を利用する標準の学習テクニックを使うときには、ネットワーク中の異なる層は、かなり異なる速度で学習する傾向があります。</p><p><!--<h4><a name="exercise_255808"></a><a href="#exercise_255808">Exercise</a></h4><ul><li> In our discussion of the vanishing gradient problem, we made use  of the fact that $|\sigma'(z)| < 1/4$.  Suppose we used a different  activation function, one whose derivative could be much larger.  Would that help us avoid the unstable gradient problem?</ul>--><h4><a name="exercise_255808"></a><a href="#exercise_255808">演習</a></h4><ul><li> 勾配消失問題の議論の中で、 $|\sigma'(z)| < 1/4$ である事実を利用しました。別の活性化関数を使ったと想定し、その導関数はとても大きくなってしまったとします。その場合、勾配不安定問題は回避できるでしょうか？</ul></p><p><!--<strong>The prevalence of the vanishing gradient problem:</strong> We've seenthat the gradient can either vanish or explode in the early layers ofa deep network.  In fact, when using sigmoid neurons the gradient willusually vanish.  To see why, consider again the expression $|w\sigma'(z)|$.  To avoid the vanishing gradient problem we need $|w\sigma'(z)| \geq 1$.  You might think this could happen easily if $w$is very large.  However, it's more difficult that it looks.  Thereason is that the $\sigma'(z)$ term also depends on $w$: $\sigma'(z)= \sigma'(wa +b)$, where $a$ is the input activation.  So when we make$w$ large, we need to be careful that we're not simultaneously making$\sigma'(wa+b)$ small.  That turns out to be a considerableconstraint.  The reason is that when we make $w$ large we tend to make$wa+b$ very large.  Looking at the graph of $\sigma'$ you can see thatthis puts us off in the "wings" of the $\sigma'$ function, where ittakes very small values.  The only way to avoid this is if the inputactivation falls within a fairly narrow range of values (thisqualitative explanation is made quantitative in the first problembelow).  Sometimes that will chance to happen.  More often, though, itdoes not happen.  And so in the generic case we have vanishinggradients.--><strong>勾配消失問題の広がり:</strong>これまでに、前方の層での勾配は消失するか爆発する可能性があることを見てきました。実際、シグモイドのニューロンを使ったときには、勾配は通常、消失してしまうでしょう。その理由については、 $|w \sigma'(z)|$ の式を再び考えてみてください。勾配消失問題を避けるためには、 $|w \sigma'(z)| \geq 1$ とする必要があります。もし $w$ がとても大きければ、上記は達成できると思っているでしょう。しかし実際は、見た目よりずっと難しいのです。その理由は、 $\sigma'(z)$ の項は $w$ にも依存していることにあります。すなわち $a$ が活性化した入力とすると、 $\sigma'(z) = \sigma'(wa +b)$ となります。だから、$w$ を大きくするときには、同時に $\sigma'(wa+b)$ を小さくしないよう注意する必要があります。これは大きな制約です。なぜかというと、 $w$ を大きくすると、$wa+b$ がとても大きく傾向があるからです。$\sigma'$ のグラフの中では、 $\sigma'$ の「両翼」に相当する確率が高くなります。<!-- 意訳 -->そこではとても小さな値をとります。これを避ける唯一の方法は、活性化された入力を、かなり狭い範囲の値（この定性的な説明は下記の定量的な説明により明らかです）に落ち着かせることです。たまたま起きることはあります。しかし、大抵は起きません。したがって、一般的には勾配が消失してしまうのです。</p><p><!--<h4><a name="problems_778071"></a><a href="#problems_778071">Problems</a></h4><ul><li> Consider the product $|w \sigma'(wa+b)|$.  Suppose $|w  \sigma'(wa+b)| \geq 1$.  (1) Argue that this can only ever occur if  $|w| \geq 4$.  (2) Supposing that $|w| \geq 4$, consider the set of  input activations $a$ for which $|w \sigma'(wa+b)| \geq 1$.  Show  that the set of $a$ satisfying that constraint can range over an  interval no greater in width than  <a class="displaced_anchor" name="eqtn123"></a>\begin{eqnarray}    \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right).  \tag{123}\end{eqnarray}  (3) Show numerically that the above expression bounding the width of  the range is greatest at $|w| \approx 6.9$, where it takes a value  $\approx 0.45$.  And so even given that everything lines up just  perfectly, we still have a fairly narrow range of input activations  which can avoid the vanishing gradient problem.--><h4><a name="problems_778071"></a><a href="#problems_778071">問題</a></h4><ul><li> 積 $|w \sigma'(wa+b)|$ を考えます。$|w  \sigma'(wa+b)| \geq 1$ とします。  (1) これが $|w| \geq 4$ の場合にだけ起こることを証明してください。  (2) $|w| \geq 4$ としたとき、   $|w \sigma'(wa+b)| \geq 1$ である入力の活性 $a$ を考えてください。この制約を満たす $a$ の組は、<a class="displaced_anchor" name="eqtn123"></a>\begin{eqnarray}    \frac{2}{|w|} \ln\left( \frac{|w|(1+\sqrt{1-4/|w|})}{2}-1\right).  \tag{123}\end{eqnarray}  の幅より小さいインターバルに分布することを示してください。  (3) この範囲の幅の境界にある上式は $|w| \approx 6.9$ において最大値 $\approx 0.45$ をとることを示してください。  完璧に全てが並んだときでさえ、活性化された入力を狭い範囲で持てば、勾配消失問題を避けられるのです。  <!-- この問題はうまく訳せていません --></p><p><!--<li><strong>Identity neuron:</strong> <a id="identity_neuron"></a> Consider a  neuron with a single input, $x$, a corresponding weight, $w_1$, a  bias $b$, and a weight $w_2$ on the output.  Show that by choosing  the weights and bias appropriately, we can ensure $w_2 \sigma(w_1  x+b) \approx x$ for $x \in [0, 1]$.  Such a neuron can thus be used  as a kind of identity neuron, that is, a neuron whose output is the  same (up to rescaling by a weight factor) as its input.  <em>Hint:    It helps to rewrite $x = 1/2+\Delta$, to assume $w_1$ is small,    and to use a Taylor series expansion in $w_1 \Delta$.</em></ul>--><li><strong>Identity neuron:</strong> <a id="identity_neuron"></a>1つの入力 $x$ 、重み $w_1$ 、バイアス $b$ 、出力への重み $w_2$ を持つニューロンを考えてください。重みとバイアスを適切に選べば、 $x \in [0, 1]$ に対して $w_2 \sigma(w_1 x+b) \approx x$ が成り立つことを示してください。そのようなニューロンはidentity neuronとして使えます。それはすなわち、出力が入力と同じとなるニューロンのことです。<em>ヒント：  $x = 1/2+\Delta$ を記述し、$w_1$ が小さいと想定し、 $w_1 \Delta$ のテイラー展開を使ってみてください。</em></ul></p><p><!--<h3><a name="unstable_gradients_in_more_complex_networks"></a><a href="#unstable_gradients_in_more_complex_networks">Unstable gradients in more complex networks</a></h3>--><h3><a name="unstable_gradients_in_more_complex_networks"></a><a href="#unstable_gradients_in_more_complex_networks">さらに複雑なネットワークにおける不安定な勾配</a></h3></p><p><!--We've been studying toy networks, with just one neuron in each hiddenlayer.  What about more complex deep networks, with many neurons ineach hidden layer?-->これまで、各隠れ層に一つのニューロンだけ持つような、トイネットワークを使って勉強してきました。もっと複雑な深層の、各隠れ層にたくさんのニューロンを持つようなネットワークの場合どうなるのでしょうか？</p><p><center><img src="images/tikz40.png"/></center></p><p><!--In fact, much the same behaviour occurs in such networks.  In theearlier chapter on backpropagation we saw that the gradient in the$l$th layer of an $L$ layer network<a href="chap2.html#alternative_backprop">is given by</a>:-->実際、そのようなネットワークにおいても、同じ振る舞いが見られます。逆伝播の章で見たように、全 $L$ 層のネットワークの $l$ 層目 の勾配は次の式で<a href="chap2.html#alternative_backprop">与えられます</a>。</p><p><a class="displaced_anchor" name="eqtn124"></a>\begin{eqnarray}  \delta^l = \Sigma'(z^l) (w^{l+1})^T \Sigma'(z^{l+1}) (w^{l+2})^T \ldots  \Sigma'(z^L) \nabla_a C\tag{124}\end{eqnarray}</p><p><!--Here, $\Sigma'(z^l)$ is a diagonal matrix whose entries are the$\sigma'(z)$ values for the weighted inputs to the $l$th layer.  The$w^l$ are the weight matrices for the different layers.  And $\nabla_aC$ is the vector of partial derivatives of $C$ with respect to theoutput activations.-->ここで、 $\Sigma'(z^l)$ は対角行列で、各成分は、$l$ 層目に対する重み付き入力への $\sigma'(z)$ となります。$w^l$ は異なる層に対する重み行列です。$\nabla_a C$ は活性化された出力に対する $C$ の偏微分ベクトルです。</p><p><!--This is a much more complicated expression than in the single-neuroncase.  Still, if you look closely, the essential form is very similar,with lots of pairs of the form $(w^j)^T \Sigma'(z^j)$.  What's more,the matrices $\Sigma'(z^j)$ have small entries on the diagonal, nonelarger than $\frac{1}{4}$. Provided the weight matrices $w^j$ aren'ttoo large, each additional term $(w^j)^T \Sigma'(z^l)$ tends to makethe gradient vector smaller, leading to a vanishing gradient.  Moregenerally, the large number of terms in the product tends to lead toan unstable gradient, just as in our earlier example.  In practice,empirically it is typically found in sigmoid networks that gradientsvanish exponentially quickly in earlier layers.  As a result, learningslows down in those layers.  This slowdown isn't merely an accident oran inconvenience: it's a fundamental consequence of the approach we'retaking to learning.-->これは各層に1ニューロンの場合よりも遥かに複雑です。でも、よく見ると、本質的な形状はとてもよく似ています。$(w^j)^T \Sigma'(z^j)$ の項の組がたくさんあります。加えて、行列 $\Sigma'(z^j)$ の対角成分は $\frac{1}{4}$ よりも小さくなります.もし、重み行列 $w^j$ がそこまで大きくない場合、 $(w^j)^T \Sigma'(z^l)$ の各項は勾配ベクトルを小さくする働きがあり、勾配消失問題が発生します。一般的に言うと、先の例のように積の中に項が多くなると、勾配が不安定になります。経験的には、シグモイドのネットワークにおいて、前方の層で勾配が急速に消失する現象が顕著に見られます。結果的に、前方の層での学習は遅くなります。学習が遅いことは、偶然でも単なる不都合でもありません。学習のために私たちが採用した方法のもつ根本的な問題なのです。</p><p><!--  <h3><a name="other_obstacles_to_deep_learning"></a><a href="#other_obstacles_to_deep_learning">Other obstacles to deep learning</a></h3>-->  <h3><a name="other_obstacles_to_deep_learning"></a><a href="#other_obstacles_to_deep_learning">ディープラーニングに関する他の問題</a></h3></p><p><!--In this chapter we've focused on vanishing gradients - and, moregenerally, unstable gradients - as an obstacle to deep learning.  Infact, unstable gradients are just one obstacle to deep learning,albeit an important fundamental obstacle.  Much ongoing research aimsto better understand the challenges that can occur when training deepnetworks.  I won't comprehensively summarize that work here, but justwant to briefly mention a couple of papers, to give you the flavour ofsome of the questions people are asking.-->この章では、ディープラーニングの問題として勾配消失（や、もっと一般的には勾配の不安定性）に焦点を当ててきました。勾配の不安定性は重要で本質的な問題ではありますが、実際にはディープラーニングの一つの問題に過ぎません。現在も多くの研究が、深層ネットワークの訓練時に発生する難問を理解しようと試みています。ここで包括的にまとめることはしませんが、いくつかの論文を簡単に紹介します。人々が解決したいと思っている問題の雰囲気をあなたに届けたいと思います。</p><p><!--As a first example, in 2010 Glorot andBengio*<span class="marginnote">*<a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding    the difficulty of training deep feedforward neural networks</a>, by  Xavier Glorot and Yoshua Bengio (2010). See also the earlier  discussion of the use of sigmoids in  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998).</span>found evidence suggesting that the use of sigmoid activation functionscan cause problems training deep networks.  In particular, they foundevidence that the use of sigmoids will cause the activations in thefinal hidden layer to saturate near $0$ early in training,substantially slowing down learning.  They suggested some alternativeactivation functions, which appear not to suffer as much from thissaturation problem.-->1つ目の例は、2010年にGlorotとBengio*<span class="marginnote">*<a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">Understanding    the difficulty of training deep feedforward neural networks</a>, by  Xavier Glorot and Yoshua Bengio (2010)。シグモイドの利用に関する先の議論である、<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient    BackProp</a>, by Yann LeCun, Léon Bottou,  Genevieve Orr and Klaus-Robert Müller (1998)も確認してください。</span>が、シグモイドを活性化関数に使うと、深いネットワークでの訓練時に問題が生じる証拠を発見した論文です。特に、シグモイドが最後の隠れ層の活性化において、訓練の初期にほぼ $0$ に飽和して、実質的に学習を遅くさせる証拠を見つけています。彼らは、この飽和の問題の発生しないような、代わりの活性化関数を提案しました。</p><p><!--As a second example, in 2013 Sutskever, Martens, Dahl andHinton*<span class="marginnote">*<a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On    the importance of initialization and momentum in deep learning</a>,  by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton  (2013).</span> studied the impact on deep learning of both the randomweight initialization and the momentum schedule in momentum-basedstochastic gradient descent.  In both cases, making good choices madea substantial difference in the ability to train deep networks.-->2つ目の例として、2013年にSutskever、Martens、Dahl、Hinton*<span class="marginnote">*<a href="http://www.cs.toronto.edu/&#126;hinton/absps/momentum.pdf">On    the importance of initialization and momentum in deep learning</a>,  by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton  (2013)。</span>を挙げます。彼らはランダムな重みの初期化と、momentumに基づいた確率的勾配降下法のmomentumのスケジューリングによるディープラーニングへの影響を調べました。どちらの場合にも、パラメータをうまく選択することで、深層ネットワークの訓練が成功しました。</p><p><!--These examples suggest that "What makes deep networks hard totrain?" is a complex question.  In this chapter, we've focused on theinstabilities associated to gradient-based learning in deep networks.The results in the last two paragraphs suggest that there is also arole played by the choice of activation function, the way weights areinitialized, and even details of how learning by gradient descent isimplemented.  And, of course, choice of network architecture and otherhyper-parameters is also important. Thus, many factors can play a rolein making deep networks hard to train, and understanding all thosefactors is still a subject of ongoing research.  This all seems ratherdownbeat and pessimism-inducing.  But the good news is that in thenext chapter we'll turn that around, and develop several approaches todeep learning that to some extent manage to overcome or route aroundall these challenges.-->これらの例が示唆するのは、「深いネットワークを訓練するときの難しさは何か」というのは複雑な問いであるということです。この章では、深いネットワークにおいて、勾配を利用する学習を行うときの不安定性に焦点を当ててきました。直前の2つの段落での結果は、活性化関数の選択や、重みの初期化の仕方、さらに勾配降下の学習方法の実装によっても、学習の効果が変わってくることを示すものでした。そしてもちろん、ネットワークの構造や他のハイパーパラメータの選択も重要です。したがって、多くの要素が深いネットワークの訓練を難しくさせうるのです。そしてこれらを全て理解するために、研究が現在も進められているのです。このことは、暗澹としていて悲観的であるように思えます。しかし、良い知らせもあります。私たちは、次の章でこれらの問題をひっくり返します。ある程度はこれらの難問を克服もしくは回避するような、ディープラーニングのアプローチを開発します。</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><script src="js/misc.js" type="text/javascript"></script><script src="js/canvas.js" type="text/javascript"></script><script src="js/neuron.js" type="text/javascript"></script><script src="js/chap5.js" type="text/javascript"></script></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Sun Dec 21 14:49:05 2014
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>