diff --git a/Use-Caffe.ipynb b/Use-Caffe.ipynb new file mode 100644 index 0000000..1b5ba37 --- /dev/null +++ b/Use-Caffe.ipynb @@ -0,0 +1,575 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use Caffe On TH-2 DEMO\n", + "\n", + "> 演示如何在天河上使用Caffe 2016/5/20 by : lijiang\n", + "\n", + "*Caffe 是目前流行的深度学习框架,前几天才在天河上进行了部署,今天因有用户询问使用方法做此说明文档。*\n", + "\n", + "*Caffe 目前的代码没有针对 CPU 进行优化,更适合使用 GPU 来进行计算。但有客户在对CAFFE 进行 CPU 的优化,以后会在TH-2 上进行发布 *" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [] + } + ], + "source": [ + "alias module='source ~/.module.sh'\n", + "export MODULEPATH=/NSFCGZ/app/modulefiles" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> 这两行代码只是为了我配置演示环境,一般用户并不需要" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. 加载环境\n", + "使用 module load 加载 caffe 环境 ; 首先,在不了解系统时可以通过 module avail 查看有哪些可用版本 " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\r\n", + "--------------------------- /NSFCGZ/app/modulefiles ----------------------------\r\n", + "caffe/v20160510-cpu3 caffe/v20160511-gpu-icc\r\n" + ] + } + ], + "source": [ + "module avail caffe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这里可以看到,在 基金用户的分区 (/NSFCGZ) 目前部署了两个版本的 caffe , 分别对应使用 CPU 和 GPU .本例演示使用CPU 的版本。\n", + "\n", + "> 如果是在前面的WORK 节点,可能能看到更多的, 但有一些并不推荐使用。\n", + "\n", + "使用module load 加载环境." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/NSFCGZ/app/caffe/v20160510-cpu3/bin/caffe\r\n" + ] + } + ], + "source": [ + "module load caffe/v20160510-cpu3\n", + "which caffe " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这里可以看到,caffe 命令及其环境已经成功加载了。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. 数据处理\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "bin caffe-master include lib python share\r\n" + ] + } + ], + "source": [ + "ls /NSFCGZ/app/caffe/v20160510-cpu3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "用户可以自己上传 caffe-master 文件夹,也可以直接从 /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master 复制一份" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CMakeLists.txt\t Makefile\t\t caffe.cloc examples\tscripts\r\n", + "CONTRIBUTING.md Makefile.config\t cmake include\tsrc\r\n", + "CONTRIBUTORS.md Makefile.config.example data\t matlab\ttools\r\n", + "INSTALL.md\t README.md\t\t docker models\r\n", + "LICENSE\t\t build\t\t\t docs\t python\r\n" + ] + } + ], + "source": [ + "cp -r /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master ~/.\n", + "cd ~/caffe-master \n", + "ls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我原来已经下载了 算例 cifar10 的数据; 但为了完整演示,这里使用算例 mnist" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "get_mnist.sh\r\n" + ] + } + ], + "source": [ + "ls data/mnist " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "可以看到: 这个文件夹里面只有一个脚本文件;如果按在自己电脑上的方法去执行则会出错:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading...\r\n", + "--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\r\n", + "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n", + "wget: unable to resolve host address `yann.lecun.com'\r\n", + "gzip: train-images-idx3-ubyte.gz: No such file or directory\r\n", + "--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\r\n", + "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n", + "wget: unable to resolve host address `yann.lecun.com'\r\n", + "gzip: train-labels-idx1-ubyte.gz: No such file or directory\r\n", + "--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\r\n", + "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n", + "wget: unable to resolve host address `yann.lecun.com'\r\n", + "gzip: t10k-images-idx3-ubyte.gz: No such file or directory\r\n", + "--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\r\n", + "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n", + "wget: unable to resolve host address `yann.lecun.com'\r\n", + "gzip: t10k-labels-idx1-ubyte.gz: No such file or directory\r\n" + ] + } + ], + "source": [ + "data/mnist/get_mnist.sh" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "原因很简单,机器是不联网的。。。我们需要手工将压缩包或者数据文件放过来。\n", + "这里可以从 /NSFCGZ/app/share 拷贝一份过来。" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [] + } + ], + "source": [ + "cp /NSFCGZ/app/share/mnist_data.tar.gz data/mnist/." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "然后修改一下 data/mnist/get_mnist.sh 这个脚本 \n", + "(这里没法演示修改的过程;大家自己修改 , 这里用 diff 查看修改前后的对比 , 后面如有文件修改同此):" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8c8\r\n", + "< \r\n", + "---\r\n", + "> tar xvf *.tar.gz\r\n", + "12c12\r\n", + "< wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz\r\n", + "---\r\n", + "> #wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz\r\n" + ] + } + ], + "source": [ + "diff data/mnist/get_mnist.sh data/mnist/get_mnist-th.sh " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading...\r\n", + "train-images-idx3-ubyte.gz\r\n", + "train-labels-idx1-ubyte.gz\r\n", + "t10k-images-idx3-ubyte.gz\r\n", + "t10k-labels-idx1-ubyte.gz\r\n", + "get_mnist-th.sh mnist_data.tar.gz\t train-images-idx3-ubyte\r\n", + "get_mnist.sh\t t10k-images-idx3-ubyte train-labels-idx1-ubyte\r\n", + "get_mnist.sh-org t10k-labels-idx1-ubyte\r\n" + ] + } + ], + "source": [ + "data/mnist/get_mnist-th.sh \n", + "ls data/mnist/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "执行完脚本后可以发现数据已经就绪了。进行下一步操作。" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating lmdb...\r\n", + "examples/mnist/create_mnist.sh: line 16: build/examples/mnist/convert_mnist_data.bin: Permission denied\r\n", + "examples/mnist/create_mnist.sh: line 18: build/examples/mnist/convert_mnist_data.bin: Permission denied\r\n", + "Done.\r\n" + ] + } + ], + "source": [ + "examples/mnist/create_mnist.sh" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这里执行会出错,同样需要稍微修改下文件 :\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "16c16,17\r\n", + "< $BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \\\r\n", + "---\r\n", + "> #$BUILD/\r\n", + "> convert_mnist_data $DATA/train-images-idx3-ubyte \\\r\n", + "18c19,20\r\n", + "< $BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \\\r\n", + "---\r\n", + "> #$BUILD/\r\n", + "> convert_mnist_data $DATA/t10k-images-idx3-ubyte \\\r\n" + ] + } + ], + "source": [ + "diff examples/mnist/create_mnist.sh examples/mnist/create_mnist_th.sh " + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Creating lmdb...\r\n", + "I0520 16:10:11.905948 10730 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb\r\n", + "I0520 16:10:11.914786 10730 convert_mnist_data.cpp:88] A total of 60000 items.\r\n", + "I0520 16:10:11.914836 10730 convert_mnist_data.cpp:89] Rows: 28 Cols: 28\r\n", + "I0520 16:10:13.172996 10730 convert_mnist_data.cpp:108] Processed 60000 files.\r\n", + "I0520 16:10:13.472939 10737 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_test_lmdb\r\n", + "I0520 16:10:13.474231 10737 convert_mnist_data.cpp:88] A total of 10000 items.\r\n", + "I0520 16:10:13.474262 10737 convert_mnist_data.cpp:89] Rows: 28 Cols: 28\r\n", + "I0520 16:10:13.711577 10737 convert_mnist_data.cpp:108] Processed 10000 files.\r\n", + "Done.\r\n" + ] + } + ], + "source": [ + "examples/mnist/create_mnist_th.sh" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "执行修改后的文件,一切顺利" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 计算\n", + "\n", + "本例要演示的计算文件为 examples/mnist/train_lenet.sh ; 同样需要略作修改 :" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3c3,4\r\n", + "< ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt\r\n", + "---\r\n", + "> #./build/tools/\r\n", + "> yhrun -p nsfc2 -n 1 caffe train --solver=examples/mnist/lenet_solver.prototxt &> examples/mnist/lenet.log\r\n" + ] + } + ], + "source": [ + "diff examples/mnist/train_lenet.sh examples/mnist/train_lenet_th.sh " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这个文件的修改是为了使用 SLURM 作业系统,通过 yhrun 来提交计算任务,并且创建了日志文件 examples/mnist/lenet.log 。\n", + "\n", + "**另外, 由于是使用 CPU 来计算 , 需要修改 examples/mnist/lenet_solver.prototxt 中的solver_mode: GPU 为 solver_mode: CPU; 这里不再展示。**\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Submitted batch job 383308\r\n" + ] + } + ], + "source": [ + "yhbatch -p nsfc2 examples/mnist/train_lenet_th.sh " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "这里可以直接执行此脚本,不过最好还是用yhbatch 提交作业" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)\r\n", + " 383308 nsfc2 train_le nscc-gz_jian R 0:17 1 cn8035\r\n" + ] + } + ], + "source": [ + "yhqueue" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "作业已经在运行了,可以用tail 查看log (加上 -f 参数可以连续查看,但演示环境不允许)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I0520 16:29:50.440232 19425 sgd_solver.cpp:106] Iteration 1400, lr = 0.00906403\r\n", + "I0520 16:29:55.268719 19425 solver.cpp:337] Iteration 1500, Testing net (#0)\r\n", + "I0520 16:29:58.517141 19425 solver.cpp:404] Test net output #0: accuracy = 0.9833\r\n", + "I0520 16:29:58.517194 19425 solver.cpp:404] Test net output #1: loss = 0.0494627 (* 1 = 0.0494627 loss)\r\n", + "I0520 16:29:58.576706 19425 solver.cpp:228] Iteration 1500, loss = 0.0795374\r\n", + "I0520 16:29:58.576748 19425 solver.cpp:244] Train net output #0: loss = 0.0795374 (* 1 = 0.0795374 loss)\r\n", + "I0520 16:29:58.576759 19425 sgd_solver.cpp:106] Iteration 1500, lr = 0.00900485\r\n", + "I0520 16:30:03.441340 19425 solver.cpp:228] Iteration 1600, loss = 0.115878\r\n", + "I0520 16:30:03.441392 19425 solver.cpp:244] Train net output #0: loss = 0.115878 (* 1 = 0.115878 loss)\r\n", + "I0520 16:30:03.441401 19425 sgd_solver.cpp:106] Iteration 1600, lr = 0.00894657\r\n" + ] + } + ], + "source": [ + "tail examples/mnist/lenet.log " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "运算一切正常,本演示结束 。 \n", + "\n", + "笔者目前对ML/DL 颇感兴趣 , 欢迎合作交流。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Bash", + "language": "bash", + "name": "bash" + }, + "language_info": { + "codemirror_mode": "shell", + "file_extension": ".sh", + "mimetype": "text/x-sh", + "name": "bash" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}