diff --git a/2019-spring/Assignment-02.ipynb b/2019-spring/Assignment-02.ipynb new file mode 100644 index 0000000..d50138b --- /dev/null +++ b/2019-spring/Assignment-02.ipynb @@ -0,0 +1,335 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Assignment-02, Probability Model A First Look: An Introduction of Language Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Assignment\n", + "\n", + "1. Review the course online programming code; \n", + "2. Review the main questions; \n", + "3. Using wikipedia corpus to build a language model. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Review the course online programming code. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*In this part, you should re-code the programming task in our online course.*\n", + "\n", + "> \n", + "> \n", + "\n", + "> \n", + "> \n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Review the main points of this lesson. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 1. How to Github and Why do we use Jupyter and Pycharm; " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans: {*Put your answer here*}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 2. What's the Probability Model?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 3. Can you came up with some sceneraies at which we could use Probability Model?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 4. Why do we use probability and what's the difficult points for programming based on parsing and pattern match? \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 5. What's the Language Model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 6. Can you came up with some sceneraies at which we could use Language Model?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 7. What's the 1-gram language model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 8. What's the disadvantages and advantages of 1-gram language model;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 9. What't the 2-gram models; " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 10. what's the web crawler, and can you implement a simple crawler? " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 11. There may be some issues to make our crwaler programming difficult, what are these, and how do we solve them?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 12. What't the Regular Expression and how to use?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Using Wikipedia dataset to finish the language model. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 1: You need to download the corpus from wikipedis:\n", + "> https://dumps.wikimedia.org/zhwiki/20190401/\n", + "\n", + "Step 2: You may need the help of wiki-extractor:\n", + "\n", + "> https://github.com/attardi/wikiextractor\n", + "\n", + "Step 3: Using the technologies and methods to finish the language model; \n", + "> \n", + "\n", + "Step 4: Try some interested sentence pairs, and check if your model could fit them\n", + "\n", + "> \n", + "\n", + "Step 5: If we need to solve following problems, how can language model help us? \n", + "\n", + "+ Voice Recognization.\n", + "+ Sogou *pinyin* input.\n", + "+ Auto correction in search engine. \n", + "+ Abnormal Detection." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compared to the previous learned parsing and pattern match problems. What's the advantage and disavantage of Probability Based Methods? " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## (Optional) How to solve *OOV* problem?\n", + "\n", + "If some words are not in our dictionary or corpus. When we using language model, we need to overcome this `out-of-vocabulary`(OOV) problems. There are so many intelligent man to solve this probelm. \n", + "\n", + "-- \n", + "\n", + "The first question is: \n", + "\n", + "**Q1: How did you solve this problem in your programming task?**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ans: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then, the sencond question is: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Q2: Read about the 'Turing-Good Estimator', can explain the main points about this method, and may implement this method in your programming task**\n", + "\n", + "Reference: \n", + "+ https://www.wikiwand.com/en/Good%E2%80%93Turing_frequency_estimation\n", + "+ https://github.com/Computing-Intelligence/References/blob/master/NLP/Natural-Language-Processing.pdf, Page-46" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> coding in here" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}