GitHub - chimney37/github-python-term-standardizer: Input a list of keywords to scrape wikipedia to get their canonical form. We can get ja => en or en => ja, using "ja" lang setting of wiki to search.

chimney37 / github-python-term-standardizer Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Input a list of keywords to scrape wikipedia to get their canonical form. We can get ja => en or en => ja, using "ja" lang setting of wiki to search.

MIT license

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
PythonExperiment.xcodeproj		PythonExperiment.xcodeproj
IO.py		IO.py
IO.pyc		IO.pyc
LICENSE		LICENSE
README		README
main.py		main.py
wikiscraper.py		wikiscraper.py
wikiscraper.pyc		wikiscraper.pyc

Repository files navigation

The purpose of the program is to scrape wikipedia using a set of keywords to get their canonical form. We can go ja => en or en => ja. Therefore the input can either be Japanese or English. If a search result exist on wikipedia (language is set to "ja"), we get the top ranked result, by fetching the page of the search result. Note: this software is dependent on wikipedia API for python.


Installation:

    1) install python >= 2.6
    2) get wikipedia API for python >= 1.4.0
        pip install wikipedia

Running the software:

    python main.py --batch <path-to-input-file> --out <path-to-output-file>

Input file format (TSV) Per row:

    <Input Keyword>
    Note: Multiple rows of keywords in Japanese or English

Output file format (TSV) Per row:

    <Input Keyword><tab><Output Keyword><tab>Summary
    Note: Summary refers to the page summary of the wiki article, specified by the output keyword (title of wiki page)

TODO:

    Try to use a more versatile wiki scraping API, to get the translation equivalent of ja => en and en => ja by traversing the language link of a page article rather than relying it on a single language setting and getting page based on its title.


Changelog:

    2016/5/1 - Initial version of program.