-
Notifications
You must be signed in to change notification settings - Fork 0
Input a list of keywords to scrape wikipedia to get their canonical form. We can get ja => en or en => ja, using "ja" lang setting of wiki to search.
License
chimney37/github-python-term-standardizer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The purpose of the program is to scrape wikipedia using a set of keywords to get their canonical form. We can go ja => en or en => ja. Therefore the input can either be Japanese or English. If a search result exist on wikipedia (language is set to "ja"), we get the top ranked result, by fetching the page of the search result. Note: this software is dependent on wikipedia API for python. Installation: 1) install python >= 2.6 2) get wikipedia API for python >= 1.4.0 pip install wikipedia Running the software: python main.py --batch <path-to-input-file> --out <path-to-output-file> Input file format (TSV) Per row: <Input Keyword> Note: Multiple rows of keywords in Japanese or English Output file format (TSV) Per row: <Input Keyword><tab><Output Keyword><tab>Summary Note: Summary refers to the page summary of the wiki article, specified by the output keyword (title of wiki page) TODO: Try to use a more versatile wiki scraping API, to get the translation equivalent of ja => en and en => ja by traversing the language link of a page article rather than relying it on a single language setting and getting page based on its title. Changelog: 2016/5/1 - Initial version of program.
About
Input a list of keywords to scrape wikipedia to get their canonical form. We can get ja => en or en => ja, using "ja" lang setting of wiki to search.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published