Word2Vec Model on Wikipedia Articles.

Download Training Data

url: https://dumps.wikimedia.org/enwiki/latest/

Note: I have trained the Model on enwiki-latest-pages-articles.xml.bz2, but there are a lot of other datasets available. After extraction this data file size will be around 11GB.

Normalize Data.

Command: python wikidata_normalize.py enwiki-latest-pages-articles.xml.bz2 wiki.text

Note: It took 7 to 8 hours for me, I was running it on aws 2GB ec2 compute node.

Train Model.

Command: python word2vec_model.py wiki.text wiki.model

Output Screenshot.

It goes without saying, if you have more targetted data according to your requirement, it will give much better results.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
wikidata_normalize.py		wikidata_normalize.py
word2vec_model.py		word2vec_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec Model on Wikipedia Articles.

Download Training Data

Normalize Data.

Train Model.

Output Screenshot.

About

Releases

Packages

Languages

License

abhisharma7/Wiki-Word2Vec

Folders and files

Latest commit

History

Repository files navigation

Word2Vec Model on Wikipedia Articles.

Download Training Data

Normalize Data.

Train Model.

Output Screenshot.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages