reuters-tc

Repository to showcase and explore different classification approaches with the Reuters-21578 collection.

The collection used in this example is Reuters-21578, a historical, and old, collection for text classification commonly seen in academia in the 2010s. This collection is tiny by any modern standards, but it has a number of characteristics that make it great for educational purposes, as it is similar to many (albeit small) datasets we might encounter in some industrial applications. Namely, it is a clearly unbalanced dataset, with classes ranging from a handful of examples to thousands.

Reuters-21578 contains structured information about newswire articles that can be assigned to several classes, making it a multi-label classification problem. The “ModApte” split is used where we only consider classes with at least one training and one test document. As a result, there are 7770 / 3019 documents for training and testing, observing 90 classes with a highly skewed distribution over classes.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
bin		bin
notebooks		notebooks
reuters_tc		reuters_tc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reuters-tc

About

Releases

Packages

Languages

License

miguelmalvarez/reuters-tc

Folders and files

Latest commit

History

Repository files navigation

reuters-tc

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages