Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset format expected #9

Open
shashankg7 opened this issue May 22, 2020 · 2 comments
Open

Dataset format expected #9

shashankg7 opened this issue May 22, 2020 · 2 comments

Comments

@shashankg7
Copy link

Hi,

What is the dataset format expected for multi-label classification?

@mwydmuch
Copy link
Owner

Hi @shashankg7, the dataset format is fastText data format with few extension:

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1> <word2> <word3...>

It is possible to add weighting for each word by adding -wordsWeights option and using the following format :

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1>:<word 1 wieght> <word2>:<word 2 wieght> <word3...>:<word 3 wieght...>

See xml_experiments directory for some examples. run_EURLex-4K.sh is the smallest from all the datasets.

@shashankg7
Copy link
Author

Thanks a lot @mwydmuch for your reply.

I am able to run the code with the format you have described. Thanks!

I have one doubt. I am trying out your model on a custom multi-label short text classification (average word length of ~4). The #labels are in order of 3.5K.

I am trying out 'plt' loss function with #dimensions in [200, 300, 500]. I tried different epochs and I have also tried out varying char n-gram sizes.

But I am not able to get good results, when compared to fasttext.

Any suggestions to where I might be going wrong, or what else I could try.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants