Dataset format expected #9

shashankg7 · 2020-05-22T10:22:29Z

Hi,

What is the dataset format expected for multi-label classification?

mwydmuch · 2020-05-23T11:40:46Z

Hi @shashankg7, the dataset format is fastText data format with few extension:

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1> <word2> <word3...>

It is possible to add weighting for each word by adding -wordsWeights option and using the following format :

__label__<label 1 name> __label__<label 2 name> __label__<label 3 name...> <word 1>:<word 1 wieght> <word2>:<word 2 wieght> <word3...>:<word 3 wieght...>

See xml_experiments directory for some examples. run_EURLex-4K.sh is the smallest from all the datasets.

shashankg7 · 2020-05-25T20:23:52Z

Thanks a lot @mwydmuch for your reply.

I am able to run the code with the format you have described. Thanks!

I have one doubt. I am trying out your model on a custom multi-label short text classification (average word length of ~4). The #labels are in order of 3.5K.

I am trying out 'plt' loss function with #dimensions in [200, 300, 500]. I tried different epochs and I have also tried out varying char n-gram sizes.

But I am not able to get good results, when compared to fasttext.

Any suggestions to where I might be going wrong, or what else I could try.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset format expected #9

Dataset format expected #9

shashankg7 commented May 22, 2020

mwydmuch commented May 23, 2020

shashankg7 commented May 25, 2020

Dataset format expected #9

Dataset format expected #9

Comments

shashankg7 commented May 22, 2020

mwydmuch commented May 23, 2020

shashankg7 commented May 25, 2020