The text in dataset seems no meaning #26

MaYB2333 · 2020-03-29T10:54:30Z

Hi, Lanco!
I downloaded the dataset you provided on Google Drive https://drive.google.com/drive/folders/1lBt2MjEoh4CG2jjt4nMgHro2h3k2gwUH
and viewed the src text. But the text seems meaningless and strange, eg "june bean myr smc smc board cocoa cocoa malays aver aver ton pric high high prev prev prev tawau lahad datu sabak low low bernam".
Would you let me know is there something wrong on the dataset? Thank you very much.

monk1337 · 2020-07-26T21:43:40Z

@MaYB2333 Exactly, The dataset seems strange, Maybe it's preprocessed with some conditions, I just printed two lines from train.src and it looks like this :

['won million sery moody moody net cost ud percent bond texa system cape sew tax revenu interest waterwork juran royal\n',
 'purchas stock complet issu million corp fsb hold shar shar shar tuesday back newsdesk compan compan result fsf fsf fsf chicag buy feder common outstand total financ financ financ repurchas\n']

@ypengc7512 Can you please take a look into this?

Thanks!

monk1337 · 2020-07-26T21:48:42Z

I think it's preprocessed and stemmed words?

monk1337 · 2020-07-26T22:31:03Z

Because of the agreement of RCV1 CD-ROMs from Reuters, the original data should not be reconstructed. So the dataset removes the large stop words and replace the remaining words with stems, and scramble the order of the stems.

Please refer the detail on the official site.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The text in dataset seems no meaning #26

The text in dataset seems no meaning #26

MaYB2333 commented Mar 29, 2020

monk1337 commented Jul 26, 2020 •

edited

Loading

monk1337 commented Jul 26, 2020

monk1337 commented Jul 26, 2020

The text in dataset seems no meaning #26

The text in dataset seems no meaning #26

Comments

MaYB2333 commented Mar 29, 2020

monk1337 commented Jul 26, 2020 • edited Loading

monk1337 commented Jul 26, 2020

monk1337 commented Jul 26, 2020

monk1337 commented Jul 26, 2020 •

edited

Loading