Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The text in dataset seems no meaning #26

Open
MaYB2333 opened this issue Mar 29, 2020 · 3 comments
Open

The text in dataset seems no meaning #26

MaYB2333 opened this issue Mar 29, 2020 · 3 comments

Comments

@MaYB2333
Copy link

Hi, Lanco!
I downloaded the dataset you provided on Google Drive https://drive.google.com/drive/folders/1lBt2MjEoh4CG2jjt4nMgHro2h3k2gwUH
and viewed the src text. But the text seems meaningless and strange, eg "june bean myr smc smc board cocoa cocoa malays aver aver ton pric high high prev prev prev tawau lahad datu sabak low low bernam".
Would you let me know is there something wrong on the dataset? Thank you very much.

@monk1337
Copy link

monk1337 commented Jul 26, 2020

@MaYB2333 Exactly, The dataset seems strange, Maybe it's preprocessed with some conditions, I just printed two lines from train.src and it looks like this :

['won million sery moody moody net cost ud percent bond texa system cape sew tax revenu interest waterwork juran royal\n',
 'purchas stock complet issu million corp fsb hold shar shar shar tuesday back newsdesk compan compan result fsf fsf fsf chicag buy feder common outstand total financ financ financ repurchas\n']

@ypengc7512 Can you please take a look into this?

Thanks!

@monk1337
Copy link

I think it's preprocessed and stemmed words?

@monk1337
Copy link

Because of the agreement of RCV1 CD-ROMs from Reuters, the original data should not be reconstructed. So the dataset removes the large stop words and replace the remaining words with stems, and scramble the order of the stems.

Please refer the detail on the official site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants