You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Lanco!
I downloaded the dataset you provided on Google Drive https://drive.google.com/drive/folders/1lBt2MjEoh4CG2jjt4nMgHro2h3k2gwUH
and viewed the src text. But the text seems meaningless and strange, eg "june bean myr smc smc board cocoa cocoa malays aver aver ton pric high high prev prev prev tawau lahad datu sabak low low bernam".
Would you let me know is there something wrong on the dataset? Thank you very much.
The text was updated successfully, but these errors were encountered:
@MaYB2333 Exactly, The dataset seems strange, Maybe it's preprocessed with some conditions, I just printed two lines from train.src and it looks like this :
['won million sery moody moody net cost ud percent bond texa system cape sew tax revenu interest waterwork juran royal\n',
'purchas stock complet issu million corp fsb hold shar shar shar tuesday back newsdesk compan compan result fsf fsf fsf chicag buy feder common outstand total financ financ financ repurchas\n']
Because of the agreement of RCV1 CD-ROMs from Reuters, the original data should not be reconstructed. So the dataset removes the large stop words and replace the remaining words with stems, and scramble the order of the stems.
Hi, Lanco!
I downloaded the dataset you provided on Google Drive https://drive.google.com/drive/folders/1lBt2MjEoh4CG2jjt4nMgHro2h3k2gwUH
and viewed the src text. But the text seems meaningless and strange, eg "june bean myr smc smc board cocoa cocoa malays aver aver ton pric high high prev prev prev tawau lahad datu sabak low low bernam".
Would you let me know is there something wrong on the dataset? Thank you very much.
The text was updated successfully, but these errors were encountered: