Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the pre-processed data of 2Wiki (more detailly)? #2

Open
canghongjian opened this issue Jul 11, 2023 · 11 comments
Open

How to get the pre-processed data of 2Wiki (more detailly)? #2

canghongjian opened this issue Jul 11, 2023 · 11 comments

Comments

@canghongjian
Copy link

Hi @xanhho . You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids. Besides, I found it difficult to follow the entire process of HGN. If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?

@xanhho
Copy link
Contributor

xanhho commented Jul 12, 2023

Hi @canghongjian, thank you for your interest in the work.

"You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids."
=> Yes, we said that we base on HGN for data preprocessing. But we also said that we update it (a little bit) to work with our dataset. Because HGN is not designed to solve the evidence generation task, then we update some parts in HGN to process data for the evidence generation task. Some new attributes in Example class are for the evidence generation task.

"If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?"
=> To get the pre-processed data quickly, one possible way is to re-use the available data (https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0) that I uploaded. However, in this data, it includes the paragraph selection process because I follow the code of HGN here (https://github.com/yuwfan/HGN/tree/master/scripts).

I think that if you want to replace the paragraph selection part, then you also need to follow the scrips in HGN for data preprocessing.

@canghongjian
Copy link
Author

Hi @canghongjian, thank you for your interest in the work.

"You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids." => Yes, we said that we base on HGN for data preprocessing. But we also said that we update it (a little bit) to work with our dataset. Because HGN is not designed to solve the evidence generation task, then we update some parts in HGN to process data for the evidence generation task. Some new attributes in Example class are for the evidence generation task.

"If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?" => To get the pre-processed data quickly, one possible way is to re-use the available data (https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0) that I uploaded. However, in this data, it includes the paragraph selection process because I follow the code of HGN here (https://github.com/yuwfan/HGN/tree/master/scripts).

I think that if you want to replace the paragraph selection part, then you also need to follow the scrips in HGN for data preprocessing.

Thanks for your kind reply. Could you please describe how to add the evidence generation part in Example class? In other words, for a sample (id 8813f87c0bdd11eba7f7acde48001122) in 2Wiki
'evidences': [['Polish-Russian War', 'director', 'Xawery Żuławski'], ['Xawery Żuławski', 'mother', 'Małgorzata Braunek']], 'answer': 'Małgorzata Braunek',
how to get these attributes
self.relations=relations, self.evidences=evidences, self.evidence_ids=evidence_ids, self.q_ner_labels=q_ner_labels, self.ctx_ner_labels=ctx_ner_labels
Is there a snippet of code? Wish you a good day : )

@xanhho
Copy link
Contributor

xanhho commented Jul 12, 2023

When I follow the scripts in HGN, in the 5_dump_features.py file, I edit it a little bit to add these new attributes.
https://www.dropbox.com/s/xy65blwp9z3e55b/5_dump_features.py?dl=0 This is an example.

You can search these new attributes (e.g., relations) in the file above.

@canghongjian
Copy link
Author

When I follow the scripts in HGN, in the 5_dump_features.py file, I edit it a little bit to add these new attributes. https://www.dropbox.com/home/public/2023_multi-hop-analysis?preview=5_dump_features.py This is an example.

You can search these new attributes (e.g., relations) in the file above.

I clicked the url but it showed “public/2023_multi-hop-analysis” did not exist. Maybe it is a private url? It did not show the same as https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0 in README, which can access directly.

@xanhho
Copy link
Contributor

xanhho commented Jul 12, 2023

Oh I'm sorry, I have updated the link above, could you check it again?

@canghongjian
Copy link
Author

Oh I'm sorry, I have updated the link above, could you check it again?

It worked. Thanks a lot.

@canghongjian
Copy link
Author

Hello @xanhho , I'm sorry to bother you, but I met some problems when I followed HGN data processing and I need your help. It happened in "2. Extract NER for Question and Context" step in https://github.com/yuwfan/HGN/blob/master/run.sh . My db file 'enwiki_ner.db' seems to be incomplete, as the follow screenshot illustrated:
image
It lacked some wiki passages in 2WikiMultiHop dataset ('Alice Washburn' is a passage title in 2Wiki dev set, actually there are other 17375 titles lost), causing step "2. Extract NER for Question and Context" failed.
I ran https://github.com/yuwfan/HGN/blob/master/scripts/0_build_db.py to get the db file, was it out-of-date? How to get the complete db file to cover all the 2Wiki dataset?

@xanhho
Copy link
Contributor

xanhho commented Jul 14, 2023

Sorry for the issue, as I remember I also faced with these issues in the past.
For quickly solving the issue, I have put all data I obtained before the paragraph selection step here.
(https://www.dropbox.com/scl/fo/naoi4a0929vi2vcb99ad8/h?rlkey=7l4zq7g9svtfaxouhv0pugwak&dl=0)

I think you can start from here to the end.

@canghongjian
Copy link
Author

Sorry for the issue, as I remember I also faced with these issues in the past. For quickly solving the issue, I have put all data I obtained before the paragraph selection step here. (https://www.dropbox.com/scl/fo/naoi4a0929vi2vcb99ad8/h?rlkey=7l4zq7g9svtfaxouhv0pugwak&dl=0)

I think you can start from here to the end.

Oh, thanks for your generous help!!

@canghongjian
Copy link
Author

canghongjian commented Aug 1, 2023

Hey @xanhho. I have submitted my test set result on 2WikiMultihopQA to your email, which is named 'Beam Retrieval'. Could you please evaluate it and tell me the performance? I'm writing a paper to meet the conference's deadline. Thanks for your great work and generous help!

@xanhho
Copy link
Contributor

xanhho commented Aug 2, 2023

I'm so sorry. I just checked my email and found that your email is in the spam folder (I don't know why). I have just replied your email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants