How to get the pre-processed data of 2Wiki (more detailly)? #2

canghongjian · 2023-07-11T15:57:34Z

Hi @xanhho . You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids. Besides, I found it difficult to follow the entire process of HGN. If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?

xanhho · 2023-07-12T10:32:39Z

Hi @canghongjian, thank you for your interest in the work.

"You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids."
=> Yes, we said that we base on HGN for data preprocessing. But we also said that we update it (a little bit) to work with our dataset. Because HGN is not designed to solve the evidence generation task, then we update some parts in HGN to process data for the evidence generation task. Some new attributes in Example class are for the evidence generation task.

"If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?"
=> To get the pre-processed data quickly, one possible way is to re-use the available data (https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0) that I uploaded. However, in this data, it includes the paragraph selection process because I follow the code of HGN here (https://github.com/yuwfan/HGN/tree/master/scripts).

I think that if you want to replace the paragraph selection part, then you also need to follow the scrips in HGN for data preprocessing.

canghongjian · 2023-07-12T10:59:51Z

Hi @canghongjian, thank you for your interest in the work.

"You said data processing is based on HGN. But I see there are some new attributes in Example class such as evidence_ids." => Yes, we said that we base on HGN for data preprocessing. But we also said that we update it (a little bit) to work with our dataset. Because HGN is not designed to solve the evidence generation task, then we update some parts in HGN to process data for the evidence generation task. Some new attributes in Example class are for the evidence generation task.

"If I want to replace the paragraph selection part with more precise selection, how can I get the pre-processed data quickly?" => To get the pre-processed data quickly, one possible way is to re-use the available data (https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0) that I uploaded. However, in this data, it includes the paragraph selection process because I follow the code of HGN here (https://github.com/yuwfan/HGN/tree/master/scripts).

I think that if you want to replace the paragraph selection part, then you also need to follow the scrips in HGN for data preprocessing.

Thanks for your kind reply. Could you please describe how to add the evidence generation part in Example class? In other words, for a sample (id 8813f87c0bdd11eba7f7acde48001122) in 2Wiki
'evidences': [['Polish-Russian War', 'director', 'Xawery Żuławski'], ['Xawery Żuławski', 'mother', 'Małgorzata Braunek']], 'answer': 'Małgorzata Braunek',
how to get these attributes
self.relations=relations, self.evidences=evidences, self.evidence_ids=evidence_ids, self.q_ner_labels=q_ner_labels, self.ctx_ner_labels=ctx_ner_labels
Is there a snippet of code? Wish you a good day : )

xanhho · 2023-07-12T11:40:05Z

When I follow the scripts in HGN, in the 5_dump_features.py file, I edit it a little bit to add these new attributes.
https://www.dropbox.com/s/xy65blwp9z3e55b/5_dump_features.py?dl=0 This is an example.

You can search these new attributes (e.g., relations) in the file above.

canghongjian · 2023-07-12T12:21:22Z

When I follow the scripts in HGN, in the 5_dump_features.py file, I edit it a little bit to add these new attributes. https://www.dropbox.com/home/public/2023_multi-hop-analysis?preview=5_dump_features.py This is an example.

You can search these new attributes (e.g., relations) in the file above.

I clicked the url but it showed “public/2023_multi-hop-analysis” did not exist. Maybe it is a private url? It did not show the same as https://www.dropbox.com/s/dcrr5m0sxhexr84/2wiki.zip?dl=0 in README, which can access directly.

xanhho · 2023-07-12T12:24:11Z

Oh I'm sorry, I have updated the link above, could you check it again?

canghongjian · 2023-07-12T12:26:58Z

Oh I'm sorry, I have updated the link above, could you check it again?

It worked. Thanks a lot.

canghongjian · 2023-07-13T16:05:16Z

Hello @xanhho , I'm sorry to bother you, but I met some problems when I followed HGN data processing and I need your help. It happened in "2. Extract NER for Question and Context" step in https://github.com/yuwfan/HGN/blob/master/run.sh . My db file 'enwiki_ner.db' seems to be incomplete, as the follow screenshot illustrated:

It lacked some wiki passages in 2WikiMultiHop dataset ('Alice Washburn' is a passage title in 2Wiki dev set, actually there are other 17375 titles lost), causing step "2. Extract NER for Question and Context" failed.
I ran https://github.com/yuwfan/HGN/blob/master/scripts/0_build_db.py to get the db file, was it out-of-date? How to get the complete db file to cover all the 2Wiki dataset?

xanhho · 2023-07-14T07:04:51Z

Sorry for the issue, as I remember I also faced with these issues in the past.
For quickly solving the issue, I have put all data I obtained before the paragraph selection step here.
(https://www.dropbox.com/scl/fo/naoi4a0929vi2vcb99ad8/h?rlkey=7l4zq7g9svtfaxouhv0pugwak&dl=0)

I think you can start from here to the end.

canghongjian · 2023-07-14T07:28:16Z

Sorry for the issue, as I remember I also faced with these issues in the past. For quickly solving the issue, I have put all data I obtained before the paragraph selection step here. (https://www.dropbox.com/scl/fo/naoi4a0929vi2vcb99ad8/h?rlkey=7l4zq7g9svtfaxouhv0pugwak&dl=0)

I think you can start from here to the end.

Oh, thanks for your generous help!!

canghongjian · 2023-08-01T16:17:18Z

Hey @xanhho. I have submitted my test set result on 2WikiMultihopQA to your email, which is named 'Beam Retrieval'. Could you please evaluate it and tell me the performance? I'm writing a paper to meet the conference's deadline. Thanks for your great work and generous help!

xanhho · 2023-08-02T02:34:43Z

I'm so sorry. I just checked my email and found that your email is in the spam folder (I don't know why). I have just replied your email.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the pre-processed data of 2Wiki (more detailly)? #2

How to get the pre-processed data of 2Wiki (more detailly)? #2

canghongjian commented Jul 11, 2023

xanhho commented Jul 12, 2023

canghongjian commented Jul 12, 2023

xanhho commented Jul 12, 2023 •

edited

Loading

canghongjian commented Jul 12, 2023

xanhho commented Jul 12, 2023

canghongjian commented Jul 12, 2023

canghongjian commented Jul 13, 2023

xanhho commented Jul 14, 2023

canghongjian commented Jul 14, 2023

canghongjian commented Aug 1, 2023 •

edited

Loading

xanhho commented Aug 2, 2023

How to get the pre-processed data of 2Wiki (more detailly)? #2

How to get the pre-processed data of 2Wiki (more detailly)? #2

Comments

canghongjian commented Jul 11, 2023

xanhho commented Jul 12, 2023

canghongjian commented Jul 12, 2023

xanhho commented Jul 12, 2023 • edited Loading

canghongjian commented Jul 12, 2023

xanhho commented Jul 12, 2023

canghongjian commented Jul 12, 2023

canghongjian commented Jul 13, 2023

xanhho commented Jul 14, 2023

canghongjian commented Jul 14, 2023

canghongjian commented Aug 1, 2023 • edited Loading

xanhho commented Aug 2, 2023

xanhho commented Jul 12, 2023 •

edited

Loading

canghongjian commented Aug 1, 2023 •

edited

Loading