Clueweb22 #213

janheinrichmerker · 2022-10-19T22:08:53Z

I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22.
Of course, the implementation is far from finished (as you can see by the numerous todo's 😆).
But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.

And of course, this PR would close #210 😉

seanmacavaney · 2022-10-19T22:36:35Z

Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data.

janheinrichmerker · 2022-10-19T22:38:26Z

Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up.

seanmacavaney · 2022-11-30T14:45:06Z

Thanks!

Looks like there are still some py36 incompatibilities: ImportError: cannot import name 'Final' from 'typing'.

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it.

janheinrichmerker · 2022-12-01T12:23:32Z

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks.

Well, with the current approach it is already easy (just use clueweb22/b/text instead of clueweb22/b) and optimized (for clueweb22/b/text we would only look at the text files, no WARC is touched).

So why is it a problem to have users explicitly choose clueweb22/b/text if they only care about the text?

I'm now going to test everything with a Python 3.6 interpreter, just to be sure.

janheinrichmerker · 2022-12-01T12:42:18Z

I'd like to add that there are also datasets in ir_datasets where the derived datasets are a suffix to the original dataset:

argsme/2020-04-01/processed is derived from argsme/2020-04-01
clueweb12/touche-2022-task-2/expanded-doc-t5-query is derived from clueweb12/touche-2022-task-2
cord19 is derived from cord19/fulltext

So I don't see a general pattern for preferring shorter IDs for the "only text"-version.

(only need to uncomment lines once they are released)

janheinrichmerker · 2022-12-01T14:36:22Z

That should have been the last few 3.7-incompatible things.

seanmacavaney · 2022-12-01T17:51:17Z

Awesome, thanks!

seanmacavaney · 2022-12-02T13:44:17Z

Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions?

janheinrichmerker · 2022-12-08T17:15:54Z

These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:

[INFO] [starting] first 100k docs, just text
100000it [00:07, 12524.57it/s]
[INFO] [finished] first 100k docs, just text [8.06s]
[INFO] [starting] first 100k docs, with html, txt, vdom, inlink, outlink
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: txt URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL mismatch for clueweb22-de0000-00-13406: outlink URL was https://www.jovanovic.com/quotidien.htm but html URL was https://www.jovanna.de/
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: outlink URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: txt URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: inlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: inlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: outlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: outlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
100000it [03:04, 541.70it/s]
[INFO] [finished] first 100k docs, with html, txt, vdom, inlink, outlink [03:05]

As expected parsing the WARC files is 22x slower than just reading the JSONL file.

(outlink stream path have `zh` instead of `zh_chs`)

seanmacavaney · 2023-01-11T11:07:01Z

Great news, my copy of the CW22 drive arrived.

janheinrichmerker · 2023-01-11T11:54:59Z

Great to hear that!

# Conflicts: # ir_datasets/etc/downloads.json

janheinrichmerker · 2023-03-14T10:51:21Z

I've updated the branch to reflect upstream changes and added default_text() implementations.

janheinrichmerker · 2023-05-02T11:05:12Z

Is anything still blocking the merge?

seanmacavaney · 2023-05-02T11:06:16Z

Sorry -- the only thing blocking is finding the time to run through the tests on my end.

janheinrichmerker · 2024-02-19T09:40:02Z

Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to ir_datasets. If there is anything I can help with, please let me know.

janheinrichmerker · 2024-04-19T13:37:39Z

Closing this PR in favor of the new ir-datasets-clueweb22 extension.

janheinrichmerker added 16 commits October 19, 2022 15:29

Add IO utils

a842f8e

Fix IO utils

879301f

Add CW22 base records

05d710d

Add CW22 doc records

f052f0f

Improve IO utils typing

f08d029

Fix concat IO util

2f15e36

Add documentation

bc08d5e

Add CW22 config types

30bba6e

Add CW22 ID type

b34f395

Add CW22 record readers

dbee8fd

Add CW22 record readers

56456cf

Add CW22 combining doc iterators

af074a1

Re-structure CW22 format

bef1678

Re-structure CW22 format

7770ea4

Configure readers

80b4265

Configure CW22 document combiners

288b481

janheinrichmerker added 12 commits October 20, 2022 00:42

Fix generics

d4b1861

Fix IO wrapper

e5bfe0b

Add CW22 iterator

1842ef6

Add CW22 docs

b292548

Add CW22 docstore

17d347a

Rename CW22 classes

210becb

Rename CW22 classes

b54ac03

Re-export CW22 classes

e5e0d0c

Prepare CW22 download instructions

0aadd43

Add CW22 bib

6fa307b

Add CW22 docs

6cb7af6

Add CW22 docs

fefd059

janheinrichmerker added 6 commits December 1, 2022 14:37

Remove Final usages

8a9608a

Improve CW22 Python backwards compatibility

2ba133c

Move CW22 docstrings to their public interfaces

668edf7

Prepare CW22 screenshots

e6c1ea0

(only need to uncomment lines once they are released)

Replace Protocol usages with Callable

6e60492

Fix parameter name

2b7b063

Add CW22 record ID and payload digest

a2e1d10

janheinrichmerker added 5 commits December 13, 2022 16:56

Fix ClueWeb22 date format

df9c20d

Fix ClueWeb22 path bug

3e17a30

(outlink stream path have `zh` instead of `zh_chs`)

Add slice test for fixed paths

9e9819a

Fix language tag

27ba440

Add bug fix for missing line break in CW22 offset file

72bd72e

janheinrichmerker added 3 commits March 14, 2023 11:32

Merge remote-tracking branch 'origin/master' into clueweb22

2dd9581

# Conflicts: # ir_datasets/etc/downloads.json

Update requirement

b1f9836

Add CW22 default text

845f344

janheinrichmerker closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clueweb22 #213

Clueweb22 #213

janheinrichmerker commented Oct 19, 2022 •

edited

Loading

seanmacavaney commented Oct 19, 2022

janheinrichmerker commented Oct 19, 2022

seanmacavaney commented Nov 30, 2022

janheinrichmerker commented Dec 1, 2022

janheinrichmerker commented Dec 1, 2022

janheinrichmerker commented Dec 1, 2022

seanmacavaney commented Dec 1, 2022

seanmacavaney commented Dec 2, 2022

janheinrichmerker commented Dec 8, 2022

seanmacavaney commented Jan 11, 2023

janheinrichmerker commented Jan 11, 2023

janheinrichmerker commented Mar 14, 2023

janheinrichmerker commented May 2, 2023

seanmacavaney commented May 2, 2023

janheinrichmerker commented Feb 19, 2024

janheinrichmerker commented Apr 19, 2024

Clueweb22 #213

Clueweb22 #213

Conversation

janheinrichmerker commented Oct 19, 2022 • edited Loading

seanmacavaney commented Oct 19, 2022

janheinrichmerker commented Oct 19, 2022

seanmacavaney commented Nov 30, 2022

janheinrichmerker commented Dec 1, 2022

janheinrichmerker commented Dec 1, 2022

janheinrichmerker commented Dec 1, 2022

seanmacavaney commented Dec 1, 2022

seanmacavaney commented Dec 2, 2022

janheinrichmerker commented Dec 8, 2022

seanmacavaney commented Jan 11, 2023

janheinrichmerker commented Jan 11, 2023

janheinrichmerker commented Mar 14, 2023

janheinrichmerker commented May 2, 2023

seanmacavaney commented May 2, 2023

janheinrichmerker commented Feb 19, 2024

janheinrichmerker commented Apr 19, 2024

janheinrichmerker commented Oct 19, 2022 •

edited

Loading