-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clueweb22 #213
Clueweb22 #213
Conversation
Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data. |
Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up. |
Thanks! Looks like there are still some py36 incompatibilities: My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it. |
Well, with the current approach it is already easy (just use So why is it a problem to have users explicitly choose I'm now going to test everything with a Python 3.6 interpreter, just to be sure. |
I'd like to add that there are also datasets in
So I don't see a general pattern for preferring shorter IDs for the "only text"-version. |
(only need to uncomment lines once they are released)
That should have been the last few 3.7-incompatible things. |
Awesome, thanks! |
Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions? |
These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:
As expected parsing the WARC files is 22x slower than just reading the JSONL file. |
(outlink stream path have `zh` instead of `zh_chs`)
Great news, my copy of the CW22 drive arrived. |
Great to hear that! |
# Conflicts: # ir_datasets/etc/downloads.json
I've updated the branch to reflect upstream changes and added |
Is anything still blocking the merge? |
Sorry -- the only thing blocking is finding the time to run through the tests on my end. |
Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to |
Closing this PR in favor of the new |
I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22.
Of course, the implementation is far from finished (as you can see by the numerous todo's 😆).
But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.
And of course, this PR would close #210 😉