Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIPSTER corpus #241

Open
8 tasks
breuert opened this issue Jul 5, 2023 · 0 comments
Open
8 tasks

TIPSTER corpus #241

breuert opened this issue Jul 5, 2023 · 0 comments

Comments

@breuert
Copy link

breuert commented Jul 5, 2023

Hi @seanmacavaney
I am considering reproducing an experiment that uses the TIPSTER corpus used in TREC-3 and earlier tracks (https://catalog.ldc.upenn.edu/LDC93T3A). Apparently, the catalog does not feature TIPSTER or any of the earlier tracks. Did you already try to integrate them, and did it cause any problems? Or is there another reason that makes it impossible to integrate them?

As far as I know, the disks were distributed with different naming schemes. For instance, my copy of disks 4 and 5 have lower-cased file names, which is different from the format ir-datasets expects for "disks45/nocr/trec-robust-2004" (I copied the data as is from TREC's CD-ROMs). I remember that this issue was also discussed as part of OSIRRC back in 2019: osirrc/jig#28

Similarly, my TIPSTER Vol. 1 - 3 copies are also lower-cased. Do you have any recommendations on which format to use if I try to add these datasets?

Many thanks,
Timo

Dataset Information:

The TREC conferences emerged from the TIPSTER Text Program and this corpus is one of the first large-scale datasets that was curated for system evaluations. More information can be found here: https://www-nlpir.nist.gov/related_projects/tipster/trec.htm

Links to Resources:

https://trec.nist.gov/data/topics_eng/index.html
https://trec.nist.gov/data/qrels_eng/index.html
https://www-nlpir.nist.gov/related_projects/tipster/trec.htm
https://catalog.ldc.upenn.edu/LDC93T3A

Dataset ID(s) & supported entities:

  • tipster/trec1/adhoc: docs, queries, qrels
  • tipster/trec1/routing: docs, queries, qrels
  • tipster/trec2/adhoc: docs, queries, qrels
  • tipster/trec2/routing: docs, queries, qrels
  • tipster/trec3/adhoc: docs, queries, qrels
  • tipster/trec3/routing: docs, queries, qrels
  • tipster/trec4/adhoc: docs, queries, qrels
  • tipster/trec4/routing: docs, queries, qrels
  • tipster/trec5/adhoc: docs, queries, qrels
  • tipster/trec5/routing: docs, queries, qrels
  • tipster/trec6/adhoc: docs, queries, qrels
  • tipster/trec6/routing: docs, queries, qrels
  • ...

(I think other iterations and tracks based on TIPSTER could be added in a similar fashion.)

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/[topid].py)
  • Tests (in tests/integration/[topid].py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/[topid].yaml)
  • Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
  • Downloadable content (in ir_datasets/etc/downloads.json)
  • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
  • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant