TIPSTER corpus #241

breuert · 2023-07-05T10:45:26Z

Hi @seanmacavaney
I am considering reproducing an experiment that uses the TIPSTER corpus used in TREC-3 and earlier tracks (https://catalog.ldc.upenn.edu/LDC93T3A). Apparently, the catalog does not feature TIPSTER or any of the earlier tracks. Did you already try to integrate them, and did it cause any problems? Or is there another reason that makes it impossible to integrate them?

As far as I know, the disks were distributed with different naming schemes. For instance, my copy of disks 4 and 5 have lower-cased file names, which is different from the format ir-datasets expects for "disks45/nocr/trec-robust-2004" (I copied the data as is from TREC's CD-ROMs). I remember that this issue was also discussed as part of OSIRRC back in 2019: osirrc/jig#28

Similarly, my TIPSTER Vol. 1 - 3 copies are also lower-cased. Do you have any recommendations on which format to use if I try to add these datasets?

Many thanks,
Timo

Dataset Information:

The TREC conferences emerged from the TIPSTER Text Program and this corpus is one of the first large-scale datasets that was curated for system evaluations. More information can be found here: https://www-nlpir.nist.gov/related_projects/tipster/trec.htm

Links to Resources:

https://trec.nist.gov/data/topics_eng/index.html
https://trec.nist.gov/data/qrels_eng/index.html
https://www-nlpir.nist.gov/related_projects/tipster/trec.htm
https://catalog.ldc.upenn.edu/LDC93T3A

Dataset ID(s) & supported entities:

tipster/trec1/adhoc: docs, queries, qrels
tipster/trec1/routing: docs, queries, qrels
tipster/trec2/adhoc: docs, queries, qrels
tipster/trec2/routing: docs, queries, qrels
tipster/trec3/adhoc: docs, queries, qrels
tipster/trec3/routing: docs, queries, qrels
tipster/trec4/adhoc: docs, queries, qrels
tipster/trec4/routing: docs, queries, qrels
tipster/trec5/adhoc: docs, queries, qrels
tipster/trec5/routing: docs, queries, qrels
tipster/trec6/adhoc: docs, queries, qrels
tipster/trec6/routing: docs, queries, qrels
...

(I think other iterations and tracks based on TIPSTER could be added in a similar fashion.)

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

The text was updated successfully, but these errors were encountered:

breuert added the add-dataset label Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIPSTER corpus #241

TIPSTER corpus #241

breuert commented Jul 5, 2023

TIPSTER corpus #241

TIPSTER corpus #241

Comments

breuert commented Jul 5, 2023