You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @seanmacavaney
I am considering reproducing an experiment that uses the TIPSTER corpus used in TREC-3 and earlier tracks (https://catalog.ldc.upenn.edu/LDC93T3A). Apparently, the catalog does not feature TIPSTER or any of the earlier tracks. Did you already try to integrate them, and did it cause any problems? Or is there another reason that makes it impossible to integrate them?
As far as I know, the disks were distributed with different naming schemes. For instance, my copy of disks 4 and 5 have lower-cased file names, which is different from the format ir-datasets expects for "disks45/nocr/trec-robust-2004" (I copied the data as is from TREC's CD-ROMs). I remember that this issue was also discussed as part of OSIRRC back in 2019: osirrc/jig#28
Similarly, my TIPSTER Vol. 1 - 3 copies are also lower-cased. Do you have any recommendations on which format to use if I try to add these datasets?
Many thanks,
Timo
Dataset Information:
The TREC conferences emerged from the TIPSTER Text Program and this corpus is one of the first large-scale datasets that was curated for system evaluations. More information can be found here: https://www-nlpir.nist.gov/related_projects/tipster/trec.htm
Downloadable content (in ir_datasets/etc/downloads.json)
Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.
The text was updated successfully, but these errors were encountered:
Hi @seanmacavaney
I am considering reproducing an experiment that uses the TIPSTER corpus used in TREC-3 and earlier tracks (https://catalog.ldc.upenn.edu/LDC93T3A). Apparently, the catalog does not feature TIPSTER or any of the earlier tracks. Did you already try to integrate them, and did it cause any problems? Or is there another reason that makes it impossible to integrate them?
As far as I know, the disks were distributed with different naming schemes. For instance, my copy of disks 4 and 5 have lower-cased file names, which is different from the format ir-datasets expects for "disks45/nocr/trec-robust-2004" (I copied the data as is from TREC's CD-ROMs). I remember that this issue was also discussed as part of OSIRRC back in 2019: osirrc/jig#28
Similarly, my TIPSTER Vol. 1 - 3 copies are also lower-cased. Do you have any recommendations on which format to use if I try to add these datasets?
Many thanks,
Timo
Dataset Information:
The TREC conferences emerged from the TIPSTER Text Program and this corpus is one of the first large-scale datasets that was curated for system evaluations. More information can be found here: https://www-nlpir.nist.gov/related_projects/tipster/trec.htm
Links to Resources:
https://trec.nist.gov/data/topics_eng/index.html
https://trec.nist.gov/data/qrels_eng/index.html
https://www-nlpir.nist.gov/related_projects/tipster/trec.htm
https://catalog.ldc.upenn.edu/LDC93T3A
Dataset ID(s) & supported entities:
tipster/trec1/adhoc
: docs, queries, qrelstipster/trec1/routing
: docs, queries, qrelstipster/trec2/adhoc
: docs, queries, qrelstipster/trec2/routing
: docs, queries, qrelstipster/trec3/adhoc
: docs, queries, qrelstipster/trec3/routing
: docs, queries, qrelstipster/trec4/adhoc
: docs, queries, qrelstipster/trec4/routing
: docs, queries, qrelstipster/trec5/adhoc
: docs, queries, qrelstipster/trec5/routing
: docs, queries, qrelstipster/trec6/adhoc
: docs, queries, qrelstipster/trec6/routing
: docs, queries, qrels(I think other iterations and tracks based on TIPSTER could be added in a similar fashion.)
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.The text was updated successfully, but these errors were encountered: