.CSIRO Enterprise Research Collection
TREC Enterprise track 2007
Access to the collection
Note, you will need to have signed and returned the Organisation Agreement, and been allocated a user/password, for you to access the new corpus data. Instructions for where to return the signed agreement are in the agreement itself.
If you just fax the form, please also also email us so we have an email address in case we need to get hold of you, and so we can be sure the fax gets through.
The data may be downloaded either as a single tar file (CSIRO_Enterprise_Research_Collection.tar - 357MB in size) or as individual bundles and extras.
Quick stats on the corpus:
| 4493545213 bytes | = 4.1849401 gigabytes |
| 370715 docs | |
| 267 bundles - CSIRO0lmn.gz | |
Extras:
- md5sum.txt.gz - md5 checksums for each of the 267 bundles
- redirects.txt.gz - information from the crawler about URL redirects encountered
- url2id.txt.gz - mapping of URLs to DOCIDs
Crawl of *.csiro.au websites carried out using Funnelback 6.0 by Peter Thew and Peter Bailey. Data preparation by Nick Craswell and Peter Bailey.

