About the size of TREC-TS-2015F ?

86 views
Skip to first unread message

姚应哲

unread,
Jul 2, 2015, 11:06:23 AM7/2/15
to tre...@googlegroups.com
I have download the corpus for TREC-TS-2015F  topic by topic with my own python script.
Now I have completed it ,but its total size is about 37 GB .Is this Right ? 
Could you give us participants  some detailed description about the TREC-TS-2015F as the same with last year do ?

Richard McCreadie

unread,
Jul 6, 2015, 8:50:58 AM7/6/15
to tre...@googlegroups.com
Hi,

You should expect that the corpus size will be smaller than last year, since we were more stringent with the filtering.

I will post the total size per topic later on so you can check that you have it all.

RichardM

Richard McCreadie

unread,
Jul 6, 2015, 11:38:02 AM7/6/15
to tre...@googlegroups.com
4.9G    26
1.9G    27
1.7G    28
1.2G    29
136M    30
4.1G    31
4.9G    32
4.7G    33
321M    34
2.3G    35
895M    36
1.6G    37
867M    38
488M    39
313M    40
979M    41
703M    42
1.4G    43
318M    44
241M    45
4.2G    46
38G    total

Bilel Moulahi

unread,
Jul 7, 2015, 8:49:00 AM7/7/15
to tre...@googlegroups.com
Hi, 

We have downloaded the corpus using the same method as the last year, but we have obtained different sizes from those reported by Richard. 
Here are the details per topic: 

topic #gpg_docs size 

26    11101    7,4G
27    5377    3,0G
28    12432    3,7G
29    7567    2,4G
30    2111    421M
31    6206    5,6G
32    22807    9,4G
33    15398    8,0G
34    2941    769M
35    13265    4,8G
36    3097    1,6G
37    12350    3,7G
38    2437    1,3G
39    3607    1,1G
40    3866    837M
41    4586    1,9G
42    4593    1,5G
43    5461    2,5G
44    3415    786M
45    2159    632M
46    7590    6,1G


Is there anything wrong with our collection?

Thanks, 
Regards, 
Bilel

姚应哲

unread,
Jul 7, 2015, 12:33:24 PM7/7/15
to tre...@googlegroups.com
Hi,RichardM  
Thanks for your answers!!
The size of a file may be different On a different operating system.
I have downloaded and stored my raw data on WIN 7 OS ,but the size(space occupation)  of the topics are as follows:
TOPICID SIZE(space occupation)  #GPG_FILE NUMBER    
26          4.79G                11,101             
27          1.83G                 5,377              
28          1.63G                12,432              
29          1.16G                 7,567              
30          135M                  2,111               
31          4.01G                 6,206               
32          4.87G                22,807            
33          4.69G                15,398           
34          320M                  2,941              
35          2.24G                13,265             
36          894M                  3,097              
37          1.54G                12,350              
38          865M                  2,437                
39          487M                  3,607               
40          311M                  3,866                
41          978M                  4,586                
42          701M                  4,593            
43          1.37G                 5,461              
44          316M                  3,415     
45          239M                  2,159
46          4.18G                 7,590              
TOTAL      37.4G 

I'm not sure Whether it is right ? Could you tell me where is wrong ?
my python script  specific ,please  see attachment.

Thanks,
Yingzhe




在 2015年7月2日星期四 UTC+8下午11:06:23,姚应哲写道:
newDownLoads.py

Richard McCreadie

unread,
Jul 7, 2015, 1:23:28 PM7/7/15
to tre...@googlegroups.com
That looks correct to me:

here are my file counts (the size differences are due to rounding)

26 4.9G 11101
27 1.9G 5377
28 1.7G 12432
29 1.2G 7567
30 136M 2111
31 4.1G 6206
32 4.9G 22807
33 4.7G 15398
34 321M 2941
35 2.3G 13265
36 895M 3097
37 1.6G 12350
38 867M 2437
39 488M 3607
40 313M 3866
41 979M 4586
42 703M 4593
43 1.4G 5461
44 318M 3415
45 241M 2159
46 4.2G 7590

Richard McCreadie

unread,
Jul 7, 2015, 1:27:58 PM7/7/15
to tre...@googlegroups.com
Your file counts look correct, but the sizes are larger than expected.

Were you downloading from the s3://aws-publicdatasets/trec/ts/streamcorpus-2015-v1-ts-filtered/ directory, or using the original KBA 2014 folder on AWS?

RichardM

Bilel Moulahi

unread,
Aug 24, 2015, 10:33:12 AM8/24/15
to temporalsummarization
Hi,

Sorry for the late reply. 
We have used these urls: 

...

Regards,
Bilel
Reply all
Reply to author
Forward
0 new messages