The corpus cannot be downloaded?

278 views
Skip to first unread message

Henan Wang

unread,
Mar 9, 2014, 10:01:06 PM3/9/14
to web-data...@googlegroups.com
Hi all,
    I am a newcomer to Amazon S3, and I have installed s3cmd on my mac, but I cannot download the corpus (Web Tables from English TLDs) using this command below:
  s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/
    I am now in China, but I have tried it on a server machine which located in the USA. The terminal reported the following message:
  WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346823846176_1346845399014_1626.arc.gz4809554595361666879.tar.gz.tar&prefix=englishTar/ ('') 
  WARNING: Waiting 3 sec...
    Does anyone meet the problem, and how should I fix it?
    Thank you for your attention!
Best wishes,
Henan

Robert Meusel

unread,
Mar 10, 2014, 3:22:48 AM3/10/14
to web-data...@googlegroups.com
Hi Henan,

did you use the s3cmd 150 alpha (http://s3tools.org/s3cmd-150a1-released)? Unfortunately, with the last official release of s3cmd its not possible to add the --requester-payer header. It is simply ignored and you will get an exception. 
In addition, sometimes S3 is not that stable and a request may not go through. S3cmd then automatically tries to re-send the request. Did it stop in your test or did it proceed after a retry?

Cheers,
Robert

Henan Wang

unread,
Mar 10, 2014, 11:39:28 PM3/10/14
to web-data...@googlegroups.com
Hi Robert,

Thank you for your reply!

I have tried s3cmd 150 alpha, but I got the same error too. The complete error logs are listed below. I wonder whether my command "s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/" is right?

103 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346876860840_1346954269374_785.arc.gz8249912047288259083.tar.gz.tar&prefix=englishTar    / ('')

104 WARNING: Waiting 3 sec...

105 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172184_1347017068724_4418.arc.gz5821797457276788945.tar.gz.tar&prefix=englishTa    r/ ('')

106 WARNING: Waiting 3 sec...

107 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172186_1346995390893_2198.arc.gz8251947706194437996.tar.gz.tar&prefix=englishTa    r/ ('')

108 WARNING: Waiting 3 sec...

109 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172186_1346995568916_2451.arc.gz7586952024027287519.tar.gz.tar&prefix=englishTa    r/ ('')

110 WARNING: Waiting 3 sec...

111 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172250_1346981395010_331.arc.gz6561457986707759630.tar.gz.tar&prefix=englishTar    / ('')

112 WARNING: Waiting 3 sec...

113 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172250_1346982076924_1215.arc.gz7717498467062295474.tar.gz.tar&prefix=englishTa    r/ ('')

114 WARNING: Waiting 3 sec...

115 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172250_1346984164875_3348.arc.gz5310147906482169920.tar.gz.tar&prefix=englishTa    r/ ('')

116 WARNING: Waiting 3 sec...

117 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172250_1346984767696_2703.arc.gz7544490827845143852.tar.gz.tar&prefix=englishTa    r/ ('')

118 WARNING: Waiting 3 sec...

119 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172250_1346985206022_2843.arc.gz2097386507538636765.tar.gz.tar&prefix=englishTa    r/ ('')

120 WARNING: Waiting 3 sec...

121 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1346981172258_1346986276917_991.arc.gz776621164516825690.tar.gz.tar&prefix=englishTar/     ('')

122 WARNING: Waiting 3 sec...

123 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107000_1350481199932_670.arc.gz2495333143518084095.tar.gz.tar&prefix=englishTar    / ('')

124 WARNING: Waiting 3 sec...

125 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107021_1350474841001_743.arc.gz5722850201412949829.tar.gz.tar&prefix=englishTar    / ('')

126 WARNING: Waiting 3 sec...

127 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107023_1350446763626_747.arc.gz1435570456824324007.tar.gz.tar&prefix=englishTar    / ('')

128 WARNING: Waiting 3 sec...

129 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107032_1350499812100_815.arc.gz5840860091087872685.tar.gz.tar&prefix=englishTar    / ('')

130 WARNING: Waiting 3 sec...

131 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107033_1350506811133_371.arc.gz8471749399007167670.tar.gz.tar&prefix=englishTar    / ('')

132 WARNING: Waiting 3 sec...

133 WARNING: Retrying failed request: /?marker=englishTar/common-crawl_parse-output_segment_1350433107038_1350461200797_926.arc.gz2633837436638829617.tar.gz.tar&prefix=englishTar    / ('')

134 WARNING: Waiting 3 sec...

135 WARNING: Empty object name on S3 found, ignoring.

136 ERROR: S3 error: 403 (Forbidden):


在 2014年3月10日星期一UTC+8下午3时22分48秒,Robert Meusel写道:

Oliver Lehmberg

unread,
Mar 17, 2014, 12:53:28 PM3/17/14
to web-data...@googlegroups.com
Hi Henan,

I changed the permissions of all files, now you should be able to download them, please try again.

Cheers,

Oliver

王鹤男

unread,
Mar 17, 2014, 1:06:52 PM3/17/14
to web-data...@googlegroups.com
Hi Oliver,

Thank you very much for your reply!

I try to download the dataset in Windows using CloudBerry S3 Explorer PRO, and I can download half of the data.

I got some XML parse errors, and I guess the reason may be my network failed.

Any way, I will try to download the whole dataset.



Best Wishes,
Henan


--
You received this message because you are subscribed to a topic in the Google Groups "Web Data Commons" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web-data-commons/xPi6sFTtHc4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web-data-commo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bhavana Dalvi Mishra

unread,
Mar 24, 2014, 11:11:31 AM3/24/14
to web-data...@googlegroups.com
Hello all,

I faced similar problems, could not download the English corpus completely.

When I used the get command given on the website:
$s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/
Last 2 lines in the output trace
 s3://SearchJoin-tables/englishTar/common-crawl_parse-output_segment_1346981172142_1347012421405_1486.arc.gz2263847230256533962.tar.gz.tar -> ./common-crawl_parse-output_segment_1346981172142_1347012421405_1486.arc.gz2263847230256533962.tar.gz.tar  [470067 of 773884]
  35840 of 35840   100% in    0s   397.34 kB/s  done
  s3://SearchJoin-tables/englishTar/common-crawl_parse-output_segment_1346981172142_1347012421687_2006.arc.gz6574617375064786259.tar.gz.tar -> ./common-crawl_parse-output_segment_1346981172142_1347012421687_2006.arc.gz6574617375064786259.tar.gz.tar  [470068 of 773884]
  ERROR: S3 error: 403 (Forbidden):

So only 470K files out of 770K files could be downloaded. I have made sure that the disk I am downloading these files on has enough space, so its not the disk issue.

Then I tried cp command to put the data in my amazon bucket
$ s3cmd cp --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/ <My bucket name>
File s3://SearchJoin-tables/englishTar/ copied to <my-bucket-name>
ERROR: S3 error: 400 (InvalidRequest): The specified copy source is larger than the maximum allowable size for a copy source: 5368709120

Does anyone know a way out of?

Regards,
Bhavana

Oliver Lehmberg

unread,
Mar 25, 2014, 4:40:30 AM3/25/14
to web-data...@googlegroups.com
Hi Bhavana,

thank you for your feedback. We are currently investigating the 403 issues. Concerning the s3cmd, you need to use at least version 1.5.0 alpha (newest version is 1.5.0 beta 1 http://s3tools.org/download).

Regards,
Oliver

Oliver Lehmberg

unread,
Mar 27, 2014, 11:57:52 AM3/27/14
to web-data...@googlegroups.com
Hi all,

can you please try to download the missing files again? The "ERROR: S3 error: 403 (Forbidden)" should no longer occur.


Regards,
Oliver

Am Montag, 24. März 2014 16:11:31 UTC+1 schrieb Bhavana Dalvi Mishra:

Bhavana Dalvi Mishra

unread,
Mar 27, 2014, 1:52:59 PM3/27/14
to web-data...@googlegroups.com
Hello Oliver,

Thanks for looking into this problem. Yes, I am using the newer version of s3cmd 1.5.0. 
I just started copying English subset of tables and I will let you know whether it succeeds or fails.

Regards,
Bhavana

Bhavana Dalvi Mishra

unread,
Mar 28, 2014, 2:10:48 PM3/28/14
to web-data...@googlegroups.com
Hello Oliver,

I tried two commands, one to copy content to my bucket and another to download data on servers in my school. Both commands failed with following errors.

1) Downloading data on our server
   $s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/
   ......
   .......
   s3://SearchJoin-tables/englishTar/common-crawl_parse-output_segment_1346876860567_1346913407116_2034.arc.gz6028615114515978013.tar.gz.tar -> ./common-crawl_parse-  output_segment_1346876860567_1346913407116_2034.arc.gz6028615114515978013.tar.gz.tar  [80916 of 773884]
   ERROR: S3 error: 500 (Internal Server Error):

2) Copying to a bucket
  $s3cmd cp --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/ <MY BUCKET NAME>
  WARNING: Retrying failed request: / ('')
  WARNING: Waiting 3 sec...
  ERROR: S3 error: 400 (InvalidArgument): You can only specify a copy source header for copy requests.

Regards,
Bhavana

Oliver Lehmberg

unread,
Apr 2, 2014, 11:09:00 AM4/2/14
to web-data...@googlegroups.com
Hi,

concerning your second problem I found something similar here https://github.com/s3tools/s3cmd/issues/142#issuecomment-14973262. It could be that for the cp command, the header is incorrecly used for the list request that is done before copying, but I'm not sure about that. Can you try to download the files using another tool? I would also recommend that for the first problem, which can most probably be resolved by retrying to download the file.

Regards,
Oliver

王鹤男

unread,
Apr 2, 2014, 12:07:08 PM4/2/14
to web-data...@googlegroups.com
Hi all,

I have tried to download the dataset in Windows, and it worked!

Here, I recommend you the CloudBerry tool, and you can try it.

Best Wishes,
Henan

eteng...@gmail.com

unread,
Jun 26, 2014, 12:51:50 PM6/26/14
to web-data...@googlegroups.com
Dears,

I wonder if the Web table data in S3 is still open for download now? Because when I use s3cmd to download the data today (either s3://SearchJoin-tables/englishTar/ or s3://WebTablesExtraction/c*), I got a "ERROR: S3 error: 403 (AccessDenied): Access Denied" error. Thanks a lot for your reply.

Best,
Eteng

Petar Ristoski

unread,
Jun 29, 2014, 8:34:04 AM6/29/14
to web-data...@googlegroups.com
Hi Eteng,

We extracted a new corpus of web tables with higher quality than the previous one, w.r.t. precision and recall. The corpus is available for free download here. There are 900 tar files, with size of 820 GB in total. Each tar file is around 1GB in size, and contains 1000 gz files. Each gz file contains the original html file, json file with meta data, and a csv file for each extracted table.

In the following days we will publish complete statistics for the corpus on our web page. Also, we will publish a separate corpus of English tables only.

Regards,

Petar

Petar Ristoski

unread,
Jul 10, 2014, 3:36:54 PM7/10/14
to web-data...@googlegroups.com
Hi Eteng,

You can download the English corpus here. There are 28,552,388 tables in total, packed in 180 tar files, with total size of 237 GB . Each tar file is around 1.4GB, and contains 5000 gz files. Each gz file contains couple of hundred tables, including the original html file, json file with meta data, and a csv file for each extracted table.
 
Regards,
 
Peta

Shuo ZHANG

unread,
Jul 19, 2014, 2:23:18 AM7/19/14
to web-data...@googlegroups.com
Dear Petar,

Thank you so much! I appreciate your help and effort!

Best,
Eteng


--
Reply all
Reply to author
Forward
0 new messages