Access Denied

Mariano Rodriguez Muro

unread,

May 19, 2014, 4:35:15 PM5/19/14

to web-data...@googlegroups.com

Hi,

Im getting Access Denied errors when trying to download the english subset of the data.

s3cmd get --recursive --add-header=x-amz-request-payer:requester s3://SearchJoin-tables/englishTar/

ERROR: S3 error: 403 (AccessDenied): Access Denied

[root@v525400d2d5d7 05192014]# s3cmd --version

s3cmd version 1.5.0-beta1

I search for support on this and they indicate that AWS Policies have to be in place for the bucket, but I imagine that the bucket has its policy already setup.

Are there any pointers that you could give me to solve this issues?

Thank you in advance,

Best regards,
Mariano

Petar Ristoski

unread,

May 20, 2014, 8:41:51 AM5/20/14

to web-data...@googlegroups.com

Hi Mariano,

Please follow the instructions discussed in the previous thread. If you are still not able to download the corpus from Amazon S3, you can download the corpus here, using a standard command line. Note, it may take a couple of days to download the complete corpus.

However, we are running a new extraction that will produce a corpus of tables with higher quality than the current one. The corpus should be released till the end of the month, and it will be available for free download. Depending on your schedule, you might want to wait for the new release.

Regards,

Petar

Mariano Rodriguez Muro

unread,

May 22, 2014, 5:57:02 PM5/22/14

to web-data...@googlegroups.com

Thank you Petar, we really appreciate it.

We will download now and do it again when the release is done.

Cheers,

Mariano

Petar Ristoski

unread,

Jun 29, 2014, 8:33:24 AM6/29/14

to web-data...@googlegroups.com

Hi Mariano,

The new extraction is over, and the files are available for free download here. There are 900 tar files, with size of 820 GB in total. Each tar file is around 1GB in size, and contains 1000 gz files. Each gz file contains the original html file, json file with meta data, and a csv file for each extracted table.

In the following days we will publish complete statistics for the corpus on our web page. Also, we will publish a separate corpus of English tables only.

Regards,

Petar

Petar Ristoski

unread,

Jul 10, 2014, 3:35:56 PM7/10/14

to web-data...@googlegroups.com

Hi Mariano,

You can download the English corpus here. There are 28,552,388 tables in total, packed in 180 tar files, with total size of 237 GB . Each tar file is around 1.4GB, and contains 5000 gz files. Each gz file contains couple of hundred tables, including the original html file, json file with meta data, and a csv file for each extracted table.