How to list .gz files in commoncrawl-crawl-002 ?

211 views
Skip to first unread message

s3cmdo

unread,
May 24, 2012, 8:52:04 PM5/24/12
to Common Crawl
I am trying to get a listing of all the files in the commoncrawl-
crawl-002 bucket from EC2. I have configured s3cmd with my key and
secret and am trying to execute this:

s3cmd ls -r --add-header="x-amz-request-payer: requester" s3://
commoncrawl-crawl-002

Result:
ERROR: Access to bucket 'commoncrawl-crawl-002' was denied

My motivation in this is to find the name of a .gz file I can test
with BasicArcFileReaderSample.java

Executing this from the github page:

bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample {AWS
ACCESS KEY} {AWS SECRET KEY} commoncrawl-crawl-002
2010/01/07/18/1262876244253_18.arc.gz

Gives:

2012-05-25 00:17:47,335 ERROR
org.commoncrawl.samples.BasicArcFileReaderSample: java.io.IOException:
No input to process
at
org.commoncrawl.hadoop.io.ARCInputFormat.getSplits(ARCInputFormat.java:
171)
at
org.commoncrawl.samples.BasicArcFileReaderSample.main(BasicArcFileReaderSample.java:
64)

Mat Kelcey

unread,
May 24, 2012, 9:01:32 PM5/24/12
to common...@googlegroups.com
it's likely easier to use the version hosted under the amazon public datasets

http://commoncrawl.org/common-crawl-on-aws-public-data-sets/

give this one a try ...
s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-002/

cheers
mat
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To post to this group, send email to common...@googlegroups.com.
> To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.
>

s3cmdo

unread,
May 25, 2012, 5:09:16 AM5/25/12
to common...@googlegroups.com
That works, but I would like to view content from 2012 -- I only see 2009 and 2010 content here.  

s3cmdo

unread,
May 25, 2012, 7:46:21 AM5/25/12
to common...@googlegroups.com
Nevermind, I see the blog-post.  I think Readme.md on the github repo should be updated with this new info (where you give an example usage of BasicArcFileReaderSample).

Hsiao Su

unread,
May 25, 2012, 2:24:11 PM5/25/12
to common...@googlegroups.com

Which blog-post are you referring to?

s3cmdo

unread,
May 25, 2012, 5:29:04 PM5/25/12
to common...@googlegroups.com

s3cmdo

unread,
May 25, 2012, 5:42:05 PM5/25/12
to common...@googlegroups.com
However, that won't help you run the example BasicArcFileReaderSample, because that code expects .gz's of ARC files.  If you read the post, they have changed the format a bit and haven't updated their sample apps yet.

Hsiao Su

unread,
May 29, 2012, 7:53:04 PM5/29/12
to common...@googlegroups.com

Does that mean that I cannot use the code here:


to process the data here:

 s3://aws-publicdatasets/common-crawl/crawl-002/

?

I'm just thinking about what to do next, try out the code on github, or start writing my own code for parsing arc files.

Hsiao



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/tyh5TCZbxdcJ.
Reply all
Reply to author
Forward
0 new messages