Access to crawler's results for 2008-2012.

73 views
Skip to first unread message

Dmytro Rakovskyi

unread,
Mar 29, 2016, 5:06:42 AM3/29/16
to Common Crawl
Hello,
How can I access to the crawler's results for 2008-2012?
There are no paths to files on s3 public data set http://commoncrawl.org/the-data/get-started/.
If anybody has the file with paths I will be very thankful.
Best regards,
Rakovskyi Dmytro

Tom Morris

unread,
Mar 29, 2016, 9:33:52 AM3/29/16
to common...@googlegroups.com
On Tue, Mar 29, 2016 at 5:06 AM, Dmytro Rakovskyi <rakov...@gmail.com> wrote:
Hello,
How can I access to the crawler's results for 2008-2012?
There are no paths to files on s3 public data set http://commoncrawl.org/the-data/get-started/.
If anybody has the file with paths I will be very thankful.

There's probably more information in some of the old posts announcing them, but see:

2008/9 s3://aws-publicdatasets/common-crawl/crawl-001/
2009/10 s3://aws-publicdatasets/common-crawl/crawl-002/
2012 s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

Don't forget that these all pre-date the switch to the WARC format, so you'll need different tooling to deal with the ARC files.

Tom

Tom Morris

unread,
Mar 29, 2016, 12:41:54 PM3/29/16
to common...@googlegroups.com
The blog post describing the 2012 crawl is here:

Dmytro Rakovskyi

unread,
Mar 30, 2016, 10:52:52 AM3/30/16
to Common Crawl
Thank You but I have already find decision. I have parse all file paths with the help aws cli( aws s3 ls s://path). 

вівторок, 29 березня 2016 р. 19:41:54 UTC+3 користувач Tom Morris написав:

Tom Morris

unread,
Mar 30, 2016, 10:59:34 AM3/30/16
to common...@googlegroups.com
On Wed, Mar 30, 2016 at 10:52 AM, Dmytro Rakovskyi <rakov...@gmail.com> wrote:
Thank You but I have already find decision. I have parse all file paths with the help aws cli( aws s3 ls s://path). 

You shouldn't use that technique for 2012 and later crawls because not all the files you find will be valid. That's why there's a file with the list of valid segments.

Tom
 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Dmytro Rakovskyi

unread,
Mar 30, 2016, 11:16:35 AM3/30/16
to Common Crawl
I don't think that it's possible to get paths in other way. And yep I get result only from valid segments(from valid_segments.txt).

середа, 30 березня 2016 р. 17:59:34 UTC+3 користувач Tom Morris написав:
Reply all
Reply to author
Forward
0 new messages