How to access blekko hosts extract?

58 views
Skip to first unread message

OneSpeedFast

unread,
Jan 21, 2016, 4:40:36 AM1/21/16
to common...@googlegroups.com
I haven`t had any trouble access other files, but I`m doing something wrong here. How do I access the extracted hosts from blekko?


For example: /blekko/blekko-extracthosts-20130317/3921038735.seq

Andrew Berezovskyi

unread,
Mar 14, 2016, 9:10:53 AM3/14/16
to Common Crawl
Hi, did you have any success with this?

Tom Morris

unread,
Mar 14, 2016, 2:35:06 PM3/14/16
to common...@googlegroups.com
That file's protected:

 $ aws s3 --no-sign-request cp s3://aws-publicdatasets/common-crawl/blekko/blekko-extracthosts-20130317/3921038735.seq .
A client error (403) occurred when calling the HeadObject operation: Forbidden

so you're not going to be able to access it.  What is it that you're looking for?

Tom

OneSpeedFast

unread,
Mar 14, 2016, 2:50:07 PM3/14/16
to common...@googlegroups.com
I ended up extracting hosts directly from the body of all documents .. and received many millions more than are in the hosts file. 
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages