--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at https://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
I just discovered this project and wanted to ask a question, if I may, of course.
Is the URL list of all crawled pages available for download ?
I will not be able to handle all data with my computing resources
I have never seen it for download anywhere, I was also interested. Instead I ran all 150 tb of the latest crawl through a few of my servers and extracted hostnames.
--
--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/xMz2gZoIMV8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
Thanks for the reply, i eventually paid a freelancer to try and access the bucket but he advised me that the bucket was not public and I would need some secret key from you. Is this true? Or does the guy just not know what he is doing?
Repeat for each of the other 299 files and you'll have the full list.
Tom
On Friday, May 6, 2016 at 8:26:50 AM UTC+1, Ivan Habernal wrote:
Hi Juli,
unfortunately not, due to the transfer costs as mentioned by Andrew. But you might have a look at our documentation to C4Corpus which also describes how to run a simple free-tier AWS server and access/download any data publicly available at S3:
https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/
Beware of the transfer costs: you must run your instance in us-east-1 (virginia) because that's where the CommonCrawl and C4Corpus are located; otherwise standard fees for transfer between AWS regions apply.
Hope it helps,
IvanHi Ivan, do you have a http version to download this? I'm pretty new to common-crawl and Amazon s3. Have just tried for hours to download the public bucket but with no luck, i dont see any good tutorials anywhere.Thanks in advanceJulie
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/xMz2gZoIMV8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.