how to read all the data of Common Crawl from AWS with Java?

190 views
Skip to first unread message

Pierre Therrode

unread,
Jul 8, 2015, 8:29:33 AM7/8/15
to common...@googlegroups.com
Hello,

how to read all the data of Common Crawl from AWS with Java?
I've allready posted my question on stackoverflow, but I've yet response (http://stackoverflow.com/questions/31287956/how-to-read-all-the-data-of-common-crawl-from-aws-with-java).

If someone has an idea, I'm interested.

Thanks in advance.

Mat Kelcey

unread,
Jul 8, 2015, 10:36:04 PM7/8/15
to common...@googlegroups.com
here's some code that's kinda similar, you might get some ideas from it. 
wrote it awhile go though so it might be a bit stale sorry

Robert Meusel

unread,
Jul 10, 2015, 3:19:17 AM7/10/15
to common...@googlegroups.com
He, 

we are maining a framework running in AWS which you might want to look at:


Cheers,
Robert

sohail ahmed

unread,
Jul 10, 2015, 5:20:53 AM7/10/15
to common...@googlegroups.com
Here is the java code that I used to get CC Data. Build using https://github.com/commoncrawl/cc-warc-examples

neededJars.txt
src.zip

Pierre Therrode

unread,
Jul 10, 2015, 5:08:22 PM7/10/15
to common...@googlegroups.com
Hi,
I coded a little program to remotely read the WAT file. If people have encountered the same problems , here's my github deposit : https://github.com/pi-2r/myReaderS3Bucket

Pierre Therrode

unread,
Jul 10, 2015, 5:14:45 PM7/10/15
to common...@googlegroups.com

Yes,

I've founded this good example, but you need to :

 Configuration conf = getConf();
 //
 Job job = new Job(conf);

to

Configuration conf = new Configuration();
conf.set("fs.s3n.awsAccessKeyId", "your_key");
conf.set("fs.s3n.awsSecretAccessKey", "your_key");
Job job = new Job(conf);

otherwise you will encounter errors connections to AWS .

Jaffer Wilson

unread,
Dec 13, 2016, 6:29:12 AM12/13/16
to Common Crawl
Hello Pierre,
May I know are you interested in the program for extracting the data or just the extracted data from the commoncrawl corpus?
Kindly, let me know.
Regards,
Jaffer Wilson
Reply all
Reply to author
Forward
0 new messages