list of example open source apps that work with common crawl?

271 views
Skip to first unread message

Jason

unread,
Jan 12, 2014, 9:06:59 PM1/12/14
to common...@googlegroups.com
Is there a list of example apps anywhere that collects these projects? I want to try working with the common crawl data and work off some of the examples.

It would be especially useful to see some example apps that work with the new data format!
Looking forward to playing with this!

Robert Meusel

unread,
Feb 18, 2014, 7:42:23 AM2/18/14
to common...@googlegroups.com
Jason, maybe you find here what you are looking for: https://commoncrawl.atlassian.net/wiki/display/CRWL/Code+Examples

Cheers,
Robert

Rob Witoff

unread,
Feb 18, 2014, 5:38:28 PM2/18/14
to common...@googlegroups.com
It's my understanding that all of these examples rely on the old Arc file formats.  If CC is really moving to Warc then so too should the recommended examples.  Are you aware of any Hadoop (preferably EMR) examples using the new WARC/WAT/WET formats?

Thanks,
-Rob

Ross Fairbanks

unread,
Feb 19, 2014, 3:43:33 AM2/19/14
to common...@googlegroups.com
Hi Rob,
I've been working with the 2013 data.  Here is an example of the Hadoop Word Count that uses the WET files.


If you're using EMR you might also want to look at the elasticrawl tool I've developed.  Its a command line tool for launching EMR jobs against the crawl.


Hope this helps.

Cheers

Ross

Rob Witoff

unread,
Feb 20, 2014, 3:02:53 AM2/20/14
to common...@googlegroups.com
Awesome!  Thanks Ross, this is a big help to have as a reference.

I'll be releasing an EMR Streaming / Python example shortly as well.

Thanks,

-Rob


--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/sTK7aFfpRRU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/groups/opt_out.

shlomi...@gmail.com

unread,
Feb 20, 2014, 11:19:51 AM2/20/14
to common...@googlegroups.com
Hey Ross, 
Thanks for the examples! However, they use mapred which is pseudo deprecated. I found some implementation of edu.cmu.lemurproject that uses mapreduce and tried to use it in one of our simple jobs, but it didnt seem to work properly. Can anyone verify that they can work with WET and mapreduce? 

Thanks,
Shlomi

Mat Kelcey

unread,
Feb 20, 2014, 11:23:27 AM2/20/14
to common...@googlegroups.com

given mapred has been 'deprecated' for, literally, years I wouldn't worry too much about it.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

shlomi...@gmail.com

unread,
Feb 20, 2014, 1:26:27 PM2/20/14
to common...@googlegroups.com
Its not that i am worried, just that i have a large codebase that uses mapreduce, and i really dont want to change it just for a little bit of warc support :)

Ross Fairbanks

unread,
Feb 20, 2014, 1:50:35 PM2/20/14
to common...@googlegroups.com
Hi Shlomi,
yes I had a problem processing the WET files with the Lemur code. It has some logic that always expects 2 char line endings.  But some of the text extractions in the WET files have single char endings and these cause null pointer exceptions.

I had to make a small change to the WarcRecord class to handle this.  It looks like the newer version of the Lemur code you're using has the same issue.  Hope this helps!

Cheers

Ross

shlomi...@gmail.com

unread,
Feb 20, 2014, 2:25:12 PM2/20/14
to common...@googlegroups.com
Hey Ross, 
Thank you for your answer, you are right, I did see this error on my previous attempt, and made a fix similar to the one you linked, thanks!
Although it got rid of the error, the job still didnt work as it should.. 

I added a println to both my mapper and reducer, and when I dont use WarcFileInputFormat, I actually get all the lines printed, each one in a separate map call (meaning I dont get the entire page in one map call), here is an example of what I am getting. 

If, however, I do use the mapreduce version of lemur with your fix, I get a strange behavior, here it is:

21:19:19,004  INFO FileInputFormat:237 - Total input paths to process : 1
21:19:19,172  INFO ProcessTree:63 - setsid exited with exit code 0
21:19:19,176  INFO Task:534 -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@189945c1
21:19:19,185  INFO MapTask:944 - io.sort.mb = 100
21:19:19,206  INFO MapTask:956 - data buffer = 79691776/99614720
21:19:19,208  INFO MapTask:957 - record buffer = 262144/327680
21:19:19,220  INFO WarcFileRecordReader:117 - file:/home/shlomi/test/CC-MAIN-20130516131833-00097-ip-10-60-113-184.ec2.internal.warc.wet.gz
21:19:19,223  INFO WarcFileRecordReader:122 - Compression enabled
21:19:27,648  INFO MapTask:1284 - Starting flush of map output
21:19:27,653  INFO Task:858 - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
21:19:27,655  INFO LocalJobRunner:323 - 
21:19:27,656  INFO Task:970 - Task 'attempt_local_0001_m_000000_0' done.
21:19:27,660  INFO Task:534 -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5945d890
21:19:27,661  INFO LocalJobRunner:323 - 
21:19:27,663  INFO Merger:390 - Merging 1 sorted segments
21:19:27,667  INFO Merger:473 - Down to the last merge-pass, with 0 segments left of total size: 0 bytes
21:19:27,667  INFO LocalJobRunner:323 - 
21:19:27,670  INFO Task:858 - Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
21:19:27,671  INFO LocalJobRunner:323 - 
21:19:27,671  INFO Task:1011 - Task attempt_local_0001_r_000000_0 is allowed to commit now
21:19:27,672  INFO FileOutputCommitter:173 - Saved output of task 'attempt_local_0001_r_000000_0' to /tmp/wet-ass
21:19:27,672  INFO LocalJobRunner:323 - reduce > reduce
21:19:27,673  INFO Task:970 - Task 'attempt_local_0001_r_000000_0' done.

Please notice the time jump between "compression enabled" and the next line, this is where I expect to see the mapper and reducer prints..
Nothing prints out and I get an empty result file with no errors...

any ideas?

Thanks,
Shlomi

Ross Fairbanks

unread,
Feb 20, 2014, 3:02:42 PM2/20/14
to common...@googlegroups.com
I'm not sure but I think the issue could be in your mapper.  

When using the WarcFileRecordReader the value passed to the mapper is a WritableWarcRecord.  But this is just a wrapper to the WarcRecord object that has the data. The code below gets the URL and the text contents.

WarcRecord record = value.getRecord();

String pageUrl = record.getHeaderMetadataItem("WARC-Target-URI");
String pageText = record.getContentUTF8();

shlomi...@gmail.com

unread,
Feb 20, 2014, 3:11:21 PM2/20/14
to common...@googlegroups.com
Thank you for replying, 
I tried that, and even more - I left nothing in mapper, just a println with a constant string, and it doesnt seem like mapper gets called at all (when I use WarcFileInputFormat - if I dont, it gets printed).

I suspect its a problem with the mapreduce version of lemur (it doesnt seem to be official..), I would appreciate if anyone could try this out and let me and the group know..

Thanks,
Shlomi

shlomi...@gmail.com

unread,
Feb 20, 2014, 4:07:34 PM2/20/14
to common...@googlegroups.com
Ok, Indeed there is a bug in mapreduce version of WarcFileRecordReader. It now works fine. I will upload a working version of lemur (with mapreduce) to github and post it here.

Thanks!

shlomi...@gmail.com

unread,
Feb 21, 2014, 5:55:04 AM2/21/14
to common...@googlegroups.com
Hey,

I made a git repo that contains a fixed version of lemur that works with mapreduce api ( https://github.com/vadali/warc-mapreduce/tree/master/java/edu/cmu/lemurproject )

and there's also a word-count example in clojure using hadoop-clojure library and a sample warc file from common crawl. ( https://github.com/vadali/warc-mapreduce/blob/master/src/warc_mapreduce/example.clj )

Hope this helps someone!
Reply all
Reply to author
Forward
0 new messages