Example code for working with WARC files

172 views
Skip to first unread message

Lisa Green

unread,
Apr 2, 2014, 3:36:36 PM4/2/14
to common...@googlegroups.com
Stephen Merity has produced three introductory Java examples for working with WARC files!

If you want to go straight to the code you can find it here https://github.com/Smerity/cc-warc-examples
For more detail on how to work with these files, check out Stephen's excellent blog post http://commoncrawl.org/navigating-the-warc-file-format/

shlomi...@gmail.com

unread,
Apr 3, 2014, 6:24:21 AM4/3/14
to common...@googlegroups.com
Thank you!

I read the code and its very clean and understandable. Could have saved me lots of work a few months ago, but better late then never :)
Would you consider making the warc reader a maven artifact? 

I will test it today and update my clojure binding accordingly.

Lisa Green

unread,
Apr 3, 2014, 2:33:43 PM4/3/14
to common...@googlegroups.com
Making the WARC reader a maven artifact is a great idea!  I am not sure Stephen has time right now, but we will find someone to do it.

It would be great to hear about how you are using Common Crawl data. Is your code in an open repo that you would be willing to share?

Lisa

shlomi...@gmail.com

unread,
Apr 6, 2014, 10:27:45 AM4/6/14
to common...@googlegroups.com
Hey,

Unfortunately although I really want to, I am not allowed to share our code base.. 

Generally we have large hadoop jobs running on top of Amazon's EMR that generate lots of NLP stats from the extracted text (all written in Clojure). We used to do all this operation in-house, (crawling, extracting, storing the data, etc..) so finding out about Common Crawl was a real blessing. However, we still didnt get a chance to make a full run on top of Common Crawl, just adopting our algorithms to CC layout, and making small test runs.

Thanks for a wonderful job,
Shlomi
Reply all
Reply to author
Forward
0 new messages