Example code for working with WARC files

Lisa Green

unread,

Apr 2, 2014, 3:36:36 PM4/2/14

to common...@googlegroups.com

Stephen Merity has produced three introductory Java examples for working with WARC files!

If you want to go straight to the code you can find it here https://github.com/Smerity/cc-warc-examples

For more detail on how to work with these files, check out Stephen's excellent blog post http://commoncrawl.org/navigating-the-warc-file-format/

shlomi...@gmail.com

unread,

Apr 3, 2014, 6:24:21 AM4/3/14

to common...@googlegroups.com

Thank you!

I read the code and its very clean and understandable. Could have saved me lots of work a few months ago, but better late then never :)

Would you consider making the warc reader a maven artifact?

I will test it today and update my clojure binding accordingly.

Lisa Green

unread,

Apr 3, 2014, 2:33:43 PM4/3/14

to common...@googlegroups.com

Making the WARC reader a maven artifact is a great idea! I am not sure Stephen has time right now, but we will find someone to do it.

It would be great to hear about how you are using Common Crawl data. Is your code in an open repo that you would be willing to share?

Lisa

shlomi...@gmail.com

unread,

Apr 6, 2014, 10:27:45 AM4/6/14

to common...@googlegroups.com

Hey,

Unfortunately although I really want to, I am not allowed to share our code base..

Generally we have large hadoop jobs running on top of Amazon's EMR that generate lots of NLP stats from the extracted text (all written in Clojure). We used to do all this operation in-house, (crawling, extracting, storing the data, etc..) so finding out about Common Crawl was a real blessing. However, we still didnt get a chance to make a full run on top of Common Crawl, just adopting our algorithms to CC layout, and making small test runs.