Create CommonCrawl files using Java

60 views
Skip to first unread message

tot...@di.uniroma1.it

unread,
Mar 5, 2015, 12:51:13 PM3/5/15
to common...@googlegroups.com
Hi all,
my name is Giuseppe and I am new in this group. I started right now in using CommonCrawl data format.
I need to implement a Java utility to export web crawling data from a specific binary format to CommonCrawl data format.
I am searching for a Java APIs for writing CommonCrawl files (WARC, WAT, WET) starting from my crawled data. Looking at CommonCrawl website, I found only libraries and examples that handle parsing/reading of CommonCrawl files but not writing. I would really appreciate if you could suggest me some libraries/resources/writers.
Thanks a lot,
Giuseppe

Mat Kelcey

unread,
Mar 7, 2015, 10:17:09 AM3/7/15
to common...@googlegroups.com

The actual common crawl crawler is publicly available...

In particular https://github.com/commoncrawl/commoncrawl-crawler/blob/master/src/org/commoncrawl/mapred/ec2/parser/ParserOutputFormat.java looks like a good place to start


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

tot...@di.uniroma1.it

unread,
Mar 9, 2015, 1:19:57 PM3/9/15
to common...@googlegroups.com
Thanks Mat. I will be going to use it. Then I will give my feedback.
Thanks a lot,
Giuseppe
Reply all
Reply to author
Forward
0 new messages