WARC and WAT examples

321 views
Skip to first unread message

shlomi...@gmail.com

unread,
Nov 28, 2013, 4:56:04 AM11/28/13
to common...@googlegroups.com
Hey, 

I am new to common-crawl, and this is the first time i am hearing/reading about WARC and WAT. 
Could you please post a canonical hadoop example (in java) for using them, such as word count etc. ?

Thanks!

Winicius Siqueira

unread,
Nov 30, 2013, 2:01:15 AM11/30/13
to common...@googlegroups.com
I'm new to all this data mining business. I'm trying to extract email addresses from this crawl. Can anyone point me in the direction I should take?

--regards

shlomi...@gmail.com

unread,
Dec 18, 2013, 12:20:40 PM12/18/13
to common...@googlegroups.com
Hey all,

I am trying to make a simple hadoop job that takes in a wet file and prints out its key and value (in the mapper). 
I started by simply using a TextInputFormat and I am getting key as a LongWriteable (representing the offset in the file) and value is simply a Text containing a line (naturally).

here is an example output:

MAPPER> 568 : 'WARC/1.0'
MAPPER> 578 : 'WARC-Type: conversion'
MAPPER> 601 : 'WARC-Target-URI: http://02varvara.wordpress.com/2009/12/'
MAPPER> 659 : 'WARC-Date: 2013-06-20T05:31:31Z'
MAPPER> 692 : 'WARC-Record-ID: <urn:uuid:c7e0c36c-d5a2-48e5-9a13-f401504e8ff9>'
MAPPER> 757 : 'WARC-Refers-To: <urn:uuid:c3285eac-bd57-4098-bba2-c144d7755ffe>'
MAPPER> 822 : 'WARC-Block-Digest: sha1:KTCM2H2G3SQCLNW7B4I3DXOM2F5NEB73'
MAPPER> 880 : 'Content-Type: text/plain'
MAPPER> 906 : 'Content-Length: 18894'
MAPPER> 929 : ''
MAPPER> 931 : 'December | 2009 | Voices from Russia'
MAPPER> 968 : 'Voices from Russia'
MAPPER> 987 : 'Wednesday, 30 December 2009'
MAPPER> 1015 : 'The ‘Pooter is in the Lair of the Geeks'
MAPPER> 1057 : 'Filed under: humour/wry/"people are funny" — 01varvara @ 00.00 Well… it’s time for annual cleaning and there’s a minor glitch or two to be repaired. So, it’s off to the lair of the local geeks, where a consilium of certified and actual techies will put it everything right… God willing, without breaking the bank account.'
MAPPER> 1391 : 'God willing, it won’t be long. Yes… I’ll admit it… I’m a cyberspace junkie, but, I do have a real life… which is why posts have been down recently… I heard rumours about a holiday or two…'
MAPPER> 1595 : 'Barbara-Marie Drezhlo'
MAPPER> 1617 : 'Wednesday 30 December 2009'
MAPPER> 1644 : 'Albany NY'

This naive approach gives me access to all the lines in the file, but its rather inconvenient, I'd much rather have a <url, all-text> tuple as the previous crawls served..
Is there a better InputFormat I should be using? do I have to roll my own?

jor...@commoncrawl.org

unread,
Jan 13, 2014, 3:46:09 PM1/13/14
to common...@googlegroups.com
A wet file is identical to a WARC file except it contains just conversion records which means any WARC file reader should be able to extract everything. I've been using Internet Archive's ia-web-commons to read them: https://github.com/internetarchive/ia-web-commons using Java and I think their hadoop tools provide streaming + pig access as well though I haven't played around too much with them: https://github.com/internetarchive/ia-hadoop-tools/. There are other tools built for processing ClueWeb data which should also work.
Reply all
Reply to author
Forward
0 new messages