I am trying to make a simple hadoop job that takes in a wet file and prints out its key and value (in the mapper).
I started by simply using a TextInputFormat and I am getting key as a LongWriteable (representing the offset in the file) and value is simply a Text containing a line (naturally).
MAPPER> 568 : 'WARC/1.0'
MAPPER> 578 : 'WARC-Type: conversion'
MAPPER> 659 : 'WARC-Date: 2013-06-20T05:31:31Z'
MAPPER> 692 : 'WARC-Record-ID: <urn:uuid:c7e0c36c-d5a2-48e5-9a13-f401504e8ff9>'
MAPPER> 757 : 'WARC-Refers-To: <urn:uuid:c3285eac-bd57-4098-bba2-c144d7755ffe>'
MAPPER> 822 : 'WARC-Block-Digest: sha1:KTCM2H2G3SQCLNW7B4I3DXOM2F5NEB73'
MAPPER> 880 : 'Content-Type: text/plain'
MAPPER> 906 : 'Content-Length: 18894'
MAPPER> 929 : ''
MAPPER> 931 : 'December | 2009 | Voices from Russia'
MAPPER> 968 : 'Voices from Russia'
MAPPER> 987 : 'Wednesday, 30 December 2009'
MAPPER> 1015 : 'The ‘Pooter is in the Lair of the Geeks'
MAPPER> 1057 : 'Filed under: humour/wry/"people are funny" — 01varvara @ 00.00 Well… it’s time for annual cleaning and there’s a minor glitch or two to be repaired. So, it’s off to the lair of the local geeks, where a consilium of certified and actual techies will put it everything right… God willing, without breaking the bank account.'
MAPPER> 1391 : 'God willing, it won’t be long. Yes… I’ll admit it… I’m a cyberspace junkie, but, I do have a real life… which is why posts have been down recently… I heard rumours about a holiday or two…'
MAPPER> 1595 : 'Barbara-Marie Drezhlo'
MAPPER> 1617 : 'Wednesday 30 December 2009'
MAPPER> 1644 : 'Albany NY'
This naive approach gives me access to all the lines in the file, but its rather inconvenient, I'd much rather have a <url, all-text> tuple as the previous crawls served..