Hi Ben,
Thanks for getting back to me.
yes, I probably missed something when I entered it. I thought I'd
followed it byte for byte... but who knows... syntax is a fanny old
thing ;)
Rather than creating a bootstrap script and running in the same way as
the RUBY example, what I was getting at is that I have instances of
linux boxes running so I wondered if I could run the perl on there and
access the S3 buckets straight from a putty SSH session. The way I
would do this locally is to have screen running and then leave the
process running as a background task.
I understand I'm out of the ark...so my apologies... I promise I'm
googling and experimenting with ruby and hadoop too!!
You mention telling it which ARC files to use, but I'm not sure where
to get a definitive list of the 300K plus arc file locations. Where
would I find that information?
Oh, and lastly, I notice on the page "
http://api.commoncrawl.org/
blogpost.html" that the data I REALLY want to access (the IP address
of the hosting for the URL) is in the hadoop sequence file.
Are those sequence files available? That document sort of suggests
that they are, but then says they haven't decided what compression to
use (I can use a command line tool to uncompress Snappy Codec).
Thank you so much for responding to my post, I wish you all the best
with your projects,
Rich.
On May 18, 10:55 am, Ben Nagy <
b...@iagu.net> wrote: