perl s3 buckets

Richard E

unread,

May 18, 2012, 5:31:38 AM5/18/12

to Common Crawl

Hi,

My first post - I've only just found common crawl, and I'm rather
loving the potential. I have my own research database (sphinx) with a
handful of millions of websites spidered. I didn't know this existed
until two days ago.

I've tried the ruby example which ran for 27 mins and said "status:
failure" - nothing in logging or output folders.
Anyhoo - it got me thinking, whilst I'm in the dark with java, hadoop
and ruby, I know perl rather well, couldn't I just create a linux
instance and then access the S3 buckets with perl missing out the
elastic map reduce stage, which my poor brain is failing to completely
understand?
I understand it would be slower... but I have time, and I already have
the linux instance, which is no problem to have a background task
chugging away at the data? I can write something to deal with the ARC
part easily enough - although I'm not so sure about the newer sequence
files.

Has anyone done this - or can anyone tell me why I'm barking up the
wrong proverbial tree?

all the best to you, and I look forward to seeing the new data that
Lisa teased us with in her blog entry of Nov 2011 (domain index etc).

Rich.

Ben Nagy

unread,

May 18, 2012, 5:55:22 AM5/18/12

to common...@googlegroups.com

On Fri, May 18, 2012 at 3:16 PM, Richard E <sup...@tabguitarlessons.com> wrote:
> I've tried the ruby example which ran for 27 mins and said "status:
> failure" - nothing in logging or output folders.
> Anyhoo - it got me thinking, whilst I'm in the dark with java, hadoop
> and ruby, I know perl rather well

The Ruby example "works for me" ;), you do need to do _everything_
though, like run the bootstrap script etc.

EMR Streaming does support Perl, so you should be fine. Basically you need to

0. Make a bootstrap script that makes any changes to the EC2 instance
that you need for your perl stuff to work
1. Have your input file as a manifest - one arc.gz FILENAME per line,
put this in your S3 somewhere
- What happens is that hadoop will split this up and pipe one line
at a time to your mappers
2. Have your mapper.pl read from ARGF (stdin), where it expects one
filename per line, get the file, do the ARC stuff, and output your
mapped values to STDOUT
3. Have your reducer.pl read from ARGF, do whatever (count results,
etc etc) and output to STDOUT

That's pretty much all my Ruby example does, but, you know, in Ruby.

Cheers,

ben

Richard E

unread,

May 18, 2012, 11:33:54 AM5/18/12

to Common Crawl

Hi Ben,

Thanks for getting back to me.

yes, I probably missed something when I entered it. I thought I'd
followed it byte for byte... but who knows... syntax is a fanny old
thing ;)

Rather than creating a bootstrap script and running in the same way as
the RUBY example, what I was getting at is that I have instances of
linux boxes running so I wondered if I could run the perl on there and
access the S3 buckets straight from a putty SSH session. The way I
would do this locally is to have screen running and then leave the
process running as a background task.

I understand I'm out of the ark...so my apologies... I promise I'm
googling and experimenting with ruby and hadoop too!!

You mention telling it which ARC files to use, but I'm not sure where
to get a definitive list of the 300K plus arc file locations. Where
would I find that information?

Oh, and lastly, I notice on the page "http://api.commoncrawl.org/
blogpost.html" that the data I REALLY want to access (the IP address
of the hosting for the URL) is in the hadoop sequence file.

Are those sequence files available? That document sort of suggests
that they are, but then says they haven't decided what compression to
use (I can use a command line tool to uncompress Snappy Codec).

Thank you so much for responding to my post, I wish you all the best
with your projects,

Rich.

On May 18, 10:55 am, Ben Nagy <b...@iagu.net> wrote:

Pete Warden

unread,

May 18, 2012, 4:07:55 PM5/18/12

to common...@googlegroups.com

Hi Richard,

sorry to hear you are hitting problems with my Ruby example; it is all a bit opaque when things go wrong with Hadoop, and especially with Elastic MapReduce. There's probably a whole article waiting to be written about the best practices for debugging through all the logs and return codes!

One thing that can really help (and which it sounds like you're looking for below) is running the scripts manually. If you have a machine with all the dependencies in setup.sh completed, you should just be able to run this line to run it outside of Hadoop:

ruby extension_map.rb < example_input.txt | sort | ruby extension_reduce.rb

This in fact works with all Hadoop or Elastic MapReduce streaming jobs, and is one of the key ways I debug problems. Does that help?

cheers,

Pete

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To post to this group, send email to common...@googlegroups.com.
To unsubscribe from this group, send email to common-crawl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/common-crawl?hl=en.

--

Jetpac

You should get the app for your iPad!

Richard E

unread,

May 18, 2012, 4:51:37 PM5/18/12

to Common Crawl

It most certainly does help THANKS!. I look forward to getting stuck
into that next week. Not enough hours in the day... is caffeine in an
IV bag too much to ask?

Have a great weekend.

Rich.
PS - No offence intended about your example, I'm absolutely positive
it was me... but getting this command line version with slightly more
verbose output should set me on the correct path.

On May 18, 9:07 pm, Pete Warden <p...@jetpac.com> wrote:
> Hi Richard,
> sorry to hear you are hitting problems with my Ruby
> example; it is all a bit opaque when things go wrong with Hadoop, and
> especially with Elastic MapReduce. There's probably a whole article waiting
> to be written about the best practices for debugging through all the logs
> and return codes!
>
> One thing that can really help (and which it sounds like you're looking for
> below) is running the scripts manually. If you have a machine with all the
> dependencies in setup.sh completed, you should just be able to run this
> line to run it outside of Hadoop:
>
> ruby extension_map.rb < example_input.txt | sort | ruby extension_reduce.rb
>
> This in fact works with all Hadoop or Elastic MapReduce streaming jobs, and
> is one of the key ways I debug problems. Does that help?
>
> cheers,
> Pete
>

> *Jetpac*
> *You should get the app for your
> iPad<http://itunes.apple.com/us/app/jetpac-friends-photo-travel/id51151805...>
> !*

Pete Warden

unread,

May 18, 2012, 4:54:03 PM5/18/12

to common...@googlegroups.com

No offence taken! I wouldn't be at all shocked if there were some bugs in there :)

Jetpac

You should get the app for your iPad!

Jason Duke

unread,

May 18, 2012, 4:55:56 PM5/18/12

to common...@googlegroups.com

> is caffeine in an IV bag too much to ask?

i see it as a prerequisite :D

---
Jason Duke

Email: ja...@strangelogic.com
Mob: +44 (0)7595 924 934

Twitter: @JasonD

LinkedIn: http://uk.linkedin.com/in/jasonduke1

The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright. If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender. This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised. You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

Jason Duke

unread,

May 18, 2012, 4:59:18 PM5/18/12

to common...@googlegroups.com

But it has to be good coffee preferably from Monmouth coffee in Covent Garden London :D

OK, back to the keyboard for me, maybe I've had a touch too much of the coffee today....

---
Jason Duke

Email: ja...@strangelogic.com
Mob: +44 (0)7595 924 934

Twitter: @JasonD

LinkedIn: http://uk.linkedin.com/in/jasonduke1

The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright. If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender. This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised. You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

sup...@tabguitarlessons.com

unread,

May 21, 2012, 4:36:54 PM5/21/12

to common...@googlegroups.com

Hi Pete,

Thought you might be interested to know that If I paste your shell script in line at a time - its the ruby installer line that's falling over with:

sudo apt-get -y -t universe install ruby rubygems
Reading package lists... Done
ERROR: The value 'universe' is invalid for APT::Default-Release as such a release is not available in the sources

I've just installed with:

sudo apt-get install ruby-full build-essential (overkill!?)
sudo wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.24.tgz
sudo tar -zxvf rubygems-1.8.24.tgz
cd rubygems-1.8.24
sudo ruby setup.rb
sudo gem1.8 install aws-sdk --source http://rubygems.org

Then everything worked a treat. Pesky linux always changing. I feel that DOS 3.1 was a step too far...

All the best to you, thanks for the help, you got me on the right track,

Rich.

I am using the Free version of SPAMfighter.
SPAMfighter has removed 16657 of my spam emails to date.

Do you have a slow PC? Try free scan!

Mat Kelcey

unread,

May 21, 2012, 6:36:49 PM5/21/12

to common...@googlegroups.com

i reckon this happened when we moved EMR from debian 5.0 (lenny) to
6.0 (squeeze)

if you _really_ need to you can run the old ami but it's recommended
to move forward to ami 2.0
see http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/EnvironmentConfig_AMIVersion.html?r=5757
for more details

it's really hard to keep everything backwards compatible with this
sort of change, sorry about that :(

mat

Reply all

Reply to author

Forward