Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  3 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Peter Szinek  
View profile  
 More options Jun 14 2006, 6:07 am
From: Peter Szinek <pe...@rubyrailways.com>
Date: Wed, 14 Jun 2006 19:07:39 +0900
Local: Wed, Jun 14 2006 6:07 am
Subject: Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize
Hello all,

I am investigating the possibilities of screen scraping/web extraction/
automated web navigation/wrapper generation in Ruby. I have been working
with these technologies for several years, (unfortunately) in Java
and partially C/C++ only. I came to know Ruby a few months ago and I am
 currently investigating the existing tools for the above tasks. Since i
have the feeling that i am not alone (this topic is brought up regularly
here, maybe not as often as the "how to create an Object from it's
name", but it is close to that ;-) I have summarized my findings (tools
that i have found, descriptions, examples, comparison etc.), maybe can
help someone.

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scrapin...

You can find simple example solutions of the same problem (scraping
links from a google result page) with regular expressions, HTree+REXML,
RubyfulSoup and WWW::Mechanize.

I am planning to write more entries on this topic, involving screen
scraping from Rails, Gecko to Ruby GTK widget embedding, wrapper
generation etc. Please note that i am new to Ruby so it is possible that
my code snippets are not the most optimal yet (suggestions welcome), but
they are all tested and working.

Feedback/corrections/suggestions would be very much appreciated!

If you liked the story, you can digg it here:

http://www.digg.com/programming/Data_extraction_for_Web_2.0:_Screen_s...

Cheers,
Peter


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Edward Gray II  
View profile  
 More options Jun 14 2006, 8:13 am
From: James Edward Gray II <ja...@grayproductions.net>
Date: Wed, 14 Jun 2006 21:13:16 +0900
Local: Wed, Jun 14 2006 8:13 am
Subject: Re: Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize
On Jun 14, 2006, at 5:07 AM, Peter Szinek wrote:

This was a very good article.  Thank you for sharing it with us.

> Please note that i am new to Ruby so it is possible that
> my code snippets are not the most optimal yet (suggestions welcome),

Well, you sometimes declare variables inThisStyle, but Rubyists use  
this_style_here.

James Edward Gray II


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Szinek  
View profile  
 More options Jun 14 2006, 10:00 am
From: Peter Szinek <pe...@rubyrailways.com>
Date: Wed, 14 Jun 2006 23:00:52 +0900
Local: Wed, Jun 14 2006 10:00 am
Subject: Re: Article on screen scraping w HTree+REXML, RubyfulSoup, WWW::Mechanize
James Edward Gray II wrote:

> On Jun 14, 2006, at 5:07 AM, Peter Szinek wrote:

>> http://www.rubyrailways.com/data-extraction-for-web-20-screen-scrapin...

> This was a very good article.  Thank you for sharing it with us.

Thx!

> Well, you sometimes declare variables inThisStyle, but Rubyists use
> this_style_here.

Thanks for the suggestion, i'll update it ASAP. (Coming from the Java
camp, that's why the camelsAreStillHaunting ;-)

Peter


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google