Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Need help with parsing data

0 views
Skip to first unread message

Shan

unread,
Aug 9, 2006, 6:06:57 PM8/9/06
to
So I need code that will go through a list of URLs (formatted as
http://www.google.com) and for each url get the following information:

1. The url after the href= within the following tags <link
rel="alternate" and />

So if there is <link rel="alternate" type="application/atom+xml"
title="Atom" href="http://hello.typepad.com/hello/atom.xml" /> I want
the http://hello.typepad.com/hello/atom.xml


2. everything bewtween the following tags <title> and </title>
so if there is <title>hello, typepad</title> I want hello, typepad

3. everything between the tags <h2 id="banner-description"> and </h2>


4. Finally i would like the results to be saved to a delimited file in
the following format:

column 1: original url
column 2: data obtained from step 1
column 3: data obtained from step 2
column 4: data obtained from step 3

if there is no result for any one of the steps a null should be saved.


I would like to thank whoever can provide me with the code in advance,
Thank you.

DJ Stunks

unread,
Aug 9, 2006, 6:19:45 PM8/9/06
to

it is highly unlikely that anyone will do so for a simple "thanks".
check out jobs.perl.org for someone willing to follow orders in return
for compensation.

-jp

A. Sinan Unur

unread,
Aug 9, 2006, 6:41:38 PM8/9/06
to

> So I need code that will go through a list of URLs (formatted as
> http://www.google.com) and for each url get the following information:
>
> 1. The url after the href= within the following tags <link
> rel="alternate" and />

That's just one tag.


> So if there is <link rel="alternate" type="application/atom+xml"
> title="Atom" href="http://hello.typepad.com/hello/atom.xml" /> I want
> the http://hello.typepad.com/hello/atom.xml

use HTML::TokeParser::Simple;

http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/

> 2. everything bewtween the following tags <title> and </title>
> so if there is <title>hello, typepad</title> I want hello, typepad

Ditto.

> 3. everything between the tags <h2 id="banner-description"> and </h2>

Ditto.

> 4. Finally i would like the results to be saved to a delimited file in
> the following format:
>
> column 1: original url
> column 2: data obtained from step 1
> column 3: data obtained from step 2
> column 4: data obtained from step 3

Trivial.

> if there is no result for any one of the steps a null should be saved.
>
>
> I would like to thank whoever can provide me with the code in advance,

That's not how it works here. Feel free to compose a proper post showing us
what you have tried - after having read and followed the posting guidelines
- and help will flow.

If you don't want to be bothered with that, you might be able to generate
enough "warm glow" to motivate me to help by making a donation of $1000 or
more to the Perl foundation:

http://donate.perl-foundation.org/index.pl?node=Fund+Drive+Details&selfund=2

If you don't want to bother with either of those, then try:

http://jobs.perl.org/

Sinan

PS: For the record, I am in no way affiliated with the Perl Foundation.
I have not yet donated. I should. Oh, the guilt.

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

John Bokma

unread,
Aug 9, 2006, 7:17:50 PM8/9/06
to
"Shan" <Shan...@gmail.com> wrote:

> So I need code that will go through a list of URLs (formatted as
> http://www.google.com) and for each url get the following information:
>
> 1. The url after the href= within the following tags <link
> rel="alternate" and />
>
> So if there is <link rel="alternate" type="application/atom+xml"
> title="Atom" href="http://hello.typepad.com/hello/atom.xml" /> I want
> the http://hello.typepad.com/hello/atom.xml
>
>
> 2. everything bewtween the following tags <title> and </title>
> so if there is <title>hello, typepad</title> I want hello, typepad
>
> 3. everything between the tags <h2 id="banner-description"> and </h2>


I use HTML::TreeBuilder for this, since it makes life really easy. See
http://johnbokma.com/perl/ for several examples (Web automation).

For example 3. can be done as:

my $root = HTML::TreeBuilder->new_from_content( $content );

:
:

my @column4;
push @column4, $_->as_trimmed_text
for $root->look_down( _tag => h2, id =>'banner-description' );

> I would like to thank whoever can provide me with the code in advance,
> Thank you.

I can provide the code, and forms to thank me are here:
http://johnbokma.com/wish-list.html

Either Object Oriented Perl or Perl Best Practices would be fine with me
since directly and indirectly you will contribute back to the Perl
community.

--
John Bokma Freelance software developer
&
Experienced Perl programmer: http://castleamber.com/

Tad McClellan

unread,
Aug 10, 2006, 12:49:28 AM8/10/06
to
Shan <Shan...@gmail.com> wrote:

> Subject: Need help with parsing data


What part is it that you need help with?


(you should use a module that understands XHTML data if you need
to process XHTML data.
)


> I would like to thank whoever can provide me with the code in advance,


What makes you think that someone will write your program for you?


--
Tad McClellan SGML consulting
ta...@augustmail.com Perl programming
Fort Worth, Texas

Shan

unread,
Aug 10, 2006, 9:48:23 AM8/10/06
to
Thanks for your advice. i will work on writing a script today and see
what kind of results I get.

0 new messages