Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HTML Code - Line Number

33 views
Skip to first unread message

SMac...@comcast.net

unread,
Apr 27, 2012, 1:09:57 PM4/27/12
to
Hello,

For scrapping purposes, I am having a bit of trouble writing a block
of code to define, and find, the relative position (line number) of a
string of HTML code. I can pull out one string that I want, and then
there is always a line of code, directly beneath the one I can pull
out, that begins with the following:
<td align="left" valign="top" class="body_cols_middle">

However, because this string of HTML code above is not unique to just
the information I need (which I cannot currently pull out), I was
hoping there is a way to effectively say "if you find the html string
_____ in the line of HTML code above, and the string <td align="left"
valign="top" class="body_cols_middle"> in the line immediately
following, then pull everything that follows this second string.

Any thoughts as to how to define a function to do this, or do this
some other way? All insight is much appreciated! Thanks.

Tim Roberts

unread,
Apr 28, 2012, 1:59:31 AM4/28/12
to
SMac...@comcast.net wrote:
>
>For scrapping purposes, I am having a bit of trouble writing a block
>of code to define, and find, the relative position (line number) of a
>string of HTML code. I can pull out one string that I want, and then
>there is always a line of code, directly beneath the one I can pull
>out, that begins with the following:
><td align="left" valign="top" class="body_cols_middle">
>
>However, because this string of HTML code above is not unique to just
>the information I need (which I cannot currently pull out), I was
>hoping there is a way to effectively say "if you find the html string
>_____ in the line of HTML code above, and the string <td align="left"
>valign="top" class="body_cols_middle"> in the line immediately
>following, then pull everything that follows this second string.

Regular expression-based screen scraping is extremely delicate. All it
takes is one tweak to the HTML, and your scraping fails although the page
continues to look the same.

A much better plan is to use sgmllib to write yourself a mini HTML parser.
You can handle "td" tags with the attributes you want, and count down until
you get to the "td" tag you want.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Jon Clements

unread,
Apr 28, 2012, 2:45:57 AM4/28/12
to
On Friday, 27 April 2012 18:09:57 UTC+1, SMac...@comcast.net wrote:
> Hello,
>
> For scrapping purposes, I am having a bit of trouble writing a block
> of code to define, and find, the relative position (line number) of a
> string of HTML code. I can pull out one string that I want, and then
> there is always a line of code, directly beneath the one I can pull
> out, that begins with the following:
> <td align="left" valign="top" class="body_cols_middle">
>
> However, because this string of HTML code above is not unique to just
> the information I need (which I cannot currently pull out), I was
> hoping there is a way to effectively say "if you find the html string
> _____ in the line of HTML code above, and the string <td align="left"
> valign="top" class="body_co <SMac2347 <at> comcast.net> writes:

>
> Hello,
>
> I am having some difficulty generating the output I want from web
> scraping. Specifically, the script I wrote, while it runs without any
> errors, is not writing to the output file correctly. It runs, and
> creates the output .txt file; however, the file is blank (ideally it
> should be populated with a list of names).
>
> I took the base of a program that I had before for a different data
> gathering task, which worked beautifully, and edited it for my
> purposes here. Any insight as to what I might be doing wrote would be
> highly appreciated. Code is included below. Thanks!

[quoting reply to first thread]
I would approach it like this...

import lxml.html

QUERY = '//tr[@bgcolor="#F1F3F4"][td[starts-with(@class, "body_cols")]]'

url = 'http://www.skadden.com/Index.cfm?contentID=44&alphaSearch=A'


tree = lxml.html.parse(url).getroot()
trs = tree.xpath(QUERY)
for tr in trs:
tds = [el.text_content() for el in tr.iterfind('td')]
print tds


hth

Jon.
[/quote]





> following, then pull everything that follows this second string.
>
> Any thoughts as to how to define a function to do this, or do this
> some other way? All insight is much appreciated! Thanks.

<SMac2347 <at> comcast.net> writes:

>
> Hello,
>
[snip]
> Any thoughts as to how to define a function to do this, or do this
> some other way? All insight is much appreciated! Thanks.
>

[quote in reply to second thread]
Did you not see my reply to your previous thread?

And why do you want the line number?
[/quote]

I'm trying this on GG, as the mailing list gateway one or t'other does nee seem to work (mea culpa no doubt).

So may have obscured the issue more with my quoting and snipping, or what not.

Jon.









0 new messages