getting Line numbers

34 views
Skip to first unread message

Ramdas S

unread,
Dec 11, 2007, 11:38:50 AM12/11/07
to beautifulsoup
Hi,

I am newbie to BS and to HTML parsing.

I am writing a program to parse HTML reports. Basically a program to
scrap HTML page(s) and then generate reports for a specific
application.

I need a way to refer back to original files, may be source code of
HTML files, by line number.

I see no way to do the same using BS. Can you give me a hint?

For example I want to get back the line numbers of all img tags within
a file.

Thanks

Ramdas

Leonard Richardson

unread,
Dec 11, 2007, 8:38:57 PM12/11/07
to beauti...@googlegroups.com
> I need a way to refer back to original files, may be source code of
> HTML files, by line number.

Beautiful Soup doesn't track this information, so you can't use
Beautiful Soup for what you want. The information is tracked by the
underlying sgmllib parser, so you can use sgmllib, and I might put
this feature in a future version of Beautiful Soup.

Leonard

Ramdas S

unread,
Dec 11, 2007, 11:39:16 PM12/11/07
to beautifulsoup
Thanks,

I think it will be a nice idea to do so as many developers will use BS
to scrap information from web pages, and would like some reference
back to where information originally came from.

For now, I had to write my own crude, ugly brute force parser, which
is resources hungry.

Great work!

Ramdas




On Dec 12, 6:38 am, "Leonard Richardson"

Leonard Richardson

unread,
Dec 21, 2007, 7:19:33 PM12/21/07
to beauti...@googlegroups.com
Randas,

> I think it will be a nice idea to do so as many developers will use BS
> to scrap information from web pages, and would like some reference
> back to where information originally came from.

I was unable to implement this because sgmllib doesn't actually update
its line number information as it parses. This is a known bug in
Python:

http://bugs.python.org/issue849097

That bug has a patch you can use on sgmllib, and I've attached a patch
that changes Beautiful Soup to store the getpos() information for all
PageElement objects. If you apply them both it should work, but I
won't put this in the official release.

Leonard

LineNumber.patch

jo...@neutralize.com

unread,
Dec 21, 2007, 7:20:04 PM12/21/07
to beauti...@googlegroups.com
Thank you for your email. I am now out of the office, however if your enquiry is of an urgent nature you can contact our office on 0870 063 0707, alternatively I will get back to you on my return.

Please note our office is closed for the Christmas holidays from 22nd December 2007; normal office opening hours will be resumed from Wednesday 2nd January 2008.

We would like to wish you a very merry Christmas and a happy New Year on behalf of the Neutralize team!

Kind Regards,

John Glazebrook
_________________________________________
Neutralize (*\*)
Search Engine Marketing Services
T: +44 (0) 8700 630707
F: +44 (0) 8700 630708
E: jo...@neutralize.com
U: http://www.neutralize.com

International T: 00 44 1209 722340
International F: 00 44 1209 717263
_________________________________________
Members of the Search Marketing Association UK
http://www.sma-uk.org

The information transmitted is intended only for the person or entity to which it is addressed. This email is subject to the Terms and Conditions available at:
http://www.neutralize.com/emailterms.txt
_________________________________________
Head Office: 3 The Setons, Tolvaddon Energy Park, Cornwall, TR14 0HX
Registered Address: Nuera Limited trading as Neutralize, 70 Conduit Street,London W1S 2GF
Company Registration No. 3849708 - VAT Registration No. 743 9641 09
Neutralize & (*\*) are a registered TradeMarks of Nuera Limited.


Reply all
Reply to author
Forward
0 new messages