Re: How to get the source code of an url?

31 views
Skip to first unread message

donarb

unread,
Nov 27, 2012, 1:17:17 PM11/27/12
to django...@googlegroups.com
You're not parsing XML, it's HTML and it's not well formed, for example your title and author tags have closing tags that don't match. Your HTML needs to be valid XHTML before trying to use an XML parser on it. You might want to try something else to parse this, like Scrapy or Beautiful Soup.

On Tuesday, November 27, 2012 3:32:16 AM UTC-8, wbc wrote:

I'm trying to parse an xml url with minidom. I have an url with my xml data.

This is my code:

url = "http://myurl.com/wsname.asp"    
datasource = urllib2.urlopen(url)

dom = parse(datasource)
handleElements(dom)

my handleElements function to parse xml:

def handleElements(dom):
    Elements = dom.getElementsByTagName("book")
    for item in Elements:
        getText(item.getElementsByTagName("id")[0].childNodes)
        ....

My xml:

<html><head><style type="text/css"></style></head>
<body>
<bibliothque>
 <book>
 <id>747</id>
 <title>L'alchimiste</nomclient>
 <author>Paulo Cohelo </nomposte>
 </book> 
 ...
 </bibliothque>  
</body>

I get no error, but no result!

my handleElements() works fine because when I copy the same data from my url put it in a string and use parseString instead of parse everything works fine and I get my results.

But when trying to openurlElements is empty and the loop is not even started


Seems that I need to get the sourcecode of the url (not it's content) (like the view-source in chrome) How can I do that?

Thanks

Tom Evans

unread,
Nov 28, 2012, 5:09:51 AM11/28/12
to django...@googlegroups.com
On Tue, Nov 27, 2012 at 6:17 PM, donarb <don...@nwlink.com> wrote:
> You're not parsing XML, it's HTML and it's not well formed, for example your
> title and author tags have closing tags that don't match. Your HTML needs to
> be valid XHTML before trying to use an XML parser on it. You might want to
> try something else to parse this, like Scrapy or Beautiful Soup.
>

For parsing arbitrary html, I find that the combination of html5lib
and lxml is hard to beat:


import html5lib
from html5lib import treebuilders

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder('lxml'))
doc = parser.parse(html_str)
ns = { 'h': 'http://www.w3.org/1999/xhtml' }
li_tables = doc.xpath('//h:ul[@class="table_list"]', namespaces=ns)

Cheers

Tom
Reply all
Reply to author
Forward
0 new messages