Parsing no-break spaces

3 views
Skip to first unread message

Luis Miguel Morillas

unread,
Nov 11, 2011, 12:02:14 PM11/11/11
to akar...@googlegroups.com
If this html is parsed: <html><head/><body><h1>Coca&nbsp;Cola</h1></body></html>

amara gets this result:

>>> from amara.bindery import html
>>> doc = html.parse("<html><head/><body><h1>Coca&nbsp;Cola</h1></body></html>")

>>> doc.xml_write()
<html><head/><body><h1>Coca Cola</h1></body></html>

>>> doc.html.body.h1.xml_children
(<text at 0x8ff80ac: u'Coca'>,
<text at 0x8ff8c6c: u'\xa0'>,
<text at 0x8ff8f6c: u'Cola'>)

>>> len(doc.html.body.h1.xml_children)
3

Is it ok? h1 should have three children?

Greetings,


-- luismiguel

Uche Ogbuji

unread,
Nov 11, 2011, 1:12:55 PM11/11/11
to akar...@googlegroups.com
bindery.html must not be normalizing text, so that's a bug, but you should be able to call normalize() on the h1 node for now.

--Uche


--
You received this message because you are subscribed to the Google Groups "Akara Developers" group.
To post to this group, send email to akar...@googlegroups.com.
To unsubscribe from this group, send email to akara-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/akara-dev?hl=en.




--
Uche Ogbuji                       http://uche.ogbuji.net
Weblog: http://copia.ogbuji.net
Poetry ed @TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Founding Partner, Zepheira        http://zepheira.com
Linked-in: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji
Reply all
Reply to author
Forward
0 new messages