Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

preserving entities with lxml

415 views
Skip to first unread message

Robin Becker

unread,
Jan 12, 2022, 5:22:41 AM1/12/22
to
I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> &
in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of
entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.

The simple parse from string and conversion tostring shows that the parsing at least took notice of it.

However, I want to create a tuple tree so have to use tree.text, tree.getchildren() and tree.tail for access.

When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are
already done.

Good for me, but if the tree knows how it was created (tostring shows that) why is it ignored with attribute access?

if __name__=='__main__':
from lxml import etree as ET
#initial xml
xml = b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA</a>'
#escaped xml
xxml = xml.replace(b'&',b'&amp;')

myparser = ET.XMLParser(resolve_entities=False)
tree = ET.fromstring(xxml,parser=myparser)

#use tostring
print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')

#now access the items using text & children & text
print(f'using attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')

when run I see this

$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt;
&amp;#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
&amp;gt; &amp;#33; AAAAA</a>'

using attributes
tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
tree.getchildren()=[]
tree.tail=None
--
Robin Becker

Dieter Maurer

unread,
Jan 12, 2022, 3:49:51 PM1/12/22
to
Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).

`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.

Robin Becker

unread,
Jan 13, 2022, 4:14:07 AM1/13/22
to
On 12/01/2022 20:49, Dieter Maurer wrote:
.......
>>
>> when run I see this
>>
>> $ python tmp/tlp.py
>> using tostring
>> xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt;
>> &amp;#33; AAAAA</a>'
>> ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
>> &amp;gt; &amp;#33; AAAAA</a>'
>>
>> using attributes
>> tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
>> tree.getchildren()=[]
>> tree.tail=None
>
> Apparently, the `resolve_entities=False` was not effective: otherwise,
> your tree content should have more structure (especially some
> entity reference children).
>
except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
does work.

I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib
doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the
unexpanded values in the attrib/text/tail it would be more of a problem.


> `&#<value>` is not an entity reference but a character reference.
> It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference
expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.

Dieter Maurer

unread,
Jan 13, 2022, 4:30:14 AM1/13/22
to
Robin Becker wrote at 2022-1-13 09:13 +0000:
>On 12/01/2022 20:49, Dieter Maurer wrote:
> ...
>> Apparently, the `resolve_entities=False` was not effective: otherwise,
>> your tree content should have more structure (especially some
>> entity reference children).
>>
>except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
>does work.

I think this is a misunderstanding: `tostring` will represent the text character `&` as `&amp;`.

Robin Becker

unread,
Jan 13, 2022, 4:58:37 AM1/13/22
to
aaahhhh,

thanks I see now. So tostring is actually restoring some of the entities which on input are normally expanded. If that
means resolve_entities=False does not work at all then I guess there's no need to use it at all. The initial transform

& --> &amp;

does what I need as it is reversed on output of the tree fragments.

Wonder what resolve_entities is actually used for then? All the docs seem to say

> resolve_entities - replace entities by their text value (default: True)

I assumed False would mean that they would pass through the parse
--
Robin Becker

0 new messages