The reason I think it's a bug is that it seems unreasonable that the
test case XML is parsable by the DOM parser but not the e4x parser.
Before filing in bugzilla I thought I would post here to see if anyone
has another explanation for the behaviour.
An email describing the problem with a test case is attached.
Regards -
Leni.
Test-case xml is attached.
I also have a question about a workaround I was considering using:
var serializer = new XMLSerializer();
var str = serializer.serializeToString(req.responseXML);
var xml = new XML(str);
By running the DOM's XML through the XMLserialzer to make a string then
giving that to the e4x parser at least it parses.
But XMLserialiser turns that three-byte UTF-8 sequence into a '('
character. So two more questions:
a) can someone offer a pointer to how XMLserializer is supposed
to behave when there is a 3-byte UTF-8 sequence in the content
of an element?
b) can anyone suggest any other workaround?
The real-world thing I am trying to do is get a UTF-8 encoded Atom feed
coming from Google into an e4x XML object.
Leni.
Leni.
Ok, it looks like the mailing list software is removing the attachement,
so here is a URL:
http://www.zindus.com/tmp/1.xml.zip
Leni.
>> Can you post the XML you are trying to parse?
>
> Test-case xml is attached.
I can't reproduce the issue with Firefox 3.0 (Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6). Test
case is at
http://home.arcor.de/martin.honnen/javascript/2009/02/test2009021501.html
and loads XML document from
http://home.arcor.de/martin.honnen/javascript/2009/02/test2009021501Test.xml
which is the file you sent.
I don't get any script or XML parsing errors.
Yes, you are right.
The extension I am working on is for Thunderbird2 and Thunderbird3, and
I can only reproduce the problem under Thunderbird2, not Thunderbird3.
Sorry for not making this clear in the original posting (I didn't test tb3).
If you are curious to reproduce this problem in Thunderbird using
Martin's test case, install the ThunderbirdBrowse extension:
https://addons.mozilla.org/en-US/thunderbird/addon/5373
Then visit the link in ThunderBrowse:
http://home.arcor.de/martin.honnen/javascript/2009/02/test2009021501.html
In Thunderbird3, the page is served correctly - the XML is shown.
In Thunderbird2, the page is not served correctly - the javascript error
console reports:
Error: e.target.parentNode.hasAttribute is not a function
Source File: chrome://tbrowse/content/tburlclk.js
Line: 377
I won't file a bug report for this tb2-only problem then because I doubt
it would get much attention.
About a workaround for Thunderbird 2, the DOM ==> XMLSerializer ==> e4x
technique does parse the XML but converts that 3-byte UTF-8 sequence
into a '(' which makes it lossy. If someone can shed any light on what
is going on here and in particular, what class of UTF-8 byte sequences
might be affected by such lossy conversion, it would help me evaluate
whether this technique is acceptable.
Or if anyone can think of a better workaround for tb2 it will be welcome!
Thanks -
Leni.
Actually, it's a bug that deals with javascript link handling in
ThunderBrowse. 3.2.3 fixes the bug.
After the posting from shows.G...@gmail.com I did some more testing
and found that I can't reproduce it when the e4x parsing happens
inside a <browser> element.
So ... here is another test case along the same lines.
To run the test:
- copy and paste the code below into a text editor and remove
all the newlines - all the code should be on one line
- copy and paste into the javascript error console and click evaluate
The error console reports:
Error: missing = in XML attribute
Source File:
Line: 3, Column: 2
Source Code:
le><content>Alice, Kerry
I can reproduce this in tb2, tb3beta1 and firefox 3.06. It's the \u2028
character in the code below which causes the problem.
var str = "<?xml version='1.0' encoding='UTF-8'?><feed
xmlns='http://www.w3.org/2005/Atom'
xmlns:openSearch='http://a9.com/-/spec/opensearch/1.1/'><id>exa...@gdomain.example.com</id><updated>2009-02-11T05:58:32.673Z</updated><category
scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/contact/2008#contact'/><generator
version='1.0'
uri='http://www.google.com/m8/feeds'>Contacts</generator><entry><app:edited
xmlns:app='http://www.w3.org/2007/app'>2009-02-11T05:48:11.672Z</app:edited><title>Alice
Midxxxxxx</title><content>Alice, Kerry \u2028Ex:
Jones</content></entry></feed>";var xml = new XML(str.replace(/\<\?xml
version=.*?\?\>/,""));
Regards -
Leni.
Yes - thanks for that. With ThunderBrowse 3.2.3 Martin's test case now
works for me too.
Leni.
Just for good measure, I can now reproduce the problem using a test case
similar to the one you used.
Test case:
http://www.zindus.com/tmp/test-case-2009-02-17-1.html
The xml:
http://www.zindus.com/tmp/test-case-2009-02-17-1.xml
Firefox 3.0.6 error console reports:
Error: illegal XML character
The .xml is different to the one provided earlier, but the problem is
the same - related to that unicode character, in this example it is just
before the string "Jones".
I hope I am not making a big noise over something that has a simple
explanation.
Leni.
I see that problem too with Firefox 3.0.6.
Now to move the problem into a bug report it would be best to have a
minimal test case, preferably, as the E4X XML constructor is implemented
by the JavaScript engine itself, a test case not even needing to load an
XML document with XMLHttpRequest, but rather a script test case doing
new XML(string) and causing the error.
I am however struggling to indentify the character causing the problem.
According to your earlier post, it is encoded in UTF-8 as 0xe2 0x80 0xa8
which would be the Unicode character U2028 I think.
However doing
var el = new XML('<foo>Line 1.\u2028Line 2.</foo>');
in Firefox 3.0.6 does not cause any error, so that way the character is
parsed fine. So either it is not that character causing the error or
that error only occurs with longer strings.
I get that too in trunk Gecko.
However, if I start reducing it (and it's possible to reduce it a good
bit while still getting that error), I eventually get to a point where
the error starts changing (e.g. complaining about there being a missing
'=' in an attribute).
If I breakpoint on the "invalid XML character" error, I see that it
happens when we get a '<' while we think we're in the process or parsing
an open tag.
In particular, it thinks it's looking at a string that looks something like:
<author/www.google.com/m8/feeds/contacts/a.b%40gdomain.example.com/thin?start-index=2681&max-results=10'
Which is pretty clearly bogus.
-Boris
var xmlEl = new XML("<feed
xmlns:gContact='http://schemas.google.com/contact/2'
xmlns:batch='http://schemas.google.com/gdata/batch'
xmlns:gd='http://schemas.google.com/g/2005'
gd:etag='W/"xxxxxxxxxxxxxxxxxxxxxxw."'><updated>2009-02-1</updated><e><c>\u2028</c></e></feed>");
var pre = document.createElement('pre');
pre.appendChild(document.createTextNode(xmlEl.toXMLString()));
document.body.appendChild(pre);
with no XMLHttpRequest required. Deleting chars from the string
sometimes changes the error, and sometimes makes it go away entirely,
but I bet it can be minimized some more. If someone wants to take a
shot at that, great.
This doesn't look like an XML issue, though, but a JS engine one.
-Boris
Thanks for the reduction. I agree it is a JavaScript engine issue, I
have filed https://bugzilla.mozilla.org/show_bug.cgi?id=478905 on this.