Error parsing internationalised strings

2 views
Skip to first unread message

Cefn

unread,
Jan 10, 2010, 8:10:31 AM1/10/10
to FireFox POW - Users Group
As part of an Ajax application serving XML through POW I had this text
in an XML file.

'Fédération Internationale des Échecs'

Unfortunately, there seems to be some problem with Unicode support,
which means that POW estimates the length of the file incorrectly when
internationalised characters are included and does not serve all the
bytes, since it assumes that the number of characters and the number
of bytes is the same (i.e. US Ascii).

The original file...

<?xml version="1.0" encoding="UTF-8"?>
<player name="Viswanathan Anand" position="Grand Master" >
<championship body="Fédération Internationale des Échecs"
title="World Champion" />
</player>

...is served as (copy paste from Firefox)...

<?xml version="1.0" encoding="UTF-8"?>
<player name="Viswanathan Anand" position="Grand Master" >
<championship body="Fédération Internationale des Échecs"
title="World Champion" />
</pl

...cutting off the file as long as the internationalised characters
are included.

Swap the accented Es for normal Es and it serves all the bytes
correctly again.

Here is a patch for server.js which seems to fix the file length
miscalculation on my local POW for the Basic Multilingual Plane of
Unicode.

// END BAD

var cefn_length = webpage.length;
var re2b = /[\u0080-\u07FF]/g; //2 byte characters
var re3b = /[\u0800-\uFFFF]/g; //3 byte characters
if(re2b.test(webpage)){
var match_2b = webpage.match(re2b);
var match_3b = webpage.match(re3b);
cefn_length += match_2b + match_3b;
}

// FINALIZE HEADERS AND CLOSE
extra_headers += "Server: POW/"+pow_version+"\r\n";
extra_headers += "Content-Length: "+(cefn_length -2)+"\r\n";


This doesn't fix the fact that something in the outgoing character
encoding causes the characters to be junked when displayed in Firefox,
but it at least send the full file.

In general is there a good way to run a local copy of POW which you
can debug, (I was doing all this with alerts, and reinserting the
server.js file back into the archive again and again).

Also how should we commit back to the project? Is there an SVN or
github we can follow?

Cefn

unread,
Jan 10, 2010, 9:26:10 AM1/10/10
to FireFox POW - Users Group
There's a bug in the patch I provided in the last post.

Last minute changes I made introduced an invisible bug, since the
invalid header Content-Length:NaN also seems to work fine - the
browser ignores it.

Sorry for any confusion. Here is improved code to patch server.js,
which I've checked in Firebug does indeed provide a sensible number in
the Content-Length: HTTP header. It's also a little more efficient.

// END BAD

var cefn_length = webpage.length;
var re2b = /[\u0080-\u07FF]/g; //2 byte characters
var re3b = /[\u0800-\uFFFF]/g; //3 byte characters
if(re2b.test(webpage)){

cefn_length += webpage.match(re2b).length;
if(re3b.test(webpage)){
cefn_length += webpage.match(re3b).length;

Dave

unread,
Jan 11, 2010, 10:21:59 PM1/11/10
to FireFox POW - Users Group
Thanks for the code. I'll look into integrating it. This is funny,
though, since I noticed for years that Mozilla and JS are good at
guessing non-7bit-ascii string length. JS also detects the width of
characters (2 or 3 bytes) with little problem.

Dave

Reply all
Reply to author
Forward
0 new messages