can not read 4 bytes characters correctly in utf8

504 views
Skip to first unread message

ant21

unread,
May 10, 2011, 11:59:28 AM5/10/11
to nodejs
Hi,

I am doing some text processing using nodejs 0.2.3. What beating me is
that it believes character consists of 4 bytes is illegal character
and replace it with \uFFFD replacement character when reading a
partial of text file encoding in utf8 through fs.createReadStream()
specified the start and end parameters. (example: U+024B62 = F0 A4 AD
A2)

The need of reading a partial of text continuously instead of entire
file once requires aware of current file reading pointer position
correctly. Unfortunately, I can not figure out how to get the current
reading position easily. As a workaround, I accumulate the bytes
length Buffer.byteLength() supplied as next reading start offset.

Everything works like a charm except that some characters which
consist of 4 bytes instead of 3 in utf8 encoding. Nodejs met a 4 bytes
character and it returned a \uFFFD which length was 3 in utf8(ef bf
bd). Tragedy occurd, Buffer.byteLength() tells 3 but the true
character length is 4, when take the byte length as start offset to
read in next character, it will be destroied and get another \uFFFD.
Well, the text messed up.

If it not easy to handle utf8 correctly right now, is there some other
way to know the real amount of bytes it read instead of
Buffer.byteLength()?

As for the lack of support of convertion between the concrete encoding
and unicode, is it possible to know the hexadecimal of each byte or
the raw binary it read and leave these convertion job to third party?

Thank you.

Some information about UTF-8. http://en.wikipedia.org/wiki/UTF-8

mscdex

unread,
May 10, 2011, 1:31:37 PM5/10/11
to nodejs
On May 10, 11:59 am, ant21 <libs...@gmail.com> wrote:
> Hi,
>
> I am doing some text processing using nodejs 0.2.3. What beating me is
> that it believes character consists of 4 bytes is illegal character
> and replace it with \uFFFD replacement character when reading a
> partial of text file encoding in utf8 through fs.createReadStream()
> specified the start and end parameters. (example: U+024B62 = F0 A4 AD
> A2)

Is it possible that you're starting at a position in the file that is
in the middle of a multi-byte utf8 character?

Also, you can read raw bytes using fs.createReadStream by not
specifying an encoding in the options object. You can then use those
bytes with some other encoder/decoder module if you wish, such as node-
iconv.

Koichi Kobayashi

unread,
May 10, 2011, 1:51:00 PM5/10/11
to nod...@googlegroups.com
Hi,

V8 may supports only BMP (U+0000 - U+FFFF).
4 bytes characters are outside BMP.

http://code.google.com/p/v8/issues/detail?id=761


Date: Tue, 10 May 2011 08:59:28 -0700 (PDT)
From: ant21 <lib...@gmail.com>
To: nodejs <nod...@googlegroups.com>
Subject: [nodejs] can not read 4 bytes characters correctly in utf8

> --
> You received this message because you are subscribed to the Google Groups "nodejs" group.
> To post to this group, send email to nod...@googlegroups.com.
> To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.


--
{
name: "Koichi Kobayashi",
mail: "koi...@improvement.jp",
blog: "http://d.hatena.ne.jp",
twitter: "@koichik"
}

ant21

unread,
May 11, 2011, 7:27:11 AM5/11/11
to nodejs
Thanks for all.

Well, does this means that I can only get the length of three of a
character even if it consists of four bytes in nodejs as v8 only
supports BMP? In other words, I can't tell the right offset while
partial reading a file when it contains four bytes characters?


On May 11, 1:51 am, Koichi Kobayashi <koic...@improvement.jp> wrote:
> Hi,
>
> V8 may supports only BMP (U+0000 - U+FFFF).
> 4 bytes characters are outside BMP.
>
> http://code.google.com/p/v8/issues/detail?id=761
>
> Date:    Tue, 10 May 2011 08:59:28 -0700 (PDT)
> From:    ant21 <libs...@gmail.com>
> > Some information about UTF-8.http://en.wikipedia.org/wiki/UTF-8
>
> > --
> > You received this message because you are subscribed to the Google Groups "nodejs" group.
> > To post to this group, send email to nod...@googlegroups.com.
> > To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/nodejs?hl=en.
>
> --
> {
>   name: "Koichi Kobayashi",
>   mail: "koic...@improvement.jp",

Laurie Harper

unread,
May 11, 2011, 9:24:26 AM5/11/11
to nod...@googlegroups.com
It means you can't rely on internal UTF-8 decoding functions to decode byte streams with characters outside the BMP range. You can read the file in binary format (don't supply an encoding option to fs.createReadStream() or call setEncoding() on the returned stream) and handle the decoding yourself, though. The stream's 'data' events will pass a raw Buffer object if there's no encoding set.

L.

--
Laurie Harper
http://laurie.holoweb.net/

ant21

unread,
May 12, 2011, 10:43:40 AM5/12/11
to nodejs
I'm sorry but I can get EF BF BD only.

fs.createReadStream('/file/only/contains/one/4-bytes-character')
.on('error', function(error) {
console.error('error occured: ' + error);
}).on('data', function(chunk) {
buff += chunk;
sys.print(chunk); //output ef bf bd
}).on('end', function() {
sys.print(buff); //output ef bf bd
});

I think just like Koichi Kobayashi said v8 only supports BMP, we have
no chance to get the real hexadecimal of the character.

Tim Caswell

unread,
May 12, 2011, 5:04:51 PM5/12/11
to nod...@googlegroups.com
You're forcing the buffer to a string by doing string concatenation.  Don't += the raw buffer chunks unless you want to use v8 strings.  Better to keep an array of buffer objects and then optionally, combine them into a single buffer once the final size is known.

Laurie Harper

unread,
May 12, 2011, 6:34:12 PM5/12/11
to nod...@googlegroups.com
Same goes for calling sys.print() on the buffer, you're converting the buffer to a string. When I said "handle the decoding yourself" I meant do the buffer-to-string conversion yourself, e.g. by looping through the bytes and handling sequences outside of the BMP range correctly. As soon as you turn the buffer into a string, you're subject to the constraints and limitations of the platform's UTF-8 support.

L.

ant21

unread,
May 14, 2011, 2:38:34 PM5/14/11
to nodejs
Thanks for your help.

I can get the raw bytes without any implicit encoding conversion
finally.

fs.createReadStream('/file/only/contains/one/4-bytes-character')
.on('error', function(error) {
console.error('error occured: ' + error);
}).on('data', function(chunk) {
buffer = new Buffer(chunk.length);
for (var i = 0; i < chunk.length; i++) {
buffer[i] = chunk[i];
}
bufferArray.push(buffer);
}).on('end', function() {
callback(bufferArray);
});

Well, the next challenge is the buffer to string conversion manually
to support supplementary code points correctly.

On May 13, 6:34 am, Laurie Harper <lau...@holoweb.net> wrote:
> Same goes for calling sys.print() on the buffer, you're converting the buffer to a string. When I said "handle the decoding yourself" I meant do the buffer-to-string conversion yourself, e.g. by looping through the bytes and handling sequences outside of the BMP range correctly. As soon as you turn the buffer into a string, you're subject to the constraints and limitations of the platform's UTF-8 support.
>
> L.
>
> On 2011-05-12, at 5:04 PM, Tim Caswell wrote:
>
>
>
>
>
>
>
>
>
> > You're forcing the buffer to a string by doing string concatenation.  Don't += the raw buffer chunks unless you want to use v8 strings.  Better to keep an array of buffer objects and then optionally, combine them into a single buffer once the final size is known.
>
> > For more options, visit this group athttp://groups.google.com/group/nodejs?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "nodejs" group.
> > To post to this group, send email to nod...@googlegroups.com.
> > To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.

mscdex

unread,
May 14, 2011, 4:43:15 PM5/14/11
to nodejs
On May 14, 2:38 pm, ant21 <libs...@gmail.com> wrote:
> fs.createReadStream('/file/only/contains/one/4-bytes-character')
> .on('error', function(error) {
>     console.error('error occured: ' + error);}).on('data', function(chunk) {
>
>     buffer = new Buffer(chunk.length);
>     for (var i = 0; i < chunk.length; i++) {
>         buffer[i] = chunk[i];
>     }
>     bufferArray.push(buffer);}).on('end', function() {
>
>     callback(bufferArray);
>
> });

For the 'data' handler, why not just push the incoming chunk onto
bufferArray instead of pushing a copy? Also as a side note, using
Buffer's copy() method is more efficient (it uses memcpy) than doing
the copy manually with a loop.

Felix Geisendörfer

unread,
May 14, 2011, 6:59:37 PM5/14/11
to nod...@googlegroups.com
FWIW, calling setEncoding() on a fs.ReadStream passes all buffers through the string_decoder module which makes sure utf-8 multibyte characters are not broken apart. That still won't fix characters outside the BMP range, but for everything else it takes care of things.

--fg

ant21

unread,
May 16, 2011, 12:33:07 PM5/16/11
to nodejs
You are absolutely right when I realized that chunk is just Buffer
object that could be pushed in buffer array directly.

Thank you.
Reply all
Reply to author
Forward
0 new messages