buffer toString with partial utf8 character?

Mark Hahn

unread,

Sep 3, 2014, 4:28:48 AM9/3/14

to nod...@googlegroups.com

What happens if I convert an entire buffer to a string using toString and the end of the buffer only has a partial character encoding? Will it just ignore the extra byte(s)? (This is what I want). Or will it have garbage at the end of the string or throw an exception?

Jimb Esser

unread,

Sep 3, 2014, 7:23:15 PM9/3/14

to nod...@googlegroups.com

I seem to remember it converts any unknown codes to the Unicode character 65533 (�). Giving it a try is easy!

node -e 'console.log(new Buffer([255]).toString().charCodeAt(0))'

Ben Noordhuis

unread,

Sep 4, 2014, 1:50:32 AM9/4/14

to nod...@googlegroups.com

It will have garbage at the end. More precisely, the partial
character gets replaced with the replacement character, U+FFFD.

Mark Hahn

unread,

Sep 5, 2014, 4:08:53 AM9/5/14

to nod...@googlegroups.com

So if I find \uFFFD as the last character of a valid but truncated utf8 buffer and I strip it, I should always end up with a valid string, right?

That was an awkward sentence. Let me try in code. If buf is the first 512 bytes of a long utf8 file will the following always produce a valid string?

str = buf.toString();

if (str[str.length-1] is '\uFFFD') str = str.slice(0, -1);

Jimb Esser

unread,

Sep 5, 2014, 12:31:01 PM9/5/14

to nod...@googlegroups.com

\uFFFD is a valid character, so you'll always have a valid string. If you do as you suggest, you will both have a valid string and a actual prefix of the your file, which is probably what you want, representing the first 508 - 512 bytes (with some chance of chopping off a character that was actually in the file if the last utf8 character was actually \uFFFD).

Alex Kocharin

unread,

Sep 8, 2014, 11:35:30 AM9/8/14

to nod...@googlegroups.com

05.09.2014, 13:32, "Mark Hahn" <ma...@hahnca.com>:

Nope, that's syntax error. It should be either:

if (str[str.length-1] === '\uFFFD') str = str.slice(0, -1)

Or:

if (str[str.length-1] is '\uFFFD') then str = str.slice(0, -1)

:)

Other than that... If you're converting buffer to utf8, you *always* get a valid utf8 in the output.

But it could contain special characters that you don't want. They can appear in the middle of the string as well if your input isn't a valid utf8:

```

> Buffer([32,32,32,255,32,32,32]).toString('utf8').charCodeAt(3)

65533

```

Also BOM in the beginning, though that's rare now.

If you want to stream buffers, there is some tool in node.js core that takes care of ending characters... don't remember which one

Ben Noordhuis

unread,

Sep 8, 2014, 11:35:30 AM9/8/14

to nod...@googlegroups.com

Yes, unless there already was a replacement character in the input.
If you want to be sure, you can use
StringDecoder#detectIncompleteChar(). It's not documented but it
takes a buffer as its argument:

var buffer = /* ... */;
var StringDecoder = require('string_decoder').StringDecoder;
var dec = new StringDecoder('utf8');
dec.detectIncompleteChar(buffer);
if (dec.charReceived < dec.charLength) {
// Partial character sequence.
}

You can also implement the algorithm yourself if you don't want to
depend on an undocumented function. UTF-8 is a self-synchronizing
run-length encoding; you can figure out the length of the character by
looking at the last one to three bytes. In a nutshell:

1. If c & 0xC0 < 0x80, then it's a single-byte character.

2. If c & 0xC0 == 0xC0, then it's the start of a multi-byte character.
You can figure out its length by looking at the other bits.

3. If c & 0xC0 == 0x80, then it's part of a multi-byte character.
Backtrack until you find a byte that satisfies criterion 2 (but don't
backtrack more than three bytes.)

Mark Hahn

unread,

Sep 9, 2014, 2:09:27 AM9/9/14

to nod...@googlegroups.com

Thanks everyone for the detailed answers. I almost feel like I understand this stuff now.

I wonder if there could have been an easier way to handle universal characters than unicode. It seems overly complex.

Reply all

Reply to author

Forward