Encoding of data from ChildProcess output / the right way to convert binary into unicode

Will Conant

unread,

Feb 24, 2010, 1:30:07 AM2/24/10

to nodejs

First, is it possible to control the encoding of data passed to the
"output" event on process.ChildProcess objects?

Second, what is the "right" way to convert from a binary string to a
unicode string? I've found a couple of hacks and one decent function
out there, but it seems like I ought to have access to something built-
in and fast.

Thanks in advance!

--
Will Conant

Ryan Dahl

unread,

Feb 24, 2010, 1:46:33 AM2/24/10

to nod...@googlegroups.com

On Tue, Feb 23, 2010 at 10:30 PM, Will Conant <will....@gmail.com> wrote:
> First, is it possible to control the encoding of data passed to the
> "output" event on process.ChildProcess objects?

Not at the moment. It could be easily thrown in.

The enterprising hacker could add setOutputEncoding() and
setErrorEncoding() rather easily by following this hint
http://github.com/ry/node/blob/df1c1e593f8aac17e6edd8aa1fe278893b2e5a39/src/node_net.cc#L361-388
http://github.com/ry/node/blob/df1c1e593f8aac17e6edd8aa1fe278893b2e5a39/src/node_net.cc#L109
http://github.com/ry/node/blob/df1c1e593f8aac17e6edd8aa1fe278893b2e5a39/src/node_child_process.cc#L232
http://github.com/ry/node/blob/df1c1e593f8aac17e6edd8aa1fe278893b2e5a39/src/node_child_process.cc#L36
And submit a patch to me.

> Second, what is the "right" way to convert from a binary string to a
> unicode string? I've found a couple of hacks and one decent function
> out there, but it seems like I ought to have access to something built-
> in and fast.

There isn't a good way right now.

The reason for these problems is that Node will likely be changing how
it does binary in the near future - to a Blob, Buffer-like thing -
instead of using strings.

Will Conant

unread,

Feb 24, 2010, 2:22:56 AM2/24/10

to nodejs

On Feb 23, 11:46 pm, Ryan Dahl <coldredle...@gmail.com> wrote:

> > Second, what is the "right" way to convert from a binary string to a
> > unicode string? I've found a couple of hacks and one decent function
> > out there, but it seems like I ought to have access to something built-
> > in and fast.
>
> There isn't a good way right now.
>
> The reason for these problems is that Node will likely be changing how
> it does binary in the near future - to a Blob, Buffer-like thing -
> instead of using strings.

That makes sense. In the meantime, for anyone who runs into this
question, here's the script I found for handling it:
http://www.webtoolkit.info/javascript-utf8.html

Rasmus Andersson

unread,

Feb 24, 2010, 12:50:07 PM2/24/10

to nod...@googlegroups.com

On Wed, Feb 24, 2010 at 07:46, Ryan Dahl <coldre...@gmail.com> wrote:
> On Tue, Feb 23, 2010 at 10:30 PM, Will Conant <will....@gmail.com> wrote:
>> First, is it possible to control the encoding of data passed to the
>> "output" event on process.ChildProcess objects?
>
> Not at the moment. It could be easily thrown in.

How about a binary type? In my experience, fiddling with strings of
bytes are more common than unicode (as unicode is normally just passed
around in an application). A binary type would also use half the
amount of memory in most situations and is faster to process (no UTF-8
checks).

var s = new Bytes("regular unicode string input")
(new Bytes("åbc")).length => 4 // "å" is 2 bytes
var s = new Bytes([12, 0, 194])
s += "abc"

Or maybe it's just overkill and few people would understand when to use it.

What I've understood...

- V8 represents a character as a 16-bit unsigned integer in UTF-16 (not UTF-8).
- For ASCII and binary encoding in node, only the first bits of the
16-bit character is used.
- If you in node specify a string as being encoded in UTF-8 node does
_not_ convert the string nor interpret it as UTF-8, but instead as
UTF-16.

This could lead to fals assumptions by programmers who think they
actually are dealing with UTF-8 (e.g. put in a beyond-bmp sequence and
try to operate on it inside node which would yield "weird" results).

So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
where no utf-8 specific encoding/decoding/interpretation is done.

Further note, cited from the v8-users list:

"[...] you can't generally tell whether a program will behave
correctly under UCS-2. For instance, consider this program:

var dci = String.fromCharCode(0xD801) + String.fromCharCode(0xDC00);
var dli = dci.toLowerCase();
print(dci == dli);

(dci is a deseret capital I, represented by a surrogate pair). Under
UCS-2 this program prints true, under UTF-16 it prints false.
Programs like this cannote be detected reliably."
— http://www.mail-archive.com/v8-u...@googlegroups.com/msg00355.html

inimino

unread,

Feb 24, 2010, 1:14:46 PM2/24/10

to nod...@googlegroups.com

On 2010-02-24 10:50, Rasmus Andersson wrote:
> How about a binary type? In my experience, fiddling with strings of
> bytes are more common than unicode (as unicode is normally just passed
> around in an application). A binary type would also use half the
> amount of memory in most situations and is faster to process (no UTF-8
> checks).

Sure, this can be built on the Buffers in the net2 branch, once that
lands (or if they are backported). See the CommonJS list and wiki for
the many Binary API proposals (Binary/B has a few implementations
already in other SSJS projects).

The nicest and cleanest API for node users would be to have binary
buffers and all I/O just deals with buffers of bytes, and then a
collection of conversion routines between all the various encodings.

The "binary" and "ascii" and "utf-8" encoding arguments can just go
away at that point and everything that does I/O will just deal with
raw bytes.

> - V8 represents a character as a 16-bit unsigned integer in UTF-16
> (not UTF-8).

...as mandated by ECMAScript. Every engine does this, not just V8.

> - If you in node specify a string as being encoded in UTF-8 node does
> _not_ convert the string nor interpret it as UTF-8, but instead as
> UTF-16.

Not quite. If you are reading data, it will take UTF-8 input
and give you an ordinary JavaScript string containing those
characters. If you are writing data it will take an ordinary
JavaScript string, and convert those characters to UTF-8 and
write those bytes. If you are using the "utf-8" encoding, you
are going to be dealing with UTF-8 data coming in or going out
of node and there is a conversion happening.

> So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
> where no utf-8 specific encoding/decoding/interpretation is done.

There isn't any such place where "utf-8" is used that I'm aware
of. The only encoding in node that is completely free from
charset-related considerations is "binary".

--
http://inimino.org/~inimino/blog/

Reply all

Reply to author

Forward