Unicode

25 views
Skip to first unread message

Rasmus Andersson

unread,
Feb 24, 2010, 3:48:01 PM2/24/10
to nod...@googlegroups.com
On Wed, Feb 24, 2010 at 19:14, inimino <ini...@inimino.org> wrote:
> On 2010-02-24 10:50, Rasmus Andersson wrote:
>> How about a binary type? In my experience, fiddling with strings of
>> bytes are more common than unicode (as unicode is normally just passed
>> around in an application). A binary type would also use half the
>> amount of memory in most situations and is faster to process (no UTF-8
>> checks).
>
> Sure, this can be built on the Buffers in the net2 branch, once that
> lands (or if they are backported).  See the CommonJS list and wiki for
> the many Binary API proposals (Binary/B has a few implementations
> already in other SSJS projects).
>

> The nicest and cleanest API for node users would be to have binary
> buffers and all I/O just deals with buffers of bytes, and then a
> collection of conversion routines between all the various encodings.

Agreed. Still, such a solution would introduce complexity where one of
the reasons for using node is because it's dead-simple.

>
> The "binary" and "ascii" and "utf-8" encoding arguments can just go
> away at that point and everything that does I/O will just deal with
> raw bytes.
>
>> - V8 represents a character as a 16-bit unsigned integer in UTF-16
>> (not UTF-8).
>
> ...as mandated by ECMAScript.  Every engine does this, not just V8.

Correct. However, the ECMA-262, 5th ed. is a bit fuzzy on the details
around unicode text. This is what I could find:

"[...] the phrase “code unit” and the word “character” will be used to
refer to a 16-bit unsigned value used to represent a single 16-bit
unit of text. The phrase “Unicode character” will be used to refer to
the abstract linguistic or typographical unit represented by a single
Unicode scalar value (which may be longer than 16 bits and thus may be
represented by more than one code unit). The phrase “code point”
refers to such a Unicode scalar value. “Unicode character” only refers
to entities represented by single Unicode scalar values: the
components of a combining character sequence are still individual
“Unicode characters,” even though a user might think of the whole
sequence as a single character."

>
>> - If you in node specify a string as being encoded in UTF-8 node does
>> _not_ convert the string nor interpret it as UTF-8, but instead as
>> UTF-16.
>
> Not quite.  If you are reading data, it will take UTF-8 input
> and give you an ordinary JavaScript string containing those
> characters.  If you are writing data it will take an ordinary
> JavaScript string, and convert those characters to UTF-8 and
> write those bytes.  If you are using the "utf-8" encoding, you
> are going to be dealing with UTF-8 data coming in or going out
> of node and there is a conversion happening.

Oh, my bad then. I have not investigated that part of the source.

>
>> So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
>> where no utf-8 specific encoding/decoding/interpretation is done.
>
> There isn't any such place where "utf-8" is used that I'm aware
> of.  The only encoding in node that is completely free from
> charset-related considerations is "binary".

As you mention above, some functions take an "encoding" argument which
can, among others, be the string "utf-8" in which case (if I understad
you correctly) node decomposes the (source) UTF-16 string, producing
an actual UTF-8 sequence.

inimino

unread,
Feb 24, 2010, 5:24:05 PM2/24/10
to nod...@googlegroups.com
On 2010-02-24 13:48, Rasmus Andersson wrote:
> On Wed, Feb 24, 2010 at 19:14, inimino <ini...@inimino.org> wrote:
>> The nicest and cleanest API for node users would be to have binary
>> buffers and all I/O just deals with buffers of bytes, and then a
>> collection of conversion routines between all the various encodings.
>
> Agreed. Still, such a solution would introduce complexity where one of
> the reasons for using node is because it's dead-simple.

Simple and fast. Using UTF-8 everywhere by default would be simpler
for users (until they have to deal with something other than UTF-8,
like serving static files, at which point encoding issues come up
anyway) but would come with a performance penalty over binary (or
ASCII, which is special-cased in V8 and apparently is the fastest).

> Correct. However, the ECMA-262, 5th ed. is a bit fuzzy on the details
> around unicode text. This is what I could find:

Yes, there's some historical UCS-2 cruft there, and it's reflected
in the spec language which is not always crystal clear on what is
a character, what is a code unit, etc. This is cleaned up a lot in
the fifth edition from what was there in ES3. It's also reflected
in the APIs which are mostly surrogate-agnostic and treat strings
as sequences of unsigned 32-bit values.

>>> So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
>>> where no utf-8 specific encoding/decoding/interpretation is done.
>>
>> There isn't any such place where "utf-8" is used that I'm aware
>> of. The only encoding in node that is completely free from
>> charset-related considerations is "binary".
>
> As you mention above, some functions take an "encoding" argument which
> can, among others, be the string "utf-8" in which case (if I understad
> you correctly) node decomposes the (source) UTF-16 string, producing
> an actual UTF-8 sequence.

Yes.

--
http://inimino.org/~inimino/blog/

Reply all
Reply to author
Forward
0 new messages