> The nicest and cleanest API for node users would be to have binary
> buffers and all I/O just deals with buffers of bytes, and then a
> collection of conversion routines between all the various encodings.
Agreed. Still, such a solution would introduce complexity where one of
the reasons for using node is because it's dead-simple.
>
> The "binary" and "ascii" and "utf-8" encoding arguments can just go
> away at that point and everything that does I/O will just deal with
> raw bytes.
>
>> - V8 represents a character as a 16-bit unsigned integer in UTF-16
>> (not UTF-8).
>
> ...as mandated by ECMAScript. Every engine does this, not just V8.
Correct. However, the ECMA-262, 5th ed. is a bit fuzzy on the details
around unicode text. This is what I could find:
"[...] the phrase “code unit” and the word “character” will be used to
refer to a 16-bit unsigned value used to represent a single 16-bit
unit of text. The phrase “Unicode character” will be used to refer to
the abstract linguistic or typographical unit represented by a single
Unicode scalar value (which may be longer than 16 bits and thus may be
represented by more than one code unit). The phrase “code point”
refers to such a Unicode scalar value. “Unicode character” only refers
to entities represented by single Unicode scalar values: the
components of a combining character sequence are still individual
“Unicode characters,” even though a user might think of the whole
sequence as a single character."
>
>> - If you in node specify a string as being encoded in UTF-8 node does
>> _not_ convert the string nor interpret it as UTF-8, but instead as
>> UTF-16.
>
> Not quite. If you are reading data, it will take UTF-8 input
> and give you an ordinary JavaScript string containing those
> characters. If you are writing data it will take an ordinary
> JavaScript string, and convert those characters to UTF-8 and
> write those bytes. If you are using the "utf-8" encoding, you
> are going to be dealing with UTF-8 data coming in or going out
> of node and there is a conversion happening.
Oh, my bad then. I have not investigated that part of the source.
>
>> So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
>> where no utf-8 specific encoding/decoding/interpretation is done.
>
> There isn't any such place where "utf-8" is used that I'm aware
> of. The only encoding in node that is completely free from
> charset-related considerations is "binary".
As you mention above, some functions take an "encoding" argument which
can, among others, be the string "utf-8" in which case (if I understad
you correctly) node decomposes the (source) UTF-16 string, producing
an actual UTF-8 sequence.
Simple and fast. Using UTF-8 everywhere by default would be simpler
for users (until they have to deal with something other than UTF-8,
like serving static files, at which point encoding issues come up
anyway) but would come with a performance penalty over binary (or
ASCII, which is special-cased in V8 and apparently is the fastest).
> Correct. However, the ECMA-262, 5th ed. is a bit fuzzy on the details
> around unicode text. This is what I could find:
Yes, there's some historical UCS-2 cruft there, and it's reflected
in the spec language which is not always crystal clear on what is
a character, what is a code unit, etc. This is cleaned up a lot in
the fifth edition from what was there in ES3. It's also reflected
in the APIs which are mostly surrogate-agnostic and treat strings
as sequences of unsigned 32-bit values.
>>> So maybe renaming "utf-8" to "utf-16" (or "ucs-2") everywhere in node
>>> where no utf-8 specific encoding/decoding/interpretation is done.
>>
>> There isn't any such place where "utf-8" is used that I'm aware
>> of. The only encoding in node that is completely free from
>> charset-related considerations is "binary".
>
> As you mention above, some functions take an "encoding" argument which
> can, among others, be the string "utf-8" in which case (if I understad
> you correctly) node decomposes the (source) UTF-16 string, producing
> an actual UTF-8 sequence.
Yes.