Binary/E

6 views
Skip to first unread message

Kris Kowal

unread,
Feb 21, 2010, 7:56:58 AM2/21/10
to comm...@googlegroups.com
I sure hope you guys like this one. This is what I perceive to be
close to truly minimal, without being crippled. It's based on
Binary/E (which is based on Binary/D (which is based on Binary/B)),
but makes some accommodations for what I saw in NodeJS's net2 branch
(buffers that share allocation), renames ByteArray to Buffer, and
makes the new Buffer type have an immutable length. Unlike
Binary/Lite, it retains the charset encoding and decoding behaviors; I
think these are important, and I think we should retain the overloaded
constructors).

http://wiki.commonjs.org/wiki/Binary/F

Kris Kowal

Ash Berlin

unread,
Feb 21, 2010, 9:46:11 AM2/21/10
to comm...@googlegroups.com

(Subject updated to match link)

Couple of small niggles:

toSource returns something not accepted by the constructor - specifically an array of numbers

One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:

> [4,5,6,7].slice(-3)
[5, 6, 7]

We just need to decide and explicitly state if these are supported or not. From what i recall most Array methods support them - at least on spidermonkey.

[[Put]] should probably throw a RangeError instead of ValueError?

From what i remember many months ago, this is fairly similar to the first Blob class we had in flusspferd. Using this proposal if you do need to grow or concatenate two blobs together i guess you create a new blob and use copy/copyFrom to do it. I might be happy with this -- will ponder.

-ash

Kris Kowal

unread,
Feb 21, 2010, 3:05:14 PM2/21/10
to comm...@googlegroups.com
On Sun, Feb 21, 2010 at 6:46 AM, Ash Berlin
<ash_flu...@firemirror.com> wrote:

> Couple of small niggles:
> toSource returns something not accepted by the constructor - specifically an array of numbers

This poses a fascinating philosophical question: whether to put the
array constructor form back or to change the source representation to:

require("binary").Buffer(3).copyFrom([1, 2, 3])

I think I'll put the Array constructor back.

> One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:
>
>> [4,5,6,7].slice(-3)
> [5, 6, 7]

I'll add verbiage for this.

> [[Put]] should probably throw a RangeError instead of ValueError?

Sure.

> From what i remember many months ago, this is fairly similar to the first Blob class we had in flusspferd. Using this proposal if you do need to grow or concatenate two blobs together i guess you create a new blob and use copy/copyFrom to do it. I might be happy with this -- will ponder.

One nice thing about this proposal is that we can build much nicer UI
on top of it mostly in pure JavaScript, while having this relatively
easily implemented subset at the embedding layer.

Kris Kowal

Kris Kowal

unread,
Feb 21, 2010, 4:36:47 PM2/21/10
to comm...@googlegroups.com
On Sun, Feb 21, 2010 at 12:05 PM, Kris Kowal <cowber...@gmail.com> wrote:
> On Sun, Feb 21, 2010 at 6:46 AM, Ash Berlin
>> Couple of small niggles:
>> toSource returns something not accepted by the constructor - specifically an array of numbers
> I think I'll put the Array constructor back.

I have put the Array constructor back.

>> One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:
>>> [4,5,6,7].slice(-3)
>> [5, 6, 7]
>
> I'll add verbiage for this.

Also, this is now done.

>> [[Put]] should probably throw a RangeError instead of ValueError?
> Sure.

Also done.

I also did a pass on copy editing, formatting, and made a bunch of
things more explicit.

http://wiki.commonjs.org/wiki/Binary/F

Kris Kowal

Ondřej Žára

unread,
Feb 22, 2010, 4:04:40 AM2/22/10
to comm...@googlegroups.com
Hi Kris,

good job on this proposal! It is my current personal favorite.

Three remarks:

1) what is the deal with "views" (buffers that share allocation with other buffer)? Is there some truly frequent usage scenario that I am just missing?

2) the copy constructor form is a bit problematic if you want a 1:1 view on another buffer:

var clone = new Buffer(source, 0, source.length, false); // second and third arguments are redundant...

Maybe it would be a good idea to completely *remove* the 4th argument ( => always copy) and use only the "source.range()" instead?

3) the .Content property... what is its purpose? Or, more specifically, can you please show some usage scenario?



Thanks,
Ondrej





2010/2/21 Kris Kowal <cowber...@gmail.com>
Kris Kowal

--
You received this message because you are subscribed to the Google Groups "CommonJS" group.
To post to this group, send email to comm...@googlegroups.com.
To unsubscribe from this group, send email to commonjs+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/commonjs?hl=en.


Kris Kowal

unread,
Feb 22, 2010, 10:06:30 PM2/22/10
to comm...@googlegroups.com
On Mon, Feb 22, 2010 at 1:04 AM, Ondřej Žára <ondre...@gmail.com> wrote:
> 1) what is the deal with "views" (buffers that share allocation with other
> buffer)? Is there some truly frequent usage scenario that I am just missing?

This is an interesting one. One usage pattern I've definitely already
seen with Buffer is:

var buffer = Buffer(1024);
var actual = write(buffer, 0, 1024);
return buffer.range(0, actual);

This returns a buffer trimmed to the size of its actual content
without having to do an expensive reallocation and copy.

I'm not sure this is even the best example of using "range". I think
the general idea is that "range" avoids reallocation. This is
probably more important with this proposal than others since it does
not abstract copy-on-write semantics. This is very much a low-level
binary API that puts a lot of responsibility in pure JavaScript.

> 2) the copy constructor form is a bit problematic if you want a 1:1 view on
> another buffer:
>
> var clone = new Buffer(source, 0, source.length, false); // second and third
> arguments are redundant...
>
> Maybe it would be a good idea to completely *remove* the 4th argument ( =>
> always copy) and use only the "source.range()" instead?

I don't think this is a problem. It's pretty clear that you can throw
"undefined" in for "start" and "stop" in which case they're inferred.
I think most people will use "range" and "slice", but the inspiring
implementation (Ryan Dahl's Buffer) uses the Buffer constructor as the
basis of these operations. We could leave this as an implementation
detail, but I want to encourage consistency in practice.

> 3) the .Content property... what is its purpose? Or, more specifically, can
> you please show some usage scenario?

I'll leave this one to Daniel Friesen to defend. In his proposal it
was called .contentConstructor and it had something to do with
duck-typing collections. Its usage with a single type is not obvious,
but with a variety of types that have different content types, it
could be useful. I'm in favor of the idea in any case, just in
principle. This feature opens the possibility of content-type
agnostic algorithms that would not be possible to discern from an
empty collection. That is, (typeof someCollection[0]) won't work with
an empty buffer. Since Buffer would be the first typed collection in
JavaScript, it's difficult to cite precedent. However, with WebGL,
there promise to be many different types of collection. Some metadata
can't hurt.

Kris Kowal

Ondřej Žára

unread,
Feb 23, 2010, 1:47:43 AM2/23/10
to comm...@googlegroups.com
Okay, thanks for explanation.

I have one more question/feature request, that is closely related to Binary stuff. I recently wrote a JS EXIF parser (feel free to comment, <http://github.com/seznam/JAK/blob/master/util/exif.js>) for a non-commonjs environment. This is a good usage scenario for a Binary/Buffer data type; I realized that there was a very frequent need to read Short, Long etc. value from the buffer.

What is your opinion on adding a very generic reader method:

Buffer.prototype.toNumber(index, length, endianness) {}

?


O.

Kris Kowal

unread,
Feb 23, 2010, 2:03:21 AM2/23/10
to comm...@googlegroups.com

I think this is a "cascading virtualization of unpack" feature. I
think we should postpone this; we can implement libraries for these
things and come back and talk about how to augment the Buffer type.
There are so many ways we could do this, I would rather see us hugging
and buying each other beer than protract this spec.

But, having said that, I've got some ideas. The problem we'll have
with this is that there are a lot of kinds of things you might want to
grab out of a buffer, lots of places to get them, lots of endianness,
lots of native/network byte order, lots of native alignment variation
and non-aligned, lots of widths, signed/unsigned, lots of formats,
lots of efficient array indexing and opaque struct dereferencing, lots
of everything. It would be super clumsy to have a whole bunch of
methods for this. One option is to have an unpacking DSL like
"unpack" from Perl, PHP, Ruby, Python and so on.

unpack(buffer, "Hb*")
buffer.unpack("Hb*")

We could have a "record" DSL for unpacking built on top of that.

buffer.unpack("H<a>b*<b>") => {
"a": …,
"b", …,
};

We could also do something elegant with orthogonal range selection and
unpacking. So, for your example:

buffer.slice(index, length).valueOf(endianness);

Buffer([0, 1, 2, 3, 4, 5]).slice(2, 6).valueOf(">");

In Python pack notation, "@" means native endianness, ">" and "<" are
big and little, and "!" is network just in case you forget that it's
the same as ">". I could also buy constants or "BE", and "LE" to
match the IANA charset suffixes.

Note that "valueOf" is the idiomatic variant of "toNumber"; the
Number() constructor defers to "valueOf" internally. This is
symmetric to:

100..toString(2)

BUT…I think we should talk about this again for Binary/1.1 if we can
agree on something smaller first.

Kris Kowal

Kris Kowal

unread,
Feb 23, 2010, 2:04:23 AM2/23/10
to comm...@googlegroups.com
On Mon, Feb 22, 2010 at 11:03 PM, Kris Kowal <cowber...@gmail.com> wrote:
>    buffer.slice(index, length).valueOf(endianness);

I should have used "range" instead of "slice" to illustrate two points
at once. Oops.

Kris Kowal

Daniel Friesen

unread,
Feb 23, 2010, 7:09:10 AM2/23/10
to comm...@googlegroups.com
Kris Kowal wrote:
> On Mon, Feb 22, 2010 at 1:04 AM, Ondřej Žára <ondre...@gmail.com> wrote:
>
>> 1) what is the deal with "views" (buffers that share allocation with other
>> buffer)? Is there some truly frequent usage scenario that I am just missing?
>>
>
> This is an interesting one. One usage pattern I've definitely already
> seen with Buffer is:
>
> var buffer = Buffer(1024);
> var actual = write(buffer, 0, 1024);
> return buffer.range(0, actual);
>
> This returns a buffer trimmed to the size of its actual content
> without having to do an expensive reallocation and copy.
>
> I'm not sure this is even the best example of using "range". I think
> the general idea is that "range" avoids reallocation. This is
> probably more important with this proposal than others since it does
> not abstract copy-on-write semantics. This is very much a low-level
> binary API that puts a lot of responsibility in pure JavaScript.
>
For the record, when I came up with .range in Binary/C and IO/B/Buffer
(I'll just use Binary/C to refer to them together here) the solution
trying to be solved was actually different than the solution .range is
solving in Binary/F (though it does work in the same use case, even
though that use case is almost gone).

In Binary/F it basically creates a buffer that operates on a specific
part of another already allocated buffer.

In Binary/C the issue being solved wasn't buffer allocation (the return
from .range wasn't even a Buffer, it was a OpaqueRange which had
implementation specific semantics to let implementations choose what
technique worked best for them). The issue was the api of memcopy.
Both .splice and a .memcopy/.copy function have one issue. The argument
list, it's an unsightly list of mostly numbers that make understanding
code as you read it tough. (I for one don't bother memorizing argument
lists consisting of a bunch of numbers and get tripped up when scanning
api using them, and I expect other target programmers are the same; This
is JavaScript after all)
.splice is alright with only two confusing numbers, but memcopy;
.copy(data, offset, length, [dataOffset]), in other words;
bufB.copy(bufA, 5, 10, 15);
The .splice api was already solved. You didn't have to ever touch
.splice to modify a *Buffer in an intuitive way; .append, .insert,
.replace, .remove, .fill, and .clear basically let you modify a buffer
in any way you need without touching the .splice api without needing to
do things like .splice(7, 0, [0,255]); /* .insert([0,255], 7); */,
b.splice(b.length, 0, [0,255]); /* .append([0,255]); */, etc...
The problem was that with said api to copy a section of a buffer to
another buffer gave you two choices. A) Use the nice and readable api,
at the cost of allocating a new Blob each time and discarding that
allocated memory right after; B) Using the memcopy api and it's long
list of args which aren't easy to read;
So I ended up thinking of .range, it returns an OpaqueRange which refers
to a part of another buffer temporarily, intended to be discarded right
away (it doesn't really share it in any way, it just knows where the
data is, and what portion of the data it points to).
So now you get the benefits of both A and B without the problems in
either; bufB.insert(bufA.range(5, 10), 7); /* Take the range of data
from index 5-10 (or that might be 5-15, we didn't get enough responses
to the show of hands to pick the api for .range) of buffer A, and insert
it into buffer B at index 7 using memcopy instead of allocating any
extra blobs.
bufB.replace(bufA.range(5, 10), 7); would roughly be; bufB.copy(bufA, 7,
5, 5);
bufB.append(bufA.range(5, 10)); would roughly be; var l = bufB.length;
bufB.length += 5; bufB.copy(bufA, l, 5, 5);
bufB.insert(bufA.range(5, 10), 7); would roughly be; var l =
bufB.length; bufB.length += 5; bufB.copy(bufB, 7, l-7, 7+5);
bufB.copy(bufA, 7, 5, 5);

I won't be defending it in this instance. I've noted it before,
.contentConstructor's use cases almost completely disappear (at least
every single use case I can come up with; as well as any faint idea on
how it could be useful) when you remove it from Binary/C's abstract
text/binary symetric API. And the use of .contentConstructor === Number
makes it even less useful.

And trying to mix .contentConstructor into all the different binary API
will likely not help. Extra binary API with .contentConstructor require
much more thought on how they would interact and what unexpected things
might happen. I put the overactive part of my brain to work when it came
to figuring out how .contentConstructor and other abstract parts of the
api would react to certain situations and how they would likely be
expected to react. I haven't put that into play trying to figure out the
theory of how various binary api would interact with .contentConstructor;
> Kris Kowal
>
Considering the various binary api efforts that are going on, the
existing binary objects on various platforms, how we bikeshead and come
up with varying permutations of a binary API which only varying portions
of the group take to, and the varying use cases we have; I'm taking more
to the idea of accepting that we will likely end up with more than one
binary api to deal with and instead promoting the use of patterns that
will be resistant to various binary systems being used through an app
and it's libraries (ie: Being sure to use constructors to cast any data
passed to your library from outside the library to your binary system),
and instead collecting our use cases or goals into separate targets and
standardizing multiple (not that many) binary API that fit our separate
target cases.
I don't mind a universal lite api that'll work interoperability
independent of the capabilities of the platform... But I also don't mind
the extra work of implementing a near-primitive (not as hard in Rhino as
in other engines) from a forward-thinking spec written with the hope
that it would become a future part of ES and implemented natively in
future versions of JavaScript engines.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Ryan Dahl

unread,
Feb 24, 2010, 1:56:48 PM2/24/10
to comm...@googlegroups.com

What about writing a string into the buffer? I definitely need the
ability to take a string and write it, with a chosen encoding into the
buffer, at a given location - with a maximum length that it could
occupy. It should return the number of bytes written. It would be nice
if it didn't split characters...

Wes Garland

unread,
Feb 24, 2010, 2:03:59 PM2/24/10
to comm...@googlegroups.com

What about writing a string into the buffer? I definitely need the
ability to take a string and write it, with a chosen encoding into the
buffer, at a given location - with a maximum length that it could
occupy. It should return the number of bytes written. It would be nice
if it didn't split characters...

You could copy a new/temp Buffer into a view, although this would not allow the efficiency that, say, letting iconv operate directly on the output buffer would offer.


--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Hannes Wallnoefer

unread,
Feb 24, 2010, 2:13:02 PM2/24/10
to comm...@googlegroups.com
2010/2/24 Ryan Dahl <coldre...@gmail.com>:

Just wondering: what's the use case for this when you don't know how
much of the string (in characters, not bytes) was written to the
buffer?

Hannes

Ryan Dahl

unread,
Feb 24, 2010, 2:24:51 PM2/24/10
to comm...@googlegroups.com
On Wed, Feb 24, 2010 at 11:13 AM, Hannes Wallnoefer <han...@gmail.com> wrote:
> 2010/2/24 Ryan Dahl <coldre...@gmail.com>:
>> On Sun, Feb 21, 2010 at 4:56 AM, Kris Kowal <kris....@cixar.com> wrote:
>>> I sure hope you guys like this one.  This is what I perceive to be
>>> close to truly minimal, without being crippled.  It's based on
>>> Binary/E (which is based on Binary/D (which is based on Binary/B)),
>>> but makes some accommodations for what I saw in NodeJS's net2 branch
>>> (buffers that share allocation), renames ByteArray to Buffer, and
>>> makes the new Buffer type have an immutable length.  Unlike
>>> Binary/Lite, it retains the charset encoding and decoding behaviors; I
>>> think these are important, and I think we should retain the overloaded
>>> constructors).
>>>
>>> http://wiki.commonjs.org/wiki/Binary/F
>>
>> What about writing a string into the buffer? I definitely need the
>> ability to take a string and write it, with a chosen encoding into the
>> buffer, at a given location - with a maximum length that it could
>> occupy. It should return the number of bytes written. It would be nice
>> if it didn't split characters...
>
> Just wondering: what's the use case for this when you don't know how
> much of the string (in characters, not bytes) was written to the
> buffer?

Yeah, I guess it should provide that too.

Kris Kowal

unread,
Feb 24, 2010, 2:51:49 PM2/24/10
to comm...@googlegroups.com
On Wed, Feb 24, 2010 at 11:24 AM, Ryan Dahl <coldre...@gmail.com> wrote:
> On Wed, Feb 24, 2010 at 11:13 AM, Hannes Wallnoefer <han...@gmail.com> wrote:
>> 2010/2/24 Ryan Dahl <coldre...@gmail.com>:
>>> What about writing a string into the buffer? I definitely need the
>>> ability to take a string and write it, with a chosen encoding into the
>>> buffer, at a given location - with a maximum length that it could
>>> occupy. It should return the number of bytes written. It would be nice
>>> if it didn't split characters...
>>
>> Just wondering: what's the use case for this when you don't know how
>> much of the string (in characters, not bytes) was written to the
>> buffer?
>
> Yeah, I guess it should provide that too.

Alright, here's some potential verbiage:

; copyString(source String, String(charset), Number(start_opt),
Number(stop_opt), Number(sourceStart_opt), Number(sourceStop_opt))
[sourceStop Number, targetStop Number]
# Encodes as much as possible of a String in a given charset into this
buffer from "source" to "stop", using the source string from
"sourceStart" to "sourceStop", and returns the actual stop index of
the source string and this, the target buffer.
# "start" is 0 if undefined or omitted.
# "stop" is this buffer's length if undefined or omitted.
# "sourceStart" is 0 if undefined or omitted.
# "sourceStop" is the source string's length if undefined or omitted.
# "charset" must be an IANA charset name.
## "copyString" must throw a ValueError if the given "charset" is not supported.
## The charsets "ascii", "utf-8", and "utf-16" must be supported.
## ''Note: the charset is not optional, there is no default.''
# Returns a duple Array with the actual "sourceStop" and "targetStop".
## The actual "sourceStop" is an index one past the last character
actually read.
## The actual "targetStop" is an index one past the last byte actually written.

Kris Kowal

Ryan Dahl

unread,
Feb 24, 2010, 3:02:24 PM2/24/10
to comm...@googlegroups.com

Yeah that seems reasonable. So, for clarity - `start`, `stop` are of
the octet unit. `sourceStart` and and `sourceStop` are of the
character unit?

If it just returned the total number of octets written, that would be
sufficient?

Wes Garland

unread,
Feb 24, 2010, 3:04:36 PM2/24/10
to comm...@googlegroups.com
"if the output is to be truncated by the buffer's size, truncation will happen on a character boundary, as defined by the target encoding"

This will avoid, for example, writing part of a 3-byte UTF8 sequence into a 2-byte buffer.

Now, the return value won't be able to tell you how to get the rest of the string.  In order to do that, you need a lot of nastiness, which is well expressed in Aristid's Encodings specification, IIRC.

Wes

--
You received this message because you are subscribed to the Google Groups "CommonJS" group.
To post to this group, send email to comm...@googlegroups.com.
To unsubscribe from this group, send email to commonjs+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/commonjs?hl=en.

Ryan Dahl

unread,
Feb 24, 2010, 3:05:02 PM2/24/10
to comm...@googlegroups.com
On Sun, Feb 21, 2010 at 4:56 AM, Kris Kowal <kris....@cixar.com> wrote:

I'm not particularly thrilled that I would need to create a range to
slice out a string. If I've got a buffer containing a HTTP request, I
want to rip out a bunch of ascii encoded strings really quickly,
creating a range object for each field and value before
.toString('ascii') them is a bit of overhead.

Kris Kowal

unread,
Feb 24, 2010, 3:10:07 PM2/24/10
to comm...@googlegroups.com
On Wed, Feb 24, 2010 at 12:02 PM, Ryan Dahl <coldre...@gmail.com> wrote:
> Yeah that seems reasonable. So, for clarity - `start`, `stop` are of
> the octet unit. `sourceStart` and and `sourceStop` are of the
> character unit?

Yeah, I'll make that more clear:

# "start" is 0 if undefined or omitted and counts bytes.
# "stop" is this buffer's length if undefined or omitted, counting in bytes.
# "sourceStart" is 0 if undefined or omitted, counting in characters.
# "sourceStop" is the source string's length if undefined or omitted,
counting in characters.

> If it just returned the total number of octets written, that would be
> sufficient?

In order to resume writing on another buffer, you would need the
string offset because with mixed-width encodings like UTF-*, the width
is not computable in terms of the bytes written. It would be possible
to compute in terms of UCS-*, but we're targeting the general case.

Kris Kowal

Wes Garland

unread,
Feb 24, 2010, 3:13:41 PM2/24/10
to comm...@googlegroups.com
> If I've got a buffer containing a HTTP request, I
> want to rip out a bunch of ascii encoded strings really quickly,

FWIW, if *I* had a buffer containing HTTP requests, I might seriously want a type that could return an array of Strings or Buffers based on a separator -- similar to the BSD strsep() call, or something like this pseudo code:

var header = [];

header.backingStore = buf;  // provide GC root
for (s = strtok(buf, "\r\n"); s; s = strtok(NULL, "\r\n"))
  header.push(Buffer(s, strlen(s));

Wes

Kris Kowal

unread,
Feb 24, 2010, 3:30:21 PM2/24/10
to comm...@googlegroups.com

There are others here who would not be thrilled to have to provide a
bunch of positional arguments, but I think we can entertain both the
range().toString() case and yours by providing optional start,stop
range args to toString. Would that suffice?

Kris Kowal

Ryan Dahl

unread,
Feb 24, 2010, 3:31:36 PM2/24/10
to comm...@googlegroups.com

That works.

Kris Kowal

unread,
Feb 24, 2010, 3:46:16 PM2/24/10
to comm...@googlegroups.com

Ondřej Žára

unread,
Feb 24, 2010, 4:34:07 PM2/24/10
to comm...@googlegroups.com


2010/2/24 Kris Kowal <cowber...@gmail.com>


Thanks for keeping this up-to-date; I will adjust my implementation tomorrow.

By the way, there is something like ValueError in standard javascript?



Ondrej



 

Kris Kowal

unread,
Feb 24, 2010, 5:07:48 PM2/24/10
to comm...@googlegroups.com
On Wed, Feb 24, 2010 at 1:34 PM, Ondřej Žára <ondre...@gmail.com> wrote:
> 2010/2/24 Kris Kowal <cowber...@gmail.com>

> By the way, there is something like ValueError in standard javascript?

Oh, apparently there isn't (I've been assuming it was from my
experience elsewhere). Should we change the three references to
ValueError to RangeError or just Error.?

Kris Kowal

Ondřej Žára

unread,
Feb 25, 2010, 2:18:29 AM2/25/10
to comm...@googlegroups.com


2010/2/24 Kris Kowal <cowber...@gmail.com>


Error is very generic, but I do not see a better alternative here.


O.

 

Ryan Dahl

unread,
Feb 25, 2010, 2:59:50 PM2/25/10
to comm...@googlegroups.com
Another thought, Kris. Suppose I have a Buffer which contains a Utf-8
string starting at position 0, but does not end inside the buffer.
(say, the next packet will contain the rest of the string) and suppose
that the buffer ends in the middle of a character. You say the
toString call

> Throws a RangeError if the buffer is malformed for the given character set,
> if a multi-byte character would be split across the stop boundary, or if any
> of the code points are out of the implementation's supported range.

This isn't is difficult to recover from - I'll have to wait for the
next packet, concatenate them, and then call toString() again? Or I
suppose I could try again with .toString('utf8', 0, buffer.length -
1)? It would be nice to get the longest string possible, and then be
notified of how many octets remain.

Kris Kowal

unread,
Feb 25, 2010, 3:06:18 PM2/25/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 11:59 AM, Ryan Dahl <coldre...@gmail.com> wrote:
>> Throws a RangeError if the buffer is malformed for the given character set,
>> if a multi-byte character would be split across the stop boundary, or if any
>> of the code points are out of the implementation's supported range.
>
> This isn't is difficult to recover from - I'll have to wait for the
> next packet, concatenate them, and then call toString() again? Or I
> suppose I could try again with .toString('utf8', 0, buffer.length -
> 1)? It would be nice to get the longest string possible, and then be
> notified of how many octets remain.

Aye aye. I'll look again.

Kris Kowal

Ash Berlin

unread,
Feb 25, 2010, 3:27:02 PM2/25/10
to comm...@googlegroups.com

This is starting to get into the character of proper character set encoding. What happens for more complex charsets?

Kris Kowal

unread,
Feb 25, 2010, 3:35:35 PM2/25/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 12:27 PM, Ash Berlin
<ash_flu...@firemirror.com> wrote:
> This is starting to get into the character of proper character set encoding. What happens for more complex charsets?

Can you describe some of these case?

It's not my intent that this API should replace the encodings API; it
ought to be possible to provide the Buffer abstraction using the
encodings API. Please point out any ways that would not be possible.
Do BOMs complicate matters? Is there state that would need to be
passed into Buffer routines to properly decode or encode ranges of
certain character sets?

Kris Kowal

Kris Kowal

unread,
Feb 25, 2010, 3:44:42 PM2/25/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 11:59 AM, Ryan Dahl <coldre...@gmail.com> wrote:
>> Throws a RangeError if the buffer is malformed for the given character set,
>> if a multi-byte character would be split across the stop boundary, or if any
>> of the code points are out of the implementation's supported range.
>
> This isn't is difficult to recover from - I'll have to wait for the
> next packet, concatenate them, and then call toString() again? Or I
> suppose I could try again with .toString('utf8', 0, buffer.length -
> 1)? It would be nice to get the longest string possible, and then be
> notified of how many octets remain.

Does this change address the issue adequately:

http://wiki.commonjs.org/index.php?title=Binary%2FF&diff=2135&oldid=2132

I change "copyFromString" to "write" and created a corresponding
"read" that supports partial reads.

Kris Kowal

Ash Berlin

unread,
Feb 25, 2010, 3:51:49 PM2/25/10
to comm...@googlegroups.com

First thing that comes to mind is ISO2022-JP which is a stateful encoding - there are two modes, ASCII and a katakana mode (I think its called that anyway) and escape sequences to switch from one to the other.

The upshot of that is that you need stateful decoding if you want to deal with anything more than 'this is the entire string, please decode it' case.

I can see the need for a toString("utf-8") case, but I worry that we're asking for trouble/edge cases/special behaviour for anything other than convert all/or nothing.

Hannes Wallnoefer

unread,
Feb 25, 2010, 4:20:27 PM2/25/10
to CommonJS
On Feb 25, 9:44 pm, Kris Kowal <cowbertvon...@gmail.com> wrote:

My feeling is that this whole partial decoding support is getting out
of hand and does not really belong in a binary API (it may make sense
as part of a dedicated encoding API). For example, the read and write
methods returning duple arrays are pretty specialized and low level
and will rarely be needed for application level programming.

I think we should keep the string encoding/decoding support in binary
simple. Just provide methods to create a binary from a string and vice
versa, and say that the result for invalid character codes is
undefined. Everything else should be supported for those who need it,
but in a specialized encoding API.

Hannes

> Kris Kowal

Ryan Dahl

unread,
Feb 25, 2010, 4:43:53 PM2/25/10
to comm...@googlegroups.com
> On Feb 25, 9:44 pm, Kris Kowal <cowbertvon...@gmail.com> wrote:
> Does this change address the issue adequately:
>
> http://wiki.commonjs.org/index.php?title=Binary%2FF&diff=2135&oldid=2132
>
> I change "copyFromString" to "write" and created a corresponding
> "read" that supports partial reads.

I don't like that it returns an array, an extra, otherwise
unnecessary, object. That said, I can't think of a better alternative.

On Thu, Feb 25, 2010 at 1:20 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
> I think we should keep the string encoding/decoding support in binary
> simple. Just provide methods to create a binary from a string and vice
> versa, and say that the result for invalid character codes is
> undefined. Everything else should be supported for those who need it,
> but in a specialized encoding API.

How would I decode a partial string then?

Hannes Wallnoefer

unread,
Feb 25, 2010, 4:54:12 PM2/25/10
to comm...@googlegroups.com
2010/2/25 Ryan Dahl <coldre...@gmail.com>:

>
> On Thu, Feb 25, 2010 at 1:20 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
>> I think we should keep the string encoding/decoding support in binary
>> simple. Just provide methods to create a binary from a string and vice
>> versa, and say that the result for invalid character codes is
>> undefined. Everything else should be supported for those who need it,
>> but in a specialized encoding API.
>
> How would I decode a partial string then?

There should be a method for decoding part of a buffer to a string
(like toString(charset, start, stop)). But IMO the responsability for
proper character alignment and boundaries should be with the
developer, i.e. I'd prefer garbage in-garbage out behaviour to
throwing an Error.

Anything more fancy like support for character boundary detection
should go in a dedicated encoding API.

Hannes

Ash Berlin

unread,
Feb 25, 2010, 5:01:38 PM2/25/10
to comm...@googlegroups.com

On 25 Feb 2010, at 21:54, Hannes Wallnoefer wrote:

> 2010/2/25 Ryan Dahl <coldre...@gmail.com>:
>>
>> On Thu, Feb 25, 2010 at 1:20 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
>>> I think we should keep the string encoding/decoding support in binary
>>> simple. Just provide methods to create a binary from a string and vice
>>> versa, and say that the result for invalid character codes is
>>> undefined. Everything else should be supported for those who need it,
>>> but in a specialized encoding API.
>>
>> How would I decode a partial string then?
>
> There should be a method for decoding part of a buffer to a string
> (like toString(charset, start, stop)). But IMO the responsability for
> proper character alignment and boundaries should be with the
> developer, i.e. I'd prefer garbage in-garbage out behaviour to
> throwing an Error.

You can't get garbage with a partial character in a MBCS - you can't get anything for that character - it has to be an error condition. Is there a need for start,stop given the range method? Is .toString(cs, start, stop) not the same as .range(start,stop).toString(cs) ?

>
> Anything more fancy like support for character boundary detection
> should go in a dedicated encoding API.
>
> Hannes

Thats my view too tho.

Kris Kowal

unread,
Feb 25, 2010, 5:04:15 PM2/25/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 2:01 PM, Ash Berlin
<ash_flu...@firemirror.com> wrote:

> You can't get garbage with a partial character in a MBCS - you can't get anything for that character - it has to be an error condition. Is there a need for start,stop given the range method? Is .toString(cs, start, stop) not the same as .range(start,stop).toString(cs) ?

It's the same but faster. Ryan's all about optimization.

Kris Kowal

Ryan Dahl

unread,
Feb 25, 2010, 5:28:32 PM2/25/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 2:04 PM, Kris Kowal <cowber...@gmail.com> wrote:
> It's the same but faster.  Ryan's all about optimization.

This binary stuff is really hot code - something like toString() will
be called thousands, if not millions of times per second. It cannot
cut corners. If you can simplify it by, for example, not returning an
tuple and thus causing the GC to run less frequently - that can give
several percentage points in a simple web server benchmark.

Basically, we shouldn't prescribe this API. It needs to be
implemented, benched, and tinkered with. (That's not to say that the
proposal isn't useful - thinking through these things collectively
save a lot of time.)

Ash Berlin

unread,
Feb 25, 2010, 7:18:57 PM2/25/10
to comm...@googlegroups.com

On 25 Feb 2010, at 22:28, Ryan Dahl wrote:

> On Thu, Feb 25, 2010 at 2:04 PM, Kris Kowal <cowber...@gmail.com> wrote:
>> It's the same but faster. Ryan's all about optimization.
>
> This binary stuff is really hot code - something like toString() will
> be called thousands, if not millions of times per second. It cannot
> cut corners. If you can simplify it by, for example, not returning an
> tuple and thus causing the GC to run less frequently - that can give
> several percentage points in a simple web server benchmark.


I can agree that - but whats the case for split MBCS reads? Headers are ascii only. Perhaps an IO/encoding layer could do this with a nicer API and to match your speed needs.

Basically something you pass an IO handle/emitter and it slurps in, manages what state is needed to do encoding, and emits the events with the data in the right charset? (It would only have to keep hold of at most a few bytes between events in any case that comes to mind right now). And what I have in mind wouldn't need to go out of C space before the final event if done right (I think).

Tho you'd need to be able to switch between ascii and utf8 or $char_set at almost any point. hmmm

This kind of interface might also be nice for transparently (un)compressing streams?

>
> Basically, we shouldn't prescribe this API. It needs to be
> implemented, benched, and tinkered with. (That's not to say that the
> proposal isn't useful - thinking through these things collectively
> save a lot of time.)

-ash

Wes Garland

unread,
Feb 25, 2010, 7:22:05 PM2/25/10
to comm...@googlegroups.com
I am actually in complete agreement with both camps in this discussion, which I think shows that we really need *two* APIs.

The two use-cases are straightforward, in my mind:
1. Generic character set manipulation
2. Highspeed Unicode processing

#1 has many pitfalls if we allow for incomplete decoding.
#2, not so many, as Unicode is stateless and you can observe from any leading position in the stream whether you can decode the next character or need more data.

Ryan is absolutely right about this being "Hot Code" -- at least #2 -- it needs to be fast as hell for his use-case.  I am pretty familiar with some of the mechanics going between UTF-16 and UTF-8 as well; I spent about 10 days last summer writing optimal UTF-8 to UTF-16 limited-state inline translation for SpiderMonkey's lexer.

What if we had a series of a to/from Unicode routines that operate on partial strings (suitable for asynchronous processing) and another set of routines that support complete Strings only?

If I were implementing these, I would do #1 with iconv (like I did in Binary/B) and #2 I would implement directly. Inter-Unicode encodings are relatively easy (although detailed) and require very little overhead -- no look-up tables, incompatible characters, etc.

Kris Kowal

unread,
Feb 26, 2010, 5:45:48 PM2/26/10
to comm...@googlegroups.com
On Thu, Feb 25, 2010 at 4:22 PM, Wes Garland <w...@page.ca> wrote:
> What if we had a series of a to/from Unicode routines that operate on
> partial strings (suitable for asynchronous processing) and another set of
> routines that support complete Strings only?

Could you make this recommendation more concrete? I would like to see
the signatures you are proposing.

Kris Kowal

Wes Garland

unread,
Mar 1, 2010, 2:39:08 PM3/1/10
to comm...@googlegroups.com
Hi, Kris!

Wow, this was significantly trickier than anticipated.

Design Notes
  • The basic idea is that the binary API needs
    • fast conversion between Unicode buffers and Strings, without forcing intermediary object allocation
    • Simple line-oriented character set encoding/decoding (iconv charsets)
    • More complicated charset work (non-Unicode byte streams) should be pushed into Encodings API
  • Observations:
    • All Unicode encodings sets can represent 100% of Unicode
    • It is possible to transcode "bad" Unicode from one encoding to another
    • Correcly transcoding between UTF-8, -16 and -32 can be implemented by an average programmer without difficulty or the need for libiconv
    • All Unicode encodings "know" how many code points are required to represent an entire character by looking at only the first code point. This makes handling truncated sequences possible.
  • The reason I have added the optional Object o to some function signatures is to allow multiple out parameters without incurring new object construction overhead.
  • The way JavaScript Strings are treated in this specification fragment makes it possible and reasonable for implementations that have underlying UTF-8 Strings (like v8) to implement utf8-buffer -> String without any actual conversion.

The Unicode character sets, for the purposes of this specification are
 - UTF-8
 - UTF-16
 - UTF-32

UCS-4 will be accepted as an alias for UTF-32. UCS-2 and UTF-7 are not supported by the "Unicode" functions, although may be recognized as non-special character sets.

In this specification, Strings will be considered to be UTF-16, encoded with the native byte order, with no leading BOM. This is equivalent to the iconv encoding UTF-16BE on Big-Endian machines and UTF-16LE on Little Endian machines. This consideration does not reflect actual implementation detail in the underlying engine, but rather the view offered to script.

For performance reasons, this specification does not require implementations to perform transcoding validation when converting between Unicode character sets; instead, it is acceptable to transcode invalid code points from one Unicode encoding to another. Implementations are, however, encouraged to provide a method for performing transcoding validation for at least debugging builds of the underlying platform.


Definitions

Encode:    to transform a String to a byte-oriented buffer
Decode:    to transform a byte-oriented buffer into a String
Transcode: to transform one byte-oriented buffer into another


Constructor Methods

Object Buffer(String string, String charset, [Number length])
  • creates and returns a new Object having a buffer of length bytes. If length is unspecified, the buffer will be exactly big enough to hold the encoded data
  • string is encoded to the character set identified by charset in the new buffer
  • throws an Error if encode fails, even if failure is due to under-sized buffer
  • if buffer is undersized and charset is a Unicode character set, the Error object will be augmented with a "commonjs.binary.encode_buffer_underrun" property indicating how many more bytes would have been required to succeed
  • if buffer is undersized and charset is not a Unicode character set (but is a valid iconv character set), the Error object will be augmented with a "commonjs.binary.encode_buffer_underrun" property having an undefined value

Static Methods

Object unicodeTranscode({Object buffer, [String charset, [Number offset, [Number length]]]} target,
                        {Object buffer, [String charset, [Number offset, [Number length]]]} source, [Object o])
  • Copies from source's buffer to target's buffer, transcoding from one Unicode character set to another, returning an object o
  • The default values for charset, offset, and length are "UTF-8", 0, and buffer.length for both source and target
  • Transcode behaviour is unspecified if source === target and the ranges [source.offset...source.offset + length] and [target.offset...target.offset + length] overlap
  • Transcode input is source.buffer[offset] through source.buffer[offset+length]; only entire characters are transcoded; a trailing partial character is not an error
  • o.encoded holds the number of bytes from source.buffer which were transcoded by this operation
  • o.used holds the number of bytes written into the target.buffer by this operation
  • transcoding errors will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in the source buffer which held the leading byte of the unencodeable character
  • The contents of o.encoded and o.used are unaffected if this function throws an exception
  • No properties other than o.encoded and o.used will be affected by this method

Instance Methods

String toString([String charset, [Number offset, [Number length]]])
  • decodes this buffer, starting at byte offset for length bytes, as though this buffer was encoded with the character set identified by charset, returning a new String
  • The default values for charset, offset, and length are "UTF-8", 0, and this.length respectively.
  • This routine operates only on whole strings; truncated Unicode characters are to be treated as transcoding errors
  • A transcoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property
    • If charset identifies a Unicode character set, Error["commonjs.binary.transcode_error_offset"] will contain the number of bytes between the first byte to be decoded and the leading byte of the encoded character which could not be decoded. For example, a buffer containing a three-byte UTF-8 sequence with a corrupted third byte would set Error["commonjs.binary.transcode_error_offset"] to 0.
    • If charset does not identify a Unicode character set, Error["commonjs.binary.transcode_error_offset"] will have an undefined value.
String unicodeToString([String charset, [Number offset, [Number length, [Object o]]]])
  • decodes this buffer, starting at byte offset for length bytes, as though this buffer was encoded with the character set identified by charset, returning a new String
  • The default values for charset, offset, and length are "UTF-8", 0, and this.length respectively.
  • Specifying a non-Unicode character set name in charset will cause this function to throw an Error
  • Only entire characters are decoded - a trailing partial character is not an error
  • o.encoded holds the number of bytes from this buffer which were decoded by this operation
  • Decoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in this buffer which held the leading byte of the unencodeable character
  • The contents of o.encoded is unaffected if this function throws an exception
  • No properties other than o.encoded will be affected by this method
Object unicodeFromString(String string, [String charset, [Number offset, [Number length, [Object o]]]])
  • encodes this buffer, starting at byte offset for at most length bytes, encoding string to the character set identified by charset, returning the object o
  • The default values for charset, offset, length, and o are "UTF-8", 0, this.length, and {} respectively
  • Specifying a non-Unicode character set in charset will cause this function to throw an Error
  • o.encoded holds the number of characters encoded from String
  • o.used holds the number of bytes written into this buffer by this operation
  • Only complete characters will be decoded: a string whose last position contains the first half of a surrogate pair will have o.encoded === string.length - 1. This is not an error condition.
  • transcoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in string which held the unencodeable character
  • The contents of o.encoded and o.used are unaffected if this function throws an exception
  • No properties other than o.encoded and o.used will be affected by this method

What do you and Ryan think of this approach?

Kris Kowal

unread,
Mar 1, 2010, 5:28:30 PM3/1/10
to comm...@googlegroups.com
On Mon, Mar 1, 2010 at 11:39 AM, Wes Garland <w...@page.ca> wrote:
> What do you and Ryan think of this approach?

First pass over it and I have a good impression. I need to do another
detailed read. Just wanted to say thanks for doing the footwork. It
looks good.

Kris Kowal

Donny Viszneki

unread,
Mar 1, 2010, 6:07:21 PM3/1/10
to comm...@googlegroups.com

Can we get Wes' proposal on the Wiki somewhere?

--
http://codebad.com/

Kris Kowal

unread,
Mar 6, 2010, 5:03:30 AM3/6/10
to comm...@googlegroups.com
On Mon, Mar 1, 2010 at 3:07 PM, Donny Viszneki <donny.v...@gmail.com> wrote:
> Can we get Wes' proposal on the Wiki somewhere?

Until I can find a better home for this, I've wikified Wes's proposal
for posterity. I personally still have this in my inbox for a more
thorough review and probably integration.

http://wiki.commonjs.org/wiki/Binary/F/Wes

Kris Kowal

Reply all
Reply to author
Forward
0 new messages