http://wiki.commonjs.org/wiki/Binary/F
Kris Kowal
(Subject updated to match link)
Couple of small niggles:
toSource returns something not accepted by the constructor - specifically an array of numbers
One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:
> [4,5,6,7].slice(-3)
[5, 6, 7]
We just need to decide and explicitly state if these are supported or not. From what i recall most Array methods support them - at least on spidermonkey.
[[Put]] should probably throw a RangeError instead of ValueError?
From what i remember many months ago, this is fairly similar to the first Blob class we had in flusspferd. Using this proposal if you do need to grow or concatenate two blobs together i guess you create a new blob and use copy/copyFrom to do it. I might be happy with this -- will ponder.
-ash
> Couple of small niggles:
> toSource returns something not accepted by the constructor - specifically an array of numbers
This poses a fascinating philosophical question: whether to put the
array constructor form back or to change the source representation to:
require("binary").Buffer(3).copyFrom([1, 2, 3])
I think I'll put the Array constructor back.
> One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:
>
>> [4,5,6,7].slice(-3)
> [5, 6, 7]
I'll add verbiage for this.
> [[Put]] should probably throw a RangeError instead of ValueError?
Sure.
> From what i remember many months ago, this is fairly similar to the first Blob class we had in flusspferd. Using this proposal if you do need to grow or concatenate two blobs together i guess you create a new blob and use copy/copyFrom to do it. I might be happy with this -- will ponder.
One nice thing about this proposal is that we can build much nicer UI
on top of it mostly in pure JavaScript, while having this relatively
easily implemented subset at the embedding layer.
Kris Kowal
I have put the Array constructor back.
>> One other thing is what about negative values for range, slice etc. Some of the Array methods accept negative values to count backwards from the end. For example:
>>> [4,5,6,7].slice(-3)
>> [5, 6, 7]
>
> I'll add verbiage for this.
Also, this is now done.
>> [[Put]] should probably throw a RangeError instead of ValueError?
> Sure.
Also done.
I also did a pass on copy editing, formatting, and made a bunch of
things more explicit.
http://wiki.commonjs.org/wiki/Binary/F
Kris Kowal
Kris Kowal
--
You received this message because you are subscribed to the Google Groups "CommonJS" group.
To post to this group, send email to comm...@googlegroups.com.
To unsubscribe from this group, send email to commonjs+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/commonjs?hl=en.
This is an interesting one. One usage pattern I've definitely already
seen with Buffer is:
var buffer = Buffer(1024);
var actual = write(buffer, 0, 1024);
return buffer.range(0, actual);
This returns a buffer trimmed to the size of its actual content
without having to do an expensive reallocation and copy.
I'm not sure this is even the best example of using "range". I think
the general idea is that "range" avoids reallocation. This is
probably more important with this proposal than others since it does
not abstract copy-on-write semantics. This is very much a low-level
binary API that puts a lot of responsibility in pure JavaScript.
> 2) the copy constructor form is a bit problematic if you want a 1:1 view on
> another buffer:
>
> var clone = new Buffer(source, 0, source.length, false); // second and third
> arguments are redundant...
>
> Maybe it would be a good idea to completely *remove* the 4th argument ( =>
> always copy) and use only the "source.range()" instead?
I don't think this is a problem. It's pretty clear that you can throw
"undefined" in for "start" and "stop" in which case they're inferred.
I think most people will use "range" and "slice", but the inspiring
implementation (Ryan Dahl's Buffer) uses the Buffer constructor as the
basis of these operations. We could leave this as an implementation
detail, but I want to encourage consistency in practice.
> 3) the .Content property... what is its purpose? Or, more specifically, can
> you please show some usage scenario?
I'll leave this one to Daniel Friesen to defend. In his proposal it
was called .contentConstructor and it had something to do with
duck-typing collections. Its usage with a single type is not obvious,
but with a variety of types that have different content types, it
could be useful. I'm in favor of the idea in any case, just in
principle. This feature opens the possibility of content-type
agnostic algorithms that would not be possible to discern from an
empty collection. That is, (typeof someCollection[0]) won't work with
an empty buffer. Since Buffer would be the first typed collection in
JavaScript, it's difficult to cite precedent. However, with WebGL,
there promise to be many different types of collection. Some metadata
can't hurt.
Kris Kowal
I think this is a "cascading virtualization of unpack" feature. I
think we should postpone this; we can implement libraries for these
things and come back and talk about how to augment the Buffer type.
There are so many ways we could do this, I would rather see us hugging
and buying each other beer than protract this spec.
But, having said that, I've got some ideas. The problem we'll have
with this is that there are a lot of kinds of things you might want to
grab out of a buffer, lots of places to get them, lots of endianness,
lots of native/network byte order, lots of native alignment variation
and non-aligned, lots of widths, signed/unsigned, lots of formats,
lots of efficient array indexing and opaque struct dereferencing, lots
of everything. It would be super clumsy to have a whole bunch of
methods for this. One option is to have an unpacking DSL like
"unpack" from Perl, PHP, Ruby, Python and so on.
unpack(buffer, "Hb*")
buffer.unpack("Hb*")
We could have a "record" DSL for unpacking built on top of that.
buffer.unpack("H<a>b*<b>") => {
"a": …,
"b", …,
};
We could also do something elegant with orthogonal range selection and
unpacking. So, for your example:
buffer.slice(index, length).valueOf(endianness);
Buffer([0, 1, 2, 3, 4, 5]).slice(2, 6).valueOf(">");
In Python pack notation, "@" means native endianness, ">" and "<" are
big and little, and "!" is network just in case you forget that it's
the same as ">". I could also buy constants or "BE", and "LE" to
match the IANA charset suffixes.
Note that "valueOf" is the idiomatic variant of "toNumber"; the
Number() constructor defers to "valueOf" internally. This is
symmetric to:
100..toString(2)
BUT…I think we should talk about this again for Binary/1.1 if we can
agree on something smaller first.
Kris Kowal
I should have used "range" instead of "slice" to illustrate two points
at once. Oops.
Kris Kowal
In Binary/F it basically creates a buffer that operates on a specific
part of another already allocated buffer.
In Binary/C the issue being solved wasn't buffer allocation (the return
from .range wasn't even a Buffer, it was a OpaqueRange which had
implementation specific semantics to let implementations choose what
technique worked best for them). The issue was the api of memcopy.
Both .splice and a .memcopy/.copy function have one issue. The argument
list, it's an unsightly list of mostly numbers that make understanding
code as you read it tough. (I for one don't bother memorizing argument
lists consisting of a bunch of numbers and get tripped up when scanning
api using them, and I expect other target programmers are the same; This
is JavaScript after all)
.splice is alright with only two confusing numbers, but memcopy;
.copy(data, offset, length, [dataOffset]), in other words;
bufB.copy(bufA, 5, 10, 15);
The .splice api was already solved. You didn't have to ever touch
.splice to modify a *Buffer in an intuitive way; .append, .insert,
.replace, .remove, .fill, and .clear basically let you modify a buffer
in any way you need without touching the .splice api without needing to
do things like .splice(7, 0, [0,255]); /* .insert([0,255], 7); */,
b.splice(b.length, 0, [0,255]); /* .append([0,255]); */, etc...
The problem was that with said api to copy a section of a buffer to
another buffer gave you two choices. A) Use the nice and readable api,
at the cost of allocating a new Blob each time and discarding that
allocated memory right after; B) Using the memcopy api and it's long
list of args which aren't easy to read;
So I ended up thinking of .range, it returns an OpaqueRange which refers
to a part of another buffer temporarily, intended to be discarded right
away (it doesn't really share it in any way, it just knows where the
data is, and what portion of the data it points to).
So now you get the benefits of both A and B without the problems in
either; bufB.insert(bufA.range(5, 10), 7); /* Take the range of data
from index 5-10 (or that might be 5-15, we didn't get enough responses
to the show of hands to pick the api for .range) of buffer A, and insert
it into buffer B at index 7 using memcopy instead of allocating any
extra blobs.
bufB.replace(bufA.range(5, 10), 7); would roughly be; bufB.copy(bufA, 7,
5, 5);
bufB.append(bufA.range(5, 10)); would roughly be; var l = bufB.length;
bufB.length += 5; bufB.copy(bufA, l, 5, 5);
bufB.insert(bufA.range(5, 10), 7); would roughly be; var l =
bufB.length; bufB.length += 5; bufB.copy(bufB, 7, l-7, 7+5);
bufB.copy(bufA, 7, 5, 5);
I won't be defending it in this instance. I've noted it before,
.contentConstructor's use cases almost completely disappear (at least
every single use case I can come up with; as well as any faint idea on
how it could be useful) when you remove it from Binary/C's abstract
text/binary symetric API. And the use of .contentConstructor === Number
makes it even less useful.
And trying to mix .contentConstructor into all the different binary API
will likely not help. Extra binary API with .contentConstructor require
much more thought on how they would interact and what unexpected things
might happen. I put the overactive part of my brain to work when it came
to figuring out how .contentConstructor and other abstract parts of the
api would react to certain situations and how they would likely be
expected to react. I haven't put that into play trying to figure out the
theory of how various binary api would interact with .contentConstructor;
> Kris Kowal
>
Considering the various binary api efforts that are going on, the
existing binary objects on various platforms, how we bikeshead and come
up with varying permutations of a binary API which only varying portions
of the group take to, and the varying use cases we have; I'm taking more
to the idea of accepting that we will likely end up with more than one
binary api to deal with and instead promoting the use of patterns that
will be resistant to various binary systems being used through an app
and it's libraries (ie: Being sure to use constructors to cast any data
passed to your library from outside the library to your binary system),
and instead collecting our use cases or goals into separate targets and
standardizing multiple (not that many) binary API that fit our separate
target cases.
I don't mind a universal lite api that'll work interoperability
independent of the capabilities of the platform... But I also don't mind
the extra work of implementing a near-primitive (not as hard in Rhino as
in other engines) from a forward-thinking spec written with the hope
that it would become a future part of ES and implemented natively in
future versions of JavaScript engines.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
What about writing a string into the buffer? I definitely need the
ability to take a string and write it, with a chosen encoding into the
buffer, at a given location - with a maximum length that it could
occupy. It should return the number of bytes written. It would be nice
if it didn't split characters...
What about writing a string into the buffer? I definitely need the
ability to take a string and write it, with a chosen encoding into the
buffer, at a given location - with a maximum length that it could
occupy. It should return the number of bytes written. It would be nice
if it didn't split characters...
Just wondering: what's the use case for this when you don't know how
much of the string (in characters, not bytes) was written to the
buffer?
Hannes
Yeah, I guess it should provide that too.
Alright, here's some potential verbiage:
; copyString(source String, String(charset), Number(start_opt),
Number(stop_opt), Number(sourceStart_opt), Number(sourceStop_opt))
[sourceStop Number, targetStop Number]
# Encodes as much as possible of a String in a given charset into this
buffer from "source" to "stop", using the source string from
"sourceStart" to "sourceStop", and returns the actual stop index of
the source string and this, the target buffer.
# "start" is 0 if undefined or omitted.
# "stop" is this buffer's length if undefined or omitted.
# "sourceStart" is 0 if undefined or omitted.
# "sourceStop" is the source string's length if undefined or omitted.
# "charset" must be an IANA charset name.
## "copyString" must throw a ValueError if the given "charset" is not supported.
## The charsets "ascii", "utf-8", and "utf-16" must be supported.
## ''Note: the charset is not optional, there is no default.''
# Returns a duple Array with the actual "sourceStop" and "targetStop".
## The actual "sourceStop" is an index one past the last character
actually read.
## The actual "targetStop" is an index one past the last byte actually written.
Kris Kowal
Yeah that seems reasonable. So, for clarity - `start`, `stop` are of
the octet unit. `sourceStart` and and `sourceStop` are of the
character unit?
If it just returned the total number of octets written, that would be
sufficient?
--
You received this message because you are subscribed to the Google Groups "CommonJS" group.
To post to this group, send email to comm...@googlegroups.com.
To unsubscribe from this group, send email to commonjs+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/commonjs?hl=en.
I'm not particularly thrilled that I would need to create a range to
slice out a string. If I've got a buffer containing a HTTP request, I
want to rip out a bunch of ascii encoded strings really quickly,
creating a range object for each field and value before
.toString('ascii') them is a bit of overhead.
Yeah, I'll make that more clear:
# "start" is 0 if undefined or omitted and counts bytes.
# "stop" is this buffer's length if undefined or omitted, counting in bytes.
# "sourceStart" is 0 if undefined or omitted, counting in characters.
# "sourceStop" is the source string's length if undefined or omitted,
counting in characters.
> If it just returned the total number of octets written, that would be
> sufficient?
In order to resume writing on another buffer, you would need the
string offset because with mixed-width encodings like UTF-*, the width
is not computable in terms of the bytes written. It would be possible
to compute in terms of UCS-*, but we're targeting the general case.
Kris Kowal
There are others here who would not be thrilled to have to provide a
bunch of positional arguments, but I think we can entertain both the
range().toString() case and yours by providing optional start,stop
range args to toString. Would that suffice?
Kris Kowal
That works.
http://wiki.commonjs.org/index.php?title=Binary%2FF&diff=2130&oldid=2128
Kris Kowal
Oh, apparently there isn't (I've been assuming it was from my
experience elsewhere). Should we change the three references to
ValueError to RangeError or just Error.?
Kris Kowal
> Throws a RangeError if the buffer is malformed for the given character set,
> if a multi-byte character would be split across the stop boundary, or if any
> of the code points are out of the implementation's supported range.
This isn't is difficult to recover from - I'll have to wait for the
next packet, concatenate them, and then call toString() again? Or I
suppose I could try again with .toString('utf8', 0, buffer.length -
1)? It would be nice to get the longest string possible, and then be
notified of how many octets remain.
Aye aye. I'll look again.
Kris Kowal
This is starting to get into the character of proper character set encoding. What happens for more complex charsets?
Can you describe some of these case?
It's not my intent that this API should replace the encodings API; it
ought to be possible to provide the Buffer abstraction using the
encodings API. Please point out any ways that would not be possible.
Do BOMs complicate matters? Is there state that would need to be
passed into Buffer routines to properly decode or encode ranges of
certain character sets?
Kris Kowal
Does this change address the issue adequately:
http://wiki.commonjs.org/index.php?title=Binary%2FF&diff=2135&oldid=2132
I change "copyFromString" to "write" and created a corresponding
"read" that supports partial reads.
Kris Kowal
First thing that comes to mind is ISO2022-JP which is a stateful encoding - there are two modes, ASCII and a katakana mode (I think its called that anyway) and escape sequences to switch from one to the other.
The upshot of that is that you need stateful decoding if you want to deal with anything more than 'this is the entire string, please decode it' case.
I can see the need for a toString("utf-8") case, but I worry that we're asking for trouble/edge cases/special behaviour for anything other than convert all/or nothing.
My feeling is that this whole partial decoding support is getting out
of hand and does not really belong in a binary API (it may make sense
as part of a dedicated encoding API). For example, the read and write
methods returning duple arrays are pretty specialized and low level
and will rarely be needed for application level programming.
I think we should keep the string encoding/decoding support in binary
simple. Just provide methods to create a binary from a string and vice
versa, and say that the result for invalid character codes is
undefined. Everything else should be supported for those who need it,
but in a specialized encoding API.
Hannes
> Kris Kowal
I don't like that it returns an array, an extra, otherwise
unnecessary, object. That said, I can't think of a better alternative.
On Thu, Feb 25, 2010 at 1:20 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
> I think we should keep the string encoding/decoding support in binary
> simple. Just provide methods to create a binary from a string and vice
> versa, and say that the result for invalid character codes is
> undefined. Everything else should be supported for those who need it,
> but in a specialized encoding API.
How would I decode a partial string then?
There should be a method for decoding part of a buffer to a string
(like toString(charset, start, stop)). But IMO the responsability for
proper character alignment and boundaries should be with the
developer, i.e. I'd prefer garbage in-garbage out behaviour to
throwing an Error.
Anything more fancy like support for character boundary detection
should go in a dedicated encoding API.
Hannes
> 2010/2/25 Ryan Dahl <coldre...@gmail.com>:
>>
>> On Thu, Feb 25, 2010 at 1:20 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
>>> I think we should keep the string encoding/decoding support in binary
>>> simple. Just provide methods to create a binary from a string and vice
>>> versa, and say that the result for invalid character codes is
>>> undefined. Everything else should be supported for those who need it,
>>> but in a specialized encoding API.
>>
>> How would I decode a partial string then?
>
> There should be a method for decoding part of a buffer to a string
> (like toString(charset, start, stop)). But IMO the responsability for
> proper character alignment and boundaries should be with the
> developer, i.e. I'd prefer garbage in-garbage out behaviour to
> throwing an Error.
You can't get garbage with a partial character in a MBCS - you can't get anything for that character - it has to be an error condition. Is there a need for start,stop given the range method? Is .toString(cs, start, stop) not the same as .range(start,stop).toString(cs) ?
>
> Anything more fancy like support for character boundary detection
> should go in a dedicated encoding API.
>
> Hannes
Thats my view too tho.
> You can't get garbage with a partial character in a MBCS - you can't get anything for that character - it has to be an error condition. Is there a need for start,stop given the range method? Is .toString(cs, start, stop) not the same as .range(start,stop).toString(cs) ?
It's the same but faster. Ryan's all about optimization.
Kris Kowal
This binary stuff is really hot code - something like toString() will
be called thousands, if not millions of times per second. It cannot
cut corners. If you can simplify it by, for example, not returning an
tuple and thus causing the GC to run less frequently - that can give
several percentage points in a simple web server benchmark.
Basically, we shouldn't prescribe this API. It needs to be
implemented, benched, and tinkered with. (That's not to say that the
proposal isn't useful - thinking through these things collectively
save a lot of time.)
> On Thu, Feb 25, 2010 at 2:04 PM, Kris Kowal <cowber...@gmail.com> wrote:
>> It's the same but faster. Ryan's all about optimization.
>
> This binary stuff is really hot code - something like toString() will
> be called thousands, if not millions of times per second. It cannot
> cut corners. If you can simplify it by, for example, not returning an
> tuple and thus causing the GC to run less frequently - that can give
> several percentage points in a simple web server benchmark.
I can agree that - but whats the case for split MBCS reads? Headers are ascii only. Perhaps an IO/encoding layer could do this with a nicer API and to match your speed needs.
Basically something you pass an IO handle/emitter and it slurps in, manages what state is needed to do encoding, and emits the events with the data in the right charset? (It would only have to keep hold of at most a few bytes between events in any case that comes to mind right now). And what I have in mind wouldn't need to go out of C space before the final event if done right (I think).
Tho you'd need to be able to switch between ascii and utf8 or $char_set at almost any point. hmmm
This kind of interface might also be nice for transparently (un)compressing streams?
>
> Basically, we shouldn't prescribe this API. It needs to be
> implemented, benched, and tinkered with. (That's not to say that the
> proposal isn't useful - thinking through these things collectively
> save a lot of time.)
-ash
Could you make this recommendation more concrete? I would like to see
the signatures you are proposing.
Kris Kowal
First pass over it and I have a good impression. I need to do another
detailed read. Just wanted to say thanks for doing the footwork. It
looks good.
Kris Kowal
Until I can find a better home for this, I've wikified Wes's proposal
for posterity. I personally still have this in my inbox for a more
thorough review and probably integration.
http://wiki.commonjs.org/wiki/Binary/F/Wes
Kris Kowal