[binary] ByteArray and ByteString proposal

30 views
Skip to first unread message

Kris Kowal

unread,
Apr 8, 2009, 4:08:25 AM4/8/09
to serv...@googlegroups.com
I've taken the liberty to update the binary effort's front page,
including references to the mailing list and links to prior art
mentioned in that discourse. I've also reread those threads,
formulated some new ideas, and made a proposal for ByteArray and
ByteString types.

https://wiki.mozilla.org/ServerJS/Binary

https://wiki.mozilla.org/ServerJS/Binary/B

Kris Kowal

Ash Berlin

unread,
Apr 8, 2009, 1:38:16 PM4/8/09
to serv...@googlegroups.com

As a general rule, i think having a toString method on any binary
class that tries to deal with encoding is asking for trouble, unless
it just returns something like "[Binary len: x]" when called without
any parameters. Think of the case when using a REPL or print to debug
and you do print(thisObjectThatIJustGot); you don't really want that
doing anything smarter than telling you that its a blob.

Onto the proposals themselves:

Binary A:

* getLength: Why is this a function rather than a (RO?) property?
Both String and Array have a length property.

* toString: See above comment about toString and the comment at
the end of this mail about encodings.

* getByte/setByte: nothing seriously wrong with these, but if we
could get binary.byte[3] working or similar it'd be 'nicer'. Perhaps
just binary[3] again following Array and String.

* base64*,md5,sha1: I dont really think these belong on a Binary
class.

* String.proto.toBinary: what about if you want it as UTF-16?

Binary B:

'All platforms support two types for interacting with binary
data: ByteArray and ByteString.' Is this currently or 'its is proposed
that they do'?

ByteString.encode: Generally yes, I like this approach. what
format would codecModuleId take. I would suggest http://www.iana.org/assignments/character-sets
(aka the same type you see in mimitypes/content-type headers etc).
Okay so you allude to this at the end of your proposal.

ByteString.proto.toString: Again, see above comment about toString.

ByteArray.proto.compress: Does this compress inp-lace, return a
BA or a BS. This same question goes for the method on ByteString, but
then it more obvious that it returns a new ByteString. Should be
documented tho.

ByteArray.proto.push: Whats the behaviour when you push a number
>= 256. Dies? If so it should say so. (its the only sane behaviour i
can think of fwiw)

Array.proto.toByte*: Yes, so long as those are [[dontenum]].

Compression modules should also support a decompress.


Now onto Encoding.

Dealing with character sets is hard. Lots of languages get it subtly
or completely wrong. Spidermonkey gets it wrong in that it can't (to
my knowledge) support unicode characters with a code point greater
than 65535 - i.e. outside of the Basic Multi Lingual Plane. This
royally screws over CJK speakers (completely ignoring the whole Han
unification issue.) but thats kinda a side issue and nothing to do
with this proposal.

So basically I started writing this email 10 hours ago, and now i
forgot what I wanted to say in this section. Hmm.

Well your comment about conventions implies the need for an Encoding
module/namespace. Can I suggest that we use 'utf-8' instead of 'utf8'
so that it matches the IANA list <http://www.iana.org/assignments/character-sets
>

I'm sure there was some more important point I wanted to raise here,
but I've completely forgotten what it might have been. I will make it
later if I remember it.

Big first post eh? Hi everyone, well done for caring enough to read
this far, I've just found out about ServerJS project and I like its
aims. I'm involved in http://flusspferd.org.

-ash

Aristid Breitkreuz

unread,
Apr 9, 2009, 10:42:55 AM4/9/09
to serverjs
Hi,

On Apr 8, 10:08 am, Kris Kowal <cowbertvon...@gmail.com> wrote:
> I've taken the liberty to update the binary effort's front page,
> including references to the mailing list and links to prior art
> mentioned in that discourse.  I've also reread those threads,
> formulated some new ideas, and made a proposal for ByteArray and
> ByteString types.

I pretty much like the Binary/B proposal, so I'll comment on the
details there.

> ByteString

Splitting it up into ByteString and ByteArray, while more work for the
implementation, generally seems like a good idea. It possibly allows
some optimisations and generally the use cases for ByteString and
ByteArray might diverge a little, so that's probably good. If we'd
settle for a single class though, I'd want it to be ByteArray.

> A ByteString is an immutable, fixed-width representation of a C unsigned char (byte) array. ByteString supports the String API, and indexing returns a byte substring of length 1.

Sure.

> The ByteString constructor accepts:
>
> * ByteString()
> * ByteString(byteString)
> * ByteString(byteArray)
> * ByteString(array)

Array of numbers? So far so good.

> * ByteString(string, codecModuleId)
>
> The ByteString object has the following methods:
>
> * encode(string, codecModuleId)

I'm not sure if that really belongs in there. I think this really
should be the job of the Encodings library. But if this would be a
convenience wrapper around Encodings, well, OK.

> ByteString instances support the following:
>
> * immutable length property
> * toByteArray()
> * toArray()

Sure.

> * toString(codecModuleId)

Please not! I don't like toString() taking parameters. I think toString
() should simply return something like "[ByteString 4096]", where 4096
is the length.

> * decode(codecModuleId)
> * hash(digestModuleId)
> * compress(compressionModuleId)

Again, I'm not sure if this belongs in here.

> * indexOf(Number or ByteString)
> * lastIndexOf(Number or ByteString)

Yeah. But also a indexOf(Number of ByteString, lowestBoundForIndex)
please.

> * charAt(offset) -> ByteString
> * charCodeAt(offset) -> Number
> * byteAt(offset) -> Number (same as charCodeAt)

Is "char" really an appropriate identifier there? I mean it's a _byte_
string and not a _text_ string.

> * split(Number or ByteString) -> Array of ByteStrings

This would split around a boundary right? Okay, some additions:

* add an option for specifying whether the boundary itself should be
included in the result array,
* it should be possible to specify multiple boundaries.

So I'd propose this:

split(Number or ByteString or Array, {include_boundary: false/true}),
where include_boundary's default would be false.

> * substring(first, last) or substring(first) to the end
> * substr(first, length) or substr(length)

Slightly confusing.

String.prototype specifies:

substring(indexA, [indexB])
substr(start[, length]);

We should go with that, then.

> * The + operator returning new ByteStrings

Drop that. Can't be implemented everywhere.

> * The immutable [] operator returning ByteStrings
> * toSource() which would return "ByteString([])" for a null byte string

OK.

> * valueOf() returns itself

I don't understand.

> ByteString does not implement toUpperCase() or toLowerCase().

Wouldn't make sense anyways.

> ByteArray
>
> A ByteArray is a mutable, flexible representation of a C unsigned char (byte) array.
>
> The ByteArray constructor has the following forms:
>
> * ByteArray()
> * ByteArray(length)
> * ByteArray(byteArray)
> * ByteArray(byteString)
> * ByteArray(array)

OK.

> * ByteArray(string, codecModuleId)

See above (ByteString).

> Unlike the Array, the ByteArray is not variadic so that its initial length constructor is not ambiguous with its copy constructor.

Fair enough.

> The ByteArray object has the following methods:
>
> * encode(string, codecModuleId)

See above (ByteString).

> ByteArray instances support the following:
>
> * mutable length property
> o extending a byte array fills the new entries with 0.
> * toByteString()
> * toArray()

Good.

> * toString(codecModuleId)
> * decode(codecModuleId) returns String
> * hash(digestModuleId)
> * compress(compressionModuleId)

See above (ByteString).

> * concat(iterable)
> * join(byteString byteArray or Number)
> * pop()
> * push(…variadic Numbers…)

OK. Push could also take an Array or a ByteArray or a ByteString,
because that is not ambiguous...

> * shift()
> * unshift(…variadic Numbers…)
> * reverse() in place reversal
> * slice()
> * sort()
> * splice()

OK. Slice and splice maybe need more clarification, but I think I get
the idea.

> * toSource() returns a string like "ByteArray([])" for a null byte-array.

OK.

> * valueOf() returns itself
> * The + operator returning new ByteArrays

See above (ByteString).

> * The mutable [] operator for numbers

OK.

> String
>
> The String prototype will be extended with the following members:
>
> * toByteArray(codecModuleId)
> * toByteString(codecModuleId)

For the objections to the encodings stuff, see above (ByteString).

> Array
>
> The Array prototype will be extended with the following members:
>
> * toByteArray(codecModuleId)
> * toByteString(codecModuleId)

I don't think this one should take a codec parameter. I think it
should simply wrap the constructors of ByteArray and ByteString. Or it
can be dropped, because it doesn't really add anything that has to be
there.

Kind regards and thanks for taking the time to read my mail,

Aristid Breitkreuz

Aristid Breitkreuz

unread,
Apr 9, 2009, 10:57:25 AM4/9/09
to serv...@googlegroups.com
Aristid Breitkreuz schrieb:

>> * ByteString(string, codecModuleId)
>>
>> The ByteString object has the following methods:
>>
>> * encode(string, codecModuleId)
>>
>
> I'm not sure if that really belongs in there. I think this really
> should be the job of the Encodings library. But if this would be a
> convenience wrapper around Encodings, well, OK.
>

Now that I think about it, support for UTF-8 and UTF-16 strings should
probably be there.

I'd propose the following API:

Byte{String,Array}.fromUtf{8,16}(string)
Byte{String,Array}.prototype.asUtf{8,16}()

Well, or even... maybe the whole encodings thing is not so bad, because
it takes strings, while Encodings would work on ByteStrings / ByteArrays
exclusively, probably.

Kris Kowal

unread,
Apr 10, 2009, 3:36:38 AM4/10/09
to serv...@googlegroups.com
On Thu, Apr 9, 2009 at 7:57 AM, Aristid Breitkreuz
<aristid.b...@gmx.de> wrote:
>
> Aristid Breitkreuz schrieb:
>>>    * ByteString(string, codecModuleId)
>>>
>>> The ByteString object has the following methods:
>>>
>>>    * encode(string, codecModuleId)
>>>
>>
>> I'm not sure if that really belongs in there. I think this really
>> should be the job of the Encodings library. But if this would be a
>> convenience wrapper around Encodings, well, OK.

The intent is certainly for it to be a convenience wrapper for another
module, albeit an "encodings" module, "codec" module, or a variety of
extensible codec modules. I recommend "codec" for the module name.
George Moschovitis has already started using the codec/* namespace for
encoder/decoder modules.

>>
>
> Now that I think about it, support for UTF-8 and UTF-16 strings should
> probably be there.
>
> I'd propose the following API:
>
> Byte{String,Array}.fromUtf{8,16}(string)
> Byte{String,Array}.prototype.asUtf{8,16}()
>

Rather than favor any particular encoding, I envisioned this being:

Byte{String,Array}.decode('UTF-{8,16}')
Byte{String,Array}.prototype.encode('UTF-{8,16}')

But we definitely need to favor some canonical form. In light of Ash
Berlin's observation that Strings in Spidermonkey follow the erroneous
assumption that all unicode characters can be expressed in 16 bits (he
mentioned 21 bits actually?), I think we should consider using Arrays
of Numbers for the canonical form for pure-js. Converting a String to
this form is trivial with forEach and charCodeAt or fromCharCode.

String.prototype.charCodes() --> Array of Numbers
String.fromCharCodes(Array of Numbers) --> String

String.prototype.charCodes = function () {
var codes = [];
for (var i = 0; i < this.length; i++) {
codes[i] = this.charCodeAt(i);
}
return codes;
};

String.fromCharCodes = function (codes) {
var chars = [];
for (var i = 0; i < codes.length; i++)
chars[i] = String.fromCharCode(codes[i]);
return chars.join('');
};

An array of numbers would be an acceptable internal buffer for a
pure-js transcoder using orthogonal encoders and decoders from
separate modules, in the case that a lower-level, C implementation for
that (source, target) encoding pair is not available. The useful
interface for getting a transcoder in the API should make the choice.

exports.transcoder = function (source, target) {
if (lowLevel.available(source, target) {
return lowLevel.Transcoder(source, target);
} else {
var decode = require('codec/' + source).decode;
var encode = require('codec/' + target).encode;
return TranscoderAdapter(decode, encode);
}
};

This presumes that encode and decode in the given modules translate
between ByteStrings and ByteArrays and the canonical Array of Number
form. So, encoding.decode(encoded:ByteString) --> decoded:Array of
Number and encoding.encode(decoded:Array of Number) -->
encoded:ByteString.

The only trouble here is that decode is usually expected to return a
String, the accepted unicode canonical form. Perhaps Array of Numbers
could be a special array type that has a .toString() that attempts to
shove all the Numbers into a characters, and throws an error if
they're outside of the 16 bit limitation of the platform.

By the way, I recommend the term "Transcoder" instead of "Converter".
Using this name helped clarify the intent of the type, particularly
that it was for translating bytes to bytes, not bytes to strings and
strings to bytes. Could you illuminate the trade-offs between the two
approaches you've proposed (stream-like and not)? If one is more
performant than the other and the less performant can be implemented
in terms of it, I think it would be acceptable to use both as long as
the more useful albeit less performant one had the more convenient
name.

Thanks for bringing this up! Wes Garland, in particular, brought up
the issue of transcoding early on and we're ecstatic you've joined the
discussion, especially since the discussion comes on the heels of the
file IO proposal with which there probably could be synergy. I'd like
to see these abstractions:

* EncoderInputStream(encodingName:String, source:InputStream)
* DecoderInputStream(encodingName:String, source:ByteInputStream)
* EncoderOutputStream(encodingName:String, target:ByteOutputStream)
* DecoderOutputStream(encodingName:String, target:OutputStream)


Kris Kowal

Aristid Breitkreuz

unread,
Apr 10, 2009, 5:14:36 AM4/10/09
to serv...@googlegroups.com
Maybe a base class for ByteArray and ByteString would be good, it's a
nuisance to always write "ByteString or ByteArray". :-D

Kris Kowal schrieb:


> The intent is certainly for it to be a convenience wrapper for another
> module, albeit an "encodings" module, "codec" module, or a variety of
> extensible codec modules. I recommend "codec" for the module name.
> George Moschovitis has already started using the codec/* namespace for
> encoder/decoder modules.
>

I must admit that I don't understand how this would work. But I have
some ideas how multiple encodings modules could be made to work
together, if that's necessary.

>> Now that I think about it, support for UTF-8 and UTF-16 strings should
>> probably be there.
>>
>> I'd propose the following API:
>>
>> Byte{String,Array}.fromUtf{8,16}(string)
>> Byte{String,Array}.prototype.asUtf{8,16}()
>>
>>
>
> Rather than favor any particular encoding, I envisioned this being:
>

There is a reason for favoring particular encodings: UTF-8 and UTF-16
(well, at least the base plane) are natively supported by Spidermonkey.
Though I don't know about other engines.

> Byte{String,Array}.decode('UTF-{8,16}')
> Byte{String,Array}.prototype.encode('UTF-{8,16}')
>
> But we definitely need to favor some canonical form. In light of Ash
> Berlin's observation that Strings in Spidermonkey follow the erroneous
> assumption that all unicode characters can be expressed in 16 bits (he
> mentioned 21 bits actually?),

Yes, 21 bits for full Unicode. But I must add that also Windows and Java
make the same error.

Decode and encode as names confuse me. Decode means from arbitrary
encoding ByteString to UTF-16 Javascript string and Encode means from
UTF-16 Javascript string to arbitrary encoding ByteString?

> I think we should consider using Arrays
> of Numbers for the canonical form for pure-js. Converting a String to
> this form is trivial with forEach and charCodeAt or fromCharCode.
>

You mean Array of Unicode Codepoints?

> String.prototype.charCodes() --> Array of Numbers
> String.fromCharCodes(Array of Numbers) --> String
>
> String.prototype.charCodes = function () {
> var codes = [];
> for (var i = 0; i < this.length; i++) {
> codes[i] = this.charCodeAt(i);
> }
> return codes;
> };
>
> String.fromCharCodes = function (codes) {
> var chars = [];
> for (var i = 0; i < codes.length; i++)
> chars[i] = String.fromCharCode(codes[i]);
> return chars.join('');
> };
>
> An array of numbers would be an acceptable internal buffer for a
> pure-js transcoder using orthogonal encoders and decoders from
> separate modules, in the case that a lower-level, C implementation for
> that (source, target) encoding pair is not available. The useful
> interface for getting a transcoder in the API should make the choice.
>
> exports.transcoder = function (source, target) {
> if (lowLevel.available(source, target) {
> return lowLevel.Transcoder(source, target);
> } else {
> var decode = require('codec/' + source).decode;
> var encode = require('codec/' + target).encode;
> return TranscoderAdapter(decode, encode);
> }
> };
>

I agree that having a way for pure-js to implement the whole scheme is a
good thing.

We should actively design the "searching for encodings" scheme. I'll
think a little about it.

> This presumes that encode and decode in the given modules translate
> between ByteStrings and ByteArrays and the canonical Array of Number
> form. So, encoding.decode(encoded:ByteString) --> decoded:Array of
> Number and encoding.encode(decoded:Array of Number) -->
> encoded:ByteString.
>
> The only trouble here is that decode is usually expected to return a
> String, the accepted unicode canonical form. Perhaps Array of Numbers
> could be a special array type that has a .toString() that attempts to
> shove all the Numbers into a characters, and throws an error if
> they're outside of the 16 bit limitation of the platform.
>

Sounds hacky.

> By the way, I recommend the term "Transcoder" instead of "Converter".
> Using this name helped clarify the intent of the type, particularly
> that it was for translating bytes to bytes, not bytes to strings and
> strings to bytes. Could you illuminate the trade-offs between the two
> approaches you've proposed (stream-like and not)? If one is more
> performant than the other and the less performant can be implemented
> in terms of it, I think it would be acceptable to use both as long as
> the more useful albeit less performant one had the more convenient
> name.
>

OK, I changed the name.

Push-only is more performant because it needs less copying and it also
gets rid of an ancillary buffer (it only needs a buffer for bytes that
could not be transcoded, which are usually only 1 to 3 bytes) and I
think ByteStringStreams (with encodings support) fully replace the other
form.

> Thanks for bringing this up! Wes Garland, in particular, brought up
> the issue of transcoding early on and we're ecstatic you've joined the
> discussion, especially since the discussion comes on the heels of the
> file IO proposal with which there probably could be synergy. I'd like
> to see these abstractions:
>
> * EncoderInputStream(encodingName:String, source:InputStream)
> * DecoderInputStream(encodingName:String, source:ByteInputStream)
> * EncoderOutputStream(encodingName:String, target:ByteOutputStream)
> * DecoderOutputStream(encodingName:String, target:OutputStream)
>

What about "Transcoding" streams?

Aristid Breitkreuz

Hannes Wallnoefer

unread,
Apr 10, 2009, 3:14:02 PM4/10/09
to serverjs
On Apr 8, 10:08 am, Kris Kowal <cowbertvon...@gmail.com> wrote:
I generally like the Binary/B proposal. A few notes:

- I'm not convinced we need two classes, one mutable and one
immutable. IMO a single ByteArray class would do the trick, mutable
and possibly growable. An option for immutable byte array be to
introduce a freeze() method. We'll have Object.freeze() in ES 3.1
anyway.

- codecModuleId should not default to UTF-8. Instead it should default
to the platform's default/native encoding. This is because when you
read a file on your local platform, you should be able to convert it
to a string without having to explicitly ask the system what encoding
that may be.

- I find your use of decode/encode confusing. To me, the decoded state
is the byte array state, and the encoded state is the String state, so
the static method would be decode and the instance method encode. Is
this just me?

- substring(), substr() - two methods that do virtually the same with
slightly different semantics. Too confusing IMO. Related to that,
there's no slice() in ByteString, although both String and Array have
it and it's more arguably more powerful than substring for the way it
handles negative arguments.

Hannes

> Kris Kowal

Kris Kowal

unread,
Apr 11, 2009, 12:57:54 AM4/11/09
to serv...@googlegroups.com
On Wed, Apr 8, 2009 at 10:38 AM, Ash Berlin <ash_g...@firemirror.com> wrote:
> On 8 Apr 2009, at 09:08, Kris Kowal wrote:

> As a general rule, i think having a toString method on any binary
> class that tries to deal with encoding is asking for trouble, unless
> it just returns something like "[Binary len: x]" when called without
> any parameters. Think of the case when using a REPL or print to debug
> and you do print(thisObjectThatIJustGot); you don't really want that
> doing anything smarter than telling you that its a blob.

We could make it opaque and I'd be fine with that.

> Binary B:
>
>     'All platforms support two types for interacting with binary
> data: ByteArray and ByteString.' Is this currently or 'its is proposed
> that they do'?

I got lazy with the verbiage. I hate having to change the voice and
tense of a document as its status changes. Waste of time, or I'm just
lazy. None of the proposals are final or frozen. Some are getting
sticky, but that's about it.

>     ByteString.encode: Generally yes, I like this approach. what
> format would codecModuleId take. I would suggest http://www.iana.org/assignments/character-sets
>  (aka the same type you see in mimitypes/content-type headers etc).
> Okay so you allude to this at the end of your proposal.

Yeah, sounds fine.

>     ByteArray.proto.compress: Does this compress inp-lace, return a
> BA or a BS. This same question goes for the method on ByteString, but
> then it more obvious that it returns a new ByteString. Should be
> documented tho.

I'm not attached to this one. I borrowed it from one of the prior
arts, but am not sure which one. I'm not sure what it means. If
someone has an idea, I'll keep it around, but it'll probably get
culled in the next draft. I'm thinking that the encode/decode (or
alternate names thereof) functions could pass their variadic arguments
to the underlying encoder/decoder system, so they could effectively be
used for a wider variety of transforms like compression and encryption
without having to supply all sorts of names.

>     ByteArray.proto.push: Whats the behaviour when you push a number
>  >= 256. Dies? If so it should say so. (its the only sane behaviour i
> can think of fwiw)

Yeah, let's have these constructors and members throw errors if you
try to set a value out of the type's bounds. That'll help us fine
errors early in general. I could see an argument for 0xFF masking,
but I'd like someone else to make that argument if that's in their
heart. I also think they should be unsigned, so negative would be out
of bounds, not wrapped, if we go with the throw behavior, or just
masked as a signed negative otherwise.

>     Array.proto.toByte*: Yes, so long as those are [[dontenum]].

Yes. Dontenum would be required.

> Compression modules should also support a decompress.

I'm inclined to leave that out now.

> Now onto Encoding.
>
> Dealing with character sets is hard. Lots of languages get it subtly
> or completely wrong. Spidermonkey gets it wrong in that it can't (to
> my knowledge) support unicode characters with a code point greater
> than 65535 - i.e. outside of the Basic Multi Lingual Plane. This
> royally screws over CJK speakers (completely ignoring the whole Han
> unification issue.) but thats kinda a side issue and nothing to do
> with this proposal.

Well, it could be. We have a choice of whether to use String objects
as canonical Unicode and require Spidermonkey to get its act together.
This is the idealistic option. With minor inconvenience, we could
use Arrays of codepoint Numbers instead.

> Well your comment about conventions implies the need for an Encoding
> module/namespace. Can I suggest that we use 'utf-8' instead of 'utf8'
> so that it matches the IANA list <http://www.iana.org/assignments/character-sets

Sounds good to me.

> Big first post eh? Hi everyone, well done for caring enough to read
> this far, I've just found out about ServerJS project and I like its
> aims. I'm involved in http://flusspferd.org.

I'm guilty of longer. Keep them coming :-)

Thanks,
Kris Kowal

Kris Kowal

unread,
Apr 11, 2009, 1:07:47 AM4/11/09
to serv...@googlegroups.com
On Fri, Apr 10, 2009 at 12:14 PM, Hannes Wallnoefer <han...@gmail.com> wrote:
> - I'm not convinced we need two classes, one mutable and one
> immutable. IMO a single ByteArray class would do the trick, mutable
> and possibly growable. An option for immutable byte array be to
> introduce a freeze() method. We'll have Object.freeze() in ES 3.1
> anyway.

The purpose of a ByteString is to produce a type that is duck-type
compatible with existing algorithms, just as ByteArray can be an Array
duck-type. If that's not of value, we can ditch it. I personally
like the lucid dichotomy as it's instructive as to what I can expect
from each. If I have a question about ByteArray, I can ask myself
what an Array would do in its place. For example, "Are ByteArrays
growable", well, Array's are, so sure they are. Can they be
initialized with a particular size? Well, Array's can, so sure they
are. There are ways in which they definitely need to be different,
like for performance they should only store numbers that fit in a
byte, so default values within the allocation range should be 0
initially instead of undefined, but these are minor deviations.

> - codecModuleId should not default to UTF-8. Instead it should default
> to the platform's default/native encoding. This is because when you
> read a file on your local platform, you should be able to convert it
> to a string without having to explicitly ask the system what encoding
> that may be.

Sounds good to me.

> - I find your use of decode/encode confusing. To me, the decoded state
> is the byte array state, and the encoded state is the String state, so
> the static method would be decode and the instance method encode. Is
> this just me?

Hmmm. It could be that I've got them mixed up in my head. You're not
the only one to comment that encode/decode are confusing, and I've had
to scratch my head when using them in Python too. This is probably a
point where better names, albeit more verbose names, will be
appropriate. I think toString(encoding), toByteArray(encoding), and
toByteString(encoding) would be acceptable, presuming that toString's
default encoding, as you suggest, is the system's default encoding or
is required. Alternately calling with no argument could throw an
error; there has been some support for that.

To potentially address Ash's concern with toString operating on
arbitrary blobs with a default encoding, a REPL really ought to use
toSource instead of toString. That being said, he's probably right
that toString should always produce a meaningful representation of the
content of the string, so imposing an encoding might not be
appropriate.

Perhaps Byte*.toString() would return "[Byte* length=n]" and
.toString(encoding) would return something encoding appropriate.


> - substring(), substr() - two methods that do virtually the same with
> slightly different semantics. Too confusing IMO. Related to that,
> there's no slice() in ByteString, although both String and Array have
> it and it's more arguably more powerful than substring for the way it
> handles negative arguments.

I'm inclined to include substring() and substr() in ByteString since
they're in String, even though I never use them and find them obtuse
and confusing, so that they may be passed as String duck-types to
algorithms expecting Strings with those functions. slice() should
definitely be in there; that was an accidental omission.

I'm going to redraft the proposal at some point to account for these ideas.

Kris Kowal

Kris Kowal

unread,
Apr 11, 2009, 1:31:34 AM4/11/09
to serv...@googlegroups.com
On Thu, Apr 9, 2009 at 7:42 AM, Aristid Breitkreuz
<aristid.b...@gmx.de> wrote:
> On Apr 8, 10:08 am, Kris Kowal <cowbertvon...@gmail.com> wrote:

>>    * ByteString(string, codecModuleId)
>>
>> The ByteString object has the following methods:
>>
>>    * encode(string, codecModuleId)
>
> I'm not sure if that really belongs in there. I think this really
> should be the job of the Encodings library. But if this would be a
> convenience wrapper around Encodings, well, OK.

Yes, it would definitely be a shorthand for corresponding operations
provided by an "encoding" or "codecs" module.

>>    * toString(codecModuleId)
>
> Please not! I don't like toString() taking parameters. I think toString
> () should simply return something like "[ByteString 4096]", where 4096
> is the length.

Right, what do you think of the idea:

'' + ByteString([0, 0]) == "[ByteString 2]"
ByteString([0, 0]).toString("UTF-8") == "\0\0"

>>    * decode(codecModuleId)
>>    * hash(digestModuleId)
>>    * compress(compressionModuleId)
>
> Again, I'm not sure if this belongs in here.

Yeah, compress in particular seems out of place. Hash I could buy,
but again, all of these would definitely have to be shorthands for
behavior in "compress" and "hash" modules as yet undefined.

>>    * indexOf(Number or ByteString)
>>    * lastIndexOf(Number or ByteString)
>
> Yeah. But also a indexOf(Number of ByteString, lowestBoundForIndex)
> please.

Sure. How about an exclusive upper bound too?

ByteString.prototype.indexOf(byte:ByteString|ByteArray|Number, first, last)

>>    * charAt(offset) -> ByteString
>>    * charCodeAt(offset) -> Number
>>    * byteAt(offset) -> Number (same as charCodeAt)
>
> Is "char" really an appropriate identifier there? I mean it's a _byte_
> string and not a _text_ string.

No, it's not, but in the interest of duck-typing, I'd recommend
keeping it around.

>>    * split(Number or ByteString) -> Array of ByteStrings
>
> This would split around a boundary right? Okay, some additions:
>
> * add an option for specifying whether the boundary itself should be
> included in the result array,
> * it should be possible to specify multiple boundaries.

If any additional argument should be accepted for split, I think it
should be the same interface as Python, where it specifies the maximum
number of delimiters to break on. So, "a, b, c".split(', ', 1) would
return ["a", "b, c"]. Boundary inclusive operations would be better
as the purview for streams. Multiple boundaries are usually served by
regular expressions for Strings. I agree that such things should be
possible, but we should probably defer adding something like byte
regexes.

>>    * substring(first, last) or substring(first) to the end
>>    * substr(first, length) or substr(length)
>
> Slightly confusing.
>
> String.prototype specifies:
>
> substring(indexA, [indexB])
> substr(start[, length]);
>
> We should go with that, then.

I'll make sure that the definitions for ByteString match those for
String. That's the intent.

>>    * The + operator returning new ByteStrings
>
> Drop that. Can't be implemented everywhere.

As long as we have to dive down deep enough to replicate Array and
String behaviors for indexing, that we should take up the challenge of
implementing the other operators that are supported by the analogous
types.

I personally favor creating abstractions that can be created with
pure-js and abandoning the unsubclassable, special, underlying base
types whenever possible, but the list favored the approach of
providing a low level interface since all serverjs platforms support
properties and such.

>>    * valueOf() returns itself
>
> I don't understand.

The value of a primitive is usually the primitive. 1..valueOf() == 1,
for example.

> OK. Push could also take an Array or a ByteArray or a ByteString,
> because that is not ambiguous...

Array.push and unshift are variadic. That at least should be
supported. We should consider adding an "extend" operation instead of
conflating the meaning of "push".

> Kind regards and thanks for taking the time to read my mail,
> Aristid Breitkreuz

Natürlich,
Kris Kowal

Kris Kowal

unread,
Apr 11, 2009, 2:25:27 AM4/11/09
to serv...@googlegroups.com
I've updated Binary/B to reflect the comments brought to bear so far.
The salient points are: there is no default encoding, the "encode" and
"decode" functions have been replaced with argument overloads for
toString, toArray, toByteString, and toByteArray, with appropriate
combinations of (), (codec), (sourceCodec, targetCodec) for byte for
byte translation, decoding, encoding, and transcoding. It's
explicitly mentioned that these functions are shorthands for behavior
provided by a "codec" or "encodings" module as Aristid Breitkreuz has
begun building. Encodings are defined to be IANA encoding names,
which I hope to be coincidental with modules in the codec/* module
name-space so that the capabilities of the codec module can be
extended by plugable encoding modules, even if it's at the expense of
some performance when those encodings are not hosted by a lower-level
transcoder.

https://wiki.mozilla.org/ServerJS/Binary/B

Also, I would appreciate it if someone would take ownership over
proposal B or its successor, perhaps Aristid. This week, my zeal for
this project finally began to impact my effectiveness at my day-job,
so I'm planning to pull back to just keeping my pulse on modules and
the securability and usability dialectic of the File API.

Kris Kowal

Aristid Breitkreuz

unread,
Apr 11, 2009, 7:51:28 AM4/11/09
to serv...@googlegroups.com
Hi,

Kris Kowal schrieb:


> On Thu, Apr 9, 2009 at 7:42 AM, Aristid Breitkreuz
> <aristid.b...@gmx.de> wrote:
>
>> On Apr 8, 10:08 am, Kris Kowal <cowbertvon...@gmail.com> wrote:
>>
>
>
> Yes, it would definitely be a shorthand for corresponding operations
> provided by an "encoding" or "codecs" module.
>

Codecs? Does that name imply that more than just character encodings
will be supported?

>>> * toString(codecModuleId)
>>>
>> Please not! I don't like toString() taking parameters. I think toString
>> () should simply return something like "[ByteString 4096]", where 4096
>> is the length.
>>
>
> Right, what do you think of the idea:
>
> '' + ByteString([0, 0]) == "[ByteString 2]"
> ByteString([0, 0]).toString("UTF-8") == "\0\0"
>

The problem here is that one name (toString) is used for two completely
different functions.


>>> * indexOf(Number or ByteString)
>>> * lastIndexOf(Number or ByteString)
>>>
>> Yeah. But also a indexOf(Number of ByteString, lowestBoundForIndex)
>> please.
>>
>
> Sure. How about an exclusive upper bound too?
>
> ByteString.prototype.indexOf(byte:ByteString|ByteArray|Number, first, last)
>

Well, I don't really object to that, but it isn't really necessary OTOH.

>>> * charAt(offset) -> ByteString
>>> * charCodeAt(offset) -> Number
>>> * byteAt(offset) -> Number (same as charCodeAt)
>>>
>> Is "char" really an appropriate identifier there? I mean it's a _byte_
>> string and not a _text_ string.
>>
>
> No, it's not, but in the interest of duck-typing, I'd recommend
> keeping it around.
>

Should or need byte strings really be 100% duck-type compatible with
Strings? Calling charAt on ByteString doesn't seem particularly safe to me.

>>> * split(Number or ByteString) -> Array of ByteStrings
>>>
>> This would split around a boundary right? Okay, some additions:
>>
>> * add an option for specifying whether the boundary itself should be
>> included in the result array,
>> * it should be possible to specify multiple boundaries.
>>
>
> If any additional argument should be accepted for split, I think it
> should be the same interface as Python, where it specifies the maximum
> number of delimiters to break on. So, "a, b, c".split(', ', 1) would
> return ["a", "b, c"]. Boundary inclusive operations would be better
> as the purview for streams. Multiple boundaries are usually served by
> regular expressions for Strings. I agree that such things should be
> possible, but we should probably defer adding something like byte
> regexes.
>

Additional parameters should be keyword parameters, for clarity. So,
byteString.split(boundary, {maxSplit: 1}), that is easier to understand.

Multiple boundaries without regular expressions are a very valid
use-case IMHO, and they can be more efficiently implemented than regular
expressions. There can be no ambiguity either (a JS-array of ByteStrings
or numbers is always a safe indicator that multiple boundaries are
desired), so I don't see the problem either.

Boundary inclusions as the purview of streams? To be honest I don't
understand how that could be implemented in terms of streams. What I
imagine is the following

ByteString([-1, 0, 1, 2, 3, 4, 5]).split([2, 5], {boundaryInclusion: true})
// => [ByteString([-1, 0, 1]), ByteString([2]), ByteString([3, 4]),
ByteString([5])]

How would that be implemented with streams? And why not implement it in
split additionally? It's not difficult to implement...

>>> * The + operator returning new ByteStrings
>>>
>> Drop that. Can't be implemented everywhere.
>>
>
> As long as we have to dive down deep enough to replicate Array and
> String behaviors for indexing, that we should take up the challenge of
> implementing the other operators that are supported by the analogous
> types.
>
> I personally favor creating abstractions that can be created with
> pure-js and abandoning the unsubclassable, special, underlying base
> types whenever possible, but the list favored the approach of
> providing a low level interface since all serverjs platforms support
> properties and such.
>

I actually think that the + operator cannot be overloaded with
Spidermonkey, which I use. At least not with public APIs. So _I_ cannot
implement this, which leads me to think that it should not be
standardised, and I hope you agree.

> OK. Push could also take an Array or a ByteArray or a ByteString,
>> because that is not ambiguous...
>>
>
> Array.push and unshift are variadic. That at least should be
> supported. We should consider adding an "extend" operation instead of
> conflating the meaning of "push".
>

I vote for the name "append". That method should definitely be there.
You are right that the name "push" should not be conflated. (The same
goes for "toString". :-P)

Aristid Breitkreuz

unread,
Apr 11, 2009, 8:25:48 AM4/11/09
to serv...@googlegroups.com
Kris Kowal schrieb:

> Also, I would appreciate it if someone would take ownership over
> proposal B or its successor, perhaps Aristid. This week, my zeal for
> this project finally began to impact my effectiveness at my day-job,
> so I'm planning to pull back to just keeping my pulse on modules and
> the securability and usability dialectic of the File API.
>

OK, I have already begun improving the non-controversial parts of the
wiki page. Too bad that this means less eyes ensuring a good API.

I hope that dangoor will start the github specs repo soon, because that
would be more convenient than the MozWiki IMHO.

Reply all
Reply to author
Forward
0 new messages