Comments on Binary object

Cameron McCormack

unread,

Feb 6, 2009, 2:53:04 AM2/6/09

to serv...@googlegroups.com

Hi.

Here are some comments on the Binary object proposal at
https://wiki.mozilla.org/ServerJS/Binary.

I’m assuming that this is the interface for host objects, is that
right?

I’d suggest using a property for length instead of a getLength()
function, for consistency with Array.

I’m not sure that toString() is the best function to use for
interpreting the bytes as UTF-8 and decoding them to a string. Not all
sequences of bytes are valid UTF-8, and it might be an inconvenience for
some Binary objects to throw an exception when stringifying. (I’m
assuming that an exception would be thrown if the bytes aren’t valid
UTF-8.)

I think [[Get]] and [[Put]] should be used to access the bytes in the
object, so that it can be used similarly to Array objects.

Basically I see this object as a specialisation of Array, with some
extra functions on it.

I think Binary objects should come in a couple of different flavours:

* Completely immutable. Assignments to length or numeric properties
would be ignored. This would be useful for representing, say, the
contents of a file that has been opened for reading but not writing.

* Constant length, but with contents modifiable. Assignment to length
would be ignored. Certain Binary objects could represent fixed
amounts of data, like the CanvasPixelArray type in HTML 5.

* Length and contents modifiable. For other cases, like contents of
files that have been opened r/w.

I don’t think toBinary() is named precisely enough. toUTF8()?

Not sure if base64encode, md5, sha1, base64encode belong on Binary.

Since you want pop(), push(), etc., I think you could consider making
Binary.prototype.prototype == Array.prototype. This should work if
[[Get]], [[Put]] length-as-property were used.

Thanks,

Cameron

--
Cameron McCormack ≝ http://mcc.id.au/

Ondrej Zara

unread,

Feb 6, 2009, 9:55:06 AM2/6/09

to serv...@googlegroups.com

> I'd suggest using a property for length instead of a getLength()
> function, for consistency with Array.
>

Yes, several people suggested this and I completely agree.

> I don't think toBinary() is named precisely enough. toUTF8()?
>

It depends on what this function does :) In my sample interface, it
converts a "string" object to "binary" object, which is why I believe
the name is chosen properly. An argument can be used to specify a way
in which this should be done (UTF-8, ISO, win-1250, ...).

> Not sure if base64encode, md5, sha1, base64encode belong on Binary.
>

I have already expressed the reason here:
http://groups.google.com/group/serverjs/msg/0243f943fffb542e

> Since you want pop(), push(), etc., I think you could consider making
> Binary.prototype.prototype == Array.prototype. This should work if
> [[Get]], [[Put]] length-as-property were used.
>

This is a nice idea. Not sure if subclassing an Array is possible in
all JS VMs though.

Thanks for comments,
Ondrej

Wes Garland

unread,

Feb 6, 2009, 11:42:50 AM2/6/09

to serv...@googlegroups.com

> I'm assuming that this is the interface for host objects, is that
> right?

The immediate need is for data read from disk with a File object, but
could also be for host objects. IMO file data is best off returned as
binary and coerced into strings (or something else) as needed.

> I'm not sure that toString() is the best function to use for
> interpreting the bytes as UTF-8 and decoding them to a string.

toString() is the logical function to use for decode from Binary to a
String. If it can't be done, an exception should be thrown. toString
is logical because the js engine will automatically call it whenever
coersion to a string is needed.

That said, I don't recall the suggested spec saying that the
conversion would always be from UTF-8. I would logically expect them
to be just shoved into a widechar string with the high byte set zero.

That said, it would be interesting if the Binary constructor could
specify the underlying "binary type" so that toString could do
automatic conversions. When File.read returns a Binary object, it
could construct it such that the underlying type indicates that it's
utf-8 or whatever the file actually is. This would get the "right"
behaviour all the time, provided that the file opener got the file
type right in the first place.

> Basically I see this object as a specialisation of Array, with some
> extra functions on it.
>
> I think Binary objects should come in a couple of different flavours:
>
> * Completely immutable. Assignments to length or numeric properties
> would be ignored. This would be useful for representing, say, the
> contents of a file that has been opened for reading but not writing.
>
> * Constant length, but with contents modifiable. Assignment to length
> would be ignored. Certain Binary objects could represent fixed
> amounts of data, like the CanvasPixelArray type in HTML 5.
>
> * Length and contents modifiable. For other cases, like contents of
> files that have been opened r/w.

Your proposal that Binary behave like Array conflicts with the
immutability proposed above.

That said, I like immutable Binary, just like immutable String. But I
think the [] operator on Binary should be meaningful, just like it
used to be on String. On the other hand, writing Binary data to disk
would be a pain with immutable Binary. Can you imagine having to
create your Binary data as a String, concatenating immutable bits
together, converting to binary, and then writing?

Hmm. Okay, I don't like immutable binary. You should be able to
specify length and have it behave like Array. So, your third option
above.

> I don't think toBinary() is named precisely enough. toUTF8()?

In my mind, binary is raw, not UTF8. Otherwise, you need yet ANOTHER
type to represent raw data. We already have a native Unicode type
(String), I'm not sure adding a UTF8 data type really has value.

> Not sure if base64encode, md5, sha1, base64encode belong on Binary.
>
> Since you want pop(), push(), etc., I think you could consider making
> Binary.prototype.prototype == Array.prototype. This should work if
> [[Get]], [[Put]] length-as-property were used.

Can that actually be implemented?

My initial implementation thought for Binary is to hook the resolver,
when it comes time to look up a numbered property, index into a C char
array and return the right byte, wrapped in a new String.

length -> realloc
push -> length+=1, set into
pop -> length -= 1, return popped byte as new String

etc

--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Kris Kowal

unread,

Feb 6, 2009, 1:19:29 PM2/6/09

to serv...@googlegroups.com

On Fri, Feb 6, 2009 at 6:55 AM, Ondrej Zara <ondre...@gmail.com> wrote:
>> Since you want pop(), push(), etc., I think you could consider making
>> Binary.prototype.prototype == Array.prototype. This should work if
>> [[Get]], [[Put]] length-as-property were used.
>>

>> Cameron McCormack ≝ http://mcc.id.au/

>
> This is a nice idea. Not sure if subclassing an Array is possible in
> all JS VMs though.

Most (all but two, "concat" and another) of the array prototype
members are not native code, but rather generic members for array-like
objects, so you could ride that wave by copying them to your prototype
if prototypical inheritance doesn't work out.

Kris Kowal

Tom Robinson

unread,

Feb 6, 2009, 3:42:03 PM2/6/09

to serv...@googlegroups.com

On Feb 6, 2009, at 8:42 AM, Wes Garland wrote:
>
>> I'm assuming that this is the interface for host objects, is that
>> right?
>
> The immediate need is for data read from disk with a File object, but
> could also be for host objects. IMO file data is best off returned as
> binary and coerced into strings (or something else) as needed.
>
>> I'm not sure that toString() is the best function to use for
>> interpreting the bytes as UTF-8 and decoding them to a string.
>
> toString() is the logical function to use for decode from Binary to a
> String. If it can't be done, an exception should be thrown. toString
> is logical because the js engine will automatically call it whenever
> coersion to a string is needed.

Except if the user thinks it's a string and does .length they'll get
the number of bytes, not the number of characters (which isn't always
the same)

> That said, I don't recall the suggested spec saying that the
> conversion would always be from UTF-8. I would logically expect them
> to be just shoved into a widechar string with the high byte set zero.

So anything except ASCII would be garbled?

Why not intelligently try to figure out the encoding if it's not
explicitly provided?

> Hmm. Okay, I don't like immutable binary. You should be able to
> specify length and have it behave like Array. So, your third option
> above.

I think we would also have mutable binary objects available.

>> I don't think toBinary() is named precisely enough. toUTF8()?
>
> In my mind, binary is raw, not UTF8. Otherwise, you need yet ANOTHER
> type to represent raw data. We already have a native Unicode type
> (String), I'm not sure adding a UTF8 data type really has value.

I presume toUTF8() would just return a binary object containing the
string encoded as UTF8, not a special data type.

>> Not sure if base64encode, md5, sha1, base64encode belong on Binary.
>>
>> Since you want pop(), push(), etc., I think you could consider making
>> Binary.prototype.prototype == Array.prototype. This should work if
>> [[Get]], [[Put]] length-as-property were used.

I think it would probably be better to make the Binary object
compatible with Array, but not rely on Array's methods. Those methods
are easy enough to implement.

Peter Michaux

unread,

Feb 6, 2009, 3:44:17 PM2/6/09

to serv...@googlegroups.com

On Fri, Feb 6, 2009 at 12:42 PM, Tom Robinson <tlrob...@gmail.com> wrote:

> Why not intelligently try to figure out the encoding if it's not
> explicitly provided?

I am told by the resident encoding expert where I work that this is
not possible to do reliably.

Peter

Ondrej Zara

unread,

Feb 6, 2009, 3:57:17 PM2/6/09

to serv...@googlegroups.com

>> Why not intelligently try to figure out the encoding if it's not
>> explicitly provided?
>
> I am told by the resident encoding expert where I work that this is
> not possible to do reliably.

+1

Actually, I live in a country which uses (*used*, UTF-8 is now leading
the way) about 5 different one-byte encodings for bytes > 128.

O.

>
> Peter
>
> >
>

Robert Koberg

unread,

Feb 6, 2009, 4:07:55 PM2/6/09

to serv...@googlegroups.com

If an XML parser is involved somewhere along the way, it will use the
xml declaration to determine the encoding, e.g.

<?xml version="1.0" encoding="UTF-16"?>

So, to intelligently determine the encoding you would need to parse it
first :)

-Rob

Wes Garland

unread,

Feb 6, 2009, 4:37:48 PM2/6/09

to serv...@googlegroups.com

> Except if the user thinks it's a string and does .length they'll get
> the number of bytes, not the number of characters (which isn't always
> the same)

The same argument could be made for Array. I don't think it's a reason
to ditch toString. (Array.toString returns this.join(','))

>> That said, I don't recall the suggested spec saying that the
>> conversion would always be from UTF-8. I would logically expect them
>> to be just shoved into a widechar string with the high byte set zero.
>
> So anything except ASCII would be garbled?

If it wasn't treated intelligently by the programmer, then yes. But
we can't stop programmers from shooting themselves in the foot, only
help them.

Remember, if you impose high-level functionality like understanding
UTF-8 on your lowest level data type (Binary), then you're backing
yourself into a corner where you cannot _possibly_ support anything
else. For example, what if I wanted to be able to copy an executable
from one directory to another? In a UTF-8-only proposal, I can't
possibly do that. That's silly.

> Why not intelligently try to figure out the encoding if it's not
> explicitly provided?

As an option during construction? I don't really have a problem with
that, although can't see if being all that great. All the time? forget
it. What if I am reading a file which is UTF-8, but contains very
few escape sequences? So, I'm reading it line by line, and
File.read() (or whatever) is returning me "plain binary", or "ascii",
or whatever it detects it as, the suddenly there is a UTF-8 escape
sequence. Now it starts returning UTF-8-decoded Binary data. But it
turns out that what I'm *really* reading is actually just a large BLOB
from something else, and next it has utf-7 escapes...

I think a reasonable way to handle this is to have whatever
instanciates the Binary object indicate the underlying representation,
with "raw" as one option.

Then, in the example of File.read() returning a Binary, File.open()
will have to have a parameter indicating the encoding type.

Also, I forsee the day when File.read() and .write() might be reading
and writing pascal-style records (or raw C structs) as a legacy shim
to some other application. We should allow that. That means that we
need Binary.slice, but that's pretty trivial. Hmm, I suppose that
also implies that Binary.slice should return a Binary, with an
appropriate toString. That could be really nice and easy to use..

Other things I can see happening with Binary are decoding protocol
data units from a socket, being hunks of shared memory, etc.

> I think we would also have mutable binary objects available.

I agree, I've been thinking about this and I don't think there are any
real benefits to immutable binaries, including engine-level
optimizations. And there are definately drawbacks to an immutable
binary type.

> I presume toUTF8() would just return a binary object containing the
> string encoded as UTF8, not a special data type.

Yes

> I think it would probably be better to make the Binary object
> compatible with Array, but not rely on Array's methods. Those methods
> are easy enough to implement.

Agreed.

Wes

Ates Goral

unread,

Feb 8, 2009, 5:00:33 PM2/8/09

to serv...@googlegroups.com

On Fri, Feb 6, 2009 at 2:53 AM, Cameron McCormack <hey...@gmail.com> wrote:
>
> I think [[Get]] and [[Put]] should be used to access the bytes in the
> object, so that it can be used similarly to Array objects.
>

The CanvasPixelArray[1] object of <canvas> follows this approach for
accessing raw image data.

[1]: http://www.whatwg.org/specs/web-apps/current-work/multipage/the-canvas-element.html#canvaspixelarray

Ates

ry

unread,

Mar 4, 2009, 7:29:06 AM3/4/09

to serverjs

> Here are some comments on theBinaryobject proposal athttps://wiki.mozilla.org/ServerJS/Binary.

Google Gears has proposed a simple binary string API
http://code.google.com/apis/gears/api_blob.html

Ates Goral

unread,

Mar 4, 2009, 9:05:53 PM3/4/09

to serv...@googlegroups.com

Some comments on the proposed API:

> /**
> * Encodes the data in Base64
> * @returns {string}
> */
> Binary.prototype.base64encode = function() {};
>
> /**
> * Calculates MD5 hash
> * @returns {string}
> */
> Binary.prototype.md5 = function() {};
>
> /**
> * Calculates SHA1 hash
> * @returns {string}
> */
> Binary.prototype.sha1 = function() {};

I'd call these toBase64, toMD5, toSHA1 to be consistent with toString,
toBinary etc.

> /**
> * Decodes the Base64 encoded string
> * @returns {Binary}
> */
> String.prototype.base64decode = function() {};

I'd make this a static method of Binary and call it fromBase64:

Binary.prototype.fromBase64 = function (string) {};

From an implementation perspective, it would make sense to include
both the encoding and the decoding in the same object. The String
object shouldn't have to know about base64 encoding.

> /**
> * Converts an UTF-8 encoded string into its binary (one byte per item) variant
> * @returns {Binary}
> */
> String.prototype.toBinary = function() {};

Again, I'd make this part of the Binary object: pass in a String to
the Binary constructor:

function Binary(string) { }

My last two suggestions have the advantage of making the String object
completely Binary-agnostic. It will be the Binary object who knows
about String and not vice versa. It would be up to the implementation
to extract the internal binary representation from a given String
object.

Also, while we have push, pop, shift and unshift, could they take an
integer count argument so that they can be used for "blitting"?

Ates

Reply all

Reply to author

Forward