Something that has come up a couple of times with content authors
lately has been the desire to convert an ArrayBuffer (or part thereof)
into a decoded string. Similarly being able to encode a string into an
ArrayBuffer (or part thereof).
Something as simple as
DOMString decode(ArrayBufferView source, DOMString encoding);
ArrayBufferView encode(DOMString source, DOMString encoding,
[optional] ArrayBufferView destination);
would go a very long way. The question is where to stick these
functions. Internationalization doesn't have a obvious object we can
hang functions off of (unlike, for example crypto), and the above
names are much too generic to turn into global functions.
Ideas/opinions/bikesheds?
/ Jonas
Python3 just defines str.encode and bytes.decode. Can we not do this
with String.encode and ArrayBuffer.decode?
~TJ
Shouldn't this just be another ArrayBufferView type with special
semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
getString()/setString() method pair on DataView?
Incidentally I _strongly_ suggest we only support UTF-8 here.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
http://wiki.whatwg.org/wiki/StringEncoding
This is the direction I prefer. String encoding and decoding seems to
be a complex enough problem that it should be expressed separately
from the typed array spec itself.
-Ken
Unfortunately I suspect getting anything added on the String object
will take a few years given that it's too late to get into ES6 (and in
any case I suspect adding ArrayBuffer dependencies to ES6 would be
controversial).
/ Jonas
Very cool. Where do I provide feedback to this? Here?
/ Jonas
> Something that has come up a couple of times with content authors
> lately has been the desire to convert an ArrayBuffer (or part thereof)
> into a decoded string. Similarly being able to encode a string into an
> ArrayBuffer (or part thereof).
>
There was discussion about this before:
https://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html
http://wiki.whatwg.org/wiki/StringEncoding
(I don't know why it was on the WebGL list; typed arrays are becoming
infrastructural and this doesn't seem like it belongs there, even though
ArrayBuffer was started there.)
The API on that wiki page is a reasonable start. For the same reasons that
we discussed in a recent thread (
http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
conversion errors should use replacement (eg. U+FFFD), not throw
exceptions. The "any" arguments should be fixed. Encoding to UTF-16
should definitely not prefix a BOM, and UTF-16 having unspecified
endianness is obviously bad.
I'd also suggest that, unless there's serious, substantiated demand for
it--which I doubt--only major Unicode encodings be supported. Don't make
it easier for people to keep using legacy encodings.
> Shouldn't this just be another ArrayBufferView type with special
> semantics, like Uint8ClampedArray? DOMStringArray or some such? And/or a
> getString()/setString() method pair on DataView?
I don't think so, because retrieving the N'th decoded/reencoded character
isn't a constant-time operation.
--
Glenn Maynard
This list seems like a good place to discuss it.
-Ken
We can just define it outside the ES spec.
> On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking <jo...@sicking.cc> wrote:
>
> > Something that has come up a couple of times with content authors
> > lately has been the desire to convert an ArrayBuffer (or part thereof)
> > into a decoded string. Similarly being able to encode a string into an
> > ArrayBuffer (or part thereof).
> >
>
> There was discussion about this before:
>
>
> https://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html
> http://wiki.whatwg.org/wiki/StringEncoding
>
> (I don't know why it was on the WebGL list; typed arrays are becoming
> infrastructural and this doesn't seem like it belongs there, even though
> ArrayBuffer was started there.)
>
> The API on that wiki page is a reasonable start. For the same reasons that
> we discussed in a recent thread (
> http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
> conversion errors should use replacement (eg. U+FFFD), not throw
> exceptions. The "any" arguments should be fixed. Encoding to UTF-16
> should definitely not prefix a BOM, and UTF-16 having unspecified
> endianness is obviously bad.
>
> I'd also suggest that, unless there's serious, substantiated demand for
> it--which I doubt--only major Unicode encodings be supported. Don't make
> it easier for people to keep using legacy encodings.
>
>
Two other pieces of feedback I received from Adam Barth off list:
* take ArrayBufferView as input which both fixes "any" and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.
> On Tue, Mar 13, 2012 at 5:49 PM, Jonas Sicking <jo...@sicking.cc> wrote:
>
> > Something that has come up a couple of times with content authors
> > lately has been the desire to convert an ArrayBuffer (or part thereof)
> > into a decoded string. Similarly being able to encode a string into an
> > ArrayBuffer (or part thereof).
> >
>
> There was discussion about this before:
>
>
> https://www.khronos.org/webgl/public-mailing-list/archives/1111/msg00017.html
> http://wiki.whatwg.org/wiki/StringEncoding
>
> (I don't know why it was on the WebGL list; typed arrays are becoming
> infrastructural and this doesn't seem like it belongs there, even though
> ArrayBuffer was started there.)
>
Purely historical; early adopters of Typed Arrays were folks prototyping
with WebGL who wanted to parse data files containing strings.
WHATWG makes sense, I just hadn't gotten around to shopping for a home.
(Administrivia: Is there need to propose a charter addition?)
> The API on that wiki page is a reasonable start. For the same reasons that
> we discussed in a recent thread (
> http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1589.html),
> conversion errors should use replacement (eg. U+FFFD), not throw
> exceptions. The "any" arguments should be fixed. Encoding to UTF-16
> should definitely not prefix a BOM, and UTF-16 having unspecified
> endianness is obviously bad.
>
> I'd also suggest that, unless there's serious, substantiated demand for
> it--which I doubt--only major Unicode encodings be supported. Don't make
> it easier for people to keep using legacy encodings.
>
>
Two other pieces of feedback I received from Adam Barth off list:
* take ArrayBufferView as input which both fixes "any" and simplifies the
API to eliminate byteOffset and byteLength
* support two versions of encode, one which takes a target ArrayBufferView,
and one which allocates/returns a new Uint8Array of the appropriate length.
> > Shouldn't this just be another ArrayBufferView type with special
Some quick feedback:
- [OmitConstructor] doesn't seem to be WebIDL
- please don't allow UAs to implement other encodings. You should list
the exact set of supported encodings and the exact labels that should
be recognised as meaning those encodings, and disallow all others.
Otherwise, we'll be in a never-ending game of reverse-engineering each
others' lists of supported encodings and it'll keep growing.
- What's the use case for supporting anything but UTF-8?
- Having a mechanism that lets you encode the string and get a length
separate from the mechanism that lets you encode the string and get the
encoded string seems like it would encourage very inefficient code. Can
we instead have a mechanism that returns both at once? Or is the idea
that for some encodings getting the encoded length is much quicker than
getting the actual string?
- Seems weird that integers and strings would have such different APIs
for doing the same thing. Why can't we handle them equivalently? As in:
len = view.setString(strings[i],
offset + Uint32Array.BYTES_PER_ELEMENT,
"UTF-8");
view.setUint32(offset, len);
offset += Uint32Array.BYTES_PER_ELEMENT + len;
HTH,
You're welcome to use the WHATWG list for this. Charters are pointless and
there's no need to worry about them here.
Like Ian said, I don't see anything particularly bad about the spec
defining ArrayBuffers to define an ArrayBuffer-related method on
String. There's no reason it has to be in the ES spec.
~TJ
Python throws errors by default, but both functions have an additional
argument specifying an alternate strategy. In particular,
bytes.decode can either drop the invalid bytes, replace them with a
replacement char (which I agree should be U+FFFD), or replace them
with XML entities; str.encode can choose to drop characters the
encoding doesn't support.
~TJ
> On Tue, 13 Mar 2012, Joshua Bell wrote:
> > On Tue, Mar 13, 2012 at 4:10 PM, Jonas Sicking <jo...@sicking.cc> wrote:
> > > On Tue, Mar 13, 2012 at 4:08 PM, Kenneth Russell <k...@google.com>
> > > wrote:
> > > > Joshua Bell has been working on a string encoding and decoding API
> > > > that supports the needed encodings, and which is separable from the
> > > > core typed array API:
> > > >
> > > > http://wiki.whatwg.org/wiki/StringEncoding
> > > >
> > > > This is the direction I prefer. String encoding and decoding seems
> > > > to be a complex enough problem that it should be expressed
> > > > separately from the typed array spec itself.
>
> Some quick feedback:
>
> - [OmitConstructor] doesn't seem to be WebIDL
>
Historically, the spec started off as an addition to the Typed Array spec
that splintered off; cleanup is definitely needed, thanks.
> - please don't allow UAs to implement other encodings. You should list
> the exact set of supported encodings and the exact labels that should
> be recognised as meaning those encodings, and disallow all others.
> Otherwise, we'll be in a never-ending game of reverse-engineering each
> others' lists of supported encodings and it'll keep growing.
>
> - What's the use case for supporting anything but UTF-8?
>
For both of the above: initially suggested use cases included parsing data
as esoteric as ID3 tags in MP3 files, where encoding unspecified and is
guessed at by decoders, and includes non-Unicode encodings. It was
suggested that the encoding sniffing capabilities of browsers be leveraged.
(Cue a strong "nooooooo!" from Anne.)
I completely agree that we should explicitly list the set of encoding
supported and should remove the "other encodings" allowance.
Whether we should restrict it as far as UTF-8 depends on whether we
envision this API only used for parsing/serializing newly defined data
formats, or whether there is consideration for interop with previously
existing formats data formats and code. For example, "BINARY" would be used
to bridge the existing atob()/btoa() methods with Typed Arrays (although
base64 directly in/out of Typed Arrays would be preferable).
Jonas, since you started this thread - did your content authors mention
encodings?
> - Having a mechanism that lets you encode the string and get a length
> separate from the mechanism that lets you encode the string and get the
> encoded string seems like it would encourage very inefficient code. Can
> we instead have a mechanism that returns both at once? Or is the idea
> that for some encodings getting the encoded length is much quicker than
> getting the actual string?
>
The use case was to compute the size necessary to allocate a single buffer
into which may be encoded multiple strings and other data, rather than
allocating multiple small buffers and then copying strings into a larger
buffer.
Ignoring the issue of invalid code points, the length calculations for
non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
be sanitized, that case is trivially 2x the JS string length.)
> - Seems weird that integers and strings would have such different APIs
> for doing the same thing. Why can't we handle them equivalently? As in:
>
> len = view.setString(strings[i],
> offset + Uint32Array.BYTES_PER_ELEMENT,
> "UTF-8");
> view.setUint32(offset, len);
> offset += Uint32Array.BYTES_PER_ELEMENT + len;
>
Heh, that's where the discussion started, actually. We wanted to keep the
DataView interface simple, and potentially support encoding into plain JS
arrays and/or non-TypedArray support that appeared to be on the horizon for
JS.
Seems reasonable. If we have specific use cases for non-UTF-8 encodings, I
agree we should support them; if that's the case, we should survey those
use cases to work out what the set of encodings we need is, and add just
those.
> > - Having a mechanism that lets you encode the string and get a length
> > separate from the mechanism that lets you encode the string and get the
> > encoded string seems like it would encourage very inefficient code. Can
> > we instead have a mechanism that returns both at once? Or is the idea
> > that for some encodings getting the encoded length is much quicker than
> > getting the actual string?
> >
>
> The use case was to compute the size necessary to allocate a single buffer
> into which may be encoded multiple strings and other data, rather than
> allocating multiple small buffers and then copying strings into a larger
> buffer.
>
> Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)
Yeah, but surely we'll mainly be doing stuff with UTF-8...
One option is to return an opaque object of the form:
interface EncodedString {
readonly attributes unsigned long length;
// internally has a copy of the encoded string
}
...and then have view.setString take this EncodedString object. At least
then you get it down to an extraneous copy, rather than an extraneous
encode. Still not ideal though.
> > - Seems weird that integers and strings would have such different APIs
> > for doing the same thing. Why can't we handle them equivalently? As in:
> >
> > len = view.setString(strings[i],
> > offset + Uint32Array.BYTES_PER_ELEMENT,
> > "UTF-8");
> > view.setUint32(offset, len);
> > offset += Uint32Array.BYTES_PER_ELEMENT + len;
>
> Heh, that's where the discussion started, actually. We wanted to keep
> the DataView interface simple, and potentially support encoding into
> plain JS arrays and/or non-TypedArray support that appeared to be on the
> horizon for JS.
I see where you're coming from, but I think we should look at the platform
as a whole, not just one API. It doesn't help the platform as a whole if
we just have the same features split across two interfaces, the complexity
is even slightly higher than just having one consistent API that does ints
and strings equivalently.
On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson <i...@hixie.ch> wrote:
> - What's the use case for supporting anything but UTF-8?
>
Other Unicode encodings may be useful, to decode existing file formats
containing (most likely at a minimum) UTF-16. I don't feel strongly about
that, though; we're stuck with UTF-16 as an internal representation in the
platform, but that doesn't necessarily mean we need to support it as a
transfer encoding.
For non-Unicode legacy encodings, I think that even if use cases exist,
they should be given more than the usual amount of scrutiny before being
supported.
On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. <jacka...@gmail.com>wrote:
> Python throws errors by default, but both functions have an additional
> argument specifying an alternate strategy. In particular,
> bytes.decode can either drop the invalid bytes, replace them with a
> replacement char (which I agree should be U+FFFD), or replace them
> with XML entities; str.encode can choose to drop characters the
> encoding doesn't support.
>
Supporting throwing is okay if it's really wanted, but the default should
be replacement. It reduces fatal errors to (usually) non-fatal
replacement, for obscure cases that people generally don't test. It's a
much more sane default failure mode.
As another option, never throw, but allow returning the number of
conversion errors:
results = encode("abc\uD800def", outputView, "UTF-8");
where results.inputConsumed is the number of words consumed in myString,
results.outputWritten is the number of UTF-8 bytes written, and
results.errors is 1.
That also allows block-by-block conversion; for example, to convert as many
complete characters as possible into a fixed-size buffer for transmission,
then starting again at the next unencoded character.
One more idea, while I'm brainstorming: if outputView is null, allocate an
ArrayBuffer of the necessary size, storing it in results.output. That
eliminates the need for a separate length pass, without bloating the API
with another overload.
On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell <jsb...@chromium.org> wrote:
> (Cue a strong "nooooooo!" from Anne.)
>
(Count me in on that, too. Heuristics bad.)
Ignoring the issue of invalid code points, the length calculations for
> non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not
> be sanitized, that case is trivially 2x the JS string length.)
>
UTF-16 "sanitization" (replacing mismatched surrogates with U+FFFD) doesn't
change the size of the output, actually.
--
Glenn Maynard
> Using Views instead of specifying the offset and length sounds good.
>
> On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson <i...@hixie.ch> wrote:
>
> > - What's the use case for supporting anything but UTF-8?
> >
>
> Other Unicode encodings may be useful, to decode existing file formats
> containing (most likely at a minimum) UTF-16. I don't feel strongly about
> that, though; we're stuck with UTF-16 as an internal representation in the
> platform, but that doesn't necessarily mean we need to support it as a
> transfer encoding.
>
> For non-Unicode legacy encodings, I think that even if use cases exist,
> they should be given more than the usual amount of scrutiny before being
> supported.
>
The whole idea is to be able to extract textual data out of some packed
binary format. If you don't support the character sets people want to use,
they will simply do like they have to do now and hand-code the character
set conversion, where it will slow and inaccurate.
In particular, I think you have to include various ISO-8859-* character
sets (especially Latin1) and the non-Unicode character sets still
frequently used by Japanese and Chinese users.
I am fine with strongly suggesting that only UTF8 be used for new things,
but leaving out legacy support will severely limit the utility of this
library.
--
John A. Tamplin
Software Engineer (GWT), Google
And not go beyond what is defined/allowed in:
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
--
Anne van Kesteren
http://annevankesteren.nl/
On Wed, Mar 14, 2012 at 06:49, Jonas Sicking <jo...@sicking.cc> wrote:
> Something that has come up a couple of times with content authors
> lately has been the desire to convert an ArrayBuffer (or part thereof)
> into a decoded string. Similarly being able to encode a string into an
> ArrayBuffer (or part thereof).
What are the 'late' use cases for this?
The question might sound naive, but to me the encoding/decoding would
have been really great to have during the time when we didn't have
support for ArrayBuffers in general input/output APIs like we have now
(XHR, WebSockets, File API, ...) - which sounds like the mainstream
use cases to me.
However there is one use case that is not supported that sounds
something worthy not to overlook imho : embedding of binary data
(typed arrays) into textual formats such as XML or JSON.
For this, base64 encoding/decoding is typically used (so that it
doesn't conflict with the XML or JSON container) and thus more or less
efficiently implemented in JavaScript (just like we had to
encode/decode strings in JS to/from XHR a while ago).
Would it make sense to support encoding="base64" in this API?
> Something as simple as
>
> DOMString decode(ArrayBufferView source, DOMString encoding);
> ArrayBufferView encode(DOMString source, DOMString encoding,
> [optional] ArrayBufferView destination);
This API proposal looks lean and mean.
I hope we can move the current StringEncoding proposal to something
closer to this.
Regards,
For completeness I note that python also allows user-provided custom
error handling. I'm not suggesting we want this, but I would strongly
prefer it to providing an XML-entity-encode option :)
If we can make it a deterministic, unchanging, and defined algorithm, I
think that would actually be acceptable. And ideally we do define that
algorithm at some point so new browsers can enter the existing market more
easily and existing browsers interpret existing content in the same way.
>
> What are the 'late' use cases for this?
> The question might sound naive, but to me the encoding/decoding would
> have been really great to have during the time when we didn't have
> support for ArrayBuffers in general input/output APIs like we have now
> (XHR, WebSockets, File API, ...) - which sounds like the mainstream
> use cases to me.
>
I brought up a use case with Mozilla during their Games Work Week. When
designing formats for games there's a desire to make the content small and
parse fast to keep load times down. When talking about something like a 3D
mesh format, it's very convenient to deliver the mesh to the browser
with responseType = "arraybuffer", as this allows us to push views of the
resulting array directly into WebGL buffers.
There's a lot of content within a model that doesn't cleanly map to a
binary array, however. A good example is if you want to include shader code
as part of the model. It's also very popular currently to use JSON to
describe model metadata. In these cases currently developers have two
choices: embed the string data in the binary buffer, which requires
cumbersome byte-by-byte extraction, or store it in a secondary file that is
requested as string data to begin with. This is the route I've seen taken
most often: Three.js uses it currently, for example. In either case,
however, you are being slowed down by either the string parsing overhead or
the second request.
With the proposed API it would be practical and fast to store string data
and binary mesh data in the same ArrayBuffer, which would be a boon to game
developers seeking to make HTML5 into a first-class gaming platform.
--Brandon
>
> For this, base64 encoding/decoding is typically used (so that it
> doesn't conflict with the XML or JSON container) and thus more or less
> efficiently implemented in JavaScript (just like we had to
> encode/decode strings in JS to/from XHR a while ago).
>
> Would it make sense to support encoding="base64" in this API?
Having implemented a library that handled both text encodings and
base16/base64 encoding, I can offer the opinion that the nomenclature gets
very confusing since the encode/decode semantics are reversed.
binary_buffer = encode(text_content)
text_content = decode(binary_buffer)
vs.
binary_buffer = decode(base64_data)
base64_data = encode(binary_buffer)
When you try to unify these and have the same API accept "UTF-8" and
"BASE64" encodings by name it's difficult to keep track of which of
encode/decode you want; one will seem backwards. (This is one advantage of
the atob()/btoa() API naming approach.)
And extending that thought, such confusion is lessened if you can avoid the
loaded words encode/decode and associate the operations with either one of
the target types, e.g. buffer.writeString/readString, or
string.toBuffer/fromBuffer
* Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms,
encodings, and encoding selection, which greatly simplifies the spec. This
implicitly adds support for all of the other encodings defined therein - we
may still want to dictate a subset of encodings. A few minor issues noted
throughout the spec.
* Define a "binary" encoding, since that support was already in this spec.
We may decide to kill this but I didn't want to remove it just yet.
* Simplify methods to take ArrayBufferView instead of
any/byteOffset/byteLength. The implication is that you may need to use
temporary DataViews, and this is reflected in the examples.
* Call out more of the big open issues raised on this thread (e.g. where
should we hang this API)
Nothing controversial added, or (alas) resolved.
Not all limitations are bad, and I'd disagree with "seriously".
At a minimum, the set of encodings should be very carefully selected.
Limit it to Unicode to begin with, and if we're really going to put legacy
encodings on yet more life support, only add an encoding where there's a
clear, justified need for it. (There are many encodings that browsers need
to support for text/html because they're used in legacy content, but which
nobody is still using today in new content--those should not be supported
here.)
But stick with Unicode for now. Once an encoding is added, it's hard to
ever remove it.
On Wed, Mar 14, 2012 at 6:52 AM, Anne van Kesteren <ann...@opera.com> wrote:
> If we can make it a deterministic, unchanging, and defined algorithm, I
> think that would actually be acceptable. And ideally we do define that
> algorithm at some point so new browsers can enter the existing market more
> easily and existing browsers interpret existing content in the same way.
We don't have any untagged content to support yet, so let's not create an
API that guarantees it'll come into existence. The heuristics you need
depend heavily on the content, anyway (for example, heuristics that work
for HTML probably won't for ID3 tags, which are generally very short).
On Wed, Mar 14, 2012 at 11:14 AM, Joshua Bell <jsb...@chromium.org> wrote:
> Having implemented a library that handled both text encodings and
> base16/base64 encoding, I can offer the opinion that the nomenclature gets
> very confusing since the encode/decode semantics are reversed.
>
> binary_buffer = encode(text_content)
> text_content = decode(binary_buffer)
>
> vs.
>
> binary_buffer = decode(base64_data)
> base64_data = encode(binary_buffer)
>
It's more than a naming problem. With this string API, one side of the
conversion is always a DOMString. Base64 conversion wants
ArrayBuffer<->ArrayBuffer conversions, so it would belong in a separate API.
--
Glenn Maynard
>
> It's more than a naming problem. With this string API, one side of the
> conversion is always a DOMString. Base64 conversion wants
> ArrayBuffer<->ArrayBuffer conversions, so it would belong in a separate API.
>
Huh. The scenarios I've run across are Base64-encoded binary data islands
embedded in textual container formats like XML or JSON, which yield a
DOMString I want to decode into an ArrayBuffer.
> FYI, I've updated http://wiki.whatwg.org/wiki/StringEncoding
>
> * Rewritten in terms of Anne's Encoding spec and WebIDL, for algorithms,
> encodings, and encoding selection, which greatly simplifies the spec.
> This
> implicitly adds support for all of the other encodings defined therein -
> we
> may still want to dictate a subset of encodings. A few minor issues noted
> throughout the spec.
> * Define a "binary" encoding, since that support was already in this
> spec.
Maybe atob() and btoa() could be extended to work with ArrayBuffers?
> We may decide to kill this but I didn't want to remove it just yet.
> * Simplify methods to take ArrayBufferView instead of
> any/byteOffset/byteLength. The implication is that you may need to use
> temporary DataViews, and this is reflected in the examples.
> * Call out more of the big open issues raised on this thread (e.g. where
> should we hang this API)
>
> Nothing controversial added, or (alas) resolved.
--
Simon Pieters
Opera Software
What I replied to suggested reusing an existing undocumented code path
which is definitely used to support existing content. From what I remember
reading about the detector in Gecko it can be quite useful regardless of
context.
> We don't have any untagged content to support yet, so let's not create an
>> API that guarantees it'll come into existence. The heuristics you need
>> depend heavily on the content, anyway (for example, heuristics that work
>> for HTML probably won't for ID3 tags, which are generally very short).
>>
>
> What I replied to suggested reusing an existing undocumented code path
> which is definitely used to support existing content. From what I remember
> reading about the detector in Gecko it can be quite useful regardless of
> context.
It's used to support text/html content. There's no content using *this
API* that exists yet, because this API doesn't exist yet.
If we really wanted to expose heuristics, it should be a separate method,
eg. guessEncoding(view) == "Shift-JIS".
--
Glenn Maynard
A few comments:
What's the use-case for the "stringLength" function? You can't decode
into an existing datastructure anyway, so you're ultimately forced to
call "decode" at which point the "stringLength" function hasn't helped
you.
Currently the use-case of simply wanting to convert a string to a
binary buffer is a bit cumbersome. You first have to call the
"encodedLength" function, then allocate a buffer of the right size,
then call the "encode" function. Could we add a function with
something like the following signature:
ArrayBufferView encode(DOMString value, optional DOMString encoding);
It doesn't seem possible to implement the 'encode' function without
doing multiple scans over the string. The implementation seems
required both to check that the data can be decoded using the
specified encoding, as well as check that the data will fit in the
passed in buffer. Only then can the implementation start decoding the
data. This seems problematic.
I also don't think it's a good idea to throw an exception for encoding
errors. Better to convert characters to the unicode replacement
character. I believe we made a similar change to the WebSockets
specification recently.
/ Jonas
> What's the use-case for the "stringLength" function? You can't decode
> into an existing datastructure anyway, so you're ultimately forced to
> call "decode" at which point the "stringLength" function hasn't helped
> you.
>
stringLength doesn't return the length of the decoded string. It returns
the byte offset of the first \0 (or the length of the whole buffer, if
none), for decoding null-terminated strings. For multibyte encodings (eg.
everything except UTF-16 and friends), it's just memchr(), so it's much
faster than actually decoding the string.
Currently the use-case of simply wanting to convert a string to a
> binary buffer is a bit cumbersome. You first have to call the
> "encodedLength" function, then allocate a buffer of the right size,
> then call the "encode" function.
I suggested eg.
result = encode("string", "utf-8", null).output;
which would create an ArrayBuffer of the required size. Presumably the
null ArrayBufferView argument would be optional, so you could just say
encode("string", "utf-8").
It doesn't seem possible to implement the 'encode' function without
> doing multiple scans over the string. The implementation seems
> required both to check that the data can be decoded using the
> specified encoding, as well as check that the data will fit in the
> passed in buffer. Only then can the implementation start decoding the
> data. This seems problematic.
>
Only if it guarantees that it doesn't write anything to the output buffer
unless the entire result will fit. I don't think we need to do that; just
guarantee that it'll be truncated on a whole codepoint.
I also don't think it's a good idea to throw an exception for encoding
> errors. Better to convert characters to the unicode replacement
> character. I believe we made a similar change to the WebSockets
> specification recently.
>
Was that change made? I filed
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems to
be undecided.
--
Glenn Maynard
> On Thu, Mar 15, 2012 at 6:51 PM, Jonas Sicking <jo...@sicking.cc> wrote:
>
>> What's the use-case for the "stringLength" function? You can't decode
>> into an existing datastructure anyway, so you're ultimately forced to
>> call "decode" at which point the "stringLength" function hasn't helped
>> you.
>>
>
> stringLength doesn't return the length of the decoded string. It returns
> the byte offset of the first \0 (or the length of the whole buffer, if
> none), for decoding null-terminated strings. For multibyte encodings (eg.
> everything except UTF-16 and friends), it's just memchr(), so it's much
> faster than actually decoding the string.
>
And just to be clear, the use case is decoding data formats where string
fields are variable length null terminated.
> Currently the use-case of simply wanting to convert a string to a
>> binary buffer is a bit cumbersome. You first have to call the
>> "encodedLength" function, then allocate a buffer of the right size,
>> then call the "encode" function.
>
>
> I suggested eg.
>
> result = encode("string", "utf-8", null).output;
>
> which would create an ArrayBuffer of the required size. Presumably the
> null ArrayBufferView argument would be optional, so you could just say
> encode("string", "utf-8").
>
I think we want both encoding and destination to be optional. That leads us
to an API like:
out_dict = stringEncoding.encode("string", opt_dict);
.. where both out_dict and opt_dict are WebIDL Dictionaries:
opt_dict keys: view, encoding
out_dict keys: charactersWritten, byteWritten, output
... where output === view if view is supplied, otherwise a new Uint8Array
(or Uint8ClampedArray??)
If this instead is attached to String, it would look like:
out_dict = my_string.encode(opt_dict);
If it were attached to ArrayBufferView, having a right-size buffer
allocated for the caller gets uglier unless we include a static version.
It doesn't seem possible to implement the 'encode' function without
>> doing multiple scans over the string. The implementation seems
>> required both to check that the data can be decoded using the
>> specified encoding, as well as check that the data will fit in the
>> passed in buffer. Only then can the implementation start decoding the
>> data. This seems problematic.
>>
>
> Only if it guarantees that it doesn't write anything to the output buffer
> unless the entire result will fit. I don't think we need to do that; just
> guarantee that it'll be truncated on a whole codepoint.
>
Agreed. Input/output dicts mean the API documentation a caller needs to
read to understand the usage is more complex than a function signature
which is why I resisted them, but it does seem like the best approach.
Thanks for pushing, Glenn!
In the create-a-buffer-on-the-fly case there will be some memory juggling
going on, either by initially over allocating or reallocating/moving.
> I also don't think it's a good idea to throw an exception for encoding
>> errors. Better to convert characters to the unicode replacement
>> character. I believe we made a similar change to the WebSockets
>> specification recently.
>>
>
> Was that change made? I filed
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
> to be undecided.
>
Settling on an options dict means adding a flag to control this behavior
(throws: true ?) doesn't extend the API surface significantly.
>
> And just to be clear, the use case is decoding data formats where string
> fields are variable length null terminated.
>
... and the spec should include normative guidance that length-prefixing is
strongly recommended for new data formats.
> And just to be clear, the use case is decoding data formats where string
> fields are variable length null terminated.
>
A concrete example is ZIP central directories.
I think we want both encoding and destination to be optional. That leads us
> to an API like:
>
> out_dict = stringEncoding.encode("string", opt_dict);
>
> .. where both out_dict and opt_dict are WebIDL Dictionaries:
>
> opt_dict keys: view, encoding
>
> out_dict keys: charactersWritten, byteWritten, output
>
The return value should just be a [NoInterfaceObject] interface.
Dictionaries are used for input fields.
Something that came up on IRC that we should spend some time thinking
about, though: Is it actually important to be able to encode into an
existing buffer? This may be a premature optimization. You can always
encode into a new buffer, and--if needed--copy the result where you need it.
If we don't support that, most of this extra stuff in encode() goes away.
... where output === view if view is supplied, otherwise a new Uint8Array
> (or Uint8ClampedArray??)
>
Uint8Array is correct. (Uint8ClampedArray is for image color data.)
If UTF-16 or UTF-32 are supported, decoding to them should return
Uint16Array and Uint32Array, respectively (with the return value being
typed just to ArrayBufferView).
If this instead is attached to String, it would look like:
>
> out_dict = my_string.encode(opt_dict);
>
> If it were attached to ArrayBufferView, having a right-size buffer
> allocated for the caller gets uglier unless we include a static version.
>
If in-place decoding isn't really needed, we could have:
newView = str.encode("utf-8"); // or {encoding: "utf-8"}
str2 = newView.decode("utf-8");
len = newView.find(0); // replaces stringLength, searching for 0 in the
view's type; you'd use Uint16Array for UTF-16
and encodedLength() would go away.
newView.find(val) would live on subclasses of TypedArray.
In the create-a-buffer-on-the-fly case there will be some memory juggling
> going on, either by initially over allocating or reallocating/moving.
>
But since that's all behind the scenes, the implementation can do it
whichever way is most efficient for the particular encoding. In many
cases, it may be possible to eliminate any reallocation, by making an
educated guess about how big the buffer is likely to be.
On Fri, Mar 16, 2012 at 11:21 AM, Joshua Bell <jsb...@chromium.org> wrote:
> ... and the spec should include normative guidance that length-prefixing is
> strongly recommended for new data formats.
>
I think this would be a bit off-topic.
--
Glenn Maynard
On Fri, 16 Mar 2012, Glenn Maynard wrote:
> On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell <jsb...@chromium.org> wrote:
>
>> And just to be clear, the use case is decoding data formats where string
>> fields are variable length null terminated.
>>
>
> A concrete example is ZIP central directories.
>
> I think we want both encoding and destination to be optional. That leads us
>> to an API like:
>>
>> out_dict = stringEncoding.encode("string", opt_dict);
>>
>> .. where both out_dict and opt_dict are WebIDL Dictionaries:
>>
>> opt_dict keys: view, encoding
>>
>
>
>> out_dict keys: charactersWritten, byteWritten, output
>>
>
> The return value should just be a [NoInterfaceObject] interface.
> Dictionaries are used for input fields.
>
> Something that came up on IRC that we should spend some time thinking
> about, though: Is it actually important to be able to encode into an
> existing buffer? This may be a premature optimization. You can always
> encode into a new buffer, and--if needed--copy the result where you need it.
>
> If we don't support that, most of this extra stuff in encode() goes away.
Yes, I think we should focus on getting feature parity with e.g. python
first -- i.e. not worry about decoding into existing buffers -- and add
extra fancy stuff later if we find that there are actually usecases where
avoiding the copy is critical. This should allow us to focus on getting
the right API for the common case.
> If in-place decoding isn't really needed, we could have:
>
> newView = str.encode("utf-8"); // or {encoding: "utf-8"}
> str2 = newView.decode("utf-8");
> len = newView.find(0); // replaces stringLength, searching for 0 in the
> view's type; you'd use Uint16Array for UTF-16
>
> and encodedLength() would go away.
This looks like a big win to me.
> On Fri, Mar 16, 2012 at 11:19 AM, Joshua Bell <jsb...@chromium.org> wrote:
>
>>
> ... where output === view if view is supplied, otherwise a new Uint8Array
>> (or Uint8ClampedArray??)
>>
>
> Uint8Array is correct. (Uint8ClampedArray is for image color data.)
>
> If UTF-16 or UTF-32 are supported, decoding to them should return
> Uint16Array and Uint32Array, respectively (with the return value being
> typed just to ArrayBufferView).
>
FYI, there was some follow up IRC conversation on this. With Typed Arrays
as currently specified - that is, that Uint16Array has platform endianness
- the above would imply that either platform endianness dictated the output
byte sequence (and le/be was ignored), or that encode("\uFFFD",
"utf-16").view[0] might != 0xFFFD on some platforms.
There was consensus (among the two of us) that the output view's underlying
buffer's byte order would be le/be depending on the selected encoding.
There is not consensus over what the return view type should be -
Uint8Array, or pursue BE/LE variants of Uint16Array to conceal platform
endianness.
For what it's worth, it seems like this is something we should seriously
consider changing so as to make the web-visible endianness of typed
arrays always be little-endian. Authors are actively writing code (and
being encouraged to do so by technology evangelists) that makes that
assumption anyway....
-Boris
As a WebGL developer, such a change would make my life far easier. Everyone
knows that typed arrays *can* be Big Endian, but I'm not aware of any
devices available right now that support WebGL that are. As a result we all
happily ignore that scenario and treat the entire world as Little Endian.
If and when a BE/WebGL capable device lands I'm not aware of ANY of the
existing apps that would work on it.
If we could normalize on LE, even if it meant some overhead in the browser
to compensate, a lot of devs would be grateful for it.
--Brandon
The DataView set of methods already does this work. The raw arrays are
supposed to have platform endianness.
If you see some evangelists skipping the endian check, send them an
e-mail and let them know.
-Charles
Not going to work.
You can't evangelise people into making their code work on architectures
that they don't own. It's hard enough to get people to work around
differences between browsers when all the browsers are avaliable for free
and run on the platforms that they develop on.
The reality is that on devices where typed arrays don't appear LE, content
will break.
> The DataView set of methods already does this work. The raw arrays are
> supposed to have platform endianness.
>
That's wrong. This is web API design 101; everyone should know better than
this by now. Exposing platform endianness is setting the platform up for
massive incompatibilities down the road.
In reality, the spec is moot here: if anyone does implement typed arrays on
a production big-endian system, they're going to make these views
little-endian, because doing otherwise would break countless applications,
essentially all of which are tested only on little-endian systems. Web
compatibility is a top priority to browser implementations.
(DataView isn't relevant here; it's used for different access patterns. To
access arrays of data embedded in an ArrayBuffer, you use views, not
DataView. Use DataView if you have a packed data structure with
variable-size fields, such as the metadata in a ZIP local file header.)
--
Glenn Maynard
I make mistakes all the time with UTF8 and raw string arrays. I make
mistakes all the time with endianness.
Low level API design 101; everyone working with low level APIs makes
mistakes.
> In reality, the spec is moot here: if anyone does implement typed
> arrays on a production big-endian system, they're going to make these
> views little-endian, because doing otherwise would break countless
> applications, essentially all of which are tested only on
> little-endian systems. Web compatibility is a top priority to browser
> implementations.
It's up to programmers to code defensively. More-so with multi-platform
multi-vendor deployments than walled gardens.
Authors should be using the spec as written, it only takes one target
system to use big-endian.
It doesn't harm anything for a vendor to implement as little-endian, as
most authors assume and test on little endian.
It may cause some harm to alter the spec so as to remove the requirement
that coders account for both.
>
> (DataView isn't relevant here; it's used for different access
> patterns. To access arrays of data embedded in an ArrayBuffer, you
> use views, not DataView. Use DataView if you have a packed data
> structure with variable-size fields, such as the metadata in a ZIP
> local file header.)
I use the subarray pattern frequently. DataView is not much different
than using subarray.
Use DataView when it's easier than ArrayBufferView and available.
I haven't seen anyone actually using the DataView stuff in practice, or
presenting it to developers much...
> If you see some evangelists skipping the endian check, send them an
> e-mail and let them know.
I've done that... then I stopped because it just wasn't worth the
effort. Every single WebGL demo I've seen recently was doing this.
People were being told that typed arrays are a good way to load binary
(integer and float) data from servers using the arraybuffer facilities
of XHR at SXSW last week, with no mention of endianness.
I think that trying to get web developers to do this right is a lost
cause, esp. because none of them (to a good approximation) have any
big-endian systems to test on.
-Boris
I believe that recent Firefox on a SPARC processor would fit that
description. Of course the number of web developers that have a
SPARC-based machine is 0 to a very good approximation....
-Boris
Using input and output dictionaries is definitely messy, but I can't
see a better way either. And I think ES6 is adding some syntax here
that will make developer's lives better (deconstructing assignments)
> It doesn't seem possible to implement the 'encode' function without
>>> doing multiple scans over the string. The implementation seems
>>> required both to check that the data can be decoded using the
>>> specified encoding, as well as check that the data will fit in the
>>> passed in buffer. Only then can the implementation start decoding the
>>> data. This seems problematic.
>>>
>>
>> Only if it guarantees that it doesn't write anything to the output buffer
>> unless the entire result will fit. I don't think we need to do that; just
>> guarantee that it'll be truncated on a whole codepoint.
>>
>
> Agreed. Input/output dicts mean the API documentation a caller needs to
> read to understand the usage is more complex than a function signature
> which is why I resisted them, but it does seem like the best approach.
> Thanks for pushing, Glenn!
>
> In the create-a-buffer-on-the-fly case there will be some memory juggling
> going on, either by initially over allocating or reallocating/moving.
The implementation can always figure out what strategy fits its own
requirements best with regards to memory allocation. I suspect that
right now in Firefox the fastest implementation would be to scan
through the string once to measure the desired buffer size, then
allocate and write into the allocated buffer.
The problem is that the way that the encoding function is defined
right now, you are not allowed to write any data if you are throwing
for whatever reason, which means that you have to do a scan first to
see if you need to throw, and then do a separate pass to actually
encode the data. I think we need to change that such that when an
exception is thrown that data should be written up to the point that
causes the exception.
>> I also don't think it's a good idea to throw an exception for encoding
>>> errors. Better to convert characters to the unicode replacement
>>> character. I believe we made a similar change to the WebSockets
>>> specification recently.
>>>
>>
>> Was that change made? I filed
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=16157, but it still seems
>> to be undecided.
>>
>
> Settling on an options dict means adding a flag to control this behavior
> (throws: true ?) doesn't extend the API surface significantly.
Sounds good to me. Though I would still strongly prefer the default to
be non-throwing as to minimize the risk of website breakage in the
case of bugs. Especially since these bugs are so data dependent and
are likely to not happen on a developers computer.
/ Jonas
I've written some hash/encryption methods that could very well could
fail on Firefox on SPARC; many things fail on machines I've never tested
with.
Flip the implementation on SPARC, and it wouldn't harm anything. Cut it
out of the spec, so that the behavior is undocumented, implementations
break.
DataView is a more complex than ArrayBufferView, so implementers started
with the easy option.
The coders using Float32Array are cowboys; (web app gaming and
encryption). We're talking about a few hundred people out of many millions.
You can s/web developers/users/ and the statement would still apply,
wouldn't it?
- James
>
>
> -Boris
>
Sure, but so what?
The upshot is that people are writing code that assumes little-endian
hardware all over. We should just clearly make the spec say that that's
what typed arrays are so that an implementor can actually implement the
spec and be web compatible.
The value of a spec which can't be implemented as written is arguably
lower than not having a spec at all... At least then you _know_ you
have to reverse-engineer.
-Boris
Isn't that an issue for TC39?
It saddens me that this allows non-UTF-8 encodings. However, since use
cases for non-UTF-8 encodings were mentioned in this thread, I suggest
that the set of supported encodings be an enumerated set of encodings
stated in a spec and browsers MUST NOT support other encodings. The
set should probably be the set offered in the encoding popup at
http://validator.nu/?charset or a subset thereof (containing at least
UTF-8 of course). (That set was derived by researching the
intersection of the encodings supported by browsers, Python and the
JDK.)
> would go a very long way.
Are you sure that it's not necessary to support streaming conversion?
The suggested API design assumes you always have the entire data
sequence in a single DOMString or ArrayBufferView.
> The question is where to stick these
> functions. Internationalization doesn't have a obvious object we can
> hang functions off of (unlike, for example crypto), and the above
> names are much too generic to turn into global functions.
If we deem streaming conversion unnecessary, I'd put the methods on
DOMString and ArrayBufferView. It would be terribly sad to let the
schedules of various working groups affect the API design.
--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/
> On Wed, Mar 14, 2012 at 12:49 AM, Jonas Sicking <jo...@sicking.cc> wrote:
> > Something that has come up a couple of times with content authors
> > lately has been the desire to convert an ArrayBuffer (or part thereof)
> > into a decoded string. Similarly being able to encode a string into an
> > ArrayBuffer (or part thereof).
> >
> > Something as simple as
> >
> > DOMString decode(ArrayBufferView source, DOMString encoding);
> > ArrayBufferView encode(DOMString source, DOMString encoding,
> > [optional] ArrayBufferView destination);
>
> It saddens me that this allows non-UTF-8 encodings. However, since use
> cases for non-UTF-8 encodings were mentioned in this thread, I suggest
> that the set of supported encodings be an enumerated set of encodings
> stated in a spec and browsers MUST NOT support other encodings.
I believe we have consensus on the above.
> The
> set should probably be the set offered in the encoding popup at
> http://validator.nu/?charset or a subset thereof (containing at least
> UTF-8 of course). (That set was derived by researching the
> intersection of the encodings supported by browsers, Python and the
> JDK.)
I have edited the proposal to base the list of encodings on
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html - is there any
reason that would not be sufficient or appropriate? (this appears to be a
superset of the validator.nu/?charset list, with only a small number of
additional encodings)
> > would go a very long way.
>
> Are you sure that it's not necessary to support streaming conversion?
> The suggested API design assumes you always have the entire data
> sequence in a single DOMString or ArrayBufferView.
I agree that this is a sticking point, and I'm not sure how to resolve it.
Some are advocating for a simpler UI with the above assumption, others for
a broader solution that allows streaming conversion. The draft text as
written now is in the middle - supporting writing into an existing buffer,
but simply failing on overflow - and is thus not satisfactory to either
group.
Yes, I think we should enumerate the set of encodings supported.
Ideally we'd for simplicity support the same set of enumerated
encodings everywhere in the platform and over time try to shrink that
set.
>> would go a very long way.
>
> Are you sure that it's not necessary to support streaming conversion?
> The suggested API design assumes you always have the entire data
> sequence in a single DOMString or ArrayBufferView.
>
>> The question is where to stick these
>> functions. Internationalization doesn't have a obvious object we can
>> hang functions off of (unlike, for example crypto), and the above
>> names are much too generic to turn into global functions.
>
> If we deem streaming conversion unnecessary, I'd put the methods on
> DOMString and ArrayBufferView. It would be terribly sad to let the
> schedules of various working groups affect the API design.
Streaming is a very good question. I hadn't thought about that.
Especially now that we have chunked ArrayBuffer support in XHR
streaming would seem like a much more interesting request.
/ Jonas
> I have edited the proposal to base the list of encodings on
>
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html - is there any
> reason that would not be sufficient or appropriate? (this appears to be a
> superset of the validator.nu/?charset list, with only a small number of
> additional encodings)
>
There are lots of encodings in that list which browsers need to support for
legacy text/html content, which are probably completely unnecessary here.
People may be storing Shift-JIS text in ID3 tags, but I doubt they're
doing that with ISO-2022-JP.
I'm undecided about legacy encodings in general, but that aside, I'd start
from just ["UTF-8"], and add to the list based on concrete use cases.
Don't start from the whole list and try to pare it down.
I wonder if we can't limit the damage of extending more support to legacy
encodings. We have a use case for decoding legacy charsets (ID3 tags), but
do we have any use cases for encoding to them? If you're writing back
changed ID3 tags, you should be writing it back in as ID3v2 (which is all
most tagging software writes to now), which uses UTF-8.
On Mon, Mar 19, 2012 at 5:54 PM, Jonas Sicking <jo...@sicking.cc> wrote:
> Yes, I think we should enumerate the set of encodings supported.
> Ideally we'd for simplicity support the same set of enumerated
> encodings everywhere in the platform and over time try to shrink that
> set.
>
Shrinking the set supported for HTML will be much harder than keeping this
set small to begin with.
--
Glenn Maynard
What value are we adding, and to whom, by keeping the list the
smallest it can be, even when that means keeping the lists of
supported encodings different between different APIs?
The concrete costs are that authors will have to learn which encodings
work where, and that implementations need to keep separate lists of
supported encodings in different APIs.
/ Jonas
> What value are we adding, and to whom, by keeping the list the
> smallest it can be, even when that means keeping the lists of
> supported encodings different between different APIs?
>
Not needlessly extending support for legacy encodings means there's no
chance of this API inadvertently causing proliferation of those encodings.
That benefits everyone who might come in contact with that data, and
increases the odds of being able to remove some of those encodings from the
platform entirely.
The concrete costs are that authors will have to learn which encodings
> work where, and that implementations need to keep separate lists of
> supported encodings in different APIs.
>
Authors don't need to learn that; all they care about is if the encoding
they're trying to use works. Nobody memorizes lists of encodings.
Keeping a list of supported encodings is a trivial cost.
It also means that browsers need to be able to encode to each of these
encodings, and encoding for all of them needs to be specified, which I
think is currently unneeded. (Unless we go the asymmetric
encoding/decoding route, supporting only decoders for legacy charsets. If
this is the only reason that'd all have to be specified, that's probably
another reason to consider it...)
Supporting streaming decoding for modal encodings, such as ISO-2022-CN,
might also be a burden: it means implementations would be required to
support stateful, incremental decoding for that charset, which is more
complicated than most encodings (which are stateless). Many
implementations probably do support that, but I don't think it's currently
mandatory, and it would complicate any streaming API. Stateful encodings
need to die even more than other legacy encodings; I hope this API doesn't
have to support any of them.
--
Glenn Maynard
> If this is the only reason that'd all have to be specified, that's
> probably another reason to consider it...
(Well, there's form data either way. At least encoding is probably easier
to spec, since it only has to deal with UTF-16 error handling...)
--
Glenn Maynard
It seems unlikely to me that adding support for an encoding here will
make it harder to eradicate the encoding from the web.
>> The concrete costs are that authors will have to learn which encodings
>> work where, and that implementations need to keep separate lists of
>> supported encodings in different APIs.
>
>
> Authors don't need to learn that; all they care about is if the encoding
> they're trying to use works. Nobody memorizes lists of encodings.
Why are encodings different than other parts of the API where you
indeed have to know what works and what doesn't.
> It also means that browsers need to be able to encode to each of these
> encodings, and encoding for all of them needs to be specified, which I think
> is currently unneeded. (Unless we go the asymmetric encoding/decoding
> route, supporting only decoders for legacy charsets. If this is the only
> reason that'd all have to be specified, that's probably another reason to
> consider it...)
>
> Supporting streaming decoding for modal encodings, such as ISO-2022-CN,
> might also be a burden: it means implementations would be required to
> support stateful, incremental decoding for that charset, which is more
> complicated than most encodings (which are stateless). Many implementations
> probably do support that, but I don't think it's currently mandatory, and it
> would complicate any streaming API. Stateful encodings need to die even
> more than other legacy encodings; I hope this API doesn't have to support
> any of them.
UTF8 is stateful, so I disagree.
/ Jonas
> Why are encodings different than other parts of the API where you
>
indeed have to know what works and what doesn't.
>
Do you memorize lists of encodings? I certainly don't. I look them up as
needed.
UTF8 is stateful, so I disagree.
>
No, UTF-8 doesn't require a stateful decoder to support streaming. You
decode up to the last codepoint that you can decode completely. The return
values are the output data, the number of bytes output, and the number of
bytes consumed; that's all you need to restart decoding later. That's the
iconv(3) approach that we're probably all familiar with, which works with
almost all encodings.
ISO-2022 encodings are stateful: you have to persistently remember the
character subsets activated by earlier escape sequences. An iconv-like
streaming API is impossible; to support streamed decoding, you'd need to
have a decoder object that the user keeps around in order to store that
state. http://en.wikipedia.org/wiki/ISO/IEC_2022#Code_structure
--
Glenn Maynard
Which seems like it leaves us with these options:
1. Only support encodings with stateless coding (possibly down to a minimum
of UTF-8)
2. Only provide an API supporting non-streaming coding (i.e. whole
strings/whole buffers)
3. Expand the API to return encoder/decoder objects that capture state
Any others?
Trying to do simplify the problem but take on both (1) and (2) without (3)
would lead to an API that could not encompass (3) in the future, which
would be a mistake.
I'll throw out that the in-progress design of a Globalization API for
ECMAScript -
http://norbertlindenberg.com/2012/02/ecmascript-internationalization-api/ -
is currently spec'd to both build on the existing locale-aware methods on
String/Number/Date prototypes as conveniences, as well as introducing the
Collator and *Format objects.
Should we start with UTF-8-only/non-streaming methods on
DOMString/ArrayBufferView, and avoid constraining a future API supporting
multiple, possibly stateful encodings and streaming?
> 1. Only support encodings with stateless coding (possibly down to a minimum
> of UTF-8)
> 2. Only provide an API supporting non-streaming coding (i.e. whole
> strings/whole buffers)
> 3. Expand the API to return encoder/decoder objects that capture state
>
> Any others?
>
> Trying to do simplify the problem but take on both (1) and (2) without (3)
> would lead to an API that could not encompass (3) in the future, which
> would be a mistake.
>
I don't think that's obviously a mistake. Only the nastiest, wartiest of
legacy encodings require it.
That said, it's fairly simple to later return an additional state object
from the previously proposed streaming APIs, eg.
result = decode(str, 0, outputView)
// result.outputBytes == 15
// result.nextInputByte == 5
// result.state == opaque object
result2 = decode(str, result.nextInputByte, outputView, {state:
result.state});
--
Glenn Maynard
Regards
-Mark
I'm pretty sure there is consensus for supporting UTF8. UTF8 is
stateful though can be made not stateful by not consuming all
characters and instead forcing the caller to keep the state (in the
form of unconsumed text).
So I would rephrase your 3 options above as:
1) Create an API which forces consumers to do state handling. Probably
leading to people creating wrappers which essentially implement option
3
2) Don't support streaming
3) Have encoder/decoder objects which hold state
I personally don't think 1 is a good option since it's basically the
same as 3 but just with libraries doing some of the work. We might as
well do that work so that libraries aren't needed.
This leaves us with 2 or 3. So the question is if we should support
streaming or not. I suspect doing so would be worth it.
/ Jonas
The categories feels strange.
If the conversion is not streaming (whole strings/whole buffers), its
implementation should be simply the wrapper of the browser's
conversion functions.
There is no need to a state object to save the state because the conversion
is done with the completion of the function, even if it is stateful encoding.
For streaming conversion, it needs state even if the encoding is stateless.
When the given partial input is finished at the middle of a character
like "\xE3\x81\x82\xC2", the conversion consumes 4 bytes, output one character
"\u3042", and remember the partial bytes "\xC2". This bytes is the state.
> That said, it's fairly simple to later return an additional state object
> from the previously proposed streaming APIs, eg.
>
> result = decode(str, 0, outputView)
> // result.outputBytes == 15
> // result.nextInputByte == 5
> // result.state == opaque object
>
> result2 = decode(str, result.nextInputByte, outputView, {state:
> result.state});
You can refer mbsrtowcs(3), which convert a character string to a wide-character
string (restartable). It uses opaque state.
size_t mbsnrtowcs(wchar_t *restrict dst, const char **restrict src,
size_t nmc, size_t len, mbstate_t *restrict ps);
http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbsrtowcs.html
Anyway, they need error if the byte sequence is invalid for the encoding.
--
NARUSE, Yui <nar...@airemix.jp>
Your use of the word "stateful" involves misunderstanding.
Usually the word "stateful encoding" means that the encoding keeps a state
between characters, not bytes.
What you mean is usually expressed by the word "multibyte".
UTF-8 is multibyte encoding, and it needs to keep a state on streaming.
> So I would rephrase your 3 options above as:
>
> 1) Create an API which forces consumers to do state handling. Probably
> leading to people creating wrappers which essentially implement option
> 3
> 2) Don't support streaming
> 3) Have encoder/decoder objects which hold state
>
> I personally don't think 1 is a good option since it's basically the
> same as 3 but just with libraries doing some of the work. We might as
> well do that work so that libraries aren't needed.
>
> This leaves us with 2 or 3. So the question is if we should support
> streaming or not. I suspect doing so would be worth it.
I think it should provide non streaming API.
And if there are concreate use case, provide streaming API as another one.
--
NARUSE, Yui <nar...@airemix.jp>
For XMLHttpRequest it might be, yes.
I think we should expose the same encoding set throughout the platform.
One reason to limit the encoding set initially might be because we have
not all converged yet on our encoding sets. Gecko, Safari, and Internet
Explorer expose a lot more encodings than Opera and Chrome.
As for the API, how about:
enc = new Encoder("euc-kr")
string1 = enc.encode(bytes1)
string2 = enc.encode(bytes2)
string3 = enc.eof() // might return empty string if all is fine
And similarly you would have
dec = new Decoder("shift_jis")
bytes = dec.decode(string)
Or alternatively you could have a single object that exposes both encode()
and decode() and tracks state for both:
enc = new Encoding("gb18030")
bytes1 = enc.decode(string1)
string2 = enc.encode(bytes2)
--
Anne van Kesteren
http://annevankesteren.nl/
Delegating to http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
seems appropriate.
> 1) Create an API which forces consumers to do state handling. Probably
> leading to people creating wrappers which essentially implement option
> 3
>
It's not the same. Please look at how ISO-2022 works: the stream has
*long-lived* state, with escape sequences that change the meaning of later
code sequences in the stream. For example, you have to remember whether GR
is encoding G1, G2 or G3. This can't be stored merely by remembering the
next input byte you have to start at.
As Yui said, the sort of state UTF-8 has isn't what people mean when we
talk about "stateful encodings".
On Wed, Mar 21, 2012 at 3:34 AM, NARUSE, Yui <nar...@airemix.jp> wrote:
> For streaming conversion, it needs state even if the encoding is stateless.
> When the given partial input is finished at the middle of a character
> like "\xE3\x81\x82\xC2", the conversion consumes 4 bytes, output one
> character
> "\u3042", and remember the partial bytes "\xC2". This bytes is the state.
>
You don't need to do that. You can simply convert as many output
codepoints as can be *completely* converted. In this example, you'd
consume 3 bytes and output one codepoint. You don't consume data that you
can't immediately convert, so you don't have to buffer anything.
(We don't have to do it that way, of course; just pointing out that you
don't *need* special state for streaming encodings like UTF-8.)
Anyway, they need error if the byte sequence is invalid for the encoding.
>
Errors were discussed previously: by default errors output U+FFFD (or
another replacement character, for encoding unsupported characters to
non-Unicode encodings), and we may have an option to turn it into an
exception.
--
Glenn Maynard
> On Wed, 21 Mar 2012 01:27:47 -0700, Jonas Sicking <jo...@sicking.cc>
> wrote:
>
>> This leaves us with 2 or 3. So the question is if we should support
>> streaming or not. I suspect doing so would be worth it.
>>
>
> For XMLHttpRequest it might be, yes.
>
> I think we should expose the same encoding set throughout the platform.
> One reason to limit the encoding set initially might be because we have not
> all converged yet on our encoding sets. Gecko, Safari, and Internet
> Explorer expose a lot more encodings than Opera and Chrome.
>
Just to throw it out there - does anyone feel we can/should offer
asymmetric encode/decode support, i.e. supporting more encodings for decode
operations than for encode operations?
As for the API, how about:
>
> enc = new Encoder("euc-kr")
> string1 = enc.encode(bytes1)
> string2 = enc.encode(bytes2)
> string3 = enc.eof() // might return empty string if all is fine
>
> And similarly you would have
>
> dec = new Decoder("shift_jis")
> bytes = dec.decode(string)
>
> Or alternatively you could have a single object that exposes both encode()
> and decode() and tracks state for both:
>
> enc = new Encoding("gb18030")
> bytes1 = enc.decode(string1)
> string2 = enc.encode(bytes2)
That's the direction my thinking was headed. Glenn pointed out that the
state that's implicitly captured in the above objects could instead be
returned as an explicit but opaque state object that's passed in and out of
stateless functions. As a potential user of the API, I find the above
"object-oriented" style easier to understand.
Re: Encoding object vs. an Encoder/Decoder pair - I'd prefer the latter as
it makes the state being captured and any methods/attributes to interrogate
the state clearer.
Bikeshedding on the name - we'd have to put "String" or "Text" in there
somewhere, since audio/video/image codecs will likely want to use similar
terms.
In the past decade I've never had to encode into something other than
UTF-8. I have had to decode many encoding sets.
If I did need to do a special encoding, given the state of typed arrays,
I'd probably just implement the encoding in JS.
+1 for asymmetric from my experience.
-Charles
Yeah, I suspect we'll get it right once put in a draft :-)
XMLHttpRequest has that. You can only send (encode) UTF-8, receive
(decode) "everything". Forms can send "everything". URL query parameters
can encode everything (though the page itself has to be encoded in the
encoding of choice).
If we have no use cases just supporting encoding UTF-8 seems fine to me,
but I think the design should allow for other encodings in the future.
> Bikeshedding on the name - we'd have to put "String" or "Text" in there
> somewhere, since audio/video/image codecs will likely want to use similar
> terms.
They can use the prefixed variants :-) If we have to use a prefix "String"
seems better, as Text is a node object in the platform.
Simon pointed out Text as prefix is probably better (it is used elsewhere
in the platform unrelated to nodes (e.g. TextTrack)), though I'd
personally prefer simply Decoder/Encoder.
I don't mind this API for complex usecases e.g. streaming, but it is
massive overkill for the simple common case of "I have a list of bytes
that I want to decode to a string" or "I have a string that I want to
encode to bytes". For those cases I strongly prefer the earlier API
along the lines of
String.prototype.encode(encoding)
ArrayBufferView.prototype.decode(encoding)
> As for the API, how about:
>
> enc = new Encoder("euc-kr")
> string1 = enc.encode(bytes1)
> string2 = enc.encode(bytes2)
> string3 = enc.eof() // might return empty string if all is fine
>
A problem with this is that the bugs resulting from not calling eof() are
subtle. The only thing eof() would ever do, I think, is return U+FFFD
characters if there are leftover characters in the internal buffer; if you
never call eof(), you'll never get incorrect results unless you test with
invalid inputs.
It's minor, as subtle-edge-cases-that-people-won't-get-right go, but it's
at least worth a mention. Maybe people who would use this API instead of
the simpler non-streaming version (which could be a thin wrapper on this)
in the first place are also more likely to get this right.
I'm guessing a common, incorrect pattern would be:
string = new Encoder("euc-kr").encode(bytes);
which would *not* be equivalent to bytes.encode("euc-kr").
--
Glenn Maynard
Another way would be to have a second optional argument that indicates
whether more bytes are coming (defaults to false), but I'm not sure of the
chances that would be used correctly. The reasons you outline are probably
why many browser implementations deal with EOF poorly too.
> Another way would be to have a second optional argument that indicates
> whether more bytes are coming (defaults to false), but I'm not sure of the
> chances that would be used correctly. The reasons you outline are probably
> why many browser implementations deal with EOF poorly too.
It might not improve it, but I don't think it'd be worse. If you didn't
use it correctly for an encoding where it matters, the breakage would be
obvious.
Also, the previous "automatically-streaming" API has another possible
misuse: constructing a single encoder, then calling it repeatedly for
unrelated strings, without calling eof() between them (trailing bytes would
become U+FFFD in the next string). That'd be a less likely mistake with
this, too.
Here's a suggestion, working from that:
encoder = Encoder("euc-kr");
view = encoder.encode(str1, {continues: true});
view = encoder.encode(str2, {continues: true});
view = encoder.encode(str3, {continues: false});
An alternative way to end the stream:
encoder = Encoder("euc-kr");
view = encoder.encode(str1, {continues: true});
view = encoder.encode(str2, {continues: true});
view = encoder.encode(str3, {continues: true});
view = encoder.encode("", {continues: false});
// or view = encoder.encode(""); // equivalent; continues defaults to false
// or view = encoder.encode(); // maybe equivalent, if the first parameter
is optional
The simplest usage is concise enough that we don't really need a separate
str.encode() method:
view = Encoder("euc-kr").encode(str);
If it has an eof() method, it'd just be a literal wrapper for
encoder.encode(), but it can probably be omitted.
--
Glenn Maynard
All major mobile OSes use LE on ARM � I believe we currently don't ship
anything on BE ARM. (We, do, however, currently ship on BE MIPS, though
MIPS too is mostly LE nowadays).
--
Geoffrey Sneddon � Opera Software
<http://gsnedders.com>
<http://opera.com>
> On Thu, Mar 22, 2012 at 8:58 AM, Anne van Kesteren <ann...@opera.com>
> wrote:
>
> > Another way would be to have a second optional argument that indicates
> > whether more bytes are coming (defaults to false), but I'm not sure of
> the
> > chances that would be used correctly. The reasons you outline are
> probably
> > why many browser implementations deal with EOF poorly too.
>
>
> It might not improve it, but I don't think it'd be worse. If you didn't
> use it correctly for an encoding where it matters, the breakage would be
> obvious.
>
> Also, the previous "automatically-streaming" API has another possible
> misuse: constructing a single encoder, then calling it repeatedly for
> unrelated strings, without calling eof() between them (trailing bytes would
> become U+FFFD in the next string). That'd be a less likely mistake with
> this, too.
>
Agreed. Simple things should be simple.
> Here's a suggestion, working from that:
>
> encoder = Encoder("euc-kr");
> view = encoder.encode(str1, {continues: true});
> view = encoder.encode(str2, {continues: true});
> view = encoder.encode(str3, {continues: false});
>
> An alternative way to end the stream:
>
> encoder = Encoder("euc-kr");
> view = encoder.encode(str1, {continues: true});
> view = encoder.encode(str2, {continues: true});
> view = encoder.encode(str3, {continues: true});
> view = encoder.encode("", {continues: false});
> // or view = encoder.encode(""); // equivalent; continues defaults to false
> // or view = encoder.encode(); // maybe equivalent, if the first parameter
> is optional
>
> The simplest usage is concise enough that we don't really need a separate
> str.encode() method:
>
> view = Encoder("euc-kr").encode(str);
>
> If it has an eof() method, it'd just be a literal wrapper for
> encoder.encode(), but it can probably be omitted.
Agreed, I'd omit it.
Bikeshed: The |continues| term doesn't completely thrill me; it's clear in
context, but not necessarily what someone might go searching for.
{eof:true} would be lovely except we want the default to be yes-EOF but a
falsy JS value. |noEOF| ?
If there aren't immediate objections, I'll update my wiki draft with this
style of API, and see about updating my JS polyfill as well.
Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?
One object type is simpler for the non-streaming case, e.g.:
// somewhere globally
g_codec = Encoding("euc-kr");
// elsewhere...
str = g_codec.decode(view); // okay
view = g_codec.encode(str); // fine, no state captured
str = g_codec.decode(view); // still okay
but IMHO someone unfamiliar with the internals of encodings might extend
the above into::
// somewhere globally
g_codec = Encoding("euc-kr");
// elsewhere in some stream handling code...
str = g_codec.decode(view, {continues: true}); // okay..
view = g_codec.encode(str, {continues: true}); // sure, now both an encode
and decode state are captured by codec
str = g_codec.decode(view, {continues: true}); // okay only if this is more
of the same stream; if there are two incoming streams, this is wrong
The same mistake is possible with Encoder / Decoder objects, of course (you
just need two globals). But something about separating them makes it
clearer to me that the |continues| flag is affecting state in the object
rather than just affecting the output of the call.
Peter Beverloo suggests "stream" on IRC. I like it.
> Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?
Two seems cleaner.
> On Mon, 26 Mar 2012 17:56:41 +0100, Joshua Bell <jsb...@chromium.org>
> wrote:
>
>> Bikeshed: The |continues| term doesn't completely thrill me; it's clear
>> in context, but not necessarily what someone might go searching for.
>> {eof:true} would be lovely except we want the default to be yes-EOF but a
>> falsy JS value. |noEOF| ?
>>
>
> Peter Beverloo suggests "stream" on IRC. I like it.
+1
> Opinions on one object type (Encoding) vs. two (Encoder, Decoder) ?
>>
>
> Two seems cleaner.
I've gone ahead and updated the wiki/draft:
http://wiki.whatwg.org/wiki/StringEncoding
This includes:
* TextEncoder / TextDecoder objects, with |encode| and |decode| methods
that take option dicts
* A |stream| option, per the above
* A |nullTerminator| option eliminates the need for a stringLength method
(hasta la vista, baby!)
* |encodedLength| method is dropped since you can't in-place encode anyway
* decoding errors yield fallback code points by default, but setting a
|fatal| option cause a DOMException to be thrown instead
* specified exceptions as DOMException of type "EncodingError", as a
placeholder
New issues resulting from this refactor:
* You can change the options (stream, nullTerminator, fatal) midway through
decoding a stream. This would be silly to do, but as written I don't think
this makes the implementation more difficult. Alternately, the non-stream
options could be set on the TextDecoder object itself.
* BOM handling needs to be resolved. The Encoding spec makes the encoding
label secondary to the BOM. With this API it's unclear if that should be
the case. Options include having a mismatching BOM throw, treating a
mismatching BOM as a decoding error (i.e. fallback or throw, depending on
options), or allow the BOM to actually switch the decoder used for this
"stream" - possibly if-and-only-if the default encoding was specified.
I've also partially updated the JS "polyfill" proof-of-concept
implementation, tests, and examples as well, but it does not implement
streaming yet (i.e. a "stream" option is ignored, state is always lost); I
need to do a tiny bit more refactoring first.
> * A |stream| option, per the above
>
Does this make sense when you're using stream: false to flush the stream?
It's still a streaming operation. I guess it's "close enough".
* A |nullTerminator| option eliminates the need for a stringLength method
> (hasta la vista, baby!)
>
I strongly disagree with this change. It's much cleaner and more generic
for the decoding algorithm to not know anything about null terminators, and
to have separate general-purpose methods to determine the length of the
string (memchr/wmemchr analogs, which we should have anyway). We made this
simplification a long time ago--why did you resurrect this?
array = new Int8Array(myArrayBuffer);
length = array.indexOf(0); // same semantics as String.indexOf
if(length != -1)
array = array.subarray(0, length);
new TextDecoder('utf-8').decode(array);
* BOM handling needs to be resolved. The Encoding spec makes the encoding
> label secondary to the BOM. With this API it's unclear if that should be
> the case. Options include having a mismatching BOM throw, treating a
> mismatching BOM as a decoding error (i.e. fallback or throw, depending on
> options), or allow the BOM to actually switch the decoder used for this
> "stream" - possibly if-and-only-if the default encoding was specified.
>
The path of fewest errors is probably to have a BOM override the specified
UTF-16 endianness, so saying "UTF-16BE" just changes the default.
An aside:
The TypedArray constructors have a depressing design bug: new
Int8Array(someOtherView) makes a copy of the data. It's nonsensical that
view constructors create a view when passed an ArrayBuffer, but a copy when
passed another view. This doesn't make any kind of sense; creating a view
should create a *view* if it's passed an object that already has
ArrayBuffer-based storage, and making a copy should have been its own
operation.
This means we can't say "creating a view is cheap"; we have to qualify it:
"creating a view is cheap, as long as you're careful not to call a
constructor that makes a copy".
It's frustrating that we're now stuck with a confusing, inconsistent API
like this. I'm sure it's much too late to fix this properly, but hopefully
an option can be added to fix it, so a new TypedArray(TypedArray, {view:
true}) call actually creates a view.
--
Glenn Maynard
> On Mon, Mar 26, 2012 at 4:49 PM, Joshua Bell <jsb...@chromium.org> wrote:
>
>> * A |stream| option, per the above
>>
>
> Does this make sense when you're using stream: false to flush the stream?
> It's still a streaming operation. I guess it's "close enough".
>
> * A |nullTerminator| option eliminates the need for a stringLength method
>> (hasta la vista, baby!)
>>
>
> I strongly disagree with this change. It's much cleaner and more generic
> for the decoding algorithm to not know anything about null terminators, and
> to have separate general-purpose methods to determine the length of the
> string (memchr/wmemchr analogs, which we should have anyway). We made this
> simplification a long time ago--why did you resurrect this?
>
Ah, I'd forgotten that there was consensus that doing this outside the API
was preferable. I'll remove the option when I touch the spec again.
* BOM handling needs to be resolved. The Encoding spec makes the encoding
>> label secondary to the BOM. With this API it's unclear if that should be
>> the case. Options include having a mismatching BOM throw, treating a
>> mismatching BOM as a decoding error (i.e. fallback or throw, depending on
>> options), or allow the BOM to actually switch the decoder used for this
>> "stream" - possibly if-and-only-if the default encoding was specified.
>>
>
> The path of fewest errors is probably to have a BOM override the specified
> UTF-16 endianness, so saying "UTF-16BE" just changes the default.
>
This would apply on if the previous call had {stream: false} (implicitly or
explicitly). Calling with {stream:false} would reset for the next call.
Would it apply only to UTF-16 or UTF-8 as well? Should there be any special
behavior when not specifying an encoding in the constructor?
On Mon, Mar 26, 2012 at 4:27 PM, Jonas Sicking <jo...@sicking.cc> wrote:
> A few comments:
>
> * It appears that we lost the ability to measure how long a resulting
> buffer was going to be and then decode into the buffer. I don't know
> if this is an issue.
>
True. On the plus side, the examples in the page (encode/decode
array-of-strings) didn't change size or IMHO readability at all.
> * It might be a performance problem to have to check for the
> fatal/nullTerminator options on each call.
>
No comment here. Moving the "fatal" and other options to the TextDecoding
object rather than the decode() call is a possibility. I'm not sure which I
prefer.
> * We lost the ability to decode from a arraybuffer and see how many
> bytes were consumed before a null-terminator was hit. One not terribly
> elegant solution would be to add a TextDecoder.decodeWithLength method
> which return a DOMString+length tuple.
Agreed, but of course see above - there was consensus earlier in the thread
that searching for null terminators should be done outside the API,
therefore the caller will have the length handy already. Yes, this would be
a big flaw since decoding at tightly packed data structure (e.g. array of
null terminated strings w/o length) would be impossible with just the
nullTerminator flag.
> * It appears that we lost the ability to measure how long a resulting
> buffer was going to be and then decode into the buffer. I don't know
> if this is an issue.
>
The theory is that it probably isn't a real performance issue to decode
into a new buffer, then copy it where you want it. If you think there are
any cases where it matters, we should look at it, though.
The extra GC might matter if you're doing a lot of large conversions, but
that's easily fixed by adding ArrayBuffer.close().
* It might be a performance problem to have to check for the
> fatal/nullTerminator options on each call.
>
Are you thinking of people, say, feeding in a single byte at a time? That
seems like it'll be slow no matter what.
On Mon, Mar 26, 2012 at 6:40 PM, Joshua Bell <jsb...@chromium.org> wrote:
> > The path of fewest errors is probably to have a BOM override the
> specified
> > UTF-16 endianness, so saying "UTF-16BE" just changes the default.
>
> This would apply on if the previous call had {stream: false} (implicitly or
> explicitly).
Right. The following two operations should be exactly identical, for every
possible value of str and combination of options, and resulting in a
decoder in the same state:
view1 = decoder.decode(str.substr(0, 8), {stream: true});
view2 = decoder.decode(str.substr(8));
finalView = new Int8Array(view1.length + view2.length);
finalView.set(view1);
finalView.set(view2, view1.length);
return finalView;
return decoder.decode(str);
Calling with {stream:false} would reset for the next call.
>
Right: after a {stream:false} call, a decoder or encoder should be
equivalent to a newly-created one.
Would it apply only to UTF-16 or UTF-8 as well? Should there be any special
> behavior when not specifying an encoding in the constructor?
>
Do you mean, should decoding UTF-8 switch to UTF-16 if it starts with a
UTF-16 BOM? I think that would be confusing. If people want to autodetect
UTF-16 like that, they should probably do it themselves. I think browsers
do this with text/html, but that's just a web-compatibility wart, not a
feature...
--
Glenn Maynard
Requiring callers to find the null character first, and then use that
will require one additional pass over the encoded binary data though.
Also, if we put the API for finding the null character on the Decoder
object it doesn't seem like we're creating an API which is easier to
use, just one that has moved some of the logic from the API to every
caller.
Though I guess the best solution would be to add methods to DataView
which allows consuming an ArrayBuffer up to a null terminated point
and returns the decoded string. Potentially such a method could take a
Decoder object as argument.
/ Jonas
The rationale for specifying the string encoding and decoding
functionality outside the typed array specification is to keep the
typed array spec small and easily implementable. The indexed property
getters and setters on the typed array views, and methods on DataView,
are designed to be implementable with a small amount of assembly code
in JavaScript engines. I'd strongly prefer to continue to design the
encoding/decoding functionality separately from the typed array views.
-Ken
> The rationale for specifying the string encoding and decoding
> functionality outside the typed array specification is to keep the
> typed array spec small and easily implementable. The indexed property
> getters and setters on the typed array views, and methods on DataView,
> are designed to be implementable with a small amount of assembly code
> in JavaScript engines. I'd strongly prefer to continue to design the
> encoding/decoding functionality separately from the typed array views.
>
However, if the browser's don't all implement this, then you can't rely on
it being there. In apps where you compile separately for each browser, you
only pay the cost where the browser doesn't implement it (for example, in
GWT we emulate DataView and Uint8ClampedArray where it is missing). Even
then, you may have to include both versions and do runtime detection, such
as when later versions of the browser include the functionality -- that may
be worse than simply not using the API at all if you care more about code
size than execution speed of encoding/decoding text.
So, personally I think whatever gets the most browsers to completely
implement it is better, whether that is being part of the typed arrays spec
or separate. Logically, it seems to fit most directly in DataView.
--
John A. Tamplin
Software Engineer (GWT), Google
> I guess. It doesn't seem that important, since it's just a few lines of
> code. If this is done, I'd suggest that this helper API *not* have any
> special support for streaming (not to disallow it, but not to have any
> special handling for it, either). I think streaming has little overlap
> with null-terminated fields, since null-termination is typically used with
> fixed-size buffers. It would complicate things; for example, you'd need
> some way to signal to the caller that a null terminator was encountered.
>
Agreed.
Also worth relying to this thread is that in addition to null termination
there have been requests for other terminators, such as 0xFF which is an
invalid byte in a UTF-8 stream and thus a lovely terminator. Other byte
sequences were mentioned. (This was over in the Khronos WebGL list for
anyone who wants to dig it up. It was tracked as an unresolved ISSUE in the
spec.)
This supports the assertion that we should not special case null
terminators, but instead provide general (and highly optimizable) utilities
like memchr operating on buffers, since we can't anticipate every usage in
higher-level APIs like the one under discussion.
Is there a reason you couldn't keep the current set of functions on
DataView implemented using a small amount of assembly code, and let
the new functions fall back to slower C++ functions?
/ Jonas
The memchr is purely overhead, I.e. we are comparing memchr+decoding
to decoding. So I don't see what's backing up the "probably the
fastest thing" claim.
> Unless there's a concrete benchmark showing that it's slower, and slower
> enough to actually matter, this shouldn't be a consideration. It's a
> premature optimization.
My argument is that it's both faster and more author friendly.
I admit I missed the previous discussion which led to the agreement to
keep the length measuring outside, so I don't know what arguments were
presented. Any pointers would be appreciated.
>> Also, if we put the API for finding the null character on the Decoder
>> object it doesn't seem like we're creating an API which is easier to
>> use, just one that has moved some of the logic from the API to every
>> caller.
>
> It doesn't seem materially harder (a little more code, yes, but that's not
> the same thing), and it's more general-purpose.
I agree it doesn't seem materially harder. I also agree that I don't
have data to show that it's materially slower. But it sounds like
we're in agreement that keeping the logic outside is both harder and
slower which honestly doesn't speak strongly in its favor.
I don't understand the argument that the alternative is more
"general-purpose". The API is already generic in that you can use
whatever delimiter you want since you pass in a view. The only
functionality which is not available is finding a null-terminator in
an arraybuffer which you are arguing below shouldn't be part of the
decoder (which I agree with).
/ Jonas
> Scanning over the buffer twice will cause a lot more memory IO and
> will definitely be slower.
>
That's what cache is for. But: benchmarks...
We can argue weather it's meaningfully slower or harder. But it seems
> like we agree that it's slower and harder.
>
What? Are you really arguing that we should do something because of
*meaningless* differences?
I still don't understand what that benefit you are seeing is. You
> hinted at some "more generic" argument, but I still don't understand
> it. So far the only reason that has been brought up is that it
> provides an API for simply finding null terminators which could be
> useful if you are doing things other than decoding. Is that what you
> are talking about when you are saying that it's "more generic"?
>
Yes, I've said that repeatedly. It also avoids bloating the API with
something that's merely a helper for something you can do in a couple lines
of code, and allows you to tell how many bytes/words were consumed (eg. for
packed string arrays).
It can always be added later, but it feels unnecessary.
--
Glenn Maynard
I'm saying that if an API is better in every way then it doesn't seem
like an interesting discussion how much better, we should clearly go
with that API.
/ Jonas