String encoding semantics

20 views
Skip to first unread message

stephen judkins

unread,
Oct 26, 2009, 12:41:43 AM10/26/09
to BERT-RPC
I did a quick implementation of the encoder on github using Scala, a
language I'm learning, at http://github.com/stephenjudkins/scala-bert.
In doing so, I realized that there may be issues with storing strings
as arrays as bytes with no encoding-related metadata.

As far as I'm aware, in Ruby, a String is simply a string of bytes.
In Ruby 1.9 this string may have some encoding metadata attached to it
to allow proper conversion to other encoding/character sets. However,
if you'd like to treat a ruby String simply as a string of bytes it's
quite reasonable to do so, since a String can contain any arbitrary
bytes.

However, in Java and Scala, a String is a string of unicode
characters. To convert an array of bytes into a String requires
knowledge of the specific encoding. As far as I know, there is no
encoding in Java that guarantees the ability to take an arbitrary
string of bytes, convert it into a String object, then get that same
arbitrary string of bytes out later. I'm not sure how other languages
behave in this regard.

Thus, I've written the library to deal with arrays of bytes instead of
Java strings. This seems reasonable, though it also seems likely to
make RPC much less "simple" and DRY than it would otherwise. An
option might be to change the behavior of the encoder/decoders in
affected languages to return either byte arrays or decode them into
character strings.

Ideally, it would be nice to be able to pass UTF-8 encoded strings as
well as as arrays of bytes over the wire, transparently serializing
and deserializing them into their proper representations. This would
require adding another complex type to the specification, however.

Having dealt with some very painful character set issues, and having
seen the virtue of being explicit about these things whenever
possible, I'd like to make the suggestion that a complex string type
be added to the protocol in version 2.0.

I appreciate the great work done on this!

Tom Preston-Werner

unread,
Oct 27, 2009, 2:23:10 PM10/27/09
to BERT-RPC
On Oct 25, 9:41 pm, stephen judkins <stephen.judk...@gmail.com> wrote:
> I did a quick implementation of the encoder on github using Scala, a
> language I'm learning, athttp://github.com/stephenjudkins/scala-bert.
> In doing so, I realized that there may be issues with storing strings
> as arrays as bytes with no encoding-related metadata.

I've added a proposal for a complex string type outlined here:

http://groups.google.com/group/bert-rpc/browse_thread/thread/b3ccda7b76a3a631

Tom
Reply all
Reply to author
Forward
0 new messages