What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."
Good grief, this was *just* covered ad nauseum recently. Search the
archives.
- Dan
This communication is the property of Qwest and may contain confidential or
privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and destroy
all copies of the communication and any attachments.
Good grief, you're a prick. Thanks for the help. Has everyone in the
world been on this mailing list since it was started?
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"Real ugliness is not harsh-looking syntax, but having to
build programs out of the wrong concepts." - Paul Graham
Whoops. I offer my apologies to ruby-talk: that was meant to be an
off-list email.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"It's just incredible that a trillion-synapse computer could actually
spend Saturday afternoon watching a football game." - Marvin Minsky
> -----Original Message-----
> From: Chad Perrin [mailto:per...@apotheon.com]
> Sent: Friday, July 28, 2006 9:27 AM
> To: ruby-talk ML
> Subject: Re: state of unicode support
>
>
> On Sat, Jul 29, 2006 at 12:05:26AM +0900, Berger, Daniel wrote:
> > >
> > > I've heard rumors that "oniguruma fixes everything", and the
> > > like. I'm sure that's a touch of hyperbole, but in any case:
> > >
> > > What's the current state of Unicode support in Ruby? My
> > > recollection is of Unicode support somewhat lacking.
> >
> > Good grief, this was *just* covered ad nauseum recently.
> Search the
> > archives.
>
> Good grief, you're a prick. Thanks for the help. Has
> everyone in the world been on this mailing list since it was started?
Somehow you missed this 300+ long thread started June 13 and lasted
replied to on June 28:
Maybe that has something to do with the fact that I've been an
intermittent member of this list, not a constant member, since I first
discovered it -- and my most recent membership (this one) started on the
25th of this month.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
This sig for rent: a Signify v1.14 production from http://www.debian.org/
. . and holy crap. Would someone please provide a one-sentence
summary so I can get back to my life? Something akin to "Everything's
awesome now with 1.9!" or "It's not quite there yet, but it's close," or
even "It's as broken as ever, but regex support is better," would
suffice.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The ability to quote is a serviceable
substitute for wit." - W. Somerset Maugham
It shouldn't matter if one is constantly subscribed. One should search
archive to see if question already discussed before posting dup. Given
that sometimes one might not have a search sting that doesn't find a
match there are cases where there will be dups. But at least try.
News flash: I used Google and found a grand total of two posts from that
thread before I posted the question -- two posts that didn't help. I'm
not a complete moron, thanks.
Drop the friggin' subject. Forget I asked. It's not worth an entire
thread devoted entirely to a defense of my decision to ask rather than
rely on incomplete addressing of a simple question asking for a summary
response that spanned more than fifty emails, requiring hours of reading
just to get to an answer that's something like "Regex support good,
strict Unicode support good, localization via Unicode needs work." WTF?
Is your time so damned precious that you can't spend thirty seconds
posting a one-sentence summary response rather than twenty minutes
chastising me for not wasting hours finding out almost nothing?
I recently saw a couple posts lamenting the downhill trend of Ruby
community friendliness. At the time, I'd been on this list for about a
day and a half this time around, and couldn't really comment. On a
previous occasion when I followed this list for a bit, my impression was
"Those guys at ruby-talk are a great bunch of guys, unless you ask about
the nature of :symbols, in which case you'll get useless answers and
someone trying to give useful answers will get his head bitten off.
Otherwise, friendliest programming community I've ever seen." Now,
judging mostly by reactions others have received for innocent questions
and comments, and a couple of reactions I myself have received, it looks
like things have degenerated.
The old guard I recall (D. Black, Matz, et cetera) seem mostly to still
be a great bunch of guys from the posts I've witnessed "realtime" and
the recent archives I've read, but some others here need an attitude
check. Seriously.
"It shouldn't matter if one is constantly subscribed." Yeah. 'Cause at
a rate of hundreds of posts a day (wild guesstimate -- I don't have time
to actually count 'em), I really have time to read the last six months
of archives to see if I've missed something relevant. Even a search
engine won't solve that problem effectively. Maybe next time you should
say "Have you checked the archives? There's something relevant in there
from last month. Does it answer your question?" rather than the
equivalent of "Someone already said Unicode this year! Out, heathen!"
<snip rant>
Cookie?
Is that a joking bit of peace-offering, or should I confine my comments
to an off-list response? Your response in particular has been less than
friendly in this case, and I'm less than optimistic in regards to your
motive in saying that. Specifically, I get the impression that you're
being a sarcastic <censored for length and inappropriateness to the
list>.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The measure on a man's real character is what he would do
if he knew he would never be found out." - Thomas McCauley
I think the real problem here is not that you happened to ask a
question that comes up repeatedly, but that the last thread on
that question basically ended with everyone too beat up to
talk anymore.
You're hitting a sore nerve for some people. That is not your
fault, of course, but that nerve still is a bit sore... I, for one,
was not particularly impressed with the way the earlier thread
ended (stumbling to a close), but I'd really hate to see it start
up again!
--
Garance Alistair Drosehn = dro...@gmail.com
Senior Systems Programmer
Rensselaer Polytechnic Institute; Troy, NY; USA
Thanks for the explanation. It's nice to occasionally get a civil
response.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
print substr("Just another Perl hacker", 0, -2);
Lest we start another flamewar on Unicode, can we please, please,
pretty please remember MINASWAN?
I'm really getting tired of seeing all these sarcastic posts and also
personal attacks on this list. I barely read RubyTalk anymore because
of it, and it makes me sad because I really do love this list, and
many of the people on it.
But I'm seriously considering unsubscribing because of how no one
(except the folks who have been around for quite some time) seems to
respect the very thing that attracted me to Ruby, the friendliness of
it's community.
As far as unicode support goes, it's a complicated topic. People who
want to sum it up in one line probably don't really care about the
tough design decisions behind it.
. . or already have some vague idea of what Ruby Unicode support was
like a year ago, and just want a brief update for purposes of tool
evaluation for a project, or want to know the truth behind something
someone said in another venue, or . . .
. . or maybe you should assume good faith rather than jumping to
conclusions once in a while.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Ben Franklin: "As we enjoy great Advantages from the Inventions of
others we should be glad of an Opportunity to serve others by any
Invention of ours, and this we should do freely and generously."
This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].
First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.
Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.
Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)
If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/>
They have a channel on irc.freenode.net: #multibyte_rails.
The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.
Here are the extensions it prefers, in order:
* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])
So, many options, some massive, but most only partial and in their infancy.
The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]
Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.
This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.
Signed in elaborate calligraphy with a picture of grapes at the end,
_why
[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] http://www.flexiguided.de/publications.utf8proc.en.html
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html
That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.
>
> Signed in elaborate calligraphy with a picture of grapes at the end,
. . and as always, you manage to entertain in the process.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid
Or is my understanding warped and wrong?
_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).
Cheers, folks; remember to be nice. We're on the same team.
M.T.
why the lucky stiff wrote:
> First, Onigurama[2] is a regular expression engine. It supports
> Unicode regular
> expressions under many encodings, it's very handy. If all you want
> to do is
> search strings for Unicode text, then great, use it.
Er uh well it doesn't do unicode properties so you can't use things
like \p{L} which, once you've found them, quickly come to feel
essential. Anytime you write [a-zA-Z] in a regex, you've probably
just uttered a bug So I would say that Oniguruma has holes.
Otherwise, a very useful landslide indeed. -Tim
> The old guard I recall (D. Black, Matz, et cetera) seem mostly to still
> be a great bunch of guys from the posts I've witnessed "realtime" and
> the recent archives I've read, but some others here need an attitude
> check. Seriously.
In all politeness, I think you should count yourself in.
--
Christian Neukirchen <chneuk...@gmail.com> http://chneukirchen.org
Perhaps I should. I let my frustration at rudeness and similar poor
manners get the better of me on occasion.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"There comes a time in the history of any project when it becomes necessary
to shoot the engineers and begin production." - MacUser, November 1990
Regexes in 1.8 can do utf-8.
>
> _Why, et al, if you could break down the actual difficulties with
> implementing Unicode support into Ruby 1.8, I think that might clear
> up the questions we have as to whether a library eradicates all
> problems (obviously, some problems can't be fixed, but merely hacked
> or worked around).
The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).
So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.
Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.
Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.
None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.
Thanks
Michal
Off topic, what does/would that do? Match a lower-case symbol?
--
Alex
>>> First, Onigurama[2] is a regular expression engine. It supports
>>> Unicode regular
>>> expressions under many encodings, it's very handy. If all you
>>> want to do is
>>> search strings for Unicode text, then great, use it.
>> Er uh well it doesn't do unicode properties so you can't use
>> things like \p{L}
>
> Off topic, what does/would that do? Match a lower-case symbol?
Unicode characters have named properties. "L" means it's a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly
for those of us round-eye gringos who are prone to write [a-zA-Z] and
think we're matching letters, which we're not. If you don't support
properties, you don't support Unicode. -Tim
> There's something relevant in there
> from last month. Does it answer your question?" rather than the
> equivalent of "Someone already said Unicode this year! Out, heathen!"
This is not quite true - the problem is that when the subject comes
up it spurs threads of obnoxious length, with no single answer given.
We heard that Ruby will have _some_ multibyte support and Oniguruma
by Christmas 2007. This is unbearably far away. Nobody knows how
it is going to work, and any suggestions summon enormous flame wars
of different origin. Moreover, the threads about Unicode in Ruby (the
best one
dating back to 2002) have all ended up with nothing.They were all
about the same length by the way.
I, for one, am very saddened every time the topic comes up ecause i'm
sick of the brokenness (I actually start looking at these Other
Languages and Other Frameworks that take l10n and i18n seriously). I
believe i'm not alone. Might be the case that many people don't want
to read about the Big Sad Topic again as long as no promises and
explanations are given that outline the future. Which as of now
doesn't look bright, at least for the coming 18 months.
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl
> Ruby itself also understands UTF-8 regular expressions to a
> degree. Using the
> 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> str.scan(/./u), which returns an array of strings, each string
> containing a
> multibyte character. (Also: str.unpack('U*').)
Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).
--
Alex
> Unicode characters have named properties. "L" means it's a
> letter. There are sub-properties like Lu and Ll for upper and
> lower case. There are lots more properties for things like being
> numbers, being white-space, combining forms and particular
> properties of Asian characters and so on. Tremendously useful in
> regexes, particularly for those of us round-eye gringos who are
> prone to write [a-zA-Z] and think we're matching letters, which
> we're not. If you don't support properties, you don't support
> Unicode.
That's one of the reasons why you _need_ tables when working with
Unicode, and you _will_ spend memory on them. What Ruby does now is
nowhere near, and Matz wrote that he didn't unclude complete tables
for Oniguruma in 1.9 yet.
With proper regex support other funky things become posslbe, for
instance {all_cyrillic_letters} in a regex etc.
Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.
> I, for one, am very saddened every time the topic comes up ecause i'm
> sick of the brokenness (I actually start looking at these Other
> Languages and Other Frameworks that take l10n and i18n seriously).
Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.
Paul.
AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).
> Actually, that's a really good idea. Which languages/frameworks have
> you found that actually do it right? We could learn from their
> example.
To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.
But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))
That's what I mean: ICU is a separate library, not part of a language
core. We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists. From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)
> > Actually, that's a really good idea. Which languages/frameworks have
> > you found that actually do it right? We could learn from their
> > example.
>
> To my knowledge you are intimately familiar with the subject so I
> take it as sarcasm.
I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?
> But if you really feel like being constructive you can update the
> Unicode gem (wich you promised about a month ago) :-))
I promised I'd try :-) Thanks for the reminder, though! I'll get on with it.
Paul.
> On 31/07/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
>>
>> On 31-jul-2006, at 17:48, Paul Battley wrote:
>> > Whilst it's certainly useless for a lot of tasks, I'm not sure that
>> > Ruby is any worse than other languages in this regard. As far as
>> I'm
>> > aware, most languages that 'support' Unicode don't handle grapheme
>> > clusters without using additional libraries.
>>
>> AFAIK Python regexps do that properly, and ICU does for sure (both as
>> free iterators and regexps).
>
> That's what I mean: ICU is a separate library, not part of a language
> core.
PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.
> We can use ICU in Ruby too - it's still pre-alpha and not
> seamless, but the possibility exists.
Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don't do C.
> From what I've read, Python
> doesn't do the heavyweight stuff natively, either. (Please tell me if
> I'm wrong - I don't use Python.)
It depends on what you call "heavyweight". For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".
>>
>> To my knowledge you are intimately familiar with the subject so I
>> take it as sarcasm.
>
> I'm not being sarcastic at all, though perhaps I could have phrased it
> better. It's just that all Unicode discussions in Ruby end up going
> round and round in circles; if we as a community could identify some
> first-class examples of Doing It Right, I think we'd have some useful
> yardsticks.
The problem being, my "Right Examples" are nowhere near other's
"Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.
> You are someone with particularly high expectations
> (rightly so) of Unicode support in a language: have you found anything
> that really impressed you?
ICU in all it's incarnations (Java and C), compulsory character-
oriented Strings without choice of encoding in Java and the upcoming
Unicode support in Python (again - compulsory Unicode for all
strings, byte arrays for everything else). Perl's regex support. I
know everyone will disagree (how do I match a PNG header in a
string???) but that's what I consider good.
As to localization - resource bundles are good, and of course I
consider all languages that _did_ bother to print localized dates.
Shame on Ruby.
>
>> But if you really feel like being constructive you can update the
>> Unicode gem (wich you promised about a month ago) :-))
>
> I promised I'd try :-) Thanks for the reminder, though! I'll get on
> with it.
Gotcha :-)
. . except that Why answered me beyond the expectations of the
question, quite satisfactorily. My question is answered and then some.
I wasn't asking "When will it be perfect?" I was only asking "What does
it do now?"
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
> . . . except that Why answered me beyond the expectations of the
> question, quite satisfactorily. My question is answered and then
> some.
> I wasn't asking "When will it be perfect?" I was only asking "What
> does
> it do now?"
It's always more entertaining to think globally and draw grand
schemes :-) _why's answer indeed is a good
summary.
I would have thought so, but the general consensus seems to be that
considering the issue is "bad" right now, for some definition of "bad".
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
I second that. I see a lot of people asking for "transparent" unicode support
but I don't see how that is possible. To me it's like asking for a language that
has transparent bug recovery. I know that ruby has weaknesses when it comes to
multibyte encodings, but the main problem is human in nature; too many people
assume that char==byte, which results in bugs when someone unexpectedly uses
"weird" characters. IMHO no amount of "transparent support" will change that.
But I would love to be shown otherwise with examples of languages that "do it
right".
Daniel
In ruby 1.8 working with anything but bytes is like scratching your
right ear with your left hand .. or leg.
Thanks
Michal
Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :)
>
> > We can use ICU in Ruby too - it's still pre-alpha and not
> > seamless, but the possibility exists.
>
> Except from the fact that the maintainer has abandoned it and nobody
> stepped in. I don't do C.
>
> > From what I've read, Python
> > doesn't do the heavyweight stuff natively, either. (Please tell me if
> > I'm wrong - I don't use Python.)
>
> It depends on what you call "heavyweight". For the purists out there,
> I gather, even including a complete Unicode table with
> codepoint properties might be "heavyweight".
I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :)
>
> >>
> >> To my knowledge you are intimately familiar with the subject so I
> >> take it as sarcasm.
> >
> > I'm not being sarcastic at all, though perhaps I could have phrased it
> > better. It's just that all Unicode discussions in Ruby end up going
> > round and round in circles; if we as a community could identify some
> > first-class examples of Doing It Right, I think we'd have some useful
> > yardsticks.
>
> The problem being, my "Right Examples" are nowhere near other's
> "Right Examples", which in turn supurs flamewars.
> My "right example" is simple - Unicode on no terms, no encoding
> choice, characters only - but most already are dissatisfied with such
> an attitude and the issue has been discussed in detail, with no
> solution satisfying all parties being devises. Too much compromise.
It's been also said that giving more options does not stop you from
using only unicode. If your "right example" is only about restricting
choice then there is really not much to it.
The "right examples" people were interested in are probably more like
the libraries/languages that implement enough functionality to give
you full unicode support for your definition of "full".
Thanks
Michal
> Last time I looked ICU was in C++. Requiring a C++ compilier and
> runtime is quite a bit of bloat :)
It still is. And it's huge and takes ages to build. If only I knew
something much lighter and better I would have dismissed it.
>
> I am not sure how large that might be. But if it is about the size of
> the interpreter including the rest of the standard libraries I would
> consider it "heavyweight". It would be a reason to start "optional
> standard libraries" I guess :)
I'm stopping right here. Unicode is not an option.
> It's been also said that giving more options does not stop you from
> using only unicode.
In 90% of the cases giving more options means programmers ignore
Unicode, for reasons ranging from speed
to ignorance. My user experience over the years has proven it.
But then again, I stop right here. And I urge you to do the same :-)
> I second that. I see a lot of people asking for "transparent" unicode
> support but I don't see how that is possible. To me it's like asking for
> a language that has transparent bug recovery. I know that ruby has
> weaknesses when it comes to multibyte encodings, but the main problem is
> human in nature; too many people assume that char==byte, which results
> in bugs when someone unexpectedly uses "weird" characters. IMHO no
> amount of "transparent support" will change that. But I would love to be
> shown otherwise with examples of languages that "do it right".
It can be done. Java gets it almost right, and in such a way that most
people will never stub their toes on the flaws. Python, it seems, is
going to get it right next time around. It's clearly possible to do
Unicode correctly. What Matz wants is much harder; a String type that
can contain strings of characters from arbitrary character sets in
arbitrary encodings, Unicode being just one special case, and also serve
as a byte buffer.
-Tim