Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

state of unicode support

9 views
Skip to first unread message

Chad Perrin

unread,
Jul 28, 2006, 11:01:02 AM7/28/06
to
I've heard rumors that "oniguruma fixes everything", and the like. I'm
sure that's a touch of hyperbole, but in any case:

What's the current state of Unicode support in Ruby? My recollection is
of Unicode support somewhat lacking.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."

Berger, Daniel

unread,
Jul 28, 2006, 11:05:26 AM7/28/06
to
> -----Original Message-----
> From: Chad Perrin [mailto:per...@apotheon.com]
> Sent: Friday, July 28, 2006 9:01 AM
> To: ruby-talk ML
> Subject: state of unicode support
>
>
> I've heard rumors that "oniguruma fixes everything", and the
> like. I'm sure that's a touch of hyperbole, but in any case:
>
> What's the current state of Unicode support in Ruby? My
> recollection is of Unicode support somewhat lacking.

Good grief, this was *just* covered ad nauseum recently. Search the
archives.

- Dan


This communication is the property of Qwest and may contain confidential or
privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and destroy
all copies of the communication and any attachments.

Chad Perrin

unread,
Jul 28, 2006, 11:26:44 AM7/28/06
to
On Sat, Jul 29, 2006 at 12:05:26AM +0900, Berger, Daniel wrote:
> >
> > I've heard rumors that "oniguruma fixes everything", and the
> > like. I'm sure that's a touch of hyperbole, but in any case:
> >
> > What's the current state of Unicode support in Ruby? My
> > recollection is of Unicode support somewhat lacking.
>
> Good grief, this was *just* covered ad nauseum recently. Search the
> archives.

Good grief, you're a prick. Thanks for the help. Has everyone in the
world been on this mailing list since it was started?

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"Real ugliness is not harsh-looking syntax, but having to
build programs out of the wrong concepts." - Paul Graham

Chad Perrin

unread,
Jul 28, 2006, 11:28:52 AM7/28/06
to
On Sat, Jul 29, 2006 at 12:26:44AM +0900, Chad Perrin wrote:
> On Sat, Jul 29, 2006 at 12:05:26AM +0900, Berger, Daniel wrote:
> > >
> > > I've heard rumors that "oniguruma fixes everything", and the
> > > like. I'm sure that's a touch of hyperbole, but in any case:
> > >
> > > What's the current state of Unicode support in Ruby? My
> > > recollection is of Unicode support somewhat lacking.
> >
> > Good grief, this was *just* covered ad nauseum recently. Search the
> > archives.
>
> Good grief, you're a prick. Thanks for the help. Has everyone in the
> world been on this mailing list since it was started?

Whoops. I offer my apologies to ruby-talk: that was meant to be an
off-list email.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"It's just incredible that a trillion-synapse computer could actually
spend Saturday afternoon watching a football game." - Marvin Minsky

Berger, Daniel

unread,
Jul 28, 2006, 11:46:25 AM7/28/06
to

> -----Original Message-----
> From: Chad Perrin [mailto:per...@apotheon.com]
> Sent: Friday, July 28, 2006 9:27 AM
> To: ruby-talk ML

> Subject: Re: state of unicode support
>
>
> On Sat, Jul 29, 2006 at 12:05:26AM +0900, Berger, Daniel wrote:
> > >
> > > I've heard rumors that "oniguruma fixes everything", and the
> > > like. I'm sure that's a touch of hyperbole, but in any case:
> > >
> > > What's the current state of Unicode support in Ruby? My
> > > recollection is of Unicode support somewhat lacking.
> >
> > Good grief, this was *just* covered ad nauseum recently.
> Search the
> > archives.
>
> Good grief, you're a prick. Thanks for the help. Has
> everyone in the world been on this mailing list since it was started?

Somehow you missed this 300+ long thread started June 13 and lasted
replied to on June 28:

http://tinyurl.com/ge2kp

Chad Perrin

unread,
Jul 28, 2006, 11:51:31 AM7/28/06
to
On Sat, Jul 29, 2006 at 12:46:25AM +0900, Berger, Daniel wrote:
>
> Somehow you missed this 300+ long thread started June 13 and lasted
> replied to on June 28:

Maybe that has something to do with the fact that I've been an
intermittent member of this list, not a constant member, since I first
discovered it -- and my most recent membership (this one) started on the
25th of this month.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

This sig for rent: a Signify v1.14 production from http://www.debian.org/

Chad Perrin

unread,
Jul 28, 2006, 11:53:24 AM7/28/06
to
On Sat, Jul 29, 2006 at 12:46:25AM +0900, Berger, Daniel wrote:
>
> http://tinyurl.com/ge2kp

. . and holy crap. Would someone please provide a one-sentence
summary so I can get back to my life? Something akin to "Everything's
awesome now with 1.9!" or "It's not quite there yet, but it's close," or
even "It's as broken as ever, but regex support is better," would
suffice.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"The ability to quote is a serviceable
substitute for wit." - W. Somerset Maugham

Cliff Cyphers

unread,
Jul 28, 2006, 12:39:34 PM7/28/06
to
Chad Perrin wrote:
> Maybe that has something to do with the fact that I've been an
> intermittent member of this list, not a constant member, since I first
> discovered it -- and my most recent membership (this one) started on the
> 25th of this month.
>

It shouldn't matter if one is constantly subscribed. One should search
archive to see if question already discussed before posting dup. Given
that sometimes one might not have a search sting that doesn't find a
match there are cases where there will be dups. But at least try.

Chad Perrin

unread,
Jul 28, 2006, 12:54:16 PM7/28/06
to

News flash: I used Google and found a grand total of two posts from that
thread before I posted the question -- two posts that didn't help. I'm
not a complete moron, thanks.

Drop the friggin' subject. Forget I asked. It's not worth an entire
thread devoted entirely to a defense of my decision to ask rather than
rely on incomplete addressing of a simple question asking for a summary
response that spanned more than fifty emails, requiring hours of reading
just to get to an answer that's something like "Regex support good,
strict Unicode support good, localization via Unicode needs work." WTF?
Is your time so damned precious that you can't spend thirty seconds
posting a one-sentence summary response rather than twenty minutes
chastising me for not wasting hours finding out almost nothing?

I recently saw a couple posts lamenting the downhill trend of Ruby
community friendliness. At the time, I'd been on this list for about a
day and a half this time around, and couldn't really comment. On a
previous occasion when I followed this list for a bit, my impression was
"Those guys at ruby-talk are a great bunch of guys, unless you ask about
the nature of :symbols, in which case you'll get useless answers and
someone trying to give useful answers will get his head bitten off.
Otherwise, friendliest programming community I've ever seen." Now,
judging mostly by reactions others have received for innocent questions
and comments, and a couple of reactions I myself have received, it looks
like things have degenerated.

The old guard I recall (D. Black, Matz, et cetera) seem mostly to still
be a great bunch of guys from the posts I've witnessed "realtime" and
the recent archives I've read, but some others here need an attitude
check. Seriously.

"It shouldn't matter if one is constantly subscribed." Yeah. 'Cause at
a rate of hundreds of posts a day (wild guesstimate -- I don't have time
to actually count 'em), I really have time to read the last six months
of archives to see if I've missed something relevant. Even a search
engine won't solve that problem effectively. Maybe next time you should
say "Have you checked the archives? There's something relevant in there
from last month. Does it answer your question?" rather than the
equivalent of "Someone already said Unicode this year! Out, heathen!"

Daniel Berger

unread,
Jul 28, 2006, 12:58:07 PM7/28/06
to
Chad Perrin wrote:
> On Sat, Jul 29, 2006 at 01:39:34AM +0900, Cliff Cyphers wrote:
>> Chad Perrin wrote:
>>> Maybe that has something to do with the fact that I've been an
>>> intermittent member of this list, not a constant member, since I first
>>> discovered it -- and my most recent membership (this one) started on the
>>> 25th of this month.
>> It shouldn't matter if one is constantly subscribed. One should search
>> archive to see if question already discussed before posting dup. Given
>> that sometimes one might not have a search sting that doesn't find a
>> match there are cases where there will be dups. But at least try.
>
> News flash: I used Google and found a grand total of two posts from that
> thread before I posted the question -- two posts that didn't help. I'm
> not a complete moron, thanks.

<snip rant>

Cookie?

Chad Perrin

unread,
Jul 28, 2006, 1:03:11 PM7/28/06
to
On Sat, Jul 29, 2006 at 01:58:07AM +0900, Daniel Berger wrote:
> Chad Perrin wrote:
> >On Sat, Jul 29, 2006 at 01:39:34AM +0900, Cliff Cyphers wrote:
> >>Chad Perrin wrote:
> >>>Maybe that has something to do with the fact that I've been an
> >>>intermittent member of this list, not a constant member, since I first
> >>>discovered it -- and my most recent membership (this one) started on the
> >>>25th of this month.
> >>It shouldn't matter if one is constantly subscribed. One should search
> >>archive to see if question already discussed before posting dup. Given
> >>that sometimes one might not have a search sting that doesn't find a
> >>match there are cases where there will be dups. But at least try.
> >
> >News flash: I used Google and found a grand total of two posts from that
> >thread before I posted the question -- two posts that didn't help. I'm
> >not a complete moron, thanks.
>
> <snip rant>
>
> Cookie?

Is that a joking bit of peace-offering, or should I confine my comments
to an off-list response? Your response in particular has been less than
friendly in this case, and I'm less than optimistic in regards to your
motive in saying that. Specifically, I get the impression that you're
being a sarcastic <censored for length and inappropriateness to the
list>.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"The measure on a man's real character is what he would do
if he knew he would never be found out." - Thomas McCauley

Garance A Drosehn

unread,
Jul 28, 2006, 1:06:53 PM7/28/06
to
On 7/28/06, Chad Perrin <per...@apotheon.com> wrote:
>
> Drop the friggin' subject. Forget I asked. It's not worth an entire
> thread devoted entirely to a defense of my decision to ask

I think the real problem here is not that you happened to ask a
question that comes up repeatedly, but that the last thread on
that question basically ended with everyone too beat up to
talk anymore.

You're hitting a sore nerve for some people. That is not your
fault, of course, but that nerve still is a bit sore... I, for one,
was not particularly impressed with the way the earlier thread
ended (stumbling to a close), but I'd really hate to see it start
up again!

--
Garance Alistair Drosehn = dro...@gmail.com
Senior Systems Programmer
Rensselaer Polytechnic Institute; Troy, NY; USA

Chad Perrin

unread,
Jul 28, 2006, 1:13:14 PM7/28/06
to
On Sat, Jul 29, 2006 at 02:06:53AM +0900, Garance A Drosehn wrote:
> On 7/28/06, Chad Perrin <per...@apotheon.com> wrote:
> >
> >Drop the friggin' subject. Forget I asked. It's not worth an entire
> >thread devoted entirely to a defense of my decision to ask
>
> I think the real problem here is not that you happened to ask a
> question that comes up repeatedly, but that the last thread on
> that question basically ended with everyone too beat up to
> talk anymore.
>
> You're hitting a sore nerve for some people. That is not your
> fault, of course, but that nerve still is a bit sore... I, for one,
> was not particularly impressed with the way the earlier thread
> ended (stumbling to a close), but I'd really hate to see it start
> up again!

Thanks for the explanation. It's nice to occasionally get a civil
response.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

print substr("Just another Perl hacker", 0, -2);

Gregory Brown

unread,
Jul 28, 2006, 1:51:10 PM7/28/06
to
On 7/28/06, Charles O Nutter <hea...@headius.com> wrote:
> On 7/28/06, Chad Perrin <per...@apotheon.com> wrote:
> >
> > On Sat, Jul 29, 2006 at 12:46:25AM +0900, Berger, Daniel wrote:
> > >
> > > http://tinyurl.com/ge2kp
> >
> > . . . and holy crap. Would someone please provide a one-sentence

> > summary so I can get back to my life? Something akin to "Everything's
> > awesome now with 1.9!" or "It's not quite there yet, but it's close," or
> > even "It's as broken as ever, but regex support is better," would
> > suffice.
> >
>
> Oh come on, you don't want to read all those bazillion emails to get the
> bottom line? :)
>
> Seriously though y'all, isn't there a nice, short FAQ out there somewhere? I
> haven't been able to find one. There must be something, right?

Lest we start another flamewar on Unicode, can we please, please,
pretty please remember MINASWAN?

I'm really getting tired of seeing all these sarcastic posts and also
personal attacks on this list. I barely read RubyTalk anymore because
of it, and it makes me sad because I really do love this list, and
many of the people on it.

But I'm seriously considering unsubscribing because of how no one
(except the folks who have been around for quite some time) seems to
respect the very thing that attracted me to Ruby, the friendliness of
it's community.

As far as unicode support goes, it's a complicated topic. People who
want to sum it up in one line probably don't really care about the
tough design decisions behind it.

Chad Perrin

unread,
Jul 28, 2006, 2:23:46 PM7/28/06
to
On Sat, Jul 29, 2006 at 02:51:10AM +0900, Gregory Brown wrote:
>
> As far as unicode support goes, it's a complicated topic. People who
> want to sum it up in one line probably don't really care about the
> tough design decisions behind it.

. . or already have some vague idea of what Ruby Unicode support was
like a year ago, and just want a brief update for purposes of tool
evaluation for a project, or want to know the truth behind something
someone said in another venue, or . . .

. . or maybe you should assume good faith rather than jumping to
conclusions once in a while.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

Ben Franklin: "As we enjoy great Advantages from the Inventions of
others we should be glad of an Opportunity to serve others by any
Invention of ours, and this we should do freely and generously."

why the lucky stiff

unread,
Jul 28, 2006, 3:13:04 PM7/28/06
to
On Sat, Jul 29, 2006 at 01:08:06AM +0900, Charles O Nutter wrote:
> Oh man, I really don't have the energy for this thread again :) Chad: if you
> get a straight answer about this, let me know. Others: Is there a simple,
> straightforward FAQ entry somewhere that says "to use Unicode you have the
> following choices"? This keeps coming up.

This isn't a complete answer, but it's the best I can do to help Chad out.
If you really want to solve the question now, Chad, I'd read Julian Tarkhanov's
UNICODE_PRIMER[1].

First, Onigurama[2] is a regular expression engine. It supports Unicode regular
expressions under many encodings, it's very handy. If all you want to do is
search strings for Unicode text, then great, use it.

Ruby's strings are not unicode-aware. There is a library called 'jcode', which
comes with Ruby which tries to help out, but it's very simple, only good for a
few things like counting characters and iterating through characters. Again,
UTF-8 only.

Ruby itself also understands UTF-8 regular expressions to a degree. Using the
'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
str.scan(/./u), which returns an array of strings, each string containing a
multibyte character. (Also: str.unpack('U*').)

If you are using Unicode strings in Rails, check out Julian's unicode_hacks
plugin: <http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/>
They have a channel on irc.freenode.net: #multibyte_rails.

The unicode_hacks plugin is interesting in that it tries to load one of several
Ruby unicode extensions before falling back to str.unpack('U*') mode.

Here are the extensions it prefers, in order:

* icu4r: a Ruby extension to IBM's ICU library. Adds UString, URegexp, etc.
classes for containing Unicode stuffs.
(project page[3] and docs[4])
* utf8proc: a small library for iterating through characters and converting
ints to code points. Adds String#utf8map and Integer#utf8, for example.
(download[5])
* unicode: a little extension by Yoshida Masato which adds Unicode class
methods for `strcmp`, `[de]compose`, normalization and case conversion for
utf-8.
(download[6] and readme[7])

So, many options, some massive, but most only partial and in their infancy.

The most recent entrant into this race, though, is Nikolai Weibull's
ruby-character-encoding library, which aims to get complete multibyte support
into Ruby 1.8's string class. If you use it, it will probably break a lot of
libraries which are used to strings acting the way they do now.
He is trying to emulate the Ruby 2.0 Unicode plans outlined by Matz.[8]

Nevertheless, it is a very promising library and Nikolai is working at
break-neck pace to appease the nations, all tongues and peoples.[9] And
discussion is here[10] with links to the mailing list and all that.

This might be a landslide of information, but it's better than spending all day
Googling and extracting tarballs and pouring through READMEs just to get a
picture of what's happening these days.

Signed in elaborate calligraphy with a picture of grapes at the end,

_why

[1] http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/UNICODE_PRIMER
[2] http://www.geocities.jp/kosako3/oniguruma/
[3] http://rubyforge.org/projects/icu4r/
[4] http://icu4r.rubyforge.org/
[5] http://www.flexiguided.de/publications.utf8proc.en.html
[6] http://www.yoshidam.net/Ruby.html
[7] http://www.yoshidam.net/unicode.txt
[8] http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html
[9] http://git.bitwi.se/?p=ruby-character-encodings.git;a=summary
[10] http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

Chad Perrin

unread,
Jul 28, 2006, 3:23:17 PM7/28/06
to
On Sat, Jul 29, 2006 at 04:13:04AM +0900, why the lucky stiff wrote:
>
> This might be a landslide of information, but it's better than spending all day
> Googling and extracting tarballs and pouring through READMEs just to get a
> picture of what's happening these days.

That was most excellent. Thank you for your kind assistance: it answers
my question quite well, and I appreciate your effort.


>
> Signed in elaborate calligraphy with a picture of grapes at the end,

. . and as always, you manage to entertain in the process.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid

Matt Todd

unread,
Jul 28, 2006, 3:34:35 PM7/28/06
to
So, the problem with Unicode support in Ruby is that the code
currently assumes that each letter is one byte, instead of multiple?
This includes presumably search algorithms (for Regexs, et al), then?

Or is my understanding warped and wrong?

_Why, et al, if you could break down the actual difficulties with
implementing Unicode support into Ruby 1.8, I think that might clear
up the questions we have as to whether a library eradicates all
problems (obviously, some problems can't be fixed, but merely hacked
or worked around).

Cheers, folks; remember to be nice. We're on the same team.

M.T.

Eric Armstrong

unread,
Jul 28, 2006, 4:59:38 PM7/28/06
to
Spectacular summary. As a lurker on this thread,
I greatly appreciate it.

why the lucky stiff wrote:

Tim Bray

unread,
Jul 29, 2006, 4:33:54 AM7/29/06
to
On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:

> First, Onigurama[2] is a regular expression engine. It supports
> Unicode regular
> expressions under many encodings, it's very handy. If all you want
> to do is
> search strings for Unicode text, then great, use it.

Er uh well it doesn't do unicode properties so you can't use things
like \p{L} which, once you've found them, quickly come to feel
essential. Anytime you write [a-zA-Z] in a regex, you've probably
just uttered a bug So I would say that Oniguruma has holes.

Otherwise, a very useful landslide indeed. -Tim

Christian Neukirchen

unread,
Jul 29, 2006, 2:41:26 PM7/29/06
to
Chad Perrin <per...@apotheon.com> writes:

> The old guard I recall (D. Black, Matz, et cetera) seem mostly to still
> be a great bunch of guys from the posts I've witnessed "realtime" and
> the recent archives I've read, but some others here need an attitude
> check. Seriously.

In all politeness, I think you should count yourself in.

--
Christian Neukirchen <chneuk...@gmail.com> http://chneukirchen.org

Chad Perrin

unread,
Jul 29, 2006, 3:50:24 PM7/29/06
to
On Sun, Jul 30, 2006 at 03:41:26AM +0900, Christian Neukirchen wrote:
> Chad Perrin <per...@apotheon.com> writes:
>
> > The old guard I recall (D. Black, Matz, et cetera) seem mostly to still
> > be a great bunch of guys from the posts I've witnessed "realtime" and
> > the recent archives I've read, but some others here need an attitude
> > check. Seriously.
>
> In all politeness, I think you should count yourself in.

Perhaps I should. I let my frustration at rudeness and similar poor
manners get the better of me on occasion.

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

"There comes a time in the history of any project when it becomes necessary
to shoot the engineers and begin production." - MacUser, November 1990

Michal Suchanek

unread,
Jul 31, 2006, 6:28:35 AM7/31/06
to
On 7/28/06, Matt Todd <chio...@gmail.com> wrote:
> So, the problem with Unicode support in Ruby is that the code
> currently assumes that each letter is one byte, instead of multiple?
> This includes presumably search algorithms (for Regexs, et al), then?
>
> Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.

>
> _Why, et al, if you could break down the actual difficulties with
> implementing Unicode support into Ruby 1.8, I think that might clear
> up the questions we have as to whether a library eradicates all
> problems (obviously, some problems can't be fixed, but merely hacked
> or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal

Alex Young

unread,
Jul 31, 2006, 10:52:11 AM7/31/06
to
Tim Bray wrote:
> On Jul 28, 2006, at 12:13 PM, why the lucky stiff wrote:
>
>> First, Onigurama[2] is a regular expression engine. It supports
>> Unicode regular
>> expressions under many encodings, it's very handy. If all you want
>> to do is
>> search strings for Unicode text, then great, use it.
>
>
> Er uh well it doesn't do unicode properties so you can't use things
> like \p{L}

Off topic, what does/would that do? Match a lower-case symbol?

--
Alex

Tim Bray

unread,
Jul 31, 2006, 11:10:49 AM7/31/06
to
On Jul 31, 2006, at 7:52 AM, Alex Young wrote:

>>> First, Onigurama[2] is a regular expression engine. It supports
>>> Unicode regular
>>> expressions under many encodings, it's very handy. If all you
>>> want to do is
>>> search strings for Unicode text, then great, use it.
>> Er uh well it doesn't do unicode properties so you can't use
>> things like \p{L}
>
> Off topic, what does/would that do? Match a lower-case symbol?

Unicode characters have named properties. "L" means it's a letter.
There are sub-properties like Lu and Ll for upper and lower case.
There are lots more properties for things like being numbers, being
white-space, combining forms and particular properties of Asian
characters and so on. Tremendously useful in regexes, particularly
for those of us round-eye gringos who are prone to write [a-zA-Z] and
think we're matching letters, which we're not. If you don't support
properties, you don't support Unicode. -Tim


Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 11:20:18 AM7/31/06
to

On 28-jul-2006, at 18:54, Chad Perrin wrote:

> There's something relevant in there
> from last month. Does it answer your question?" rather than the
> equivalent of "Someone already said Unicode this year! Out, heathen!"

This is not quite true - the problem is that when the subject comes
up it spurs threads of obnoxious length, with no single answer given.

We heard that Ruby will have _some_ multibyte support and Oniguruma
by Christmas 2007. This is unbearably far away. Nobody knows how
it is going to work, and any suggestions summon enormous flame wars
of different origin. Moreover, the threads about Unicode in Ruby (the
best one
dating back to 2002) have all ended up with nothing.They were all
about the same length by the way.

I, for one, am very saddened every time the topic comes up ecause i'm
sick of the brokenness (I actually start looking at these Other
Languages and Other Frameworks that take l10n and i18n seriously). I
believe i'm not alone. Might be the case that many people don't want
to read about the Big Sad Topic again as long as no promises and
explanations are given that outline the future. Which as of now
doesn't look bright, at least for the coming 18 months.
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 11:24:43 AM7/31/06
to

On 28-jul-2006, at 21:13, why the lucky stiff wrote:

> Ruby itself also understands UTF-8 regular expressions to a
> degree. Using the
> 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> str.scan(/./u), which returns an array of strings, each string
> containing a
> multibyte character. (Also: str.unpack('U*').)

Which is actually useless because this breaks your string between
codepoints, not between characters. ICU4R currently resolves this, as
well as a library posted
on ruby-talk a while ago (with proper text boudnary handling).

Alex Young

unread,
Jul 31, 2006, 11:25:23 AM7/31/06
to
Gotcha. Thanks for that.

--
Alex

Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 11:35:13 AM7/31/06
to

On 31-jul-2006, at 17:10, Tim Bray wrote:

> Unicode characters have named properties. "L" means it's a
> letter. There are sub-properties like Lu and Ll for upper and
> lower case. There are lots more properties for things like being
> numbers, being white-space, combining forms and particular
> properties of Asian characters and so on. Tremendously useful in
> regexes, particularly for those of us round-eye gringos who are
> prone to write [a-zA-Z] and think we're matching letters, which
> we're not. If you don't support properties, you don't support
> Unicode.

That's one of the reasons why you _need_ tables when working with
Unicode, and you _will_ spend memory on them. What Ruby does now is
nowhere near, and Matz wrote that he didn't unclude complete tables
for Oniguruma in 1.9 yet.

With proper regex support other funky things become posslbe, for
instance {all_cyrillic_letters} in a regex etc.

Paul Battley

unread,
Jul 31, 2006, 11:48:23 AM7/31/06
to
On 31/07/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
> > Ruby itself also understands UTF-8 regular expressions to a
> > degree. Using the
> > 'u' modifier. Many Ruby-based UTF-8 hacks are based on the idea of:
> > str.scan(/./u), which returns an array of strings, each string
> > containing a
> > multibyte character. (Also: str.unpack('U*').)
>
> Which is actually useless because this breaks your string between
> codepoints, not between characters. ICU4R currently resolves this, as
> well as a library posted
> on ruby-talk a while ago (with proper text boudnary handling).

Whilst it's certainly useless for a lot of tasks, I'm not sure that
Ruby is any worse than other languages in this regard. As far as I'm
aware, most languages that 'support' Unicode don't handle grapheme
clusters without using additional libraries.

> I, for one, am very saddened every time the topic comes up ecause i'm
> sick of the brokenness (I actually start looking at these Other
> Languages and Other Frameworks that take l10n and i18n seriously).

Actually, that's a really good idea. Which languages/frameworks have
you found that actually do it right? We could learn from their
example.

Paul.

Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 12:02:33 PM7/31/06
to

On 31-jul-2006, at 17:48, Paul Battley wrote:
> Whilst it's certainly useless for a lot of tasks, I'm not sure that
> Ruby is any worse than other languages in this regard. As far as I'm
> aware, most languages that 'support' Unicode don't handle grapheme
> clusters without using additional libraries.

AFAIK Python regexps do that properly, and ICU does for sure (both as
free iterators and regexps).

> Actually, that's a really good idea. Which languages/frameworks have
> you found that actually do it right? We could learn from their
> example.

To my knowledge you are intimately familiar with the subject so I
take it as sarcasm.

But if you really feel like being constructive you can update the
Unicode gem (wich you promised about a month ago) :-))

Paul Battley

unread,
Jul 31, 2006, 12:51:47 PM7/31/06
to
On 31/07/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
>
> On 31-jul-2006, at 17:48, Paul Battley wrote:
> > Whilst it's certainly useless for a lot of tasks, I'm not sure that
> > Ruby is any worse than other languages in this regard. As far as I'm
> > aware, most languages that 'support' Unicode don't handle grapheme
> > clusters without using additional libraries.
>
> AFAIK Python regexps do that properly, and ICU does for sure (both as
> free iterators and regexps).

That's what I mean: ICU is a separate library, not part of a language
core. We can use ICU in Ruby too - it's still pre-alpha and not
seamless, but the possibility exists. From what I've read, Python
doesn't do the heavyweight stuff natively, either. (Please tell me if
I'm wrong - I don't use Python.)

> > Actually, that's a really good idea. Which languages/frameworks have
> > you found that actually do it right? We could learn from their
> > example.
>
> To my knowledge you are intimately familiar with the subject so I
> take it as sarcasm.

I'm not being sarcastic at all, though perhaps I could have phrased it
better. It's just that all Unicode discussions in Ruby end up going
round and round in circles; if we as a community could identify some
first-class examples of Doing It Right, I think we'd have some useful
yardsticks. You are someone with particularly high expectations
(rightly so) of Unicode support in a language: have you found anything
that really impressed you?

> But if you really feel like being constructive you can update the
> Unicode gem (wich you promised about a month ago) :-))

I promised I'd try :-) Thanks for the reminder, though! I'll get on with it.

Paul.

Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 1:15:03 PM7/31/06
to

On 31-jul-2006, at 18:51, Paul Battley wrote:

> On 31/07/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
>>
>> On 31-jul-2006, at 17:48, Paul Battley wrote:
>> > Whilst it's certainly useless for a lot of tasks, I'm not sure that
>> > Ruby is any worse than other languages in this regard. As far as
>> I'm
>> > aware, most languages that 'support' Unicode don't handle grapheme
>> > clusters without using additional libraries.
>>
>> AFAIK Python regexps do that properly, and ICU does for sure (both as
>> free iterators and regexps).
>
> That's what I mean: ICU is a separate library, not part of a language
> core.

PHP took the best of both - they are integrating ICU into the core.
Although I always hated
their tendency to bloat the core, this is one of the cases of bloat
that I would want to applaud as a gesture
of sanity and common sense.

> We can use ICU in Ruby too - it's still pre-alpha and not
> seamless, but the possibility exists.

Except from the fact that the maintainer has abandoned it and nobody
stepped in. I don't do C.

> From what I've read, Python
> doesn't do the heavyweight stuff natively, either. (Please tell me if
> I'm wrong - I don't use Python.)

It depends on what you call "heavyweight". For the purists out there,
I gather, even including a complete Unicode table with
codepoint properties might be "heavyweight".

>>
>> To my knowledge you are intimately familiar with the subject so I
>> take it as sarcasm.
>
> I'm not being sarcastic at all, though perhaps I could have phrased it
> better. It's just that all Unicode discussions in Ruby end up going
> round and round in circles; if we as a community could identify some
> first-class examples of Doing It Right, I think we'd have some useful
> yardsticks.

The problem being, my "Right Examples" are nowhere near other's
"Right Examples", which in turn supurs flamewars.
My "right example" is simple - Unicode on no terms, no encoding
choice, characters only - but most already are dissatisfied with such
an attitude and the issue has been discussed in detail, with no
solution satisfying all parties being devises. Too much compromise.

> You are someone with particularly high expectations
> (rightly so) of Unicode support in a language: have you found anything
> that really impressed you?

ICU in all it's incarnations (Java and C), compulsory character-
oriented Strings without choice of encoding in Java and the upcoming
Unicode support in Python (again - compulsory Unicode for all
strings, byte arrays for everything else). Perl's regex support. I
know everyone will disagree (how do I match a PNG header in a
string???) but that's what I consider good.

As to localization - resource bundles are good, and of course I
consider all languages that _did_ bother to print localized dates.
Shame on Ruby.


>
>> But if you really feel like being constructive you can update the
>> Unicode gem (wich you promised about a month ago) :-))
>
> I promised I'd try :-) Thanks for the reminder, though! I'll get on
> with it.

Gotcha :-)

Chad Perrin

unread,
Jul 31, 2006, 1:40:47 PM7/31/06
to
On Tue, Aug 01, 2006 at 12:20:18AM +0900, Julian 'Julik' Tarkhanov wrote:
> On 28-jul-2006, at 18:54, Chad Perrin wrote:
>
> >There's something relevant in there
> >from last month. Does it answer your question?" rather than the
> >equivalent of "Someone already said Unicode this year! Out, heathen!"
>
> This is not quite true - the problem is that when the subject comes
> up it spurs threads of obnoxious length, with no single answer given.

. . except that Why answered me beyond the expectations of the
question, quite satisfactorily. My question is answered and then some.
I wasn't asking "When will it be perfect?" I was only asking "What does
it do now?"

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

Julian 'Julik' Tarkhanov

unread,
Jul 31, 2006, 2:52:37 PM7/31/06
to

On 31-jul-2006, at 19:40, Chad Perrin wrote:

> . . . except that Why answered me beyond the expectations of the


> question, quite satisfactorily. My question is answered and then
> some.
> I wasn't asking "When will it be perfect?" I was only asking "What
> does
> it do now?"

It's always more entertaining to think globally and draw grand
schemes :-) _why's answer indeed is a good
summary.

Chad Perrin

unread,
Jul 31, 2006, 3:18:31 PM7/31/06
to
On Tue, Aug 01, 2006 at 03:52:37AM +0900, Julian 'Julik' Tarkhanov wrote:
>
> It's always more entertaining to think globally and draw grand
> schemes :-)

I would have thought so, but the general consensus seems to be that
considering the issue is "bad" right now, for some definition of "bad".

--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]

Daniel DeLorme

unread,
Aug 1, 2006, 5:05:59 AM8/1/06
to
Paul Battley wrote:
>> > Actually, that's a really good idea. Which languages/frameworks have
>> > you found that actually do it right? We could learn from their
>> > example.
>>
>> To my knowledge you are intimately familiar with the subject so I
>> take it as sarcasm.
>
> I'm not being sarcastic at all, though perhaps I could have phrased it
> better. It's just that all Unicode discussions in Ruby end up going
> round and round in circles; if we as a community could identify some
> first-class examples of Doing It Right, I think we'd have some useful
> yardsticks. You are someone with particularly high expectations
> (rightly so) of Unicode support in a language: have you found anything
> that really impressed you?

I second that. I see a lot of people asking for "transparent" unicode support
but I don't see how that is possible. To me it's like asking for a language that
has transparent bug recovery. I know that ruby has weaknesses when it comes to
multibyte encodings, but the main problem is human in nature; too many people
assume that char==byte, which results in bugs when someone unexpectedly uses
"weird" characters. IMHO no amount of "transparent support" will change that.
But I would love to be shown otherwise with examples of languages that "do it
right".

Daniel

Michal Suchanek

unread,
Aug 1, 2006, 5:37:50 AM8/1/06
to
By transparent I mean that I can iterate, compare, match, index, ...
not only bytes but also at least code points (and grapheme clusters if
somebody is so nice and implements that - but for me it is not very
important now). Using the standard string class that all standard
functions accept.

In ruby 1.8 working with anything but bytes is like scratching your
right ear with your left hand .. or leg.

Thanks

Michal

Michal Suchanek

unread,
Aug 1, 2006, 6:05:03 AM8/1/06
to
On 7/31/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
>
> On 31-jul-2006, at 18:51, Paul Battley wrote:
>
> > On 31/07/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
> >>
> >> On 31-jul-2006, at 17:48, Paul Battley wrote:
> >> > Whilst it's certainly useless for a lot of tasks, I'm not sure that
> >> > Ruby is any worse than other languages in this regard. As far as
> >> I'm
> >> > aware, most languages that 'support' Unicode don't handle grapheme
> >> > clusters without using additional libraries.
> >>
> >> AFAIK Python regexps do that properly, and ICU does for sure (both as
> >> free iterators and regexps).
> >
> > That's what I mean: ICU is a separate library, not part of a language
> > core.
>
> PHP took the best of both - they are integrating ICU into the core.
> Although I always hated
> their tendency to bloat the core, this is one of the cases of bloat
> that I would want to applaud as a gesture
> of sanity and common sense.

Last time I looked ICU was in C++. Requiring a C++ compilier and
runtime is quite a bit of bloat :)

>
> > We can use ICU in Ruby too - it's still pre-alpha and not
> > seamless, but the possibility exists.
>
> Except from the fact that the maintainer has abandoned it and nobody
> stepped in. I don't do C.
>
> > From what I've read, Python
> > doesn't do the heavyweight stuff natively, either. (Please tell me if
> > I'm wrong - I don't use Python.)
>
> It depends on what you call "heavyweight". For the purists out there,
> I gather, even including a complete Unicode table with
> codepoint properties might be "heavyweight".

I am not sure how large that might be. But if it is about the size of
the interpreter including the rest of the standard libraries I would
consider it "heavyweight". It would be a reason to start "optional
standard libraries" I guess :)

>
> >>
> >> To my knowledge you are intimately familiar with the subject so I
> >> take it as sarcasm.
> >
> > I'm not being sarcastic at all, though perhaps I could have phrased it
> > better. It's just that all Unicode discussions in Ruby end up going
> > round and round in circles; if we as a community could identify some
> > first-class examples of Doing It Right, I think we'd have some useful
> > yardsticks.
>
> The problem being, my "Right Examples" are nowhere near other's
> "Right Examples", which in turn supurs flamewars.
> My "right example" is simple - Unicode on no terms, no encoding
> choice, characters only - but most already are dissatisfied with such
> an attitude and the issue has been discussed in detail, with no
> solution satisfying all parties being devises. Too much compromise.

It's been also said that giving more options does not stop you from
using only unicode. If your "right example" is only about restricting
choice then there is really not much to it.

The "right examples" people were interested in are probably more like
the libraries/languages that implement enough functionality to give
you full unicode support for your definition of "full".

Thanks

Michal

Julian 'Julik' Tarkhanov

unread,
Aug 1, 2006, 7:43:47 AM8/1/06
to

On 1-aug-2006, at 12:05, Michal Suchanek wrote:

> Last time I looked ICU was in C++. Requiring a C++ compilier and
> runtime is quite a bit of bloat :)

It still is. And it's huge and takes ages to build. If only I knew
something much lighter and better I would have dismissed it.


>
> I am not sure how large that might be. But if it is about the size of
> the interpreter including the rest of the standard libraries I would
> consider it "heavyweight". It would be a reason to start "optional
> standard libraries" I guess :)

I'm stopping right here. Unicode is not an option.

> It's been also said that giving more options does not stop you from
> using only unicode.

In 90% of the cases giving more options means programmers ignore
Unicode, for reasons ranging from speed
to ignorance. My user experience over the years has proven it.

But then again, I stop right here. And I urge you to do the same :-)

Tim Bray

unread,
Aug 1, 2006, 2:41:25 PM8/1/06
to
Daniel DeLorme wrote:

> I second that. I see a lot of people asking for "transparent" unicode
> support but I don't see how that is possible. To me it's like asking for
> a language that has transparent bug recovery. I know that ruby has
> weaknesses when it comes to multibyte encodings, but the main problem is
> human in nature; too many people assume that char==byte, which results
> in bugs when someone unexpectedly uses "weird" characters. IMHO no
> amount of "transparent support" will change that. But I would love to be
> shown otherwise with examples of languages that "do it right".

It can be done. Java gets it almost right, and in such a way that most
people will never stub their toes on the flaws. Python, it seems, is
going to get it right next time around. It's clearly possible to do
Unicode correctly. What Matz wants is much harder; a String type that
can contain strings of characters from arbitrary character sets in
arbitrary encodings, Unicode being just one special case, and also serve
as a byte buffer.

-Tim

0 new messages