Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

7 views
Skip to first unread message
Message has been deleted

Yukihiro Matsumoto

unread,
May 19, 2008, 7:53:04 PM5/19/08
to
Hi,

In message "Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it."
on Sat, 17 May 2008 06:10:05 +0900, DJ Jazzy Linefeed <john.d....@gmail.com> writes:
|
|def prep_file(path)
|
| ret = ''
|
| x = File.open(path)
|
| x.lines.each do |l|
| l.gsub!('\n', ' ')
| ret << l
| end
|
| puts ret
|
|end
|...
|compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
| from compare.rb:64:in `block in prep_file'
| from compare.rb:63:in `each_line'
| from compare.rb:63:in `call'
| from compare.rb:63:in `each'
| from compare.rb:63:in `prep_file'
| from compare.rb:144:in `<main>'

Regular expression operation does not work fine on broken strings. It
seems that you specify utf-8 for your locale, yet the content of
reading file is not. If you know the encoding of the content, say
iso-8859-1, you can open it with the explicit encoding:

x = File.open(path, "r:iso-8859-1")

if not, you can say it

x = File.open(path, "r:ascii-8bit")

unless the file content is non ASCII like UTF-16.

matz.

7stud --

unread,
May 19, 2008, 8:40:18 PM5/19/08
to
DJ Jazzy Linefeed wrote:
>
> I'm gonna go get a gallon of milk and I'll be back soon. You wait
> right there. (grumbles)
>

Just shut your eyes and hum the mantra, "Ruby doesn't get in your way.
Ruby doesn't get in your way." You obviously need to get "cleared".
Please set up an appointment with your nearest Church of Scientology.

--
Posted via http://www.ruby-forum.com/.

DJ Jazzy Linefeed

unread,
May 20, 2008, 6:45:28 AM5/20/08
to
On May 19, 1:53 pm, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:
> Hi,
>
> In message "Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it."

It makes no sense, Matz.

I don't get to know what the encoding is before hand, that's just it -
there may be every encoding. I just deal with a pile of files, I
think...

case encoding
when ascii-8bit
# ... 12 lines of boilerplate encoder extraction
when iso-8859
# ... 12 more?
end

is a case to avoid

should move forward or die, yes, but not
and die.

Ruby is accessible to the masses because you don't have to understand
encodings, not in spite of the fact.
Food for thought.
-DJ J\N

Yukihiro Matsumoto

unread,
May 20, 2008, 9:51:38 AM5/20/08
to
Hi,

In message "Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it."

on Tue, 20 May 2008 19:50:08 +0900, DJ Jazzy Linefeed <john.d....@gmail.com> writes:

|> Regular expression operation does not work fine on broken strings. It
|> seems that you specify utf-8 for your locale, yet the content of
|> reading file is not. If you know the encoding of the content, say
|> iso-8859-1, you can open it with the explicit encoding:
|>
|> x = File.open(path, "r:iso-8859-1")
|>
|> if not, you can say it
|>
|> x = File.open(path, "r:ascii-8bit")
|>
|> unless the file content is non ASCII like UTF-16.

|It makes no sense, Matz.


|
|I don't get to know what the encoding is before hand, that's just it -
|there may be every encoding. I just deal with a pile of files, I
|think...

Since today's OSes do not provide encoding information for files, you
HAVE TO know the encoding of the files if you want to handle them
correctly, unfortunately. That's life, no matter how you expect.

If you don't need exact encoding handling, and know the file is mostly
ASCII, use ASCII-8BIT for encoding. It works most of the cases.

matz.

M. Edward (Ed) Borasky

unread,
May 20, 2008, 10:24:29 AM5/20/08
to
Yukihiro Matsumoto wrote:
> Hi,
>
> In message "Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it."
> on Tue, 20 May 2008 19:50:08 +0900, DJ Jazzy Linefeed <john.d....@gmail.com> writes:
>
> |> Regular expression operation does not work fine on broken strings. It
> |> seems that you specify utf-8 for your locale, yet the content of
> |> reading file is not. If you know the encoding of the content, say
> |> iso-8859-1, you can open it with the explicit encoding:
> |>
> |> x = File.open(path, "r:iso-8859-1")
> |>
> |> if not, you can say it
> |>
> |> x = File.open(path, "r:ascii-8bit")
> |>
> |> unless the file content is non ASCII like UTF-16.
>
> |It makes no sense, Matz.
> |
> |I don't get to know what the encoding is before hand, that's just it -
> |there may be every encoding. I just deal with a pile of files, I
> |think...
>
> Since today's OSes do not provide encoding information for files, you
> HAVE TO know the encoding of the files if you want to handle them
> correctly, unfortunately. That's life, no matter how you expect.

I ran across this in Perl last week on a Windows machine. It seems the
Perl "Encode" library has a "guess" option. It will look at a file and
attempt to guess what the encoding is. Unfortunately, it could only
determine that the files were "UTF-16", not which of (at least) two
variants. The solution turned out to be to open the files in Wordpad and
save them as ASCII.

You're absolutely right ... if you don't know what encoding the writer
of the file used, your first action should be to ask!

P.S.: I suppose I should look at how Perl attempts to guess the encoding
and why it couldn't pick one of two UTF-16 variants. :)


>
> If you don't need exact encoding handling, and know the file is mostly
> ASCII, use ASCII-8BIT for encoding. It works most of the cases.

Well ... it didn't work on my UTF-16 files last week. :)


Todd Benson

unread,
May 20, 2008, 10:54:25 AM5/20/08
to
On Fri, May 16, 2008 at 4:10 PM, DJ Jazzy Linefeed
<john.d....@gmail.com> wrote:
> def prep_file(path)
>
> ret = ''
>
> x = File.open(path)
>
> x.lines.each do |l|
> l.gsub!('\n', ' ')
> ret << l
> end
>
> puts ret
>
> end
> ...
> compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
> from compare.rb:64:in `block in prep_file'
> from compare.rb:63:in `each_line'
> from compare.rb:63:in `call'
> from compare.rb:63:in `each'
> from compare.rb:63:in `prep_file'
> from compare.rb:144:in `<main>'
>
> Hm. Okay, I love you ruby, we can just talk this thing out and I can
> get back to...
>
> x.lines.each do |l|
> ret << l
> end
>
> # (I love you too)
>
> Alright baby, Daddy gets confused and angry sometimes... do you wanna
> make a little string love...?

>
> ret = ''
>
> x = File.open(path)
>
> x.lines.each do |l|
> ret << l
> end
>
> puts ret.class
>
> # String
>
> Mhm, it smells like you do. Why don't we take this off...
>
> x.lines.each do |l|
> ret << l
> end
>
> puts ret.gsub!('a', 'test')
>
> end
> ...
> compare.rb:69:in `gsub!': broken UTF-8 string (ArgumentError)
> from compare.rb:69:in `prep_file'
> from compare.rb:145:in `<main>'
>
> Hey, Ruby, if it's that week of the month we can just cuddle. Here,
> try this...
>
>
> x.lines.each do |l|

> ret << l
> end
>
> puts ret
> ...
> # (big string)
>
> See, thats good. Thats a string and that's something we have in
> common, maybe we were just talking about different encodings. Let's
> see what it's made of.
>
> puts ret.encoding
>
> # UTF-8

>
> I'm gonna go get a gallon of milk and I'll be back soon. You wait
> right there. (grumbles)

I love it. Another person that wants a babel fish. The irony is in
the language demonstrating as much. In other words, I need Jazzy
Linefeed encoding (I left off the DJ because there might be other
types of linefeeds).

Todd

Gary Watson

unread,
Dec 27, 2009, 5:04:18 AM12/27/09
to
Thanks for the suggestion of using ascii-8bit. This solved my problem.

The line of code that was giving me fits was the following line. Worked
in 1.8 but didn't work in 1.9.0

puts Dir["**/*"].select {|x| x.match(/(jpg)$/)}

when I changed it to this

puts Dir["**/*"].select {|x|
x.force_encoding("ascii-8bit").match(/(jpg)$/)}

all was well.

Regards,
Gary

--
Posted via http://www.ruby-forum.com/.

Albert Schlef

unread,
Dec 27, 2009, 5:36:06 AM12/27/09
to
Yukihiro Matsumoto wrote:
> on Sat, 17 May 2008 06:10:05 +0900, DJ Jazzy Linefeed
> <john.d....@gmail.com> writes:
> > l.gsub!('\n', ' ')
[snip]

> Regular expression operation does not work fine on broken strings. It

An off-topic question:

So String#gsub always use the regexp engine (even if the pattern is a
plain string).

Now,

Is there a way, in Ruby, to do search/replace that don't involve the
regexp engine?

I'm asking this because I figure that not using the regexp engine would
be faster (but maybe it'll be only marginally faster, I don't know).

I know one can do...

s[ 'find' ] = 'replace'

..but it replaces only one occurance the the substring (and does it
skip the regexp engine?).

Brian Candler

unread,
Dec 27, 2009, 6:05:44 AM12/27/09
to
DJ Jazzy Linefeed wrote:
> compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)

Yep. Ruby 1.9 will raise exceptions in all sorts of odd places,
dependent on both the tagged encoding of the string *and* its content at
that point in time.

I got as far as recording 200 behaviours of String in ruby 1.9 before I
gave up:
http://github.com/candlerb/string19/blob/master/string19.rb

The solution I use is simple: stick to ruby 1.8.x. When that branch
dies, perhaps reia will be ready. If not I'll move to something else.

IMO, both python 3 and erlang have got the right idea when it comes to
handling UTF8.

Benoit Daloze

unread,
Dec 27, 2009, 6:48:51 AM12/27/09
to
2009/12/27 Brian Candler <b.ca...@pobox.com>

Hi,

I got this kind of problem yesterday too.

While taking some file names with Dir#[], I got some special results.

I was searching for "bad" file names, I mean file names with é,ê or
whatever. When I print the String given in the block directly, no problem.

But then I come with things like:
/Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng

(The ~ is separated from the n and then is not ñ). The Regexp is acting like
it is 2 different characters. How to handle that easily? I tried to change
the script encoding in MacRoman, but it produced an error of bad encoding
not matching UTF-8.

as output of this script (which is then not able to rename any wrong file,
because tr! seem to not work either on name) :

path = ARGV[0] || "/"

ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"

Dir["#{File.expand_path(path)}/**/*"].each { |f|
name = File.basename(f)
unless name =~ /^[#{ALLOWED_CHARS}]+$/
puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
">\\1<")

if name.tr!('éèê', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is not
complete, it is just a test, but it doesn't work even for 'filéname'
File.rename(f, File.dirname(f) + '/' + name)
puts "\trenamed in #{name}"
break
end
end
}

Edward Middleton

unread,
Dec 27, 2009, 8:30:31 PM12/27/09
to
Brian Candler wrote:
> DJ Jazzy Linefeed wrote:
>
>> compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
>>
>
> Yep. Ruby 1.9 will raise exceptions in all sorts of odd places,
> dependent on both the tagged encoding of the string *and* its content at
> that point in time.
>

If you don't arbitrarily set the encoding, when will this be a problem?

Edward

Brian Candler

unread,
Dec 28, 2009, 5:36:11 AM12/28/09
to
Benoit Daloze wrote:
> But then I come with things like:
> /Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng
>
> (The ~ is separated from the n and then is not ñ). The Regexp is acting
> like
> it is 2 different characters. How to handle that easily? I tried to
> change
> the script encoding in MacRoman, but it produced an error of bad
> encoding
> not matching UTF-8.

I don't know what you mean. If Dir.[] tells you that the file name is
<e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
filename?

I suggest you try something like this:

puts "Source encoding: #{"".encoding}"
puts "External encoding: #{Encoding.default_external}"
Dir["*.lng"] do |fn|
puts "Name: #{fn.inspect}"
puts "Encoding: #{fn.encoding}"
puts "Chars: #{fn.chars.to_a.inspect}"
puts "Codepoints: #{fn.codepoints.to_a.inspect}"
puts "Bytes: #{fn.bytes.to_a.inspect}"
puts
end

then post the results for this file here. Then also post what you think
the true filename is.

Then you can see whether: (1) Dir.[] is returning the correct sequence
of bytes for the filename or not; and (2) Dir.[] is tagging the string
with the correct encoding or not.

(This is one of the thousands of cases I did *not* document in
string19.rb; I did some of the core methods on String, but of course
every method in every class which either returns a string or accepts a
string argument needs to document how it handles encodings)

> as output of this script (which is then not able to rename any wrong
> file,
> because tr! seem to not work either on name) :
>
> path = ARGV[0] || "/"
>
> ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"
>
> Dir["#{File.expand_path(path)}/**/*"].each { |f|
> name = File.basename(f)
> unless name =~ /^[#{ALLOWED_CHARS}]+$/
> puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
> ">\\1<")
>
> if name.tr!('éèê', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is
> not
> complete, it is just a test, but it doesn't work even for 'filéname'
> File.rename(f, File.dirname(f) + '/' + name)
> puts "\trenamed in #{name}"
> break
> end
> end
> }

What error do you get? Is it failing to match the é at all (tr! returns
nil), or is an encoding error raised in tr!, or is an error raised by
File.rename ?

Benoit Daloze

unread,
Dec 28, 2009, 8:23:18 AM12/28/09
to
2009/12/28 Brian Candler <b.ca...@pobox.com>

The true filename is (from the Finder and Terminal):
-rw-r--r--@ 1 benoitdaloze staff 3758 Jul 17 2008 español.lng
So, with the 'ñ'.

I don't know which is the encoding of the filename on HFS+, from Wikipedia
it s said as UTF-16, with Decomposition:
"names which are also character encoded in
UTF-16<http://en.wikipedia.org/wiki/UTF-16>and normalized to a form
very nearly the same as Unicode
Normalization Form D (NFD)<http://en.wikipedia.org/wiki/Unicode_normalization>
[4] <http://en.wikipedia.org/wiki/HFS_Plus#cite_note-3> (which means that
precomposed characters like é are decomposed in the HFS+ filename and
therefore count as two
characters[5]<http://en.wikipedia.org/wiki/HFS_Plus#cite_note-4>"
So, that's probably a problem of encoding for Dir.[]

I changed a little the script, to compare with a String hard-coded inside
the script (rn = "español.lng")

ruby 1.9.2dev (2009-12-11 trunk 26067) [x86_64-darwin10.2.0]

Source encoding: UTF-8
External encoding: UTF-8

Format:
String in the code
filename from Dir[]

String equality: false

Name:
"español.lng"
"español.lng"
Encoding:
UTF-8
UTF-8
Chars:
["e", "s", "p", "a", "ñ", "o", "l", ".", "l", "n", "g"]
["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]
Codepoints:
[101, 115, 112, 97, 241, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
Bytes:
[101, 115, 112, 97, 195, 177, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 204, 131, 111, 108, 46, 108, 110, 103]


> Then you can see whether: (1) Dir.[] is returning the correct sequence
> of bytes for the filename or not; and (2) Dir.[] is tagging the string
> with the correct encoding or not.
>

(1) Dir[] seems to return a correct String in UTF-8, while being different
(!!) from a String inside in UTF-8
But looking at the codepoints and bytes, it's very different ...

(2) That's probably the case, let's look by forcing the encoding to
MacRoman:
Or not ... making crazy results like: "espan\xCC\x83ol.lng" or
"espan\u0303ol.lng"

Well, this is out of my poor knowledge of encoding I'm afraid :(

The most frustrating is it's printing the same...

P.S.: Well I got also filenames with "\r", quite weared,no? ("Target
Application Alias\r", and it "\r" is shown as "?" in the Terminal)

> Yes, tr! returns nil on name.tr!('ñ', 'n'), but it would work on a String
inside the script (eg: "eño".tr!('ñ', 'n'))

Marnen Laibow-Koser

unread,
Dec 28, 2009, 2:22:13 PM12/28/09
to
Benoit Daloze wrote:
> 2009/12/27 Brian Candler <b.ca...@pobox.com>

>
>>
>> The solution I use is simple: stick to ruby 1.8.x. When that branch
>> dies, perhaps reia will be ready. If not I'll move to something else.
>>
>> IMO, both python 3 and erlang have got the right idea when it comes to
>> handling UTF8.
>> --
>> Posted via http://www.ruby-forum.com/.
>>
>>
> Hi,
>
> I got this kind of problem yesterday too.
>
> While taking some file names with Dir#[], I got some special results.
>
> I was searching for "bad" file names, I mean file names with é,ê or
> whatever. When I print the String given in the block directly, no
> problem.
>
> But then I come with things like:
> /Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng
>
> (The ~ is separated from the n and then is not ñ). The Regexp is acting
> like
> it is 2 different characters.

And so it is. If memory serves, Mac OS X stores filenames in normal
form D.

> How to handle that easily?

Normalize to normal form C instead.

Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org

Bill Kelly

unread,
Dec 28, 2009, 7:46:52 PM12/28/09
to
Brian Candler wrote:
>
> I got as far as recording 200 behaviours of String in ruby 1.9 before I
> gave up:
> http://github.com/candlerb/string19/blob/master/string19.rb
>
> The solution I use is simple: stick to ruby 1.8.x. When that branch
> dies, perhaps reia will be ready. If not I'll move to something else.
>
> IMO, both python 3 and erlang have got the right idea when it comes to
> handling UTF8.

Could you summarize what you feel the key difference of
the python 3 / erlang approach is, compared to ruby19 ?

I'm a relative newbie in dealing with character encodings,
but I do recall a few lengthy discussions on this list when
ruby19's M17N was being developed, where the "UTF-8 only"
approaches of some other languages were deemed insufficient
for various reasons.

However, my understanding is that one is supposed to be
able to effectively make ruby behave as a "UTF-8 only"
language if one makes sure external data is transcoded to
UTF-8 at I/O boundaries.

I realize there may be some caveats with regard to locale,
although I invoke my ruby19 scripts with -EUTF-8:UTF-8.

So far, my experience with ruby19 M17N has _not_ been
problematic. The only difficulties I've encountered have
been when dealing with external data in some unknown
encoding, where I've had to do some programmatic guesswork
and finagling to make sort of a best-effort conversion of
the external data to UTF-8 at the I/O boundary.

But that is something I can't imagine python or erlang
helping me much with either.

* * *

Reflecting some more, I do recall that James Gray had
remarked on the difficulty of modifying one of his libraries
so that it would be effectively encoding agnostic, and be
able to handle data in whatever encoding was thrown at it.

So from that perspective I can see how a "UTF-8 only"
approach at the language level should simplify things.

But from my current perspective as an application developer
who is taking the approach of ensuring all data read into
my program is converted to UTF-8, I'm wondering if my
experience is essentially similar to what it would be in
a "UTF-8 only" language.


Regards,

Bill

Edward Middleton

unread,
Dec 28, 2009, 8:37:10 PM12/28/09
to
Bill Kelly wrote:
> Brian Candler wrote:
>
>> I got as far as recording 200 behaviours of String in ruby 1.9 before I
>> gave up:
>> http://github.com/candlerb/string19/blob/master/string19.rb
>>
>> The solution I use is simple: stick to ruby 1.8.x. When that branch
>> dies, perhaps reia will be ready. If not I'll move to something else.
>>
>> IMO, both python 3 and erlang have got the right idea when it comes to
>> handling UTF8.
>>
>
> Could you summarize what you feel the key difference of
> the python 3 / erlang approach is, compared to ruby19 ?
>

Taking a UTF-8 approach is easier to implement because you enforce all
strings to be UTF-8 and ignore when this doesn't work. Kind of like
saying everything will be ASCII or converted to it ;)

> I'm a relative newbie in dealing with character encodings,
> but I do recall a few lengthy discussions on this list when
> ruby19's M17N was being developed, where the "UTF-8 only"
> approaches of some other languages were deemed insufficient
> for various reasons.
>

Not everything maps one-to-one to UTF-8.

> However, my understanding is that one is supposed to be
> able to effectively make ruby behave as a "UTF-8 only"
> language if one makes sure external data is transcoded to
> UTF-8 at I/O boundaries.
>

That is pretty much it. The problem is that a lot of libraries still
don't handle encodings. This results in some spurious errors when a
function requiring compatible encoding operates on them[1]. The
solution is to add support for handling encodings.

Edward

1. As appose to ruby 1.8 which would silently ignore actual errors
caused by the use of incompatible encodings.

Benoit Daloze

unread,
Dec 29, 2009, 6:44:26 AM12/29/09
to
2009/12/28 Brian Candler <b.ca...@pobox.com>

" And so it is. If memory serves, Mac OS X stores filenames in normal
form D.

> How to handle that easily?

Normalize to normal form C instead.

Best,
--
Marnen Laibow-Koser "

So that solved it, converting with Iconv.
It would probably only works on Mac the encoding "UTF-8-MAC", but that for
working on HFS+, so that's not really a problem.

I found the documentation(in 1.9.2) of Iconv a little messy ...
For exemple, typing 'ri Iconv#iconv'
------------------------------------------------------------ Iconv#iconv
Iconv.iconv(to, from, *strs)

and in 1.8.7
------------------------------------------------------------ Iconv#iconv
iconv(str, start=0, length=-1)

The result of ri(1.9.2) is the same of 'ri Iconv::iconv', what is kind of
very different.

Anyway, converting every filename using this works :)

fn = Iconv.open("UTF-8", "UTF-8-MAC") { |iconv|
iconv.iconv(fn)
}
or
fn = Iconv.iconv("UTF-8", "UTF-8-MAC", fn).shift

Gary Watson

unread,
Dec 29, 2009, 10:27:00 AM12/29/09
to
I would like to chime in here and point out that sometimes you really
want to ignore the errors caused by mis-matched encodings, (as was the
case in my script where I just wanted to match filenames ending in *.mpg
and really didn't care if the characters occurring before had funkiness
going on with them.)

1.8 had this kind of behavior by default, and I'm assuming python3 and
erlang do too based on the descriptions given in this thread.

As Matz pointed out, you can force ruby1.9 to have this behavior simply
by using the ASCII-8 encoding rather than the default ASCII-7 encoding.
Basically causes the regular expression engine to look at the string as
a series of bytes again like it used to rather than freaking out when it
see's something it doesn't expect in that last byte.

I'm by no means knowledgeable about encodings, so take what I'm about to
say with a grain of salt. It seems like the old way of handling
encodings was permissive but imprecise, and the new way is precise but
not always permissive. I like the ability to be precise because before
that ability simply wasn't an option, however, since allot of people
seem to be confused by the default behavior why not make the default
behavior permissive and set it up so that IF YOU WANT to be precise you
can enable the proper encodings that ensure that behavior? To me this
seems to fall in with the principal of least surprise. (Sorry for
quoting it, I know it's over-quoted).

What do people think?

Regards
Gary


Edward Middleton wrote:


> Bill Kelly wrote:
>>> handling UTF8.
>>>
>>
>> Could you summarize what you feel the key difference of
>> the python 3 / erlang approach is, compared to ruby19 ?
>>
>
> Taking a UTF-8 approach is easier to implement because you enforce all
> strings to be UTF-8 and ignore when this doesn't work. Kind of like
> saying everything will be ASCII or converted to it ;)
>
>> I'm a relative newbie in dealing with character encodings,
>> but I do recall a few lengthy discussions on this list when
>> ruby19's M17N was being developed, where the "UTF-8 only"
>> approaches of some other languages were deemed insufficient
>> for various reasons.
>>
>
> Not everything maps one-to-one to UTF-8.
>
>> However, my understanding is that one is supposed to be
>> able to effectively make ruby behave as a "UTF-8 only"
>> language if one makes sure external data is transcoded to
>> UTF-8 at I/O boundaries.
>>
>
> That is pretty much it. The problem is that a lot of libraries still
> don't handle encodings. This results in some spurious errors when a
> function requiring compatible encoding operates on them[1]. The
> solution is to add support for handling encodings.
>
> Edward
>
> 1. As appose to ruby 1.8 which would silently ignore actual errors
> caused by the use of incompatible encodings.

--
Posted via http://www.ruby-forum.com/.

Edward Middleton

unread,
Dec 29, 2009, 11:05:46 AM12/29/09
to
Gary Watson wrote:
> I'm by no means knowledgeable about encodings, so take what I'm about to
> say with a grain of salt. It seems like the old way of handling
> encodings was permissive but imprecise, and the new way is precise but
> not always permissive. I like the ability to be precise because before
> that ability simply wasn't an option, however, since allot of people
> seem to be confused by the default behavior why not make the default
> behavior permissive and set it up so that IF YOU WANT to be precise you
> can enable the proper encodings that ensure that behavior? To me this
> seems to fall in with the principal of least surprise. (Sorry for
> quoting it, I know it's over-quoted).

I guess the problem is that if you do this no libraries will make an
effort to support encodings and you will lose all the advantages of
proper encoding handling. I have to say, I cringed when the idea of
handling encodings properly came up, because it is different from ruby
1.8 and the transition was going to be difficult. Having said that, if
you are going to support encodings this is probably the best way to do
it, and in reality it is not that hard to get it right.

Edward

Brian Candler

unread,
Dec 29, 2009, 1:06:27 PM12/29/09
to
Bill Kelly wrote:
>> IMO, both python 3 and erlang have got the right idea when it comes to
>> handling UTF8.
>
> Could you summarize what you feel the key difference of
> the python 3 / erlang approach is, compared to ruby19 ?

As far as I can tell, both have two distinct data structures. One
represents a binary object: a string of bytes. The other represents a
textual string, a string of UTF-8 codepoints. (In the case of erlang,
these are "binaries" and "lists" respectively).

ruby 1.9 has one String which tries to do both jobs. I commonly deal
with binary data: ASN1 encodings, PDFs, JPGs, firmware images, ZIP
files, and so on. And yet ruby 1.9 has it now deeply embedded that all
data is text (which is not clearly true: rather the converse, all text
is data). At best you can get ruby 1.9 to tell you that your data is
"ASCII-8BIT", even when it has nothing to do with ASCII whatsoever.

I really miss having an object which simply represents a "sequence of
bytes". Of course ruby 1.9 can do it, if you jump through the right
hoops.

I really miss being able to look at a simple expression such as
a = b + c
when I know that both b and c are String objects, and being able to say
for definite whether or not it will raise an exception.

> However, my understanding is that one is supposed to be
> able to effectively make ruby behave as a "UTF-8 only"
> language if one makes sure external data is transcoded to
> UTF-8 at I/O boundaries.

If you jump through the right hoops, you can do this. If you omit any of
the hoops, your program may work on some systems but not on others. ruby
1.9's behaviour is environment-sensitive.

But the worst part of all this is that it's totally undocumented. Look
into the 'ri' pages for most of ruby core, for any method which either
takes a string, returns a string, or acts on a string, and you are
unlikely to find any definition of its encoding-related behaviour,
including under what circumstances it may raise an exception.

By tagging every string with its own encoding, ruby 1.9 is solving a
problem that does not exist: that is, how do you write a program which
juggles multiple strings in different encodings all at the same time?

And as the OP has discovered, the built-in support is often incomplete
so that you have to use libraries like Iconv anyway.

Tony Arcieri

unread,
Dec 29, 2009, 3:50:31 PM12/29/09
to
[Note: parts of this message were removed to make it a legal post.]

On Tue, Dec 29, 2009 at 11:06 AM, Brian Candler <b.ca...@pobox.com> wrote:

> By tagging every string with its own encoding, ruby 1.9 is solving a
> problem that does not exist: that is, how do you write a program which
> juggles multiple strings in different encodings all at the same time?
>

To play devil's advocate here, Japanese users do routinely have to deal with
multiple different encodings... Shift JIS on Windows/Mac, EUC-JP on *IX, and
ISO-2022-JP for email (if I even got that correct, it's somewhat hard to
keep track). And then on top of all of that there's Unicode in all its
various forms...

While I would personally never choose M17n as the solution for my own
language I can see why it makes sense for a language which originated in and
is popular in Japan. The encoding situation over there is something of a
mess.

--
Tony Arcieri
Medioh! A Kudelski Brand

Brian Candler

unread,
Dec 29, 2009, 4:34:37 PM12/29/09
to
Tony Arcieri wrote:
> To play devil's advocate here, Japanese users do routinely have to deal
> with
> multiple different encodings... Shift JIS on Windows/Mac, EUC-JP on *IX,
> and
> ISO-2022-JP for email

Sure; and maybe they even want to process these formats without a
round-trip to UTF8. (By the way, ruby 1.9 *can't* handle Shift JIS
natively)

I want a programming language which (a) handles strings of bytes, and
(b) does so with simple, understandable, and predictable semantics: for
example, concat string 1 with string 2 to make string 3. Is that too
much to ask?

Anyway, I'll shut up now.

Benoit Daloze

unread,
Dec 30, 2009, 5:38:01 AM12/30/09
to
Hi,

I think you're quite a little pessimist here :)

Until my post on this subject, I have never been complaining far from that,
and enjoyed to play with ∑, ∆ and so on.

And I was not complaining, jsut asking how to solve that (The fact it didn't
handle the normalization form C is quite logical I think, no language would
do that easily).

I think having Unicode support is something very useful. Look for
example(even if it is a bad one) PHP and mb_* functions and all encoding
functions, scary, no? Well, I think it's quite intuitive how it is for the
moment, and most of the time doing concatenation is not a problem at all.

So, globally I think a good encoding support is really important, while
being not useful everyday.

Regards,

B.D.

2009/12/29 Brian Candler <b.ca...@pobox.com>

Marnen Laibow-Koser

unread,
Dec 30, 2009, 6:04:42 AM12/30/09
to
Benoit Daloze wrote:
> Hi,
>
> I think you're quite a little pessimist here :)
>
> Until my post on this subject, I have never been complaining far from
> that,
> and enjoyed to play with ∑, ∆ and so on.
>
> And I was not complaining, jsut asking how to solve that (The fact it
> didn't
> handle the normalization form C is quite logical I think, no language
> would
> do that easily).

Huh? Normalization transformations should be pretty easy to implement.
(FWIW, the Unicode Consortium recommends KC for identifiers, although
I'm not sure I agree with that recommendation.)

Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org

Brian Candler

unread,
Dec 30, 2009, 7:20:03 AM12/30/09
to
Marnen Laibow-Koser wrote:
> Huh? Normalization transformations should be pretty easy to implement.

But the point is, you can't do anything useful with this until you
*transcode* it anyway, which you can do using Iconv (in either 1.8 or
1.9).

ruby 1.9's big flag feature of being able to store a string in its
original form tagged with the encoding doesn't help the OP much, even if
it had been tagged correctly.

I mean, to a degree ruby 1.9 already supports this UTF-8-MAC as an
'encoding'. For example:

>> decomp = [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103].map { |x| x.chr("UTF-8-MAC") }.join
=> "español.lng"
>> decomp.codepoints.to_a
=> [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
>> decomp.encoding
=> #<Encoding:UTF8-MAC>

Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)

So you could argue that Dir[] on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.

But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:

(1) The regexp /./ should match a series of decomposed codepoints as a
single 'character'; str[n] should fetch the nth 'character'; and so on.
I don't think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.

What you actually get is this:

>> decomp.split(//)
=> ["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]

Aside: note that "̃ is actually a single character, a double quote with
the accent applied!

(2) The OP wanted to match the regexp containing a single codepoint /ñ/
against the decomposed representation, which isn't going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.

>> /ñ/ =~ decomp
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)
from (irb):5
from /usr/local/bin/irb19:12:in `<main>'

(3) Since ruby 1.9 has a UTF-8-MAC encoding, it *should* be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I'm trying here.

>> /ñ/ =~ decomp.encode("UTF-8")
=> nil
>> decomp.encode("UTF-8")
=> "espa\xB1\x00ol.lng"
>> decomp.encode("UTF-8").codepoints.to_a
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in `codepoints'
from (irb):10:in `each'
from (irb):10:in `to_a'
from (irb):10
from /usr/local/bin/irb19:12:in `<main>'
>> RUBY_VERSION
=> "1.9.2"
>> RUBY_PATCHLEVEL
=> -1
>> RUBY_REVISION
=> 24186

(4) If general support for decomposed form would be added as further
'Encodings', there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that's ignoring the "compatible" versus
"canonical" composed and decomposed forms.

(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed "n" and "̃" to be
distinct from a literal containing a composed "ñ". Try it and see.

(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "̃ is actually a single 'character'). Most
editors are going to display both the composed and decomposed forms
identically.

I think this just shows that ruby 1.9's complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)

Marnen Laibow-Koser

unread,
Dec 30, 2009, 7:48:59 AM12/30/09
to
Brian Candler wrote:
> Marnen Laibow-Koser wrote:
>> Huh? Normalization transformations should be pretty easy to implement.
>
> But the point is, you can't do anything useful with this until you
> *transcode* it anyway, which you can do using Iconv (in either 1.8 or
> 1.9).
>

Wrong. Normalization transformations are useful within one Unicode
encoding. In fact, they have little use in transcoding as I understand.

[...]


> Notice that the n-accent is displayed as a single character by the
> terminal, even though it is two codepoints (110,771)

I don't think it's meaningful to say that something is displayed as a
single character. You can't see characters -- they're abstract ideas.
All you can see is the glyphs that represent those characters.

>
> So you could argue that Dir[] on the Mac is at fault here, for tagging
> the string as UTF-8 when it should be UTF-8-MAC.

But you'd be wrong, because UTF-8-MAC is valid UTF-8.

>
> But you still need to transcode to UTF-8 before doing anything useful
> with this string. Consider a string containing decomposed characters
> tagged as UTF-8-MAC:
>
> (1) The regexp /./ should match a series of decomposed codepoints as a
> single 'character'

I am not sure I agree with you.

str[n] should fetch the nth 'character';

Yes, but a combining sequence is not conceptually a character in many
cases.

> and so on.
> I don't think this would be easy to implement, since finding a character
> boundary is no longer a codepoint boundary.

Sure it is. You are confusing characters and combining sequences.

>
> What you actually get is this:
>
>>> decomp.split(//)
> => ["e", "s", "p", "a", "n", "̃", "o", "l", ".", "l", "n", "g"]
>
> Aside: note that "̃ is actually a single character,

It is nothing of the kind. It is a single combining sequence composed
of two characters. I would expect it to be matched by /../ .

> a double quote with
> the accent applied!

Right.

>
> (2) The OP wanted to match the regexp containing a single codepoint /ñ/
> against the decomposed representation, which isn't going to work anyway.
> That is, ruby 1.9 does not automatically transcode strings so they are
> compatible; it just raises an exception if they are not.

But UTF-8 NFC and UTF-8 NFD *are* compatible -- they're not even really
separate encodings. At this point I strongly suggest that you read the
article (I think it's UAX #15) on Unicode normalization.

>
>>> /ñ/ =~ decomp
> Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
> regexp with UTF8-MAC string)

If the only difference between UTF-8 and UTF-8-MAC is normalization,
then this is brain-dead.

> from (irb):5
> from /usr/local/bin/irb19:12:in `<main>'
>
> (3) Since ruby 1.9 has a UTF-8-MAC encoding, it *should* be able to
> transcode it to UTF-8 without using Iconv. However this is simply
> broken, at least in the version I'm trying here.
>
>>> /ñ/ =~ decomp.encode("UTF-8")
> => nil
>>> decomp.encode("UTF-8")
> => "espa\xB1\x00ol.lng"
>>> decomp.encode("UTF-8").codepoints.to_a
> ArgumentError: invalid byte sequence in UTF-8
> from (irb):10:in `codepoints'
> from (irb):10:in `each'
> from (irb):10:in `to_a'
> from (irb):10
> from /usr/local/bin/irb19:12:in `<main>'
>>> RUBY_VERSION
> => "1.9.2"
>>> RUBY_PATCHLEVEL
> => -1
>>> RUBY_REVISION
> => 24186

Yikes! That's bad.

>
> (4) If general support for decomposed form would be added as further
> 'Encodings', there would be an explosion of encodings: UTF-8-D,
> UTF-16LE-D, UTF-16BE-D etc, and that's ignoring the "compatible" versus
> "canonical" composed and decomposed forms.
>

Right. Different normal forms really aren't separate encodings in the
usual sense.

> (5) It is going to be very hard (if not impossible) to make a source
> code string or regexp literal containing decomposed "n" and "̃" to be
> distinct from a literal containing a composed "ñ". Try it and see.

And that's probably a good thing. In fact, that's the point of
normalization.

>
> (In the above paragraph, the decomposed accent is applied to the
> double-quote; that is, "̃ is actually a single 'character').

Combining sequence.

> Most
> editors are going to display both the composed and decomposed forms
> identically.

And at least in the case of ñ versus n + combining ~, they normalize to
the same thing in all normal forms (precomposed ñ in C and KC; a 2-char
combining sequence in D and KD). Thus, under any normalization, they
are *equivalent* and should be treated as such.

>
> I think this just shows that ruby 1.9's complexity is not helping in the
> slightest. If you have to transcode to UTF-8 composed form, then ruby
> 1.8 does this just as well (and then you only need to tag the regexp as
> UTF-8 using //u)

Normalization really isn't transcoding in the usual sense.

Best,
--
Marnen Laibow-Koser
http://www.marnen.org
mar...@marnen.org

Marc Heiler

unread,
Dec 30, 2009, 12:26:57 PM12/30/09
to
If I would have one wish open, I would want to have a compile-time
option for ruby 1.9 where I could keep the old ruby 1.8 behaviour. Ruby
1.8 simply gave me less problems here.

I am using loads of umlauts like "äöü" in my comments and ruby 1.8 is
totally happy with it. Ruby 1.9 however hates it, refuses to run it, and
I dont think I want to add something like "Encoding: ISO-8859-1" to all
my .rb scripts.

I'd wish there would be more than one way to treat encodings - and one
way should be to use ruby 1.8 behaviour, because ruby 1.9 just forces me
to make all kind of changes before my old .rb scripts run again, simply
because of the encoding issue (there seem to be some other minor changes
as well, I have had problems in case/when code too, but the encoding
issue seems larger)

This is not really a rant - I am using ruby 1.8.x without any problem,
and I actually LOVE that ruby 1.8.x is not feature frozen. It is also
good that a language can keep evolving.

Personally however I don't need UTF or another exotic encoding, so the
encoding add-on is of no real advantage to me and rather a burden as I
have to modify .rb files. I can see that other people have different
needs. though.

Benoit Daloze

unread,
Dec 31, 2009, 6:20:12 AM12/31/09
to
First, thank for your long and good answers about UTF8-MAC.

2009/12/30 Marc Heiler <shev...@linuxmail.org>

> If I would have one wish open, I would want to have a compile-time
> option for ruby 1.9 where I could keep the old ruby 1.8 behaviour. Ruby
> 1.8 simply gave me less problems here.
>
> I am using loads of umlauts like "äöü" in my comments and ruby 1.8 is
> totally happy with it. Ruby 1.9 however hates it, refuses to run it, and
> I dont think I want to add something like "Encoding: ISO-8859-1" to all
> my .rb scripts.
>

Well, I think that's not so hard to add
# encoding: ISO-8859-1
to your scripts( what do you say of writing a small Ruby script for that ?
:p ).

I think that's something really good! Well, it's kind of not easy to know a
file encoding if not specified somewhere. Think a little about somebody
working with you on another platform, he will surely meets problems of
encoding.

So yes, I think is something quite useful and good for compatibility.

0 new messages