Multibyte Chars in Ruby

3 views
Skip to first unread message

Tim Morgan

unread,
Dec 3, 2009, 3:50:18 PM12/3/09
to ok...@googlegroups.com
OK, someone, anyone (James?), can you tell me how to convert a file with stuff like this in it:

Voc\xC3\xAA tem certeza?

...into this:

Você tem certeza?

I don't get how to convert those multibyte escape sequences into a real, actual character shown in my text editor.

Any help would be appreciated. Thanks.

-Tim

Tim Morgan

unread,
Dec 3, 2009, 4:01:41 PM12/3/09
to ok...@googlegroups.com
I should probably provide more of the picture...

I have a Rails locale file, in Portuguese, with all these fancy little characters in it. In my zeal for organizing this stuff a little more, I wrote a Ruby app to load in the YAML, sort it, and write it back out again. Unfortunately, in the process, the Ruby YAML library so kindly escaped all the fancy characters. But I want them back.

So, basically when I say "multibyte escape sequences", I mean the file literally contains backslash (\) followed by some numbers and/or letters.

Do I need the ruby 1.8 -Ku switch, or do I need to fire up Ruby 1.9, or what?

Tim Morgan

unread,
Dec 3, 2009, 4:22:40 PM12/3/09
to ok...@googlegroups.com
I came up with this: http://gist.github.com/248536

Maybe not the right way, but seems to have worked.

Let me know if you have a better solution.

Thanks.

James Edward Gray II

unread,
Dec 3, 2009, 4:47:04 PM12/3/09
to ok...@googlegroups.com
On Dec 3, 2009, at 2:50 PM, Tim Morgan wrote:

> OK, someone, anyone (James?),

How long should I wait before offering an answer? ;)

> can you tell me how to convert a file with stuff like this in it:
>
> Voc\xC3\xAA tem certeza?
>
> ...into this:
>
> Você tem certeza?

I wrote quite a bit about this once:

http://blog.grayproductions.net/articles/understanding_m17n

I've been told that's boring though, so I'll try to give you a shorter version.

> I don't get how to convert those multibyte escape sequences into a real, actual character shown in my text editor.
>
> Any help would be appreciated. Thanks.

The data you showed is encoded as UTF-8. I figured that out by dumping it to a file, and just printing it out in my terminal. I have my terminal set to use UTF-8, so the fact that I saw the proper character pretty much told me what it was.

I also opened the file in TextMate though and peeked inside the File → Re-Open With Encoding menu to see what was checked. Again, UTF-8.

Knowing that, putting Ruby in UTF-8 mode when you originally worked with the data may have been all that was needed. You could do that by using the -KU switch, or setting $KCODE = "U". Modern versions of Rails do this for you, if you load that environment.

I can probably be more specific if you show me the script that got you into trouble in the first place… :)

James Edward Gray II

Tim Morgan

unread,
Dec 3, 2009, 4:54:29 PM12/3/09
to ok...@googlegroups.com
On Thu, Dec 3, 2009 at 3:47 PM, James Edward Gray II
<ja...@graysoftinc.com> wrote:
>
> How long should I wait before offering an answer?  ;)

Well, at least half an hour apparently, as that is how long it takes
me to get impatient and come up with some half-baked solution myself.
:-)

> I wrote quite a bit about this once:
>
> http://blog.grayproductions.net/articles/understanding_m17n
>
> I've been told that's boring though, so I'll try to give you a shorter version.

James, rest assured that I consumed every bit of your series on Ruby
character encodings, with great enthusiasm I might add. I even
understood some of it.

> > I don't get how to convert those multibyte escape sequences into a real, actual character shown in my text editor.
> >
> > Any help would be appreciated. Thanks.
>
> The data you showed is encoded as UTF-8.  I figured that out by dumping it to a file, and just printing it out in my terminal.  I have my terminal set to use UTF-8, so the fact that I saw the proper character pretty much told me what it was.

This shows my complete misunderstanding of the topic I believe. I'm
ashamed of it, for sure.

The file in question actually, physically, literally had backslash
followed by numbers for every multibyte character. They weren't
escaped in my *view* -- they were literally escaped in the yaml
document. This is where my trouble came in.

>
> I also opened the file in TextMate though and peeked inside the File → Re-Open With Encoding menu to see what was checked.  Again, UTF-8.
>
> Knowing that, putting Ruby in UTF-8 mode when you originally worked with the data may have been all that was needed.  You could do that by using the -KU switch, or setting $KCODE = "U".  Modern versions of Rails do this for you, if you load that environment.
>
> I can probably be more specific if you show me the script that got you into trouble in the first place…  :)

Here's the file:
http://github.com/seven1m/onebody/blob/master/config/locales/pt.yml

Here's what the file used to look like, before I screwed it up:
http://github.com/seven1m/onebody/commit/42a72259c849fb33119f2391d2b01bc8d0de1e2b#diff-7

Again, I don't pretend to understand this stuff, but I did
sledgehammer my way to the solution I believe. Thanks for your help!

-Tim

James Edward Gray II

unread,
Dec 3, 2009, 5:35:40 PM12/3/09
to ok...@googlegroups.com
On Dec 3, 2009, at 3:54 PM, Tim Morgan wrote:

> On Thu, Dec 3, 2009 at 3:47 PM, James Edward Gray II
> <ja...@graysoftinc.com> wrote:
>>
>> I wrote quite a bit about this once:
>>
>> http://blog.grayproductions.net/articles/understanding_m17n
>>
>> I've been told that's boring though, so I'll try to give you a shorter version.
>
> James, rest assured that I consumed every bit of your series on Ruby
> character encodings, with great enthusiasm I might add. I even
> understood some of it.

Yeah, it's kind-of dense, for sure.

> The file in question actually, physically, literally had backslash
> followed by numbers for every multibyte character. They weren't
> escaped in my *view* -- they were literally escaped in the yaml
> document. This is where my trouble came in.

That's just how YAML safely encoded the data.
We can read that just fine. Watch:

$ curl -O http://github.com/seven1m/onebody/raw/master/config/locales/pt.yml
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 10789 100 10789 0 0 30782 0 --:--:-- --:--:-- --:--:-- 47528
$ head -n 3 pt.yml
pt:

are_you_sure: "Voc\xC3\xAA tem certeza?"
$ ruby -KU -r yaml -e 'p YAML.load(ARGF)["pt"]["are_you_sure"]' pt.yml
"Você tem certeza?"

The escapes are YAML's way of saving the data correctly, but it reads it back just fine as the character data it is. No transform required. YAML has to escape encoded data, so weird encodings don't mess with it's ability to fine YAML specific bytes, like : or ".

Does that help?

James Edward Gray II

Tim Morgan

unread,
Dec 3, 2009, 5:43:07 PM12/3/09
to ok...@googlegroups.com
That helps. And that's fine and dandy that the YAML lib escapes its
strings. No problemo.

But I want anyone to be able to write these translation files, and
since the yaml lib seems to work fine reading the normal, non-escaped
chars, I wanted to put the file back the way I found it (along with,
of course, the modifications and additions I had made to the file in
the mean time).

As an aside, the ARGF thing is new and magical to me. Off to Google I go...

On Thu, Dec 3, 2009 at 4:35 PM, James Edward Gray II
> --
>
> You received this message because you are subscribed to the Google Groups "Oklahoma Ruby Users Group" group.
> To post to this group, send email to ok...@googlegroups.com.
> To unsubscribe from this group, send email to okrb+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/okrb?hl=en.
>
>
>

James Edward Gray II

unread,
Dec 3, 2009, 5:57:04 PM12/3/09
to ok...@googlegroups.com
On Dec 3, 2009, at 4:43 PM, Tim Morgan wrote:

> That helps. And that's fine and dandy that the YAML lib escapes its
> strings. No problemo.
>
> But I want anyone to be able to write these translation files, and
> since the yaml lib seems to work fine reading the normal, non-escaped
> chars, I wanted to put the file back the way I found it (along with,
> of course, the modifications and additions I had made to the file in
> the mean time).

Gotcha. I assume that's possible in this case because the data is UTF-8. I think YAML expects to be UTF-8 encoded, so that works.

Here's the relevant part of the spec:

"2.2.2 Encoding

A YAML processor is required to support both UTF-16 and UTF-8 character encodings. If an input stream begins with a byte order mark, then the initial character encoding shall be UTF-16. Otherwise, the initial encoding shall be UTF-8."

Ruby's emitter is probably just being over cautious.

> As an aside, the ARGF thing is new and magical to me. Off to Google I go...

ARGF is an IO-like object that is the concatenation of all file names listed in ARGV. Handy stuff indeed.

James Edward Gray II
Reply all
Reply to author
Forward
0 new messages