Unicode in Ruby - Converting to UTF 16

819 views
Skip to first unread message

lewdsilver

unread,
Jun 2, 2008, 2:04:02 PM6/2/08
to Boston Ruby Group
Hola,
I am messing around with trying to convert a ruby string to UTF-16.

I need to do this in one of my RoR apps, b/c I have integrated with a
3rd party app that converts audio voice recordings to text.

I take the text and send via an HTTP request to an SMSC which can take
the text and deliver it via SMS to mobile phones.

This was working great until I had to start supporting French and the
French character set.

Now I need to take the text I get and convert it to UTF 16 so that it
can be delivered properly.

I have just discovered 1st hand that working with Unicode is not so
easy in Ruby.

I am able to play with a sample web based converter from the
SMSC(www.clickatell.com) so I can compare what my converted strings
should look like.

When I convert this text:
"Bonjour Bertrand, c'est Alex. Je t'appelle a propos de la journée
VocalExpo du 17 juin qui va être reportée a cause de grèves. Ce jour
la les transports en commun seront pratiquement tous en grève donc
plutôt que de prendre des risques on a repoussée la date. A bientôt.
Au revoir."


I get this converted text:
0042006F006E006A006F007500720020004200650072007400720061006E0064002C00200063002700650073007400200041006C00650078002E0020004A00650020007400270061007000700065006C006C006500200061002000700072006F0070006F00730020006400650020006C00610020006A006F00750072006E00E9006500200056006F00630061006C004500780070006F0020006400750020003100370020006A00750069006E0020007100750069002000760061002000EA0074007200650020007200650070006F0072007400E900650020006100200063006100750073006500200064006500200067007200E8007600650073002E0020004300650020006A006F007500720020006C00610020006C006500730020007400720061006E00730070006F00720074007300200065006E00200063006F006D006D0075000D000A006E0020007300650072006F006E0074002000700072006100740069007100750065006D0065006E007400200074006F0075007300200065006E00200067007200E80076006500200064006F006E006300200070006C0075007400F4007400200071007500650020006400650020007000720065006E0064007200650020006400650073002000720069007300710075006500730020006F006E002000610020007200650070006F00750073007300E900650020006C006100200064006100740065002E002000410020006200690065006E007400F40074002E0020004100750020007200650076006F00690072002E0020


And this "converted text" is what I am supposed to put into my URL for
HTTP Request.

Has anyone had to play with converted Ruby strings to UTF-16 in a
similar fashion?
I am finding a lot of random info on forums and other sites, but
nothing super concrete yet.

Thanks



Robert Mela

unread,
Jun 7, 2008, 10:10:48 AM6/7/08
to boston-r...@googlegroups.com
If this is just for building a URL, why not just URL-encode the
string? Western european languages are represented in single-byte
character sets, and byte values 128 thru 254 will be encoded as %80 thru
%FE.

If you need to work further with it, UTF-8 offers an easier migration
than UTF-16.

UTF-16 uses two bytes per character. That means charsets for Western
languages get expanded to two bytes, one of which is always 0x00 ( end
of string is two consecutive 0x00 bytes ). That additional NULL byte
in each char makes a lot of tools interpret the string to be binary
data, not a character string.

I usually use UTF-8 in C++ and am currently using it in PHP. Not sure
how well it works in Ruby. The command line environments on my Mac and
Linux work will with UTF-8.

Everything that fits in 7-bit ASCII is represented as a single byte in
UTF8. End-of-string is always a single 0x00 byte. So any string in
7-bit ASCII will have identical representation in UTF-8.

Even when UTF-8 needs to get fancy it's more backward compatible than
UTF-16. Characters > 0x7F are represented as two or more non-null
bytes, and 0x00 only appears as end-of-string.

To get those multi-byte chars into a URL you can %-escape each byte (
see 1st sentence in this note ).

UTF-8 strings of the US 7-bit ASCII will always display properly ( much
harder with UTF-16 ). If your environment is Mac then UTF-8 strings
should display nicely with most tools. Brain dead tools will at least
recognize the sequence as a string, even if they can't display it properly.

You can use iconv to convert your input encoding to UTF-8.

So...

- Find out what charset encoding you're being passed
- Invoke sth like Iconv.convert( 'ISO-8859-1', 'UTF-8' )

For a full listing of what charsets iconv supports on your system run
iconv --list at the command line.

http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/index.html

But again,if urls are all you need, and the and the app at the other end
knows how to handle the encoding, why not just url-encode the string and
avoid the conversion?

Sincerely,

Rob 'Öll er innri maður' Mela

rob.vcf

Robert Mela

unread,
Jun 7, 2008, 10:21:57 AM6/7/08
to boston-r...@googlegroups.com
While Ruby can handle utf8 without choking

puts "Öll".length

prints "4" rather than 3.

There's an add-on UTF-8 library that might be useful (
http://www.rubyinside.com/ruby-gets-a-new-and-good-utf-8-library-157.html )

require ‘encoding/character/utf-8‘
str = u"hëllö"
str.length
#=> 5
str.reverse.length
#=> 5
str[/ël/]
#=> "ël"

rob.vcf

lewdsilver

unread,
Jun 9, 2008, 8:22:32 AM6/9/08
to Boston Ruby Group
Hi,
Thanks for the feedback.

To answer the 1st question - we are using UTF16 encoded strings b/c
this seems to work best with the clickatell API.
We tried URL encoded strings, but did not have as good results when
sending an array of accented characters (éçñ, etc.)

It turned out that we were receiving ISO-8995-1 format from the remote
system which was feeding us text for transcribed voicemails.

&

When I tried to convert this string to a UTF16 string I would get
incorrect character encodings, b/c Rails was expecting UTF-8.

I created a lot of test cases, test frameworks, and even hardcoded
strings into the controller to conduct tests.
All of these passed with flying colors, but each time I interfaced
with the remote server via HTTP - I would get bad string encoding
results.

Thus, running the following code first was critical to being able to
properly encode strings in any format(URL encoded, UTF-16, etc).

txt = data_from_remote_server()
Iconv.conv('utf-8', 'ISO-8859-1', txt)


Afterwards, creating a UTF16 string was done by this line of code:

converted_text = Kconv.kconv(txt, NKF::UTF16, NKF::UTF8).unpack('H*')
[0]


At first I was a bit put out b/c my first instinct was to want to look
at bytes in memory like I do with C/C++ and work with them in that
fashion, but Ruby does have some nice libs for managing this
conversions (i.e. easy 1 liners).

thanks for the feedback!
-c


On Jun 7, 10:21 am, Robert Mela <r...@robmela.com> wrote:
> While Ruby can handle utf8 without choking
>
> puts "Öll".length
>
> prints "4" rather than 3.
>
> There's an add-on UTF-8 library that might be useful (http://www.rubyinside.com/ruby-gets-a-new-and-good-utf-8-library-157....)
>
> require ‘encoding/character/utf-8‘
> str = u"hëllö"
> str.length
>   #=> 5
> str.reverse.length
>   #=> 5
> str[/ël/]
>   #=> "ël"
>
>  rob.vcf
> 1KDownload
Reply all
Reply to author
Forward
0 new messages