If you need to work further with it, UTF-8 offers an easier migration
than UTF-16.
UTF-16 uses two bytes per character. That means charsets for Western
languages get expanded to two bytes, one of which is always 0x00 ( end
of string is two consecutive 0x00 bytes ). That additional NULL byte
in each char makes a lot of tools interpret the string to be binary
data, not a character string.
I usually use UTF-8 in C++ and am currently using it in PHP. Not sure
how well it works in Ruby. The command line environments on my Mac and
Linux work will with UTF-8.
Everything that fits in 7-bit ASCII is represented as a single byte in
UTF8. End-of-string is always a single 0x00 byte. So any string in
7-bit ASCII will have identical representation in UTF-8.
Even when UTF-8 needs to get fancy it's more backward compatible than
UTF-16. Characters > 0x7F are represented as two or more non-null
bytes, and 0x00 only appears as end-of-string.
To get those multi-byte chars into a URL you can %-escape each byte (
see 1st sentence in this note ).
UTF-8 strings of the US 7-bit ASCII will always display properly ( much
harder with UTF-16 ). If your environment is Mac then UTF-8 strings
should display nicely with most tools. Brain dead tools will at least
recognize the sequence as a string, even if they can't display it properly.
You can use iconv to convert your input encoding to UTF-8.
So...
- Find out what charset encoding you're being passed
- Invoke sth like Iconv.convert( 'ISO-8859-1', 'UTF-8' )
For a full listing of what charsets iconv supports on your system run
iconv --list at the command line.
http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/index.html
But again,if urls are all you need, and the and the app at the other end
knows how to handle the encoding, why not just url-encode the string and
avoid the conversion?
Sincerely,
Rob 'Öll er innri maður' Mela
puts "Öll".length
prints "4" rather than 3.
There's an add-on UTF-8 library that might be useful (
http://www.rubyinside.com/ruby-gets-a-new-and-good-utf-8-library-157.html )
require ‘encoding/character/utf-8‘
str = u"hëllö"
str.length
#=> 5
str.reverse.length
#=> 5
str[/ël/]
#=> "ël"