All this talk of Unicode handling has brought up a problem I run into every so often.
Is there any way to use the Iconv library to lossily convert between partially incompatible encodings? In other words, if, for example, I've got a UTF-8 string that I need to convert down to 7-bit ASCII, and I don't especially care what happens to the extended characters (short of a single character being mapped to a single character - ideally one I can specify), is there any way of forcing the recode?
I realise I could use the failure message to figure out which characters in the input string are incompatible and blank them out case-by-case in a loop, but that seems awfully wasteful.
On 27/06/06, Alex Young <a...@blackkettle.org> wrote:
> Is there any way to use the Iconv library to lossily convert between > partially incompatible encodings? In other words, if, for example, I've > got a UTF-8 string that I need to convert down to 7-bit ASCII, and I > don't especially care what happens to the extended characters (short of > a single character being mapped to a single character - ideally one I > can specify), is there any way of forcing the recode?
Yes, there is. Add //IGNORE to the destination encoding to ignore unavailable characters, or //TRANSLIT to transliterate them into combinations of ASCII characters (e.g. `e for и).
//TRANSLIT will raise an exception on characters it can't transliterate, however; this can be solved by using '//IGNORE//TRANSLIT' together (in that order).
Paul Battley wrote: > On 27/06/06, Alex Young <a...@blackkettle.org> wrote: >> Is there any way to use the Iconv library to lossily convert between >> partially incompatible encodings? In other words, if, for example, I've >> got a UTF-8 string that I need to convert down to 7-bit ASCII, and I >> don't especially care what happens to the extended characters (short of >> a single character being mapped to a single character - ideally one I >> can specify), is there any way of forcing the recode?
> Yes, there is. Add //IGNORE to the destination encoding to ignore > unavailable characters, or //TRANSLIT to transliterate them into > combinations of ASCII characters (e.g. `e for и).
It works by decomposing characters and rejecting everything above position 127, which includes all the now-separated accents.
This does require the slightly-flaky unicode library (work is underway to update it). You can get that from http://www.yoshidam.net/Ruby.html or via gems.
Paul Battley wrote: > On 27/06/06, Alex Young <a...@blackkettle.org> wrote: >> Ooh, that's nice. Thanks for that. I guess it's wishful thinking to >> hope for:
> It works by decomposing characters and rejecting everything above > position 127, which includes all the now-separated accents.
> This does require the slightly-flaky unicode library (work is underway > to update it). You can get that from http://www.yoshidam.net/Ruby.html > or via gems.
Paul Battley wrote: > Yes, there is. Add //IGNORE to the destination encoding to ignore > unavailable characters, or //TRANSLIT to transliterate them into > combinations of ASCII characters (e.g. `e for и).
> //TRANSLIT will raise an exception on characters it can't > transliterate, however; this can be solved by using > '//IGNORE//TRANSLIT' together (in that order).
Can anyone else get this to work? Instead of "caff`e" I just get "caff?"
Paul Battley wrote: > On 10/07/06, Daniel DeLorme <dan...@dan42.com> wrote: >> Can anyone else get this to work? Instead of "caff`e" I just get "caff?"
I have also a problem with iconv. I'm under linux (configured with utf-8 as usual) and under irb I get: irb(main):016:0> Iconv.conv("US-ASCII//TRANSLIT","UTF-8",'йикл') => "eeee"
But when I try the same in ruby or mod_ruby I get '????', for example: $ ruby -e "require 'iconv'; puts Iconv.conv('US-ASCII//TRANSLIT','UTF-8','йикл')" ???? I already checked with str.each_byte {|x| puts x} and the strings are exactly the same. Does anyone have any idea why I get two different answers from Iconv?
> I have also a problem with iconv. I'm under linux (configured with > utf-8 > as usual) and under irb I get: > irb(main):016:0> Iconv.conv("US-ASCII//TRANSLIT","UTF-8",'йикл') > => "eeee"
> But when I try the same in ruby or mod_ruby I get '????', for example: > $ ruby -e "require 'iconv'; puts > Iconv.conv('US-ASCII//TRANSLIT','UTF-8','йикл')" > ???? > I already checked with str.each_byte {|x| puts x} and the strings are > exactly the same. Does anyone have any idea why I get two different > answers from Iconv?
and finally, the most weird, irb doesn't work if I use pipe: $ echo "require 'iconv'; puts Iconv.conv('US-ASCII//TRANSLIT','UTF-8','й'); 'й'.each_byte{|x| puts x}" | irb require 'iconv'; puts Iconv.conv('US-ASCII//TRANSLIT','UTF-8','й'); 'й'.each_byte{|x| puts x} ? 195 169 "\303\251" -- Posted via http://www.ruby-forum.com/.
Hello, I was going crazy with this problem. I searched a lot and found some people with the same problem: Iconv works with irb but not in a ruby script. The solution was take another way. For example, Daniel Lucraft (http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/306663) made 3 suggestions. The first one is use Ruby-GNOME2 library and:
This not only worked for me, as the Iconv started to work as expected! For instance: require 'iconv' require 'gtk2' puts Iconv.conv("ASCII//translit","UTF-8","бавд") gives 'aaaa'.
The second solution: ascii = %x{echo "#{str}" | iconv -f "ISO-8859-1" -t "US-ASCII//TRANSLIT"} also worked here.
The problem is that I'm not using a ruby script, I'm making an web page with mod_ruby. So, %x{} gives an 'Insecure operation' error and "require 'gtk2'" gives: /var/www/dev/q/test.rbx:12: Cannot open display: /usr/lib/ruby/1.8/gtk2.rb:12 ./lib.rb:31:in `require'
His last suggestion is to write your own wrapper. Of course I've not tried. Finally, I used the hack: Unicode.normalize_KD(string).gsub(/[^\x00-\x7F]/n,'') as described here: http://www.ruby-forum.com/topic/70827, and this looks to work fine to remove accents (but I'm not sure if the result is an ascii string) -- Posted via http://www.ruby-forum.com/.
> This not only worked for me, as the Iconv started to work as expected! > For instance: > require 'iconv' > require 'gtk2' > puts Iconv.conv("ASCII//translit","UTF-8","бавд") > gives 'aaaa'.
GNU libiconv seems to need the locale set. The issue would be fixed by the following patch.
VALUE method_setlocale(VALUE self, VALUE category, VALUE locale) { int c = NUM2INT(category); char *r; if(locale == Qnil) { r = setlocale(c, NULL); } else { Check_Type(locale, T_STRING); r = setlocale(c, RSTRING_PTR(locale)); } return r == NULL ? Qnil : rb_str_new2(r);
static VALUE rb_setlocale(int category, VALUE locale) { char *r = setlocale(category, StringValueCStr(locale)); return r ? rb_str_new2(r) : Qnil;
}
static VALUE rb_getlocale(int category) { char *r = setlocale(category, NULL); return r ? rb_str_new2(r) : Qnil;
}
#define funcs(n, c) \ static VALUE rb_getlocale_##n(VALUE self) {return rb_getlocale(c);} \ static VALUE rb_setlocale_##n(VALUE self, VALUE val) {return rb_setlocale(c, val);} \ /* end of funcs */
foreach_categories(funcs)
void Init_locale(void) { VALUE locale = rb_define_module("Locale"); #define methods(n, c) \ rb_define_singleton_method(locale, #n, rb_getlocale_##n, 0); \ rb_define_singleton_method(locale, #n"=", rb_setlocale_##n, 1); \ /* end of methods */
Thanks, your solution is really more Ruby-way. I just wonder why "setlocale" isn't a part of Ruby standard library. Since Ruby maps/wraps most of the standard (POSIX) functions (especially those available on Windows too), this one should be also taken into consideration.
> First of all, option after // is GNU iconv local extension.
Sure I know that, but it doesn't mean it is EVIL, is it? Still it is very useful for creating permalinks and removing accented characters simply, w/o using any third party libraries and so, but unusable until we call POSIX setlocale, which isn't present in Ruby API.
> Second, extconf.rb must check for the necessary header, and > whether each categories are defined.
Still it should be present on every system (AFAIK it is), since quoting man: "The setlocale() function conforms to ISO/IEC 9899:1999 (``ISO C99'').". Is there anyone who checks whether <stdio.h> exists?
> And it feels too redundant. I guess Locale.ctype = '' would be easy.
Sure yours is better. Mine didn't consider fact that some of constants may have different values on different systems.
If it is was to be included into standard library I'd leave Locale::setlocale method as well, as you may combine types there and also check returned value, where nil means failed association and String on successful one, where documentation doesn't explicitly say that returned string is exactly the one that was passed. So with simple Locale::ctype= we may miss some important feedback.
At Sun, 7 Jun 2009 00:24:55 +0900, Adam Strzelecki wrote in [ruby-talk:338586]:
> Thanks, your solution is really more Ruby-way. I just wonder why > "setlocale" isn't a part of Ruby standard library. Since Ruby maps/wraps > most of the standard (POSIX) functions (especially those available on > Windows too), this one should be also taken into consideration.
It affects library functions, such as printf(), and can cause problems.
> > First of all, option after // is GNU iconv local extension. > Sure I know that, but it doesn't mean it is EVIL, is it? Still it is > very useful for creating permalinks and removing accented characters > simply, w/o using any third party libraries and so, but unusable until > we call POSIX setlocale, which isn't present in Ruby API.
You can't rely on it if you want write portable script.
> > Second, extconf.rb must check for the necessary header, and > > whether each categories are defined.
> Still it should be present on every system (AFAIK it is), since quoting > man: "The setlocale() function conforms to ISO/IEC 9899:1999 (``ISO > C99'').". Is there anyone who checks whether <stdio.h> exists?
Not all systems conform to C99, and not all categories are available on all systems. For instance, mingw and perhaps mswin don't have LC_MESSAGES.
> > And it feels too redundant. I guess Locale.ctype = '' would be easy.
> Sure yours is better. Mine didn't consider fact that some of constants > may have different values on different systems.
Of course, otherwise why they need the macros?
> If it is was to be included into standard library I'd leave > Locale::setlocale method as well, as you may combine types there and > also check returned value, where nil means failed association and String > on successful one, where documentation doesn't explicitly say that > returned string is exactly the one that was passed. So with simple > Locale::ctype= we may miss some important feedback.
You mean setters should raise an exception on error?
#include <locale.h> #include "ruby.h"
static VALUE rb_setlocale(int category, const char *locale) { char *r = setlocale(category, locale); if (!r) rb_raise(rb_eRuntimeError, "setlocale"); return rb_str_new2(r);
}
static inline VALUE locale_set(int category, VALUE locale) { return rb_setlocale(category, StringValueCStr(locale));
}
static inline VALUE locale_get(int category) { return rb_setlocale(category, NULL);
}
#define funcs(n, c) \ static VALUE rb_getlocale_##n(VALUE self) {return locale_get(c);} \ static VALUE rb_setlocale_##n(VALUE self, VALUE val) {return locale_set(c, val);} \ /* end of funcs */
foreach_categories(funcs)
void Init_locale(void) { VALUE locale = rb_define_module("Locale"); #define methods(n, c) \ rb_define_singleton_method(locale, #n, rb_getlocale_##n, 0); \ rb_define_singleton_method(locale, #n"=", rb_setlocale_##n, 1); \ /* end of methods */
Nobuyoshi Nakada wrote: > It affects library functions, such as printf(), and can cause > problems.
Of course, but FileUtils.rm_rf '/' can do harm as well, but it is still included in Ruby. So I don't really get why I can't hook into setlocale with Ruby? This function is present at every recent OS, and accessible for every C, C++, Perl or Python programmer. Sure I know it does change way printf works and so on, but Ruby is for sober developers, isn't it? If setlocale causes trouble for someone, my answer is "don't use it", but not "prohibit it".
>> we call POSIX setlocale, which isn't present in Ruby API.
> You can't rely on it if you want write portable script.
I don't care. ascii//translit//IGNORE is present at Linux and Mac OSX, and maybe many others, it is enough for me. Kernel#fork AFAIK doesn't work on Windows, but would you tell me not to use it because my script won't be portable?
> Not all systems conform to C99, and not all categories are > available on all systems. For instance, mingw and perhaps > mswin don't have LC_MESSAGES.
Right. It is just checking for <locale.h> existence is a bit paranoiac for me :), but it's just my point of view.
>> Sure yours is better. Mine didn't consider fact that some of constants >> may have different values on different systems.
> Of course, otherwise why they need the macros?
:)
> You mean setters should raise an exception on error?
IMHO yes they should if setlocale returns NULL.
Anyway it is pretty too much for someone who wants just call:
setlocale LC_CTYPE, ''
But if we want to get a real interface for setlocale, yours is the perfect one to have it included in official stdlib.