I was wondering if I could get some feedback on a patch I created for
ActiveSupport's `tidy_bytes` method.
Right now `tidy_bytes` doesn't work with 1.9.x, since it relies on a
Unicode regexp that always fails for strings with invalid UTF-8
characters. You can see the essence of the problem easily by firing up
any 1.9.x irb and doing this:
ruby-1.9.2-preview1 > "\x93".split(//u)
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `split'
from (irb):2
from
/Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>
This patch resolves the issue by traversing the string as bytes rather
than codepoints, and is about twice as fast as the current
implementation. Rather than using the current implementation's regular
expression, it checks each byte's first 0 bit to determine its
validity. This Wikipedia article was a useful reference while working
on the patch:
http://en.wikipedia.org/wiki/UTF-8#Description
It also adds a `force` option to allow cleanup of byte sequences that
are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when
the developer knows that their input is encoded in CP-1252 or
ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of
invalid characters will prevent doing this by simply using #encode or
#force_encoding on 1.9.)
* The patch: http://gist.github.com/361115
* LH Ticket: https://rails.lighthouseapp.com/projects/8994/tickets/4350-tidy_bytes-fails-on-19x
Here is also a library where you can see this code in isolation:
http://github.com/norman/utf8_utils
Regards,
Norman
This is great. Thanks, Norman!
jeremy