[patch] fix tidy_bytes for 1.9.x, improve performance

53 views
Skip to first unread message

Norman Clarke

unread,
Apr 9, 2010, 9:05:39 AM4/9/10
to rubyonra...@googlegroups.com
Hi all,

I was wondering if I could get some feedback on a patch I created for
ActiveSupport's `tidy_bytes` method.

Right now `tidy_bytes` doesn't work with 1.9.x, since it relies on a
Unicode regexp that always fails for strings with invalid UTF-8
characters. You can see the essence of the problem easily by firing up
any 1.9.x irb and doing this:

    ruby-1.9.2-preview1 > "\x93".split(//u)
    ArgumentError: invalid byte sequence in UTF-8
            from (irb):2:in `split'
            from (irb):2
            from
/Users/norman/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>

This patch resolves the issue by traversing the string as bytes rather
than codepoints, and is about twice as fast as the current
implementation. Rather than using the current implementation's regular
expression, it checks each byte's first 0 bit to determine its
validity. This Wikipedia article was a useful reference while working
on the patch:

http://en.wikipedia.org/wiki/UTF-8#Description

It also adds a `force` option to allow cleanup of byte sequences that
are both valid CP-1252 / ISO-8859-1 and UTF-8. This can be used when
the developer knows that their input is encoded in CP-1252 or
ISO-8859-1 and wants to recode it to UTF-8. (Again, the presence of
invalid characters will prevent doing this by simply using #encode or
#force_encoding on 1.9.)

* The patch: http://gist.github.com/361115
* LH Ticket:  https://rails.lighthouseapp.com/projects/8994/tickets/4350-tidy_bytes-fails-on-19x

Here is also a library where you can see this code in isolation:

http://github.com/norman/utf8_utils

Regards,

Norman

Jeremy Kemper

unread,
Apr 9, 2010, 12:47:50 PM4/9/10
to rubyonra...@googlegroups.com

This is great. Thanks, Norman!

jeremy

Reply all
Reply to author
Forward
0 new messages