Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

testing stdin for bad encoding, ruby 1.9

0 views
Skip to first unread message

Ben Crowell

unread,
May 11, 2008, 11:49:37 AM5/11/08
to
I have some existing ruby 1.9 code that broke recently with a new build
of ruby. It looks like the problem was that my preexisting text input
files, which I'd been reading from stdin, contained some characters that
were not valid UTF-8 or US-ASCII. The latest version throws an error in
this situation:

$ ruby --version
ruby 1.9.0 (2008-04-26 revision 0) [x86_64-linux]
$ cat a.rb
#!/usr/bin/ruby

t = $stdin.gets(nil)

t.gsub!(/a/) {'b'}

$ ruby -e 'print "\332"' | a.rb
./a.rb:5:in `gsub!': broken UTF-8 string (ArgumentError)
from ./a.rb:5:in `<main>'

I'm happy to change the input files, because it is an error that they
aren't properly encoded. However, I'd also like to find some way to test
for this type of error more gracefully, and I can't seem to figure out
how to do it. I was originally thinking of something like this:

#!/usr/bin/ruby

t = $stdin.gets(nil)
if t=~/([^\n]*[^\000-\177][^\n]*)/ then
$stderr.print "Bad ASCII character detected in this line:\n#{$1}\n"
end

(In my application, the string t may be thousands of lines long.)
However, this doesn't work, because the attempt to test t against a
regex fails with an ArgumentError.

Googling turns up some references to magic comments, but I haven't
been able to find any information on what magic comments are.

Thanks in advance!

Alex Fenton

unread,
May 11, 2008, 2:05:06 PM5/11/08
to
Ben Crowell wrote:
> I have some existing ruby 1.9 code that broke recently with a new build
> of ruby. It looks like the problem was that my preexisting text input
> files, which I'd been reading from stdin, contained some characters that
> were not valid UTF-8 or US-ASCII.

...

> I'm happy to change the input files, because it is an error that they
> aren't properly encoded. However, I'd also like to find some way to test
> for this type of error more gracefully, and I can't seem to figure out
> how to do it.

I use IConv in the standard library to convert from UTF8 to UTF8 to test
whether files being imported by a user are in fact in the right
encoding. This otherwise redundant recoding will raise a
BadSequenceError if there's a problem. This can be caught and reported.

a


Ben Crowell

unread,
May 11, 2008, 6:14:31 PM5/11/08
to

Thanks for the suggestion. However, I already have an error that I can
catch and report. The problem is that it's not very helpful to the user
to say, "hey, somewhere in your 100-page text file, there are illegal
characters." That's why I was trying to do this:


if t=~/([^\n]*[^\000-\177][^\n]*)/ then
$stderr.print "Bad ASCII character detected in this line:\n#{$1}\n"
end

It seems to me that I need some way to convince Ruby that the string t
is in an encoding where all characters are a single byte, and it's ok
to have the high bit set. Then I could go ahead and use regexes to test
whether it contains any characters with the high bit set, and report
them properly. It just seems like the string, once I read it in, is
like the Medusa -- my program doesn't even dare take a peek at it for
fear of being turned to stone.

0 new messages