incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

2,363 views
Skip to first unread message

Alex (Maxious) Sadleir

unread,
Oct 12, 2011, 12:13:06 PM10/12/11
to scrap...@googlegroups.com
Time to fix my scrapers after the great Ruby 1.9 upgrade, the first
issue was ending if statements with unnecessary colons "syntax error,
unexpected ':', expecting keyword_then or ';' or '\n'" (python on the
brain?).

Now I'm getting this "incompatible encoding regexp match (ASCII-8BIT
regexp with UTF-8 string)" error with this line "clientName =
client.content.gsub(/\302\240/, '').strip" which makes sense with the
Ruby 1.9 encoding support
http://blog.grayproductions.net/articles/ruby_19s_string

I ended up adding an encoding change with
.force_encoding("ASCII-8BIT") which is okay for my purposes eg
"clientName = client.content.force_encoding("ASCII-8BIT").gsub(/\302\240/,
'').strip"

... but I'm sure some people need UTF-8.

Is this because my scraper's code is stored in ASCII-8BIT format
internally to ScraperWiki? Is this an issue that could be addressed
more broadly by converting all the scraper code files to UTF-8? Or did
I miss some new best practice for specifying regexp in Ruby 1.9?

Ross Jones

unread,
Oct 12, 2011, 12:27:20 PM10/12/11
to scrap...@googlegroups.com
Hi Alex,

If it helps, I ran into the : problem too :( One way to specify the encoding of the source file (which defaults to US-ASCII) is to use the magic comment

#encoding: UTF-8

at the top of the file, the other is that we could set $LC_CTYPE in the LXC for Ruby scripts.

I'll try the latter on staging and see if it helps fix the problem for everyone, can I use your scraper as a test case?

Ross

Alex (Maxious) Sadleir

unread,
Oct 12, 2011, 12:39:26 PM10/12/11
to scrap...@googlegroups.com
Looks like PEBKAC - my regexp is a byte array not a string and those
bytes are always ASCII-8BIT even with file encoding. I tried the magic
comment on scraperwiki and LC_CTYPE locally and then noticed what I
was actually asking Ruby to do.

Specifically they're the representation of a non breaking space or
  which in unicode speak is \u00a0

So clientName = client.content.gsub(/\302\240/, '').strip
becomesclientName = client.content.gsub(/\u00a0/, '').strip
This fixes the errors on all my scrapers - the last migration issue
was SSL certificates which I worked around with "http.verify_mode =
OpenSSL::SSL::VERIFY_NONE"

For future reference if you were really working with UTF-8 string
regex, you can also override regex encoding like this
http://www.ruby-forum.com/topic/184136

Reply all
Reply to author
Forward
0 new messages