Now I'm getting this "incompatible encoding regexp match (ASCII-8BIT
regexp with UTF-8 string)" error with this line "clientName =
client.content.gsub(/\302\240/, '').strip" which makes sense with the
Ruby 1.9 encoding support
http://blog.grayproductions.net/articles/ruby_19s_string
I ended up adding an encoding change with
.force_encoding("ASCII-8BIT") which is okay for my purposes eg
"clientName = client.content.force_encoding("ASCII-8BIT").gsub(/\302\240/,
'').strip"
... but I'm sure some people need UTF-8.
Is this because my scraper's code is stored in ASCII-8BIT format
internally to ScraperWiki? Is this an issue that could be addressed
more broadly by converting all the scraper code files to UTF-8? Or did
I miss some new best practice for specifying regexp in Ruby 1.9?
If it helps, I ran into the : problem too :( One way to specify the encoding of the source file (which defaults to US-ASCII) is to use the magic comment
#encoding: UTF-8
at the top of the file, the other is that we could set $LC_CTYPE in the LXC for Ruby scripts.
I'll try the latter on staging and see if it helps fix the problem for everyone, can I use your scraper as a test case?
Ross
Specifically they're the representation of a non breaking space or
which in unicode speak is \u00a0
So clientName = client.content.gsub(/\302\240/, '').strip
becomesclientName = client.content.gsub(/\u00a0/, '').strip
This fixes the errors on all my scrapers - the last migration issue
was SSL certificates which I worked around with "http.verify_mode =
OpenSSL::SSL::VERIFY_NONE"
For future reference if you were really working with UTF-8 string
regex, you can also override regex encoding like this
http://www.ruby-forum.com/topic/184136