I'm having trouble with URI.encode and UTF-8 characters

2,778 views
Skip to first unread message

Ron Newman

unread,
Mar 28, 2013, 11:36:33 PM3/28/13
to boston-r...@googlegroups.com
I'm trying to use URI.encode to properly escape a query parameter to an HTTP GET request.  The string that I am encoding contains a non-ascii (UTF-8) accented character.

     s = "é"                # lowercase e with acute accent
     s.encoding       #  => #<Encoding:UTF-8>
     s.length             #  => 1
     s.chars.to_a     #  => "é"
     s.chars.to_a[0].ord    # => "e9" , which agrees with http://www.unicode.org/charts/PDF/U0080.pdf

but then

     require 'uri'
     URI.encode(s)      #   =>  "%C3%A9" 
     URI.escape(s)      #   =>   "%C3%A9" 
     URI.encode_www_form_component(s)   # => "%C3%A9"
     URI.encode_www_form(:foo => s)            # => "foo=%C3%A9" 

Why is it doing this, when the proper encoding should be "%E9" ?  How do I use Ruby to encode the string correctly?


Daniel Choi

unread,
Mar 29, 2013, 11:04:00 AM3/29/13
to boston-r...@googlegroups.com


Hi Ron

I may be wrong, but I think URI.escape splits up the character data on byte boundaries, not on character boundaries, so:

s.bytes.map{|x| x.to_s(16)}.inspect # => ["c3", "a9"]

URI.unescape goes through these hex codes a pair at a time to reconstitute the original unicode characters.

Dan

Ron Newman

unread,
Mar 29, 2013, 11:07:22 AM3/29/13
to boston-r...@googlegroups.com

On Mar 29, 2013, at 11:04 AM, Daniel Choi wrote:

>
> I may be wrong, but I think URI.escape splits up the character data on byte boundaries, not on character boundaries, so:

So what do I need to do instead, to properly URL encode a string that contains UTF-8 characters?


Daniel Choi

unread,
Mar 29, 2013, 11:10:31 AM3/29/13
to boston-r...@googlegroups.com



I think you can run it through

s.force_encoding("utf-8")
URI.unescape(s)

Daniel Choi

unread,
Mar 29, 2013, 11:12:33 AM3/29/13
to boston-r...@googlegroups.com

I think I misread your question.

To URL-encode the utf-8 string, you just have to do URI.escape(s)

Just be sure to do #force_encoding("utf-8") before you decode the URI string on the other end.

Ron Newman

unread,
Mar 29, 2013, 11:20:32 AM3/29/13
to boston-r...@googlegroups.com

On Mar 29, 2013, at 11:12 AM, Daniel Choi wrote:

>
> To URL-encode the utf-8 string, you just have to do URI.escape(s)

That's exactly what I did, and it is not working.

The string is "é". It should URL encode to "%E9"; that's what my browser does if I type such a string into a query form (such as the "Search Artists" form at http://SomervilleOpenStudios.org , which is the web site that I am scraping with Nokogiri).

Instead, Ruby URI.escape is encoding it to "%C3%A9" .

Daniel Choi

unread,
Mar 29, 2013, 11:37:20 AM3/29/13
to boston-r...@googlegroups.com, dhc...@gmail.com
I think you'll have to patch the Ruby library's URI module to achieve
this. Maybe something along the lines of:

https://gist.github.com/danchoi/5271585



On Fri, Mar 29, 2013 at 11:20 AM, Ron Newman <rne...@thecia.net> wrote:

> from: Ron Newman <rne...@thecia.net>
> date: Fri, Mar 29 11:20 AM -04:00 2013
> to: boston-r...@googlegroups.com
> reply-to: boston-r...@googlegroups.com
> subject: Re: [boston.rb] Re: I'm having trouble with URI.encode and UTF-8 characters
> --
> You received this message because you are subscribed to a topic in the Google Groups "Boston Ruby Group" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/boston-rubygroup/egGwnojTYfM/unsubscribe?hl=en.
> To unsubscribe from this group and all its topics, send an email to boston-rubygro...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

---
Sent from Vmail
http://danielchoi.com/software/vmail.html

Travis Briggs

unread,
Mar 29, 2013, 12:16:41 PM3/29/13
to boston-r...@googlegroups.com, Dan Choi
Actually I think you have misinterpreted the URI encoding of Unicode characters.

Unicode != UTF-8

Although the Unicode code point of 'é' is E9, the UTF-8 encoding of that code point is two bytes, namely C3A9.

Any code point higher than 7F gets encoded as multiple bytes in UTF-8.

Here is a utility that I've found useful in cases like this: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=e9&mode=hex

So the conclusion is that the URI library is doing the right thing and producing the correct result.

Hope this helps,
-Travis



--
You received this message because you are subscribed to the Google Groups "Boston Ruby Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boston-rubygro...@googlegroups.com.

Daniel Choi

unread,
Mar 29, 2013, 12:41:24 PM3/29/13
to boston-r...@googlegroups.com, Dan Choi


I think Travis is right.

The underlying problem may be that your code is using UTF-8, but the website you're scraping expects strings in ISO-8859-1.

So maybe try to encode your search query strings into ISO-8859-1 before you URI.encode() them and send the request via your web scraper:

# encoding: utf-8
require 'uri'
s = "café"
puts s
s.encode!("iso-8859-1")
puts URI.encode(s)
#=> caf%E9
s.encode!("utf-8")
puts URI.encode(a)
#=> caf%C3%A9
~          

Ron Newman

unread,
Mar 29, 2013, 3:18:23 PM3/29/13
to boston-r...@googlegroups.com
Thanks, everyone.  Turns out that the real problem was that Ruby 1.9.3 uses utf-8 as its default encoding, but the website that I'm sending the GET request to is written in PHP and therefore expects the iso-8859-1 encoding.    This is ugly, but solved my problem:

   word = "Forêt"  # Ruby string, in the default encoding which is utf-8

   url = SearchURL + URI.encode(word.encode("iso-8859-1"))

   doc = Nokogiri::HTML(Net::HTTP::get(URI(url)))  
        # it appears that Nokogiri will always parse into utf-8 strings, even though the web page is in iso-8859-1


Reply all
Reply to author
Forward
0 new messages