Decoding and converting to UTF-8

564 views
Skip to first unread message

Cristi Balan

unread,
Dec 17, 2009, 12:42:57 AM12/17/09
to Ruby's Mail Discussion Group
Hi,

I'm trying to created some sort of crude mail viewing app using the
mail gem.

One of the issues I'm running into is that the decoded values for
fields are .decoded into bytes of the encoding they arrived in and
then that information is lost, as far as I can tell.

Here's an example of what I mean:

require 'iconv'
require 'rubygems'
require 'mail'

s = Mail::SubjectField.new("From", 'Subject: =?ISO-8859-1?Q?
Re=3A_ol=E1?=')
s.decoded # => "Re: ol\341" (LATIN1 bytes)
Iconv.conv("UTF8", "LATIN1", s.decoded) # => "Re: ol\303\241" (UTF8
bytes)

Mail::Encodings.unquote_and_convert_to(s.value, 'UTF8') # => "Re: ol
\303\241" (UTF8 bytes)

Also as a gist here: http://gist.github.com/258544

So, Manually converting works for this example but, shouldn't it
happen automatically on decode?

For example, address.rb uses decode internally and this means that if
one uses any of the handy accessor methods, information about the
encoding is lost and the accessors are useless.

Opinions, ideas?

Thanks,
Cristi

Mikel Lindsaar

unread,
Dec 17, 2009, 1:38:05 AM12/17/09
to mail...@googlegroups.com
Yes... it is a bug, thanks for finding it. Just fixed now

Get the latest copy of mail by cloning github.

>> require 'lib/mail'
=> true


>> s = Mail::SubjectField.new("From", 'Subject: =?ISO-8859-1?Q?Re=3A_ol=E1?=')

=> #<Mail::SubjectField:0x102184d38 @name="Subject", @tree=nil,
@length=nil, @value="=?ISO-8859-1?Q?Re=3A_ol=E1?=", @element=nil>
?> s.decoded
=> "Re: ol\341"

Mikel

--
http://lindsaar.net/
Rails, RSpec and Life blog....

Cristi Balan

unread,
Dec 17, 2009, 2:05:12 AM12/17/09
to Ruby's Mail Discussion Group
Hi,

Thanks for the amazingly fast reply and fix. However, I'm a bit
confused now because the fix fixes something I didn't intend to
report :).

What I wanted to ask was whether .decoded should in fact use
Mail::Encodings.unquote_and_convert_to to avoid losing the encoding on
the text?

Right now, if you use .decoded, there's no way to safely convert the
resulting bytes to UTF8 because their encoding is not known anymore.

Actually, this goes for decoding the body of a message. I have to do
this to get the proper body content as UTF8. Shouldn't this
automagically happen in .decoded?

class Message
def decoded_and_converted_to(encoding='UTF8')
Iconv.conv("UTF8", charset, body.decoded)
end
end

Cristi

> --http://lindsaar.net/

Mikel Lindsaar

unread,
Dec 17, 2009, 7:54:13 AM12/17/09
to mail...@googlegroups.com
On Thu, Dec 17, 2009 at 6:05 PM, Cristi Balan <ev...@che.lu> wrote:
> What I wanted to ask was whether .decoded should in fact use
> Mail::Encodings.unquote_and_convert_to to avoid losing the encoding on
> the text?
> Right now, if you use .decoded, there's no way to safely convert the
> resulting bytes to UTF8 because their encoding is not known anymore.

Ok, I see your problem

In Ruby 1.9 this is a moot problem because the encoding is embedded in the text.

For 1.8x, maybe I could put another method in that gives you the
encoding the string is in...........

Lemmie think about it :)

Mikel

W. Andrew Loe III

unread,
Apr 6, 2011, 7:59:12 PM4/6/11
to mail...@googlegroups.com
This is a pretty old thread but it shows up high in search results. What is the answer on 1.8 to parse a message and always extract the UTF-8 body?
Reply all
Reply to author
Forward
0 new messages