find encoding (charset) of an uploaded file

deegee

unread,

Aug 14, 2008, 6:07:00 AM8/14/08

to Ruby on Rails: Talk

Hi,
I'm having users upload CSV files, but find a problem with files
having different encodings. My default is UTF-8, so when users upload
a ISO-8859-1 encoded file, some characters get munged.

I have a standard file upload field in my view (the form is for an
object "import"):

<%= f.file_field :file %>

And in my controller I pass the file param to the model CsvFile:

@csv_file = CsvFile.new(params[:import][:file])
@csv_file.import

In my model, I first write the file to disk and then read it using
FasterCSV:

def initialize(file)
@file = file
end

def import
write_file
FasterCSV.foreach(tmp_file) do |row|
....
end
end

def write_file
self.tmp_file = "#{RAILS_ROOT}/tmp/csv_files/" +
rand(9999999999).to_s

if file
File.open(tmp_file, "w") do |f|
f.write(file.read)
end
end
end

I would like to convert it to UTF-8 before saving, changing the
write_file like this:

# was: f.write(file.read)
# now becomes:
f.write(Iconv.iconv("UTF-8", charset, file.read))

where charset is the charset of the uploaded file. But I need to find
out what it is for the conversion to work. It works when I use
explicit "ISO-8859-1", but then only if the uploaded file is actually
an ISO-8859-1 file. I need to make it variable, because in many cases,
the file will just be UTF-8. Anyone ideas?

Frederick Cheung

unread,

Aug 14, 2008, 6:08:41 AM8/14/08

to rubyonra...@googlegroups.com

On 14 Aug 2008, at 11:07, deegee wrote:

>
> Hi,
> I'm having users upload CSV files, but find a problem with files
> having different encodings. My default is UTF-8, so when users upload
> a ISO-8859-1 encoded file, some characters get munged.
>

> where charset is the charset of the uploaded file. But I need to find
> out what it is for the conversion to work. It works when I use
> explicit "ISO-8859-1", but then only if the uploaded file is actually
> an ISO-8859-1 file. I need to make it variable, because in many cases,
> the file will just be UTF-8. Anyone ideas?

Check the content-type header ?

Fred
>
> >

Peter De Berdt

unread,

Aug 14, 2008, 7:39:34 AM8/14/08

to rubyonra...@googlegroups.com

That isn't going to help to determine the encoding of the content of a file. To be honest, you can only make an educated guess and there are several ways to go about it:

- The easiest way would be for you to let the user decide. You don't have to ask them for the encoding per se, you can just ask: what application have you created this file with?

- Some UTF-8 file have a BOM identifier at the beginning of the file. Mind you, i said "some", not "all". However, if the BOM is there, you can assume the file is UTF-8 encoded (http://en.wikipedia.org/wiki/Byte-order_mark)

- Making your own guess: since most character overlap the different encodings and only some special characters differ, you can scan the strings you're importing. Dutch for example will only use ë, é, è, … Try and find the most common ones for your language and save them with WindowsLatin encoding. Then import them in your rails app and look how they are seen within rails' UTF-8 context. You then have the choice of either prescanning the files (if they're not huge) or doing a streaming import and restart the import if you find out along the way you're using the wrong encoding.

Best regards

Peter De Berdt

Jens Wille

unread,

Aug 14, 2008, 8:37:19 AM8/14/08

to rubyonra...@googlegroups.com

hi peter! (and dirk)

Peter De Berdt [2008-08-14 13:39]:

> - Making your own guess: since most character overlap the
> different encodings and only some special characters differ, you
> can scan the strings you're importing. Dutch for example will
> only use ë, é, è, … Try and find the most common ones for your
> language and save them with WindowsLatin encoding. Then import
> them in your rails app and look how they are seen within rails'
> UTF-8 context. You then have the choice of either prescanning the
> files (if they're not huge) or doing a streaming import and
> restart the import if you find out along the way you're using the
> wrong encoding.

we're doing a similar thing (including BOM detection) in our
automatic encoding guesser. see the documentation [1] or install the
'cmess' gem in case you're interested. unfortunately, the actual
heuristics are buried in the source [2] so you'd have to look there
for details.

dirk's example might then look like this:

require 'cmess/guess_encoding'

File.open(tmp_file, 'w') do |f|

input = file.read
charset = CMess::GuessEncoding::Automatic.guess(input)

f.write(Iconv.iconv('UTF-8', charset, input))
end

[1]
<http://prometheus.rubyforge.org/cmess/classes/CMess/GuessEncoding/Automatic.html>
[2]
<http://prometheus.khi.uni-koeln.de/svn/scratch/cmess/lib/cmess/guess_encoding.rb>

cheers
jens

deegee

unread,

Aug 14, 2008, 10:36:26 AM8/14/08

to Ruby on Rails: Talk

Peter, Jens,
Thanks for these great answers. I know already it wouldn't be as
simple as checking the content-type, the charset usually not being
passed properly in there. Your tactics of looking for common
characters should work fine for my purposes, we don't need to be 100%
accurate, 99,99% is good enough :-)

I'll have a look at your gem, Jens, on first looks it seems exactly
what I need.

cheers,
dirk.

Reply all

Reply to author

Forward