>
> Hi,
> I'm having users upload CSV files, but find a problem with files
> having different encodings. My default is UTF-8, so when users upload
> a ISO-8859-1 encoded file, some characters get munged.
>
> where charset is the charset of the uploaded file. But I need to find
> out what it is for the conversion to work. It works when I use
> explicit "ISO-8859-1", but then only if the uploaded file is actually
> an ISO-8859-1 file. I need to make it variable, because in many cases,
> the file will just be UTF-8. Anyone ideas?
Check the content-type header ?
Fred
>
> >
Best regards
Peter De Berdt
Peter De Berdt [2008-08-14 13:39]:
> - Making your own guess: since most character overlap the
> different encodings and only some special characters differ, you
> can scan the strings you're importing. Dutch for example will
> only use ë, é, è, … Try and find the most common ones for your
> language and save them with WindowsLatin encoding. Then import
> them in your rails app and look how they are seen within rails'
> UTF-8 context. You then have the choice of either prescanning the
> files (if they're not huge) or doing a streaming import and
> restart the import if you find out along the way you're using the
> wrong encoding.
we're doing a similar thing (including BOM detection) in our
automatic encoding guesser. see the documentation [1] or install the
'cmess' gem in case you're interested. unfortunately, the actual
heuristics are buried in the source [2] so you'd have to look there
for details.
dirk's example might then look like this:
require 'cmess/guess_encoding'
File.open(tmp_file, 'w') do |f|
input = file.read
charset = CMess::GuessEncoding::Automatic.guess(input)
f.write(Iconv.iconv('UTF-8', charset, input))
end
[1]
<http://prometheus.rubyforge.org/cmess/classes/CMess/GuessEncoding/Automatic.html>
[2]
<http://prometheus.khi.uni-koeln.de/svn/scratch/cmess/lib/cmess/guess_encoding.rb>
cheers
jens