Having trouble with utf-8 csv to xlsx conversion

Christina Plummer

unread,

Apr 28, 2015, 8:24:38 PM4/28/15

to spreadsheet...@googlegroups.com

I have written a very basic script (similar to one of your examples) to convert .csv files to .xlsx using Excel::Writer::XLSX. However, I've noticed that when I read in a UTF-8 text/csv file and write it out, the special characters (Spanish accented letters, in this case) aren't getting written properly.

Judging from the docs, it seems like UTF-8 should be used automatically. But when I open the resulting file in Excel, or even if I unzip it and inspect the sharedStrings.xml file, the characters are wrong.

E.g. (hopefully these will paste correctly)

Dirección

becomes

DirecciÃ³n

If I explicitly open the input file using '<:encoding(utf-8)', then it seems to do the right thing.

What encoding does it assume by default? ASCII? Since UTF-8 is backwards compatible with ISO-8859-1 and ASCII, why not assume UTF-8?

I'm running 0.84 (the latest I saw on CPAN) on RHEL. The 'file' utility detects it as UTF-8 so I had been hoping it would "just work" without me having to specify the encoding for every file. I am thinking that for my purposes specifying the input encoding as UTF-8 should be safe, but wanted to ask the question.

Thanks.

jmcnamara

unread,

Apr 29, 2015, 3:27:21 AM4/29/15

to spreadsheet...@googlegroups.com

On Wednesday, 29 April 2015 01:24:38 UTC+1, Christina Plummer wrote:

If I explicitly open the input file using '<:encoding(utf-8)', then it seems to do the right thing.

Hi,

That is the right thing to do.

Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

John

Christina Plummer

unread,

Apr 29, 2015, 11:01:55 AM4/29/15

to spreadsheet...@googlegroups.com

Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

Thanks for your response. So, just trying to understand better - if it "expects" it to be UTF-8, why do I need to specify that? What is it doing when I don't specify the encoding? I've been looking through the "perluniintro" reference but haven't quite got a handle on it yet. Are you doing anything special in Excel::Writer::XLSX in regards to the encoding, or just relying on Perl?

Also, I wanted to make a correction to something I said earlier, since the internet is forever:

UTF-8 is backwards compatible with ISO-8859-1 and ASCII

This statement is incorrect - UTF-8 is backwards compatible with **ASCII**, but incompatible with the ISO-8859-1.

jmcnamara

unread,

Apr 29, 2015, 4:29:33 PM4/29/15

to spreadsheet...@googlegroups.com

On Wednesday, 29 April 2015 16:01:55 UTC+1, Christina Plummer wrote:

Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

Thanks for your response. So, just trying to understand better - if it "expects" it to be UTF-8, why do I need to specify that?

Hi,

Otherwise perl doesn't know that it isn't ASCII.

What is it doing when I don't specify the encoding?

It will just treat it as a stream of bytes.

I've been looking through the "perluniintro" reference but haven't quite got a handle on it yet. Are you doing anything special in Excel::Writer::XLSX in regards to the encoding, or just relying on Perl?

Unicode and UTF-8 handling can be tricky but in terms of Excel::Writer::XLSX it it reasonably straightforward: just make sure that perl knows that it is dealing with UTF-8 and everything else will work.