Having trouble with utf-8 csv to xlsx conversion

367 views
Skip to first unread message

Christina Plummer

unread,
Apr 28, 2015, 8:24:38 PM4/28/15
to spreadsheet...@googlegroups.com
I have written a very basic script (similar to one of your examples) to convert .csv files to .xlsx using Excel::Writer::XLSX.  However, I've noticed that when I read in a UTF-8 text/csv file and write it out, the special characters (Spanish accented letters, in this case) aren't getting written properly. 

Judging from the docs, it seems like UTF-8 should be used automatically.  But when I open the resulting file in Excel, or even if I unzip it and inspect the sharedStrings.xml file, the characters are wrong.

E.g. (hopefully these will paste correctly)

Dirección

becomes

Dirección

If I explicitly open the input file using '<:encoding(utf-8)', then it seems to do the right thing. 

What encoding does it assume by default?  ASCII?  Since UTF-8 is backwards compatible with ISO-8859-1 and ASCII, why not assume UTF-8? 

I'm running 0.84 (the latest I saw on CPAN) on RHEL.  The 'file' utility detects it as UTF-8 so I had been hoping it would "just work" without me having to specify the encoding for every file.  I am thinking that for my purposes specifying the input encoding as UTF-8 should be safe, but wanted to ask the question.

Thanks.

jmcnamara

unread,
Apr 29, 2015, 3:27:21 AM4/29/15
to spreadsheet...@googlegroups.com


On Wednesday, 29 April 2015 01:24:38 UTC+1, Christina Plummer wrote:

If I explicitly open the input file using '<:encoding(utf-8)', then it seems to do the right thing. 



Hi,

That is the right thing to do.

Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

John

Christina Plummer

unread,
Apr 29, 2015, 11:01:55 AM4/29/15
to spreadsheet...@googlegroups.com
Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

Thanks for your response.  So, just trying to understand better - if it "expects" it to be UTF-8, why do I need to specify that?  What is it doing when I don't specify the encoding?  I've been looking through the "perluniintro" reference but haven't quite got a handle on it yet.  Are you doing anything special in Excel::Writer::XLSX in regards to the encoding, or just relying on Perl?

Also, I wanted to make a correction to something I said earlier, since the internet is forever:
UTF-8 is backwards compatible with ISO-8859-1 and ASCII
This statement is incorrect - UTF-8 is backwards compatible with **ASCII**, but incompatible with the ISO-8859-1. 

jmcnamara

unread,
Apr 29, 2015, 4:29:33 PM4/29/15
to spreadsheet...@googlegroups.com


On Wednesday, 29 April 2015 16:01:55 UTC+1, Christina Plummer wrote:
Excel and Excel::Writer::XLSX expect any string data to be UFT-8.

Thanks for your response.  So, just trying to understand better - if it "expects" it to be UTF-8, why do I need to specify that?

Hi,

Otherwise perl doesn't know that it isn't ASCII.
 
  What is it doing when I don't specify the encoding?

It will just treat it as a stream of bytes.
 
  I've been looking through the "perluniintro" reference but haven't quite got a handle on it yet.  Are you doing anything special in Excel::Writer::XLSX in regards to the encoding, or just relying on Perl?

Unicode and UTF-8 handling can be tricky but in terms of Excel::Writer::XLSX it it reasonably straightforward: just make sure that perl knows that it is dealing with UTF-8 and everything else will work.

 Regards,

John
Reply all
Reply to author
Forward
0 new messages