Encoding Confusion When Writing (e.g. Czech text) to an XLSX File

682 views
Skip to first unread message

Marc Schwartz

unread,
Jun 22, 2015, 4:32:51 PM6/22/15
to spreadsheet...@googlegroups.com
Hi,

I have created a package for R (http://www.r-project.org), which is managed on Github here:


The package uses R code to create a CSV file containing data from an R object (a data frame) which is a rectangular data set. The CSV file is then read using a Perl script to then create a resultant XLS or XLSX file. The Perl scripts that are used are located here:


and are called from within R, passed to a shell to be executed.

A user is having issues when the source data contains Czech characters. As an example (note the Czech Š in SKODA):

Make Bodystyle Satisfaction
1 ŠKODA Coupé 4
2 ŠKODA Coupé 5
3 ŠKODA Coupé 6
4 Citroën Coupé 7
5 Citroën Coupé 5
6 Citroën Coupé 3


The attached CSV file is an example of the source file generated in R as the interim step in the process. 

If I attempt to open the CSV file using Textedit on OS X, I get errors about the file not being UTF-8. However, I can open it using Emacs without issue.

If I open the CSV file in Excel (on OS X using a en_US.UTF-8 locale), the file is parsed correctly and the cell contents are as above, showing the Czech characters.

However, if I use my R package to write the XLS file, using the UTF-8 encoding, when I open the XLS file, I get (note the ? symbols):

Make Bodystyle Satisfaction
�KODA Coup� 4
�KODA Coup� 5
�KODA Coup� 6
Citro�n Coup� 7
Citro�n Coup� 5
Citro�n Coup� 3


I have attempted various incantations of Perl encoding code, including running the Perl script from the CLI to avoid any possible issues when running from within R itself. The encoding incantations are based upon Google searches of Perl encoding issues, but I have yet to come up with something that works.

I would appreciate any insights that anyone can offer. I am trying to avoid having to specify multiple encodings as user options, since users can be anywhere, thinking that this character set should be ok in UTF-8, but perhaps I am missing something.

If you need more information, please let me know.

Thanks,

Marc


data.csv

jmcnamara

unread,
Jun 23, 2015, 3:39:46 AM6/23/15
to spreadsheet...@googlegroups.com, wdwg...@gmail.com


On Monday, 22 June 2015 21:32:51 UTC+1, Marc Schwartz wrote:


A user is having issues when the source data contains Czech characters. As an example (note the Czech Š in SKODA):


Hi,

The rule in Excel::Writer::XLSX is straightforward but still tricky to get right: data that perl recognises as utf8 will be written to an Excel file correctly. 

The trick is to inform perl that the data you are using is in UTF-8. This is usually done with a directive to open(). Here is a small working example:


#!/usr/bin/perl

use strict;
use warnings;
use Excel::Writer::XLSX;


my $workbook  = Excel::Writer::XLSX->new( 'data.xlsx' );
my $worksheet = $workbook->add_worksheet();


my $file = 'data.csv';
open FH, '<:encoding(utf8)', $file or die "Couldn't open $file: $!\n";

my $row = 0;

while ( my $line = <FH> ) {
    chomp $line;

    my @items = split /,/, $line;

    $worksheet->write_row( $row++, 0, \@items );
}

__END__


Note, you should use a real CSV parser since this doesn't handle quoted data.

John

 

Marc Schwartz

unread,
Jun 23, 2015, 2:37:38 PM6/23/15
to spreadsheet...@googlegroups.com, wdwg...@gmail.com
Hi John,

Thanks kindly for your reply.

I ran the example script that you have above on a version of the CSV file without quotes. I do use TEXT::CSV_PP for CSV file parsing in my full Perl script in the R package.

There is some interesting behavior that results and I was not sure if you had observed it.

When running your script, which I copied and pasted into a text file, I get the following warnings:

utf8 "\x8A" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.

utf8 "\x8A" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.

utf8 "\x8A" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.

utf8 "\xEB" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.

utf8 "\xEB" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.

utf8 "\xEB" does not map to Unicode at test.pl line XX.

utf8 "\xE9" does not map to Unicode at test.pl line XX.



where 'XX' is the line that begins with 'while...'.



However, the resultant XLSX file, rather than containing the ? symbols or spaces in place of the Czech characters, contains:


Make Bodystyle Satisfaction
1 \x8AKODA Coup\xE9 4
2 \x8AKODA Coup\xE9 5
3 \x8AKODA Coup\xE9 6
4 Citro\xEBn Coup\xE9 7
5 Citro\xEBn Coup\xE9 5
6 Citro\xEBn Coup\xE9 3


So essentially, the literal ASCII characters from the source CSV file are being written 'as is' to the XLSX file, rather than being translated to the correct Czech characters.

I have tried various other incantations, including the use of:

  open FH, '<:encoding(iso-8859-2)', $file or die "Couldn't open $file: $!\n";

where iso-8859-2 is supposedly one correct encoding for the Czech character set. The result is that there are no warnings as above regarding unicode mapping. However, I get spaces where the Czech 'S' characters should be and the 'e' with the umlaut and the accented 'e' are correctly written to the XLSX file:

Make Bodystyle Satisfaction
1 Š KODA Coupé 4
2 Š KODA Coupé 5
3 Š KODA Coupé 6
4 Citroën Coupé 7
5 Citroën Coupé 5
6 Citroën Coupé 3


I modified your script to write one cell at a time, rather than a full row and included a call to the Encode function decode() as I have now in my Perl code for utf-8 and latin1. However, that seems to be ineffectual, whether I used utf-8 or iso-8859-2.

After spending some time searching further via Google, I came across some new references regarding Czech characters specifically. It would seem that the encoding in the source CSV file that I have is Windows CP1252, not UTF-8 nor ISO-8859-2. So, the user is apparently on an older Windows based computer or at least has that locale set.

Thus, after some further testing, the following incantation on the open file line of:

  open FH, '<:encoding(cp1252)', $file or die "Couldn't open $file: $!\n";

results in a correct XLSX file (attached)!

In addition, the use of:

  $worksheet->write($Row, $Column, decode("cp1252", $item));

in lieu of the file open encoding incantation also works, even when the file open encoding is not explicitly stated.

So it would seem that when just opening the CSV file into Excel directly, Excel is selecting the proper encoding silently to result in the correct visualization of the character set.

I am not sure if the behavior that I am seeing may be unique to my OS X environment or if it is indeed portable. The key, presumably, is that the content in the XLSX file is correct for the user's locale if they define an encoding other than UTF-8 (e.g. latin1 or now cp1252) when running my code in R.

Is there are preferred approach for this issue, in terms of explicitly indicating the encoding at file open versus using Encode::decode() when writing the cell contents?

Thanks again John!

Marc
dataNOQUOTES.xlsx

Marc Schwartz

unread,
Jun 23, 2015, 3:38:40 PM6/23/15
to spreadsheet...@googlegroups.com
On Tuesday, June 23, 2015 at 1:37:38 PM UTC-5, Marc Schwartz wrote:

<snip>
 
Is there are preferred approach for this issue, in terms of explicitly indicating the encoding at file open versus using Encode::decode() when writing the cell contents?



From some additional searching, it would appear that indeed the preferred approach would be to define the encoding on CSV file open, rather than even using Encode to 'translate' the cell content prior to writing the cells to the spreadsheet file.

There still seems to be a level of confusion on some of these finer points with respect to encoding in Perl, with the ultra-cautious suggesting that perhaps both may be apropos. Thus my own confusion here.

I am going to modify my Perl scripts to use the approach of declaring the file encoding specs upon open and remove the use of Encode for now. Hopefully that might avoid some additional issues that pop up now and then.

Thanks again John. If anything here does not make sense, please let me know.

Regards,

Marc

Reply all
Reply to author
Forward
0 new messages