Encoding Confusion When Writing (e.g. Czech text) to an XLSX File

Marc Schwartz

unread,

Jun 22, 2015, 4:32:51 PM6/22/15

to spreadsheet...@googlegroups.com

Hi,

I have created a package for R (http://www.r-project.org), which is managed on Github here:

https://github.com/marcschwartz/WriteXLS

The package uses R code to create a CSV file containing data from an R object (a data frame) which is a rectangular data set. The CSV file is then read using a Perl script to then create a resultant XLS or XLSX file. The Perl scripts that are used are located here:

https://github.com/marcschwartz/WriteXLS/tree/master/inst/Perl

and are called from within R, passed to a shell to be executed.

A user is having issues when the source data contains Czech characters. As an example (note the Czech Š in SKODA):

	Make	Bodystyle	Satisfaction
1	ŠKODA	Coupé	4
2	ŠKODA	Coupé	5
3	ŠKODA	Coupé	6
4	Citroën	Coupé	7
5	Citroën	Coupé	5
6	Citroën	Coupé	3

The attached CSV file is an example of the source file generated in R as the interim step in the process.

If I attempt to open the CSV file using Textedit on OS X, I get errors about the file not being UTF-8. However, I can open it using Emacs without issue.

If I open the CSV file in Excel (on OS X using a en_US.UTF-8 locale), the file is parsed correctly and the cell contents are as above, showing the Czech characters.

However, if I use my R package to write the XLS file, using the UTF-8 encoding, when I open the XLS file, I get (note the ? symbols):

Make	Bodystyle	Satisfaction
�KODA	Coup�	4
�KODA	Coup�	5
�KODA	Coup�	6
Citro�n	Coup�	7
Citro�n	Coup�	5
Citro�n	Coup�	3

I have attempted various incantations of Perl encoding code, including running the Perl script from the CLI to avoid any possible issues when running from within R itself. The encoding incantations are based upon Google searches of Perl encoding issues, but I have yet to come up with something that works.

I would appreciate any insights that anyone can offer. I am trying to avoid having to specify multiple encodings as user options, since users can be anywhere, thinking that this character set should be ok in UTF-8, but perhaps I am missing something.

If you need more information, please let me know.

Thanks,

Marc

data.csv

jmcnamara

unread,

Jun 23, 2015, 3:39:46 AM6/23/15

to spreadsheet...@googlegroups.com, wdwg...@gmail.com

On Monday, 22 June 2015 21:32:51 UTC+1, Marc Schwartz wrote:

A user is having issues when the source data contains Czech characters. As an example (note the Czech Š in SKODA):

Hi,

The rule in Excel::Writer::XLSX is straightforward but still tricky to get right: data that perl recognises as utf8 will be written to an Excel file correctly.

The trick is to inform perl that the data you are using is in UTF-8. This is usually done with a directive to open(). Here is a small working example:

#!/usr/bin/perl

use strict;
use warnings;
use Excel::Writer::XLSX;

my $workbook = Excel::Writer::XLSX->new( 'data.xlsx' );
my $worksheet = $workbook->add_worksheet();

my $file = 'data.csv';
open FH, '<:encoding(utf8)', $file or die "Couldn't open $file: $!\n";

my $row = 0;

while ( my $line = <FH> ) {
chomp $line;

my @items = split /,/, $line;

$worksheet->write_row( $row++, 0, \@items );
}

__END__

Note, you should use a real CSV parser since this doesn't handle quoted data.

John

Marc Schwartz

unread,

Jun 23, 2015, 2:37:38 PM6/23/15

to spreadsheet...@googlegroups.com, wdwg...@gmail.com

Hi John,

Thanks kindly for your reply.

I ran the example script that you have above on a version of the CSV file without quotes. I do use TEXT::CSV_PP for CSV file parsing in my full Perl script in the R package.

There is some interesting behavior that results and I was not sure if you had observed it.

When running your script, which I copied and pasted into a text file, I get the following warnings:

utf8 "\x8A" does not map to Unicode at test.pl line XX.