[R] Problem with writing a file in UTF-8

tpklein

unread,

Feb 17, 2011, 4:54:53 PM2/17/11

to r-h...@r-project.org

Hello,

I am working with a data frame containg character strings with many special
symbols from various European languages. When writing such character
strings to a file using the UTF-8 encoding, some of them are converted in a
strange way. See the following example, run in R 2.12.1 on Windows 7:

out <- file( description="out.txt", open="w", encoding="UTF-8")
write( x="äöüßæűŁ", file=out )
close( con=out )

The last two symbols in the character string are converted to "uL" while all
other characters are not changed (which is what I want). How to explain
this? Does it have something to do with my locale? And is there a way to
work around this problem? -- Any help would be greatly appreciated.

Thomas
--
View this message in context: http://r.789695.n4.nabble.com/Problem-with-writing-a-file-in-UTF-8-tp3311721p3311721.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Matt Shotwell

unread,

Feb 21, 2011, 11:47:52 AM2/21/11

to r-h...@r-project.org, tpklein

Thomas,

I wasn't able to reproduce your finding. The last two characters in my
'out.txt' file were just as expected. But, I'm in an UTF-8 locale. Your
locale affects the encoding of characters on your platform. If you're
not in a UTF-8 locale, then characters are converted from your native
encoding to UTF-8 (when you specify encoding="UTF-8"). In the process of
conversion, it's possible to lose information. You can test whether
there is a loss (or a change rather) when R writes these characters like
so:

# what does űŁ look like in binary (hex)?
raw_before <- charToRaw("űŁ")

# write 'out.txt' as before
out <- file(description="out.txt", open="w", encoding="UTF-8")
write(x="űŁ", file=out)
close(con=out)

# read in the two characters
out <- file(description="out.txt", open="r", encoding="UTF-8")
raw_after <- charToRaw(readChar(con=out, nchars=2))
close(con=out)

# compare the raw representations
identical(raw_before, raw_after)

This test passes on my machine. But, there's also the question of
whether these characters made it onto R-help list unaltered. Also,
please include the result of sessionInfo() in you subsequent messages.

Best,
Matt

> sessionInfo()
R version 2.11.1 (2010-05-31)
i686-pc-linux-gnu

locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:

> Hello,
>
> I am working with a data frame containg character strings with many special
> symbols from various European languages. When writing such character
> strings to a file using the UTF-8 encoding, some of them are converted in a
> strange way. See the following example, run in R 2.12.1 on Windows 7:
>
> out <- file( description="out.txt", open="w", encoding="UTF-8")
> write( x="äöüßæűŁ", file=out )
> close( con=out )
>
> The last two symbols in the character string are converted to "uL" while all
> other characters are not changed (which is what I want). How to explain
> this? Does it have something to do with my locale? And is there a way to
> work around this problem? -- Any help would be greatly appreciated.
>
> Thomas

______________________________________________

Prof Brian Ripley

unread,

Feb 21, 2011, 12:29:08 PM2/21/11

to Matt Shotwell, r-h...@r-project.org, tpklein

This is asking FAR too much under Windows, which has no UTF-8 locales.
In particular, cat() (on which write() is based) will convert to the
native locale, even if you manage to input the string as an R UTF-8
string.

And conversion is a OS service, so you are getting the conversion
Windows sees as appropriate.

The best way around this is to use a more capable OS. But you can do
e.g.

> x <- '\u0171\u0141' # ensure this really is "űŁ"
> writeLines(x, 'foo', useBytes=TRUE) # ensure no conversion

--
Brian D. Ripley, rip...@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595

David Heffernan

unread,

Feb 21, 2011, 2:39:54 PM2/21/11

to r-h...@r-project.org

Windows is perfectly capable of handling UTF-8, but its native
encoding is UTF-16LE. Applications on Windows are meant to work with
text data in the UTF-16LE encoding. If it needs to be converted to or
from another encoding then there are services that do this (which
work). There are countless programs on Windows that are 100% Unicode
compliant.

I don't know how R holds text data, but perhaps it holds it as char*
as do Python, Perl etc. and all the other such languages that have
problems doing Unicode properly on Windows.

Basically I don't buy the idea that Windows can't do Unicode. It's
supported Unicode since NT was released back in 1991. That's 20 years
ago now. It's just too easy to blame it on Windows but it doesn't
ring true.

David Heffernan.

> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html

> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Brian D. Ripley, rip...@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>

> ______________________________________________
> R-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html

Reply all

Reply to author

Forward