I am working with a data frame containg character strings with many special
symbols from various European languages. When writing such character
strings to a file using the UTF-8 encoding, some of them are converted in a
strange way. See the following example, run in R 2.12.1 on Windows 7:
out <- file( description="out.txt", open="w", encoding="UTF-8")
write( x="äöüßæűŁ", file=out )
close( con=out )
The last two symbols in the character string are converted to "uL" while all
other characters are not changed (which is what I want). How to explain
this? Does it have something to do with my locale? And is there a way to
work around this problem? -- Any help would be greatly appreciated.
Thomas
--
View this message in context: http://r.789695.n4.nabble.com/Problem-with-writing-a-file-in-UTF-8-tp3311721p3311721.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
I wasn't able to reproduce your finding. The last two characters in my
'out.txt' file were just as expected. But, I'm in an UTF-8 locale. Your
locale affects the encoding of characters on your platform. If you're
not in a UTF-8 locale, then characters are converted from your native
encoding to UTF-8 (when you specify encoding="UTF-8"). In the process of
conversion, it's possible to lose information. You can test whether
there is a loss (or a change rather) when R writes these characters like
so:
# what does űŁ look like in binary (hex)?
raw_before <- charToRaw("űŁ")
# write 'out.txt' as before
out <- file(description="out.txt", open="w", encoding="UTF-8")
write(x="űŁ", file=out)
close(con=out)
# read in the two characters
out <- file(description="out.txt", open="r", encoding="UTF-8")
raw_after <- charToRaw(readChar(con=out, nchars=2))
close(con=out)
# compare the raw representations
identical(raw_before, raw_after)
This test passes on my machine. But, there's also the question of
whether these characters made it onto R-help list unaltered. Also,
please include the result of sessionInfo() in you subsequent messages.
Best,
Matt
> sessionInfo()
R version 2.11.1 (2010-05-31)
i686-pc-linux-gnu
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:
> Hello,
>
> I am working with a data frame containg character strings with many special
> symbols from various European languages. When writing such character
> strings to a file using the UTF-8 encoding, some of them are converted in a
> strange way. See the following example, run in R 2.12.1 on Windows 7:
>
> out <- file( description="out.txt", open="w", encoding="UTF-8")
> write( x="äöüßæűŁ", file=out )
> close( con=out )
>
> The last two symbols in the character string are converted to "uL" while all
> other characters are not changed (which is what I want). How to explain
> this? Does it have something to do with my locale? And is there a way to
> work around this problem? -- Any help would be greatly appreciated.
>
> Thomas
______________________________________________
And conversion is a OS service, so you are getting the conversion
Windows sees as appropriate.
The best way around this is to use a more capable OS. But you can do
e.g.
> x <- '\u0171\u0141' # ensure this really is "űŁ"
> writeLines(x, 'foo', useBytes=TRUE) # ensure no conversion
--
Brian D. Ripley, rip...@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
I don't know how R holds text data, but perhaps it holds it as char*
as do Python, Perl etc. and all the other such languages that have
problems doing Unicode properly on Windows.
Basically I don't buy the idea that Windows can't do Unicode. It's
supported Unicode since NT was released back in 1991. That's 20 years
ago now. It's just too easy to blame it on Windows but it doesn't
ring true.
David Heffernan.
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Brian D. Ripley, rip...@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html