Chinese encoding

1,308 views
Skip to first unread message

Alvin Chen

unread,
Nov 19, 2010, 6:24:10 AM11/19/10
to CorpLing with R
I'm totally confused by the text encoding now.

I'm a newbie in mac os. Recently i started to process data in R for
mac. Here's the problem.
I have two text files with UTF-8 and big5 encoding respectively(I did
the encoding conversion in Windows). When I would like to process
these files in R for mac, the command scan(,sep="\n") only works for
the big5 text, not the UTF-8 one. And what confuses me more is that
when i process the big5 encoding file, everything works fine except
that the Chinese characters cannot appear on the R concole properly. I
can't read the Chinese characters in the R concole.

Also, in mac os, I can read the Chinese characters properly only in
the UTF-8 file. In other words, I can't read Chinese in the big5 file
neither in mac.

Plus, when I process the big5 file in mac, I always get the warning
messages like :

***
In grep(paste("\"", name, "\"", sep = ""), x) :
input string 315 is invalid in this locale
***

I thought that R for mac supports UTF-8? Then why can't I process the
UTF-8 version but the big5 version only? I really need someone to fill
me in with some knowledge on this encoding puzzle. Many thanks in
advance.


##Additional info:
##OS: Mac OS X 10.6.5 (Language: English, Format: Taiwan)
##I've made the setting in the terminal: defaults write org.R-
project.R force.LANG en_US.UTF-8
> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8"


-----
Alvin Chen
Ph.D. Candidate
Taiwan International Graduate Program,
Computational Linguistics and Chinese Language Processing (TIGP-CLCLP)
Academia Sinica, Taiwan

Stefan Th. Gries

unread,
Nov 19, 2010, 11:04:25 AM11/19/10
to corplin...@googlegroups.com
Have you tried reading in the files with a file connection in which
you specify the encoding?

temp.connection <- file("dontusemacs.txt", open="r", encoding="ENCODING") # ;-)
x <- readLines(temp.connection)
close(temp.connection)

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Alvin Chen

unread,
Nov 19, 2010, 11:54:08 AM11/19/10
to corplin...@googlegroups.com
Hi Stefan,

What's the difference between setting the encoding in scan() vs. in file()?

I just tried ur suggestion. For the UTF-8 encoded txt, i still cannot process it into a vector. I got error messages when reading lines:

> temp.connection <- file("di_001.unicode", open="r", encoding="UTF-8");
> x<- readLines(temp.connection)
Warning messages:
1: In readLines(temp.connection) :
  invalid input found on input connection 'di_001.unicode'
2: In readLines(temp.connection) :
  incomplete final line found on 'di_001.unicode'

But for the big5 one, it still worked and now I can read the chinese characters in R console (So this is better than using scan() i guess??).

But why didn't R support UTF8? I remember that in windows, I also have to convert all the unicode texts into big5 for R processing....However, a lot of programs only process unicode......That bugs me...

Probably I should've stopped at where mac is more powerful, image-editing, and leave the "business" to the ubuntu...LOL~

Alvin



--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.




--
Alvin Chen

Stefan Th. Gries

unread,
Nov 19, 2010, 12:09:13 PM11/19/10
to corplin...@googlegroups.com
see the help page for scan:

encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.

R *does* support Unicode, many people and I have used it with UTF-8
(e.g., Mandarin or Cyrillic) and UTF-16 (Japanese). You're sure it's
not just a R console font issue or something?

> Probably I should've stopped at where mac is more powerful, image-editing, and leave the "business" to the ubuntu...LOL~

Indeed ... ;-)))

Cheers,

Alvin Chen

unread,
Nov 19, 2010, 12:31:11 PM11/19/10
to corplin...@googlegroups.com
On Sat, Nov 20, 2010 at 1:09 AM, Stefan Th. Gries <stg...@gmail.com> wrote:
see the help page for scan:

encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
  and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.

I didnt check the help pages before. It's just...I don't quite understand them really. I also tried this:

 x <- scan(file(turn.list[current.turn],open="r",encoding="big5"), what="c",sep="\n");

And I got the same result. Is it the same?
 
R *does* support Unicode, many people and I have used it with UTF-8
(e.g., Mandarin or Cyrillic) and UTF-16 (Japanese). You're sure it's
not just a R console font issue or something?

I tried to change the font as well. It doesn't work. The question now is R can't even read the utf8 file, but only the big5.  Here I'm attaching a utf8 file. Ideally I like to read in the file and do further processing. Can u successfully read it in either by scan() or readline()?

As i said, no matter the OS is windows or mac, I use the utf8 encoding for Praat and Java. And yet whenever I have to do the further processing of the texts in R, I always fail to get R to handle those UTF8 files properly. But what's funny is that if I convert those text files into big5, then R handles them well! That's what I don't understand.


Alvin
> Probably I should've stopped at where mac is more powerful, image-editing, and leave the "business" to the ubuntu...LOL~
Indeed ... ;-)))

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.




--
Alvin Chen
di_001.unicode

Alvin Chen

unread,
Nov 19, 2010, 12:31:55 PM11/19/10
to corplin...@googlegroups.com
On Sat, Nov 20, 2010 at 1:31 AM, Alvin Chen <alvin...@gmail.com> wrote:


On Sat, Nov 20, 2010 at 1:09 AM, Stefan Th. Gries <stg...@gmail.com> wrote:
see the help page for scan:

encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
  and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.

I didnt check the help pages before. It's just...I don't quite understand them really. I also tried this:
Sorry, I meant to say i DID check those pages before....= ="



--
Alvin Chen

Hardie, Andrew

unread,
Nov 19, 2010, 12:35:29 PM11/19/10
to corplin...@googlegroups.com

The “invalid input found on input connection” error message is triggered if R’s call to iconv() results in the EILSEQ error code, which means an invalid byte sequence (once which can’t be parsed) has been encountered. So your file contains dodgy UTF-8 sequences.

 

Looking at the file you sent it’s easy to see why – the encoding is big-endian UTF-16BE, not UTF-8 – telling R to treat the former as the latter leads to errors.

 

best

 

Andrew.

Stefan Th. Gries

unread,
Nov 19, 2010, 12:36:47 PM11/19/10
to corplin...@googlegroups.com
Ahem, this file is not UTF-8 !!!!! The folliowing works perfectly:

> temp.connection <- file("/home/stgries/Desktop/di_001.unicode", open="r", encoding="UTF-16BE")
> x <- readLines(temp.connection)
> close(temp.connection)
> head(x)

[1] "File type = \"ooTextFile\"" "Object class = \"TextGrid\""
[3] "" "xmin = 0 "
[5] "xmax = 18.788004535147394 " "tiers? <exists> "

Alvin Chen

unread,
Nov 19, 2010, 12:38:14 PM11/19/10
to corplin...@googlegroups.com
Hi Andrew,
Is there a quick way in R to get the encoding of the file?

Alvin

Alvin Chen

unread,
Nov 19, 2010, 12:43:03 PM11/19/10
to corplin...@googlegroups.com
uhm....thank god. I'm totally confused with the UTF-8, UTF-16, and the term unicode in java....

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.




--
Alvin Chen

Hardie, Andrew

unread,
Nov 19, 2010, 12:55:11 PM11/19/10
to corplin...@googlegroups.com

What Java means by “Unicode”, internally at least, is UTF16-BE. What Windows or MS Office means by “Unicode” is UTF-16LE (unless specified that it’s big-endian).

 

The quick way to get the encoding of a file is to open it *as if it was Latin-1* or other 8-bit encoding, and load the first three bytes (characters).

 

If the first byte is 0xfe and the second byte is 0xff, it’s UTF-16BE.

If the first byte is 0xff and the second byte is 0xfe, it’s UTF-16LE.

If the first byte is 0xef and the second byte is 0xbb and the third byte is 0xbf, it’s UTF-8.

 

If it doesn’t have any of those at the start, either (a) it’s a pre-Unicode character encoding or (b) it’s a Unicode file to which the leading bytes haven’t been added (in which case you can try to guess by attempting to parse the data as one of the three, and seeing if you get an error – but this quickly gets messy).

 

best

 

Andrew.

 

From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com] On Behalf Of Alvin Chen
Sent: 19 November 2010 17:43
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] Chinese encoding

 

uhm....thank god. I'm totally confused with the UTF-8, UTF-16, and the term unicode in java....

Alvin Chen

unread,
Nov 19, 2010, 1:00:41 PM11/19/10
to corplin...@googlegroups.com
Hi Andrew,

U r my savior from this long-standing nightmare of encoding!!!!!! Thank u so much! This REALLY helps! Now I can sleep -:) (It's 2am in Taipei now...sigh...)
Best,
Alvin

Alvin Chen

unread,
Nov 19, 2010, 11:23:59 PM11/19/10
to corplin...@googlegroups.com
It seems the encoding for R has come to be clearer to me.

Now i have an output question. For the original text files in UTF-16BE, I can read those files in Office 2011 for MAC. (and somehow in the Word Mac version the right encoding name is unicode 5.1). Yet once after I process those files in R and then output it into a csv file, the default encoding for R output seems to be UTF-8 (?). And Office 2011 for MAC does not seem to work with UTF-8 properly.

So, how can I output a data.frame from R in UTF-16BE encoding as a csv file?

Alvin
--
Alvin Chen

Jacobo Myerston

unread,
Nov 20, 2010, 11:23:29 AM11/20/10
to corplin...@googlegroups.com
you can output the file in utf-8 and convert it with textedit to any
format a mac allows.

Maybe it is not a good idea to use Word to process textfiles. Textedit
should be enough and it handles unicode properly. OpenOffice may have
a better unicode support. Also Apple Numbers handles better Unicode
than MS Excel. If what you want is to open a csv, which is a plain
text file separated by comas, try OpenOffice Calc that is free or
Numbers, which is not free but has native unicode support.

Alvin Chen

unread,
Nov 21, 2010, 6:47:45 AM11/21/10
to corplin...@googlegroups.com
Hi Jacobo,

Thanks for the recommendation.

I tried to open UTF8 csv file in Mac Numbers and the speed is pretty slow. That's why I liked to use MS Excel (plus I'm more familiar with it.)

But i just installed OpenOffice in mac and it worked perfectly well! And what's appealing here is that when I start to open the csv file, OpenOffice even offers a page asking which encoding this file is expected!!! Love this function! I knew that OpenOffice is good because its free but I didn't know it's so friendly. LOL~

But still, are there any ways to output a text file in UTF-16 in R???

Alvin Chen

Hardie, Andrew

unread,
Nov 24, 2010, 9:09:19 AM11/24/10
to corplin...@googlegroups.com

I would suggest: Use write.table() with the file argument set not to a filename string, but to a connection that you have opened for writing with encoding=”UTF-16BE”!

Alvin Chen

unread,
Nov 24, 2010, 9:47:14 AM11/24/10
to corplin...@googlegroups.com
Hi Andrew,

I tried the following, but all I have is an empty file.

> file.output<-file("FILENAME.csv",open="w",encoding="UTF16BE");
> write.table(DATAFRAME, file=file.output);
> close(file.output);

Any clues?

Best,

Alvin Chen
Reply all
Reply to author
Forward
0 new messages