temp.connection <- file("dontusemacs.txt", open="r", encoding="ENCODING") # ;-)
x <- readLines(temp.connection)
close(temp.connection)
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.
R *does* support Unicode, many people and I have used it with UTF-8
(e.g., Mandarin or Cyrillic) and UTF-16 (Japanese). You're sure it's
not just a R console font issue or something?
> Probably I should've stopped at where mac is more powerful, image-editing, and leave the "business" to the ubuntu...LOL~
Indeed ... ;-)))
Cheers,
see the help page for scan:
encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.
R *does* support Unicode, many people and I have used it with UTF-8
(e.g., Mandarin or Cyrillic) and UTF-16 (Japanese). You're sure it's
not just a R console font issue or something?
> Probably I should've stopped at where mac is more powerful, image-editing, and leave the "business" to the ubuntu...LOL~Indeed ... ;-)))
Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
On Sat, Nov 20, 2010 at 1:09 AM, Stefan Th. Gries <stg...@gmail.com> wrote:see the help page for scan:
encoding: encoding to be assumed for input strings. If the value is
"latin1" or "UTF-8" it is used to mark character strings as known to
be in Latin-1 or UTF-8: ***it is not used to re-encode the input (see
fileEncoding. See also ‘Details’.***
and
If file is a character string and fileEncoding is non-default, or it
it is a not-already-open connection with a non-default encoding
argument, the text is converted to UTF-8 and declared as such (and the
encoding argument to scan is ignored). See the examples of readLines.
I didnt check the help pages before. It's just...I don't quite understand them really. I also tried this:
The “invalid input found on input connection” error message is triggered if R’s call to iconv() results in the EILSEQ error code, which means an invalid byte sequence (once which can’t be parsed) has been encountered. So your file contains dodgy UTF-8 sequences.
Looking at the file you sent it’s easy to see why – the encoding is big-endian UTF-16BE, not UTF-8 – telling R to treat the former as the latter leads to errors.
best
Andrew.
> temp.connection <- file("/home/stgries/Desktop/di_001.unicode", open="r", encoding="UTF-16BE")
> x <- readLines(temp.connection)
> close(temp.connection)
> head(x)
[1] "File type = \"ooTextFile\"" "Object class = \"TextGrid\""
[3] "" "xmin = 0 "
[5] "xmax = 18.788004535147394 " "tiers? <exists> "
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.
What Java means by “Unicode”, internally at least, is UTF16-BE. What Windows or MS Office means by “Unicode” is UTF-16LE (unless specified that it’s big-endian).
The quick way to get the encoding of a file is to open it *as if it was Latin-1* or other 8-bit encoding, and load the first three bytes (characters).
If the first byte is 0xfe and the second byte is 0xff, it’s UTF-16BE.
If the first byte is 0xff and the second byte is 0xfe, it’s UTF-16LE.
If the first byte is 0xef and the second byte is 0xbb and the third byte is 0xbf, it’s UTF-8.
If it doesn’t have any of those at the start, either (a) it’s a pre-Unicode character encoding or (b) it’s a Unicode file to which the leading bytes haven’t been added (in which case you can try to guess by attempting to parse the data as one of the three, and seeing if you get an error – but this quickly gets messy).
best
Andrew.
From: corplin...@googlegroups.com
[mailto:corplin...@googlegroups.com] On Behalf Of Alvin Chen
Sent: 19 November 2010 17:43
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] Chinese encoding
uhm....thank god. I'm totally confused with the UTF-8, UTF-16, and the term unicode in java....
Maybe it is not a good idea to use Word to process textfiles. Textedit
should be enough and it handles unicode properly. OpenOffice may have
a better unicode support. Also Apple Numbers handles better Unicode
than MS Excel. If what you want is to open a csv, which is a plain
text file separated by comas, try OpenOffice Calc that is free or
Numbers, which is not free but has native unicode support.
I would suggest: Use write.table() with the file argument set not to a filename string, but to a connection that you have opened for writing with encoding=”UTF-16BE”!