error in tolower

Kevin Parent

unread,

Dec 19, 2010, 8:39:30 PM12/19/10

to corpling-with-r

I'm working with a new corpus that was compiled by our university, a corpus of maritime English. The problem is when I use the command tolower (or, of course, toupper), I get a 'invalid multibyte string' error.' What causes this? We haven't gotten all the kinks out yet so part of this is debugging the corpus itself.

--

Kevin Parent, Ph.D, ACS, ALB
Korea Maritime University

Lt Governor of Education & Training, Korea Territorial Council (Toastmasters)
Schoolmasters: http://grou.ps/schoolmasters/
National Korea Toastmasters: http://grou.ps/koreatoastmasters

Stefan Th. Gries

unread,

Dec 19, 2010, 8:56:05 PM12/19/10

to corplin...@googlegroups.com

Seems like the corpus files contain characters that the encoding which
toupper/tolower works with are not supported ... I usually get that
when I load Western ISO files into my UTF-8 R on Linux.

STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Kevin Parent

unread,

Dec 19, 2010, 9:06:23 PM12/19/10

to corplin...@googlegroups.com

you're right, but they weren't appearing as (obviously) multibyte characters until I loaded the output into gedit where it became pretty clear. This also solves the issue of why strsplit wasn't strsplitting certain lines.

Thanks.

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.

Kevin Parent

unread,

Dec 19, 2010, 9:38:43 PM12/19/10

to corplin...@googlegroups.com

Okay, the issue isn't foreign characters per se but corrupted ASCII characters resulting from the transcribers using Word .doc or a similar format and the compiler saving them as .txt files. I asked about this before and know I need to paste the files into a unicode file and not an ASCII one, but this time we're talking about 150 such files. Is there a way that R can automate this? I'm looking at iconv, but it's not helping much. For example, one problematic instance is:
"you\xa1\xafre" (appearing as you¡¯re in a text viewer)
which obviously is supposed to be 'you're,' but inconv simply replaces it with NA or whatever I put in the sub argument.
Any help?

Stefan Th. Gries

unread,

Dec 19, 2010, 9:41:35 PM12/19/10

to corplin...@googlegroups.com

well, if you want to do it in R, you can loop over this and just read
in the files with a file connection and readLines with the right
encoding. If that's too much of a pain (and I can imagine it would be
for me), why not just use Notepad++ for this?

Kevin Parent

unread,

Dec 19, 2010, 9:54:12 PM12/19/10

to corplin...@googlegroups.com

Sorry, but why would Notepad++ be better for such a large number of files?

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.

Stefan Th. Gries

unread,

Dec 19, 2010, 9:58:18 PM12/19/10

to corplin...@googlegroups.com

you don't have to write a script and would just use 'Find in Files ...'

Earl Brown

unread,

Dec 20, 2010, 12:19:22 AM12/20/10

to CorpLing with R

This may or may not help with your problem, but did you see Andrew
Hardie's response to a similar question I asked about a multibyte
string error in this post:
http://groups.google.com/group/corpling-with-r/browse_thread/thread/5123b9694f389940/

Hardie, Andrew

unread,

Dec 20, 2010, 12:48:24 AM12/20/10

to corplin...@googlegroups.com

Your problem is that \xa1\xaf is not a valid utf8 sequence. If it resulted from "save as text" then whoever did it must have saved it to a non-Unicdoe encoding. I thought it might be big5, but the Big5 for curvy-apostrophe is \xa1\xa6 (close but not quite the same). Without knowing what encoding it actually is (and therefore what to tell iconv to change it from) you probably have no option except to work out a list of global search and replaces for each of the dodgy sequences.

best

Andrew.

From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com] On Behalf Of Kevin Parent
Sent: 20 December 2010 02:39
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] error in tolower

Reply all

Reply to author

Forward