"Error in substr(lines, starts, stops) :
invalid multibyte string at '<97>..."
I assume this error came up when the function "exact.matches" was
running, as I didn't put in the function substr() in my script and
exact.matches has it. I've received this error message before, when my
text files had "smart" quotes rather than straight ones.
My question for you R enthusiasts is: How can I get around this
multibyte string error, if I can?
Thanks. Earl Brown
The solution is to make sure to recode the string to UTF-8 before doing
anything with it. The iconv() function will do this for you. iconvlist()
tells you what encodings are available.
If Windows-1252 is not available as a supported encoding, then you can
use ISO-8859-1 instead as the "from" encoding, but in this case make
sure to set the sub argument to iconv() to something like "?", because
the characters from 0x80 to 0x9f are Microsoft-only innovations, thus
illegal in ISO-8859-1 and therefore may not convert to Unicode if the
"from" encoding is ISO-8859-1.
An alternative is to write your own function to globally switch each of
the "smart" punctuation marks for an ASCII equivalent (e.g. the straight
quotes for the smart quotes, a minus for the endash and emdash). This
has the advantage of making searches simpler later on -- if you are ever
going to be in the position of needing to search for punctuation!
Hope this is useful.
best
Andrew.
Andrew Hardie
Linguistics and English Language
Bowland College, Lancaster University
Lancaster LA1 4YT
United Kingdom
http://www.ling.lancs.ac.uk/staff/hardie
Thanks. Earl Brown
--
You received this message because you are subscribed to the Google
Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to
corpling-with...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/corpling-with-r?hl=en.