--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.
file1<-scan(file="_nInMar2MichaelDay.txt",what="char",sep="\n")
file.cleaned<-gsub("[”“]",'"',file1,perl=T) #straighten out typographic quotation marks
file.cleaned2<-gsub("[’‘]","'",file.cleaned,perl=T) #straighten out typographic apostophes
cat(" # ",file.cleaned2,file="file_cleaned2.txt",sep="\n")
file.utf8<-scan(file="_nInMar2MichaelDay.txt",what="char",sep="\n",encoding = "UTF-8")
file.utf8.cleaned<-gsub("[“”]",'"',file.utf8,perl=T) #straighten out typographic quotation marks
file.utf8.cleaned2<-gsub("[’]","'",file.utf8.cleaned,perl=T) #straighten out typographic apostophes
cat(" # ",file.utf8.cleaned2,file="file_cleaned2.txt",sep="\n")
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with...@googlegroups.com.
To post to this group, send email to corplin...@googlegroups.com.
<_nInMar2MichaelDay.txt>
"LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale("LC_ALL","en_US.UTF-8")
[1] ""
Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored
Not sure what to try next. Anyone know a convenient way to change the encoding of 2887 textfiles in one go?
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
All this sounds very much like the problem is with the character encoding of the script. That is, if the script is coded in Win 1252 (which most Windows editors will generate by default), doing replacements of literal quote marks will not produce the correct results (since the literals are represented as single bytes in the range x80 to x9f in 1252, which will match partial UTF-8 characters, thus scrambling the content). The script and the target files need to have the same encoding for replacements of this kind to work.
I suggest examining the script in an editor that makes the encoding transparent (e.g. Notepad++) and converting to UTF-8 if necessary…
best
Andrew.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Visit this group at https://groups.google.com/group/corpling-with-r.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to corpling-with-r+unsubscribe@googlegroups.com.
To post to this group, send email to corpling-with-r@googlegroups.com.
Yeah, nearly all Windows text editors insert the BOM in UTF-8 by default; this is generally not what you want esp., if you’re going to be concatenating stuff, as you then get spurious zero-width-no-break-spaces mid file. When reading UTF-8 files, unless they are from a known-safe source, I normally check for and discard the BOM first of all. (Which is a pretty simple regex: “^\x{feff}” or “^\xef\xbb\xbf” as bytes.)
Andrew.
From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com]
On Behalf Of Alvin Chen
Sent: 18 December 2016 23:41
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
Another thing.
In my output text file, while all the curly quotes are taken care of (converted successfully), there is a <U+FEFF> at the beginning of the text file. (p.s. The output is in UTF-8) Is it because the original file from Juho is UTF-8-BOM? But i can't scan() the input file with that encoding successfully though. Processing UTF-8 files in windows is really discouraging.
Alvin
On Mon, Dec 19, 2016 at 7:01 AM, Hardie, Andrew <a.ha...@lancaster.ac.uk> wrote:
All this sounds very much like the problem is with the character encoding of the script. That is, if the script is coded in Win 1252 (which most Windows editors will generate by default), doing replacements of literal quote marks will not produce the correct results (since the literals are represented as single bytes in the range x80 to x9f in 1252, which will match partial UTF-8 characters, thus scrambling the content). The script and the target files need to have the same encoding for replacements of this kind to work.
I suggest examining the script in an editor that makes the encoding transparent (e.g. Notepad++) and converting to UTF-8 if necessary…
best
Andrew.
From:
corplin...@googlegroups.com [mailto:corplin...@googlegroups.com]
On Behalf Of Alvin Chen
Sent: 18 December 2016 22:38
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
Hi,
Non-English locale users here as well. Here's my print-screen.
But something weird, though. At first, I tried Juho's script and I ran into the same problem. What I did was that I printed the lines with the curly quotes in console and then copied those curly quotes (from the console) back to the regular expressions in the script. And after that, everything works fine. Still no idea why....
Alvin
*****
Alvin C.-H. Chen, Ph.D. (陳正賢)
Assistant Professor
Department of English
National Taiwan Normal University
國立台灣師範大學英語系
Taipei City, 106, Taiwan
"utf-8-sig"
) for its Notepad program: Before any of the Unicode characters
is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
sequence: 0xef
, 0xbb
, 0xbf
) is written."On Dec 18, 2016, at 6:53 PM, Hardie, Andrew <a.ha...@lancaster.ac.uk> wrote:
Yeah, nearly all Windows text editors insert the BOM in UTF-8 by default; this is generally not what you want esp., if you’re going to be concatenating stuff, as you then get spurious zero-width-no-break-spaces mid file. When reading UTF-8 files, unless they are from a known-safe source, I normally check for and discard the BOM first of all. (Which is a pretty simple regex: “^\x{feff}” or “^\xef\xbb\xbf” as bytes.)Andrew.From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com] On Behalf Of Alvin Chen
Sent: 18 December 2016 23:41
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
Another thing.In my output text file, while all the curly quotes are taken care of (converted successfully), there is a <U+FEFF> at the beginning of the text file. (p.s. The output is in UTF-8) Is it because the original file from Juho is UTF-8-BOM? But i can't scan() the input file with that encoding successfully though. Processing UTF-8 files in windows is really discouraging.
Alvin
On Mon, Dec 19, 2016 at 7:01 AM, Hardie, Andrew <a.ha...@lancaster.ac.uk> wrote:
All this sounds very much like the problem is with the character encoding of the script. That is, if the script is coded in Win 1252 (which most Windows editors will generate by default), doing replacements of literal quote marks will not produce the correct results (since the literals are represented as single bytes in the range x80 to x9f in 1252, which will match partial UTF-8 characters, thus scrambling the content). The script and the target files need to have the same encoding for replacements of this kind to work.I suggest examining the script in an editor that makes the encoding transparent (e.g. Notepad++) and converting to UTF-8 if necessary…bestAndrew.From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com] On Behalf Of Alvin Chen
Sent: 18 December 2016 22:38
To: corplin...@googlegroups.com
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
Hi,Non-English locale users here as well. Here's my print-screen.
<image001.png>
the output file becomes tainted with myriad junk characters such as � etc, and not only where the "exotic" characters are, but also at the beginning of each appended vector. The junk only appears if I perform gsub replacements before outputting or add extra input with cat when outputting the vectors back out. If I simply scan the textfiles and thencat them back out using sep="\n", they stay clean. But then I cannot do the cleanup. Both Word and OpenOffice Writer badly choke when doing search and replace on a textfile this large, and they don't even support backreferences, so they are no substitute for R.
The main point I was getting at was the encoding of the script, rather than the corpus text file. The issue of BOM in the input is a side point really (although it does, I think, explain the appearance of  in JKR’s “junk”-filled output). – AH.
"LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252".
Trying to change it into anything that is specifically UTF-8 always returns
Warning message: In Sys.setlocale("LC_ALL", "en_US.UTF-8") : OS reports request to set locale to "en_US.UTF-8" cannot be honored
>> All my code in RStudio (when pasted into Notepad++) is
If you copy paste, then the encoding will change to whatever N++ is currently set to. (To the best of my understanding, internally in Windows operations, everything is UTF-16LE, with translations at either end as needed.)
>> write my gsub commands in Notepad++ instead of R, encode those commands in UTF-8-BOM, paste them into the R console and execute
This won’t work for the same reason. The encoding will be adjusted by Windows upon paste into whatever encoding your R console is set for. And it seems pretty clear your console encoding is 1252 – witness all the 1252s in your locale string.
Anyway: My apologies for giving unhelpful advice based on a misunderstanding, I had thought you were running the script from a file whose encoding could be checked, rather than a pane in RStudio. Here’s some advice that might be more helpful, fingers crossed:
· I believe the reason that you get an error upon changing to “en_US.UTF-8” is that this is a Linux-style locale. “English_United States.65001” would be, I think, the Windows equivalent. (65001 = Windows-speak for UTF-8.) However, I have seen some notes online that say you can’t use any locale that includes 65001 – notably the MSDN documentation on setlocale() to be found here: https://msdn.microsoft.com/en-us/library/x99tb11d.aspx . So this may well also fail.
· If it fails, I can suggest two possibilities.
· One. Save your script in a file and invoke it with command-line R rather than from R studio. Then, you can make sure you have the correct encoding of the script by examining that file as it is on disk, using N++ or other tool, rather than trying to set the console encoding.
· Two, change the literal “,”,‘,’ etc. in your regexes to hexadecimal codes which will not be vulnerable to encoding problems.
>> RStudio does not seem to have the option of changing the encoding of one's code to "UTF-8-BOM". Even when I check "select all encodings", "UTF-8-BOM" (or even "UTF-8-SIG") does not figure on the list of options.
That’s because these are not generally recognised encodings. They are somebody’s attempt to treat the versions with/without BOM as two separate versions of Unicode, but an attempt that has not had wide adoption.
best
Andrew.
From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com]
On Behalf Of Juho Kristian Ruohonen
Sent: 19 December 2016 09:40
To: CorpLing with R
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
According to Notepad++, the encoding of the corpus files (including the sample one) is "UTF-8-BOM". Hence the BOMs at the beginning of each file. But while annoying, the BOMs are only a minor annoyance compared to the mess created by smart quotes and apostrophes.
--
file.utf8<-scan(file="n_e&s2may16.txt",what="char",sep="\n",encoding = "UTF-8") #Scan the file and specify the encoding as utf-8 so that it gets displayed correctly in R. The problem, of course, is that the file is not truly encoded in UTF-8 but UTF-8-BOM (courtesy of Notepad), so there is a mismatch of encodings between R and the scanned vector. Matching characters literally will therefore not work.
file.utf8.cleaned<-gsub("[\u201C\u201D]","\u0022",file.utf8,perl=T) #straighten out typographic quotation marks
file.utf8.cleaned2<-gsub("[\u2018\u2019]","\u0027",file.utf8.cleaned,perl=T) #straighten out typographic apostrophes
cat(" # ",file.utf8.cleaned2,file="file_utf8_cleaned2.txt",sep="\n") #output the file.
Hmm, Juho, from your comments in this code snippet, I see I explained the issue poorly once again.
The problem is *not* that “the file is not truly encoded in UTF-8 but UTF-8-BOM”. BOM does cause the problems in concatenating, of course, but that’s a separate issue to the punctuation searching.
Rather, the problem is that your console (and, therefore, your code) is using Windows-1252, which mismatches your data files encoded as UTF-8.
Because your code is in 1252, when you include in your regex, say, a literal curly quote “ , its byte representation is 0x93 . But in your files, which are UTF-8, the same character (U+201C) has the byte representation 0xE2, 0x80, 0x9C. The problem is then that searching for 0x93 does not match 0xE2, 0x80, 0x9C!
However, searching for 0x93 may match one of the bytes of some other UTF-8 sequence(s). Once 0x93 is replaced, any such sequence is corrupted, and the overall string may not be recognisable to R and/or the OS as UTF-8 any longer (thus it is treated as an 8-bit encoding, with many many things rendering incorrectly).
Using a \u escape instead works because (to oversimplify a bit) it causes R to insert the correct UTF-8 byte sequence into the string, regardless of the encoding of the file or console that contains the code.
I hope this fuller explanation clarifies the issue.
>> I dunno how to fix that using gsub(), that's some next-level sh*t
Does search-and-replace on the regex ^\ufeff (with empty replacement string) not work to remove BOM? It ought to.
best
Andrew.
From: corplin...@googlegroups.com [mailto:corplin...@googlegroups.com]
On Behalf Of Juho Kristian Ruohonen
Sent: 20 December 2016 13:46
To: CorpLing with R
Subject: Re: [CorpLing with R] R's apparent inability to handle UTF-8 encoding when cleaning up corpus files in R and outputting them into .txt
Hmm. My bad. I tried different character codes until I found the ones that matched. I think they're actually Unicode character codes instead of hexadecimal (I didn't know the difference at the time). So here's what I did that works:
--