Scanning text file warning message: "embedded nul(s) found in input"

Viola Wiegand

unread,

Jul 15, 2015, 9:41:34 AM7/15/15

to corplin...@googlegroups.com

Dear all,

I am trying to generate a frequency list of my own corpus. When I applied the QCLWR instructions on generating a frequency list of an unannotated corpus (p. 106 following) to my own data the following warning message was returned:

> textfile<-scan(file.choose(), what="char", sep="\n", quote="", comment.char="")

Read 69745 items

Warning message:

In scan(file.choose(), what = "char", sep = "\n", quote = "", comment.char = "") :

embedded nul(s) found in input

Does anyone know what the 'embedded nul(s)' mean and whether they imply that there's a serious problem with my data?

I had merged all text files into one file using the mac command line (cat [file names] -> all-merged.txt) , is it possible that something went wrong and corrupted the data?

Thanks!

Best,

Viola

Stefan Th. Gries

unread,

Jul 15, 2015, 9:49:16 AM7/15/15

to CorpLing with R

I have never had this error before but this site provides a few
suggestions: <http://stackoverflow.com/questions/23209464/get-embedded-nuls-found-in-input-when-reading-a-csv-using-read-cv>.
I'd try the skipNul tweak first, , if that doesn't work, the line
breaks by looking at it in a good text editor that shows you whether
you have \r\n or \n, finally the encoding specification.

HTH,
STG
--
Stefan Th. Gries
----------------------------------
Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

Viola Wiegand

unread,

Jul 15, 2015, 5:29:12 PM7/15/15

to corplin...@googlegroups.com

Thank you for this, it seems to have worked (at least there is no warning message) and it now says it has read more items:

> textfile<-scan(file.choose(), what="char", sep="\n", quote="", comment.char="", skipNul=TRUE)

Read 70992 items

In future work I might try just working with the separate files though.

Viola

Stefan Th. Gries

unread,

Jul 15, 2015, 5:33:40 PM7/15/15

to CorpLing with R

> Thank you for this, it seems to have worked (at least there is no warning message) and it now says it has read more items:
>> textfile<-scan(file.choose(), what="char", sep="\n", quote="", comment.char="", skipNul=TRUE)
> Read 70992 items

If that's the right number of lines (as determined by, say, a text
editor), great!

> In future work I might try just working with the separate files though.

Might slow down things but probably makes debugging easier ...

Viola Wiegand

unread,

Jul 15, 2015, 7:18:30 PM7/15/15

to corplin...@googlegroups.com

Many thanks for your prompt replies!

I just found out that the mistake was in a previous step: I had converted word files to text files. One single text file (out of 94 items) was corrupt because the original word file contained many photographs. This produced so much gibberish that the merged corpus file was over 18MB (over 70,000 lines) and after fixing that one file it was just 4MB (11,000 lines)...