Type Errors in drRead.table

12 views
Skip to first unread message

Aleksander Eskilson

unread,
Jun 1, 2015, 10:25:49 AM6/1/15
to tesser...@googlegroups.com
I've been attempting to pull in a sizable data set in with the drRead.table set of commands. This data set is still small enough to pull into the local session with the regular read.table API and then I can convert the collection to a ddf, however, attempting to load the data with drRead.table using the same parameters I pass to read.table results in column class errors. The commands I'm comparing in particular are:
lab.data <- drRead.table("/path/to/file", output="/path/to/results", sep="|", header=TRUE, stringsAsFactors=FALSE) # this fails with: scan() expected a 'real' got 'E.U./DL'
lab.data <- read.table("/path/to/file", output="/path/to/results", sep="|", header=TRUE, stringsAsFactors=FALSE) # this succeeds, and I create a ddf using lab.ddf <- ddf(lab.data)

I should also note that for other identically structured data sets, the errors will include things like scan() expected 'an integer', got '0.4'. 

Any thoughts on this apparent inconsistency in the API? 

Regards,
Alek

Ryan Hafen

unread,
Jun 1, 2015, 12:08:01 PM6/1/15
to Aleksander Eskilson, tesser...@googlegroups.com
Hi Alek,

The cause of this issue most likely is the inability of drRead.csv to correctly guess the column classes for your particular data set.  I’m guessing there are a lot of NAs for some variables or some variables are numeric except for a few cases?  By default It reads the first 1000 lines and guesses the column classes for the entire data set based on this.  This is done for speed and to keep classes consistent across subsets.  You can turn it off by setting autoColClasses=FALSE (see ?drRead.table), but the recommended thing to do in this case would be to just explicitly set colClasses as can you do with read.csv.  In your case since you can read the whole hint, you can cheat and get the column classes from that result and you’ll be safe.  Let me know if this does/doesn’t clear up your issue.

Ryan


--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/49c350c0-45e7-40d7-877d-e1ad994b289b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aleksander Eskilson

unread,
Jun 1, 2015, 1:30:11 PM6/1/15
to tesser...@googlegroups.com, aleksa...@gmail.com
Hi Ryan,

That implementation detail is quite interesting, I wasn't aware. The load works using the autoColClasses parameter. Thanks for the advice on colClasses, in the future we'll be sure to make the classes clear.

Regards,
Alek
Reply all
Reply to author
Forward
0 new messages