cforest predict errors

606 views
Skip to first unread message

Nathaniel Roth

unread,
Oct 15, 2015, 10:02:38 AM10/15/15
to Davis R Users' Group
Any thoughts on how to get this to work are appreciated.

I can build regression models successfully using the party::cforest function with either of the two functions below. The first is the one that I should be using because cforest and randomforest don't appear to recognized factors in a data frame. I can reapply the original data to both sets of models and calculate rsquare based on the OOB sample.

formula <- "dist_sum ~ factor(clkmeans) + hhsize + hhveh + hhemp + hhstu + hhlic + incom + factor(resty)"
formula2 <- "dist_sum ~ clkmeans + hhsize + hhveh + hhemp + hhstu + hhlic + incom + resty"


What I can't do is select a new subset of the original data and do a predict on it without getting the following error from the model that uses the original formula with factors. I get a predicted response for formula2.

Error in checkData(oldData, RET) : 
  Classes of new data do not match original data


Any ideas on how tosubmit newdata to a predict function that uses the "factor()" elements in defining the function? That's the only difference between them.

Thank you,
Nate



Noam Ross

unread,
Oct 15, 2015, 12:12:06 PM10/15/15
to Davis R Users' Group

While I’m not familiar with the party package (sadly), I have a guess. The model object created with formula has an internal representation of the data it was passed, and this includes a vector of data classes. In the model with formula, it expects clkmeans and resty to be factors, and in the model with formula2 it expects them to be whatever their original classes were. So when subsetting the data, you need to convert these variables like this:

predict_data = olddata[1:100,]
predict_data$clkmeans = as.factor(predict_data$clkmeans)
predict_data$resty = as.factor(predict_data$resty)

You might also run into an issue with column names, in which case you’ll have to convert the names of clkmeans and resty to "factor(clkmeans)" and "factor(resty)".


--
Check out our R resources at http://www.noamross.net/davis-r-users-group.html
---
You received this message because you are subscribed to the Google Groups "Davis R Users' Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to davis-rug+...@googlegroups.com.
Visit this group at http://groups.google.com/group/davis-rug.
For more options, visit https://groups.google.com/d/optout.

Nathaniel Roth

unread,
Oct 17, 2015, 8:25:14 PM10/17/15
to Davis R Users' Group
The predict function for the party package (and cforest specifically) appears to be unstable. 

For example, once I've built my random forest, I can pull as subset of the training data for use as a testing data set to test the mechanics of the predict function. 

I can add new rows to this data set and run predict on them as long as I'm careful to make sure I don't include any factor levels that aren't in the training data set (that's as it should be).  However if I remove rows from the working testing data set, I then get the following error:

 Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

This doesn't make sense to me. Why would removing records from the testing data cause this error, which I believe to indicate that a factor present in the testing dataset was not in the training one? The factors in the testing data frame have the same levels as the training data frame even though not all are represented in the reduced data set, but they weren't present in the working testing data set either. 

Regardless, by adding the synthetic records that I need to test, I've been able to accomplish what I need to, It's just not very elegant. 

Thank you Noam, your suggestions were helpful, though not a true solution.
Nate
Reply all
Reply to author
Forward
0 new messages