Re: Error in predict randomForest(fit, test):Type of predictors in new data do not match that of the training data

1,013 views

Skip to first unread message

Tomislav Hengl

unread,

Mar 30, 2016, 9:20:44 AM3/30/16

to Issoufou Ouedraogo, global-soil...@googlegroups.com

Hi Issoufou,

I think that the standard randomForest::randomForest function, in fact,
has problems handling factor-type covariates
(http://stats.stackexchange.com/questions/49243/rs-randomforest-can-not-handle-more-than-32-levels-what-is-workaround).
The way around it is to convert the original data frame to Principal
Components (first to indicators, then to PCs). Here is an example:

http://gsif.r-forge.r-project.org/spc.html

You can almost always drop the last 2-3 components (if the number of PCs
is >>10) because they only contain noise.

Otherwise, for each level in the factor-variable, you could also do some
clean-up before you use it for further model building and prediction.
Here is an example of how you can drop out all levels that have <5
observations:

https://github.com/ISRICWorldSoil/GSIF_tutorials/blob/master/eberg/soilmaps_MLA.R#L30

Otherwise, take also look at the function:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/droplevels.html

PS: You might also want to check the caret package for fine-tuning
randomForest modeling. Here are some examples:

http://www.r-bloggers.com/predictive-modelling-fun-with-the-caret-package/

HTH,

T. (Tom) Hengl
Researcher @ ISRIC - World Soil Information
Team member Africa Soil Information Services http://africasoils.net
Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
Network: http://profiles.google.com/tom.hengl
Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ

On 30-3-2016 12:00, Issoufou Ouedraogo wrote:
> Dear Mr.Hengl,
>
> Hello!
>
> I am a PhD student in Belgium. I decided to send you my email, because, I read your paper on random forest regression:"Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions". So, I decided to contact you.
>
> I tried the random forest model in my research topic, but I met a problem during the validation phase (using independent data).
>
> See my short code for random forest model:
>
> Traindata <- read.table("C:/Users/iouedraogo/Desktop/Tester/MoyData_correction final.txt",header=TRUE, sep="\t", na.strings="NA", dec=",", strip.white=TRUE)
> rf<-randomForest(Ln.NO3._mean~ Aquifer.media + Recharge + Climat.Class + Population.density..people.km2. + Rainfall.Class, mtry=4, ntree=1000, data=Train, importance= TRUE)
>
> rf
> predict(rf)
>
> Testdata<-read.table("C:/Users/iouedraogo/Desktop/Tester/Random_Forest_Factors_Fin.txt", header=TRUE, sep="\t", na.strings="NA", dec=",", strip.white=TRUE)
>
> predict(rf,Testdata)
>
> When, I run the step: predict(rf, Testdata) , this message: "Type of predictors in new data do not match that of the training data" come.
>
> I applied the leveling given in below to detect the different categories in my factors/variables:
> For example, you can observe the leveling for these two variables:
> levels(Traindata$Aquifer.media), levels(Testdata$Aquifer.media).
> levels(Traindata$Climat.Class), levels(Testdata$Climat.Class).
>
> The print of top 5 columns are:
>
> head(Traindata, 5)
>
> **For Traindata**
> _Aquifer.media:_ _Climat.Class:_
> Crystalline rocks Dry sub-Humid
> Crystalline rocks Humid
> Crystalline rocks Humid
> Crystalline rocks Dry sub-Humid
> Unconsolisated sediments rocks Arid
>
> head(Testdata, 5)
>
> **For Testdata**
> _Aquifer.media_ _Climat.Class_
> Crystalline rocks Semi-arid
> Crystalline rocks Semi-arid
> Crystalline rocks Semi-arid
> Crystalline rocks Arid
> Crystalline rocks Semi-arid
>
> We observe that the values does not match in Traindata and Testdata.
> In my study, I think that it would perhars a mistake to combine Traindata and Testdata because, Traindata are the data observed at regional scale in my study and Testdata are the data observed at local/small scale. So, my objective was to develop a regional-scale model by using Traindata, and after, we use the independent data (here Testdata) to validate the random forest developed).
>
> If you analyse very well Climate class in Traindata, you can observe that there are several climatic conditions due to the large scale study compare to Climate class in Testdata corresponding to local scale. I think that the problem of scale is the reason fundamental which caused the data matching.
>
> Please, how can solve this problem? If you have any ideas,please inform me.
>
>
>
> Best regards.
>
>
>

Issoufou Ouedraogo

unread,

Mar 30, 2016, 9:37:12 AM3/30/16

to Tomislav Hengl, global-soil...@googlegroups.com

Dear Mr Hengl,

Thank you very much for your explanations.

I am going to try again with your great suggestions.

Best regards

Issoufou Ouedraogo

PhD Student

Earth and Life Institute/ Environmental Sciences

Université Catholique de Louvain/Belgique

Croix du sud 2, bte 1, B-1348, Louvain-la-Neuve, Belgium

Tel: +32 (0)10 47 37 19

Fax: +32 (0)10 47 47 45

Reply all

Reply to author

Forward

0 new messages