classify new data

2,335 views
Skip to first unread message

cory

unread,
Dec 21, 2011, 9:45:19 AM12/21/11
to rtexttools-help
I can't seem to figure out how to classify new data. So, after a
model has been built, I'd like to use it to classify new data as it
comes in. Something along the lines of this...

matrix <- create_matrix(df$text, language="english",
removeNumbers=TRUE, stemWords=TRUE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,as.numeric(df
$hand_coded),trainSize=1:75, testSize=76:100,virgin=FALSE)
models <- train_models(corpus,
algorithms=c("BAGGING","BOOSTING","GLMNET","MAXENT","NNET","TREE"))
result <- classify_models(corpus, models)

newdata <- create_matrix(newdf$text,
language="english",removeNumbers=TRUE, stemWords=FALSE,
weighting=weightTfIdf)
newcorpus <- create_corpus(newdata,newdf$hand_coded,trainSize=1:2,
testSize=3:4, virgin=FALSE)
newresult <- classify_models(newcorpus, models)

The new data wouldn't have been hand coded, hence my desire to run the
model on them. Is this possible?

Thanks

Cory

Tim Jurka

unread,
Jan 2, 2012, 7:03:19 PM1/2/12
to rtextto...@googlegroups.com
Hi Cory,

Sorry for the delayed reply; I've been out of town for the past few weeks.

The package documentation for RTextTools is available here: http://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf . You'll want to set the "virgin" flag to TRUE in the create_corpus() function.

Best,
Tim


--
Timothy P. Jurka
Ph.D. Student
Department of Political Science
University of California, Davis
www.timjurka.com

cory

unread,
Jan 3, 2012, 10:43:16 AM1/3/12
to rtexttools-help
Sure, of course. I've tried that. The issue I have is that
apparently, trainSize cannot be missing or equal to 0. In my case,
with "new" data, I don't want to do any training with it, just
classification.

So, if I set trainSize to 1, and try to classify the rest of my "new"
data, I get an error...

newcorpus <-
create_corpus(matrix2,as.numeric(df2$hand_coded),trainSize=1,
testSize=2:5, virgin=TRUE)
results <- classify_models(newcorpus, models)
Error in cbind2(1, newx) %*% (nbeta[[i]]) :
Cholmod error 'A and B inner dimensions must match' at file ../
MatrixOps/cholmod_ssmult.c, line 82

Any ideas?

cn


On Jan 2, 6:03 pm, Tim Jurka <timju...@gmail.com> wrote:
> Hi Cory,
>
> Sorry for the delayed reply; I've been out of town for the past few weeks.
>
> The package documentation for RTextTools is available here:http://cran.r-project.org/web/packages/RTextTools/RTextTools.pdf. You'll want to set the "virgin" flag to TRUE in the create_corpus() function.

Tim Jurka

unread,
Jan 3, 2012, 4:14:09 PM1/3/12
to rtextto...@googlegroups.com
Hi Cory,

This is a bug that I've fixed in the latest version of RTextTools (v1.3.3) which is awaiting approval on CRAN. I'm attaching the package here if you want to install it manually.

The update allows you to define overlapping ranges, so for the trainSize you can do 1:2 (it doesn't matter what range you use, but it has to be a range), and testSize you can do 1:100. When using the classify command, RTextTools will only use the documents defined in the testSize parameter.

Best,
Tim

RTextTools_1.3.3.tar.gz

cory

unread,
Jan 4, 2012, 2:33:31 PM1/4/12
to rtexttools-help
Okay, I really appreciate your help. I'm getting there. I can get
the MAXENT model to work, but none of the other ones. Here's an
example with some fake data. This is the same results that I get with
my "real" data.

strings <- c("holiday scene was nice", "airplane was delayed", "santa
for the holidays", "ugh, more family time for the holidays", "airplane
was ontime, woot", "saw an airplane today", "beautiful holiday night")
codes <- c(1,2,1,1,2,2,1)
matrix <- create_matrix(strings, language="english",
removeNumbers=TRUE, stemWords=TRUE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,codes,trainSize=1:7,
testSize=1:7,virgin=FALSE)
model1 <- train_models(corpus, algorithms=c("MAXENT"))
model2 <- train_models(corpus, algorithms=c("GLMNET"))
new_strings <- c("saw another airplane today", "santa was at the
holiday show")
new_matrix <- create_matrix(new_strings, language="english",
removeNumbers=TRUE, stemWords=TRUE, weighting=weightTfIdf)
new_corpus <- create_corpus(new_matrix,c(NA, NA),trainSize=1:2,
testSize=1:2,virgin=FALSE)
results <- classify_models(new_corpus, model1)
results <- classify_models(new_corpus, model2)
Error in cbind2(1, newx) %*% (nbeta[[i]]) :
Cholmod error 'A and B inner dimensions must match' at file ../
MatrixOps/cholmod_ssmult.c, line 82

So the glmnet model prediction bombs out, but with some cryptic error
message. Any ideas on this?

Thanks

Cory
> --
> Timothy P. Jurka
> Ph.D. Student
> Department of Political Science
> University of California, Daviswww.timjurka.com
>

Tim Jurka

unread,
Jan 4, 2012, 3:02:01 PM1/4/12
to rtextto...@googlegroups.com
Hi Cory,

I'll look into the problem. It seems that the glmnet package ( http://cran.r-project.org/web/packages/glmnet/index.html ) requires that the classification matrix contain all terms that were in the original training matrix, even if they are not used.

Best,
Tim

--
Timothy P. Jurka
Ph.D. Student
Department of Political Science
University of California, Davis
www.timjurka.com

cory

unread,
Jan 4, 2012, 4:10:38 PM1/4/12
to rtexttools-help
Yeah, that was my guess. If I manage to hack something together to
make it work, I'll let you know.

cn

On Jan 4, 2:02 pm, Tim Jurka <timju...@gmail.com> wrote:
> Hi Cory,
>
> I'll look into the problem. It seems that the glmnet package (http://cran.r-project.org/web/packages/glmnet/index.html) requires that the classification matrix contain all terms that were in the original training matrix, even if they are not used.

Mark Rogan

unread,
Mar 5, 2012, 7:36:35 AM3/5/12
to rtextto...@googlegroups.com

Hi,

Just wondering if any progress has been made on this post?  I’ve built a model with similar code structure to Cory but only using MAXENT and SVM. The model builds fine and has a recall accuracy of 89%. When I try to classify new data I get poor or non-working results. The MAXENT model gives 2% recall accuracy and SVM doesn’t work, at all giving the following error:

Error in predict.svm(model, corpus@classification_matrix, prob = TRUE,  :

  test data does not match model !

My conclusion is that the problem stems from the model DTM having a different column structure (the list of words) to the new DTM. Do these column structures really need to be the same? This seems a bit restrictive if they do.

Does anyone have a working example of using RTextTools to classify a completely new set of data after having built a model?

Thanks for your help,

Mark

Tim Jurka

unread,
Mar 5, 2012, 12:56:55 PM3/5/12
to rtextto...@googlegroups.com
Hi Mark,

A working demo of using a saved model is available at http://www.rtexttools.com/documentation.html (the "Example Scripts" link). The trick is to use the originalMatrix parameter in create_matrix().

Best,
Tim


--
Timothy P. Jurka
Ph.D. Student
Department of Political Science
University of California, Davis
www.timjurka.com

Mark Rogan

unread,
Mar 6, 2012, 5:00:03 AM3/6/12
to rtextto...@googlegroups.com
Ah, there's my problem, didn't have originalMatrix set, thanks very much, they're useful examples.

Just a comment, there's possibly an error in the file "saved_model_demo.R". The comments describe a trainSize of 2000 testSize of 1000 which is followed by the line:
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:3000,virgin=FALSE)

Thanks again.

Tim Jurka

unread,
Mar 6, 2012, 2:21:06 PM3/6/12
to rtextto...@googlegroups.com
I need to update the code comments. Thanks for the heads up!

Best,
Tim

--
Timothy P. Jurka
Ph.D. Student
Department of Political Science
University of California, Davis
www.timjurka.com

Antoine Thibaud

unread,
Aug 18, 2013, 7:13:44 PM8/18/13
to rtextto...@googlegroups.com
Hello Tim,
Thank you for your package!
Aside from the comments issue, the saved model example has the following line in the new data section:

container <- create_container(new_matrix,NYTimes$Topic.Code,testSize=1:100,virgin=FALSE)

If this is new data, shouldn't virgin be set to TRUE?
Thank you!
Antoine

Tim Jurka

unread,
Aug 18, 2013, 8:57:56 PM8/18/13
to rtextto...@googlegroups.com
Hi Antoine,

Yes, that is correct, thank you for catching the mistake. However, if you have some validation data for the new data, you can set virgin to FALSE and see some analytics for how the classification performed against the validation set.

Best,
Tim


--
You received this message because you are subscribed to the Google Groups "rtexttools-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtexttools-he...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mi Zhou

unread,
Jun 2, 2014, 10:29:25 PM6/2/14
to rtextto...@googlegroups.com
I don't think that link is still working. In case some people need it, the link is https://github.com/timjurka/RTextTools/blob/master/RTextTools/inst/examples/saved_model_demo.R
Thanks,

Greg L

unread,
Jul 16, 2014, 6:04:40 PM7/16/14
to rtextto...@googlegroups.com
Hi,

I think I have found the issue that I asked about earlier. I appears that in the RTextTools package the create_matrix function has a spelling mistake in line 31.         

if (attr(weighting,"Acronym")=="tf-idf") weight <- 0.000000001

should be read:

if (attr(weighting,"acronym")=="tf-idf") weight <- 0.000000001

It works fine once that is done. 

Kind regards, 
Greg
Message has been deleted
Message has been deleted

Phuc Quang Tran

unread,
May 17, 2019, 4:05:22 AM5/17/19
to rtexttools-help
> # CREATE A container THAT IS SPLIT INTO A TRAINING SET AND A TESTING SET
> # WE WILL BE USING Topic.Code AS THE CODE COLUMN. WE DEFINE A 2000 
> # ARTICLE TRAINING SET AND A 1000 ARTICLE TESTING SET.
> container <- create_container(matrix,NYTimes$Topic.Code,trainSize=1:3000,virgin=FALSE) 

 Result: Error: $ operator is invalid for atomic vectors 

Please help me. Thanks

Vào 09:29:25 UTC+7 Thứ Ba, ngày 03 tháng 6 năm 2014, Mi Zhou đã viết:
Reply all
Reply to author
Forward
0 new messages