Errors so far

1,116 views
Skip to first unread message

Thomas Levi

unread,
Sep 6, 2012, 7:07:54 PM9/6/12
to rtextto...@googlegroups.com
I've just installed RTextTools and on Windows 7 (R v. 2.15.1) and am trying to work through the introductory example. I have encountered several errors in just trying to load and create a corpus:

1. matrix <- create_matrix(cbind(data$text), language="english", removeNumbers=T, removeSparseTerms=.998)
Error in slam:::`[.simple_triplet_matrix`(x, i, j, ...) : 
  Invalid subscript type: NULL.
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

I will note that taking out the "removeNumbers=T" bit makes it work, but obviously I'd like to remove them in real data at times.

2. Apparently create_corpus cannot be found:

corpus <- create_corpus(matrix, as.numeric(data$major), trainSize=1:750, testSize=751:1000, virgin=F)
Error: could not find function "create_corpus"

even though the library is loaded and in my namespace.

Any explanation/fixes?

Timothy Jurka

unread,
Sep 6, 2012, 7:11:10 PM9/6/12
to rtextto...@googlegroups.com
Hi Thomas,

1. That error indicates that there are no words in your matrix,
perhaps we should make that error more informative. When you do
removeNumbers=F, you have more "words" because RTextTools also
considers numbers at words, and so you don't get the error. You'll
probably want to remove the removeSparseTerms=.998 parameter to make
sure you're not removing too many words.

2. The latest version of RTextTools replaced create_corpus with
create_container to avoid confusion with package tm.

Best,
Tim

Thomas Levi

unread,
Sep 6, 2012, 7:14:49 PM9/6/12
to rtextto...@googlegroups.com
But when loading the data(USCongress) shouldn't there be words in there?

Timothy Jurka

unread,
Sep 6, 2012, 7:22:00 PM9/6/12
to rtextto...@googlegroups.com
Yes, there are words in the USCongrses dataset. This works fine for me:

library(RTextTools)
data(USCongress)
# CREATE THE DOCUMENT-TERM MATRIX
doc_matrix <- create_matrix(USCongress$text, language="english",
removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.998)
container <- create_container(doc_matrix,USCongress$major,
trainSize=1:4000, testSize=4001:4449, virgin=FALSE)

Best,
Tim

Thomas Levi

unread,
Sep 6, 2012, 7:30:11 PM9/6/12
to rtextto...@googlegroups.com
Apparently, I can reproduce the error if I include the cbind(USCongress$text) as in the startup pdf online. Is this out of date?

Timothy Jurka

unread,
Sep 6, 2012, 7:34:00 PM9/6/12
to rtextto...@googlegroups.com
Hi Thomas,

Yes, we need to update that. Thank you for pointing that out.

Best,
Tim

Thomas Levi

unread,
Sep 6, 2012, 7:40:49 PM9/6/12
to rtextto...@googlegroups.com
Thanks and quick replies much appreciated. Can you explain the cbind issue and why it's no longer necessary?

Timothy Jurka

unread,
Sep 6, 2012, 7:49:23 PM9/6/12
to rtextto...@googlegroups.com
cbind() binds two vectors together side by side in column format. We
decided that if you're passing in only one object, it doesn't make
sense to cbind() it, because you're not cbinding it with anything.
Therefore, cbind() is only necessary if you want to pass in more than
one field for training.

Best,
Tim

Christine Talbot

unread,
Nov 18, 2012, 10:56:43 AM11/18/12
to rtextto...@googlegroups.com
I'm having issues with trying to do anything other than a 1-gram with cbind'ed data for the create_matrix command.  It is giving me this same error as this thread.  If I only use the single text column, I can do the 2-gram and 3-gram.  As soon as I cbind some other features to the matrix, it's no longer working.

How do I get it to do the ngram bag of words and still add additional features for classification?

Thanks.

Timothy P. Jurka

unread,
Nov 19, 2012, 12:24:40 AM11/19/12
to rtextto...@googlegroups.com
Hi Christine,

Thank you for pointing me to this error. It seems that the ngram feature broke since the last version of RTextTools. I've added it to the bug tracker and will fix it with the next release.

Best,
Tim

--
Timothy P. Jurka
Ph.D. Student
Department of Political Science
University of California, Davis
www.timjurka.com

Josh Varty

unread,
Sep 23, 2013, 11:55:35 AM9/23/13
to rtextto...@googlegroups.com
I just thought I'd post a different solution to this error (as I encountered it as well).

You will receive this error if you have stemWords=TRUE and your dataset contains another language. (For me it was Japanese). Hopefully that helps someone else who comes across this issue.

Mi Zhou

unread,
May 26, 2014, 2:07:57 PM5/26/14
to rtextto...@googlegroups.com
I have stemWords=FALSE but still have the same issues.
But hours of testing, I found when refer to the columns, cbind(USCongress["firmName"],USCongress["text"] worked, (instead of using cbind(USCongress$firmName,USCongress$text)), hope this helps.

Tim Jurka

unread,
May 26, 2014, 2:09:51 PM5/26/14
to rtextto...@googlegroups.com
Hi Mi,

I am no longer actively supporting RTextTools, but there are other available resources online (see StackOverflow) for help using R packages. Please feel free to contribute any bug fixes to the open source project and I will re-publish to CRAN!

Thank you!
Tim



-- 
You received this message because you are subscribed to the Google Groups "rtexttools-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtexttools-he...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mi Zhou

unread,
May 28, 2014, 7:37:48 PM5/28/14
to rtextto...@googlegroups.com
Hello Tim, 
Thank you for your quick response, I just have one more question for you. Why common text classification algorithm such as Naive Bayesian and KNN are not include in rtexttools? 

Thanks a lot!
To unsubscribe from this group and stop receiving emails from it, send an email to rtexttools-help+unsub...@googlegroups.com.

Tim Jurka

unread,
May 28, 2014, 8:15:11 PM5/28/14
to rtextto...@googlegroups.com
Hi Mi,

At the time RTextTools was written, there were no packages available on CRAN for multinomial text classification using naive Bayes. Anecdotally, KNN was found to have far worse performance than SVM for text classification both in terms of accuracy as well as runtime performance (i.e. it was slow).

If I were to update RTextTools, I would likely change the core set of algorithms to include naive Bayes, SVM, and maxent. However, I still wouldn't consider KNN simply due to performance reasons. On large datasets KNN quickly freezes with the typical tf-idf matrices.

Best,
To unsubscribe from this group and stop receiving emails from it, send an email to rtexttools-he...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages