adding additional features to be used with classification of text

Christine Talbot

未读，

2012年11月6日 13:40:032012/11/6

收件人 rtextto...@googlegroups.com

I have some additional features that I want to use in addition to the features generated just from the text. For example, I have a set of phrases/sentences in one column, another column for current position of speaker, another for how long it has been since someone moved, etc. I want to be able to use the text AND the other features when trying to train or classify my data. How can I do that with RTextTools?

Thanks!

Timothy P. Jurka

未读，

2012年11月6日 17:57:162012/11/6

收件人 rtextto...@googlegroups.com

Hi Christine,

You can pass in multiple columns during the create_matrix() function, for example:

create_matrix(cbind(mydata$text, mydata$speakerPosition, mydata$lengthSinceMove), ...)

For other examples, see the sample scripts on the RTextTools website ( http://www.rtexttools.com/documentation.html ).

Best,

Tim

--

Timothy P. Jurka
Ph.D. Student

Department of Political Science
University of California, Davis
www.timjurka.com

Christine Talbot

未读，

2012年11月9日 16:28:372012/11/9

收件人 rtextto...@googlegroups.com

I'm trying to figure out how to interpret the results that appear in the analytics - ideally, I'm looking to try to understand tp, fp, fn, tn, precision, recall, total classified as positive, total classified as negative, and ultimately create some sort of ROC curve to try different thresholds. How can I find/get that information out of what is in the analytics stuff?

Thanks!

Christine

Timothy P. Jurka

未读，

2012年11月9日 20:13:352012/11/9

收件人 rtextto...@googlegroups.com

Hi Christine,

Google can provide you with several excellent resources for understanding the precision, recall, and f-Score metrics. The raw classifications are stored in the the "document_summary" slot of the analytics container (e.g. analytics@document_summary). Using a few lines of R code you can tally the distribution of responses, or if you're testing against known codes, the "label_summary" slot will have details on how many observations were classified under each category.

Best,

Tim

--

Timothy P. Jurka
Ph.D. Student

Department of Political Science
University of California, Davis
www.timjurka.com

Christine Talbot

未读，

2012年11月9日 20:44:342012/11/9

收件人 rtextto...@googlegroups.com

I understand the precision, recall, and f-score metrics. However, what I don't understand is what you are trying to represent for each of the items in the analytics object. What is represented by the *_PROB columns in the document_summary? Is it the output of the machine learning itself? Or something else? What is represented by the algorithm_summary stuff (since it's split into the two separate categories i was classifying with)? I'm familiar with the precision and recall and fscore for the entire dataset, but not clear on how it should be interpreted for the partial sets like it's split up in the algorithm_summary. And for the label_summary - what are the calculations that those values are representing?

Again, I'm just unclear on what formulas you're using for the values I get in the analytics object, so I don't know how to get the things that I'm used to working with when classifying things.

Thank you

Christine

Timothy P. Jurka

未读，

2012年11月9日 20:49:192012/11/9

收件人 rtextto...@googlegroups.com

Hi Christine,

The statistics are documented in the Getting Started guide available on the RTextTools website ( http://install.rtexttools.com/files/RTextTools_GettingStarted.pdf ). The label summary is showing you the statistics broken up for each label, so you can see which labels are problematic and need more/better data to improve the accuracy.

Overall statistics will be available in the next version of RTextTools. We haven't needed them so far because when we have consistently high recall, precision, and F1 scores across our labels, that applies across the entire dataset.

Best,

Tim

--

Timothy P. Jurka
Ph.D. Student

Department of Political Science
University of California, Davis
www.timjurka.com

回复全部

回复作者