Re: r text tools

12 views
Skip to first unread message

Tim Jurka

unread,
Jul 23, 2011, 2:18:05 PM7/23/11
to rtextto...@googlegroups.com
Alright. I was able to add sparse matrix support for GLMNET but performance isn't great (and not terrible either) on a 3000 article training set. A lot has been simplified UI-wise, and there will be a few more refinements before the v1.1 release.

Change log:
- Maximum entropy now supports probability output.
- GLMNET now supports sparse matrices for low-memory operation.
- SLDA now replaces Naive Bayes. Naive Bayes is postponed until a suitable multinomial implementation can be found.
- Cross validation now works with all algorithms.
- Analytics have been reduced to two summaries @score_summary and @topic_summary.
- Analytics for virgin text have been improved; only relevant information will be shown.
- Classifying virgin text is easier, just specify the virgin=TRUE flag in the create_corpus function (no additional data prep needed).
- The function call for training models and classifying models was cleaned up and simplified.
- train_models() and classify_models() now work for all algorithms.
- extraColumns parameter removed from create_matrix. You can now pass in one or many columns in the trainingColumns parameter.
- Naming convention changed- matrix_container is now referred to as a "corpus" and the type parameter is referred to as an "algorithm."
- Various bug fixes and code cleaning.

Best,
Tim

On Jul 23, 2011, at 10:16 AM, Loren Collingwood wrote:

Yeah I think so, let me look into that and get something to you today or tomorrow.
-LC

On Fri, Jul 22, 2011 at 5:16 PM, Tim Jurka <timj...@gmail.com> wrote:
Do you know how to calculate precision, recall, and the F-measure using what we have in RTextTools so far? Or would that require a lot more work?
- maxent probabilities DONE
- SLDA implementation DONE
- better analytics for virgin text  IN PROGRESS
- re-write crossvalidate using existing train/classify function calls  IN PROGRESS
- fix wizard functions to support all algorithms  IN PROGRESS
- combine score summary/document summary DONE
- combine topic summary/algorithm summary? they're both displayed by topic code DONE
- add a total average summary report / precision / recall  ?????
- functions to truncate documents by num words/num sentences   DONE
- remove extraColumns parameter from create_matrix, let user pass in cbind of columns to use (one or many) DONE

Tim


On Jul 21, 2011, at 9:32 PM, Loren Collingwood wrote:

Nahaaa!

On Jul 21, 2011 3:18 PM, "Tim Jurka" <timj...@gmail.com> wrote:
> I'll be making additions and revising the draft over the next couple days. How about we collaborate next week on Tuesday or Wednesday to wrap things up?
>
> Tim
>
> On Jul 21, 2011, at 9:43 AM, Loren Collingwood wrote:
>
>> Hey Tim,
>> Let me know when we can set aside a day/afternoon in the next week to finish up this RJournal article.
>> -Loren
>>
>>
>> On Wed, Jul 20, 2011 at 2:16 PM, Tim Jurka <timj...@gmail.com> wrote:
>> Hi JoBeth,
>>
>> Try using these parameters for your create_matrix() function and let me know if you get higher accuracy across topic codes.
>>
>> create_matrix(data$Description, language="english", removeNumbers=FALSE, stemWords=TRUE, weighting=weightTfIdf)
>>
>> Best,
>> Tim
>>
>> --
>> Timothy P. Jurka
>> Graduate Student
>> Department of Political Science
>> University of California, Davis
>> www.timjurka.com
>>
>> On Jul 20, 2011, at 12:34 PM, JoBeth Surface Shafran wrote:
>>
>>> Tim, Loren, and Amber,
>>>
>>> I just wanted to thank you all for your help. I was able to exclude the words that I need to through a combination of excel and R. Text tools is doing a good job (70-90% accuracy) on about a quarter of the major topic codes, but not so well on the rest. I only used a training set of ~3000 though. Hopefully when I use the larger training set and play around a bit more with word exclusion I can increase accuracy across the board.
>>>
>>> I will keep you guys updated. Thanks again.
>>>
>>> JoBeth
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jul 20, 2011 at 3:13 AM, Tim Jurka <timj...@gmail.com> wrote:
>>> Hey Loren, thanks for handling this!
>>>
>>> Tim
>>>
>>> On Jul 19, 2011, at 8:49 PM, Loren Collingwood wrote:
>>>
>>>> JoBeth,
>>>> Here's some code that works. Tolower was an issue, also I took out the non-Senator names because that was messing up some stuff (like "amend" cuts off the "amend" from "amendment" so you're stuck with words like "ment"). So perhaps mess around with these words (i.e., work them into the words_extract vector one by one or something to make sure they're not taking out one another -- and then words that are common root words you may just want to not mess with. Also in the create_matrix function you were using data$text_test_extract, just use text_test_extract at this point because now it's a vector not part of a dataframe. I've attached the code. I was able to run the create_matrix function. Let use know if it works.
>>>>
>>>> -Loren
>>>>
>>>>
>>>> On Tue, Jul 19, 2011 at 8:02 PM, JoBeth Surface Shafran <jsurfac...@gmail.com> wrote:
>>>> Loren,
>>>>
>>>> I tried your code, but wasn't able to get it to work. Some of the years of data are in all caps so I think I need to make them all lower case before I remove words. I wasn't able to get that to work. I think I need to run the "matrix" line of code before this extract line, but I'm not sure how to do it. Any suggestions? I have attached a smaller version of the dataset and my code if that helps at all.
>>>>
>>>> Thanks everyone for you help!
>>>> JoBeth
>>>>
>>>>
>>>>
>>>> On Tue, Jul 19, 2011 at 6:43 PM, Loren Collingwood <loren.co...@gmail.com> wrote:
>>>> Glad to hear you have had success with RTextTools! Ah yes, the good old find and replace trick. In R, one, probably inefficient but doable way (I think) is with the following code, where tp is your dataframe and text is your vector of textual documents. Here you put all your words into a concatenated object then loop over that object replacing those words with white space. You could then use text_test_extract as your "text" column for RTextTools.
>>>>
>>>> text_test_extract <- tp$text
>>>>
>>>> words_extract <- c("market","all","movement","Read","here")
>>>> for (i in 1: length(words_extract)) {
>>>> text_test_extract <- gsub(pattern=words_extract[i], replacement="",x=text_test_extract)
>>>> }
>>>> head(text_test_extract)
>>>>
>>>> It's inefficient because if your list of words is long you'll have to enter that by hand, but I guess you'd probably have to do that anyway.
>>>>
>>>> -Loren
>>>>
>>>> On Tue, Jul 19, 2011 at 3:40 PM, Tim Jurka <timj...@gmail.com> wrote:
>>>> Hi JoBeth,
>>>>
>>>> If it's just a few words, I'd encourage you to do it using the Microsoft Excel or Microsoft Access replace function (under Edit > Replace). If you've got a list with several tens to hundreds of words, there is a way to do it faster with a few lines of R code. I can show you how to do this if the Excel method isn't what you're looking for.
>>>>
>>>> I may include a feature to exclude a list of words in the v1.1 release due early August; if not then it'll be included in the v1.2 release.
>>>>
>>>> Best,
>>>> Tim
>>>>
>>>> P.S. Amber, could you forward the attachment or put it in our DropBox? Thank you!
>>>>
>>>> On Jul 19, 2011, at 2:42 PM, Amber Boydstun wrote:
>>>>
>>>>> Hi JoBeth,
>>>>>
>>>>> Great to hear from you, and glad you got the program to run!
>>>>>
>>>>> I'm cc'ing Tim and Loren here because I think they'll have more efficient answers for this question than I will. Guys, any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Amber
>>>>>
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> Amber E. Boydstun
>>>>> Assistant Professor
>>>>> Department of Political Science
>>>>> University of California, Davis
>>>>> One Shields Ave
>>>>> Davis, CA 95616
>>>>>
>>>>> http://psfaculty.ucdavis.edu/boydstun/
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>
>>>>> On Tue, Jul 19, 2011 at 10:23 PM, JoBeth Surface Shafran <jsurfac...@gmail.com> wrote:
>>>>> Amber,
>>>>>
>>>>> I was able to get text tools to run on the PAP roll call data. I used a training set of ~18,000 to code ~1,500. It did really well in a few policy areas, but not very well in many of the policy areas. I think the problem that I am running into is the language used in the roll call descriptions. I need a way to exclude words like resolution, amendment, SCONRES, individual senator names, etc. We had discussed this in Catania. I was wondering how I would go about excluding these words/phrases. I assume the first thing I should do is make a list of words I would like to exclude, but I am not sure where to go after that. Thanks for you help!
>>>>>
>>>>> Also, I attached the summary tables in case you would like to look at them.
>>>>>
>>>>> Thanks!
>>>>> JoBeth
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Loren Collingwood
>>>> loren.co...@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Loren Collingwood
>>>> loren.co...@gmail.com
>>>> <shafran_texttools.r>
>>>
>>>
>>
>>
>>
>>
>> --
>> Loren Collingwood
>> loren.co...@gmail.com
>




--
Loren Collingwood
loren.co...@gmail.com

Reply all
Reply to author
Forward
0 new messages