Yeah I think so, let me look into that and get something to you today or tomorrow.-LCOn Fri, Jul 22, 2011 at 5:16 PM, Tim Jurka <timj...@gmail.com> wrote:
Do you know how to calculate precision, recall, and the F-measure using what we have in RTextTools so far? Or would that require a lot more work?- maxent probabilities DONE- SLDA implementation DONE- better analytics for virgin text IN PROGRESS- re-write crossvalidate using existing train/classify function calls IN PROGRESS- fix wizard functions to support all algorithms IN PROGRESS- combine score summary/document summary DONE- combine topic summary/algorithm summary? they're both displayed by topic code DONE- add a total average summary report / precision / recall ?????- functions to truncate documents by num words/num sentences DONE- remove extraColumns parameter from create_matrix, let user pass in cbind of columns to use (one or many) DONETimOn Jul 21, 2011, at 9:32 PM, Loren Collingwood wrote:Nahaaa!
On Jul 21, 2011 3:18 PM, "Tim Jurka" <timj...@gmail.com> wrote:
> I'll be making additions and revising the draft over the next couple days. How about we collaborate next week on Tuesday or Wednesday to wrap things up?
>
> Tim
>
> On Jul 21, 2011, at 9:43 AM, Loren Collingwood wrote:
>
>> Hey Tim,
>> Let me know when we can set aside a day/afternoon in the next week to finish up this RJournal article.
>> -Loren
>>
>>
>> On Wed, Jul 20, 2011 at 2:16 PM, Tim Jurka <timj...@gmail.com> wrote:
>> Hi JoBeth,
>>
>> Try using these parameters for your create_matrix() function and let me know if you get higher accuracy across topic codes.
>>
>> create_matrix(data$Description, language="english", removeNumbers=FALSE, stemWords=TRUE, weighting=weightTfIdf)
>>
>> Best,
>> Tim
>>
>> --
>> Timothy P. Jurka
>> Graduate Student
>> Department of Political Science
>> University of California, Davis
>> www.timjurka.com
>>
>> On Jul 20, 2011, at 12:34 PM, JoBeth Surface Shafran wrote:
>>
>>> Tim, Loren, and Amber,
>>>
>>> I just wanted to thank you all for your help. I was able to exclude the words that I need to through a combination of excel and R. Text tools is doing a good job (70-90% accuracy) on about a quarter of the major topic codes, but not so well on the rest. I only used a training set of ~3000 though. Hopefully when I use the larger training set and play around a bit more with word exclusion I can increase accuracy across the board.
>>>
>>> I will keep you guys updated. Thanks again.
>>>
>>> JoBeth
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jul 20, 2011 at 3:13 AM, Tim Jurka <timj...@gmail.com> wrote:
>>> Hey Loren, thanks for handling this!
>>>
>>> Tim
>>>
>>> On Jul 19, 2011, at 8:49 PM, Loren Collingwood wrote:
>>>
>>>> JoBeth,
>>>> Here's some code that works. Tolower was an issue, also I took out the non-Senator names because that was messing up some stuff (like "amend" cuts off the "amend" from "amendment" so you're stuck with words like "ment"). So perhaps mess around with these words (i.e., work them into the words_extract vector one by one or something to make sure they're not taking out one another -- and then words that are common root words you may just want to not mess with. Also in the create_matrix function you were using data$text_test_extract, just use text_test_extract at this point because now it's a vector not part of a dataframe. I've attached the code. I was able to run the create_matrix function. Let use know if it works.
>>>>
>>>> -Loren
>>>>
>>>>
>>>> On Tue, Jul 19, 2011 at 8:02 PM, JoBeth Surface Shafran <jsurfac...@gmail.com> wrote:
>>>> Loren,
>>>>
>>>> I tried your code, but wasn't able to get it to work. Some of the years of data are in all caps so I think I need to make them all lower case before I remove words. I wasn't able to get that to work. I think I need to run the "matrix" line of code before this extract line, but I'm not sure how to do it. Any suggestions? I have attached a smaller version of the dataset and my code if that helps at all.
>>>>
>>>> Thanks everyone for you help!
>>>> JoBeth
>>>>
>>>>
>>>>
>>>> On Tue, Jul 19, 2011 at 6:43 PM, Loren Collingwood <loren.co...@gmail.com> wrote:
>>>> Glad to hear you have had success with RTextTools! Ah yes, the good old find and replace trick. In R, one, probably inefficient but doable way (I think) is with the following code, where tp is your dataframe and text is your vector of textual documents. Here you put all your words into a concatenated object then loop over that object replacing those words with white space. You could then use text_test_extract as your "text" column for RTextTools.
>>>>
>>>> text_test_extract <- tp$text
>>>>
>>>> words_extract <- c("market","all","movement","Read","here")
>>>> for (i in 1: length(words_extract)) {
>>>> text_test_extract <- gsub(pattern=words_extract[i], replacement="",x=text_test_extract)
>>>> }
>>>> head(text_test_extract)
>>>>
>>>> It's inefficient because if your list of words is long you'll have to enter that by hand, but I guess you'd probably have to do that anyway.
>>>>
>>>> -Loren
>>>>
>>>> On Tue, Jul 19, 2011 at 3:40 PM, Tim Jurka <timj...@gmail.com> wrote:
>>>> Hi JoBeth,
>>>>
>>>> If it's just a few words, I'd encourage you to do it using the Microsoft Excel or Microsoft Access replace function (under Edit > Replace). If you've got a list with several tens to hundreds of words, there is a way to do it faster with a few lines of R code. I can show you how to do this if the Excel method isn't what you're looking for.
>>>>
>>>> I may include a feature to exclude a list of words in the v1.1 release due early August; if not then it'll be included in the v1.2 release.
>>>>
>>>> Best,
>>>> Tim
>>>>
>>>> P.S. Amber, could you forward the attachment or put it in our DropBox? Thank you!
>>>>
>>>> On Jul 19, 2011, at 2:42 PM, Amber Boydstun wrote:
>>>>
>>>>> Hi JoBeth,
>>>>>
>>>>> Great to hear from you, and glad you got the program to run!
>>>>>
>>>>> I'm cc'ing Tim and Loren here because I think they'll have more efficient answers for this question than I will. Guys, any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Amber
>>>>>
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> Amber E. Boydstun
>>>>> Assistant Professor
>>>>> Department of Political Science
>>>>> University of California, Davis
>>>>> One Shields Ave
>>>>> Davis, CA 95616
>>>>>
>>>>> http://psfaculty.ucdavis.edu/boydstun/
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>>
>>>>> On Tue, Jul 19, 2011 at 10:23 PM, JoBeth Surface Shafran <jsurfac...@gmail.com> wrote:
>>>>> Amber,
>>>>>
>>>>> I was able to get text tools to run on the PAP roll call data. I used a training set of ~18,000 to code ~1,500. It did really well in a few policy areas, but not very well in many of the policy areas. I think the problem that I am running into is the language used in the roll call descriptions. I need a way to exclude words like resolution, amendment, SCONRES, individual senator names, etc. We had discussed this in Catania. I was wondering how I would go about excluding these words/phrases. I assume the first thing I should do is make a list of words I would like to exclude, but I am not sure where to go after that. Thanks for you help!
>>>>>
>>>>> Also, I attached the summary tables in case you would like to look at them.
>>>>>
>>>>> Thanks!
>>>>> JoBeth
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Loren Collingwood
>>>> loren.co...@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Loren Collingwood
>>>> loren.co...@gmail.com
>>>> <shafran_texttools.r>
>>>
>>>
>>
>>
>>
>>
>> --
>> Loren Collingwood
>> loren.co...@gmail.com
>
--
Loren Collingwood
loren.co...@gmail.com