maxDocFreq in create_matrix

57 views
Skip to first unread message

Enrico Borghetto

unread,
Mar 20, 2014, 2:36:41 PM3/20/14
to rtextto...@googlegroups.com
Hi, 

how does the argument maxDocFreq in create_matrix work? I tried to eliminate recurring words in bill titles (act, bill, amend ecc...) by setting a high value for maxDocFreq but the resulting DocumentTermMatrix seems unaffected. 

Example using USCongress data

# SET THE SEED AND LOAD THE DATA
set.seed(95616)

data(USCongress)

# CREATE THE DOCUMENT-TERM MATRIX AND WRAP THE DATA IN A CONTAINER

tm::findFreqTerms(doc_matrix, 1)
tm::findFreqTerms(doc_matrix, 800)

doc_matrix <- create_matrix(USCongress$text, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.998)

doc_matrix

doc_matrix1 <- create_matrix(USCongress$text, maxDocFreq=800, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.998)

doc_matrix1
Reply all
Reply to author
Forward
0 new messages