Greetings:
I'm experimenting with using RTextTools to classify forum posts to an developer discussion forum.
My training set has about 1100 entries for a total of about 900kb of data in the source csv file.
When I try to run certain algorithms (BAGGING, NNET, SLDA, and TREE), train_models() fails with "Error: cannot allocate vector of size 320.0 Mb"
I am running the latest version of RTextTools using R version 2.15.2 on Ubuntu Linux 12.04 ("Precise"), on a 32 bit Intel machine with 4GB of RAM.
A bit of Googling turned up some suggestions, mostly of two kinds:
- Get a 64bit machine with more RAM.
- Break the problem into smaller pieces.
I'm looking for a bigger platform, but in the meantime I would welcome any suggestions for other approaches.
Following are the steps I performed (the
gc() invocations were an attempt to free memory):
library(RTextTools)
set.seed(95616)
data <- read.csv("even-coded.csv", sep="|", header=TRUE, quote="")
gc()
# The 'seq' column is just a record number; the text to be coded is in the 'body' column.
matrix <- create_matrix(cbind(data["seq"], data["body"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
gc()
container <- create_container(matrix, as.numeric(data$type), trainSize=1:966, testSize=967:1066, virgin=FALSE)
gc()
models <- train_models(container, c("BAGGING"))
gc()
results <- classify_models(container, models)
The process fails during train_models(...).
Thanks,
John Noll
Research Fellow
Lero - The Irish Software Engineering Research Centre
University of Limerick, Ireland