Please see my inline comments below.
> I hope this email finds you well. I am reaching out to seek your advice or assistance regarding a challenge I'm facing with a memory issue.
>
> I have implemented a REPTree classifier function to work with a dataset comprising approximately 20 attributes and 190,000 entries. To improve the model's performance, I performed some feature engineering on the original dataset, resulting in the creation of two matrices: x_values for the features and y_values for the original labels of the dataset.
>
> However, I've encountered a significant issue where the memory usage of my program keeps increasing over time, eventually leading to a crash. This escalating memory consumption occurs during the model training phase, and despite my efforts to mitigate it, the problem persists.
>
> To diagnose the issue, I attempted to use tracemalloc to track the memory allocation and identify potential leaks or areas of inefficient memory use. Unfortunately, this approach has not been effective in providing insights or solutions to the problem.
>
> I am reaching out to you hoping you could offer guidance on potential strategies to address this memory issue.
>
> My codes is as follow:
>
> def REPTree(x_values, y_values, options=["-L", "10"]):
> flted = filtered = cls = evl = None
> try:
> dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
> dataset.class_is_last()
>
> flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
> flted.inputformat(dataset)
> filtered = flted.filter(dataset)
>
> cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)
Since you're only calculating statistics, you don't need a trained
classifier at all. The Evaluation class will train the model
automatically before collecting statistics. If you use a trained
classifier instead of just a configured template, the Evaluation class
will have to make deep copies of the trained classifier for each fold
pair, which will use up unnecessary memory and consume lots of compute
cycles!
So, just remove the following line where you're training the classifier:
> cls.build_classifier(filtered)
>
> evl = Evaluation(filtered)
Add the following line to conserve some more memory by not holding on
to the predictions before calling "crossvalidate_model":
evl.discard_predictions = True
> evl.crossvalidate_model(cls, filtered, 10, Random(1))
> score = evl.correct / evl.num_instances
> finally:
> if flted is not None:
> del flted
> if filtered is not None:
> del filtered
> if cls is not None:
> del cls
> if evl is not None:
> del evl
> gc.collect()
>
> return score
>
> I look forward to any advice you can provide.
Hope this helps with the memory issue.
Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, Hamilton, NZ
Mobile
+64 22 190 2375
https://www.cs.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/