Memory issue with create_instances_from

Yoki

unread,

Mar 5, 2024, 9:11:36 PMMar 5

to python-weka-wrapper

Hi Peter,

I hope this email finds you well. I am reaching out to seek your advice or assistance regarding a challenge I'm facing with a memory issue.

I have implemented a REPTree classifier function to work with a dataset comprising approximately 20 attributes and 190,000 entries. To improve the model's performance, I performed some feature engineering on the original dataset, resulting in the creation of two matrices: x_values for the features and y_values for the original labels of the dataset.

However, I've encountered a significant issue where the memory usage of my program keeps increasing over time, eventually leading to a crash. This escalating memory consumption occurs during the model training phase, and despite my efforts to mitigate it, the problem persists.

To diagnose the issue, I attempted to use tracemalloc to track the memory allocation and identify potential leaks or areas of inefficient memory use. Unfortunately, this approach has not been effective in providing insights or solutions to the problem.

I am reaching out to you hoping you could offer guidance on potential strategies to address this memory issue.

My codes is as follow:

def REPTree(x_values, y_values, options=["-L", "10"]):
flted = filtered = cls = evl = None
try:
dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
dataset.class_is_last()

flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
flted.inputformat(dataset)
filtered = flted.filter(dataset)

cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)
cls.build_classifier(filtered)

evl = Evaluation(filtered)
evl.crossvalidate_model(cls, filtered, 10, Random(1))
score = evl.correct / evl.num_instances
finally:
if flted is not None:
del flted
if filtered is not None:
del filtered
if cls is not None:
del cls
if evl is not None:
del evl
gc.collect()

return score

I look forward to any advice you can provide.

Regards,

Yoki

Peter Reutemann

unread,

Mar 5, 2024, 9:18:57 PMMar 5

to python-we...@googlegroups.com

Please see my inline comments below.

> I hope this email finds you well. I am reaching out to seek your advice or assistance regarding a challenge I'm facing with a memory issue.
>
> I have implemented a REPTree classifier function to work with a dataset comprising approximately 20 attributes and 190,000 entries. To improve the model's performance, I performed some feature engineering on the original dataset, resulting in the creation of two matrices: x_values for the features and y_values for the original labels of the dataset.
>
> However, I've encountered a significant issue where the memory usage of my program keeps increasing over time, eventually leading to a crash. This escalating memory consumption occurs during the model training phase, and despite my efforts to mitigate it, the problem persists.
>
> To diagnose the issue, I attempted to use tracemalloc to track the memory allocation and identify potential leaks or areas of inefficient memory use. Unfortunately, this approach has not been effective in providing insights or solutions to the problem.
>
> I am reaching out to you hoping you could offer guidance on potential strategies to address this memory issue.
>
> My codes is as follow:
>
> def REPTree(x_values, y_values, options=["-L", "10"]):
> flted = filtered = cls = evl = None
> try:
> dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
> dataset.class_is_last()
>
> flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
> flted.inputformat(dataset)
> filtered = flted.filter(dataset)
>
> cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)

Since you're only calculating statistics, you don't need a trained
classifier at all. The Evaluation class will train the model
automatically before collecting statistics. If you use a trained
classifier instead of just a configured template, the Evaluation class
will have to make deep copies of the trained classifier for each fold
pair, which will use up unnecessary memory and consume lots of compute
cycles!
So, just remove the following line where you're training the classifier:

> cls.build_classifier(filtered)
>
> evl = Evaluation(filtered)

Add the following line to conserve some more memory by not holding on
to the predictions before calling "crossvalidate_model":
evl.discard_predictions = True

> evl.crossvalidate_model(cls, filtered, 10, Random(1))
> score = evl.correct / evl.num_instances
> finally:
> if flted is not None:
> del flted
> if filtered is not None:
> del filtered
> if cls is not None:
> del cls
> if evl is not None:
> del evl
> gc.collect()
>
> return score
>
> I look forward to any advice you can provide.

Hope this helps with the memory issue.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, Hamilton, NZ
Mobile +64 22 190 2375
https://www.cs.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/

Message has been deleted

Yoki

unread,

Mar 5, 2024, 9:31:23 PMMar 5

to python-weka-wrapper

Thank you for the reply. I have removed:

> cls.build_classifier(filtered)
>
> evl = Evaluation(filtered)

and added >evl.discard_predictions = True

but, I get an error as follows:

Traceback (most recent call last):
File "c:/FC/feature-construction-weka.py", line 130, in <module>
scores_base = REPTree(x_values, y_values)
File "c:/GP/feature-construction-weka.py", line 63, in REPTree
evl.discard_predictions = True
AttributeError: 'NoneType' object has no attribute 'discard_predictions'

Regards,

Yoki

unread,

Mar 7, 2024, 7:09:39 PMMar 7

to python-weka-wrapper

It seems that there is not much that javabridge can do. I use sub-processes to call the weka function avoiding the increment of memory. Anyway, thank you for the help!

Peter Reutemann

unread,

Mar 7, 2024, 7:14:00 PMMar 7

to python-we...@googlegroups.com

> It seems that there is not much that javabridge can do. I use sub-processes to call the weka function avoiding the increment of memory. Anyway, thank you for the help!

Not sure how objects are being referenced and therefore kept from
being garbage collected (Python, JNI, JVM). But at least you found a
workaround!

Reply all

Reply to author

Forward

Memory issue with create_instances_from_matrices

Yoki

Peter Reutemann

Yoki

Peter Reutemann

Yoki

Peter Reutemann

Yoki

Peter Reutemann

Yoki

Peter Reutemann