Memory issue with create_instances_from_matrices

40 views
Skip to first unread message

Yoki

unread,
Mar 5, 2024, 9:11:36 PMMar 5
to python-weka-wrapper
Hi Peter,

I hope this email finds you well. I am reaching out to seek your advice or assistance regarding a challenge I'm facing with a memory issue.

I have implemented a REPTree classifier function to work with a dataset comprising approximately 20 attributes and 190,000 entries. To improve the model's performance, I performed some feature engineering on the original dataset, resulting in the creation of two matrices: x_values for the features and y_values for the original labels of the dataset.

However, I've encountered a significant issue where the memory usage of my program keeps increasing over time, eventually leading to a crash. This escalating memory consumption occurs during the model training phase, and despite my efforts to mitigate it, the problem persists.

To diagnose the issue, I attempted to use tracemalloc to track the memory allocation and identify potential leaks or areas of inefficient memory use. Unfortunately, this approach has not been effective in providing insights or solutions to the problem.

I am reaching out to you hoping you could offer guidance on potential strategies to address this memory issue.

My codes is as follow:

def REPTree(x_values, y_values, options=["-L", "10"]):
    flted = filtered = cls = evl = None
    try:
        dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
        dataset.class_is_last()
       
        flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
        flted.inputformat(dataset)
        filtered = flted.filter(dataset)

        cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)
        cls.build_classifier(filtered)
       
        evl = Evaluation(filtered)
        evl.crossvalidate_model(cls, filtered, 10, Random(1))
        score = evl.correct / evl.num_instances
    finally:
        if flted is not None:
            del flted
        if filtered is not None:
            del filtered
        if cls is not None:
            del cls
        if evl is not None:
            del evl
        gc.collect()

    return score

I look forward to any advice you can provide.

Regards,
Yoki

Peter Reutemann

unread,
Mar 5, 2024, 9:18:57 PMMar 5
to python-we...@googlegroups.com
Please see my inline comments below.

> I hope this email finds you well. I am reaching out to seek your advice or assistance regarding a challenge I'm facing with a memory issue.
>
> I have implemented a REPTree classifier function to work with a dataset comprising approximately 20 attributes and 190,000 entries. To improve the model's performance, I performed some feature engineering on the original dataset, resulting in the creation of two matrices: x_values for the features and y_values for the original labels of the dataset.
>
> However, I've encountered a significant issue where the memory usage of my program keeps increasing over time, eventually leading to a crash. This escalating memory consumption occurs during the model training phase, and despite my efforts to mitigate it, the problem persists.
>
> To diagnose the issue, I attempted to use tracemalloc to track the memory allocation and identify potential leaks or areas of inefficient memory use. Unfortunately, this approach has not been effective in providing insights or solutions to the problem.
>
> I am reaching out to you hoping you could offer guidance on potential strategies to address this memory issue.
>
> My codes is as follow:
>
> def REPTree(x_values, y_values, options=["-L", "10"]):
> flted = filtered = cls = evl = None
> try:
> dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
> dataset.class_is_last()
>
> flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
> flted.inputformat(dataset)
> filtered = flted.filter(dataset)
>
> cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)

Since you're only calculating statistics, you don't need a trained
classifier at all. The Evaluation class will train the model
automatically before collecting statistics. If you use a trained
classifier instead of just a configured template, the Evaluation class
will have to make deep copies of the trained classifier for each fold
pair, which will use up unnecessary memory and consume lots of compute
cycles!
So, just remove the following line where you're training the classifier:

> cls.build_classifier(filtered)
>
> evl = Evaluation(filtered)

Add the following line to conserve some more memory by not holding on
to the predictions before calling "crossvalidate_model":
evl.discard_predictions = True

> evl.crossvalidate_model(cls, filtered, 10, Random(1))
> score = evl.correct / evl.num_instances
> finally:
> if flted is not None:
> del flted
> if filtered is not None:
> del filtered
> if cls is not None:
> del cls
> if evl is not None:
> del evl
> gc.collect()
>
> return score
>
> I look forward to any advice you can provide.

Hope this helps with the memory issue.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, Hamilton, NZ
Mobile +64 22 190 2375
https://www.cs.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
Message has been deleted

Yoki

unread,
Mar 5, 2024, 9:31:23 PMMar 5
to python-weka-wrapper
Thank you for the reply. I have removed:

> cls.build_classifier(filtered)
>
> evl = Evaluation(filtered)

and added  >evl.discard_predictions = True

but, I get an error as follows:

Traceback (most recent call last):
  File "c:/FC/feature-construction-weka.py", line 130, in <module>    
    scores_base = REPTree(x_values, y_values)
  File "c:/GP/feature-construction-weka.py", line 63, in REPTree      
    evl.discard_predictions = True
AttributeError: 'NoneType' object has no attribute 'discard_predictions'


Regards,
Yoki

Peter Reutemann

unread,
Mar 5, 2024, 9:34:17 PMMar 5
to python-we...@googlegroups.com
> Thank you for the reply. I have removed:
>
> > cls.build_classifier(filtered)
> >
> > evl = Evaluation(filtered)
>
> and added >evl.discard_predictions = True
>
> but, I get an error as follows:
>
> Traceback (most recent call last):
> File "c:/FC/feature-construction-weka.py", line 130, in <module>
> scores_base = REPTree(x_values, y_values)
> File "c:/GP/feature-construction-weka.py", line 63, in REPTree
> evl.discard_predictions = True
> AttributeError: 'NoneType' object has no attribute 'discard_predictions'

Just to confirm that your code looks like this?

evl = Evaluation(filtered)
evl.discard_predictions = True
evl.crossvalidate_model(cls, filtered, 10, Random(1))

Yoki

unread,
Mar 5, 2024, 9:37:32 PMMar 5
to python-weka-wrapper
It looks like as follow:

def REPTree(x_values, y_values, options=["-L", "10"]):
    flted = filtered = cls = evl = None
    try:
        dataset = create_instances_from_matrices(x_values, y_values, name="new_feature")
        dataset.class_is_last()
       
        flted = Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "last"])
        flted.inputformat(dataset)
        filtered = flted.filter(dataset)

        cls = Classifier(classname="weka.classifiers.trees.REPTree", options=options)

        evl.discard_predictions = True
        evl.crossvalidate_model(cls, filtered, 10, Random(1))
        score = evl.correct / evl.num_instances
    finally:
        if flted is not None:
            del flted
        if filtered is not None:
            del filtered
        if cls is not None:
            del cls
        if evl is not None:
            del evl
        gc.collect()

    return score

Peter Reutemann

unread,
Mar 5, 2024, 9:41:13 PMMar 5
to python-we...@googlegroups.com
You also removed another line, where you're instantiating the Evaluation object:
evl = Evaluation(filtered)

Hence resulting in this error:
AttributeError: 'NoneType' object has no attribute 'discard_predictions'

Cheers, Peter
> --
> You received this message because you are subscribed to the Google Groups "python-weka-wrapper" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to python-weka-wra...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/python-weka-wrapper/a5cfd319-8f1e-4748-850b-48ec99363bdfn%40googlegroups.com.

Yoki

unread,
Mar 6, 2024, 5:54:33 PMMar 6
to python-weka-wrapper
It seems to have very limited improvement, and runs out of memory leading to a crash again.

Is there anywhere else memory usage can be optimised? Since I only calculate statistics , is there any way to make weka release all memory after evaluation? Python's garbage collection doesn't seem to work on jvm.

Regards,
Yoki

Peter Reutemann

unread,
Mar 6, 2024, 7:56:50 PMMar 6
to python-we...@googlegroups.com
> It seems to have very limited improvement, and runs out of memory leading to a crash again.

Bummer...

> Is there anywhere else memory usage can be optimised? Since I only calculate statistics , is there any way to make weka release all memory after evaluation? Python's garbage collection doesn't seem to work on jvm.

You should be able to call the JVM's garbage collector with something
like this:

import javabridge
javabridge.static_call("java/lang/System", "gc", "()V")

Yoki

unread,
Mar 7, 2024, 7:09:39 PMMar 7
to python-weka-wrapper
It seems that there is not much that javabridge can do. I use sub-processes to call the weka function avoiding the increment of memory. Anyway, thank you for the help!

Peter Reutemann

unread,
Mar 7, 2024, 7:14:00 PMMar 7
to python-we...@googlegroups.com
> It seems that there is not much that javabridge can do. I use sub-processes to call the weka function avoiding the increment of memory. Anyway, thank you for the help!

Not sure how objects are being referenced and therefore kept from
being garbage collected (Python, JNI, JVM). But at least you found a
workaround!
Reply all
Reply to author
Forward
0 new messages