How to use data from Python with Weka

1,223 views
Skip to first unread message

Anton

unread,
Oct 2, 2015, 11:37:55 AM10/2/15
to python-weka-wrapper

I've installed weka and the python-weka-wrapper.

I got as far as


from weka.classifiers import Classifier
clf=Classifier(classname="weka.classifiers.rules.JRip")

from random import randint
X = [[randint(1,10) for _ in range(5)] for _ in range(100)]
y = [randint(0,1) for _ in range(100)]

but now I don't know how to load my data which is available as a Python data structure.

How can I load my data matrices, output the rules (in some parsable format) and test the classifier on new data?

Peter Reutemann

unread,
Oct 2, 2015, 4:42:32 PM10/2/15
to python-weka-wrapper
I've just added a convenience method to the "weka.core.dataset"
module, called "create_instances_from_lists". However, in the
meantime, you have to do something like this to create a dataset for
your classifier:

# create header
atts = []
for i in xrange(len(x[0])):
atts.append(Attribute.create_numeric("x" + str(i+1)))
atts.append(Attribute.create_numeric("y"))
dataset = Instances.create_instances(name, atts, len(y))
# add rows
for i in xrange(len(x)):
values = x[i][:]
values.append(y[i])
dataset.add_instance(Instance.create_instance(values))

As for outputting the rules, you can simply turn the built JRip model
into a string and parse the model output. If you want to have more
control, then you have to dig into JRip's methods (getRuleset() and
getRuleStats()). You can use the classifier's "jwrapper" method to
access the underlying Java API in an easy manner:

jrip = Classifier(classname="weka.classifiers.rules.JRip")
jrip.build_classifier(somedata)
rset = jrip.jwrapper.getRuleset()
for i in xrange(rset.size()):
r = rset.get(i)
print(str(r.toString(somedata.class_attribute.jobject)))

For making predictions, simply use the classifier's
"classify_instance" or "distribution_for_instance" method.

Please note, JRip cannot handle numeric classes. You must use the
weka.filters.unsupervised.attribute.NumericToNominal filter to convert
the class attribute to a nominal one.

In order to keep things simple, you should use the FilteredClassfier,
with the NumericToNominal as its filter and JRip as its base
classifier. Then you don't have worry about converting the data when
making predictions.

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/

Anton Suchaneck

unread,
Oct 5, 2015, 4:23:48 AM10/5/15
to python-weka-wrapper, frac...@waikato.ac.nz
Thanks for the detailed explanation, Peter!

I've tried to reproduce the method.
I suppose Attribute and Instances have to be imported from weka.core.dataset?

Unfortunately, Instances.create_instances("name", atts, 10) failed with
JavaException: <init>
at
weka/core/dataset.pyc, create_instances()
420    javabridge.make_instance("weka/core/Instances", "(Ljava/lang/String;Ljava/util/ArrayList;I)V", name, javabridge.make_list(attributes), capacity))

jutil.pyc
1691   raise JavaException(jexception)

What should I do from here?

If I display atts it gives [@attribute x1 numeric, @attribute x2 numeric, @attribute y numeric]

(copying my exact code is a bit difficult, since my environment is not connected to a network :/)

Peter Reutemann

unread,
Oct 5, 2015, 5:36:40 AM10/5/15
to python-weka-wrapper
> I've tried to reproduce the method.
> I suppose Attribute and Instances have to be imported from
> weka.core.dataset?

Correct.

> Unfortunately, Instances.create_instances("name", atts, 10) failed with
> JavaException: <init>
> at
> weka/core/dataset.pyc, create_instances()
> 420 javabridge.make_instance("weka/core/Instances",
> "(Ljava/lang/String;Ljava/util/ArrayList;I)V", name,
> javabridge.make_list(attributes), capacity))
>
> jutil.pyc
> 1691 raise JavaException(jexception)
>
> What should I do from here?

Hmm... I cannot reproduce this error. Here is a code snippet that
creates a simple dataset, filling it with some dummy data:

import weka.core.jvm as jvm
from weka.core.dataset import Attribute, Instance, Instances

jvm.start()

atts = []
for i in xrange(5):
atts.append(Attribute.create_numeric("x" + str(i)))

data = Instances.create_instances("data", atts, 10)

for n in xrange(10):
values = []
for i in xrange(5):
values.append(n*100 + i)
inst = Instance.create_instance(values)
data.add_instance(inst)

print(data)

jvm.stop()


> If I display atts it gives [@attribute x1 numeric, @attribute x2 numeric,
> @attribute y numeric]
>
> (copying my exact code is a bit difficult, since my environment is not
> connected to a network :/)

What versions of python-weka-wrapper and javabridge do you use?

You can do this using "pip list", for instance:

$ pip list | grep "weka\|bridge"
javabridge (X.Y.Z)
python-weka-wrapper (X.Y.Z)

javabridge should be 1.0.11 and pww 0.3.2 or 0.3.3.

Also, are you using an Oracle JRE? OpenJDK is still problematic.

Anton Suchaneck

unread,
Oct 5, 2015, 8:36:44 AM10/5/15
to python-weka-wrapper, frac...@waikato.ac.nz
I have javabridge 1.0.11 but only pww 0.3.1. It's kind of a closed system where I work, but I hope I can get an update on that.
Also I have OpenJDK and probably I'm bound to that. Currently I do not know how to proceed.

I suppose for now I will try calling JRip directly from the command line.

The are many sklearn users, which might be interested in Weka algorithms that cannot be found elsewhere.
Maybe a small, full example (as you write it) would be good for the documentation. Sklearn users are used to the pattern
X=[[...],[...],...]
y=[...]
clf=Clf(...)
clf.fit(X, y)
y_pred=clf.predict(X)

Thanks for the help!

Peter Reutemann

unread,
Oct 5, 2015, 11:19:38 PM10/5/15
to python-weka-wrapper
> I have javabridge 1.0.11 but only pww 0.3.1. It's kind of a closed system
> where I work, but I hope I can get an update on that.

OK.

> Also I have OpenJDK and probably I'm bound to that. Currently I do not know
> how to proceed.

Hmm.. Weka requires an Oracle JRE/JDK and python-weka-wrapper
therefore as well. We had weird effects in the past using OpenJDK.

Can you tell me more about the environment that is causing this issue?
OS? 32/64bit? My problem is that I cannot reproduce your error.

> I suppose for now I will try calling JRip directly from the command line.

Hopefully we can sort out that problem.

> The are many sklearn users, which might be interested in Weka algorithms
> that cannot be found elsewhere.
> Maybe a small, full example (as you write it) would be good for the
> documentation. Sklearn users are used to the pattern
> X=[[...],[...],...]
> y=[...]
> clf=Clf(...)
> clf.fit(X, y)
> y_pred=clf.predict(X)

The next release enables you to create a dataset from x any y (as long
as it is all numeric data), as I mentioned in my other post.

Anton Suchaneck

unread,
Oct 9, 2015, 3:42:07 AM10/9/15
to python-weka-wrapper, frac...@waikato.ac.nz
Here is my configuration:
python-weka-wrapper 0.3.1
javabridge 1.0.11
python 2.7.8
(Redhat) Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 GNU/Linux
java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64


> I suppose for now I will try calling JRip directly from the command line.
Hopefully we can sort out that problem.
 
But I cannot test my trained model on new data by calling it from command line only, can I? :(
Can I store the trained model from command line for now or do I have to parse the string output and recreate the rules?
 
> The are many sklearn users, which might be interested in Weka algorithms
> that cannot be found elsewhere.
> Maybe a small, full example (as you write it) would be good for the
> documentation. Sklearn users are used to the pattern
> X=[[...],[...],...]
> y=[...]
> clf=Clf(...)
> clf.fit(X, y)
> y_pred=clf.predict(X)
The next release enables you to create a dataset from x any y (as long
as it is all numeric data), as I mentioned in my other post.
 
It's a useful option, indeed. I was actually aiming a documentation section where a full working example for sklearn users could be shown.
Multiple times I tried finding alternatives to Weka after getting frustrated about not finding my case in the documentation. (even mentioning how to set the CLASSPATH could help)

Thanks!
 

Peter Reutemann

unread,
Oct 9, 2015, 4:12:09 AM10/9/15
to python-weka-wrapper


> Here is my configuration:
> python-weka-wrapper 0.3.1
> javabridge 1.0.11
> python 2.7.8
> (Redhat) Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 GNU/Linux
> java-1.7.0-openjdk-1.7.0.85-2.6.1.3.el6_7.x86_64

What Fedora or RHEL version is that?

>> > I suppose for now I will try calling JRip directly from the command line.
>> Hopefully we can sort out that problem.
>
>  
> But I cannot test my trained model on new data by calling it from command line only, can I? :(

Yes, you can. See below.

> Can I store the trained model from command line for now or do I have to parse the string output and recreate the rules?

Yes, you can. -d option for saving model, -l for using model. -l option in conjunction with -T option.

Use -h for help. Also check Weka manual.

>> > The are many sklearn users, which might be interested in Weka algorithms
>> > that cannot be found elsewhere.
>> > Maybe a small, full example (as you write it) would be good for the
>> > documentation. Sklearn users are used to the pattern
>> > X=[[...],[...],...]
>> > y=[...]
>> > clf=Clf(...)
>> > clf.fit(X, y)
>> > y_pred=clf.predict(X)
>> The next release enables you to create a dataset from x any y (as long
>> as it is all numeric data), as I mentioned in my other post.
>
>  
> It's a useful option, indeed. I was actually aiming a documentation section where a full working example for sklearn users could be shown.
> Multiple times I tried finding alternatives to Weka after getting frustrated about not finding my case in the documentation. (even mentioning how to set the CLASSPATH could help)

CLASSPATH is not  specific to Weka, it's a general Java question. My advice, don't use the CLASSPATH environment variable. Usually just gives you a headache with differing versions. I recommend using explicit -cp option when firing up JVM.

Cheers, Peter

Anton

unread,
Oct 12, 2015, 9:50:19 AM10/12/15
to python-weka-wrapper, frac...@waikato.ac.nz
The Red Hat version is:

Red Hat Enterprise Linux Server release 6.6 (Santiago)

Updating the python-weka-wrapper to 0.3.3 did not remove the error.

Thanks for the hint for the command line options. I looked at the JRip documentation only.

Is there also a way to output which exact rule matched a test instance (from the command line by calling java weka.classifiers.rules.JRip)?

Peter Reutemann

unread,
Oct 13, 2015, 12:08:53 AM10/13/15
to python-weka-wrapper
> The Red Hat version is:
>
> Red Hat Enterprise Linux Server release 6.6 (Santiago)
>
> Updating the python-weka-wrapper to 0.3.3 did not remove the error.

OK, I just di the following:
- installed a CentOS 6.6 version inside VirtualBox (64bit)
- openjdk (+ -devel)
- python (+ -devel)
- pip
- manually downloaded using https://bootstrap.pypa.io/get-pip.py
- installed using python get-pip.py
- installed dev tools (yum groupinstall "Development tools")
- installed numpy
- installed python-imaging (shouldn't be necessary)
- installed javabridge using pip
- installed python-weka-wrapper using pip

After that, I was able to run the attached script using:
python create_dataset.py

So, not quite sure what the problem at your end is... :-(

> Thanks for the hint for the command line options. I looked at the JRip
> documentation only.
>
> Is there also a way to output which exact rule matched a test instance (from
> the command line by calling java weka.classifiers.rules.JRip)?

No, classifiers are black boxes (quite a simple interface only). If
you want to do something like this, you'd have to write your own Java
code. Take a look at JRip's "distributionForInstance" method for
determining which rule covers the data and the "toString()" method if
you want to output the rules.

HTH
create_dataset.py

Anton

unread,
Oct 13, 2015, 12:34:13 PM10/13/15
to python-weka-wrapper, frac...@waikato.ac.nz
Thanks a lot for testing.
I suppose it works on most Linuxes. Maybe this Redhat version is funny or has a weird configuration. I hope magically resolves by some update of the Linux or so. Meanwhile I'm using the jar-call and my colleagues are using RWeka which both work.

A last question here for JRip:
The jar says there is an option -m for a cost matrix. Does JRip really support that and what is the format of the cost file?

Peter Reutemann

unread,
Oct 13, 2015, 2:08:11 PM10/13/15
to Anton, python-weka-wrapper


> Thanks a lot for testing.
> I suppose it works on most Linuxes. Maybe this Redhat version is funny or has a weird configuration. I hope magically resolves by some update of the Linux or so. Meanwhile I'm using the jar-call and my colleagues are using RWeka which both work.

OK. Bit weird though. Can you test using eg Virtualbox whether you can set up a system with the steps involved that I described in my previous email? Maybe it is just this particular setup that is a bit strange.

> A last question here for JRip:
> The jar says there is an option -m for a cost matrix. Does JRip really support that and what is the format of the cost file?

This is general functionality provided by the Evaluation class, not JRip.

See the following wiki article:
http://weka.wikispaces.com/CostMatrix

BTW For general Weka questions, it is best to use the Weka mailing list. See Weka homepage for details.

Cheers, Peter

Reply all
Reply to author
Forward
0 new messages