Hi everyone,
I am trying to automate a process whereby I extract rules with different seeds (starting with 1 and subsequent numbers in a sequence) from a data set. The steps I do with python are:
- Load the data and remove features in certain columns (1, 3-7). The code snippet for this is:
remove = Filter(
classname="weka.filters.unsupervised.attribute.Remove",
options=["-R", ",".join(features_positions)],
)
- Then I remove all the attributes that correlate greater than a threshold with the class (LABEL in my dataset).
So far, so good and same results if I use weka. Same number of attributes in the same order.
Then, I use Jrip to extract the rules. I am putting here the whole code:
train, _ = data.train_test_split(percentage=train_pct, rnd=None) # type: ignore
for seed in tqdm( # type: ignore
range(num_iterations), desc="Extracting rules...", unit="iteration"
):
options = f"-F 3 -N 2.0 -O {optimizations} -S {str(seeds[seed])}".split() # type: ignore
jrip = Classifier(classname="weka.classifiers.rules.JRip", options=options)
jrip.build_classifier(train)
I use 2 for the optimization parameter and the seed starts with 1 and increases by 1 on each iteration (I ask for several passes to be executed). For example, if I request 50 iterations, the seed will take numbers from 1 up to 50. Before, I split the set into train and test. But I only want train (will Evaluate at a later stage)
The rules I get with Python are very different for the rules I get with Jrip using the same seed. The options I use for weka are:
I have highlighted the options I change or fine-tune in Weka. I get two rules (which work as expected when tested in another application) but with Python, using the same seed (2 in this case) I get zero rules...
I have been assuming the results should be the same provided the same dataset and steps, but I am missing something.
I have been searching for a similar topic in the online doc, but haven't found anything I can use.
Thanks in advance for your support.
JJ