MOA vs weka on processing streaming data

Xavier Daull

unread,

May 7, 2014, 1:27:38 PM5/7/14

to moa-...@googlegroups.com

Hi,

I use MOA GUI to analyze algorithms performance but my main code uses MOA algorithms through Weka code. I know that MOA and Weka shares the same core instances but when dealing with streams with evolving labels the code is apparently slightly different and seems improved in MOA. I didn't test nor ran any bench to compare. Do you think that the overhead of this solution (use of MOA classifiers through Weka API) versus directly using MOA API is too important and I'd better move to native MOA API ?

Thank you,
Xavier

Albert Bifet

unread,

May 7, 2014, 9:28:32 PM5/7/14

to moa-...@googlegroups.com

Hi Xavier,

I may recommend to move to native MOA API, since the next release of MOA is going to have instances from SAMOA, that are not going to be Weka ones:

https://github.com/abifet/moa

The new instances are going to be able to be multitarget and multilabel, something that Weka instances are not. The SAMOA instances have the same API that the WEKA instances.

Cheers, Albert

--
You received this message because you are subscribed to the Google Groups "MOA users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moa-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xavier Daull

unread,

May 8, 2014, 4:39:51 AM5/8/14

to moa-...@googlegroups.com

Hi Albert,

I assume Weka classifiers through MOA API will still be supported ?

Does MOA/SAMOA instances creation from any new incoming data provide a faster process than Weka; especially when dealing with very frequent small itemset which requires each time to: create/update each attributes, create instances, create dataset from attributes and instances ?

Regards,
Xavier

--
You received this message because you are subscribed to a topic in the Google Groups "MOA users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/moa-users/VA_pEx4mRLY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to moa-users+...@googlegroups.com.

Albert Bifet

unread,

May 8, 2014, 4:53:19 AM5/8/14

to moa-...@googlegroups.com

Hi Xavier,

Yes, Weka classifiers will be still supported as there is a conversion between weka and samoa instances.

Can you give more details or an example about you create these frequent small itemset instances? I didn't get it.

Thanks,

Albert

Xavier Daull

unread,

May 8, 2014, 6:31:14 AM5/8/14

to moa-...@googlegroups.com

Hi,

I need to feed around 1000 models with each its own continuous stream of data. Those data are either provided via small CSV files (and processed with weka loader) or directly through a Thrift communication providing arrays directly processed by the code using Weka/MOA. If I want to train more frequently the models, I will create smaller chunks of data and increase the frequency of new data processing. In this case, the need to create proper dataset for each new incoming data has an important overhead. As an extreme example on an updatable classifier, imagine the gap between processing a file (as a stream) of 1,000,000 instance versus the time to process 1,000,000 time 1 instance, both will update the classifier for each instance but now an important part of the time spent will be to create dataset (even if some objects are cached and just updated). I currently can't move to a SAMOA architecture but I was wondering in this case how Weka API compares to MOA API in processing speed, especially on the dataset conversion, while dealing with streams.

Regards,

Xavier

Albert Bifet

unread,

May 8, 2014, 9:07:32 PM5/8/14

to moa-...@googlegroups.com

Hi Xavier,

Thanks! I don't understand why you can not use the same empty dataset for all instances.. In fact in MOA the dataset only contains the information of the attributes.

Are your attributes only numeric?

Regards,

Albert

Xavier Daull

unread,

May 9, 2014, 6:08:10 AM5/9/14

to moa-...@googlegroups.com

Hi Albert,

I don't re-use the same dataset (Instances) but re-use the Attribute(s) list. Attributes ranges are updated along the stream. I assume re-using dataset would improve the process, in that case I assume I should do a removeAll asap on instances list before any update or after each training to avoid keeping dataset instances in memory between updates.

When re-using dataset across classifier updates, I still need to check and update attributes range/map(ie.nominal to numeric, string) and extend them if necessary. It "seems" like an important overhead, do you think it would be worth to try to avoid nominal data or convert any data to numeric before to avoid this attributes update step ? Or is it negligible ?

Thank you,
Xavier

Albert Bifet

unread,

May 9, 2014, 6:26:39 AM5/9/14

to moa-...@googlegroups.com

Hi Xavier,

I see. I never checked if using only numeric attributes that could be faster. It could be, but I'm not sure.. :)

Thanks, Albert

Xavier Daull

unread,

May 14, 2014, 4:18:37 AM5/14/14

to moa-...@googlegroups.com, abi...@cs.waikato.ac.nz

Hi,

Now I am wondering how the command-line tool ensure CSV to doubles conversion consistency across different training sets for the same model. In other words, how MOA (as well as Weka) command-line tool handles same attributes conversion (nominal/string => doubles) when dealing with CSV files where nominal attributes may not appear in the same order. How to avoid to get this: training1.CSV => cat:1.0 dog:2.0 camel:3.0, training2.CSV => dog:1.0 camel:2.0 cat:3.0 ?

Thank you,
Xavier

Albert Bifet

unread,

May 14, 2014, 6:46:20 AM5/14/14

to moa-...@googlegroups.com

Hi Xavier,

MOA only reads arff files. You may ask the weka user list, or look at the code of the csv reader in weka.

Cheers, Albert

Xavier Daull

unread,

May 14, 2014, 7:43:16 AM5/14/14

to moa-...@googlegroups.com

You're right, I am confused because I use MOA through Weka. I looked into the code of the CSV Loader and there's no option to re-use a template Instances object or list of attributes. So it seems like it can't ensure persistency on nominal/string attributes mapping to doubles across multiple training on CSV files and I doubt the weka command line tool provides more than that. So it appears weka command line tool is not designed to handle CSV files with updateable classifiers.

For MOA, I understand ARFF is the only file based solution "out of the box" (I am aware of the coding approach which I already use). It seems weird to me because file based approach is quite standard and ARFF requires a real post processing to create headers which I think does not really fit to stream processing approach ? Where is the flaw in my reasoning ?

Thanks, Xavier

Albert Bifet

unread,

May 14, 2014, 9:00:59 AM5/14/14

to moa-...@googlegroups.com

Hi Xavier,

I agree with you. We are working on trying to have better readers and instances..

Cheers, Albert

Xavier Daull

unread,

May 14, 2014, 1:18:05 PM5/14/14

to moa-...@googlegroups.com, abi...@cs.waikato.ac.nz

Hi Albert,

Sorry for so many question on this subject. Now I want to decide between different file based solution (CSV with nominal values, CSV with all nominal values replaced by numerics, ARFF, JSON...) for my streaming data. If I convert all my nominal data to doubles before MOA processing (it is easier in my case), it might biase data but apparently it is already done like this in MOA/Weka/SAMOA (dense or sparsed instances are the final core data?) and it is a common practice in ML so: do you think I shouldn't get bad side effects ?

Thanks a lot,
Xavier

Albert Bifet

unread,

May 14, 2014, 9:36:12 PM5/14/14

to moa-...@googlegroups.com

Hi Xavier,

I think that it depends on your data. If you convert a discrete attribute of n values to n binary attributes, you should have similar performance, but larger instances.. MOA/Weka/SAMOA have discrete attributes that are treated different than numeric attributes.

Cheers, Albert

Xavier Daull

unread,

May 15, 2014, 2:10:33 AM5/15/14

to moa-...@googlegroups.com

Hi Albert, it makes perfect sense then! Thanks again. Xavier

Reply all

Reply to author

Forward