[ANN] General ML and ETL libraries

247 views

Skip to first unread message

Chris Nuernberger

unread,

Feb 27, 2019, 12:17:33 PM2/27/19

to clo...@googlegroups.com

Clojurians,

Good morning from (again) snowy Boulder!

Following lots of discussion and interactions with many people around the clojure and ML worlds, TechAscent has built a foundation with the intention to allow the average clojurist to do high quality machine learning of the type they are likely to encounter in their day to day work.

This isn't a deep learning framework; I already tried that in a bespoke fashion that and I think the mxnet bindings are great.

This is specifically for the use case where you have data coming in from multiple data sources and you need to do the cleaning, processing, and feature augmentation before running some set of simple models. Then gridsearch across a range of models and go about your business from there. Think more small to medium sized datomic databases and such. Everyone has a little data before they have a lot and I think this scale captures a far wider range of possible use cases.

The foundation comes in two parts.

The first is the ETL library:

https://github.com/techascent/tech.ml.dataset

This library is a column-store based design sitting on top of tablesaw. The clojure ml group profiled lots of different libraries and we found that tablesaw works great.

The ETL language is composed of three sub languages. First a set-invariant column selection language. Second, a minimal functional math language along the lines of APL or J. Finally a pipeline concept that allows you to describe an ETL pipeline in data with the idea that you create the pipeline and run it on training data and then it records context. Then during inference later you just used the saved pipeline from the first operation.

This is the second large ETL system I have worked on; the first was one named Alteryx.

The next library is a general ML framework:

https://github.com/techascent/tech.ml

The library has bindings to xgboost, smile, and libsvm. Libsvm doesn't get the credit it deserves, btw, as it works extremely well on small-n problems. xgboost works well on everything and smile contains lots of different types of models that may or may not work well depending on the problem as well as clustering and a lot of other machine-learny type things.

For this case, my interest wasn't a clear exposition of all the different things smile can do as it was more just to get a wide enough domain of different model generators to be effective. For a more thorough binding to smile, check out:

https://github.com/generateme/fastmath

I built a clojure version a very involved kaggle problem example using clojupyter and oz as a proof of concept:

https://github.com/cnuernber/ames-house-prices/blob/master/ames-housing-prices-clojure.md

Enjoy :-).

Complements of the TechAscent Crew & Clojure ML Working Group

Didier

unread,

Mar 14, 2019, 1:38:15 AM3/14/19

to Clojure

Awesome!

Reply all

Reply to author

Forward

0 new messages