ML in presto

Arshak Navruzyan

unread,

Apr 4, 2014, 7:37:26 PM4/4/14

to presto...@googlegroups.com

I just saw this pull request by Christopher Berner regarding ML support in Prestodb (https://github.com/facebook/presto/pull/1175)

Any documentation/README available about how to use it? Specifically I'd like to know:

how do you invoke training/scoring from SQL
where are models stored
does it train in parallel or is training done during a "reduce" phase
are there any limitations beyond the known 3.1.7 issues

christop...@gmail.com

unread,

May 7, 2014, 3:49:44 PM5/7/14

to presto...@googlegroups.com

On Friday, April 4, 2014 4:37:26 PM UTC-7, Arshak Navruzyan wrote:
> I just saw this pull request by Christopher Berner regarding ML support in Prestodb (https://github.com/facebook/presto/pull/1175)
>
>
> Any documentation/README available about how to use it? Specifically I'd like to know:

> how do you invoke training/scoring from SQLwhere are models storeddoes it train in parallel or is training done during a "reduce" phaseare there any limitations beyond the known 3.1.7 issues

Nope, no documentation yet. It's no where near ready for production use. That pull request wasn't merged, but there's a new one you can follow here (https://github.com/facebook/presto/pull/1275).

The plugin adds a couple new functions, and you can train and use models like this:
SELECT evaluate_classifier_predictions(label, classify(features, model))
FROM (
SELECT learn_classifier(label, features) AS model
FROM my_training_data;
)
CROSS JOIN my_validation_data;

Your data will need to contain labels as BIGINTs, and features as json.

Models are a new type, and can be stored in any connector that supports that type (Classifier/Regressor). However, none of our connectors support that yet, so you can't store them.

Training is done in a reduce phase. I have plans to make parts of it parallel, in the future, so that you can do hyperparameter selection.

Presto has some internal limitations, so models that exceed a dozen or so megabytes may not work. I tested the code with ~100k features though, and it worked fine, so in practice this shouldn't be an issue. Also, all your training data must fit in memory, in a single operator, so you may need to increase the task memory limit, if you have a lot of training data.

Arshak Navruzyan

unread,

May 7, 2014, 8:47:47 PM5/7/14

to presto...@googlegroups.com

Thanks for the clarifications.

Not sure I get the "features as json". Is there some easy way to transform arbitrary sql results to json and pass it into your learn_classifier function?

--
You received this message because you are subscribed to a topic in the Google Groups "Presto" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/presto-users/O02sEpeP9fw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

christop...@gmail.com

unread,

May 8, 2014, 1:03:05 PM5/8/14

to presto...@googlegroups.com

Yes, there's also a scalar functions features(), which you can use to transform bigints and doubles. However, it only handles up to 10 features.

Reply all

Reply to author

Forward