ML in presto

256 views
Skip to first unread message

Arshak Navruzyan

unread,
Apr 4, 2014, 7:37:26 PM4/4/14
to presto...@googlegroups.com
I just saw this pull request by Christopher Berner regarding ML support in Prestodb (https://github.com/facebook/presto/pull/1175)

Any documentation/README available about how to use it?  Specifically I'd like to know:
  • how do you invoke training/scoring from SQL
  • where are models stored
  • does it train in parallel or is training done during a "reduce" phase
  • are there any limitations beyond the known 3.1.7 issues 

christop...@gmail.com

unread,
May 7, 2014, 3:49:44 PM5/7/14
to presto...@googlegroups.com
On Friday, April 4, 2014 4:37:26 PM UTC-7, Arshak Navruzyan wrote:
> I just saw this pull request by Christopher Berner regarding ML support in Prestodb (https://github.com/facebook/presto/pull/1175)
>
>
> Any documentation/README available about how to use it?  Specifically I'd like to know:
> how do you invoke training/scoring from SQLwhere are models storeddoes it train in parallel or is training done during a "reduce" phaseare there any limitations beyond the known 3.1.7 issues 

Nope, no documentation yet. It's no where near ready for production use. That pull request wasn't merged, but there's a new one you can follow here (https://github.com/facebook/presto/pull/1275).

The plugin adds a couple new functions, and you can train and use models like this:
SELECT evaluate_classifier_predictions(label, classify(features, model))
FROM (
SELECT learn_classifier(label, features) AS model
FROM my_training_data;
)
CROSS JOIN my_validation_data;

Your data will need to contain labels as BIGINTs, and features as json.

Models are a new type, and can be stored in any connector that supports that type (Classifier/Regressor). However, none of our connectors support that yet, so you can't store them.

Training is done in a reduce phase. I have plans to make parts of it parallel, in the future, so that you can do hyperparameter selection.

Presto has some internal limitations, so models that exceed a dozen or so megabytes may not work. I tested the code with ~100k features though, and it worked fine, so in practice this shouldn't be an issue. Also, all your training data must fit in memory, in a single operator, so you may need to increase the task memory limit, if you have a lot of training data.

Arshak Navruzyan

unread,
May 7, 2014, 8:47:47 PM5/7/14
to presto...@googlegroups.com
Thanks for the clarifications. 

Not sure I get the "features as json".  Is there some easy way to transform arbitrary sql results to json and pass it into your learn_classifier function?  



--
You received this message because you are subscribed to a topic in the Google Groups "Presto" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/presto-users/O02sEpeP9fw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to presto-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

christop...@gmail.com

unread,
May 8, 2014, 1:03:05 PM5/8/14
to presto...@googlegroups.com
Yes, there's also a scalar functions features(), which you can use to transform bigints and doubles. However, it only handles up to 10 features.
Reply all
Reply to author
Forward
0 new messages