On Friday, April 4, 2014 4:37:26 PM UTC-7, Arshak Navruzyan wrote:
> I just saw this pull request by Christopher Berner regarding ML support in Prestodb (
https://github.com/facebook/presto/pull/1175)
>
>
> Any documentation/README available about how to use it? Specifically I'd like to know:
> how do you invoke training/scoring from SQLwhere are models storeddoes it train in parallel or is training done during a "reduce" phaseare there any limitations beyond the known 3.1.7 issues
Nope, no documentation yet. It's no where near ready for production use. That pull request wasn't merged, but there's a new one you can follow here (
https://github.com/facebook/presto/pull/1275).
The plugin adds a couple new functions, and you can train and use models like this:
SELECT evaluate_classifier_predictions(label, classify(features, model))
FROM (
SELECT learn_classifier(label, features) AS model
FROM my_training_data;
)
CROSS JOIN my_validation_data;
Your data will need to contain labels as BIGINTs, and features as json.
Models are a new type, and can be stored in any connector that supports that type (Classifier/Regressor). However, none of our connectors support that yet, so you can't store them.
Training is done in a reduce phase. I have plans to make parts of it parallel, in the future, so that you can do hyperparameter selection.
Presto has some internal limitations, so models that exceed a dozen or so megabytes may not work. I tested the code with ~100k features though, and it worked fine, so in practice this shouldn't be an issue. Also, all your training data must fit in memory, in a single operator, so you may need to increase the task memory limit, if you have a lot of training data.