Automatminer predicting on unknown data

35 views
Skip to first unread message

Emily

unread,
Jul 9, 2019, 12:49:57 PM7/9/19
to matminer
Hi,
I have a model for binary classification from automatminer that I'm pretty happy with - I now want to try to run the "predict" function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?
Thanks!
Emily S

Alex Dunn

unread,
Jul 9, 2019, 1:13:00 PM7/9/19
to Emily, matminer
Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)
predictions = pipe.predict(df_not_containing_target, target)

Thanks,
Alex

--
You received this message because you are subscribed to the Google Groups "matminer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to matminer+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

emm...@gmail.com

unread,
Jul 9, 2019, 1:59:09 PM7/9/19
to matminer
Hi,
Okay, that makes sense. It would be pretty lame if the target values you input affect your model, I suppose haha
Thanks!
Emily

On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:
Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)
predictions = pipe.predict(df_not_containing_target, target)

Thanks,
Alex
On Tue, Jul 9, 2019 at 9:49 AM Emily <emm...@gmail.com> wrote:
Hi,
I have a model for binary classification from automatminer that I'm pretty happy with - I now want to try to run the "predict" function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?
Thanks!
Emily S

--
You received this message because you are subscribed to the Google Groups "matminer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to matm...@googlegroups.com.

Anubhav Jain

unread,
Jul 9, 2019, 2:00:06 PM7/9/19
to matminer
I would vote for removing "target" from predict, this seems very confusing.

>>  simply for consistency of syntax

Why does fit and predict need the same syntax? I'd rather have consistency of syntax with a normal scikit-learn model which doesn't ask you for a target column in predict (since none is needed!)

>> to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times

This seems more a problem with the way you must be organizing your code? If you are having trouble keeping track which pipeline was trained on what, you could use descriptive variable names like

pipe_bandgap

or comments?


I'd like to hear a good argument as to why "target" is needed for predict



On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:
Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)
predictions = pipe.predict(df_not_containing_target, target)

Thanks,
Alex
On Tue, Jul 9, 2019 at 9:49 AM Emily <emm...@gmail.com> wrote:
Hi,
I have a model for binary classification from automatminer that I'm pretty happy with - I now want to try to run the "predict" function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?
Thanks!
Emily S

--
You received this message because you are subscribed to the Google Groups "matminer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to matminer+unsubscribe@googlegroups.com.

Alexander Dunn

unread,
Jul 9, 2019, 5:31:28 PM7/9/19
to matminer
I'm certainly not wedded to the idea of having target in predict as a required argument.

Current implementation is because the underlying classes of MatPipe (AutoFeaturizer, DataCleaner, FeatureReducer, all AutoMLAdaptors) use the same .transform operations in matpipe fit and matpipe predict, as many of the underlying operations are the same during fitting or prediction. For example, Autofeaturizer.transform creates descriptors mostly same way whether the operand the df being featurized for fitting or some other df during prediction. The underlying classes can just read their own .fitted_target to get the target during .transform I suppose, with the consequences that a bit of code complexity will be added.

Thoughts?

Thanks,
Alex



On Tuesday, July 9, 2019 at 11:00:06 AM UTC-7, Anubhav Jain wrote:
I would vote for removing "target" from predict, this seems very confusing.

>>  simply for consistency of syntax

Why does fit and predict need the same syntax? I'd rather have consistency of syntax with a normal scikit-learn model which doesn't ask you for a target column in predict (since none is needed!)

>> to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times

This seems more a problem with the way you must be organizing your code? If you are having trouble keeping track which pipeline was trained on what, you could use descriptive variable names like

pipe_bandgap

or comments?


I'd like to hear a good argument as to why "target" is needed for predict



On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:
Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)
predictions = pipe.predict(df_not_containing_target, target)

Thanks,
Alex
On Tue, Jul 9, 2019 at 9:49 AM Emily <emm...@gmail.com> wrote:
Hi,
I have a model for binary classification from automatminer that I'm pretty happy with - I now want to try to run the "predict" function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?
Thanks!
Emily S

--
You received this message because you are subscribed to the Google Groups "matminer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to matm...@googlegroups.com.

Anubhav Jain

unread,
Aug 6, 2019, 12:48:41 PM8/6/19
to matminer
Note - I have added this as a github issue:

Alexander Dunn

unread,
Aug 7, 2019, 7:58:23 PM8/7/19
to matminer
Hey Emily,

You can now use MatPipe predict without target. Pull the latest commits if you'd like this capability!

Thanks,
Alex
Reply all
Reply to author
Forward
0 new messages