automatminer crashes when using a 'saved' pipeline for prediction

26 views
Skip to first unread message

Arnab Kabiraj

unread,
Oct 4, 2019, 8:40:18 AM10/4/19
to matminer
Dear developers,

I'm trying to save the best automatminer pipeline after optimization for a particulat dataset using  the MatPipe.save() function as the optimization takes quite some time. It dumps the pipeline fine. However, when I'm loading it using MatPipe.load() and using it to predict some unknown data, it throws the error 'Pipeline' object has no attribute 'fitted_pipeline_'. I understand this has something to do with removing the backend and replacing it with the best pipeline while saving which the predict() function isn't being able to comprehend, but was unable to solve the problem myself. The commands and the outputs from the screen are pasted below.

>>> pipe = MatPipe.load('pipe.pickle')
2019-10-04 17:44:28 INFO     Loaded MatPipe from file pipe.pickle.
2019-10-04 17:44:28 WARNING  Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.
>>> pipe.predict(df)
2019-10-04 17:44:38 INFO     Beginning MatPipe prediction using fitted pipeline.
2019-10-04 17:44:38 INFO     AutoFeaturizer: Starting transforming.
2019-10-04 17:44:38 INFO     AutoFeaturizer: composition column already exists, overwriting with composition from structure.
2019-10-04 17:44:38 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.49it/s]
StructureToComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.21it/s]
2019-10-04 17:44:39 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.
CompositionToOxidComposition: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.08it/s]
2019-10-04 17:44:39 INFO     AutoFeaturizer: Featurizing with ElementProperty.
ElementProperty: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 11.96it/s]
2019-10-04 17:44:39 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 13.03it/s]
2019-10-04 17:44:40 INFO     AutoFeaturizer: Featurizing with SineCoulombMatrix.
SineCoulombMatrix: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:00<00:00, 12.05it/s]
2019-10-04 17:44:40 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2019-10-04 17:44:40 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2019-10-04 17:44:40 INFO     AutoFeaturizer: Finished transforming.
2019-10-04 17:44:40 INFO     DataCleaner: Starting transforming.
2019-10-04 17:44:40 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'fill'
2019-10-04 17:44:40 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2019-10-04 17:44:40 INFO     DataCleaner: One-hot encoding used for columns ['material', 'dir', 'XY', 'E']
2019-10-04 17:44:40 INFO     DataCleaner: Before handling na: 2 samples, 162 features
2019-10-04 17:44:40 INFO     DataCleaner: 0 samples did not have target values. They were dropped.
2019-10-04 17:44:40 WARNING  DataCleaner: Mismatched columns found in dataframe used for fitting and argument dataframe.
2019-10-04 17:44:40 WARNING  DataCleaner: Coercing mismatched columns...
2019-10-04 17:44:40 INFO     DataCleaner: After handling na: 2 samples, 143 features
2019-10-04 17:44:40 INFO     DataCleaner: Reordering columns...
2019-10-04 17:44:40 INFO     DataCleaner: Finished transforming.
2019-10-04 17:44:40 INFO     FeatureReducer: Starting transforming.
2019-10-04 17:44:40 INFO     FeatureReducer: Finished transforming.
2019-10-04 17:44:40 INFO     TPOTAdaptor: Starting predicting.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py", line 65, in wrapper
    return func(*args, **kwargs)
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/pipeline.py", line 170, in predict
    predictions = self.learner.predict(df, self.target)
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/pkg.py", line 65, in wrapper
    return func(*args, **kwargs)
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/utils/log.py", line 94, in wrapper
    result = meth(*args, **kwargs)
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/base.py", line 115, in predict
    y_pred = self.best_pipeline.predict(X)
  File "/home/mag1/atomate/atomate_env/lib/python3.6/site-packages/automatminer/automl/adaptors.py", line 197, in best_pipeline
    return self._backend.fitted_pipeline_
AttributeError: 'Pipeline' object has no attribute 'fitted_pipeline_'


Please have a look and let me know how can I save the best pipeline as a file and use it later to make predictions on unseen data.

Regards,
Arnab

Alexander Dunn

unread,
Oct 4, 2019, 2:07:24 PM10/4/19
to matminer
Hey Arnab,

Debugging save/load

Thanks for pointing this out. Upon running, I am also getting this issue. I've opened an issue on GitHub and am working on fixing it presently. I'll update this thread when it is fixed.

Just to get some more info though, which version of automatminer and matminer are you using?


Other issues

I noticed in your log that DataCleaner is one-hot encoding some suspicious columns: ['material', 'dir', 'XY', 'E']. By default, automatminer presets include "extra" columns in the learning process in the case you have some features which you want to use for learning then you don't have to do any extra work.


I am guessing the first two are columns are some material id string and the directory the files are in? If so, it's unlikely you want to keep these as features. When one-hot encoded, the features the ML algorithms will see will include some troublesome data:

If you original df is something like

"material" |     "dir"     |
"mat-1"        "my_dir_1"
"mat-2"        "my_dir_2"
...

Then what the ML algorithm sees is:

"mat-1" | "mat-2" | ... | "my_dir_1" | "my_dir_2"| ...
   1         0                 1            0
   0         1                 0            1

This can add thousands of extra features which are not relevant to your problem which will (1) add considerable noise to the learning problem and (2) make automatminer pipelines much slower and larger in size.

If you didn't mean to include them, the easiest way to remove them is just by dropping them from the training data frames. I think the pipeline will drop them on prediction automatically. If you can, this is the way I'd recommend.

Alternatively, if you are familiar with defining your own pipelines, you can ignore columns in each automatminer class: AutoFeaturizer, DataCleaner, FeatureReducer, and AutoMLAdaptor. We currently are working on a way to easily ignore columns throughout the entire pipeline (https://github.com/hackingmaterials/automatminer/issues/228) but this is not quite done yet.


Thanks,
Alex

Alexander Dunn

unread,
Oct 4, 2019, 2:47:45 PM10/4/19
to matminer
Hey Arnab,

The issue has been fixed as of commit db8e940b328dd1e29a2a9206788caaa99b130a96

Pull the latest commits from the GitHub repo for the fix. I'll be releasing a new version soon (within the next 2 weeks or so) but if you need it quicker than that go ahead and do a git pull.

Thanks,
Alex

Arnab Kabiraj

unread,
Oct 5, 2019, 4:16:24 AM10/5/19
to matminer
Thanks a lot, Alex, for the swift response. I can confirm that the problem has been resolved.

Alexander Dunn

unread,
Oct 5, 2019, 8:43:27 AM10/5/19
to matminer
Sure thing! And if you are still having issues with the columns (see my previous response, if they are indeed unintended) and I can help troubleshoot
Reply all
Reply to author
Forward
0 new messages