How to add an example pipeline using xgboost?

379 views
Skip to first unread message

Vincent N

unread,
Jun 22, 2021, 5:38:00 PM6/22/21
to sig-tfx-addons, Daniel Kim, Gerard Casas Saez
Hi Team!

As a soft dependency for the XGBoost Evaluator project, we want to add an example pipeline. How do we start?

A. PR to tfx repo, adding penguin_pipeline_xgb_*.py to this folder https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin/experimental


C. Piggyback on the XGBoost Evaluator project: need to wait for XGBoost Evaluator project proposal review

Also for the input data, does this model have to use the same penguin dataset as some existing example pipelines? The alternative would be using data from BQ table `bigquery-public-data.chicago_taxi_trips.taxi_trips`, making training the xgboost model using BQML possible.

Thanks,
Vincent

Gerard Casas Saez

unread,
Jun 22, 2021, 5:50:12 PM6/22/21
to Vincent N, sig-tfx-addons, Daniel Kim
I would say let's do C. Extend the XGBoost Evaluator project to include an example into the examples folder of tfx-addons repo.




Gerard Casas Saez
Twitter | Cortex | @casassaez

Robert Crowe

unread,
Jun 22, 2021, 7:25:44 PM6/22/21
to Gerard Casas Saez, Vincent N, sig-tfx-addons, Daniel Kim
I don't think that we have an XGBoost proposal yet, but when we do it seems like either B or C would work.  For C, the example could live under the component hierarchy since it would be specific to the component.

Is the current direction for the XGBoost project to include a dependency on BQML?

Robert Crowe | TensorFlow Developer Engineer | rober...@google.com  | @robert_crowe



--
You received this message because you are subscribed to the Google Groups "sig-tfx-addons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-tfx-addon...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/sig-tfx-addons/CAOnQ39fB_4fTLz1hUmDZV_FKcc3ytqj3hfeCdhryqfwaTYxiUg%40mail.gmail.com.

Michael Hu

unread,
Jun 22, 2021, 10:00:55 PM6/22/21
to sig-tfx-addons, Robert Crowe, Vincent N, sig-tfx-addons, dan...@twitter.com, gcasa...@twitter.com
Regarding Option A, we are actually planning to remove that example because it has been migrated to the TFX-Addons repo: https://github.com/tensorflow/tfx-addons/tree/main/projects/examples/sklearn_penguins

One of the items that's been sitting in my backlog is extending that example to use an XGBoost model if it's as simple as swapping out the trainer module. All the logic around converting the tf.Example to a numpy array and training on CAIP should remain the same. Is the example XGBoost pipeline you plan to add essentially that, or will it be different (i.e. the training will be done on BQML instead of CAIP)?

The sklearn_penguins example was based off the simplest TFX example available, which is the only reason why that dataset was used.

Vincent N

unread,
Jul 6, 2021, 3:20:04 AM7/6/21
to sig-tfx-addons, Michael Hu, rober...@google.com, Vincent N, sig-tfx-addons, Daniel Kim, Gerard Casas Saez
> Is the current direction for the XGBoost project to include a dependency on BQML?
I don't think so. Although if the evaluator is built based on xgboost native model loader instead of pickled sklearn Pipeline, it could work with exported BQML XGBoost models.

For the local pipeline here are 2 example changes that follow different approach, let me know which one you prefer:

1. https://github.com/cent5/tfx-addons/commit/2f0594fcbf080f7fa3e76676162fd3ee40000ee6 - This is using xgboost's sklearn API xgboost.XGBClassifier within the same sklearn.Pipeline, and is pretty much what you describe.

2. https://github.com/cent5/tfx-addons/commit/4f95a51d067ed0274745082459ed5b96c7b74e86 - This is using xgboost native API, which would require a different evaluator (so I think it might makes more sense to have this pipeline in a separate dir altogether). Instead of pickling the sklearn Pipeline object, we use xgboost support functions to save and load the model. This serialization format is also compatible with CAIP, plus other XGBoost interfaces such as JVM, C++, etc. So it provides more flexibility when building serving infra.

Michael Hu

unread,
Jul 7, 2021, 11:21:43 AM7/7/21
to Vincent N, Jiayi Zhao, sig-tfx-addons, rober...@google.com, Daniel Kim, Gerard Casas Saez
+Jiayi Zhao 

I think your example with the native xgb api is much more appropriate for the reasons you described. Happy to help if needed. Regardless of whether option B or C is chosen, as Gerard was said, I'd also like to see this xgb example live under the examples dir instead of the components dir so it can be extended beyond just demonstrating the use of the XGBoost Evaluator component.



--

Gerard Casas Saez

unread,
Jul 7, 2021, 2:45:10 PM7/7/21
to Michael Hu, Vincent N, Jiayi Zhao, sig-tfx-addons, rober...@google.com, Daniel Kim
We can do both for this here actually.

- Add a custom Evaluator that defaults to use the custom extractor for xgboost (for ease of use by end user). This should be under the tfx-addons library folder (need to submit the PR restructuring things sorry)
- Add the example so it can be extended as Michael says.

Also, lets continue the conversation on GitHub as well - 

Gerard Casas Saez
Twitter | Cortex | @casassaez

Reply all
Reply to author
Forward
0 new messages