smaller base images and other general question on approach

50 views

Skip to first unread message

Daniel Manson

unread,

Aug 14, 2023, 7:48:50 AM8/14/23

to google-dl-platform

Hi,

I've managed to build a docker file starting with:

FROM gcr.io/deeplearning-platform-release/tf-gpu.2-6.py39

and then pip installing a bunch of other stuff, and that then seems to run ok in GKE in a node pool with gpus.

However...

1. I'd like a bit more control over versions of numpy (i ran into an issue with pre and post numpy v1.20 that i had to work around).

2. I do need include both TF and Pytorch because Pytorch is required by the sentence transformer package which my Data Science team use prior to the model inference step done with keras.

3. It's generally a bit disconcerting not having dependencies in source control, especially when it comes to installing additional dependencies on top of the base image.

4. The image is huge, taking a long time to build and a long time to pull in GKE. If there's any way to reduce the size that would be handy, though i appreciate that ultimately i need to focus mainly on getting caching setup right in both CI and GKE

I have had a look a the base cu113:94 dependencies, and there's still a lot of opinionated stuff in there is seems. If I just want an image that will correctly run tensorflow and pytorch on GKE's gpus what would be the minimum requirements (aside from TF and pytorch themselves, which i would ideally want to install myself so as to pick versions). And what's the best way of setting up such an image (i can't really uninstall stuff already in a base image as it doesn't reduce the size of the earlier layers in docker).

Any pointers here would be much appreciated - i hope this is the right forum to discuss such things.

For a bit more context on what i'm doing (and do tell me if this is not a great approach) - we already run loads of Prefect ELT jobs in GKE (with autoprovisioning and autoscaling) as our main data/only processing stack (with a lot of BigQuery and dbt stuff involved), and I was hoping to just provide a special ML version of the execution config to Prefect so that it can use a dedicated image and node affinities for the jobs that need it. This would mean I can treat the ML stuff provided by our data science team as not being particularly special relative to our other data engineering.

Thanks,

Daniel

Daniel Manson

unread,

Aug 14, 2023, 1:30:10 PM8/14/23

to google-dl-platform

I have found the `-slim` version of the tf-gpu image, but it's rather strange in that it seems to be missing TF itself but does contain other things that aren't critical like ipython, jupyter and nodejs (!)?