Development setup

1,372 views
Skip to first unread message

Tobias Kaymak

unread,
May 29, 2018, 4:06:00 AM5/29/18
to cloud-composer-discuss
Hello,

first I want to say how awesome it is that there is now a managed Airflow environment on GCloud available - I really like it!

My question is about the general development workflow:

I have a Docker container image prepared based on the popular one by puckel ([0]) for my team to develop DAGs locally. I've made adjustments, so that it runs with Python 2.7 and the necessary packages for GCP. 

I face a version conflict now, since I use a custom PythonOperator which relies on the "new" google-cloud-python packages, whereas the GCP Operators are still relying on an older version:

ContextualVersionConflict: (google-cloud-core 0.25.0 (/usr/local/lib/python2.7/site-packages), Requirement.parse('google-cloud-core<0.29dev,>=0.28.0'), set(['google-cloud-storage']))

The docs say that in the Cloud Composer environment I can just rely on the google-cloud packages being available (as the container image is being managed by the dev team of Cloud Composer) - but that would mean I could face these version conflicts there too - and I would not be able to detect those before deploying them to the bucket (to production).

Therefore my question is: Is the underlying image of Cloud Composer somewhere available to make the dev cycle easier - so that I get exactly the environment my DAGs will face when being deployed to Cloud Composer?

Best,
Tobias

[0] https://github.com/puckel/docker-airflow

eliseop...@gmail.com

unread,
May 29, 2018, 6:51:42 AM5/29/18
to cloud-composer-discuss
+1
This would simplify things enormously for local development.

Feng Lu

unread,
May 29, 2018, 6:58:18 PM5/29/18
to eliseop...@gmail.com, Tim Swast, cloud-composer-discuss

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/00e0bd05-e857-45fb-a4d3-93ef1dd6b6e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim Swast

unread,
May 29, 2018, 8:21:09 PM5/29/18
to Feng Lu, eliseop...@gmail.com, cloud-composer-discuss
Great idea. Docker could be a very good way to get a local environment that matches the Composer environment quite closely.

As shown in https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver it is possible to extract the Docker image name from an existing composer environment. The key steps for my environment were

$ kubectl get pod airflow-scheduler-5978cbfd95-l489t  -o yaml --export > airflow-webserver.yaml
$ cat airflow-webserver.yaml| grep gcr.io
    image: gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-04-29-RC3
    image: gcr.io/cloud-airflow-releaser/gcs-syncd:cloud_composer_service_2018-04-29-RC3

Then, I running the "airflow-worker-scheduler..." image with Docker docker run gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-04-29-RC3 did in fact download the image, but requires some additional steps to set up GCP authorization, plus probably a connection to a MySQL server.

  •  Tim Swast
  •  Software Friendliness Engineer
  •  Google Cloud Developer Relations
  •  Seattle, WA, USA

Louis Vines

unread,
Jun 26, 2019, 5:03:13 AM6/26/19
to cloud-composer-discuss
Hi All,

Has anyone made progress on this? The suggested way in the Cloud Composer docs to develop DAGs (using a test Composer environment and copying the DAGs into a data/my_name folder) seems sub-optimal to me. I would really like a local version of the environment that can be iterated on fast and that matches the environment and folder structure that I'm going to deploy the DAGs to.

I've tried following the steps you've instructed @Tim but when running:


I get the error:

gcsfuse takes exactly two arguments. Run `gcsfuse --help` for more info.


I'm assuming this is an error to do with the container attempting to use gcsfuse to mount a bucket on GCS but me not having configured this. However, without being able to view and cherrypick the Dockerfile used to create this image I'm unsure how to change the way the container is being built to avoid this step.

Any thoughts on how to proceed with this or advice on alternative ways to follow down this path?

On Wednesday, 30 May 2018 01:21:09 UTC+1, Tim Swast wrote:
Great idea. Docker could be a very good way to get a local environment that matches the Composer environment quite closely.

As shown in https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver it is possible to extract the Docker image name from an existing composer environment. The key steps for my environment were

$ kubectl get pod airflow-scheduler-5978cbfd95-l489t  -o yaml --export > airflow-webserver.yaml
$ cat airflow-webserver.yaml| grep gcr.io
    image: gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-04-29-RC3
    image: gcr.io/cloud-airflow-releaser/gcs-syncd:cloud_composer_service_2018-04-29-RC3

Then, I running the "airflow-worker-scheduler..." image with Docker docker run gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-04-29-RC3 did in fact download the image, but requires some additional steps to set up GCP authorization, plus probably a connection to a MySQL server.

  •  Tim Swast
  •  Software Friendliness Engineer
  •  Google Cloud Developer Relations
  •  Seattle, WA, USA

On Tue, May 29, 2018 at 3:58 PM Feng Lu <fen...@google.com> wrote:
On Tue, May 29, 2018 at 3:51 AM <eliseo...@gmail.com> wrote:
+1
This would simplify things enormously for local development.


On Tuesday, May 29, 2018 at 9:06:00 AM UTC+1, Tobias Kaymak wrote:
Hello,

first I want to say how awesome it is that there is now a managed Airflow environment on GCloud available - I really like it!

My question is about the general development workflow:

I have a Docker container image prepared based on the popular one by puckel ([0]) for my team to develop DAGs locally. I've made adjustments, so that it runs with Python 2.7 and the necessary packages for GCP. 

I face a version conflict now, since I use a custom PythonOperator which relies on the "new" google-cloud-python packages, whereas the GCP Operators are still relying on an older version:

ContextualVersionConflict: (google-cloud-core 0.25.0 (/usr/local/lib/python2.7/site-packages), Requirement.parse('google-cloud-core<0.29dev,>=0.28.0'), set(['google-cloud-storage']))

The docs say that in the Cloud Composer environment I can just rely on the google-cloud packages being available (as the container image is being managed by the dev team of Cloud Composer) - but that would mean I could face these version conflicts there too - and I would not be able to detect those before deploying them to the bucket (to production).

Therefore my question is: Is the underlying image of Cloud Composer somewhere available to make the dev cycle easier - so that I get exactly the environment my DAGs will face when being deployed to Cloud Composer?

Best,
Tobias

[0] https://github.com/puckel/docker-airflow

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

Bob Muscovite

unread,
Aug 13, 2020, 7:26:50 AM8/13/20
to cloud-composer-discuss
Greetings all,

I am surprised to see this topic not receive much attention. I would like my team to be able to develop and test their DAGs locally, because doing so on a development instance of Composer manually is rather slow.

I have noticed that it is also very difficult to reproduce a pipenv environment that matches the Cloud Composer python environment - because it appears that the open source Airflow version available on pipenv is not compatible with some dependency library versions that Cloud Composer uses as the OP mentions. It seems there is also no publicly available GCR repository that stores Cloud Compose images - indeed like mentioned above in this thread, it is possible to retrieve the image used by the scheduler in your particular deployment (which appears to be uploaded onto your projects GCR at the time of writing), but I am not sure if it is possible to run that reliably or rely on it.

Without such an image or Cloud Composer being available as an installable python package it seems it is not even possible to run reliable basic local tests like putting a DAG into the DAGBag on a developer machine - it might work locally, but different dependencies on the Composer instance mean that a localy working DAG might break.

I would appreciate hearing about how other people ensure the correctness of their Composer-destined DAGs locally.

Best regards & thanks,

Boris

Bob Muscovite

unread,
Aug 13, 2020, 7:33:46 AM8/13/20
to cloud-composer-discuss
Looks like there is more information on this in a related thread also https://groups.google.com/u/2/g/cloud-composer-discuss/c/167kTOlaWQg/m/SFPoWodfDAAJ

Jarek Potiuk

unread,
Aug 13, 2020, 8:47:51 AM8/13/20
to Bob Muscovite, cloud-composer-discuss
Hey Bob (and others),

It's a great thread. I have a comment and request about that from the Airflow Committer and PMC member point of view - and someone who has an interest in both - development environment and Composer alike. And I would love to hear more people commenting on that. 

I created (over the last few years) Breeze development environment for Apache Airflow development. The tagline for it was "It's a Breeze to contribute to Apache Airflow". But from your comment, I have a feeling that we might need either something similar to have the tagline  "It's a Breeze to test DAGs for Composer's version of Apache Airflow" or even (and better) more general - "It's a Breeze to replicate ANY deployment of Airflow locally to test DAGs in it".

It's much more for the development of the Apache Airflow itself rather than for DAG development and testing, however over time, we have used it internally for automated system tests (mainly for Google Cloud Operators). Many other people used it for testing their own DAGs as well and it already contains a lot of useful stuff that you already mentioned - for example running it with a MySQL backend is literally adding `--backend mysql` and optionally `--mysql-version`. We also use extensively some parts of Breeze during automated CI tests of Apache Airflow. It is intended to run fully locally (so no Cloud SQL, or GCS or other external components that the Composer uses are there). However, I think (please correct me if I am wrong) it is mostly about PIP/APT dependencies synchronization that you actually miss currently - not the other Composer/GCP-specific "external" components?

You can see a video I recorded about Breeze: https://www.youtube.com/watch?v=4MCTXq-oF68  and read more about it here: https://github.com/apache/airflow/blob/master/BREEZE.rst 
 
I wonder if you and others could take a look (try and see) whether Breeze is a good start for something similar, what you'd miss if you would like to turn Breeze into a "DAG testing environment for Composer" and general feeling about a tool like that. What features you would like to see, what you currently miss, etc. I would love to hear more and gather some feedback here - and possibly bring it into discussion to the community of Airflow and to the Composer team.

J.




To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/24e809eb-e453-4326-90e2-b9b91d31109dn%40googlegroups.com.


--

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129
Polidea


Reply all
Reply to author
Forward
0 new messages