One python file (.py) per DAG or one python file for all DAGs?

678 views
Skip to first unread message

Vivien Morlet

unread,
Jun 7, 2020, 3:40:00 PM6/7/20
to cloud-composer-discuss
Hi,

I am using Airflow ang Cloud Composer and as I have some issues with Airflow Scheduler (it is slow of stops), I am looking for ideas to optimize my Airflow Architecture.

The question here is : Is it better to have one python file per DAG or to have python files that handle several DAGs?

For example, I have a python file that dynamically generates 25 DAGs (~10000 tasks) for me. When all the DAGs are running at the same time, I am experiencing issues and my tasks are not scheduled. When few DAGs are running, it works pretty good. That's why I am wondering what is the best way to generate DAG.

Don't hesitate if you can help me and if you have insights about best practices on Composer and Airflow

Thanks a lot.


Jarek Potiuk

unread,
Jun 8, 2020, 1:19:31 AM6/8/20
to Vivien Morlet, cloud-composer-discuss
I am one of the committers and PMC of Airflow - this issue is actually a limitation of Apache Airflow 1.10. 
In the currently released versions of Airflow (1.10* ) Airflow does not cope well with many Dags in on file. This has been heavily improved in Airflow 2.0 (current development version). We are aiming to bring some of the improvements to Airflow 1.10.11 (and possibly 1.10.12) in the coming weeks so it is likely when this version is released, this might improve quite a bit. 

But until this happens - your best bet is to split your dynamic DAGs into smaller, separate files (one DAG per file). 

J.


--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/cc086497-1ab9-42a8-8c13-7a557ec04c7fo%40googlegroups.com.


--

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129
Polidea


Vivien Morlet

unread,
Jun 8, 2020, 5:01:23 AM6/8/20
to cloud-composer-discuss
 Thank you for the advice, I will split my DAGs before the Airflow 2.0 version is released.

Do you have precisions about why this is not recommended at the moment to dynamically generate DAGs in a single python file?

Thank you

 

Jarek Potiuk

unread,
Jun 8, 2020, 5:07:04 AM6/8/20
to Vivien Morlet, cloud-composer-discuss
Precisely because of performance. What happens under the hood there are many many (many thousands) of database queries generated in every loop of the scheduler in this case and that's what is killing the scheduler process (or rather the underlying database).

J.


--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

Vivien Morlet

unread,
Jun 11, 2020, 12:59:50 PM6/11/20
to cloud-composer-discuss
Hello again,

Thank you for the idea to generate 1 DAG with 1 .py file, it works better.

I have another question for you: I launched a huge collect (because I will sometimes need it) with airflow to test the scalability of my pipelines.
The result is that my scheduler apparently only schedule the DAGs with few tasks, and the tasks of the big DAGs are not scheduled. Do you have insights or advices about that?

Here are information about my current configuration:
Cluster config:
- 10 Cluster nodes, 20 vCPUs, 160Go Memory

airflow config:
core
store_serialized_dags: True
dag_concurrency: 160
store_dag_code: True
min_file_process_interval: 30
max_active_runs_per_dag: 1
dagbag_import_timeout: 900
min_serialized_dag_update_interval: 30
parallelism: 160

scheduler
processor_poll_interval: 1
max_threads: 8
dag_dir_list_interval: 30

celery
worker_concurrency: 16

webserver
default_dag_run_display_number: 5
workers: 2
worker_refresh_interval: 120

airflow scheduler DagBag parsing (airflow list_dags -r):
-------------------------------------------------------------------
DagBag loading stats for /home/airflow/gcs/dags
-------------------------------------------------------------------
Number of DAGs: 27
Total task number: 32229
DagBag parsing time: 22.468404
---------------+--------------------+---------+----------+-----------------------
file           | duration           | dag_num | task_num | dags        
---------------+--------------------+---------+----------+-----------------------
/folder__dags/dag1  | 1.83547                       |       1 |     1554 | dag1 
/folder__dags/dag2  | 1.717692                     |       1 |     3872 | dag2 
/folder__dags/dag3  | 1.53                             |       1 |     3872 | dag3 
/folder__dags/dag4  | 1.391314                     |       1 |      210 | dag4 
/folder__dags/dag5  | 1.267788                     |       1 |     3872 | dag5 
/folder__dags/dag6  | 1.250022                     |       1 |     1554 | dag6 
/folder__dags/dag7  | 1.0973419999999998 |       1 |     2904 | dag7 
/folder__dags/dag8  | 1.081566                     |       1 |     3146 | dag8 
/folder__dags/dag9  | 1.019032                     |       1 |     3872 | dag9 
/folder__dags/dag10 | 0.98541                      |       1 |     1554 | dag10
/folder__dags/dag11 | 0.959722                    |       1 |      160 | dag11
/folder__dags/dag12 | 0.868756                    |       1 |     2904 | dag12
/folder__dags/dag13 | 0.81513                      |       1 |      160 | dag13
/folder__dags/dag14 | 0.69578                      |       1 |       14 | dag14
/folder__dags/dag15 | 0.617646                    |       1 |      294 | dag15
/folder__dags/dag16 | 0.588876                    |       1 |      210 | dag16
/folder__dags/dag17 | 0.563712                    |       1 |      160 | dag17
/folder__dags/dag18 | 0.55615                      |       1 |      726 | dag18
/folder__dags/dag19 | 0.553248                    |       1 |      140 | dag19
/folder__dags/dag20 | 0.55149                      |       1 |      168 | dag20
/folder__dags/dag21 | 0.543682                    |       1 |      168 | dag21
/folder__dags/dag22 | 0.530684                    |       1 |      168 | dag22
/folder__dags/dag23 | 0.498442                    |       1 |      484 | dag23
/folder__dags/dag24 | 0.46574                      |       1 |       14 | dag24
/folder__dags/dag25 | 0.454656                    |       1 |       28 | dag25
/create_conf        | 0.022272                          |       1 |       20 | create_conf
/airflow_monitoring | 0.006782                       |       1 |        1 | airflow_monitoring
---------------+--------------------+---------+----------+------------------------

Thank you for the help again

V


Ferdinan

unread,
Nov 27, 2020, 5:00:17 PM11/27/20
to cloud-composer-discuss
Hello Jarek Potiuk,

I am using Composer in a similar way as morlet...@gmail.com --> I have many python files that generates DAGs by reading JSON config files (which are often modified) in many folders. Each folders have one python file and many configurations files. I have ~100 python files which generates ~500 dags. Most of those DAGs have either no schedule_interval or @daily. The rest, let's say 25 DAGs are either scheduled @hourly or "half-hourly". The tasks are only API requests to create Dataproc jobs, Cloud SQL queries, BigQuery dataset creation... No greedy computation, only "long" poll processes.

I have 3 questions : 
  1. I see that you wrote in your first answer: 

  1. """We are aiming to bring some of the improvements to Airflow 1.10.11 (and possibly 1.10.12) in the coming weeks so it is likely when this version is released, this might improve quite a bit."""
  1. Do you know if some improvements were added in the 1.10.11 or 1.10.12 versions? If yes and if you have the patience :-), can you provide some fix links (issues or note)?

  2. In my use case of many python generating many dags from configuration file, do you think it is a good idea to have a "maintenance" DAG in charge of deleting from the airflow database (located in Cloud SQL) the DagRuns and other infos of DAGs for which the configuration files have been removed (the python file is never removed, only the configuration files can be removed) running on a daily basis.

  3. Do you have any recommendations/best practices in terms of parameter tuning for this use case?
    I noticed that in a Composer cluster, you can have one node with the airflow-scheduler pod and the airflow-worker pod.
    Is this normal? Because I see a bottleneck of choosing either
    - a low worker_concurrency value that will affect all airflow-worker to guarantee that the airflow scheduler or worker doesn't get evicted on the node having both pod,
    - a low max_threads  value to guarantee the same as above.
    I saw an article mentioning the fact that you can manually modify the YAML config of airflow-worker to improve the workers distribution across nodes, is this still accurate? if yes do you recommend doing that?Screenshot from 2020-11-27 22-42-40.png
Last question : Do you know if Cloud Composer plans to add an autoscaling process  to add/remove nodes? It could be based on the cluster workload regarding DAG's schedule_interval parameters for example

Thanks a lot,

Ferdinand

Marcel S. Gongora

unread,
Dec 18, 2020, 2:46:56 PM12/18/20
to cloud-composer-discuss
2: without knowing the details of your requirements, I could say you should evaluate k8spodoperator to leverage composer through gke and avoid those dynamic dags, they are not recommended.
3. scheduler should be isolated from workers, this is a hack to composer but it will help you. You will have 2 way to go. A: just excluding scheduler and worker deployments through kubernetes afinity or B: create an additional node-pool and using node-select move scheduler over there. Both alternatives have pros and cons.
Reply all
Reply to author
Forward
0 new messages