Every Airflow worker needs to parse the DAG file, which requires executing the top-level loop. As you've discovered, this means that every single task instance needs to undergo that startup overhead.
There's no getting around that if you want to have a single python module that defines all your DAGs.
You can try to reduce the overhead of parsing the DAG file by sharding. If the number of DAGs is a fixed value, you could use a script to generate 1.py, 2.py, ..., n.py, and have each one import and call a DAG factory library. For example:
1.py:
from dag_builder import build_dag
dag = build_dag(1)
dag_builder.py:
def build_dag(i):