How to use Airflow for time sensitive tasks, like web scraping?

619 views
Skip to first unread message

Devin Jacobs

unread,
Mar 29, 2016, 11:26:17 AM3/29/16
to Airflow
I have some web scraping to do and I was considering using Airflow to manage and schedule the scraping. Airflow would also be helpful for analyzing and ETLing the results of the scraping into my database.

I'm having an issue with Airflows default behavior of trying to "backfill" past DAG instances. This doesn't really make sense with web scraping. If I miss a week of scraping a site, then scraping it 8 times on the same day wont make up for that.

Any suggestion on how I would design my workflow? Is Airflow a good fit for this task?

Jack Golding

unread,
Apr 5, 2016, 12:51:55 AM4/5/16
to Airflow
Also interested in this, personally Devin I ignore all the date-time parts of airflow but this is probably the wrong solution

Jack Golding

unread,
Apr 5, 2016, 4:37:31 AM4/5/16
to Airflow
For more info, my current issue is I have a DAG which had 2 Python Operators which update different parts of a dashboard every 5 minutes.

I've now added another Python Operator and refreshed my scheduler. The problem is only now the scheduler is only backfilling the new Python Operator! (The execution date is 5 days behind todays date)

If you are using airflow as a glorified cron, what is the best design pattern regarding start_date?

Jack Golding

unread,
Apr 5, 2016, 5:12:47 AM4/5/16
to Airflow
Sorry for the spam, but it looks like https://github.com/airbnb/airflow/issues/59 and https://github.com/airbnb/airflow/issues/262 is conversation so far on these issues
Reply all
Reply to author
Forward
0 new messages