insert_rows

1,131 views
Skip to first unread message

Jack Golding

unread,
Oct 20, 2015, 4:32:47 AM10/20/15
to Airflow
Howdy,

Trying a simple ETL from Oracle to SQL Server

def sales_task(**kwargs):
mssql = mssql_hook.MsSqlHook()
oracle = oracle_hook.OracleHook()
df = oracle.get_pandas_df(sales_qry)
mssql.insert_rows('SALES',???)

But I'm sure what I should have instead of ???. I've tried list(df.itertuples()) and the same with iterrows(). How do you guys do writes using hooks?

I've tried using get_conn() as such but I appear to get errors relating to mssql and pandas (apparently its advised to not use pyodbc with mssql and instead use sql server for pandas > 0.15)

The only way I've gotten this to do is using pandas' to_sql with an inline SQL Alchemy engine. If there is a way to create a SQLAlchemy engine from a Hook I could use that.

I'm using Python 2.7 and Pandas 0.15.2 for what its worth.

Cheers,

Jack

Maxime Beauchemin

unread,
Oct 20, 2015, 12:09:41 PM10/20/15
to Airflow
I just realized that DbApiHook (base class for most other DB hooks) wasn't in the docs yet so I just added it here:
http://pythonhosted.org/airflow/code.html#airflow.hooks.DbApiHook.insert_rows

You may want to:
records = oracle.get_records(sales_qry)

At Airbnb we do a lot of ELT in Hive, so the data transfers aren't very optimized. Someone just added a bulk import option to MySql. If you want to do any type of massive data transfer you should go more native (bulk inserts) or use something like Sqoop, which is just a matter of time before we have a good SqoopOperator in Airflow

Max

Jack Golding

unread,
Oct 20, 2015, 10:24:14 PM10/20/15
to Airflow
Thanks Maxime,

This makes me think that airflow might not be a good alternative to SSIS for my purposes. What I'm looking for is a library to run scheduled python jobs to get data from APIs/databases into another database or to plot.ly. I don't use any Hadoop at all, but a lot of these ETL tools seem to be focused on ETL. Do you think I should write my own or continue to use airflow?

Maxime Beauchemin

unread,
Oct 21, 2015, 3:38:46 PM10/21/15
to Airflow
Airflow should work well for this, you probably just want to write a bulk insert method for the Airflow hook you are using if you want better performance and more concurrency. You can look at the MySqlHook to see how bulk inserts was implemented there.

Max
Reply all
Reply to author
Forward
0 new messages