Re: mysqlhook /dbapihook insert_rows IGNORE statement ?

383 views
Skip to first unread message

Wilson Lian

unread,
Jun 25, 2018, 2:12:56 PM6/25/18
to Hugo Epicier, cloud-composer-discuss
Hi Hugo,

Are you writing your own custom operator using the mysqlhook? Could you use the MySqlOperator and construct your query as "INSERT ... ON DUPLICATE KEY UPDATE"?

best,
Wilson

On Mon, Jun 25, 2018 at 7:36 AM, 'Hugo Epicier' via cloud-composer-discuss <cloud-compo...@googlegroups.com> wrote:
Hi there,
I'm using cloud sql to storage data from rest api into a mysql db. Actually doing this inside python callable with a mysqlhook but had a _mysql_exceptions.IntegrityError due to primary key duplicate. It is not supposed to happen in production but it did on testing purpose and would like to prevent from this in the future as reality is always different from theory...
Is there a way to use insert_rows method with ignore statement ? If no, do you have any workaround or should I query my db first, check if overlap with api result and load only the difference ?
Thanks

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/5c664551-32f0-4ddb-a3c6-2cd0e518827a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hugo Epicier

unread,
Jun 26, 2018, 4:03:56 AM6/26/18
to cloud-composer-discuss
Hi Wilson,

Thanks for answer. For the moment I'm not going into implementing my own operators, trying to completely understand the concepts of Airflow before that :)
Actually I was using a hook so that I can fetch data with a python lib and insert into mysql within a python callable. Which I find more efficient comparing to transmitting the rows to insert to the mysqloperator. From my understanding of airflow, the way to do this is to use xcom but I don't think it would be very relevant if there are a lot of rows ? What would you recommend ? May be there is an easier way to do this with the composer packaging (using gcs as a staging area ?)

Best,
Hugo

Le lundi 25 juin 2018 20:12:56 UTC+2, Wilson Lian a écrit :
Hi Hugo,

Are you writing your own custom operator using the mysqlhook? Could you use the MySqlOperator and construct your query as "INSERT ... ON DUPLICATE KEY UPDATE"?

best,
Wilson
On Mon, Jun 25, 2018 at 7:36 AM, 'Hugo Epicier' via cloud-composer-discuss <cloud-compo...@googlegroups.com> wrote:
Hi there,
I'm using cloud sql to storage data from rest api into a mysql db. Actually doing this inside python callable with a mysqlhook but had a _mysql_exceptions.IntegrityError due to primary key duplicate. It is not supposed to happen in production but it did on testing purpose and would like to prevent from this in the future as reality is always different from theory...
Is there a way to use insert_rows method with ignore statement ? If no, do you have any workaround or should I query my db first, check if overlap with api result and load only the difference ?
Thanks

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Wilson Lian

unread,
Jun 27, 2018, 4:50:05 PM6/27/18
to Hugo Epicier, cloud-composer-discuss
Hi Hugo,

My recommendations, in order would be the following:
1) If there's a pre-built operator that can move data from your source to MySQL, use that. At the moment, I'm only seeing operators from presto, hive, and vertica, so I understand this is unlikely to meet your needs.
2) Delegate the data processing/movement to something designed to scale with your data (i.e., start a Dataflow, Dataproc, etc. job that reads from your data source, transforms it as needed, and writes to MySQL). This approach follows our general advice against doing data processing in the Airflow DAG code itself. 
3) In one task, delegate to something that can dump rows in bulk to the Cloud Storage bucket (as CSV for example), then in the next task, use the MySQLOperator to write the dump to MySQL.

best,
Wilson

On Tue, Jun 26, 2018 at 1:03 AM, 'Hugo Epicier' via cloud-composer-discuss <cloud-compo...@googlegroups.com> wrote:
Hi Wilson,

Thanks for answer. For the moment I'm not going into implementing my own operators, trying to completely understand the concepts of Airflow before that :)
Actually I was using a hook so that I can fetch data with a python lib and insert into mysql within a python callable. Which I find more efficient comparing to transmitting the rows to insert to the mysqloperator. From my understanding of airflow, the way to do this is to use xcom but I don't think it would be very relevant if there are a lot of rows ? What would you recommend ? May be there is an easier way to do this with the composer packaging (using gcs as a staging area ?)

Best,
Hugo

Le lundi 25 juin 2018 20:12:56 UTC+2, Wilson Lian a écrit :
Hi Hugo,

Are you writing your own custom operator using the mysqlhook? Could you use the MySqlOperator and construct your query as "INSERT ... ON DUPLICATE KEY UPDATE"?

best,
Wilson
On Mon, Jun 25, 2018 at 7:36 AM, 'Hugo Epicier' via cloud-composer-discuss <cloud-compo...@googlegroups.com> wrote:
Hi there,
I'm using cloud sql to storage data from rest api into a mysql db. Actually doing this inside python callable with a mysqlhook but had a _mysql_exceptions.IntegrityError due to primary key duplicate. It is not supposed to happen in production but it did on testing purpose and would like to prevent from this in the future as reality is always different from theory...
Is there a way to use insert_rows method with ignore statement ? If no, do you have any workaround or should I query my db first, check if overlap with api result and load only the difference ?
Thanks

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages