working with pandas dataframes

1,006 views
Skip to first unread message

mic...@gogotech.hk

unread,
May 27, 2015, 3:24:57 AM5/27/15
to luigi...@googlegroups.com
Hi,

I am new to Luigi and I am working a lot with pandas dataframes. Historically I have written my own mini framework that did inserts and updates using pandas read and write sql functions and it all just worked very smoothly. I did not need to worry about creating table if table did not exist or about specifying column definitions etc.

Since I joined in new company I would like to start using Luigi and I would like to leverage outputs and targets.

1. How do you work with pandas dataframes? Should I be using my own insert/update functions outside output definition or is there some way to automatically write/read sql ?

Currently what I am doing is something like this :

def run(self):
df=pd.read_sql_query('select country,date,requests,orders from agg_country_date limit 10',con=self.engine)
with self.output().open('w') as out_file:
print (df.to_csv(sep='\t', header=False, index=False), file=out_file)

which seems super hacky. Basically I am reading data from postgresql table to dataframe and then printing it to local file. In next job I am reading this file and inserting to db using luigi.postgres.CopyToTable.

1a. Can I just export csv directly from dataframe using df.to_csv(), then read this csv and export to sql using df.to_sql() ?

I can do this easily outside luigi framework but then would need to mock the outputs and probably would not get benefits of luigi.

Best regards,
Michal










Erik Bernhardsson

unread,
May 27, 2015, 8:25:21 PM5/27/15
to mic...@gogotech.hk, luigi...@googlegroups.com
I don't think this is super hacky. But is there a reason why you need to  convert the sql result to a data frame and then to csv instead of just writing directly to csv?

Separately it would be great if our sql support would include querying as well, we should add that











--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Yager-Madden

unread,
Jun 30, 2015, 11:21:48 AM6/30/15
to luigi...@googlegroups.com, mic...@gogotech.hk, er...@malfunction.org
I'm using data frames in Luigi as well. In my case it's largely because I'm working with CSV downloads from an external source (Google DFP API) that have a few quirks (gzipped externally, and include total rows at the end), and the pandas from_csv reader is more flexible/powerful than standard csv library in ways that help with that. Also, once the data frame is loaded, it can be handy to be able to perform some data transformations with pandas methods before loading the result to a Postgres table. I generally treat the csv.gz file as Luigi Task input, then my run() method loads a data frame with pd.from_csv(), does some manipulation, and writes to Postgres Target output. I suppose I could do a lot of this outside of Luigi, but the framework helps link the download requests to DFP reporting with the Postgres load jobs that depend on them, so adding Luigi to the mix is worth it, to me. Including pandas logic in the run method doesn't strike me as hacky. 

Hope that helps.
Thomas
Reply all
Reply to author
Forward
0 new messages