Running out of disk space while creating and truncating a large table iteratively in plpython

46 views
Skip to first unread message

Nikhil Kak

unread,
Jan 28, 2020, 8:39:34 PM1/28/20
to Greenplum Developers, Greenplum Users
Hi ,

We noticed that gpdb5/6 does not clear out the disk space for a table that is created and truncated+dropped iteratively in a plpython function.

Setup
Local mac with gpdb5/gpdb6 compiled from source (also tested this on a gcp instance with gppkg)
1. create table foo as select 1 as i;
2. create table bar(i int, k bytea);
3. insert into bar values(1, <insert a big bytea string ~ 600 MB>); -- in our code, we have a plpython function that compiles a keras/tensorflow model, serializes it and then writes it to the table as bytea

Repro

create or replace function run() returns void as
$$
create_table_query="create table baaz as select * from foo join bar using (i);"
for i in range(1,10):
    plpy.execute(create_table_query)
    ## In our production code, this table baaz is used to run some sql queries but for the repro we don't need to do anything with it.
    plpy.execute('truncate baaz')
    plpy.execute('drop table baaz')
    plpy.info('done with i = {}'.format(i))
$$ language plpythonu;
select run();

Contrary to our assumption, the truncate + drop statement does not free up the disk space. We noticed that for each iteration, the disk size goes up by  ~ 600MB (size of the large table baaz). This happens because truncate + drop does not clear out the disk on file and instead creates a new file in the base folder.

We then modified the run function to run all the sql statements in the same query so that they get run in the same subtransaction.  This solved the issue on gpdb6 but not on gpdb5. 

create or replace function run2() returns void as
$$
q="""
create table baaz as select * from foo join bar using (i);
truncate baaz;
drop table baaz;
"""

for i in range(1,10):
    plpy.execute(q)
    plpy.info('done with i = {}'.format(i))
$$ language plpythonu;
select run2();
 
We are now blocked on making it work for gpdb5, any suggestions ?

Thanks

Jon Roberts

unread,
Jan 28, 2020, 9:04:01 PM1/28/20
to Nikhil Kak, Greenplum Developers, Greenplum Users
A function is executed in a single transaction so each loop you are basically creating a new table.  That new table doesn't remove the old table because it might be needed for a rollback.

Why don't you use a temporary table instead?  Also, be sure to set the distribution of your tables.

It also seems like you should put the loop outside of the function so that the (create temporary table, work on temporary table, drop temporary table) is done in a single transaction and then commits.


Jon Roberts



--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To view this discussion on the web visit https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/CAJ4uRcSshLzE7ea%2B9NJUG9GNDK-qKbgys3hNeF9RscDbiuv41Q%40mail.gmail.com.

Nikhil Kak

unread,
Jan 28, 2020, 9:11:26 PM1/28/20
to Jon Roberts, Greenplum Developers, Greenplum Users
Why don't you use a temporary table instead?  Also, be sure to set the distribution of your tables.

We did try temporary tables but it doesn't help.

It also seems like you should put the loop outside of the function so that the (create temporary table, work on temporary table, drop temporary table) is done in a single transaction and then commits.

This is similar to the run2 function where all the queries are run in the same statement. This does work for gpdb6 but not for gpdb5.


--
Thanks,
Nikhil Kak

Jon Roberts

unread,
Jan 28, 2020, 9:26:37 PM1/28/20
to Nikhil Kak, Greenplum Developers, Greenplum Users
Try something like this.  You could also try run() without "execute" too which would be cleaner.

create or replace function run() returns void as
$$
declare
    v_sql text;
begin
    v_sql := 'drop table if exists baaz';
    raise info '%', v_sql;
    execute v_sql;
    v_sql := 'create table baaz as select * from foo join bar using (i) distributed by (i)';
    raise info '%', v_sql;
    execute v_sql;
    --do work here
    v_sql := 'drop table if exists baaz';
    raise info '%', v_sql;
    execute v_sql;
end;
$$
language plpgsql;


create or replace function run2() returns void as
$$
begin
    for i in 1..10 loop
        perform run();
    end loop;
end;
$$
language plpgsql;

select run2();


Jon Roberts

Nikhil Kak

unread,
Jan 28, 2020, 9:26:38 PM1/28/20
to Jon Roberts, Greenplum Developers, Greenplum Users
A function is executed in a single transaction so each loop you are basically creating a new table.  That new table doesn't remove the old table because it might be needed for a rollback.
 Yes we are aware of this but are trying to work around it. Also we are dependent on having to create a table because we need to join this table with the output of a UDA (the UDA returns 1 row per segment)

Why don't you use a temporary table instead?  Also, be sure to set the distribution of your tables.

We did try temporary tables but it doesn't help.
Also we do have distribution rules set up on all the tables. I didn't include it here for simplicity.

--
Thanks,
Nikhil Kak

Luis Macedo

unread,
Jan 28, 2020, 9:30:11 PM1/28/20
to Nikhil Kak, Jon Roberts, Greenplum Developers, Greenplum Users
I would not recommend building your ETL with functions... It will bring you several headaches in the future.

Can you use an ETL tool instead? Like Talend for example.


Rgds,


Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself



Nikhil Kak

unread,
Jan 28, 2020, 9:42:25 PM1/28/20
to Jon Roberts, Greenplum Developers, Greenplum Users
Thanks for the suggestion but unfortunately even this doesn't work either on gpdb5 or gpdb6. I modified your run() function by adding a truncate statement before the drop statement and then it worked on gpdb6 but not on gpdb5. 
--
Thanks,
Nikhil Kak

Nikhil Kak

unread,
Jan 28, 2020, 9:45:44 PM1/28/20
to Luis Macedo, Jon Roberts, Greenplum Developers, Greenplum Users
I would not recommend building your ETL with functions... It will bring you several headaches in the future.

Can you use an ETL tool instead? Like Talend for example.

This along with other plpython functions are used in madlib for supporting deep learning in greenplum/postgres. Unfortunately we can't not use user defined functions.
--
Thanks,
Nikhil Kak

Luis Macedo

unread,
Jan 29, 2020, 8:19:03 AM1/29/20
to Nikhil Kak, Jon Roberts, Greenplum Developers, Greenplum Users
Got it... Then you have no other option...

I think the transactions management on 6 changed this behavior that is why you are seen this difference. I will let others comment on that have more knowledge on the subject.


Thanks,


Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself


Luis Macedo

unread,
Jan 29, 2020, 8:52:00 AM1/29/20
to Nikhil Kak, Jon Roberts, Greenplum Developers, Greenplum Users
I did some research on this as I also have a situation I am interested in transaction control.

PostgreSQL 11 added stored procedure syntax which allows for transactions control. See link below:


Looks like we will have to wait for GPDB 7 to get it...

I will keep digging into this with a couple friends from the PG community, maybe they know a workaround.

Rgds,


Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself


Luis Macedo

unread,
Jan 29, 2020, 9:07:03 AM1/29/20
to Nikhil Kak, Jon Roberts, Greenplum Developers, Greenplum Users
Found something that might work.

The hack is to execute from the function an external executor. Anything that is an external executor will run as an indepe dent transaction.


Let me know if that works for you.

Rgds,


Luis F R Macedo

Advisory Data Engineer & Business Development for Latam

Call Me @ +55 11 97616-6438

Take care of the customers and the rest takes care of itself


Nikhil Kak

unread,
Jan 29, 2020, 2:13:37 PM1/29/20
to Luis Macedo, Jon Roberts, Greenplum Developers, Greenplum Users
I tried using psycopg2 from inside the plpython function and it did work but it requires us to have access to the connection string so it's not really an option for us. We would have to ask the user to pass the connection string which might contain the password as well.  (unless there is a way to access the connection string from inside a plpython function)


--
Thanks,
Nikhil Kak
Reply all
Reply to author
Forward
0 new messages