Performance issues

141 views
Skip to first unread message

Juan Emilio Gabito Decuadra

unread,
May 24, 2018, 6:47:33 PM5/24/18
to cstore users
I have the following table:

CREATE FOREIGN TABLE public.t_censo
   (
    prefecture text ,
    gender text ,
    population integer )
   SERVER cstore_server
   OPTIONS (compression 'pglz');
ALTER FOREIGN TABLE public.t_censo
  OWNER TO colap;

And I loaded about 125.000.000 records (you cand find the csv that I use here https://drive.google.com/file/d/1pQC8UHPd2tdxcZCzsVn8BLL49-LKl99q/view?usp=sharing)

I also have the same table created as a regular Postgres table.

When I ran this query in cstore it takes about 43 minutes to run:

SELECT gender, prefecture, SUM(population) , COUNT(*) AS _meta_count FROM t_censo  GROUP BY gender, prefecture  ORDER BY gender, prefecture

Here's the explain plan:

"GroupAggregate  (cost=38515902.32..40076329.92 rows=40000 width=80)"
"  Group Key: genero, prefectura"
"  ->  Sort  (cost=38515902.32..38827907.84 rows=124802208 width=68)"
"        Sort Key: genero, prefectura"
"        ->  Foreign Scan on t_censo  (cost=0.00..1257718.83 rows=124802208 width=68)"
"              CStore File: /var/lib/postgresql/9.6/main/cstore_fdw/17180/2496283"
"              CStore File Size: 105907261"

The same query ran against the Postgres table tooks 3 minutes to execute.

The data is ordered by both, prefecture and gender.

So, there's something really strange here because this should be a fast case for a column oriented storage.

Regards

Murat Tuncer

unread,
May 25, 2018, 2:36:23 AM5/25/18
to Juan Emilio Gabito Decuadra, cstore users
Hello Juan Emilio,

This is not the best query for cstore_fdw since
1 - it retrieves all columns in the table
2 - there is no filter on the query, therefore it returns all rows in the table.


Looks like most time is spent on sorting. Could you compare this plan with the plan for regular postgresql table ? Also try "analyze" ing the table. So the planner might switch to hashagg instead of group/sort.
Murat

 

--
You received this message because you are subscribed to the Google Groups "cstore users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cstore-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Murat Tuncer
Software Engineer | Citus Data
mtu...@citusdata.com

Juan Emilio Gabito Decuadra

unread,
May 25, 2018, 8:52:03 AM5/25/18
to cstore users
Hello Murat, thakns for the answer.

This is what the explain plan for the same query in Postgres looks like

"Sort  (cost=3290968.62..3290968.86 rows=94 width=30)"
"  Sort Key: genero, prefectura"
"  ->  HashAggregate  (cost=3290964.60..3290965.54 rows=94 width=30)"
"        Group Key: genero, prefectura"
"        ->  Seq Scan on t_censojp  (cost=0.00..2042941.80 rows=124802280 width=18)"

And this is the plan for the cstore after I ran analyze:

"Sort  (cost=2505744.93..2505745.17 rows=94 width=30)"
"  Sort Key: genero, prefectura"
"  ->  HashAggregate  (cost=2505740.91..2505741.85 rows=94 width=30)"
"        Group Key: genero, prefectura"
"        ->  Foreign Scan on t_censo  (cost=0.00..1257718.83 rows=124802208 width=18)"
"              CStore File: /var/lib/postgresql/9.6/main/cstore_fdw/17180/2496283"
"              CStore File Size: 105907261"

And now it runs in 56 secods, that's a hell of a win. So now I must remember to run analyze after every data load. 

Can you explain me what happend? Why is this HasAgg faster than the previous plan?

Regards!
To unsubscribe from this group and stop receiving emails from it, send an email to cstore-users...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Murat Tuncer

unread,
May 25, 2018, 9:00:48 AM5/25/18
to Juan Emilio Gabito Decuadra, cstore users
Glad to hear that you got the performance gain.

It was about how postgres plans aggregates. In the previous case, it selected group/sort (meaning sort first, then apply group aggregate) path. This means that all rows are returned to postgres, sorted, and group aggregate is run of them. This requires a lot of internal memory, and cpu usage.

However in the case of hashagg, it does not require sorting. It iterates over the incoming row stream once, and does not need to keep them in memory. As you see in the plan, it performs sorting after the aggregate on much smaller data set.

running analyze on the table helps postgres gather information (ie data size, distribution etc) on the data set, and picks appropriate plan depending on that information.



To unsubscribe from this group and stop receiving emails from it, send an email to cstore-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages