Filesystem and OS choice

269 views
Skip to first unread message

Vineeth Kuruvath

unread,
May 20, 2016, 5:17:16 AM5/20/16
to gpdb-...@greenplum.org
Hello everyone,

I will be setting up a new greenplum 4.3.8 installation and wanted to know the best choice of filesystem and OS.

The documentation recommends XFS but warns against using CentOS 7.

Hence, following are the options I'm considering -
1. CentOS 6 with ext4
2. openSUSE Leap 42.1 with XFS
3. openSUSE 13.1 with XFS
4. Oracle Linux 7 with XFS

Has anyone any recommendations considering stability and performance?

Is there any performance limitations or issues with ext4? Also, is there any way to track whether the bugs mentioned in CentOS7 have been resolved? If none of the three seem like a good option, please do recommend any free/open solution.

Thanks and Regards
Vineeth

Ivan Novick

unread,
May 20, 2016, 8:53:13 AM5/20/16
to Vineeth Kuruvath, Greenplum Users
How about Centos6 with XFS, that is what most users have?

RedHat is almost ready with a kernel fix for RHEL7, and if you need a copy of the kernel patch there are a number of unofficial patch packages that can be obtained.

Cheers,
Ivan

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.



--
Ivan Novick
Product Manager Pivotal Greenplum & HDB

Vineeth Kuruvath

unread,
May 23, 2016, 4:48:42 AM5/23/16
to Ivan Novick, Greenplum Users
Hi Ivan

My IT team told me that XFS is not available with CentOS 6. I'll discuss with them again and see what issue they are facing.

Thanks.
Vineeth

Vineeth Kuruvath

unread,
May 30, 2016, 5:16:07 AM5/30/16
to Greenplum Users
Hi guys, I've gone ahead with CentOS6 and XFS. Thanks a lot for helping out. Following is the gpcheckperf results (I think except for network speed, it looks good, no?)-

Disk I/O and memory

 disk write avg time (sec): 98.93
 disk write tot bytes: 134654525440
 disk write tot bandwidth (MB/s): 1298.18
 disk write min bandwidth (MB/s): 642.79 [app2055]
 disk write max bandwidth (MB/s): 655.39 [app2056]

 disk read avg time (sec): 17.15
 disk read tot bytes: 134654525440
 disk read tot bandwidth (MB/s): 7488.86
 disk read min bandwidth (MB/s): 3700.77 [app2055]
 disk read max bandwidth (MB/s): 3788.10 [app2056]

 stream tot bandwidth (MB/s): 11598.22
 stream min bandwidth (MB/s): 5488.80 [app2055]
 stream max bandwidth (MB/s): 6109.41 [app2056]


Full matrix netperf bandwidth test

 Per host transfer rates
 app2056 Tx rate: 179.51
 app2055 Tx rate: 222.47

 Per host receive rates
 app2056 Rx rate: 222.47
 app2055 Rx rate: 179.51

 Summary:
 sum = 401.98 MB/sec
 min = 179.51 MB/sec
 max = 222.47 MB/sec
 avg = 200.99 MB/sec
 median = 222.47 MB/sec

Ivan Novick

unread,
May 31, 2016, 9:41:38 AM5/31/16
to Vineeth Kuruvath, Greenplum Users
Your disk read is faster than I normally see and your disk write is slower than i normally see.

Can you remind us what is the hardware for disks and if you are using virtualization?

Cheers,
Ivan

Tim tim

unread,
Jun 1, 2016, 2:56:53 AM6/1/16
to Greenplum Users, vineeth....@gmail.com
Read speed such fast because OS cache used.


вторник, 31 мая 2016 г., 16:41:38 UTC+3 пользователь inovick написал:

Ivan Novick

unread,
Jun 1, 2016, 9:04:33 AM6/1/16
to Tim tim, Greenplum Users, Vineeth Kuruvath
Normally we run gpcheckcat with data size double memory to avoid read cache from OS

Vineeth Kuruvath

unread,
Jun 1, 2016, 9:44:45 AM6/1/16
to Ivan Novick, Tim tim, Greenplum Users
Hi Ivan

I have a RAID5 SAN storage connected with 8Gbps FC. I have asked my IT team for its benchmark test results.

And, yes, they are running on VMs.

I did not get how gpcheckcat can be used for checking read speed. Could you please elaborate?

Thanks
Vineeth

Message has been deleted

Tim tim

unread,
Jun 1, 2016, 10:33:33 AM6/1/16
to Greenplum Users, oska...@gmail.com, vineeth....@gmail.com
Ivan, (memory size) * 2 data can lead to the following - part of pages will be in memory cache.
flush cache after file writing can provide adequate result output.

среда, 1 июня 2016 г., 16:04:33 UTC+3 пользователь inovick написал:

Ivan Novick

unread,
Jun 1, 2016, 11:03:47 AM6/1/16
to Vineeth Kuruvath, Tim tim, Greenplum Users
I meant gpcheckperf not gpcheckcat, sorry :)

Vineeth Kuruvath

unread,
Jun 15, 2016, 10:11:22 AM6/15/16
to Ivan Novick, Tim tim, Greenplum Users
Hi Ivan, Tim,

The SAN storage has SAS 10krpm disks in RAID5 and SSD caching. It is rated for 40k IOPS. Since we are up and running, I took a look at what performance we are getting from gpperfmon. I think read rate is peaking around 800MB/s and write rate at 600MB/s. I've quickly box plotted them and the plots are attached in the mail.

I ran the gpcheckperf but I am getting weirdly low read speed -

 disk write avg time (sec): 624.77
 disk write tot bytes: 944892805120
 disk write tot bandwidth (MB/s): 1442.73
 disk write min bandwidth (MB/s): 709.09 [app2055]
 disk write max bandwidth (MB/s): 733.64 [app2056]

 disk read avg time (sec): 2152.64
 disk read tot bytes: 944892805120
 disk read tot bandwidth (MB/s): 418.62
 disk read min bandwidth (MB/s): 208.44 [app2055]
 disk read max bandwidth (MB/s): 210.18 [app2056]


Regards
Vineeth
 
20160615_DiskReadRate.png
20160615_DiskWriteRate.png

Luis Macedo

unread,
Jun 15, 2016, 10:33:25 AM6/15/16
to Vineeth Kuruvath, Greenplum Users, Tim tim, Ivan Novick

Did you run it with the DB off line?

--- Sent from my Nexus 5x

Keaton Adams

unread,
Jun 15, 2016, 1:57:13 PM6/15/16
to Greenplum Users
How many RAID volumes running XFS do you have configured on each segment server?  More than one?

What is the exact gpcheckperf command (with all options) that is being used for the test?

Vineeth Kuruvath

unread,
Jun 16, 2016, 2:45:43 AM6/16/16
to Keaton Adams, Luis Macedo, Greenplum Users
The DB was online but there was no query running. Anywho, I have run it again after taking the DB down. Following are the command and output -

[gpadmin@app2054 ~]$ gpcheckperf -f hostfile_gpssh_segonly -d /home/gpadmin/data/ -r d -S 440GB
/usr/local/greenplum-db/./bin/gpcheckperf -f hostfile_gpssh_segonly -d /home/gpadmin/data/ -r d -S 440GB

--------------------
--  DISK WRITE TEST
--------------------

--------------------
--  DISK READ TEST
--------------------

====================
==  RESULT
====================

 disk write avg time (sec): 593.07

 disk write tot bytes: 944892805120
 disk write tot bandwidth (MB/s): 1533.56
 disk write min bandwidth (MB/s): 693.18 [app2055]
 disk write max bandwidth (MB/s): 840.38 [app2056]


 disk read avg time (sec): 2187.04

 disk read tot bytes: 944892805120
 disk read tot bandwidth (MB/s): 412.04
 disk read min bandwidth (MB/s): 205.09 [app2055]
 disk read max bandwidth (MB/s): 206.95 [app2056]

In the command center though, I am seeing up to ~450MB/s of disk read rate simultaneously on both machines.

I have not configured any RAID volume for a segment server. I'm actually not sure whether it makes sense or not since we have a storage area network connected through a fibre channel and what I finally get to see is a single large virtual disk.

Regards
Vineeth


--

Luis Macedo

unread,
Jun 16, 2016, 9:42:14 AM6/16/16
to Vineeth Kuruvath, Keaton Adams, Greenplum Users
How much memory have your nodes?

Keep in mind that OS cache will influence on GPCC reading. If you have repeating query or queries that goes to the same data and this might be the reason you see more performance in GPCC than gpperfcheck.

Anyway I think the difference between read and write is too large... Does not make much sense to me.


Rgds,


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Vineeth Kuruvath

unread,
Jun 16, 2016, 10:23:44 AM6/16/16
to Luis Macedo, Keaton Adams, Greenplum Users
Each segment server has got 220GB memory, hence I ran gpperfcheck with 440GB as Ivan suggested.

Keaton Adams

unread,
Jun 16, 2016, 10:44:43 AM6/16/16
to Greenplum Users, lma...@pivotal.io, kad...@pivotal.io
"I have not configured any RAID volume for a segment server. I'm actually not sure whether it makes sense or not since we have a storage area network connected through a fibre channel and what I finally get to see is a single large virtual disk."

This is a really good article on how hardware should be configured for the Greenplum Database, a shared-nothing, MPP platform:


Specifically, reference the section that discusses segment server disk configuration: "The Segment hosts should have 24 disks that are 10K RPM or faster. We recommend enterprise SAS drives because SATA drives are less performant, and SSD still runs high on the price per TB scale.  It’s becoming common to see SAS drives at 1.2TB and larger, which is starting to push user data capacity near the 1PB per rack. When grouping disks, a set of 24 disks will usually be split into two RAID5 sets with a set of two hot spares."

I am not certain if we have any customers running production GPDB clusters with a SAN over fiber.  Generally the nodes are completely separate CPU, memory and disk configurations to allow true horizontal scale.  What is the LUN configuration on your SAN?  The disk pool you have: are the physical disks shared by the logical volumes that are presented across the segment servers, or does each segment server map to its own set of underlying HDDs to avoid contention?  In addition, we typically have more than one RAID group configured (each with its own set of physical drives) so we can spread out the disk I/O by the database segment servers across more than one XFS volume.

The read/write rates you are receiving are not great, especially reads.  Here is an example of what gpcheckperf should return from the disk configuration.  This was a Dell platform with the following drive configuration on each segment server:

The four Segment Hosts are configured with the following:
 
OS volume: 2x1.8TB in RAID1
Data Volume 1: 10x1.8TB in RAID 10 + 1 hot swap
Data Volume 2: 10x1.8TB in RAID 10 + 1 hot swap

I would suggest working with your SAN administrator and help explain how Greenplum is a MPP database where all servers are working together as one "logical database", and each segment server really needs to not be in contention at the SAN / disk level with the others, so the underlying disk volumes really need to come from separate drive pools.  See if that helps boost disk volume performance across the cluster.


[gpadmin@utlxa450 ~]$ gpcheckperf -f /tmp/segs -r ds -D -d /data1/primary -d /data2/primary
/usr/local/greenplum-db/./bin/gpcheckperf -f /tmp/segs -r ds -D -d /data1/primary -d /data2/primary
 
--------------------
--  DISK WRITE TEST
--------------------
 
--------------------
--  DISK READ TEST
--------------------
 
--------------------
--  STREAM TEST
--------------------
 
====================
==  RESULT
====================
 
disk write avg time (sec): 412.89
disk write tot bytes: 2166084075520
disk write tot bandwidth (MB/s): 5003.17
disk write min bandwidth (MB/s): 1247.88 [sdw4]
disk write max bandwidth (MB/s): 1253.33 [sdw1]
-- per host bandwidth --
    disk write bandwidth (MB/s): 1247.88 [sdw4]
    disk write bandwidth (MB/s): 1253.33 [sdw1]
    disk write bandwidth (MB/s): 1253.18 [sdw2]
    disk write bandwidth (MB/s): 1248.78 [sdw3]
 
 
disk read avg time (sec): 408.44
disk read tot bytes: 2166084075520
disk read tot bandwidth (MB/s): 5057.62
disk read min bandwidth (MB/s): 1261.29 [sdw4]
disk read max bandwidth (MB/s): 1268.66 [sdw2]
-- per host bandwidth --
    disk read bandwidth (MB/s): 1261.29 [sdw4]
    disk read bandwidth (MB/s): 1263.70 [sdw1]
    disk read bandwidth (MB/s): 1268.66 [sdw2]
    disk read bandwidth (MB/s): 1263.97 [sdw3]
 
 
stream tot bandwidth (MB/s): 49402.29
stream min bandwidth (MB/s): 11882.93 [sdw4]
stream max bandwidth (MB/s): 12774.12 [sdw1]
-- per host bandwidth --
    stream bandwidth (MB/s): 11882.93 [sdw4]
    stream bandwidth (MB/s): 12774.12 [sdw1]
    stream bandwidth (MB/s): 12394.29 [sdw2]
    stream bandwidth (MB/s): 12350.95 [sdw3]
 
 


To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.

Luis Macedo

unread,
Jun 16, 2016, 11:12:22 AM6/16/16
to Keaton Adams, Greenplum Users
Keaton,

We do have customers running on Vmware and AWS. 

Virtualized with SAN infra for disks are still challenged by out scan rate need. Iops are not an important metric for GPDB, scan rates are.

With SAN you will need to look not only the LUNs and underlying disks but also your SAN network. The SAN network can bottleneck your scan speeds.

This article is old but can shed some light to the discussion:


Thanks,


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.

To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.

Scott Kahler

unread,
Jun 16, 2016, 11:37:26 AM6/16/16
to Luis Macedo, Keaton Adams, Greenplum Users
There are a whole lot of variables in how the SAN could be configured so it'll be tough to say exactly what the problem is. I think the main thing to keep in mind is that the SAN is a shared resource so  everything is going to hit it at once. Depending on the SAN there is probably some write caching at the node level before it pushes it down to the SAN to write and it's probably got a decent sized buffer that it can pool writes together and lay them down in decent sized chunks in parallel. On the read side it will need to reach through and you will be over file buffer cache on the local server and over whatever the controller can cache, so pretty much pulling from disk. 

The average read speed you are seeing is about what one of those disks can do for sequential throughput. So my guess there is that when reads are hitting are getting to the SAN there isn't any sort of parallel access to the disk. One way to test this would be gpcheckperf using only half the nodes in the cluster. If you run at twice the speed you are probably running into bottleneck that is network or controller that is capping the whole pipe. If the speeds stay the same then you are looking at an issue where one read or write thread is hitting a cap, which means that running multiple -d  ( -d /data/1 -d /data/2 -d /data/3 -d /data/4, to get 4 threads in gpcheckperf ) might give you better performance. In the second case you might get (relatively) horrible single query performance, but concurrency might not be a problem.

My suggestion would be to run perf checks with 1/2 the nodes and also with 2x and 4x the -d locations to see if any interesting performance characteristics come out of that.

Scott Kahler | Pivotal, R&D, Platform Engineering  | ska...@pivotal.io | 816.237.0610

Vineeth Kuruvath

unread,
Jun 28, 2016, 8:18:31 AM6/28/16
to Greenplum Users, Luis Macedo, Keaton Adams, Scott Kahler
Hi everyone

Thought I'll give an update to all following this and also put out a question regarding memory.

I am in discussion with the data center manager but getting directly attached storage immediately will not be easy due to cost and effort constraints.

The fiber channel link will create a bottleneck of total 1GB/s for the two VMs. This matches with 900MB/s seen in command center. Scott is correct with respect to multiple -d location. When I run it on a single VM with multiple -d threads, the read rate increases till it plateaus at around 900MB/s with six threads.
It hence appears to have two levels of capping - a single thread is getting 200MB/s and total can reach 900MB/s
Incidentally, 'hdparm -t' also gives about 200MB/s. But running it on the OS hosting the VM gives about 450MB/s. I am working with the data center team to figure out why this is happening.
Since it is a VM, I experimented with elevator=NOOP but there was no difference.


My new question is regarding the memory (RAM) consumption. Any load, howsoever heavy, does not increase the RAM usage, whereas, I have read that generally GP uses so much memory that eager free memory policy was invented.
The usage on our system varies between 6BG to 8GB per segment host. One segment host has 220GB RAM, ~1.5TB of data and 8 segments.
Following are the memory related config -
 gp_vmem_protect_limit=26360
 max_statement_mem=25000MB
 statement_mem=24000MB
In sysctl.conf
 vm.overcommit_memory = 2
 vm.overcommit_ratio = 95

Any ideas on why the RAM usage is low?

Thanks
Vineeth

Luis Macedo

unread,
Jun 28, 2016, 10:44:13 AM6/28/16
to Vineeth Kuruvath, Greenplum Users, Keaton Adams, Scott Kahler
Vineeth,

GPDB will only use up RAM if you have large aggregation that needs sorting in memory. If you have only joins it will use very little RAM.

If you use explain analyze GPDB will tell you how much RAM the query actually used and if it spilled to disk. Also there are views you can monitor to see if you are spilling live.


See if this helps.


Thanks,


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Vineeth Kuruvath

unread,
Jun 29, 2016, 10:01:22 AM6/29/16
to Luis Macedo, Greenplum Users, Keaton Adams, Scott Kahler
Thanks Luis, I used explain analyze and was able to see that spilling is occurring (e.g. 32 spilling, 0 reused) However, I am not able to figure out why this is happening. There certainly is more than enough RAM. Following is the explain analyze result -

"Gather Motion 32:1  (slice3; segments: 32)  (cost=0.00..43409.98 rows=95818087 width=40)"
"  Rows out:  76612214 rows at destination with 1105957 ms to first row, 1563817 ms to end, start offset by 531 ms."
"  ->  Window  (cost=0.00..34429.43 rows=2994316 width=40)"
"        Partition By: bank"
"        Order By: txn, amt"
"        Rows out:  Avg 5893247.2 rows x 13 workers.  Max 46125287 rows (seg27) with 1112866 ms to first row, 1516440 ms to end, start offset by -5144 ms."
"        ->  Sort  (cost=0.00..34249.77 rows=2994316 width=30)"
"              Sort Key: bank, txn, amt"
"              Rows out:  Avg 5893247.2 rows x 13 workers.  Max 46125287 rows (seg27) with 1112866 ms to first row, 1334118 ms to end, start offset by -5144 ms."
"              Executor memory:  47029K bytes avg, 180811K bytes max (seg25)."
"              Work_mem used:  47029K bytes avg, 180811K bytes max (seg25). Workfile: (9 spilling, 0 reused)"
"              Work_mem wanted: 1130312K bytes avg, 6078157K bytes max (seg27) to lessen workfile I/O affecting 9 workers."
"              ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=0.00..23292.08 rows=2994316 width=30)"
"                    Hash Key: bank"
"                    Rows out:  Avg 5893247.2 rows x 13 workers at destination.  Max 46125287 rows (seg27) with 535842 ms to first row, 611606 ms to end, start offset by -5144 ms."
"                    ->  Result  (cost=0.00..23010.92 rows=2994316 width=30)"
"                          Rows out:  Avg 2394131.7 rows x 32 workers.  Max 2397570 rows (seg25) with 535992 ms to first row, 557344 ms to end, start offset by -5133 ms."
"                          ->  HashAggregate  (cost=0.00..23010.92 rows=2994316 width=30)"
"                                Group By: bank, membershipno"
"                                Rows out:  Avg 2394131.7 rows x 32 workers.  Max 2397570 rows (seg25) with 535992 ms to first row, 550446 ms to end, start offset by -5133 ms."
"                                Executor memory:  104640K bytes avg, 104640K bytes max (seg0)."
"                                Work_mem used:  102151K bytes avg, 102151K bytes max (seg12). Workfile: (32 spilling, 0 reused)"
"                                Work_mem wanted: 249485K bytes avg, 249923K bytes max (seg25) to lessen workfile I/O affecting 32 workers."
"                                (seg25)  2397570 groups total in 32 batches; 1 overflows; 7413375 spill groups."
"                                (seg25)  Hash chain length 1.4 avg, 15 max, using 2341215 of 8650752 buckets."
"                                ->  Redistribute Motion 32:32  (slice1; segments: 32)  (cost=0.00..22221.76 rows=2994316 width=30)"
"                                      Hash Key: bank, membershipno"
"                                      Rows out:  Avg 40770024.7 rows x 32 workers at destination.  Max 40874260 rows (seg11) with 5137 ms to first row, 363390 ms to end, start offset by 571 ms."
"                                      ->  Result  (cost=0.00..21940.60 rows=2994316 width=30)"
"                                            Rows out:  Avg 40770024.7 rows x 32 workers.  Max 40776481 rows (seg20) with 5902 ms to first row, 406496 ms to end, start offset by -5141 ms."
"                                            ->  HashAggregate  (cost=0.00..21940.60 rows=2994316 width=30)"
"                                                  Group By: bank, membershipno"
"                                                  Rows out:  Avg 40770024.7 rows x 32 workers.  Max 40776481 rows (seg20) with 5902 ms to first row, 324800 ms to end, start offset by -5141 ms."
"                                                  Executor memory:  111295K bytes avg, 111295K bytes max (seg0)."
"                                                  ->  Sequence  (cost=0.00..7813.26 rows=44740992 width=21)"
"                                                        Rows out:  Avg 44738533.9 rows x 32 workers.  Max 44739505 rows (seg3) with 4.132 ms to first row, 151776 ms to end, start offset by 575 ms."
"                                                        ->  Partition Selector for accrualtransmaster_part (dynamic scan id: 1)  (cost=10.00..100.00 rows=4 width=4)"
"                                                              Partitions selected:  94 (out of 94)"
"                                                              Rows out:  0 rows (seg0) with 0.191 ms to end, start offset by 584 ms."
"                                                        ->  Dynamic Table Scan on accrualtransmaster_part (dynamic scan id: 1)  (cost=0.00..7813.26 rows=44740992 width=21)"
"                                                              Rows out:  Avg 44738533.9 rows x 32 workers.  Max 44739505 rows (seg3) with 3.965 ms to first row, 69300 ms to end, start offset by 575 ms."
"                                                              Partitions scanned:  Avg 94.0 (out of 94) x 32 workers.  Max 94 parts (seg0)."
"Slice statistics:"
"  (slice0)    Executor memory: 386K bytes."
"  (slice1)    Executor memory: 112466K bytes avg x 32 workers, 112469K bytes max (seg0)."
"  (slice2)  * Executor memory: 105408K bytes avg x 32 workers, 105408K bytes max (seg0).  Work_mem: 102151K bytes max, 249923K bytes wanted."
"  (slice3)  * Executor memory: 47430K bytes avg x 32 workers, 181254K bytes max (seg25).  Work_mem: 180811K bytes max, 6078157K bytes wanted."
"Statement statistics:"
"  Memory used: 204800K bytes"
"  Memory wanted: 24313424K bytes"
"Settings:  optimizer=on"
"Optimizer status: PQO version 1.632"
"Total runtime: 1655946.903 ms"

Thanks and Regards
Vineeth

Jon Ernster

unread,
Jun 29, 2016, 11:09:17 AM6/29/16
to Vineeth Kuruvath, Greenplum Users, Luis Macedo, Keaton Adams, Scott Kahler
Vineeth,

Your statement_mem and max_statement_mem are way too high.  Default is 128MB so if you want to increase it, gradually increase and test 256MB, 512MB, 1024MB 


___________________________________

Jon Ernster

Senior Systems Architect



Luis Macedo

unread,
Jun 29, 2016, 12:39:27 PM6/29/16
to Vineeth Kuruvath, Keaton Adams, Greenplum Users, Scott Kahler

Vineeth,

Your query is using a lot of memory... this is quite unusual... can you analyse the tables then try again?

Also are you using columnar? You initial scan is taking over a minute...

Rgds

--- Sent from my Nexus 5x

Vineeth Kuruvath

unread,
Jun 30, 2016, 10:32:52 AM6/30/16
to Luis Macedo, jern...@zdatainc.com, Keaton Adams, Greenplum Users, Scott Kahler
Hi all,

I tried varying the statement memory as Jon suggested and it is giving an interesting result. Up to 8GB is working as expected. That is, the query is using as much memory as per the configured limit. However, if I set it higher, the query throws the error 'Invalid Memory Alloc Request Size'.
Anyone got any idea why this might be happening?

Luis, I have just dumped a table which get updated daily, so it is in row store now. It will be shifted to polymorphic storage in a while. As you had suggested, I had written a code to use as a lot of memory using large aggregation and sorting. It will need to hold ~2B records with 1 text, 2 integers and 1 numeric field in memory to sort.

Thanks and Regards
Vineeth

Luis Macedo

unread,
Jun 30, 2016, 10:45:48 AM6/30/16
to Vineeth Kuruvath, jern...@zdatainc.com, Keaton Adams, Greenplum Users, Scott Kahler
Can you share what the SQL looks like? 


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Stephen

unread,
Jun 30, 2016, 11:34:19 AM6/30/16
to Luis Macedo, Vineeth Kuruvath, jern...@zdatainc.com, Keaton Adams, Greenplum Users, Scott Kahler
Hi Vineeth,

I ran the explain through a new planchecker tool that we've been working on:
Looks like the SORT node is requesting most of the memory and accounts for 46% of the query runtime.
Is the sort required?

Based on "Memory used" output of EXPLAIN it looks like statement_mem is set to 204800K (204MB).
Can you verify what statement_mem is set to:
show statement_mem;
You stated earlier that you set statement_mem=24000MB"
Where exactly did you set that?

Also there is a lot of skew on seg27 during redistribute motion:
Avg  5893247 rows
Max 46125287 rows on seg27


Kind regards,
Stephen

Luis Macedo

unread,
Jun 30, 2016, 11:56:12 AM6/30/16
to Stephen, Vineeth Kuruvath, jern...@zdatainc.com, Keaton Adams, Greenplum Users, Scott Kahler
Stephen,

Awesome work! 

I copy and pasted Vineeth plan and the app is telling me that I have an identation error... What am I doing wrong?


Thanks,


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Luis Macedo

unread,
Jun 30, 2016, 12:06:23 PM6/30/16
to Stephen, Vineeth Kuruvath, Jon Ernster, Keaton Adams, Greenplum Users, Scott Kahler
Too soon... Nervous fingers...

The tool makes it very easy to spot skew!

The query is skewing because is has to redistribute on the bank column to apply the sort and windowing function that comes next. 

You probably have one bank that has most of the data. (~90%).

You will probably need to rewrite the query to add more columns to the redistribution phase, not sure how to adapt that in your business case...

To be more clear you will need to add another column to the partition by clause on the windowing function. 

Also you query is returning a lot of rows, this looks like more an extraction... try "create table as select" but evaluate if you really need all those results in one query (as is business question).


Rgds,



Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Stephen

unread,
Jun 30, 2016, 12:34:38 PM6/30/16
to Luis Macedo, Vineeth Kuruvath, Jon Ernster, Keaton Adams, Greenplum Users, Scott Kahler
Hi Luis,

Just to answer your previous question.
Vineeths plan was enclosed in double-quotes and was not indented as expected.
I manually adjusted Vineeth's plan to work.

The correct whitespace indentation is required for the planchecker to matchup the nodes/slices.
Any explains generated using the "Recommended Steps" on the page should not have the problem.


Kind regards,
Stephen

Vineeth Kuruvath

unread,
Jul 1, 2016, 3:37:41 AM7/1/16
to Stephen, Luis Macedo, Jon Ernster, Keaton Adams, Greenplum Users, Scott Kahler
Hi Stephen, Luis,

This is a query I have written to max out the memory usage precisely by sorting a large aggregation as Stephen said. Also, it returns a lot of rows but it is unlikely such a query would come up during regular work. Hence, there is no need to optimize the query.

Skew on seg27 during redistribute motion, resulting from one bank having most of the data is inevitable as such is the business scenario. Query cannot be rewritten to eliminate it and such situations will crop up often in regular work. I would really appreciate suggestions on how to work around this. Best I could think of is executing separate queries for that particular bank.

I have been playing around with statement_mem. I always use 'gpconfig -c' to set the values. Here is the current state -

- show statement_mem; returns "8000MB"
- the planchecker result for a slightly modified query http://planchecker.cfapps.io/plan/CYtoM2Jr
- its complete explain analyse result (I really really need to take the 8GB limit to atleast 12GB but Invalid Memory Alloc Request Size is thrown) -

 Gather Motion 32:1  (slice3; segments: 32)  (cost=0.00..43409.98 rows=95818087 width=40)
   Rows out:  78197726 rows at destination with 630206 ms to first row, 1284607 ms to end, start offset by 495 ms.

   ->  Window  (cost=0.00..34429.43 rows=2994316 width=40)
         Partition By: bank
         Order By: txn, amt
         Rows out:  Avg 6015209.7 rows x 13 workers.  Max 46787954 rows (seg27) with 951272 ms to first row, 1275786 ms to end, start offset by -6849 ms.

         ->  Sort  (cost=0.00..34249.77 rows=2994316 width=30)
               Sort Key: bank, txn, amt
               Rows out:  Avg 6015209.7 rows x 13 workers.  Max 46787954 rows (seg27) with 951272 ms to first row, 1109221 ms to end, start offset by -6849 ms.
               Executor memory:  351911K bytes avg, 6096234K bytes max (seg27).
               Work_mem used:  351911K bytes avg, 6096234K bytes max (seg27). Workfile: (1 spilling, 0 reused)
               Work_mem wanted: 5851960K bytes avg, 5851960K bytes max (seg27) to lessen workfile I/O affecting 1 workers.

               ->  Redistribute Motion 32:32  (slice2; segments: 32)  (cost=0.00..23292.08 rows=2994316 width=30)
                     Hash Key: bank
                     Rows out:  Avg 6015209.7 rows x 13 workers at destination.  Max 46787954 rows (seg27) with 380861 ms to first row, 457861 ms to end, start offset by -6849 ms.

                     ->  Result  (cost=0.00..23010.92 rows=2994316 width=30)
                           Rows out:  Avg 2443678.9 rows x 32 workers.  Max 2447416 rows (seg25) with 380906 ms to first row, 393823 ms to end, start offset by -6835 ms.

                           ->  HashAggregate  (cost=0.00..23010.92 rows=2994316 width=30)
                                 Group By: bank, membershipno
                                 Rows out:  Avg 2443678.9 rows x 32 workers.  Max 2447416 rows (seg25) with 380906 ms to first row, 386403 ms to end, start offset by -6835 ms.
                                 Executor memory:  526578K bytes avg, 526874K bytes max (seg25).

                                 ->  Redistribute Motion 32:32  (slice1; segments: 32)  (cost=0.00..22221.76 rows=2994316 width=30)
                                       Hash Key: bank, membershipno
                                       Rows out:  Avg 20548757.8 rows x 32 workers at destination.  Max 20576520 rows (seg31) with 231136 ms to first row, 320602 ms to end, start offset by -6470 ms.

                                       ->  Result  (cost=0.00..21940.60 rows=2994316 width=30)
                                             Rows out:  Avg 20548757.8 rows x 32 workers.  Max 20553536 rows (seg1) with 259886 ms to first row, 335570 ms to end, start offset by 517 ms.

                                             ->  HashAggregate  (cost=0.00..21940.60 rows=2994316 width=30)
                                                   Group By: bank, membershipno
                                                   Rows out:  Avg 20548757.8 rows x 32 workers.  Max 20553536 rows (seg1) with 259886 ms to first row, 300160 ms to end, start offset by 517 ms.
                                                   Executor memory:  2370328K bytes avg, 2370646K bytes max (seg1).

                                                   ->  Sequence  (cost=0.00..7813.26 rows=44740992 width=21)
                                                         Rows out:  Avg 46421479.4 rows x 32 workers.  Max 46422453 rows (seg3) with 1.469 ms to first row, 131086 ms to end, start offset by 945 ms.

                                                         ->  Partition Selector for accrualtransmaster_part (dynamic scan id: 1)  (cost=10.00..100.00 rows=4 width=4)
                                                               Partitions selected:  94 (out of 94)
                                                               Rows out:  0 rows (seg0) with 0.221 ms to end, start offset by -6477 ms.

                                                         ->  Dynamic Table Scan on accrualtransmaster_part (dynamic scan id: 1)  (cost=0.00..7813.26 rows=44740992 width=21)
                                                               Rows out:  Avg 46421479.4 rows x 32 workers.  Max 46422453 rows (seg3) with 1.345 ms to first row, 59282 ms to end, start offset by 945 ms.

                                                               Partitions scanned:  Avg 94.0 (out of 94) x 32 workers.  Max 94 parts (seg0).
 Slice statistics:
  (slice0)    Executor memory: 386K bytes.
  (slice1)    Executor memory: 2371607K bytes avg x 32 workers, 2371936K bytes max (seg16).
  (slice2)    Executor memory: 527626K bytes avg x 32 workers, 527933K bytes max (seg25).
  (slice3)  * Executor memory: 352312K bytes avg x 32 workers, 6096677K bytes max (seg27).  Work_mem: 6096234K bytes max, 5851960K bytes wanted.
 Statement statistics:
  Memory used: 8192000K bytes
  Memory wanted: 23408636K bytes

 Settings:  optimizer=on
 Optimizer status: PQO version 1.632
 Total runtime: 1378565.595 ms


Thanks and Regards
Vineeth

Luis Macedo

unread,
Jul 1, 2016, 5:13:12 PM7/1/16
to Vineeth Kuruvath, Stephen, Jon Ernster, Keaton Adams, Greenplum Users, Scott Kahler
Aha! Now its clear! :)

This kind of problem that you got in your query is THE problem we have with MPP architecture and you will need to do some gymnastics to solve it. One solution is do to like you said and run it for this particular bank separated.

There is another GUC around that is called max_statement_mem and you cannot allocate more statement_mem than is defined in max_statement_mem.

If you are doing that big sorts and agg as production workload maybe is best to live with the spilling and the performance loss then use huge amounts of memory that might bring some instability... Not sure if people have used GP that way before... If you really need to run the process faster you may need to add more servers or create a pre-aggregated table.

But its an interesting investigation pushing the limits on statement_mem... I would like to know how far it can go...


Thanks,


Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

Paul Johnson

unread,
Aug 21, 2016, 6:09:17 AM8/21/16
to Greenplum Users
gpcc shows the *real* read performance at the database level across the entire cluster. It is a function of the random read ability of the disks/volumes/file system/OS setup. This is all that matters.

dd gives misleadingly high sequential IO read rates per disk, and cannot be used to extrapolate expected DBMS performance at the cluster level.

You seem to have 900 MB/s read throughput at the database level. I would aim for this as an absolute minimum *per node*. Ideally I'd aim for 2GB/s per node so the CPUs can be kept busy.

Marshall

unread,
Aug 23, 2016, 8:44:47 AM8/23/16
to Greenplum Users
In an OLTP system, there are many random reads and writes, and thus IOPs is a relevant measure of IO capacity.   Index based operations do random reads and writes all the time. Greenplum tends to read data sequentially through sequential reads.  Most of the explain plans I've seen do sequential reads and their kin, and not random reads.  For this reason, wouldn't the throughput rate be more relevant to performance than IOPs?

Luis Macedo

unread,
Aug 25, 2016, 1:34:24 PM8/25/16
to Marshall, Greenplum Users
As we have many segments and also many queries going for data I bet the OS would try to give a little IO to each process making it look like more random IO than sequential scan and that is why we recommend fast disks (SAS 10K, 15K) in favor of SATA drives.

With that said I would still think that GPDB IO requirements to be more like Seq Scan than random IO.

As a thought, other databases implement a limit in the number of process going for IO in order to not stress too much the disks (while still allowing queries that are running on memory alone at that time not to have to wait). 




Luis Macedo | Sr Platform Architect | Pivotal Inc 

Mobile: +55 11 97616-6438

Take care of the customers and the rest takes care of itself

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.
Reply all
Reply to author
Forward
0 new messages