dd vs fallocate for I/O performance check

Douglas Spadotto

unread,

Jul 22, 2016, 1:04:31 PM7/22/16

to Greenplum Users

Hello everyone,

I think this is a more Linux/storage question, but it relates to Greenplum, so here it goes.

I've instaled GPDB on virtual machines (8 cores, 64GB RAM), and the filesystems which will hold the data are on a SAN served by EMC's ScaleIO. Each volume group has 4 disks behind it.

While running gpcheckperf, I got awful results. So one of the things I did was I started to pick gpcheckperf apart and run parts of it on its own, and noticed a big difference between generating a file with dd from /dev/zero (as gpcheckperf does) and using fallocate (in red below):

GPCHECKPERF
Date/Time (approx.)	Source	Destination	Size	Servers	Rate
22/07 - 10:30	/dev/zero	/data1/primary/../ddfile	64GB	srv48	255.16 MB/s
				srv49	309.58 MB/s
				srv50	315.74 MB/s
22/07 - 10:45	/data1/primary/../ddfile	/dev/null	64GB	srv48	69.41 MB/s
				srv49	70.24 MB/s
				srv48	73.69 MB/s

MANUAL
22/07 - 11:15	/dev/zero	/data1/primary/ddfile	64GB	srv48	934 MB/s
				srv49	917 MB/s
				srv50	917 MB/s
22/07 - 11:50	/data1/primary/ddfile	/dev/null	64GB	srv48	64 MB/s
				srv49	60.8 MB/s
				srv50	60.9 MB/s
22/07 - 11:50	/data1/primary/ddfile.64GB_fallocate	/dev/null	64GB	srv48	2.5 GB/s
				srv49	2.4 GB/s
				srv50	2.5 GB/s

Anyone knows if I could take into account the results I got fallocate as acceptable for the GPDB install?

I found this on the Internet: "dd command is normally limited by the device speed, which is about 80MB/s for most disks. fallocate solves the problem by preallocating blocks instantly." - http://www.networknuts.net/tag/dd-vs-fallocate/, which seems to indicate that dd is more likely to reproduce what GPDB would be actually doing, but I still don't understand why the file generated with fallocate is so much faster to transfer.

Thanks in advance,

Douglas

-----

Frodo: "I wish none of this had happened."

Gandalf: "So do all who live to see such times, but that is not for them to decide. All we have to decide is what to do with the time that is given to us."

-- Lord of the Rings: The Fellowship of the Ring (2001)

Kyle Petersen

unread,

Jul 22, 2016, 1:21:19 PM7/22/16

to Douglas Spadotto, Greenplum Users

It is interesting that your manual run of dd to write the file was much faster, did you run those in parallel or serially (gpcheckperf does this in parallel across the segment hosts)?

Regarding fallocate, it is my understanding that most of it is not physically writing to the disk (unlike dd) but just reserving blocks. Since the blocks are all pre-allocated, they are also likely to be sequential on the disk which again would not be guaranteed with dd. From the fallocate man page:

"This approach means that the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range)"

So while I have some issues with how gpcheckperf uses dd, it is likely closer to a gauge of gpdb performance than fallocate. Also in my experience, watching disk IO on gpdb it doesn't exceed the numbers reported by gpcheckperf. However (and big however) that is against physically attached storage. Running against a SAN can really throw a big variable into things depending on a number of factors.

Kyle

Kyle Petersen | E9⋮⋮⋮Data | www.e9data.com | C: 503-438-5102 | E: kyle.p...@e9data.com

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

Scott Kahler

unread,

Jul 22, 2016, 1:30:23 PM7/22/16

to Greenplum Users

I'll agree with Kyle here and say I don't think fallocate is a good measure as it doesn't actually generate much real data, so what it shows can't be used as a judge of the amount of throughput a system could handle.

--

Scott Kahler | Pivotal, R&D, Platform Engineering | ska...@pivotal.io | 816.237.0610

Jesse Zhang

unread,

Jul 23, 2016, 6:18:17 PM7/23/16

to Greenplum Users

Hi Douglas,

fallocate(1) (and the underlying system call fallocate(2)) doesn't test I/O performance at all! It merely changes the filesystem metadata to pre-allocate extents (and hence blocks) to a file. No actual file data was written to disk.

Jesse

Keaton Adams

unread,

Jul 25, 2016, 9:43:56 AM7/25/16

to Greenplum Users

I have questions.

So your company has EMC Scale I/O serving as the disk layer. Would you mind providing more details behind that?

https://www.emc.com/collateral/data-sheet/h12713-emc-scaleio.pdf

Is it the software-only deployment, which uses local x86 commodity server storage in a pool? If so, how many nodes? Are you running VMWare? Or is this possibly an EMC VxRack System? Are all the underlying disks HDDs, or it is a mix of HDDs, SDDs and PCIe flash cards?

EMC states that this software defined block storage solution is extremely fast compared to other SAN options. The more nodes and related storage devices in the cluster, the faster the throughput and higher the IOPS achieved. I'm just curious what the architecture is that you are carving these four disk volumes from.

EMC ships a fully configured GPDB cluster on its own hardware, known as a Greenplum Database Data Computing Appliance (DCA). In their configuration, each segment server ships with 24, 1.8 TB 2.5" 10k HDDs. They are configured into two RAID-5 (10+1) arrays, each with a hot spare. These are then formatted as XFS with the required parameters in the GPDB Installation guide for optimum performance. The primary and mirror segments are evenly divided across the two disk volumes (/data1 and /data2). The checkperf results come back much higher when including both RAID volumes in the check.

With 8 CPU cores and 64 GB of RAM, how many primary data segments were you planning to run on each node? Are you using mirror segments as well?

Thanks.

Paul Johnson

unread,

Aug 21, 2016, 5:46:10 AM8/21/16

to Greenplum Users

dd is not a great way to test disk performance/suitability for a GPDB cluster.

The use of dd goes back to the *very* early (Metapa?) days and should have been ditched by now.

For DW databases we need disks that can sustain high *random* read rates. dd measures *sequential* read rates, which is not what we want. Gpcheckperf gives a totally misleading disk IO reading by using dd.

We settled on fio instead of dd to test disk performance. This allows random read IOPS with 32kb block size to be tested. This is what is needed for good GPDB performance. No amount of CPU/RAM will overcome poor disk IOPS.

High random read IOPS are much harder to deliver than high sequential read IOPS, unfortunately.

Reply all

Reply to author

Forward