dd vs fallocate for I/O performance check

231 views
Skip to first unread message

Douglas Spadotto

unread,
Jul 22, 2016, 1:04:31 PM7/22/16
to Greenplum Users
Hello everyone,

I think this is a more Linux/storage question, but it relates to Greenplum, so here it goes.

I've instaled GPDB on virtual machines (8 cores, 64GB RAM), and the filesystems which will hold the data are on a SAN served by EMC's ScaleIO. Each volume group has 4 disks behind it.

While running gpcheckperf, I got awful results. So one of the things I did was I started to pick gpcheckperf apart and run parts of it on its own, and noticed a big difference between generating a file with dd from /dev/zero (as gpcheckperf does) and using fallocate (in red below):

GPCHECKPERF

Date/Time (approx.)

Source

Destination

Size

Servers

Rate

22/07 - 10:30

/dev/zero

/data1/primary/../ddfile

64GB

srv48

255.16 MB/s

srv49

309.58 MB/s

srv50

315.74 MB/s

22/07 - 10:45

/data1/primary/../ddfile

/dev/null

64GB

srv48

69.41 MB/s

srv49

70.24 MB/s

srv48

73.69 MB/s

MANUAL

22/07 - 11:15

/dev/zero

/data1/primary/ddfile

64GB

srv48

934 MB/s

srv49

917 MB/s

srv50

917 MB/s

22/07 - 11:50

/data1/primary/ddfile

/dev/null

64GB

srv48

64 MB/s

srv49

60.8 MB/s

srv50

60.9 MB/s

22/07 - 11:50

/data1/primary/ddfile.64GB_fallocate

/dev/null

64GB

srv48

2.5 GB/s

srv49

2.4 GB/s



srv50

2.5 GB/s


Anyone knows if I could take into account the results I got fallocate as acceptable for the GPDB install?

I found this on the Internet: "dd command is normally limited by the device speed, which is about 80MB/s for most disks. fallocate solves the problem by preallocating blocks instantly." - http://www.networknuts.net/tag/dd-vs-fallocate/, which seems to indicate that dd is more likely to reproduce what GPDB would be actually doing, but I still don't understand why the file generated with fallocate is so much faster to transfer.

Thanks in advance,

Douglas

-----
Frodo: "I wish none of this had happened." 
Gandalf: "So do all who live to see such times, but that is not for them to decide. All we have to decide is what to do with the time that is given to us."
-- Lord of the Rings: The Fellowship of the Ring (2001)

Kyle Petersen

unread,
Jul 22, 2016, 1:21:19 PM7/22/16
to Douglas Spadotto, Greenplum Users
It is interesting that your manual run of dd to write the file was much faster, did you run those in parallel or serially (gpcheckperf does this in parallel across the segment hosts)?

Regarding fallocate, it is my understanding that most of it is not physically writing to the disk (unlike dd) but just reserving blocks. Since the blocks are all pre-allocated, they are also likely to be sequential on the disk which again would not be guaranteed with dd. From the fallocate man page:

"This approach means that the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range)"

So while I have some issues with how gpcheckperf uses dd, it is likely closer to a gauge of gpdb performance than fallocate. Also in my experience, watching disk IO on gpdb it doesn't exceed the numbers reported by gpcheckperf. However (and big however) that is against physically attached storage. Running against a SAN can really throw a big variable into things depending on a number of factors.

Kyle


Kyle Petersen | E9⋮⋮⋮Data | www.e9data.comC: 503-438-5102E: kyle.p...@e9data.com

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

Scott Kahler

unread,
Jul 22, 2016, 1:30:23 PM7/22/16
to Greenplum Users
I'll agree with Kyle here and say I don't think fallocate is a good measure as it doesn't actually generate much real data, so what it shows can't be used as a judge of the amount of throughput a system could handle.
--

Scott Kahler | Pivotal, R&D, Platform Engineering  | ska...@pivotal.io | 816.237.0610

Jesse Zhang

unread,
Jul 23, 2016, 6:18:17 PM7/23/16
to Greenplum Users
Hi Douglas,
fallocate(1) (and the underlying system call fallocate(2)) doesn't test I/O performance at all! It merely changes the filesystem metadata to pre-allocate extents (and hence blocks) to a file. No actual file data was written to disk.

Jesse

Keaton Adams

unread,
Jul 25, 2016, 9:43:56 AM7/25/16
to Greenplum Users
I have questions.

So your company has EMC Scale I/O serving as the disk layer. Would you mind providing more details behind that?  


Is it the software-only deployment, which uses local x86 commodity server storage in a pool?  If so, how many nodes? Are you running VMWare? Or is this possibly an EMC VxRack System?  Are all the underlying disks HDDs, or it is a mix of HDDs, SDDs and PCIe flash cards?

EMC states that this software defined block storage solution is extremely fast compared to other SAN options.  The more nodes and related storage devices in the cluster, the faster the throughput and higher the IOPS achieved.  I'm just curious what the architecture is that you are carving these four disk volumes from.

EMC ships a fully configured GPDB cluster on its own hardware, known as a Greenplum Database Data Computing Appliance (DCA).  In their configuration, each segment server ships with 24, 1.8 TB 2.5" 10k HDDs.  They are configured into two RAID-5 (10+1) arrays, each with a hot spare. These are then formatted as XFS with the required parameters in the GPDB Installation guide for optimum performance.  The primary and mirror segments are evenly divided across the two disk volumes (/data1 and /data2).  The checkperf results come back much higher when including both RAID volumes in the check.

With 8 CPU cores and 64 GB of RAM, how many primary data segments were you planning to run on each node?  Are you using mirror segments as well?

Thanks.

Paul Johnson

unread,
Aug 21, 2016, 5:46:10 AM8/21/16
to Greenplum Users
dd is not a great way to test disk performance/suitability for a GPDB cluster.

The use of dd goes back to the *very* early (Metapa?) days and should have been ditched by now.

For DW databases we need disks that can sustain high *random* read rates. dd measures *sequential* read rates, which is not what we want. Gpcheckperf gives a totally misleading disk IO reading by using dd.

We settled on fio instead of dd to test disk performance. This allows random read IOPS with 32kb block size to be tested. This is what is needed for good GPDB performance. No amount of CPU/RAM will overcome poor disk IOPS.

High random read IOPS are much harder to deliver than high sequential read IOPS, unfortunately.

Reply all
Reply to author
Forward
0 new messages