15x faster write throughput on Linux ext4 filesystems

5,298 views
Skip to first unread message

Axel Morgner

unread,
Dec 10, 2013, 3:55:10 PM12/10/13
to ne...@googlegroups.com, str...@googlegroups.com
Hi,

maybe some of you have experienced poor write performance on their Linux boxes as I did, esp. with small transactions.

In my tests I was able to increase the throughput by a factor of 15! Here's a blog post about my findings:

http://structr.org/blog/neo4j-performance-on-ext4

Comments?

Best
Axel

--

Axel Morgner
CEO Structr (c/o Morgner UG) · Hanauer Landstr. 291a · 60314 Frankfurt · Germany
Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"

Alex Frieden

unread,
Dec 10, 2013, 4:02:05 PM12/10/13
to ne...@googlegroups.com
This is really great!  I definitely want to try!


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Alexander Frieden

Chris Vest

unread,
Dec 10, 2013, 5:16:23 PM12/10/13
to ne...@googlegroups.com
This makes me want to dive into the ext4 code and see how they do barriers and journalling.

I wonder if we need journalling, since Neo4j has its own transaction log that it will replay upon recovery.

Disabling barriers is a bit more worrying. I don’t have a deep understanding of ext4 internals, but I’m guessing that they prevent NCQ from reordering commands across the barrier. Specifically, they prevent store file writes from reordering before the transaction log commit command. In other words, they make sure that store files don’t contain data that never got committed. Furthermore, in the case of a power failure, the drive might be receiving junk data. If the junk goes to the store file, then we’d like our transaction log to repair the store file, and if the junk goes to the transaction log, then we would like to simply disregard it and consider any orphaned commands as uncommitted.

I’m a little surprised that the effect of disabling barriers is so pronounced. I wonder how the SSD compare to the HDD when barriers are enabled.

--
Chris Vest
[ skype: mr.chrisvest, twitter: chvest ]


Axel Morgner

unread,
Dec 10, 2013, 6:16:36 PM12/10/13
to ne...@googlegroups.com
HDD with barriers or journaling off is still 3x faster than my SSD with barriers/journaling on (Samsung 840 Pro, not the slowest drive on the market).

Did just a quick test with just one sample each:

HDD (has_journal, barrier=1)
20.13 16.87

HDD (has_journal, barrier=0)

293.68 340.13

HDD (^has_journal, barrier=1)
377.64 443.46

HDD (^has_journal, barrier=0)
373.83 441.89


SSD (has_journal, barrier=1)
104.81 108.04

SSD (has_journal, barrier=0)
580.04 815.00

SSD (^has_journal, barrier=1)
664.01 989.12

SSD (^has_journal, barrier=0)
660.94 998.00

With journaling off, barrier=0|1 doesn't seem to have any effect.

Is it safe to run Neo4j on an ext4 fs without journaling, say on a partition dedicated to Neo4j storage?

Would it be even possible to use a raw device?




Am 10.12.2013 23:16, schrieb Chris Vest:
This makes me want to dive into the ext4 code and see how they do barriers and journalling.

I wonder if we need journalling, since Neo4j has its own transaction log that it will replay upon recovery.

Disabling barriers is a bit more worrying. I don�t have a deep understanding of ext4 internals, but I�m guessing that they prevent NCQ from reordering commands across the barrier. Specifically, they prevent store file writes from reordering before�the transaction log commit command. In other words, they make sure that store files don�t contain data that never got committed. Furthermore, in the case of a power failure, the drive might be receiving junk data. If the junk goes to the store file, then we�d like our transaction log to repair the store file, and if the junk goes to the transaction log, then we would like to simply disregard it and consider any orphaned�commands as uncommitted.

I�m a little surprised that the effect of disabling barriers is so pronounced. I wonder how the SSD compare to the HDD when barriers are enabled.

--
Chris Vest
[ skype: mr.chrisvest, twitter: chvest ]


On 10 Dec 2013, at 21:55, Axel Morgner <ax...@morgner.de> wrote:

Hi,

maybe some of you have experienced poor write performance on their Linux boxes as I did, esp. with small transactions.

In my tests I was able to increase the throughput by a factor of 15! Here's a blog post about my findings:

http://structr.org/blog/neo4j-performance-on-ext4

Comments?

Best
Axel

--

Axel Morgner
CEO Structr (c/o Morgner UG) � Hanauer Landstr. 291a � 60314 Frankfurt � Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--

Axel Morgner
CEO Structr (c/o Morgner UG) � Hanauer Landstr. 291a � 60314 Frankfurt � Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Chris Vest

unread,
Dec 10, 2013, 6:52:10 PM12/10/13
to ne...@googlegroups.com
In principle, Neo4j should not cause meta-data changes, e.g. it should use fdatasync instead of fsync and similar, such that the in ode tree don’t need updating during normal operation. In practice I would be hesitant to rely on this. For instance, I don’t know at this point, if the force() calls we do to the MappedByteBuffers translate into msync() calls, which do update metadata, or if there are other metadata-updating file operations hiding in there. But what’s worse is that Lucene probably does a lot of fiddling about with files, causing metadata updates. Actually, the fact that such large gains can be had by turning journalling off might be indicative that we do a lot of metadata updates.

I guess you can argue that a UPS backed cluster of machines would be safe, since it allows you time to rebuild (rather than restart) any machine that might crash. It would certainly lower the probability of data loss either way.

I haven’t looked into raw devices. The trouble is the file and IO APIs we’re given. I’ve briefly looked at the Linux block device APIs. They look more in tune with the nature of NCQ and ATA, which is nice, but I’m not sure what’s available in user space. Also, we need to support OS X and Windows.


--
Chris Vest
[ skype: mr.chrisvest, twitter: chvest ]


Disabling barriers is a bit more worrying. I don’t have a deep understanding of ext4 internals, but I’m guessing that they prevent NCQ from reordering commands across the barrier. Specifically, they prevent store file writes from reordering before the transaction log commit command. In other words, they make sure that store files don’t contain data that never got committed. Furthermore, in the case of a power failure, the drive might be receiving junk data. If the junk goes to the store file, then we’d like our transaction log to repair the store file, and if the junk goes to the transaction log, then we would like to simply disregard it and consider any orphaned commands as uncommitted.

I’m a little surprised that the effect of disabling barriers is so pronounced. I wonder how the SSD compare to the HDD when barriers are enabled.

--
Chris Vest
[ skype: mr.chrisvest, twitter: chvest ]


On 10 Dec 2013, at 21:55, Axel Morgner <ax...@morgner.de> wrote:

Hi,

maybe some of you have experienced poor write performance on their Linux boxes as I did, esp. with small transactions.

In my tests I was able to increase the throughput by a factor of 15! Here's a blog post about my findings:

http://structr.org/blog/neo4j-performance-on-ext4

Comments?

Best
Axel

--

Axel Morgner
CEO Structr (c/o Morgner UG) · Hanauer Landstr. 291a · 60314 Frankfurt · Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--

Axel Morgner
CEO Structr (c/o Morgner UG) · Hanauer Landstr. 291a · 60314 Frankfurt · Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"

Axel Morgner

unread,
Dec 10, 2013, 7:17:27 PM12/10/13
to ne...@googlegroups.com
Interesting!

Maybe someone on this list has already implemented a driver for raw devices to run Neo4j on? :-)

BTW: One of the best resources on this topic I found is: http://serverfault.com/questions/486677/should-we-mount-with-data-writeback-and-barrier-0-on-ext3

It's more about ext3, but most is applicable to ext4, too. There, the author's conclusion is: " We're going with disk write cache on,�barrier=0, and�data=ordered"

So the best setup I can think of is, a RAID array of SSDs with battery-, (or better - capacitor)-buffered disks (and controllers) [1], and the above parameters.


[1] https://communities.intel.com/thread/44083


Am 11.12.2013 00:52, schrieb Chris Vest:
In principle, Neo4j should not cause meta-data changes, e.g. it should use fdatasync instead of fsync and similar, such that the in ode tree don�t need updating during normal operation. In practice I would be hesitant to rely on this. For instance, I don�t know at this point, if the force() calls we do to the MappedByteBuffers translate into msync() calls, which do update metadata, or if there are other metadata-updating file operations hiding in there. But what�s worse is that Lucene probably does a lot of fiddling about with files, causing metadata updates. Actually, the fact that such large gains can be had by turning journalling off might be indicative that we do a lot of metadata updates.

I guess you can argue that a UPS backed cluster of machines would be safe, since it allows you time to rebuild (rather than restart) any machine that might crash. It would certainly lower the probability of data loss either way.

I haven�t looked into raw devices. The trouble is the file and IO APIs we�re given. I�ve briefly looked at the Linux block device APIs. They look more in tune with the nature of NCQ and ATA, which is nice, but I�m not sure what�s available in user space. Also, we need to support OS X and Windows.
Disabling barriers is a bit more worrying. I don�t have a deep understanding of ext4 internals, but I�m guessing that they prevent NCQ from reordering commands across the barrier. Specifically, they prevent store file writes from reordering before�the transaction log commit command. In other words, they make sure that store files don�t contain data that never got committed. Furthermore, in the case of a power failure, the drive might be receiving junk data. If the junk goes to the store file, then we�d like our transaction log to repair the store file, and if the junk goes to the transaction log, then we would like to simply disregard it and consider any orphaned�commands as uncommitted.

I�m a little surprised that the effect of disabling barriers is so pronounced. I wonder how the SSD compare to the HDD when barriers are enabled.

--
Chris Vest
[ skype: mr.chrisvest, twitter: chvest ]


On 10 Dec 2013, at 21:55, Axel Morgner <ax...@morgner.de> wrote:

Hi,

maybe some of you have experienced poor write performance on their Linux boxes as I did, esp. with small transactions.

In my tests I was able to increase the throughput by a factor of 15! Here's a blog post about my findings:

http://structr.org/blog/neo4j-performance-on-ext4

Comments?

Best
Axel

--

Axel Morgner
CEO Structr (c/o Morgner UG) � Hanauer Landstr. 291a � 60314 Frankfurt � Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--

Axel Morgner
CEO Structr (c/o Morgner UG) � Hanauer Landstr. 291a � 60314 Frankfurt � Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Structr - Award-Winning Open Source CMS and Web Framework based on Neo4j
Structr Mailing List and Forum
Graph Database Usergroup "graphdb-frankfurt"


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--

Axel Morgner
CEO Structr (c/o Morgner UG) � Hanauer Landstr. 291a � 60314 Frankfurt � Germany


Twitter: @amorgner
Phone: +49 151 40522060
Skype: axel.morgner

Reply all
Reply to author
Forward
0 new messages