Is 2K ACID TPS fast for a disk based (scala) database?

75 views
Skip to first unread message

William la Forge

unread,
Oct 18, 2011, 8:26:09 AM10/18/11
to scala-user
I'm running my old laptop (my viao with ssd has a bad fan right now). I just did a very simple performance test and I'm getting 2.2K ACID transactions per second. (VERY simple transactions.) Is this exceptional? Average? I really don't have anything to compare to except what I've done before--and it is a pretty good improvement over my past efforts at least.

This is an in-memory system, but it logs transactions to disk and updates the database on disk when a change occurs. It is robust--you can crash the system at any time without data loss. All of this slows it down of course. I'm calling this the "Swift" datastore and I'll be releasing it soon.

Hardware:
lenovo 4151/200.
Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz

Fujitsu MHZ2320BH-G2 320GB 5400 RPM 8MB 2.5" SATA 3.0Gb/s


Thanks!

Bill La Forge

Edmondo Porcu

unread,
Oct 18, 2011, 8:30:53 AM10/18/11
to William la Forge, scala-user
Is that an in memory database such as Ehcache, Gigaspaces, Memcached, and so on?

Best Regards

2011/10/18 William la Forge <lafo...@gmail.com>

Razvan Cojocaru

unread,
Oct 18, 2011, 9:32:24 AM10/18/11
to William la Forge, scala-user

Interesting…

 

Assuming you’re not using a transactional file system, how do you handle the file system’s lack of guarantees? How come you are certain are you that a kill -9 or unplugging the box will not lead to loosing transactions already “committed”?

William la Forge

unread,
Oct 18, 2011, 11:28:27 AM10/18/11
to Edmondo Porcu, scala-user
Edmondo,

A bit of a hybred actually. The database must all fit in memory, but there is a file backing store. Makes for very fast queries and much slower updates. I still need to do the timings for queries, but simple updates are at 2222 per second. Which is fast for file-based ACID transactions, yes?

Bill

William la Forge

unread,
Oct 18, 2011, 11:43:04 AM10/18/11
to Razvan Cojocaru, scala-user
Razvan,

Easy, at least for the small records datastore, which swift builds on. I have two dedicated areas on disk which are written to alternately for each transaction. Each area contains both a timestamp and a checksum. On startup you read both and use the latest valid data.

Swift adds a layer of complexity to that. The datastore is only updated once every 100 transactions, but each transaction is logged and flushed. The datastore now holds the name of the log file and the position of the end of the last transaction at the time of the write to disk. On startup, after choosing the latest valid disk area, you read from the old log file starting at the position given and reprocess the new transactions which were not captured in the datastore. (Recovering from an aborted transaction just means you reread the last disk area written and then reprocess the subsequent transactions that are held in memory.)

So you can turn off the computer at any time, restart, and it will rebuild what it needs to reflect all the transactions previously processed. And if you loose the datastore you can rebuild everything from the log files. Or from an old datastore and subsequent log files. Tested and working. (All utilities provided, though it is all a bit rough.)

Now, is there a future in this??? (My wife, Rupali, would really like an answer to this question!) I keep telling her this is the cat's meow, but she's a bit out of her comfort zone. :-)

Bill

Razvan Cojocaru

unread,
Oct 18, 2011, 2:55:08 PM10/18/11
to William la Forge, scala-user

Cool - I was under the impression that a flush() not only kills performance, but does not guarantee the contents to be physically on disk, in order to survive pulling a plug…  it maybe that I haven’t looked at that in a long time, but this topic is really interesting to me. Do you have some more info? Which Java class are you using to write to disk?

 

In fact, does anyone know of a safe and transactional log file-based implementation? It has been quite some time since I looked into this…

 

Can’t say about the future of this – you’d certainly have to compare to a mysql with inmem engine or hashtable etc… but my motto is that evolution requires diversity hence there is value in just having an alternative, otherwise I would turn off the laptop and go biking J

 

Cheers,

Razie

Martin S. Weber

unread,
Oct 18, 2011, 4:31:44 PM10/18/11
to scala...@googlegroups.com
On 10/18/11 14:55, Razvan Cojocaru wrote:
>
>
> Cool - I was under the impression that a flush() not only kills
> performance, but does not guarantee the contents to be physically on
> disk, in order to survive pulling a plug…


It doesn't.

And then there's the hardware, which has its own buffers, some of which it may
and some of which it may not advertise to the outside world (i.e., the OS and
thus you), while any subset of the advertised buffers may or may not be
switched off.

I'm also interested in the steps taken to ensure that the data gets physically
on the disk, DESPITE all claims of the OS AND the disk of having done so, I
mean, did it REALLY get to the disk.....

-Martin

ScottC

unread,
Oct 18, 2011, 5:16:40 PM10/18/11
to scala-user


On Oct 18, 8:28 am, William la Forge <laforg...@gmail.com> wrote:
> Edmondo,
>
> A bit of a hybred actually. The database must all fit in memory, but there
> is a file backing store. Makes for very fast queries and much slower
> updates. I still need to do the timings for queries, but simple updates are
> at 2222 per second. Which is fast for file-based ACID transactions, yes?
>
> Bill

Sure, thats fast. How fast? Its proof that your code, the os, or the
disk are not actually flushing contents to disk each transaction.
That's how fast.
(even if this is a consumer SSD, most the fast ones LIE and are not
data safe unless the drive explicitly has a power-safe cache -- e.g.
intel 320 series). Most hard drives don't lie about disk flush, but a
few do (esp. laptop ones). MacOS lies about fsync (flush) as well.

A rotating disk in a laptop can't do much more than 100 transactions
(disk rotations) per second.

High end server disks can do up to 300. A battery backed raid card
with volatile cache can do several thousand.

>
> On Tue, Oct 18, 2011 at 5:30 AM, Edmondo Porcu <edmondo.po...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Is that an in memory database such as Ehcache, Gigaspaces, Memcached, and
> > so on?
>
> > Best Regards
>
> > 2011/10/18 William la Forge <laforg...@gmail.com>

Alan Burlison

unread,
Oct 18, 2011, 6:45:00 PM10/18/11
to Martin S. Weber, scala...@googlegroups.com
On 18/10/2011 21:31, Martin S. Weber wrote:

> And then there's the hardware, which has its own buffers, some of which
> it may and some of which it may not advertise to the outside world
> (i.e., the OS and thus you), while any subset of the advertised buffers
> may or may not be switched off.

Hardware RAID5 is the poster child for that. They appear to the host
system as a small number of 'virtual' drives even when they have many
tens of physical drives and the controllers usually have on-board cache
and batteries to either hold the contents in RAM or (better) to power
the array for long enough for the data to be flushed to disk if the
power goes off. Then of course there's also the caches on the disks
themselves - the IDE, SATA or SCSI bus transaction may have completed
but the disk itself may still be caching the data, and if the power goes
off at that point your data is gone. Here's a random selection of links
from google explaining the issue in more detail:

http://www.jasonbrome.com/blog/archives/2004/04/03/writecache_enabled.html

http://milek.blogspot.com/2010/12/linux-osync-and-write-barriers.html

http://phoronix.com/forums/showthread.php?36507-Large-HDD-SSD-Linux-2.6.38-File-System-Comparison&p=181904#post181904

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarrieronoff.html
http://linux.die.net/man/2/sync

> I'm also interested in the steps taken to ensure that the data gets
> physically on the disk, DESPITE all claims of the OS AND the disk of
> having done so, I mean, did it REALLY get to the disk.....

It's possible to ensure that e.g. by disabling the caches on disk drives
(plus lots of other necessary steps) but it nearly always has a
significant performance impact unless specific (and usually expensive)
steps are taken to circumvent it.

There's a pretty good description of how one filesystem (ZFS) handles
this at
http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained.
Note in particular the use of Flash memory (SSDs in effect) to store
the filesystem intent logs - Flash is both fast and doesn't need power
to retain is content, so it is ideal for an intent log.

As the old saying goes "Fast, safe or cheap, pick any two". Solving
this sort of issue is why top-end hardware is so expensive, and why
large corporates are prepared to pay for it.

--
Alan Burlison
--

Razvan Cojocaru

unread,
Oct 18, 2011, 7:53:16 PM10/18/11
to Alan Burlison, Martin S. Weber, scala...@googlegroups.com
How about this little train ride (and a beer): https://github.com/razie/razbase/blob/master/base/src/main/scala/razie/base/data/SafeFile.scala

Do you guys figure the OS is messing with me? If you debug it, it seems that it's not messing with me, contents actually appear in file before reader reads...

It writes 1 million 50 char records at 52 thousand per second. Laptop i7+SSD

-----Original Message-----
From: scala...@googlegroups.com [mailto:scala...@googlegroups.com] On Behalf Of Alan Burlison
Sent: October-18-11 6:45 PM
To: Martin S. Weber
Cc: scala...@googlegroups.com
Subject: Re: [scala-user] Is 2K ACID TPS fast for a disk based (scala) database?

William la Forge

unread,
Oct 18, 2011, 9:45:51 PM10/18/11
to Razvan Cojocaru, scala-user
Future, in my mind, depends on building a community of developers. My hope then is to do enough interesting things that some of you guys will get involved. I find that a difficult path.

Bill

William la Forge

unread,
Oct 18, 2011, 9:47:14 PM10/18/11
to Razvan Cojocaru, scala-user
Oh yes, I'm just using DataOutputStream over FileOutputStream for writing the log.

William la Forge

unread,
Oct 18, 2011, 9:50:18 PM10/18/11
to Martin S. Weber, scala...@googlegroups.com
It would definitely be nice to have some options here. Supporting alternate sets of code with this library is easy enough and user needs will vary. So as I said in an earlier response this is more a kit with components for assembling a customized datastore.

William la Forge

unread,
Oct 18, 2011, 9:55:03 PM10/18/11
to ScottC, scala-user
Good point. But for most transactions I am simply writing to the log file. The disk is 5400 rpm and I am only appending to the file. So 2222 tps doesn't seem that absurd.

Bill

Razvan Cojocaru

unread,
Oct 18, 2011, 10:15:05 PM10/18/11
to Alan Burlison, Martin S. Weber, scala...@googlegroups.com
:) I figured it's too easy to work.... but, knowing I'd eventually fail hasn't stopped me before :)

So... anyone here that can answer this? How is this done in a reasonably serious DB: postgress, mysql etc? how do they do it so that the rollback logs are fail-safe after commit() returns? Do they really need some deep hooks into the BIOS of the HDD (not likely) or is there some simple dime-sized algorithm?

Interestingly enough - flush() does have an effect, using it after each write() dropping the "performance" significantly, from 57kps to 35kps... so it does appear to be doing something... as to what exactly it does, it probably is a question for the actual semantics of the OS/FS, disk etc... in this particular case, the spectacular drop in performance makes me think it ended up strait on the disk, bypassing at least a few buffers...

William la Forge

unread,
Oct 18, 2011, 10:23:40 PM10/18/11
to Martin S. Weber, scala...@googlegroups.com
Also, I suspect that I'm not doing a real flush, just calling flush on DataOutputStream, which just flushes its buffers and does a real write. I see that I'm going to need to actually flush the log file.

Bill

On Tue, Oct 18, 2011 at 1:31 PM, Martin S. Weber <martin...@nist.gov> wrote:

William la Forge

unread,
Oct 18, 2011, 10:25:11 PM10/18/11
to scala...@googlegroups.com
oops--forgot the list

---------- Forwarded message ----------
From: William la Forge <lafo...@gmail.com>
Date: Tue, Oct 18, 2011 at 7:21 PM
Subject: Re: [scala-user] Is 2K ACID TPS fast for a disk based (scala) database?
To: "Martin S. Weber" <martin...@nist.gov>


OK, I turned off the write cache on the disk. Performance dropped from 2222/sec to 2058.

Bill


On Tue, Oct 18, 2011 at 8:52 AM, Martin S. Weber <martin...@nist.gov> wrote:
On 10/18/11 11:43, William la Forge wrote:
Razvan,

Easy, at least for the small records datastore, which swift builds on. I
have two dedicated areas on disk which are written to alternately for
each transaction. Each area contains both a timestamp and a checksum. On
startup you read both and use the latest valid data.

So you've turned off the disk's hardware cache, too?

-Martin


William la Forge

unread,
Oct 18, 2011, 10:30:42 PM10/18/11
to Alan Burlison, Martin S. Weber, scala...@googlegroups.com
I've added those links to the pearltree for the project: http://www.pearltrees.com/laforge49/links-interest/id3471040

Very helpful. I've turned off the write cache. Performance dropped to 2058. I also need to do a real flush on the file, not just have DataOutputStream flush its buffers. Ah, duh!

Bill

William la Forge

unread,
Oct 18, 2011, 11:16:06 PM10/18/11
to ScottC, scala-user
Down now to 30 tps. I wasn't flushing the log file and the disk write cache was on.

Thanks everyone!

Bill

On Tue, Oct 18, 2011 at 2:16 PM, ScottC <scott...@gmail.com> wrote:

Razvan Cojocaru

unread,
Oct 18, 2011, 11:34:15 PM10/18/11
to William la Forge, ScottC, scala-user

So your thing is now 70 times slower than when you started asking our opinion… you somehow seem happy though…?

 

Did we finally find what this forum is really good for?

 

J

 

From: scala...@googlegroups.com [mailto:scala...@googlegroups.com] On Behalf Of William la Forge


Sent: October-18-11 11:16 PM
To: ScottC
Cc: scala-user

William la Forge

unread,
Oct 18, 2011, 11:48:20 PM10/18/11
to Razvan Cojocaru, ScottC, scala-user
Happy yes. Very happy with this forum and with everyone who participated.

Also, the software is quite flexible. So it would be easy enough to have a second server to duplicate the logging and not do any flushing. A second server is important in any case for added robustness, though it needs its own independent UPS. Indeed, it would be reasonable to set up that second server as a hot backup.

The key here is that writes to the datastore DO NOT need to be flushed, as the logic does not depend on the datastore being up-to-date--it uses the latest log file for that. So it is then easy to deploy a reliable, high-performance datastore on fairly inexpensive hardware... You just need TWO laptops. :-)

Bill

Ray Racine

unread,
Oct 19, 2011, 12:11:04 AM10/19/11
to William la Forge, Razvan Cojocaru, ScottC, scala-user
Bill,

I have an in-memory object db/cache which applies updates using zlib compressed Protocol Buffers messages via SCTP.  The journal isn't just a constant append, but is a journal of Paxos consensus commands.  Paxos guarantees that the three nodes a) remain in sync b) there is no distinguished node c) a failed node can always catch up etc.

I've been wanting to dump this out open source for a really long time.  Maybe we could merge given mutual interest on the topic.

FWIW this kind of thing has been around the block before.  See http://prevayler.org/ for example, winner of Dr Dobbs Jolt Award waaaaay back in 2004 or something like that.

Ray

William la Forge

unread,
Oct 19, 2011, 3:07:46 AM10/19/11
to Ray Racine, Razvan Cojocaru, ScottC, scala-user
Ray,

That journal sounds like the perfect complement to the logic for working with outdated datastores.

I expect that there will not be much difficulty in journal retrieval either? On startup we need to access the next journal entry after the last one which is reflected in the contents of the datastore.

Shall we take this off-list? :-)

Bill

Scott Carey

unread,
Oct 19, 2011, 1:52:47 PM10/19/11
to William la Forge, Razvan Cojocaru, scala-user
On Tue, Oct 18, 2011 at 8:48 PM, William la Forge <lafo...@gmail.com> wrote:
> Happy yes. Very happy with this forum and with everyone who participated.
>
> Also, the software is quite flexible. So it would be easy enough to have a
> second server to duplicate the logging and not do any flushing. A second
> server is important in any case for added robustness, though it needs its
> own independent UPS. Indeed, it would be reasonable to set up that second
> server as a hot backup.
>

Postgres and some other databases have a mode where transaction
commits are not synchronous, which increases performance quite a bit
wihout data integrity loss -- but transactions may be lost in the
event of power failure or abrupt process or OS termination.

If you write transactions to a log in an appending fashion, and do not
overwrite data, it is possible to maintain data integrity without
synchronous writes to disk.

If you must guarantee that data is persisted at the end of a
transaction -- that the transaction will not be 'forgotten' -- then
you must do synchronous writes -- even with duplicate hardware. For
example, if you need to commit a financial transaction and confirm it
to a user, there is only one option: highly reliable storage and a
full flush to that storage.

If you can tolerate loss of some of the most recent transactions, you
have much higher performance and do many tricks to reduce the
likelihood and volume of data loss, such as logging to multiple
servers or storage pools.

Razvan Cojocaru

unread,
Oct 19, 2011, 2:13:46 PM10/19/11
to scala...@googlegroups.com
One last update - I updated it from just flush() to also sync() and performance dropped from insane to under 2kps and you can clearly see a difference in behavior and at the HDD light... in fact, so much so that I had to introduce a backoff for writers, to allow the reader to catch up... this with an SDD: with one of those mechanical drives the performance drop should be even more dramatic...

So - using flush() gives you protection against nullpointers or outofmemory killing your app, which is useful enough, while sync() gives you much better protection against pulling the plug...?

Neither is obviously guaranteed to be razie-proof 100%.

I found this beauty though: http://www.jboss.org/jbosstm/fileio

This was fun! Learned a lot! There are massive differences in performance between FileOutputStream and FileWriter for instance.

William la Forge

unread,
Oct 19, 2011, 10:19:19 PM10/19/11
to Scott Carey, Razvan Cojocaru, scala-user
Scott,

Makes a lot of sense, though requiring synchronous writes even with multiple servers to guarantee no loss of data seems a bit too conservative. Or not conservative enough. Multiple servers decreases the likelihood of failure significantly, especially with independent UPS. While on the other hand synchronous writes are no proof against disk failure. It is all comes down to determining the acceptable probability of failure.

Bill
Reply all
Reply to author
Forward
0 new messages