Real-time Storage for orders and trades

Vero K.

unread,

Jan 1, 2016, 10:05:53 AM1/1/16

to mechanical-sympathy

hi, I read this article about LMAX http://martinfowler.com/articles/lmax.html

and I'm not clear how LMAX or any other similar exchange stores orders and trades. Is just a file? If yes, I have concerns about reliability in case of a disk crash. The problem we currently face as an exchange: we need to store orders and trades in the database. We can't purchase KX KDB because it is very expensive and all other open-source databases do not look reliable. Basically we need to keep in memory open orders and trades also if we bounce our application it should save a state somehow and apart from that we need generate reports for settlement, etc. so we need to query our data-model (orders, trades, etc.) to generate reports very quickly (almost run-time as they arrive). What will be your suggestion in this case? How exchanges do this? In the previous topics we discussed Cassandra, but I don't want to touch that solution now because of some reasons. Can we do everything with files or any famous, reliable and cheap time-series database?

Vero K.

unread,

Jan 1, 2016, 10:08:40 AM1/1/16

to mechanical-sympathy

by the way, for performance currently we use relational database to store trades and orders and generate reports and on top of that we have distributed data grid which makes everything very complicated.

Martin Thompson

unread,

Jan 1, 2016, 10:25:45 AM1/1/16

to mechanica...@googlegroups.com

Most exchanges rely on synchronous replication of incoming messages to provide reliability. This is usually combined with writing the messages to a journal on each of the replicas in parallel. This log is normally a file. Some exchanges fsync this file as they write, others let the OS choose when to flush. Trades that result from order matches are often written to a relational database for reporting. The volume of trades is typically orders of magnitude less than orders.

What is your concern about "disk crash"? Can you be more specific? Databases can also have issues when a disk fails or whole server fails.

Vero K.

unread,

Jan 1, 2016, 10:47:58 AM1/1/16

to mechanical-sympathy

Thanks, Martin. I just thinking about open orders, because they can be modified so we need to either update them in a file (which is almost not possible) or create a new record in a file which will override the previous one. If we bounce our matching engine, after that it will read open orders from a file, right? Just wanted to clarify also what do you mean by "writing the messages to a journal on each of the replicas in parallel"? What is a replica? Is it just an order and all orders published to a log in parallel?

By the way, in your practice have you every seen distributed caches/data grids used for low latency apps like this?

Currently we use Sybase cluster and it is very stable. My concern if we just start to use files how to make it reliable, how we can ensure that in any situation we can recover without data loss.

Thanks.

Greg Young

unread,

Jan 1, 2016, 10:51:59 AM1/1/16

to mechanica...@googlegroups.com

The way to make it reliable is to use many of them as example use 3
and on gateway respond when 2 have given the same message.

You probably don't want to be building much of this yourself though.
Much of this is non-trivial in terms of implementations. Keep in mind
quality metrics, for say 6sigma you are looking at 3.4
failures/million opportunities. Translate to opportunities/second and
it often gets scary.

That said

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Martin Thompson

unread,

Jan 1, 2016, 11:02:11 AM1/1/16

to mechanica...@googlegroups.com

Don't go back an update orders in the file. The file is a history of input messages. If the messages are replayed in order into a deterministic system then the same state can be recreated. Look back to the history of replicated state machines, i.e. the work of Lamport.

A change to a working order is a new input message that changes the state of the system.

Building a system like this from scratch when you have not studied the work of Leslie Lamport, Barbra Liskov, Ken Birman, and others who advanced this field, is likely to result in unreliable designs. If this is not your bag then stick with a database :-)

Martin Thompson

unread,

Jan 1, 2016, 11:25:51 AM1/1/16

to mechanica...@googlegroups.com

BTW I'm not trying to discourage you from studying this fascinating subject. Just trying to give a heads up that it is a non-trivial subject. If you get good at it then really cool things can come out of it. It does however take a significant investment in time and resources. That investment needs to be worth it to counter the cost of licencing a good database, or consulting experts to help build it. Using a good database in a manner it was designed for can also deliver some really nice and appropriate solutions.

It is important to that define the requirements for response time and throughput. Then evaluate technology and test if it can meet those requirements. I'm often surprised by what people consider "low latency" of "high throughput". Put numbers with units on the requirements and document what are the consequences if they are not achieved.

Vero K.

unread,

Jan 1, 2016, 11:36:22 AM1/1/16

to mechanical-sympathy

Thanks, Martin. Very good input, going to study this.

Greg Young

unread,

Jan 1, 2016, 11:49:18 AM1/1/16

to mechanica...@googlegroups.com

Just to add to Martin's answer a bit on some of the pros of doing
internally (yes I know I said its hard).

As an exchange you need to be able to respond to any issues quickly.
What is the value of having internal knowledge of how the underlying
system works vs having a support ticket.

It is a fairly complex decision, I have ended up on both sides.

Greg

ymo

unread,

Jan 7, 2016, 9:35:31 AM1/7/16

to mechanical-sympathy

The most elegant and simplistic protocol for a distributed fault-tolerance consensus protocol i have seen so far is https://raft.github.io/. Not sure about anyone using it for an exchange (yet).

Gary Mulder

unread,

Jan 7, 2016, 10:21:08 AM1/7/16

to mechanica...@googlegroups.com

On 7 January 2016 at 14:35, ymo <ymol...@gmail.com> wrote:

The most elegant and simplistic protocol for a distributed fault-tolerance consensus protocol i have seen so far is https://raft.github.io/. Not sure about anyone using it for an exchange (yet).

I've both performance tested and chaos monkey failure tested a Raft-based cluster of three Disruptors and had excellent results with Raft failover. Very fast to failover and never managed to split-brain it.

It took significant artificial packet loss (> 30%) on the 10GE point-to-point connections between the Disruptors while running a background load of 40K requests/sec to delay Raft consensus by 10's of secs, and it still eventually achieved consensus.

Regards,

Gary

ymo

unread,

Jan 8, 2016, 3:46:15 AM1/8/16

to mechanical-sympathy

very ... impressive ! Did you use your own library or one of the available ones ? What was the transport ?

Karlis Zigurs

unread,

Jan 9, 2016, 2:27:56 AM1/9/16

to mechanica...@googlegroups.com

Guess I can provide more details on the project Gary is referring to.

Raft library was implemented from scratch (at the time none of
available libraries seemed robust enough, mostly sync academic
exercises) in Java (sub 500us matching times, occasional GC in order
of 1-5ms max).
The inter-machine transport was off the shelf serialization over netty
(over 10ge point to point links across three machines) and raft logs -
to pcie SSD's.

Notably we implemented raft so that it operated in 'async' mode (e.g.
things like transport and serialization to disk was handled in batches
that inherently scaled with the load on the system) - this is where
disruptor's endOfBatch flag was a perfect match to our requirements.

And yes, it was pretty robust. One lesson learned was that in a
distributed system like this doing 48 hours long chaos monkey tests
(dropping interfaces, packets, packet loss, delays, simulating
hardware failures) in an automated fashion was a godsend. Couple of
really really tricky edge conditions were caught thanks to that.

K

Greg Young

unread,

Jan 9, 2016, 2:57:18 AM1/9/16

to mechanica...@googlegroups.com

just to add one here, getting a remote controller power system is also
a good thing for durability testing

Gary Mulder

unread,

Jan 9, 2016, 1:17:43 PM1/9/16

to mechanica...@googlegroups.com

On 9 January 2016 at 07:57, Greg Young <gregor...@gmail.com> wrote:

just to add one here, getting a remote controller power system is also
a good thing for durability testing

Agreed. We had the convenience of walking into the lab next door to randomly flick switches and unplug cables.

However, in my experience it is the soft failures and flapping failures that tend to cause the most problems in production. Intermittent failures often cause cluster nodes to get into an ambiguous state, in which unexpected system modes tend to occur. As Karlis mentioned, sometimes it would take 48 hours to smoke out these states.

We compared md5 sums of the on-disk Disruptor journals when the cluster was halted to verify we had the same data replicated to all three nodes.