Re: Eventsourced Data Sizing For Nosql DBp

Greg Young

unread,

Aug 16, 2014, 9:15:14 AM8/16/14

to events...@googlegroups.com

What is your "write model"

What are your latency requirements.

5-10 tb is NOT huge data btw. Might have been 10 years ago. Today a single node can serve this.

On Saturday, August 16, 2014, Prakhyat <prakh...@gmail.com> wrote:

Greg,

You are perfectly correct in observations.

We are in early stages of adapting event sourced and cqrs.

Currently the read and write model are same maintained in inmemory data grid. Later enhancement we are planning to replicate state to other data sources for read model and querying. Still we are in discussion stages.

The objective is maintain ready state for business specific querying/searches/complex reporting.

We are expecting 5 to 10 tera bytes of data/state. Just imagine from querying/searching/complex reporting perspective if we are depending on events to recreate state. It will be complex.

But still one read/write source plus additional events is huge data.

-prakhyat m m
Sent from my iPhone

On 16-Aug-2014, at 17:15, Greg Young <gregor...@gmail.com> wrote:

But you mentioned read models separately from domain state as if they were two different things. For just events plus read model isn't this roughly the same data storage requirements as data plus audit table?

On Saturday, August 16, 2014, Prakhyat <prakh...@gmail.com> wrote:

Hi Greg,

Thanks.

How we create state?
Write side of event sourced will receive the events from client. These events will be handled to convert to domain state.

Consider a domain object "Bank". We have designed 3 events fromcreated, fromedited and fromdeleted. From created will create first domain instance I.e "bank" object with some id and from edited event will edit domain object "bank" for given id.

Why we store state always?
We always maintain state to make queries faster. Query side will always have the state ready for gets, reporting and searches.

We are highly oltp application. At read side we don't want to recreate state by querying events and rebuilding state every time. Our queries for reporting will involve searching data for huge number of inter connected domain objects.

As I understand reconstructing state from large set of events will take time. Also complexity will increase if the query involves huge number of domain objects and business specific reporting queries considering date duration, so we felt maintaing state is the right choice.

-prakhyat m m

Sent from my iPhone

On 16-Aug-2014, at 1:59, Greg Young <gregor...@gmail.com> wrote:

"If the application is huge and highly OLTP with millions of transactions....data will grow in no time. Millions of transaction's means million of events and these needs to be saved. This storing will take up major disk space and will occupy space faster.

eventsourced/cqrs/DDD will lead to mammoth of data being saved. Planning on data sizing will end up requiring lot of disk space(including data and multiple copies for durability). Huge data means big big clusters."

Millions of events * say 500 bytes/event = ?

How many events/second are you looking at? If its under a few thousand stop thinking about it.

"eventsourced/cqrs/DDD will lead to mammoth of data being saved."

Where are you in reference to moore's law in terms of your acquiring data is more important http://en.wikipedia.org/wiki/Mark_Kryder

"store domian state"

Why are you storing your domain state?

On Fri, Aug 15, 2014 at 2:36 AM, Prakhyat Mallikarjun <prakh...@gmail.com> wrote:

Hi,

Hi Team,

I an working on a solution involving eventsourcing and DDD/CQRS. The app is configured with cassandra journal plugin to source the events.Snapshots will also be stored in cassandra. Application is designed to have sharded single writers. These single writers will eventually write state to in memory datagrid. The state of the application is always maintained in in memroy data grid, this is to make the reads faster.

The app has below layers,

Front End
|
|
Processing Layer
|
|
Persistence Layer
|

|
In memory Datagrid Layer
|
|
Cassandra Durable DB

Front end--> Takes the command requests from web
Processing Layer-->Process the commands and can also source the commands

Persistence Layer --> Sharded Single writer PersistentActor will persist event first into cassandra then will eventually update the domain state into in memory datagrid.

I accept the disks are very cheap. eventsourced/cqrs/DDD design requires to store commands(if required),store events,store snapshot, store domian state(read data and write data etc), Don't you think we will end up storing lots and lots of objects?

Tuning in data grid and cassandra....for durability, have to choose either replication/distribution/multiple copies etc. Further overhead of storing data and maintaining multiple copies.

If the application is huge and highly OLTP with millions of transactions....data will grow in no time. Millions of transaction's means million of events and these needs to be saved. This storing will take up major disk space and will occupy space faster.

eventsourced/cqrs/DDD will lead to mammoth of data being saved. Planning on data sizing will end up requiring lot of disk space(including data and multiple copies for durability). Huge data means big big clusters.

What are your thoughts? Correct me if I am wrong.

-Prakhyat M M

--
You received this message because you are subscribed to the Google Groups "Eventsourced User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

--
You received this message because you are subscribed to a topic in the Google Groups "Eventsourced User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/eventsourced/Cx3-kZmKnf4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Eventsourced User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

--
You received this message because you are subscribed to a topic in the Google Groups "Eventsourced User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/eventsourced/Cx3-kZmKnf4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Eventsourced User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Studying for the Turing test

Prakhyat Mallikarjun

unread,

Aug 16, 2014, 11:56:27 AM8/16/14

to events...@googlegroups.com

Hi Greg,

We are working on a financial application.

I am yet to get latency requirements from business. But as any financial application we have to achieve ui which responds on less then 3sec and critical end of day jobs.

As I understand I am in not sync with you on 5tb of data in one node.....

Just have a look at below article,
http://www.globallogic.com/wp-content/uploads/2013/04/Elastic-Java-Heap.pdf

If we use in memory datagrid like infinispan, max we can accommodate 16gb per jvm heap. To accommodate tera bytes of data, huge cluster has to be setup. Cluster setup will become even bigger if durability of data in more then one node is required. Am I wrong in my analysis?

-prakhyat m m

Greg Young

unread,

Aug 16, 2014, 12:43:03 PM8/16/14

to events...@googlegroups.com

3 seconds is a massive amount of time. Why do you need everything in an in memory grid to support that?neven on a spindle you could easily support this, on ssds it would be trivial to support this and keep data persistent as opposed to in memory.

Also yes you are wrong in your analysis not all of your data is "live" at any point in time you don't just say "put it all in an in memory grid". Most likely 99% of your data is not live at a given time (especially for things like events?!?!).

Sounds to me like you need to think a lot more about things before moving forward as it sounds like you don't understand your requirements.

--
You received this message because you are subscribed to the Google Groups "Eventsourced User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eventsourced...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prakhyat

unread,

Aug 16, 2014, 12:55:57 PM8/16/14

to events...@googlegroups.com, events...@googlegroups.com

Greg,

Thanks again.

That's is my sole problem ... Sizing of data with event sourced/cqrs.

You have understood my whole purpose of this post.

Then how to maintain data in inmemory data grid? What kind of approach we should take to maintain data in inmemory data grid. The approach should solve our querying/searching/complex reporting business use cases.

-prakhyat m m
Sent from my iPhone

You received this message because you are subscribed to a topic in the Google Groups "Eventsourced User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/eventsourced/DhuSBXr0T4g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to eventsourced...@googlegroups.com.

Vaughn Vernon

unread,

Aug 16, 2014, 2:56:42 PM8/16/14

to events...@googlegroups.com

It seems to me that the whole panic over size of data is not due to persisting on disk but persisting in memory with a data grid. If you are using ES/CQRS then store everything on disk first, then determine if there is any reason at all to maintain certain hot data in memory. Think of what it's going to take to store 5-10 terabytes in memory. If you need to replicate regions/caches so you don't lose anything, think of how much network overhead there will be in replicating parts of 5-10 terabytes all the time. In practical terms you are talking about around 500 nodes to hold only one copy of the 10 terabytes in memory. After that, you want to what, have two more copies of everything for redundancy purposes. Yikes. You better have a really awesome network.

Greg Young

unread,

Aug 16, 2014, 4:30:18 PM8/16/14

to events...@googlegroups.com

You have to answer your first question. Why would you put archive data etc in an in memory grid when you have 3 sec slas? For searching complex business reporting etc why not use a projection into a reporting model Cassandra, SQL, foundation db, riak etc.. There are many very good reporting models out there, some in memory some persistent.

Storing all your events of your system solely in memory when you have 5tb of them is not an economically good option especially when you need to access them within three seconds.

Greg

Prakhyat

unread,

Aug 17, 2014, 1:40:16 AM8/17/14

to events...@googlegroups.com, events...@googlegroups.com

Hi Vaughn,

The worry is not storing data inmemory. We will persist state in disks.

But to be more precise,

We have 5tb of just the state, i want to understand what is the additional size required for storing events/snapshots.

Also We are oltp app. Have concern, will events/snapshots consume more additional space and being oltp will consume space faster?

In short What is the additional disk sizing required for app designed with event sourced/cqrs approach?

-prakhyat m m

Sent from my iPhone

Prakhyat

unread,

Aug 17, 2014, 1:45:09 AM8/17/14

to events...@googlegroups.com, events...@googlegroups.com

Greg,

Thanks for your inputs.

Will look into your mentioned ideas and try to incorporate in my design. Projection is new concept to me.

-prakhyat m m

Sent from my iPhone

Greg Young

unread,

Aug 17, 2014, 10:29:50 AM8/17/14

to events...@googlegroups.com

The big question is how much of your state is "live" at a given time. Take example the concept of a position in the back office. There are currently some set of positions in process of settlement (say the last 3 days worth in most markets). Then there are all of the positions that have been historically settled. Once a position is settled it should not be updated after (it's read only) it makes total sense to keep the position in process of settlement inside of an in memory grid, it doesn't make sense to store six year old positions in the grid. This distinction is very important in such systems and different bits of data gets different slas.

One would then make a projection off the event streams to produce a read model for querying of this kind of state historically (and in near real time to something like say an Olap cube or Hadoop etc). These are fairly core concepts of cqrs/Es based systems, there is a lot of info available on the web about how to get this going.

Cheers,

Greg

Prakhyat

unread,

Aug 17, 2014, 11:53:53 AM8/17/14

to events...@googlegroups.com, events...@googlegroups.com

Greg,

Thanks a lot for giving total different perspective.

I was thinking from sizing perspective never considered points shared by you. I will incorporate these into my design.

Discussion with you helped me a lot. Thanks again.

-prakhyat m m
Sent from my iPhone

Prakhyat

unread,

Aug 17, 2014, 12:41:22 PM8/17/14

to events...@googlegroups.com, events...@googlegroups.com

Greg,

I am clear on your thoughts for using projections for querying.

But it is still bothering on, how to handle large set of events in event sourcing?

Do we need to through more hardware for storing millions of events received everyday or are there any architectural changes recommended?

-prakhyat m m

Sent from my iPhone

Greg Young

unread,

Aug 17, 2014, 4:17:32 PM8/17/14

to events...@googlegroups.com

Events are generally kept on persistent storage not in memory. Millions of events/day is not that much.

Reply all

Reply to author

Forward