One table per aggregate type worth the effort?

1 190 vues
Accéder directement au premier message non lu

Werner Clausen

non lue,
14 janv. 2013, 03:36:5714/01/2013
à ddd...@googlegroups.com
Hi,
 
Our eventstore is implemented with Sql server as storage, and defines one stream table per context. But I can see that, in theory, a few people recommends one table per aggregate type so to speak. We are still in the startup-phase so changing this would be alright. But when you think about it, it would take some effort in terms of configuration and usage, the logic needs to know what tables holds what types of aggregates etc, etc. So I'd like to know from those with practical experience: Is it worth the effort to implement this from day 1? I know it depends on the usage, but if I could hear from someone that have been in that situation thinking "it was good we implemented this", that would make my choice easier...
 
--
Werner

Greg Young

non lue,
14 janv. 2013, 03:39:3714/01/2013
à ddd...@googlegroups.com
I would run as far away as possible from one table per aggregate type.
--
Le doute n'est pas une condition agréable, mais la certitude est absurde.

Bennie Kloosteman

non lue,
14 janv. 2013, 04:36:3714/01/2013
à ddd...@googlegroups.com
Why one table per aggregate type ? What do you get for the hassle of
changing schema etc every-time you add a new aggregate .

Ben

Colin Yates

non lue,
14 janv. 2013, 05:04:5314/01/2013
à ddd...@googlegroups.com
I am in the same position as you, so have your "pinch of salt" ready :).

I did consider the schema-per-aggregate for the following reasons:

 - enforcement of some invariants in the database (not null, uniqueness etc.) solves some set based issues (unique name for example)...moving swiftly on :)
 - some of the read models are exactly the same as the aggregate so querying on the write schema, even if it is through a different service seemed efficient (low hanging fruit though)
 - this made it easier for the "traditional DB" people to swallow - those who immediately start querying the database to find out what is going on
 - no need for custom serialisation - persistent classes could map straight onto the tables

However, a significant technical requirement was the ability for views to rebuild themselves from a stream of events.  Having those events across multiple tables meant either a rather large union or in-memory sorting (which posed scalability challenges).

In the end we decided to go with a single "event" table - we aren't going to have billions of events so scalability on the DB isn't an issue, it gives us the "global event stream" stream and serialisation to/from JSON (for example) is trivial.

Hope this helps.

Col

Colin Yates

non lue,
14 janv. 2013, 06:28:5214/01/2013
à ddd...@googlegroups.com
The other natural benefit is that refactoring events becomes simpler than if the events are serialised.  Renaming a column for example as oppose to executing an update statement with some horrific string concatenation or hydrating and re-serialising all relevant events.

Bennie Kloosteman

non lue,
14 janv. 2013, 21:49:0814/01/2013
à ddd...@googlegroups.com
> - this made it easier for the "traditional DB" people to swallow - those
> who immediately start querying the database to find out what is going on
> - no need for custom serialisation - persistent classes could map straight
> onto the tables

Point them at the read model and say the event source is like a
replayable command log ... once i did that the DBA was pretty keen
on event sourcing , in fact he pushed for it not me. . That said my
DBA hates GUIDS ,i worked out a nice and simple system for ids ( just
fetch a range and allocate them in memory ) but i kind of regret it
now as the main benefit of smaller indexes mostly disappears with a
denormalized read model.
>
> However, a significant technical requirement was the ability for views to
> rebuild themselves from a stream of events. Having those events across
> multiple tables meant either a rather large union or in-memory sorting
> (which posed scalability challenges).

Very yuck.
>
> In the end we decided to go with a single "event" table - we aren't going to
> have billions of events so scalability on the DB isn't an issue, it gives us
> the "global event stream" stream and serialisation to/from JSON (for
> example) is trivial.

Scalability is not an issue - when you hit it , if you have a single
table its easy enough to move to its own segmented Database , noSQL
or even a file .

Ben

Werner Clausen

non lue,
15 janv. 2013, 04:12:5415/01/2013
à ddd...@googlegroups.com
Hi Benny (and others)
 
Thanks for your replies.
 
The reason for having one table per aggregate type, is that we expect around 100 million events (maximum, most customers would just have 20+). For this reason we have implemented the EventStore as one table per context. And while this seemed architectural correct for a while, trouble came when someone wanted to read an aggregate from another context. This wasn't possible as we have no configuration to support that AggregateType1 belongs to Table1 etc. But this should be possible right?
 
From there, it didn't take too long for someone to suggest that we split everything. That way, we would be fairly sure not to hit any performance barrier due to large event tables. Searching the matter and it turned out, that quite a few ppl have done so - at least in theory. This is the cause for my concerns, and this thread.
 
But I will take your advices as my next pointer. This means however, that I'm caught between a rock and a hard place. Either I go with 1 event table and hope this doesn't perform crazy with 100+ records. Or I keep the current "one bc = one table" and make some ugly crappy configuration to support reading from other contexts.
 
The thing is, if at some point in time I wanted to split the eventtable, this would be a hard job because we do not store metadata (yet). This means that we have no way to determine the event namespace/type just by looking at the database. Should we store these metadata in a sep. table at all costs?
 
--
Werner

@yreynhout

non lue,
15 janv. 2013, 05:34:4815/01/2013
à ddd...@googlegroups.com
Why would you ever wanna read from another context?

Bennie Kloosteman

non lue,
15 janv. 2013, 05:56:1715/01/2013
à ddd...@googlegroups.com
On Tue, Jan 15, 2013 at 5:12 PM, Werner Clausen <item...@hotmail.com> wrote:
> Hi Benny (and others)
>
> Thanks for your replies.
>
> The reason for having one table per aggregate type, is that we expect around
> 100 million events (maximum, most customers would just have 20+). For this
> reason we have implemented the EventStore as one table per context. And
> while this seemed architectural correct for a while, trouble came when
> someone wanted to read an aggregate from another context. This wasn't
> possible as we have no configuration to support that AggregateType1 belongs
> to Table1 etc. But this should be possible right?

I dont think you should be able to do this ... id use the read model
if I absolutely have to , to me the idea of context is isolation and
in most samples they are a dll , with only the commands and
externally available events being public.
>
> From there, it didn't take too long for someone to suggest that we split
> everything. That way, we would be fairly sure not to hit any performance
> barrier due to large event tables. Searching the matter and it turned out,
> that quite a few ppl have done so - at least in theory. This is the cause
> for my concerns, and this thread.

SQL likes deep tables with few fields , its really no more work for
the db than multiple tables since you should not be doing any get *
From table ...if you were going to hit performance issues surely
caching is a better first step (and then into partitioning etc) ,
though further steps should rarely be needed since the entire
write domain is often only 1 "client".

>
> But I will take your advices as my next pointer. This means however, that
> I'm caught between a rock and a hard place. Either I go with 1 event table
> and hope this doesn't perform crazy with 100+ records. Or I keep the current
> "one bc = one table" and make some ugly crappy configuration to support
> reading from other contexts.

In the majority of cases it will perform fine with millions of
records ( and you can get to 100M+ easy enough ) .. If you expect
lots of reads eg lots of commands to the same aggregates rather than
new aggregates and dont have a distributed environment than cache if
you find poor performance it should remove most ( often 90%+) of the
read traffic.
>
> The thing is, if at some point in time I wanted to split the eventtable,
> this would be a hard job because we do not store metadata (yet). This means
> that we have no way to determine the event namespace/type just by looking at
> the database. Should we store these metadata in a sep. table at all costs?

I think the real problem here is you are getting aggregates from
other contexts . The whole point of a context is isolation and
different views of the same aggregate eg carSale is different for the
salesman , service and leasing .And then contexts communicate via
events ( or you can push via commands) and the event and command
shoudl have all the data they need not an Id .. If your fetching
aggregates from other contexts than you may as well have a single
context .

Why split the table ? .. To me the biggest danger of splitting the
event table is focus .. it may lead to developers doing readtype
queries on the event store The focus should be on the query side.
All you should be doing is query by type and id ,maybe occasionally
filtered by date (but this can be done in code) and inserts. So you
are looking at an indexed lookup and an insert , DBs should do this
very fast , no table scans , little contention etc , and repeats of
the same queries so heavily cached etc .

Ben

rmacdonaldsmith

non lue,
15 janv. 2013, 09:17:2915/01/2013
à ddd...@googlegroups.com
Hi,

You may get some insight by looking at Jonathan Olivers EventStore implementation (or other open source event store implementations). It supports many persistence technologies (SQL Server, MongoDB, etc...). All these have a similar design around 2 "tables" (quotes to recognize the fact that some persistence technologies are schemaless); one for the event stream and one for snapshots. In the SQL Server impl, the actual event payload is a serialized blob field. But in your impl you could make this field text and store your event payload as serialized JSON / YAML / XML / other if you need to the contents of this field to be human readable in SQL Management Studio. Indeed, depending on your SQL Server version, you get some mileage from XML as it has better support in more recent releases. 

Anyway, just a thought that may help you in your decision.

Rob.

Werner Clausen

non lue,
15 janv. 2013, 10:30:5815/01/2013
à ddd...@googlegroups.com
Bennie,
 
I agree with your view on context boundaries. My co-worker read some forum were "they allowed reading other aggregates" - perhaps they were specifically talking about aggregates in the same context.
If we go with one table, some other questions rise...I will discuss with my co-workers and write them down; and probably get back in a new thread so that topics doesn't get mixed up.
 
Thanks for your time and in-depth explanations!
 
--
Werner

Greg Young

non lue,
15 janv. 2013, 11:10:4215/01/2013
à ddd...@googlegroups.com
Geteventstore.com

We support re-partitioning (indexing). What you are trying to do is have multiple partitioning strategies (aggregate id and aggregate type). My guess is you will want more as well (say userid and correlationid)

Cheers,

Greg

Wajid Poernomo

non lue,
15 janv. 2013, 21:16:3415/01/2013
à ddd...@googlegroups.com
Hi Greg,

I'm trying to understand the single table model as implemented in your event store. If you store the event as a blob, which fields can you index/project over - only aggregateID or some other generic column in the table? I need to make flexible event specific, temporal queries over the event store, but I can't see how this can be done with a single table and a blob for the events. I realize you can rehydrate to a particular point in time, but I need to make temporal queries over the complete audit trail in the event store. Can you elaborate how you can do this through projections in your implementation?

Regards,

Wajid

Bennie Kloosteman

non lue,
15 janv. 2013, 21:35:3115/01/2013
à ddd...@googlegroups.com
On Tue, Jan 15, 2013 at 11:30 PM, Werner Clausen <item...@hotmail.com> wrote:
> Bennie,
>
> I agree with your view on context boundaries. My co-worker read some forum
> were "they allowed reading other aggregates" - perhaps they were
> specifically talking about aggregates in the same context.

You should try to keep the Aggregates methods as much as possible void
MethodName( params) , this prevents any leaking of internal state
and is important for test-ability ( you just test the events produced
not the internal working which means tests dont need to change so
often -and are simpler and comprehensive which gives more motivation
for testing your logic - ie a few tests per command and you have very
good coverage of mutations on your domain ) and ensures your command
side is command and not a traditional domain which has mixed queries/
reads.

Pretty sure the forum post is for the same context and even then its
not ideal .. these are the more common options and Its more a case of
what is the worst evil with different trade offs .

Get a handler to bring in 2 aggregates ( same context) so it can hook
the second one for events
Create a Saga ( which hooks the events and fires the related commands)
Pass in a ReadModelService
Get a handler to bring in 2 aggregates ( same context) and apply
logic from one to the other ( this is teh quick and dirty )

All should be done ,only if you have too with putting all the data
needed in command and events being preffered . In most cases you can
trigger the behavior needed via just command and events and this is a
design skill those new to CQRS need to get better at. Note im still
learning this as well , sagas do seem the best way though there is a
development cost ( especially if your rolling your own
infrastructure) .

I sort of go if the handler brings in 2 aggregates of the same type
its ok ( since in most impl the code has full access to the private
members anyway) , if there is no failure requirement i sometimes use
other aggregates types in the context else i try use a saga,

Anyway there should be no "golden rules" as IMHO CQRS is too new and
there are too few public "large" examples with say 30+ aggregates (
though there are some with the experience) , but certainly i would
not have the infrastructure / plumbing dictated by unusual choices -
the plumbing is normally carefully designed ... Stick to standard
infrastructure and plumbing styles unless you have no choice eg
politics , basically make your logic / code fit the infrastructure (
excepting very unusual requirements ) not the other way around unless
of cause your experienced in CQRS .

Regards,

Ben

Werner Clausen

non lue,
16 janv. 2013, 03:39:2216/01/2013
à ddd...@googlegroups.com
Interesting, I have downloaded and will ask my questions over at the EventStore group (I guess this is preferrable?)
 
Thanks.

Werner Clausen

non lue,
16 janv. 2013, 03:43:0116/01/2013
à ddd...@googlegroups.com
I would like that too :)
Wajid, perhaps ask your ES specific questions in the ES group: https://groups.google.com/forum/#!forum/event-store. Probably better chance of getting indepth answers...
 
--
Werner

Ramin

non lue,
16 janv. 2013, 06:08:4316/01/2013
à ddd...@googlegroups.com
Unless the Aggregate ID is not simply a GUID. Or is that supported?

Bennie Kloosteman

non lue,
16 janv. 2013, 07:47:3316/01/2013
à ddd...@googlegroups.com
ids can be trivially converted to a guid eg type as int << 32 + id
will be a uniqueue guid.

Greg Young

non lue,
16 janv. 2013, 14:04:1216/01/2013
à ddd...@googlegroups.com
Aggregate Id is a string

James Nugent

non lue,
16 janv. 2013, 15:05:4516/01/2013
à ddd...@googlegroups.com
Hi Wajid,

You can do this using projections in the event store provided your events are serialized as JSON. Your projections would be of the form:

fromAll().when('SomeMessage' : function(state,event) { linkTo('stream-' + e.Whatever, e); });

which will end up with a stream per unique e.Whatever encountered, with links to the SomeMessage events.

Probably the event store group is a better place for this question as Werner suggests

Cheers,


James

Werner Clausen

non lue,
17 janv. 2013, 05:10:0417/01/2013
à ddd...@googlegroups.com
Hi Colin,
 
Thanks for your answer. I wonder, how do you get events to replay? Say you have a new Read-side online that needs events from ContextA. Are you pulling all 2 million records from the eventstore and inspecting them for "ContextA"? In our case with say 100 million records, replay could take 1-2 days :(
 
--
Werner

Werner Clausen

non lue,
17 janv. 2013, 05:16:3617/01/2013
à ddd...@googlegroups.com
Thanks for you answer,
 
I have browsed his implementation. And I think not metadata is índexed or in any way searchable. Which means that you'd have to fetch all 100 million records to inspect metadata. That is of course not an option. I guess there is the option of putting data as xml. But that would probably mean that I need to index the data...in any case, my current solution has been very strict about NOT to "search" the eventstore. Deviating from this seems wrong doesn't it?
 
--
Werner

Colin Yates

non lue,
17 janv. 2013, 05:28:5817/01/2013
à ddd...@googlegroups.com
I would be amazed if 100 million replays took 1 to 2 days - are you doing them by hand :).  Recall that applying the event only updates the local state with no business logic.  But no, I imagine (recall this is all thought experiment at the moment) that I will persist meta-data with the row.  I also want to tie the (original) command that raised it as well.

So, a starting condition is that each command is uniquely identifiable.  

My schema then would look something like:

t_event {
  event_id varchar or int... PK
  event_date datetime
  command_id varchar or int...
  aggregate_type varchar
  aggregate_id varchar or int...
  data varchar...
}

Some thoughts:
 - event_id may be a global integer or a UUID.
 - event_date is always useful....
 - command_id may be a global integer
 - aggregate_type would be "user", "product" etc.
 - aggregate_id might be an integer scoped within the aggregate_type, a global integer or a UUID
 - data is the serialised event 
 - aggregate_id cannot be an auto-incrementing ID because event consumers might want to retain a relationship to the aggregate and are processed (in my app) before persistence.

I did consider my aggregate id encoding the aggregate type (and some other things like creation date), "PRODUCT_1" or "PRODUCT_1_20130117" for example.  If this were the case then the aggregate_type and aggregate_id columns become a single "aggregate_id varchar" column.

Querying:
 - rebuilding the system one event at a time: select * from t_event
 - hydrating an aggregate: select data from t_event order by event_id where aggregate_id=? (maybe with the additional "and aggregate_type=?")
 - identifying how command 8 broke the system: select * from t_event where command_id=?, 8

All pretty straightforward I think.  I am undecided whether to have keys that encode more than one thing (i.e. PRODUCT_1) rather than separate columns...

I would be more than happy to get some thoughts - this is all up in the air at the moment and I only really thought about the detail as I typed this :)

@yreynhout

non lue,
17 janv. 2013, 06:03:2917/01/2013
à ddd...@googlegroups.com
How hard would it be to do the indexing asynchronously? Remember, the stuff in there is IMMUTABLE. No need to do it at query time. Think out of the box.

Werner Clausen

non lue,
18 janv. 2013, 06:02:3318/01/2013
à ddd...@googlegroups.com
Colin,
 
Thanks for your answer. Your solution here saves the aggregate_type in the eventstore itself. Someone suggested the same in another thread. Still I'm struggling a bit with this solution as it would mean to query metadata in the eventstore. Today it is "aggregate-type" tomorrow perhaps "userid" or something different...that could get messy.
 
On the 1-2 days of reply: That seems realistic. I'm not just talking about publish, but also the read-serialization (weird word) that must happen in order. In a busy environment we calculate around 1000 transactional messages / sec. That adds up to 28 hours of replay.
 
--
Werner

Colin Yates

non lue,
18 janv. 2013, 06:26:5318/01/2013
à ddd...@googlegroups.com

But serialisation from events should be much much simpler, literally streaming the events into a single aggregate which hydrates itself and nothing else right?  There are no transactions or business logic involved.  Maybe we are discussing different things.

Regarding meta data, maybe this is flippant but sure, why not.  If it is 1 to 1 with the event data then what does a seperate table buh you?  Simplest thing possible right :)

Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message