Importing historic data into an Event Store

868 views
Skip to first unread message

iProgrammer

unread,
Sep 7, 2010, 9:53:31 AM9/7/10
to DDD/CQRS
Hi

I've been stealthly following the group for a short while, with
particular interest in the promised BI/Analytical capabilities
promised by the event source architecture, and had a question that
hopefully does not display my apparent ignorance too much.

What are you guys doing when importing historic data from an old
system into your shiny new domain models and event stores .. e.g. a
clients existing customer base?

Presumably, its incredibly difficult to convert existing (relational)
data to its aggregate counterpart by retrospectively representing it
as stream of domain events/commands. Yet if we cannot do this, how do
build up the event store?

My first inclination is to project the historic data to a snapshot. I
then guess some kind of 'fake event' needs to flow to the read side to
then trigger handlers to build of the appropriate views?

Are people building these processes as part of the domain model (I'm
thinking ongoing integration with other systems to import data)
through commands, or/and as an external streamlined bulk operation
(for initial data load)?

It doesn't feel right for me to decorate the domain with such
infrastructure issues, yet it seems to make commercial sense to try
and reuse the validation pipeline and handlers where practical to
ensure I don't pollute the system with invalid aggregates.

Any thoughts/advice would be grately appreciated.

Thanks

Richard

Greg Young

unread,
Sep 7, 2010, 10:03:10 AM9/7/10
to ddd...@googlegroups.com
I tend to go aggregate by aggregate through a brown field system.

First make an actual aggregate (likely they don't exist) make it using
events, but have handlers writing back to the database.

As part of this I will snapshot the aggregate (assuming I can't build
up a history for it) likely with an originating "create" event.

I will save all future events into a log.


At this point I have all the benefits of event sourcing (I have an
event log) but still have relational database.

I will at this point talk about the cost of keeping the relational
database and what benefits it provides as often it is none :) but the
reasons for getting rid of it are usually cost based.

Make sense?

Greg

--
Les erreurs de grammaire et de syntaxe ont été incluses pour m'assurer
de votre attention

Rinat Abdullin

unread,
Sep 7, 2010, 10:13:58 AM9/7/10
to ddd...@googlegroups.com
Hi Richard,

That's the problem similar to what I'm currently facing in one spike.

Core project is CQRS/DDD/ES that aims to provide integration to some existing cloud service (online task planner). 

Data is originating in the external source, from which it comes in form of incremental updates (first update being the current snapshot of the entire system). More than that, we need to ship back our changes in form incremental updates as well.

AR is rather simple and works well on it's own, processing commands and sending out events in response, etc. Yet when we need to accept incoming "snapshot updates" and produce "snapshot updates" to update the external system with our changes, it all gets messy (just like when integrating with any system that does not track changes)

I'm currently thinking about adding two methods to the AR to handle the integration scenario:

1. DecomposeIncomingIncrementalSnapshot(snapshot1, snapshot2) - AR should know how to derive deltas from two states, applying them afterwards.
2. ComposeOutgoingIncrementalSnapshot(snaphost)

It feels better to have these methods outside of the main AR, but I just can't figure out how.

Best regards,
Rinat Abdullin

Technology Leader at Lokad.com | Writer at Abdullin.com | Contacts

iProgrammer

unread,
Sep 7, 2010, 10:33:58 AM9/7/10
to DDD/CQRS
Hi Greg

Thanks for the quick response.

There isn't a relational store to worry about in the new solution,
it's purely the expectation that new tenants have their historic
information stored this way and we need to consider our ETL strategy
to generate the initial aggregate sets.

If I understand you correct; you tend to find you can build your
aggregates from the existing data by generating 'historic' events
direct into the aggregate, not by issuing commands to the domain? Am
I safe to read into this that you find your domain model naturally
exposes events you can use and is not 'polluted' with otherwise
unnecessary 'imported' events? If so, that's encouraging to note.

As a quick side ... how do you prefer to store your aggregate
snapshots?

Thanks

Richard
> de votre attention- Hide quoted text -
>
> - Show quoted text -

Nuno Lopes

unread,
Sep 7, 2010, 10:43:31 AM9/7/10
to ddd...@googlegroups.com
> It feels better to have these methods outside of the main AR, but I just can't figure out how.


Domain Service.

Nuno

JAmes Atwill

unread,
Sep 7, 2010, 11:52:30 AM9/7/10
to ddd...@googlegroups.com
On Tue, Sep 7, 2010 at 7:13 AM, Rinat Abdullin <rinat.a...@gmail.com> wrote:
> Hi Richard,
> That's the problem similar to what I'm currently facing in one spike.

FWIW, I'm working through something quite similar too right now.

Our external system uses OpenJPA and I've added a listener to the
persistence, from there I'm able to query the Entity about which
fields are dirty. I then collect the class name, PK and dirty fields
and commit a AnnounceChangesCommand() into my command bus within the
scope of the other system's transaction.

From there I have a command handler which can map classname+PK to
AR+UUID in my system, and depending on which fields have changed, I
generate appropriate events and commit them onto my event bus.

The events populate my read-side and build my event store.

JAmes

seagile

unread,
Sep 7, 2010, 12:37:06 PM9/7/10
to DDD/CQRS
For this particular issue (initial data load), I've gone through the
trouble of having explicit "MigrateXXX" commands on my aggregate roots
which emit "MigratedXXX" events. This way I can distinguish between
"something that was migrated into the system" and "something that was
created in the system". I'll admit it's a lot of work, but it also has
its merits.
The most notable problem is that "old" systems either didn't track any
changes or they only tracked changes for entities "for which it made
sense". Which leaves you with the task of reverse engineering what the
intent was, if at all possible - might also be a waste of time. So if
there is no sequence of changes, then basically you have - as others
have mentioned - a "snapshot" within a special context (migration in
my case). The act of modelling this 'migration process" as a set of
commands/events instead of the more technical snapshot, makes it
explicit and in line with "other" ways of creating aggregates.
Additionally, there's no need for fake events. As the system evolves
and all customers have moved to the new system, the commands can
eventually be removed.
This whole "migration process" is also very instructive: the *nitty
gritty* details of the old system surface just by looking at and
reasoning about the data (possibly providing deeper insight into the
process the new product is supposed to support). Of course not
everybody has this luxury called "time and money".
Before migrating/importing things, careful analysis of what one is
going to do with that old data is of the utmost importance. What's the
cost of keeping the old system around for read-only purposes? Maybe
only a crucial subset of that data needs to be migrated, or maybe you
can build a view on that "old" relational data, serving as input or a
read-only view? I think these options should be discussed with
management before taking a blind leap of faith into the void that is
"let's import everything and turn off the old system".
If you have justified reasons to do the "import", then there are
several ways to go about keeping downtime to a minimum. For example,
you could migrate the more recent data first (I hope there's at least
a timestamp in the old model), working your way to the beginning of
time. Most business have both an active and inactive set of data in
their systems anyway. Things that happened in the past are for reading/
verifying, usually not for changing.

Your welcome,
Yves.

Roy

unread,
Sep 8, 2010, 9:38:34 AM9/8/10
to DDD/CQRS
It's too much trouble to reverse engineer what the intent might have
been from the old system. You could create snapshots from the legacy
system, but it is not needed. The go-live date of your new system will
be documented. The fact that your old system didn't explictly capture
meaningful events will be documented. Being the case, you don't have
to go crazy thinking about how to represent all the (none existing)
legacy events.

Simply move forward with capturing new events and take snapshots
according to your needs. If the old system used numeric identifiers
instead of a GUIDs, create a GUID field for those entities and
populate them. If you need both IDs, fine. If not, fine.
Unfortunately, some business requirements force us to keep the numeric
identifier. I label those are surrogate keys, but rely on the GUID as
the core entity identifier.

When it comes to processing commands for an aggregate root that's
present in the reporting schema/database, but doesn't exist in the new
event store, just generate events using the supplied GUID versus
throwing a NotFoundException.

Rinat Abdullin

unread,
Sep 8, 2010, 1:37:48 PM9/8/10
to ddd...@googlegroups.com
Dear Community,

// I'm sorry if this hijacks thread.

What if we are not evolving a single system, but need to continuously
integrate 2 distinct systems A and B (which tends to happen
embarrassingly frequently these days)?

System A is under our full control and features full event sourcing.
System B does not have event sourcing and is outside of our control.
At most we can query it for "what records were
created/updated/deleted" since time X.

Let's say B is some proprietary CRM that sales department just loves
using. Whenever they create a new lead, we need to bring it into the
system A.
If manager or autopilot modify this contact within system A, changes
should be synced back to B.

Now, if we assume that deciphering user intent (diffing) CRUD changes
from B into events in A, is not a problem, how would you approach the
integration?

Logically, if we treat A as the single source of truth (which it
should be, since it has the event store which is more complete than
state DB), then B should be treated just like some remote UI client,
that sends commands (which are decyphered from the CRUD changes and
could be rejected) and gets synced back by overwriting it's state with
the state built from the events.

Am I even remotely close to the right track from the logical
perspective? How would you approach the situation?

Best regards,
Rinat

Roy

unread,
Sep 9, 2010, 1:32:44 AM9/9/10
to DDD/CQRS
It should be a matter of using message passing protocols. System A
changes state, so it passes the message/event to System B. System B
changes state, so it passes the message/event to System A. Similar to
a SOA. How ever you want to pass these messages is up to you. If
System B isn't under your control, then you will need to tell the
controller what you need. Perhaps all that's needed is for System B to
create an XML event storage or event queue that can be accessed
somehow by System A.

Rinat Abdullin

unread,
Sep 9, 2010, 1:54:40 AM9/9/10
to ddd...@googlegroups.com
Roy,

Let's say we are talking about situation where it is impossible
(unfeasible due to the political reasons or resource constraints of
the project) to be messing with the internals of B (i.e: too complex,
closed source and black-box behavior, different platform, CTO does not
want to allocate resources in dealing with B internals etc).

All we can do is to poll it for diff changes and push CRUD commands
back. So, essentially, there is no clear way to message from B to A.
We can only query for changes and write our changes back. Any conflict
resolution should, obviously happen in A.

Any ideas?

Best regards,
Rinat

Nuno Lopes

unread,
Sep 9, 2010, 2:58:41 AM9/9/10
to ddd...@googlegroups.com
Anti corruption layer.

Sent from my iPhone

Rinat Abdullin

unread,
Sep 9, 2010, 3:10:29 AM9/9/10
to ddd...@googlegroups.com
Yes, that's probably the name.

But how do you keep two entities synchronized, if one of them is behind the "anti-corruption" layer?

Should changes coming from entity-b (deciphered by "anti-corruption" layer from by doing diffs between the states) be treated as events to be applied to entity-a (they already happened) or commands to the entity-b (giving the chance to cancel, but how do you cancel past)?

How do you assemble changes for the entity-b? Should "anti-corruption" layer subscribe to all events, filter out ones that it produced and use the rest to figure out CRUD deltas for the last known state of entity-b? Or do something else?

Or am I just going completely off-track here? 

Best regards,
Rinat 

Scott Reynolds

unread,
Sep 9, 2010, 3:35:37 AM9/9/10
to ddd...@googlegroups.com
Its doesn't really make sense to me trying to fit CQRS in. I mean System A is done and won't be changing. All you are really adding is an Audit trail. you can't get intent (although you can guess). 

This isn't a brownfields app as brownfields would assume you are changing A. I'm not sure what value your trying to add, unless its just that you want to be able to add new functionality via system b and in that case your only implementing behaviours for internals (no editing).  

Nuno Lopes

unread,
Sep 9, 2010, 4:18:21 AM9/9/10
to ddd...@googlegroups.com
Rinat,

We share the same challenge as you here all the time. The company I work for decided to off load some non core business processes to Sales Force. Don't know the precise details of how they are doing (different project) but I'll try to have a look.

Last time I head they do bulk periodical updates from system A to system B through some kind of DTS. A mess IMHO but again not my decision. Sales force for instance doesn't publish events.

Whatever is the solution we need to dive into understanding the business processes that system A and system B automate and business concepts used. Through understating process we can capture how the business clock works in each domain (domain time). Only then we understand how synchronize the domain clocks of system A and B can be synchronized and viable.

I think if you are striving for an automated generic solution, you will fail and the integration processes will not scale well.

> How do you assemble changes for the entity-b? Should "anti-corruption" layer subscribe to all events, filter out ones that it produced and use the rest to figure out CRUD deltas for the last known state of entity-b? Or do something else?


"As the domain luke"

Hope it helps,

Nuno
PS: Sorry for being so abstract. Maybe you can share more detail about the domains A and B automate.

Rinat Abdullin

unread,
Sep 9, 2010, 4:49:27 AM9/9/10
to ddd...@googlegroups.com
Scott. I'm not trying to fit CQRS in. It's already in the system A.
I'm just trying to figure out how to efficiently and logically keep A
integrated and synced with various external systems that are out of
the control (but happen to generate an immense value out of the
integration fact). "System B" will be the source of truth for certain
changes and "System A" will be source of truth for some other changes
upon the same entities. The problem now becomes about the approach
about keeping this truth consistent and available in both systems.

Just trying to get my head around and formalize approaches.

Rinat Abdullin

unread,
Sep 9, 2010, 5:07:20 AM9/9/10
to ddd...@googlegroups.com
Nuno,

I'm not trying to find a generic automated solution. Such thing is
unfortunately impossible (or prohibitively expensive) and I have been
bitten hard enough in the past to learn the lesson))

I'm just trying to arrive at the methodology that describes the logic
how these changes should be derived and handled. Once you have the
logic stable, architectures and implementations are simple to build
around, tailoring to each specific case.

So, for example, although there is a logic (rather a set of additional
constraints to follow) for building almost-infinitely scalable
solutions with CQRS/DDD/ES, or handling versioning and concurrent
changes, I can't get the logic for integrating state-based and event
based solutions. It's as if there were absolutely no generic
rules/guidelines to follow and you had to learn the domain to an end
and then come up with completely custom solution (without any
similarities between solutions in every case).

Examples of CQRS-A and External-B are, for example.

B: Web-based CRM used by the Sales department for historical reasons
A: Digital Nervous System of the company, that handles customers,
billing and subscriptions.

We need to detect changes in B (i.e.: changing customer details,
adding additional contacts to the company) and propagate into A,
applying all changes in the reverse direction as well.

Another one:
B: Closed source task management solution with a certain and unique value
B: Task management solution on another platform with another certain
and unique value
A: Integration platform, that detects changes in these systems, and
syncs them, also keeping track of the history, adding additional
notifications etc.

In each case I'm assuming that the challenge of figuring out the
answer to "Given state 1 and state 2, what were the business changes?"
is not the problem (or rather it is a problem that could and would be
solved by the domain).

I'm sorry if this sounds confusing; just can't seem to find better words.

Best regards,
Rinat

Nuno Lopes

unread,
Sep 9, 2010, 5:37:12 AM9/9/10
to ddd...@googlegroups.com
Hi,

> I can't get the logic for integrating state-based and event
> based solutions.

I think you can though Aggregates.

What you loose from passing from one to another is the domain events that provoke the state change of either A (only preserves the latest state) or B (preserves all state changes).

So if we want to synch between both systems all we can do is compare the least common denominator which is the latest state of an Aggregates (snapshot).

So we need to stop thinking of events and look at the latest state as usual. The Aggregate state.

> It's as if there were absolutely no generic
> rules/guidelines to follow and you had to learn the domain to an end
> and then come up with completely custom solution (without any
> similarities between solutions in every case).


Within this, it is of advantage to consider the latest state as a value object that can be transacted between A and B.

But for that you need to find what is in B that is of value to A and vice versa.

For that matter I would advise studying what is of value to A in B. And what is of value to B in A. Within the scope of each business feature of A then B.

The result is the information that need to be compared, when,and in synch.

Architecturally the synch rules are then implemented in the Anti-corruption Layer. You can use scripts if events cannot be passed on from one to another. Or if they can, probably you can use Sagas :)

Does this help,

Nuno

Nuno Lopes

unread,
Sep 9, 2010, 5:48:07 AM9/9/10
to ddd...@googlegroups.com
A last remark.

If the term business feature does not ring many technical bells. what about Slice (as of SOA slice)?.

Nuno

Roy

unread,
Sep 9, 2010, 11:10:59 AM9/9/10
to DDD/CQRS
So it's a sucky situation. Since System A uses event sourcing you
definitely need to capture state changes to aggregates using that
architecture. System B, however, does not produce events and you have
no control over it (for whatever reason) yet you can poll it and send
commands to it. Integration seems to becomes a problem for System A
because it captures state changes through events; all for the value of
capturing the intent of the change.

There is no way System A can figure out the intentions of System B
(i.e., Sales System). It appears that the aggregates in System A's
domain model may have to look slightly ugly. Along with meanfully
state change methods of the aggregates, you will need to implement
bland methods for changing properties. In other words, you would have
a meaningful event like CustomerMovedEvent and a non-meaningful event
like ChangedAddressEvent.

How does that feel?
> >> >> Richard- Hide quoted text -

Rinat Abdullin

unread,
Sep 9, 2010, 11:36:48 AM9/9/10
to ddd...@googlegroups.com
Roy,

Yes, System A in all these situation captures events for the purpose
of added business value as well as the scalability and collaboration.

// NB: figuring intentions out of diffs is never an easy task, but
it's something doable (with a certain degree of precision).

So I'm ending with something like AntiCorruption Layer + Domain
Service, as Nuno have kindly mentioned earlier.

Basically this IntegrationProcess is a process (overgrown saga) that
keeps the last known state of the system B and is capable of figuring
out "what has changed" and sending this into system A as commands,
that make sense in our CQRS/ES/DDD world. All id mappings and logical
conversions happen within this process and stay there.

Sync happens on a schedule.

Getting data from B to A is the easy part. Getting out changes back to
B is still where it is fuzzy.

Although this specific project is not a real one (it does not have any
specific goals aside from practicing CQRS/DDD/ES, integration with the
legacy systems and scalability in the cloud), we did some
brainstorming today. So for getting the data out to B, it seems the
approach would be to let IntegrationProcess track all domain events
that originated within the system A (these events will be mixed with
the domain events originated from the changes that came in through the
same sync process). Then, based on the changes it will build snapshots
and compose diffs between two syncs (the "magic happens here" part
that is still unclear). These diffs will be sent back to the System B
(we can't send commands, we can just send these diffs, similar to the
ones we've got).

We'll see if the approach justifies its right to live after being
applied to the working system.

Best regards,
Rinat

Roy

unread,
Sep 9, 2010, 12:10:25 PM9/9/10
to DDD/CQRS
Rinat,

Keep us posted. Let us know how things are behaving once it's been
running for a while.

Good luck

On Sep 9, 11:36 am, Rinat Abdullin <rinat.abdul...@gmail.com> wrote:
> Roy,
>
> >> - Show quoted text -- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages