Saga design in the face of failure

@seagile

unread,

Dec 23, 2010, 4:26:24 PM12/23/10

to DDD/CQRS

After discussing this topic in twitterverse - with excellent people
such as @thinkb4coding, @craigcav, and @abdullin - I decided to repeat
it here, so everybody can have his/her say in it.

This is a reprise of the saga mentioned in the "Saga design questions"
topic:
The end user would like to book an appointment (on a timeslot) in the
schedule of a cardiology surgeon, the schedule of an anesthetist, the
schedule of a traction table and the schedule of an operating room.
Because he'd be affecting multiple schedules/time slots (aggregates)
the work needs to be decomposed: first reserve the time slot in each
schedule, and if that works out, continue with the actual booking.
The message flow would be something like this:
1. The client sends a RequestBookAppointmentCommand, providing a time
slot in each schedule (picked from a search result by the end-user),
2. The command handler "commands" the appointment (or appointment
request) aggregate. It produces a BookAppointmentRequestedEvent.
3. The BookAppointmentSaga intercepts the
BookAppointmentRequestedEvent.
4. The saga issues a command ReserveTimeslotCommand for each
schedule's time slot.
5. The command handler "commands" the time slot aggregate to become
reserved. It produces a TimeslotReservedEvent.
Let's stop for a moment. Suppose I'm not the only one interested in
that time slot of the traction table, and somebody beat me to it. This
is could be seen as a concurrency exception, but in reality it's
innate to the business. Time slots are a precious resource and from
time to time somebody will beat you to it. Even given our best effort
to not offer up time slots that have already been taken, even given
the fact that we check their availability before submitting the
command. Taking it up with the business they usually respond in one of
two ways:
A. "The chance of that happening is very slim, the effort it would
take to fix that programmatically is just not worth it, let it happen,
and if you detect it, just send them an email saying we couldn't book
it."
B. "The chance of that happening is real, and if you detect it, you
should prevent it from ever happening."
You guessed it, I'm in the business of B.
So, here we are, the ReserveTimeslotCommand succeeded for the
cardiology surgeon and the anesthetist, but the ReserveTimeslotCommand
for the traction table fails (because the time slot aggregate enforces
that it can only be reserved once). Somehow we need to communicate the
failure back to the saga. We could choose not to communicate the
failure and let some timeout kick in on the saga side, but that's a
bit smelly. We require a TimeslotReservationFailedEvent(a more
specialized form of the CommandFailedEvent).

My question: Who produces that event? Why? Motivate :)

School of thought 1:
The commandhandler catches the SlotCouldNotBeReservedException emitted
by the Timeslot aggregate and publishes a
TimeslotReservationFailedEvent.

School of thought 2:
The commandhandler "commands" the Timeslot to Reserve(). If the
reservation fails, it registers a TimeslotReservationFailedEvent
internally which gets published after persistence.

stacy

unread,

Dec 23, 2010, 5:46:34 PM12/23/10

to DDD/CQRS

@seagile

I think there is a better approach. Basically this is a "group
reservation" problem. Udi presented a brilliant approach which you
might find helpful.

The user really doesn't want to click thru all those schedules to
arrive at a cohesive set of times for a "procedure." It's troublesome
for them, and for you.

Therefore, let the UI ask ONLY for a procedure and a daypart (say
mornings, or afternoon, etc), click. Next screen, "Please wait while I
find the best available times for you ...". Run code and reserve what
you can. Have javascript poll the read model Id for results. Then
display the reserved timeslots for all the schedules. Ask the user to
confirm this set of schedules, then run your booking commands.

Here are Udi's slides and video that explains this approach:
http://skillsmatter.com/podcast/open-source-dot-net/udi-dahan-command-query-responsibility-segregation/zx-489

The group reservation idea begins on slide 39.

Hope this helps,

Stacy

Think Before Coding

unread,

Dec 23, 2010, 5:51:58 PM12/23/10

to DDD/CQRS

I would go for option 2...

the good reason, is that you could convert this failure into a success
later using time...
Sometimes appointments are cancelled...
So the TimeslotReservationFailedEvent could just tell the saga it's
not ok now...
but if the timeslot is in a far future, you can maybe wait some time
to be sure the
previous reservation won't be cancelled.
If not, go back to the user, and tell him you could not handle his
request...
but if the previous reservation is cancelled, your user won't even
notice he was on a waiting list.

Of course, you can maybe not apply this business solution in your
domain... but the fact that you can
imagine it shows, that it's not an exception... just an standard event
from the domain.

Anyway, there's no such thinkg as a 'business error' :
http://thinkbeforecoding.com/post/2009/12/10/Business-Errors-are-Just-Ordinary-Events

jeremie / thinkbeforecoding

@seagile

unread,

Dec 24, 2010, 6:20:00 AM12/24/10

to DDD/CQRS

Thanx, I'll look into it.

FYI, the end user only picks the slot for the main procedure, the
other slots are autoselected. The end user specifies search criteria
(procedure, preferred schedules, preferred sites, preferred days of
the week, as of a date) and gets a page-able stream of search results.
By reserving everything I can (the entire stream or at least a page),
I'd be doing a form of pessimistic locking (he could be looking at
that screen for a long time), which is not what the business is
looking for. But questioning when the reservation should be done, is
indeed a good thing. Instead of a saga, I could issue a group of slot
reservation commands, and waits for those to succeed (come to think of
it, that's probably what you meant, right?). Compensation would still
need to be part of the deal, though. The only thing that bothers me is
the fact that what used to be issuing one command to get the job done,
now becomes a "procedure" by convention for each client ("client" in
the sense of consuming component) to follow. The plot also thickens
when you start thinking about "moving" an appointment, where new slots
have to be reserved and old slot reservations need to be cancelled.

On 23 dec, 23:46, stacy <stak...@gmail.com> wrote:
> @seagile
>
> I think there is a better approach. Basically this is a "group
> reservation" problem. Udi presented a brilliant approach which you
> might find helpful.
>
> The user really doesn't want to click thru all those schedules to
> arrive at a cohesive set of times for a "procedure." It's troublesome
> for them, and for you.
>
> Therefore, let the UI ask ONLY for a procedure and a daypart (say
> mornings, or afternoon, etc), click. Next screen, "Please wait while I
> find the best available times for you ...". Run code and reserve what
> you can. Have javascript poll the read model Id for results. Then
> display the reserved timeslots for all the schedules. Ask the user to
> confirm this set of schedules, then run your booking commands.
>

> Here are Udi's slides and video that explains this approach:http://skillsmatter.com/podcast/open-source-dot-net/udi-dahan-command...

@seagile

unread,

Dec 24, 2010, 6:58:28 AM12/24/10

to DDD/CQRS

Although the approach feels - in a sense - "weird" to me, there's the
added bonus that the event gets stored alongside the aggregate. What's
weird is that it's not really a state change, it's a violation of an
invariant I'd be generating an event for. I guess it's a psychological
thing I have to get over.

On 23 dec, 23:51, Think Before Coding <jeremie.chassa...@gmail.com>
wrote:

> I would go for option 2...
>
> the good reason, is that you could convert this failure into a success
> later using time...
> Sometimes appointments are cancelled...
> So the TimeslotReservationFailedEvent could just tell the saga it's
> not ok now...
> but if the timeslot is in a far future, you can maybe wait some time
> to be sure the
> previous reservation won't be cancelled.
> If not, go back to the user, and tell him you could not handle his
> request...
> but if the previous reservation is cancelled, your user won't even
> notice he was on a waiting list.
>
> Of course, you can maybe not apply this business solution in your
> domain... but the fact that you can
> imagine it shows, that it's not an exception... just an standard event
> from the domain.
>

> Anyway, there's no such thinkg as a 'business error' :http://thinkbeforecoding.com/post/2009/12/10/Business-Errors-are-Just...

@seagile

unread,

Dec 25, 2010, 11:18:28 AM12/25/10

to DDD/CQRS

After reading @thinkb4coding's article and letting everything sink in
(yeah, on christmas day) ... I am inclined to agree that the best
choice for a rejection of a request is having the aggregate generate
the rejection event. I'll go into some more detail. Why is the command
handler not the best choice to generate the event? It would be
performing or at least translating a business decision into an event.
Add to that the fact that storing the event becomes a little weird
compared to how it usually happens (querying the aggregate for its
events, storing them into its stream).
So how come, at first, it feels weird? Probably the fact that
enforcing invariants is usually associated with throwing an exception.
When generating an event for a broken invariant it feels somewhat
dirty. The event communicates a form of command failure. Not really a
state change. Nevertheless, something happened and we want to keep
track of it. Get over it. I don't hear anybody complaining when a
RenameCommand is handled by an aggregate, yet do nothing with the name
inside the aggregate (not even storing it in a field).
So, should each and every command that violates an aggregate invariant
be modelled as an event. Honestly, I don't know. but for this
particular saga an explicit SlotReservationRejected/ApprovedEvent
seems to make the most sense. More generally if a command can fail and
is used in conjunction with a saga you have to communicate back to, an
event makes the most sense. And because recollection that something
happened is important in such a case, having the event generated by
the aggregate is the easiest/right way.

Jonathan Oliver

unread,

Dec 25, 2010, 10:35:28 PM12/25/10

to DDD/CQRS

(I've cross-posted this to my blog because it's an interesting problem
that more and more people are starting to encounter and we've found a
great little solution a while back that we wanted to make people aware
of: http://jonathan-oliver.blogspot.com/2010/12/sagas-event-sourcing-and-failed.html)

We've run into almost this exact situation and considered a number of
ways to solve the problem.

To sum up the problem, it's that we need to communicate a failure or
rejection of a command/instruction by the domain back to a saga such
that the saga can take the appropriate, compensating action.

As previously mentioned in this thread, an aggregate must enforce its
invariants and is not allowed to enter an invalid state. This
presents somewhat of an issue because we want to communicate the
failure back to the saga but this failure cannot be communicated
*through* the domain.

One potential solution (although not the one that we use) was raised
during Greg's course when he was discussing sagas. He hinted at the
idea that a call from the saga to the domain could be made via RPC.
In essence the command message is sent synchronously via some kind of
RPC-like call or web service instead of asynchronously using a message
bus or message queue. In this way the failure can easily be
communicated back in the form of a fault that is understood by the
saga. Ultimately we went with something a little bit different
because we were exposing our domain behavior through message handlers
that listened to a message bus and we wanted to maintain asynchronous
messaging throughout or system.

Our solution was the following (and it's worked quite well):
1. The saga asynchronously dispatches a command message to the domain.
2. The message handler receives the command message, loads the
aggregate from the repository, and calls the appropriate method on the
aggregate.
3. The domain object checks its invariants and ultimately decides to
throws an intention-revealing, well-named, domain-specific exception.
4. The command handler catches the well-named exception. Because the
command handler knows the type of message received along with its
intention and the type of exception that was thrown and caught, it's
in a position to relay that exception in the form of a message back to
the saga via a bus.Reply(). But this time, it's not an event message,
which represents something that something happened. Instead it’s a
message that describes something that didn’t happen. We didn’t have a
name for this kind of message but the terms notification and alert
kept coming back. Ultimately we decided to call these kinds of
messages alert. These messages could also potentially be called
faults or something else, but they should be considered separate from
events.
5. The saga receives the alert/fault message and applies the
appropriate action.

The one question that still remains is that we lose the alert message
once it’s consumed by the saga and the concern is that we want to keep
track of everything that has happened. I couldn’t agree more.
Keeping track of what’s happened and it’s extremely important. But a
better question is who’s responsible for keeping track of this fault/
alert? Should the domain keep track of something that didn’t happen
or something that it refused to do? Isn’t it the saga that cares
about the command and associated failure? Shouldn’t the saga (or its
infrastructure) be responsible for tracking all messages relative to
the saga?

In our implementation, we actually implement our sagas using event
sourcing. What this means is that all messages addressed to the saga
are replayed in order to re-build the state of the saga. In addition
(and as an auditing benefit), we use an event store to dispatch
outbound message from the saga. This means that a saga is completely
autonomous and separate from the domain and is only coupled by the
message contracts. It also means that we can replay incoming messages
against a new implementation of a saga if necessary to come up with an
alternate model as our sagas and domain evolve to changing business
requirements.

To sum things up:
1. When the domain refuses to do something, it doesn’t generate an
event. Instead, it throws an exception.
2. The command handler handles the exception and does a bus.Reply()
with an “alert” or “fault” message.
3. Because the message is not generated by the domain, but by the
layer just on top of the domain, we don’t track this alert/fault
message inside of the domain.
4. The saga receives the message and takes appropriate, compensating
action.
5. Because the saga is implemented using event sourcing, nothing is
ever lost. We have a complete business/audit history of what happened
and we can evolve our saga model and rebuild it with full confidence
in our message history.

Jonathan

@seagile

unread,

Dec 26, 2010, 5:11:23 AM12/26/10

to DDD/CQRS

Nice write-up. I guess both schools of thought have have their merits,
and proponents. The only place - in my head - the "track failure via
domain/aggregate generated event" falls apart is when a saga emits a
command that is supposed to "create" something. In such a case
tracking the failure to create something via an aggregate generated
event is plain weird.
Maybe the fallback to an aggregate generated event stems from the
notion that the failure message should be tracked, or the fact that an
aggregate is usually responsible for generating an event, or the
confusion that a command failure happened and should be seen as an
event, or the the fact that a rejection is a business response ...
I'll try out your approach, because it's most inline with how I felt
about it originally (stick with your first idea kind'o thing).

On 26 dec, 04:35, Jonathan Oliver <jonathan.s.olive...@gmail.com>
wrote:

> (I've cross-posted this to my blog because it's an interesting problem
> that more and more people are starting to encounter and we've found a
> great little solution a while back that we wanted to make people aware

> of:http://jonathan-oliver.blogspot.com/2010/12/sagas-event-sourcing-and-...)

> ...
>
> meer lezen »

Daniel Yokomizo

unread,

Dec 26, 2010, 6:07:33 AM12/26/10

to ddd...@googlegroups.com

On Sun, Dec 26, 2010 at 8:11 AM, @seagile <yves.r...@gmail.com> wrote:
> Nice write-up. I guess both schools of thought have have their merits,
> and proponents. The only place - in my head - the "track failure via
> domain/aggregate generated event" falls apart is when a saga emits a
> command that is supposed to "create" something. In such a case
> tracking the failure to create something via an aggregate generated
> event is plain weird.

We can always follow Udi's advice:
http://www.udidahan.com/2009/06/29/dont-create-aggregate-roots/

IME not creating something because there's some error leads to bad
workflows, because the user is stuck with the problem right now and
can't try again later, otherwise the context and the information they
are trying to send will be lost.

Instead I think it's better to design these kind of operations as
explicit request/proposals and let an inconsistent/incorrect request
lay around while the user eventually corrects and submits it. In this
scenario, every event has an AR and we can even get value from these
partial requests (e.g. which is the most common error).

Best regards,
Daniel Yokomizo

@seagile

unread,

Dec 26, 2010, 8:55:38 AM12/26/10

to DDD/CQRS

I agree with you on the explicit modelling as proposals/requests. But
I fail to see how that's going to help. I mean, the process step that
fails and needs to communicate its failure back to the saga is not the
request, it's one of the many time slot reservations, which have been
kicked off by the saga in the first place as it received the
BookAppointmentRequested event. As for the "creational" part, sure
having a "functional" parent gets you off the hook, as you can
register the event ON the "functional" parent.

The confusing thing about this whole discussion is that "events
interesting to the business" get mixed with "events generated by an
aggregate". I associate "events interesting to the business" with both
failure and success to complete a command, basically anything that
"happened" in the system regardless of invariants. While "events used
to rebuild aggregate state" are things that were allowed to happen
according to aggregate invariants, even if we don't track that state
internally. Not things that weren't allowed to happen. So you could
say I see one as a subset of the other.

Whichever approach is taken, we all tend to agree that failure should
be communicated (either as an event or a fault message) because
knowing the "failure" happened is interesting in many ways (steering
the saga, doing business analysis).

On 26 dec, 12:07, Daniel Yokomizo <daniel.yokom...@gmail.com> wrote:

Daniel Yokomizo

unread,

Dec 26, 2010, 10:12:43 AM12/26/10

to ddd...@googlegroups.com

On Sun, Dec 26, 2010 at 11:55 AM, @seagile <yves.r...@gmail.com> wrote:
> I agree with you on the explicit modelling as proposals/requests. But
> I fail to see how that's going to help. I mean, the process step that
> fails and needs to communicate its failure back to the saga is not the
> request, it's one of the many time slot reservations, which have been
> kicked off by the saga in the first place as it received the
> BookAppointmentRequested event. As for the "creational" part, sure
> having a "functional" parent gets you off the hook, as you can
> register the event ON the "functional" parent.

Which aggregate reserves the time slot? Why reservations are handled
by a first come first served strategy? Why leave conflict handling of
the table?

Suppose we have this call:

Schedule.ReserveTimeSlotForAppointment(AppointmentId, TimeSlot)

The Schedule AR (responsible to handle the time of a single
"resource") then checks if it's possible to create the
TimeSlotReservation. Maybe there's a conflict, but let's always create
a TimeSlotReservation. Either one of these is raised

TimeSlotSuccessfullyReserved
ConflictingTimeSlotReserveRequested

The first is straightforward, it creates a TimeSlotReservation. The
second is raised when there's a reservation conflict (i.e. the time
slot or part of it was already reserved, perhaps different events for
the entire or part?) but a TimeSlotRequest is created nonetheless.
Some TimeSlotRequests "become" TimeSlotReservations later.

Now what can we do? Possibly we can show a screen for an user with the
time slot conflicts, perhaps use a bin packing-like algorithm to make
a resolution proposal, create a time slot reservation queue (e.g. if
an appointment is cancelled the second in the list automatically gets
the time slot).

The cancellation scenario can go like this:

TimeSlotReservation.Cancel() --> TimeSlotReservationCancelled

ScheduleConflictSaga (created to monitor
ConflictingTimeSlotReserveRequested of a single Schedule) listens to
TimeSlotReservationCancelled and checks to see if there're any
TimeSlotRequests that can be fulfilled by the time slot cancelled,
possibly more than one. Some sort of priority is used to decide
between candidates.

We always can create an object using an existing parent and the
creation command always can succeed, resulting in something of value
to the business. In a sense a command is always a request, tentative.
If the command can be accepted by the aggregate root, we can create an
permanent expression of the command and let the user submit the
request again.

Another way to express this idea is to think that it's requests all
the way down. Some are transient, because they're immediately accepted
(i.e. successful commands), others are more permanent (i.e. request
entities created when a command "failed"). It (almost) never matters
if the command is a request from an user or a saga, actually it's very
important to don't make distinctions using this criteria, instead
trying to find the real actor behind the command (i.e. who is
interested in the result of the command): whose interests are these
sagas trying to take care of?

> The confusing thing about this whole discussion is that "events
> interesting to the business" get mixed with "events generated by an
> aggregate". I associate "events interesting to the business" with both
> failure and success to complete a command, basically anything that
> "happened" in the system regardless of invariants. While "events used
> to rebuild aggregate state" are things that were allowed to happen
> according to aggregate invariants, even if we don't track that state
> internally. Not things that weren't allowed to happen. So you could
> say I see one as a subset of the other.
>
> Whichever approach is taken, we all tend to agree that failure should
> be communicated (either as an event or a fault message) because
> knowing the "failure" happened is interesting in many ways (steering
> the saga, doing business analysis).

Best regards,
Daniel Yokomizo

@seagile

unread,

Dec 26, 2010, 3:38:11 PM12/26/10

to DDD/CQRS

As much as I appreciate your relentless efforts, I'm afraid the
business of heathcare e-scheduling just doesn't work that way.
Timeslot conflicts are very rare, but once a slot is taken, it's no
use sticking around waiting for a slot to become free, one has to
select a new set of slots - when you're on the phone with a patient or
he's standing in front of you, getting back via e-mail is not always
an option, they need confimation, referral letters, instructions at
that moment in time because that's how things work. I'm sorry if I
have given too little contextual information about the business
problem. Conflict resolution is just not part of the problem space.
The other thing I'm afraid of with the "request" approach is that
people will get over eager and start modelling everything as a
"request".

On 26 dec, 16:12, Daniel Yokomizo <daniel.yokom...@gmail.com> wrote:

stacy

unread,

Dec 26, 2010, 5:01:02 PM12/26/10

to DDD/CQRS

"The other thing I'm afraid of with the "request" approach is that
people will get over eager and start modelling everything as a
"request". "

Is this a bad thing? I have customers in the healthcare space also -
small medical practices. They think it's an improvement to turn many
screens, wizard-style, into a single request. I get that all the time
from users. It's simpler and faster for them. They really don't like
these complex EHR systems that my app must interface with. As a
result, I'm converting my UI's with cqrs back-end. Although I'm not
done yet, I plan to model everyday recurring tasks into a simple
"request ... please wait." CQRS-ES gives us a credible opportunity to
get these medical systems out of the dark ages!

@seagile

unread,

Dec 26, 2010, 5:56:07 PM12/26/10

to DDD/CQRS

I was not referring to task-based UIs, rather how things get modelled
in the domain. As Daniel said, a command is as good as a request, not
a garantee.

Think Before Coding

unread,

Dec 29, 2010, 8:44:56 AM12/29/10

to DDD/CQRS

> - when you're on the phone with a patient or
> he's standing in front of you, getting back via e-mail is not always
> an option, they need confimation, referral letters, instructions at
> that moment in time because that's how things work

This seems a good indication that this is more a Query, than a
Command....
When you issue a command.... it should not fail.

Except in extremely rare cases like :
> Timeslot conflicts are very rare.

So you should use a query before emitting the command to check that
the odds of rejection
unlikely. Then emit command.
Treat failure with an exception (because it should not happen,
you checked a few milliseconds before.. and there is no such pressure
on appointments).
And signal the saga with an exception to 'rollback' the process.

jeremie / thinkbeforecoding

> > > > Daniel Yokomizo- Masquer le texte des messages précédents -
>
> - Afficher le texte des messages précédents -

@seagile

unread,

Dec 29, 2010, 9:33:44 AM12/29/10

to DDD/CQRS

I thought this was exactly what I was suggesting/doing ...

On 29 dec, 14:44, Think Before Coding <jeremie.chassa...@gmail.com>
wrote:

Jonathan Oliver

unread,

Dec 29, 2010, 12:23:39 PM12/29/10

to DDD/CQRS

Much of what has been mentioned in this thread has been formalized
into a pattern, the Reservation Pattern:
http://www.rgoarchitects.com/nblog/2009/09/08/SOAPatternsReservations.aspx

It seems to apply here well and should compliment what I've said
previously.

Think Before Coding

unread,

Dec 30, 2010, 4:18:55 AM12/30/10

to DDD/CQRS

> I thought this was exactly what I was suggesting/doing ...

Right. This kind of decision depends highly on your domain.
In the domain I'm working on, I'd do the other way, so I was also
suggesting other possibilities, in cas it adapts to your domain.

jeremie / thinkbeforecoding

On 29 déc, 18:23, Jonathan Oliver <jonathan.s.olive...@gmail.com>
wrote:

> Much of what has been mentioned in this thread has been formalized

> into a pattern, the Reservation Pattern:http://www.rgoarchitects.com/nblog/2009/09/08/SOAPatternsReservations...

> > > > - Afficher le texte des messages précédents -- Masquer le texte des messages précédents -

Reply all

Reply to author

Forward