Only tracing gives you the whole picture. Of course we cannot run with
tracing all of the time, but with persistent problem enabling it is
no-brainer. Printing some additional info, like "one node had a
bad_alloc" will not add much info either, you will see those bad_allocs
in other places anyway. And in much more cases the timeout will be
because of real timeout, but that real timeout may still be caused by
bad_alloc on another node, but you will not get this information.
> You may use it when debugging locally, but even there it's better if I
> don't have to enable trace-level logging on storage proxy to see the error,
> especially if the problem occurs only at heavy load.
>
> but in this case it does not even needed,
> > yes bad_alloc caused batchlog_manager error,
>
>
> How do you know that this bad_alloc belongs to batchlog, and not to an
> unrelated operation which happened at the same time?
It may have belonged to batchlog, but it was not caused by batchlog.
Batchlog is only unintended victim. Printing "batchlog failed because of
bad_alloc" does not give you any additional information about why error
happen above what we have now.
>
> but that is not very
> > helpful knowledge at all because it does not mean that batchlog_manager
> > caused low memory condition and this is what is interesting here. With
>
> bad_alloc there is a chance that it will not even reach storage_proxy
> > if it happens in very low rpc or io layer and sp will see either
> > connection closure or a real timeout because we failed to read from
> > a local disk.
> >
>
> Even if we agreed that bad_alloc is not interesting, that doesn't matter
> for point, because it's not the only exception which can happen. We may get
> other errors, e.g. "Attempted to mutate using not synced schema". In
> general, the set of exceptions is open. Lower layers can add new kinds of
> failures in the future. A middle layer should not make assumptions about
> all of them not being interesting. It should default to considering all
> failures as interesting, and only special case those that we know are not.
>
Mid layer should logs them with enough information about why error
happened. I believe we do that already for that particular error (which
usually indicates a bug, not something higher layers should deal with).
> bad_alloc is a bit of a weak case. But still. You're right that there are
> conditions leading to bad_alloc which affect the whole system, and in such
> conditions it doesn't matter which operations are affected. Those are not
> the only conditions though. There are also localized conditions like an
> operation trying to allocate too much. Knowing which operations fail allows
> to distinguish between those. If you see batchlog replay timing out, that
> could point to a different problem, not related to other bad_allocs you see
> on that node. Lower layers are supposed to propagate bad_alloc to the
> caller, so it shouldn't timeout because of bad_alloc at layers below
> storage_proxy.
If you have bad_allocs and batchlog replay timing out simultaneously it
would not be wise to look at batchlog error without resolving bad_allocs
because 99.9999% the former is caused by the later even if the former
may report some other weird kind of error (because this weird kind of
error is likely caused by bad_alloc).
Batchlog with bad_alloc is simply a bad example to made your point.
>
>
>
> > > Also, the failure may be reported in a completely different place, like
> > > cqlsh console. The one who sees the report may not have easy access to
> > > logs. Even when he has, it's extra burden to correlate the errors.
> > >
> > The point is exactly that such kind of errors should not reach cqlsh
> > console because it causes application failures (it causes c-s
> > to go completely crazy by claiming no host is reachable because, I
> > guess, receiving such en rror is considered as node failure).
> >
>
> If an operation fails with an error no all nodes, then it's right that the
> application fails the operation. There is no reason for it to believe that
> retrying will help. It can recognize non-fatal errors, like a timeout, and
> handle them specially, but the default should be to propagate the failure
> up to the application layer.
>
It should be propagated in CQL write error from either write_timeout or
write_failure (for protocol > v4). Not as an internal random error which
we do now.
> Cassandra propagates failures to cqlsh, it has this write_failure response.
> It only responds with a timeout when the protocol doesn't support it.
>
I have an impression you completely ignoring what I am writing, so I'll
summarize what I already wrote may times in this thread. Yes we
should implement write_failure/read_failure. They are not implemented
because they are later addition and were not present in Origin. They
are only available if protocol v4 is negotiated. If older protocol is
negotiated write_failure translated to write_timeout, so since we do not
yet have write_failure/read_failure it is legal to send write_timeout
until write_failure (and compatibility layer) is implemented. Note
though that even for write_failure no additional info is returned to
cqlsh. Message encoding does not include it.
>
> >
> > > There is also a semantic problem with always responding with a timeout on
> > > failure. The algorithm using storage_proxy may care if the operation
> > failed
> > > or timed out. It can attempt to retry on timeout, but it should not retry
> > > if there is no chance of succeeding, e.g. if the operation fails on
> > replica
> > > side due to incorrect request.
> > >
> > There should not be such failures, any such failure is a bug, so the
> > example is somewhat contrived
>
>
> It could be a bug somewhere, but we still should propagate information that
> we hit it, and which operation hit it.
>
>
We should log that we hit with all the context around. Propagating it up
will only lose some context and caller cannot deal with it anyway since
this is a bug.
> > and besides returning only one error from
> > multiple that may happen will make it not work for the case either.
> >
>
> Why not? It's enough that we distinguish a timeout failure from a
> non-timeout failure.
>
>
And what is the failure we returned is timeout one? And what if timeout
is cause by a bug that the caller will hit again and again forever?
> >
> > > This is a general talk without any real use case
> > > > which is hard to argue about.
> > >
> > >
> > > Error propagation is a very general problem.
> > >
> > And should not be taken to extreme at that. Lets look at bad_alloc
> > error. There is so much more info it can provide besides simple fact
> > that allocation failed: did it fail because there was really no more
> > memory or did it fail because fragmentation, what was allocation size
> > that failed and how hard LSA try to shrink before giving up. So much
> > info that can be included with bad_alloc and it is useful to debug real
> > problems, that is why we have loggers to trace bad_alloc events, but
> > no one propose to add all this to bad_alloc error propagation.
> >
>
> Propagating the exception which is a direct cause of the failure could
> hardly be called an extreme. In many cases it is enough to significantly
> narrow down investigation efforts. We want to improve on information which
> is attached to exceptions, not reduce it.
>
Storage proxy is a complicated subsystem that has complex internal
state. When error happens and requested operation cannot be done the
subsystem returns an error indicating that the operation cannot be done.
The internal reason for the failure a logged and not reported.
Memory is a complicated subsystem that has complex internal
state. When error happens and requested operation cannot be done the
subsystem returns an error indicating that the operation cannot be done.
The internal reason for the failure a logged and not reported.
Do you notice any similarities?
>
> > >
> > > > Storage proxy has a lot of logging and
> > > > at trace level it allows to see the entire lifetime of a query request
> > > >
> > >
> > > Trace-level logging doesn't solve the problem of correlating errors
> > across
> > > layers (e.g. with cqlsh output)
> > >
> > > Also, it's not enabled by default, and we may not be allowed to enable it
> > > in production. Failures could be hard to reproduce.
> > >
> > Trace level gives you much more info then one error message.
>
> Critical errors are logged at higher levels.
> >
>
> This patch turns exceptions into timeouts, where would the original errors
> get logged at higher levels?
It is logged at that level by the follow up patch I sent. I do not
understand what you suggest. Leave things as is because real error is
propagated and this is how you like it? But this breaks Cassandra
driver, so this is not a viable option. Proper CQL write error has to
be returned.
>
>
> >
> > >
> > > > which until now was enough for debugging any issue.
> > >
> > >
> > > It doesn't mean it always will be, and that we shouldn't improve on this.
> > >
> > > Recently we had a user who reported seeing in cqlsh an error with message
> > > "Not implemented: INDEXES". If that was a timeout error instead, that
> > user
> > > and we would waste more time before we got to the root cause.
> > >
> > This is not intermittent error and should be reported as such. It will
> > cause application to fail and this is desirable behaviour. Reporting
> > with anything but write_timeout (or write_failed with v4) on a temporary
> > communication error kills an application for no reason. If user hits
> > an unimplemented path in a storage_proxy of cause it should not be
> > reported as timeout and this is not what the patch is doing.
> >
>
> This patch is converting all uncaught exceptions to timeouts, treating them
> as intermittent failures. It's not right, because not all such errors are
> intermittent. It should assume they're not, and special case for the few
> which we recognize as expected.
>
So it will be OK from your POV to catch only some exceptions? What is
the list? I actually agree with you here, but with a little caveat: we
should filter exceptions knows to be non intermittent because the list
of those is finite. In this place it is empty though as far as I can
tell, but I am open to suggestions.
Because timeout can be as permanent as any other failure in case of a
bug and because if we report only one error it may be timeout and not
something that tells you it is permanent.
>
>
> >
> > > even better, but returns on doing that are diminishing. For CL=1
> > operation,
> > The difference in or views that that IMO propagating even one already
> > has diminishing returns.
> >
> > > it is as good. For CL>1, in many cases if there is one failure, chances
> > are
> > > all will be the same (e.g. bad request, serialization error). Each
> > failure
> > Those are not error conditions, those are bugs (do not remember them
> > even happening in anything but development). They are logged at very
> > high level.
>
>
> Not sure what's you point here. The case of bugs is where the error is most
> interesting. We want to see the error and correlate it with affected
> operations.
>
My point is that higher level are not suppose to handle them in any
special way and logging them in the context they happen with as much
available info as possible is perfectly good solution. Even preferable
since error propagation tend to lose context, so logging at higher
levels may have much less info about actually error.
>
> > Much more common reasons are connection closed and bad_alloc
> > and they can happen simultaneously (both may be a result of memory
> > shortage/fragmentation).
> >
>
> Just because there are errors which are not interesting, doesn't mean we
> should silence all errors. We should silence the uninteresting errors,
> after careful consideration of each case.
>
We should not. We should log them. There is the patch on top to do that.
>
> >
> > > is an abnormal condition, so even if the failures are different, knowing
> > > one of the causes already points at some problem which should probably be
> > > fixed. We have a similar problem in the read path, which uses
> > > parallel_for_each().
> > >
> > > But regardless of whether we propagate one cause of failure, all of them,
> > > or none, just distinguishing a timeout from a non-timeout failure is
> > > already an improvement.
> > >
> > For that we need to implement write_failure error which was added in v4.
>
> It allows storage_proxy to give up on a request earlier if it sees that
> > there is no chance to fulfill a request because of errors. It is still
> > high level CQL error, it does not list all the reasons why it is thrown.
> >
>
> We can still have the cause attached to write_failure exception (e.g. using
> throw_with_nested). This way internal users will have access to it. CQL
> server can ignore it.
If anything I prefer to add it directly to cassandra_exception (by
storing exception_ptr there). Much easier than trying to unfold nested
exception especially with multiple causes. But this is not relevant to
that patch.
>
>
> > This is orthogonal for the fix here since for compatibility with v3 it
> > is translated to timeout error, so replying with timeout is valid.
> >
>
> It's only valid when responding to a CQL v3 request, where there is no
> choice. We shouldn't do this to internal users.
Internal users do not handle this exception either, so what specific
problem are you trying to solve here?
I much prefer to not have write_failure implementation to be pre-request
for that fix because the fix also has to be backported to 1.7 and
implementing write_failure will be much larger and more risky backport.
Yes, as I said in some other email we can make CQL errors have some more
state about the error if you really insists. But first I do not see any point
doing it without a user and second this is orthogonal to this patch
because this additional info will not reach clients anyway and we have a
problem with them right now.
>
>
> >
> > >
> > > >
> > > > But lest say there is a caller that benefits from knowing all the
> > internal
> > > > reasons we can extend cassandra_exception to hold an array of errors
> > > > and whoever wants is may explore it. But the churn is not worth it
> > > > without a real benefit which I simple do not see.
> > > >
> > > > Anyway the discussion should not hold the fix for the current issue to
> > > > go in since it fits current behaviour.
> > >
> > >
> > > Does Cassandra have the same problem which this patch aims to fix? What's
> > > their solution?
> > Cassandra has the same interface towards upper layers as we do. It
> > throws CQL exception. Look at mutate() definition there:
> > public static void mutate(Collection<? extends IMutation> mutations,
> > ConsistencyLevel consistency_level, long queryStartNanoTime)
> > throws UnavailableException, OverloadedException,
> > WriteTimeoutException, WriteFailureException
> >
> > Looking at counters code there I see that they
> > use usual AbstractWriteResponseHandler mechanism to send counter
> > mutation, so they should get WriteTimeoutException in case of a timeout or
> > WriteFailureException in case an error is detected earlier. The later
> > will be translated to WriteTimeoutException for clients that negotiated
> > protocol v3 or earlier.
> >
>
> I think we should also distinguish from timeout and failure at storage
> proxy level, and in addition to that, attach the original exception to
> write_failure inside the catch(...) clause.
As I said numerous times in this thread, yes we need to implement
write_failure eventually. It is not implemented because it did not exist
in Origing, so this is usual catchup that we have to do. It should not
block the fix for the very real problem.
--
Gleb.