[PATCH 1/2] storage_proxy: prevent counter mutation from returning unrecognizable error to a client

24 views
Skip to first unread message

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 4:27:33 AM10/3/17
to scylladb-dev@googlegroups.com
If sending a counter mutation to a chosen leader fails with an exception
other than a timeout the error escapes to a client and confuses it greatly
(cassandra-stress claim that it cannot connect to any node and fails).
Fix it by replying with timeout in that case.

Fixed #2859
---
service/storage_proxy.cc | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/service/storage_proxy.cc b/service/storage_proxy.cc
index 35f6077..32b76b0 100644
--- a/service/storage_proxy.cc
+++ b/service/storage_proxy.cc
@@ -1232,9 +1232,7 @@ future<> storage_proxy::mutate_counters(Range&& mutations, db::consistency_level
auto& ks = _db.local().find_keyspace(s->ks_name());
try {
std::rethrow_exception(std::move(exp));
- } catch (rpc::timeout_error&) {
- return make_exception_future<>(mutation_write_timeout_exception(s->ks_name(), s->cf_name(), cl, 0, db::block_for(ks, cl), db::write_type::COUNTER));
- } catch (timed_out_error&) {
+ } catch (...) {
return make_exception_future<>(mutation_write_timeout_exception(s->ks_name(), s->cf_name(), cl, 0, db::block_for(ks, cl), db::write_type::COUNTER));
}
};
--
2.9.5

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 4:27:33 AM10/3/17
to scylladb-dev@googlegroups.com
If a local node is a replica for the counter mutation (will always be the
case for token aware drivers) choosing it to be a leader saves additional
hop and reduces a chance for an unrecoverable error in case sending
mutation to a leader fails.
---
service/storage_proxy.cc | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/service/storage_proxy.cc b/service/storage_proxy.cc
index 32b76b0..a9fab4a 100644
--- a/service/storage_proxy.cc
+++ b/service/storage_proxy.cc
@@ -1183,6 +1183,10 @@ gms::inet_address storage_proxy::find_leader_for_counter_update(const mutation&
throw exceptions::unavailable_exception(cl, block_for(ks, cl), 0);
}

+ if (boost::range::find(live_endpoints, utils::fb_utilities::get_broadcast_address()) != live_endpoints.end()) {
+ return utils::fb_utilities::get_broadcast_address();
+ }
+
auto local_endpoints = boost::copy_range<std::vector<gms::inet_address>>(live_endpoints | boost::adaptors::filtered([&] (auto&& ep) {
return db::is_local(ep);
}));
@@ -1192,7 +1196,6 @@ gms::inet_address storage_proxy::find_leader_for_counter_update(const mutation&
snitch->sort_by_proximity(utils::fb_utilities::get_broadcast_address(), live_endpoints);
return live_endpoints[0];
} else {
- // FIXME: favour ourselves to avoid additional hop?
static thread_local std::random_device rd;
static thread_local std::default_random_engine re(rd());
std::uniform_int_distribution<> dist(0, local_endpoints.size() - 1);
--
2.9.5

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 3, 2017, 5:12:25 AM10/3/17
to Gleb Natapov, scylladb-dev
On Tue, Oct 3, 2017 at 10:27 AM, Gleb Natapov <gl...@scylladb.com> wrote:
If sending a counter mutation to a chosen leader fails with an exception
other than a timeout the error escapes to a client and confuses it greatly
(cassandra-stress claim that it cannot connect to any node and fails).
Fix it by replying with timeout in that case.

Should this be done at the CQL binary protocol level? It is that protocol's limitation that we can't propagate errors. Other clients, or later protocol versions, may benefit from a more informative error.
 

Fixed #2859
---
 service/storage_proxy.cc | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/service/storage_proxy.cc b/service/storage_proxy.cc
index 35f6077..32b76b0 100644
--- a/service/storage_proxy.cc
+++ b/service/storage_proxy.cc
@@ -1232,9 +1232,7 @@ future<> storage_proxy::mutate_counters(Range&& mutations, db::consistency_level
             auto& ks = _db.local().find_keyspace(s->ks_name());
             try {
                 std::rethrow_exception(std::move(exp));
-            } catch (rpc::timeout_error&) {
-                return make_exception_future<>(mutation_write_timeout_exception(s->ks_name(), s->cf_name(), cl, 0, db::block_for(ks, cl), db::write_type::COUNTER));
-            } catch (timed_out_error&) {
+            } catch (...) {
                 return make_exception_future<>(mutation_write_timeout_exception(s->ks_name(), s->cf_name(), cl, 0, db::block_for(ks, cl), db::write_type::COUNTER));
             }
         };
--
2.9.5

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/20171003082710.29042-1-gleb%40scylladb.com.
For more options, visit https://groups.google.com/d/optout.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 5:19:32 AM10/3/17
to Tomasz Grabiec, scylladb-dev
On Tue, Oct 03, 2017 at 11:12:24AM +0200, Tomasz Grabiec wrote:
> On Tue, Oct 3, 2017 at 10:27 AM, Gleb Natapov <gl...@scylladb.com> wrote:
>
> > If sending a counter mutation to a chosen leader fails with an exception
> > other than a timeout the error escapes to a client and confuses it greatly
> > (cassandra-stress claim that it cannot connect to any node and fails).
> > Fix it by replying with timeout in that case.
> >
>
> Should this be done at the CQL binary protocol level? It is that protocol's
> limitation that we can't propagate errors. Other clients, or later protocol
> versions, may benefit from a more informative error.
>
How CQL protocol level knows how to translate a random exception into
somewhat meaningful CQL one? Even if it can guess that exception should
be about write what to put into write timeout exception data fields? I'd
rather prefer not letting internal exceptions to escape into generic code.
Note that CQL level already translates unknown exception types into
generic CQL protocol error and I think it is the best it can do, but at
least c-s (or may be it is a driver problem) does not expect generic error
as a reply to a write.
> > email to scylladb-dev...@googlegroups.com.
> > To post to this group, send email to scylla...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/
> > msgid/scylladb-dev/20171003082710.29042-1-gleb%40scylladb.com.
> > For more options, visit https://groups.google.com/d/optout.
> >

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 3, 2017, 5:41:59 AM10/3/17
to Gleb Natapov, scylladb-dev
On Tue, Oct 3, 2017 at 11:19 AM, Gleb Natapov <gl...@scylladb.com> wrote:
On Tue, Oct 03, 2017 at 11:12:24AM +0200, Tomasz Grabiec wrote:
> On Tue, Oct 3, 2017 at 10:27 AM, Gleb Natapov <gl...@scylladb.com> wrote:
>
> > If sending a counter mutation to a chosen leader fails with an exception
> > other than a timeout the error escapes to a client and confuses it greatly
> > (cassandra-stress claim that it cannot connect to any node and fails).
> > Fix it by replying with timeout in that case.
> >
>
> Should this be done at the CQL binary protocol level? It is that protocol's
> limitation that we can't propagate errors. Other clients, or later protocol
> versions, may benefit from a more informative error.
>
How CQL protocol level knows how to translate a random exception into
somewhat meaningful CQL one? Even if it can guess that exception should
be about write what to put into write timeout exception data fields? 
I'd rather prefer not letting internal exceptions to escape into generic code.

We can throw write_failed_exception here, which has all needed fields, and wraps the original cause.

> > To post to this group, send email to scylla...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/
> > msgid/scylladb-dev/20171003082710.29042-1-gleb%40scylladb.com.
> > For more options, visit https://groups.google.com/d/optout.
> >

--
                        Gleb.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 5:49:17 AM10/3/17
to Tomasz Grabiec, scylladb-dev
On Tue, Oct 03, 2017 at 11:41:58AM +0200, Tomasz Grabiec wrote:
> On Tue, Oct 3, 2017 at 11:19 AM, Gleb Natapov <gl...@scylladb.com> wrote:
>
> > On Tue, Oct 03, 2017 at 11:12:24AM +0200, Tomasz Grabiec wrote:
> > > On Tue, Oct 3, 2017 at 10:27 AM, Gleb Natapov <gl...@scylladb.com> wrote:
> > >
> > > > If sending a counter mutation to a chosen leader fails with an
> > exception
> > > > other than a timeout the error escapes to a client and confuses it
> > greatly
> > > > (cassandra-stress claim that it cannot connect to any node and fails).
> > > > Fix it by replying with timeout in that case.
> > > >
> > >
> > > Should this be done at the CQL binary protocol level? It is that
> > protocol's
> > > limitation that we can't propagate errors. Other clients, or later
> > protocol
> > > versions, may benefit from a more informative error.
> > >
> > How CQL protocol level knows how to translate a random exception into
> > somewhat meaningful CQL one? Even if it can guess that exception should
> > be about write what to put into write timeout exception data fields?
>
> I'd rather prefer not letting internal exceptions to escape into generic
> > code.
> >
>
> We can throw write_failed_exception here, which has all needed fields, and
> wraps the original cause.
>
>
That what I discussed with Pawel on the previous patch. The
exception exists only in protocol v4. Client may negotiate
older protocol. Cassandra has a layer that will translate
write_failed_exception to write_timeout_exception in this case, so
returning write_timeout_exception does not violate anything. I do think
that we need to add support for write_failed_exception eventually but
this is orthogonal to the issue here.
> > > > email to scylladb-dev...@googlegroups.com.
> > > > To post to this group, send email to scylla...@googlegroups.com.
> > > > To view this discussion on the web visit https://groups.google.com/d/
> > > > msgid/scylladb-dev/20171003082710.29042-1-gleb%40scylladb.com.
> > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> >
> > --
> > Gleb.
> >

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 3, 2017, 6:48:00 AM10/3/17
to Gleb Natapov, scylladb-dev
I am concerned that this patch will make it harder to debug those errors. Perhaps the original cause should be logged on translation?

Not all users of storage_proxy are CQL users, some may be internal. Since the translation is loosing information about the cause, we should do it as late as possible to allow such consumers to report the original cause instead of a timeout.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 6:51:15 AM10/3/17
to Tomasz Grabiec, scylladb-dev
Logged - absolutely! My previous (incorrect) patch added a bunch of
logging, but when I redone it to this simple one I dropped all of it.
Will send as a separate patch.

> Not all users of storage_proxy are CQL users, some may be internal. Since
> the translation is loosing information about the cause, we should do it as
> late as possible to allow such consumers to report the original cause
> instead of a timeout.
This is not different from regular mutation though.

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 3, 2017, 7:06:00 AM10/3/17
to Gleb Natapov, scylladb-dev
Yes, they should behave the same. It's out of scope of this patch, but do we agree that it should be changed? If so, let's open a github issue.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 7:13:46 AM10/3/17
to Tomasz Grabiec, scylladb-dev
I actually do not. There may be more than one original cause.
Communication with one node may timeout and another may get a bad_alloc.
What a caller should get? And to make it worse, depending on the operation
the call may still succeed even with those to failures), so those errors
cannot be reported, only logged anyway. The storage_proxy interface
towards its callers is well defined one: it is either a result of the
operation or CQL error.

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 3, 2017, 7:22:03 AM10/3/17
to Gleb Natapov, scylladb-dev
If the failure happens on coordinator side, for instance before we even contact any replica, we should propagate that error.

If we time out waiting for some replica, it's fair to return a timeout, and only log the errors we ignored. If all replicas return an error, we can return an error saying this ("all replicas failed"), quoting one of the errors, or all of them. 

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 3, 2017, 7:29:31 AM10/3/17
to Tomasz Grabiec, scylladb-dev
What kind of error on a coordinator side are you thinking about here?
And how caller will handle it specially?

> If we time out waiting for some replica, it's fair to return a timeout, and
> only log the errors we ignored. If all replicas return an error, we can
> return an error saying this ("all replicas failed"), quoting one of the
> errors, or all of them.
If some replicas returned an error and some timed out (and it is possible
that operation still succeeded, so not error is returned to a caller at
all)? How is it useful for a caller to get an error that is quoting one
of the errors, or all of them? How CQL server will translate such an
error to CQL protocol error, or should we have separate interface for it?

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 4, 2017, 7:15:52 AM10/4/17
to Gleb Natapov, scylladb-dev
Any kind of unhandled exception

And how caller will handle it specially?

The caller can attach the cause to its own exception, adding more context, and propagate it further up. When the exception is finally logged (maybe at the direct caller), we have full correlation between what failed and why. If all failures are turned into a timeout, information about the cause of failure of the high level operation is lost. Logging it earlier at lower level is not enough, because there is no clear correlation between the two failures.


> If we time out waiting for some replica, it's fair to return a timeout, and
> only log the errors we ignored. If all replicas return an error, we can
> return an error saying this ("all replicas failed"), quoting one of the
> errors, or all of them.
If some replicas returned an error and some timed out (and it is possible
that operation still succeeded, so not error is returned to a caller at
all)? How is it useful for a caller to get an error that is quoting one
of the errors, or all of them?

If operation succeeds, we don't return an error. If it doesn't succeed, we should fail with the error which best describes the cause. Propagating one of the failures gives more information than propagating none.

If none of the causes is a timeout, returning a timeout will not only not add information about the cause, it will confuse the one trying to understand what happened.

How CQL server will translate such an
error to CQL protocol error, or should we have separate interface for it?

CQL binary protocol has a "Write_failure" error response. But I think what we respond with there is orthogonal to whether we should only ever throw timeout from mutate() or not.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 4, 2017, 8:36:50 AM10/4/17
to Tomasz Grabiec, scylladb-dev
This is a non answer. I am trying to understand use case here. I do not
see it.

> And how caller will handle it specially?
> >
>
> The caller can attach the cause to its own exception, adding more context,
> and propagate it further up. When the exception is finally logged (maybe at
> the direct caller), we have full correlation between what failed and why.
> If all failures are turned into a timeout, information about the cause of
> failure of the high level operation is lost. Logging it earlier at lower
> level is not enough, because there is no clear correlation between the two
> failures.
>
What king of logging is missing right now? What kind of correlation
are you missing now? This is a general talk without any real use case
which is hard to argue about. Storage proxy has a lot of logging and
at trace level it allows to see the entire lifetime of a query request
which until now was enough for debugging any issue.

>
> > > If we time out waiting for some replica, it's fair to return a timeout,
> > and
> > > only log the errors we ignored. If all replicas return an error, we can
> > > return an error saying this ("all replicas failed"), quoting one of the
> > > errors, or all of them.
> > If some replicas returned an error and some timed out (and it is possible
> > that operation still succeeded, so not error is returned to a caller at
> > all)? How is it useful for a caller to get an error that is quoting one
> > of the errors, or all of them?
>
>
> If operation succeeds, we don't return an error. If it doesn't succeed, we
> should fail with the error which best describes the cause. Propagating one
> of the failures gives more information than propagating none.
>
> If none of the causes is a timeout, returning a timeout will not only not
> add information about the cause, it will confuse the one trying to
> understand what happened.
You again ignored the fact that there may be more than a single cause of
failure. How caller may handle this? But lets be more specific. Who is
this one that will be confused? Can you give a concrete example that I can
work with? Who is this caller that can take an exception with multiple
causes and to something intelligent about it except simple logging? If
the logging is the only use case then it logging on storage level was
proved to be sufficient enough. If you have other use case describe in
less generic terms and then we can find a solution.

>
> How CQL server will translate such an
> > error to CQL protocol error, or should we have separate interface for it?
> >
>
> CQL binary protocol has a "Write_failure" error response. But I think what
> we respond with there is orthogonal to whether we should only ever throw
> timeout from mutate() or not.
This is a good counterargument to what you are saying. Only storage proxy knows
when to answer with write_timeout and when with write_failure. write_failure
is a recent addition to the protocol to allow storage proxy to fail
earlier, but only storage proxy knows what "earlier" is.

But lest say there is a caller that benefits from knowing all the internal
reasons we can extend cassandra_exception to hold an array of errors
and whoever wants is may explore it. But the churn is not worth it
without a real benefit which I simple do not see.

Anyway the discussion should not hold the fix for the current issue to
go in since it fits current behaviour.

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 9, 2017, 12:00:37 PM10/9/17
to Gleb Natapov, scylladb-dev
We're missing causal correlation between failures at different levels. Example:

..
ERROR 2017-10-04 14:48:12,347 [shard 0] storage_proxy - failed to communicate with <this node>: std::bad_alloc
...
ERROR 2017-10-04 14:48:16,542 [shard 0] batchlog_manager - Exception in batch replay: exceptions::read_timeout_exception (Operation timed out for system.batchlog - received only 0 responses from 1 CL=ONE.)

We can't tell what was the cause of batch replay failure. It could be timeout, but could be not.

Also, the failure may be reported in a completely different place, like cqlsh console. The one who sees the report may not have easy access to logs. Even when he has, it's extra burden to correlate the errors.

There is also a semantic problem with always responding with a timeout on failure. The algorithm using storage_proxy may care if the operation failed or timed out. It can attempt to retry on timeout, but it should not retry if there is no chance of succeeding, e.g. if the operation fails on replica side due to incorrect request.

This is a  general talk without any real use case
which is hard to argue about.

Error propagation is a very general problem.
 
Storage proxy has a lot of logging and
at trace level it allows to see the entire lifetime of a query request

Trace-level logging doesn't solve the problem of correlating errors across layers (e.g. with cqlsh output)

Also, it's not enabled by default, and we may not be allowed to enable it in production. Failures could be hard to reproduce.
 
which until now was enough for debugging any issue.

It doesn't mean it always will be, and that we shouldn't improve on this.

Recently we had a user who reported seeing in cqlsh an error with message "Not implemented: INDEXES". If that was a timeout error instead, that user and we would waste more time before we got to the root cause.

 
 
 

>
> > > If we time out waiting for some replica, it's fair to return a timeout,
> > and
> > > only log the errors we ignored. If all replicas return an error, we can
> > > return an error saying this ("all replicas failed"), quoting one of the
> > > errors, or all of them.
> > If some replicas returned an error and some timed out (and it is possible
> > that operation still succeeded, so not error is returned to a caller at
> > all)? How is it useful for a caller to get an error that is quoting one
> > of the errors, or all of them?
>
>
> If operation succeeds, we don't return an error. If it doesn't succeed, we
> should fail with the error which best describes the cause. Propagating one
> of the failures gives more information than propagating none.
>
> If none of the causes is a timeout, returning a timeout will not only not
> add information about the cause, it will confuse the one trying to
> understand what happened.
You again ignored the fact that there may be more than a single cause of
failure.

I don't think I did, but maybe there is some point you wanted to get addressed that I missed. I wrote "Propagating one of the failures gives more information than propagating none". Propagating all failures would be even better, but returns on doing that are diminishing. For CL=1 operation, it is as good. For CL>1, in many cases if there is one failure, chances are all will be the same (e.g. bad request, serialization error). Each failure is an abnormal condition, so even if the failures are different, knowing one of the causes already points at some problem which should probably be fixed. We have a similar problem in the read path, which uses parallel_for_each().

But regardless of whether we propagate one cause of failure, all of them, or none, just distinguishing a timeout from a non-timeout failure is already an improvement.

How caller may handle this? But lets be more specific. Who is
this one that will be confused? Can you give a concrete example that I can
work with? Who is this caller that can take an exception with multiple
causes and to something intelligent about it except simple logging? 
If the logging is the only use case then it logging on storage level was
proved to be sufficient enough. If you have other use case describe in
less generic terms and then we can find a solution.

Examples given earlier.
 

>
> How CQL server will translate such an
> > error to CQL protocol error, or should we have separate interface for it?
> >
>
> CQL binary protocol has a "Write_failure" error response. But I think what
> we respond with there is orthogonal to whether we should only ever throw
> timeout from mutate() or not.
This is a good counterargument to what you are saying. Only storage proxy knows
when to answer with write_timeout and when with write_failure. write_failure
is a recent addition to the protocol to allow storage proxy to fail
earlier, but only storage proxy knows what "earlier" is.

I don't understand. Is your point that we should wait before responding with a timeout until we reach a time point at which the request would normally timeout, and only storage proxy knows for how long to wait, so it has to make a decision? 
 

But lest say there is a caller that benefits from knowing all the internal
reasons we can extend cassandra_exception to hold an array of errors
and whoever wants is may explore it. But the churn is not worth it
without a real benefit which I simple do not see.

Anyway the discussion should not hold the fix for the current issue to
go in since it fits current behaviour.

Does Cassandra have the same problem which this patch aims to fix? What's their solution?


Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 9, 2017, 1:38:24 PM10/9/17
to Tomasz Grabiec, scylladb-dev
Trace will show you exactly, but in this case it does not even needed,
yes bad_alloc caused batchlog_manager error, but that is not very
helpful knowledge at all because it does not mean that batchlog_manager
caused low memory condition and this is what is interesting here. With
bad_alloc there is a chance that it will not even reach storage_proxy
if it happens in very low rpc or io layer and sp will see either
connection closure or a real timeout because we failed to read from
a local disk.

> Also, the failure may be reported in a completely different place, like
> cqlsh console. The one who sees the report may not have easy access to
> logs. Even when he has, it's extra burden to correlate the errors.
>
The point is exactly that such kind of errors should not reach cqlsh
console because it causes application failures (it causes c-s
to go completely crazy by claiming no host is reachable because, I
guess, receiving such en rror is considered as node failure).

> There is also a semantic problem with always responding with a timeout on
> failure. The algorithm using storage_proxy may care if the operation failed
> or timed out. It can attempt to retry on timeout, but it should not retry
> if there is no chance of succeeding, e.g. if the operation fails on replica
> side due to incorrect request.
>
There should not be such failures, any such failure is a bug, so the
example is somewhat contrived and besides returning only one error from
multiple that may happen will make it not work for the case either.

> This is a general talk without any real use case
> > which is hard to argue about.
>
>
> Error propagation is a very general problem.
>
And should not be taken to extreme at that. Lets look at bad_alloc
error. There is so much more info it can provide besides simple fact
that allocation failed: did it fail because there was really no more
memory or did it fail because fragmentation, what was allocation size
that failed and how hard LSA try to shrink before giving up. So much
info that can be included with bad_alloc and it is useful to debug real
problems, that is why we have loggers to trace bad_alloc events, but
no one propose to add all this to bad_alloc error propagation.

>
> > Storage proxy has a lot of logging and
> > at trace level it allows to see the entire lifetime of a query request
> >
>
> Trace-level logging doesn't solve the problem of correlating errors across
> layers (e.g. with cqlsh output)
>
> Also, it's not enabled by default, and we may not be allowed to enable it
> in production. Failures could be hard to reproduce.
>
Trace level gives you much more info then one error message. Critical
errors are logged at higher levels.

>
> > which until now was enough for debugging any issue.
>
>
> It doesn't mean it always will be, and that we shouldn't improve on this.
>
> Recently we had a user who reported seeing in cqlsh an error with message
> "Not implemented: INDEXES". If that was a timeout error instead, that user
> and we would waste more time before we got to the root cause.
>
This is not intermittent error and should be reported as such. It will
cause application to fail and this is desirable behaviour. Reporting
with anything but write_timeout (or write_failed with v4) on a temporary
communication error kills an application for no reason. If user hits
an unimplemented path in a storage_proxy of cause it should not be
reported as timeout and this is not what the patch is doing.
Then it cannot be use for retrying request (but as I said that example
is somewhat contrived).

> even better, but returns on doing that are diminishing. For CL=1 operation,
The difference in or views that that IMO propagating even one already
has diminishing returns.

> it is as good. For CL>1, in many cases if there is one failure, chances are
> all will be the same (e.g. bad request, serialization error). Each failure
Those are not error conditions, those are bugs (do not remember them
even happening in anything but development). They are logged at very
high level. Much more common reasons are connection closed and bad_alloc
and they can happen simultaneously (both may be a result of memory
shortage/fragmentation).

> is an abnormal condition, so even if the failures are different, knowing
> one of the causes already points at some problem which should probably be
> fixed. We have a similar problem in the read path, which uses
> parallel_for_each().
>
> But regardless of whether we propagate one cause of failure, all of them,
> or none, just distinguishing a timeout from a non-timeout failure is
> already an improvement.
>
For that we need to implement write_failure error which was added in v4.
It allows storage_proxy to give up on a request earlier if it sees that
there is no chance to fulfill a request because of errors. It is still
high level CQL error, it does not list all the reasons why it is thrown.
This is orthogonal for the fix here since for compatibility with v3 it
is translated to timeout error, so replying with timeout is valid.

> How caller may handle this? But lets be more specific. Who is
>
> this one that will be confused? Can you give a concrete example that I can
>
> work with? Who is this caller that can take an exception with multiple
>
> causes and to something intelligent about it except simple logging?
>
> If the logging is the only use case then it logging on storage level was
> > proved to be sufficient enough. If you have other use case describe in
>
> less generic terms and then we can find a solution.
> >
>
> Examples given earlier.
I addressed them above.

>
>
> >
> > >
> > > How CQL server will translate such an
> > > > error to CQL protocol error, or should we have separate interface for
> > it?
> > > >
> > >
> > > CQL binary protocol has a "Write_failure" error response. But I think
> > what
> > > we respond with there is orthogonal to whether we should only ever throw
> > > timeout from mutate() or not.
> > This is a good counterargument to what you are saying. Only storage proxy
> > knows
> > when to answer with write_timeout and when with write_failure.
> > write_failure
> > is a recent addition to the protocol to allow storage proxy to fail
> > earlier, but only storage proxy knows what "earlier" is.
> >
>
> I don't understand. Is your point that we should wait before responding
> with a timeout until we reach a time point at which the request would
> normally timeout, and only storage proxy knows for how long to wait, so it
> has to make a decision?
>
No, my point is that layers above storage proxy cannot translate a non
cql exception that contains a bunch of reasons why write failed to proper
write_timeout/write_failure exception (without redoing a lot of works
storage proxy did). Only storage proxy knows if operation timed out
(in which case write_timeout should be returned) or enough errors were
received before timeout what reached so that sp may know in advance that
operation will fail (in which case write_failure should be returned but
this is protocol v4 feature only, will be mapped to write_timeout if v3
is negotiated). Thus this is what storage_proxy should return for any
intermittent failure.

>
> >
> > But lest say there is a caller that benefits from knowing all the internal
> > reasons we can extend cassandra_exception to hold an array of errors
> > and whoever wants is may explore it. But the churn is not worth it
> > without a real benefit which I simple do not see.
> >
> > Anyway the discussion should not hold the fix for the current issue to
> > go in since it fits current behaviour.
>
>
> Does Cassandra have the same problem which this patch aims to fix? What's
> their solution?
Cassandra has the same interface towards upper layers as we do. It
throws CQL exception. Look at mutate() definition there:
public static void mutate(Collection<? extends IMutation> mutations, ConsistencyLevel consistency_level, long queryStartNanoTime)
throws UnavailableException, OverloadedException, WriteTimeoutException, WriteFailureException

Looking at counters code there I see that they
use usual AbstractWriteResponseHandler mechanism to send counter
mutation, so they should get WriteTimeoutException in case of a timeout or
WriteFailureException in case an error is detected earlier. The later
will be translated to WriteTimeoutException for clients that negotiated
protocol v3 or earlier.

--
Gleb.

Tomasz Grabiec

<tgrabiec@scylladb.com>
unread,
Oct 17, 2017, 7:18:43 AM10/17/17
to Gleb Natapov, scylladb-dev
We can't run with trace-level logging enabled in prod, so it's not a replacement. 

You may use it when debugging locally, but even there it's better if I don't have to enable trace-level logging on storage proxy to see the error, especially if the problem occurs only at heavy load.

but in this case it does not even needed,
yes bad_alloc caused batchlog_manager error,

How do you know that this bad_alloc belongs to batchlog, and not to an unrelated operation which happened at the same time?

but that is not very
helpful knowledge at all because it does not mean that batchlog_manager
caused low memory condition and this is what is interesting here. With
bad_alloc there is a chance that it will not even reach storage_proxy
if it happens in very low rpc or io layer and sp will see either
connection closure or a real timeout because we failed to read from
a local disk.

Even if we agreed that bad_alloc is not interesting, that doesn't matter for point, because it's not the only exception which can happen. We may get other errors, e.g. "Attempted to mutate using not synced schema". In general, the set of exceptions is open. Lower layers can add new kinds of failures in the future. A middle layer should not make assumptions about all of them not being interesting. It should default to considering all failures as interesting, and only special case those that we know are not.

bad_alloc is a bit of a weak case. But still. You're right that there are conditions leading to bad_alloc which affect the whole system, and in such conditions it doesn't matter which operations are affected. Those are not the only conditions though. There are also localized conditions like an operation trying to allocate too much. Knowing which operations fail allows to distinguish between those. If you see batchlog replay timing out, that could point to a different problem, not related to other bad_allocs you see on that node. Lower layers are supposed to propagate bad_alloc to the caller, so it shouldn't timeout because of bad_alloc at layers below storage_proxy.



> Also, the failure may be reported in a completely different place, like
> cqlsh console. The one who sees the report may not have easy access to
> logs. Even when he has, it's extra burden to correlate the errors.
>
The point is exactly that such kind of errors should not reach cqlsh
console because it causes application failures (it causes c-s
to go completely crazy by claiming no host is reachable because, I
guess, receiving such en rror is considered as node failure).

If an operation fails with an error no all nodes, then it's right that the application fails the operation. There is no reason for it to believe that retrying will help. It can recognize non-fatal errors, like a timeout, and handle them specially, but the default should be to propagate the failure up to the application layer.

Cassandra propagates failures to cqlsh, it has this write_failure response. It only responds with a timeout when the protocol doesn't support it.
 

> There is also a semantic problem with always responding with a timeout on
> failure. The algorithm using storage_proxy may care if the operation failed
> or timed out. It can attempt to retry on timeout, but it should not retry
> if there is no chance of succeeding, e.g. if the operation fails on replica
> side due to incorrect request.
>
There should not be such failures, any such failure is a bug, so the
example is somewhat contrived

It could be a bug somewhere, but we still should propagate information that we hit it, and which operation hit it.
 
and besides returning only one error from
multiple that may happen will make it not work for the case either.

Why not? It's enough that we distinguish a timeout failure from a non-timeout failure.
 

> This is a  general talk without any real use case
> > which is hard to argue about.
>
>
> Error propagation is a very general problem.
>
And should not be taken to extreme at that. Lets look at bad_alloc
error. There is so much more info it can provide besides simple fact
that allocation failed: did it fail because there was really no more
memory or did it fail because fragmentation, what was allocation size
that failed and how hard LSA try to shrink before giving up. So much
info that can be included with bad_alloc and it is useful to debug real
problems, that is why we have loggers to trace bad_alloc events, but
no one propose to add all this to bad_alloc error propagation.

Propagating the exception which is a direct cause of the failure could hardly be called an extreme. In many cases it is enough to significantly narrow down investigation efforts. We want to improve on information which is attached to exceptions, not reduce it.


>
> > Storage proxy has a lot of logging and
> > at trace level it allows to see the entire lifetime of a query request
> >
>
> Trace-level logging doesn't solve the problem of correlating errors across
> layers (e.g. with cqlsh output)
>
> Also, it's not enabled by default, and we may not be allowed to enable it
> in production. Failures could be hard to reproduce.
>
Trace level gives you much more info then one error message. 
Critical errors are logged at higher levels.

This patch turns exceptions into timeouts, where would the original errors get logged at higher levels?
 

>
> > which until now was enough for debugging any issue.
>
>
> It doesn't mean it always will be, and that we shouldn't improve on this.
>
> Recently we had a user who reported seeing in cqlsh an error with message
> "Not implemented: INDEXES". If that was a timeout error instead, that user
> and we would waste more time before we got to the root cause.
>
This is not intermittent error and should be reported as such. It will
cause application to fail and this is desirable behaviour. Reporting
with anything but write_timeout (or write_failed with v4) on a temporary
communication error kills an application for no reason. If user hits
an unimplemented path in a storage_proxy of cause it should not be
reported as timeout and this is not what the patch is doing.

This patch is converting all uncaught exceptions to timeouts, treating them as intermittent failures. It's not right, because not all such errors are intermittent. It should assume they're not, and special case for the few which we recognize as expected.
Why not? We only need to know whether there was a timeout or a hard failure. On the latter, we exit regardless of the cause.
 

> even better, but returns on doing that are diminishing. For CL=1 operation,
The difference in or views that that IMO propagating even one already
has diminishing returns.

> it is as good. For CL>1, in many cases if there is one failure, chances are
> all will be the same (e.g. bad request, serialization error). Each failure
Those are not error conditions, those are bugs (do not remember them
even happening in anything but development). They are logged at very
high level.

Not sure what's you point here. The case of bugs is where the error is most interesting. We want to see the error and correlate it with affected operations.
 
Much more common reasons are connection closed and bad_alloc
and they can happen simultaneously (both may be a result of memory
shortage/fragmentation).

Just because there are errors which are not interesting, doesn't mean we should silence all errors. We should silence the uninteresting errors, after careful consideration of each case.
 

> is an abnormal condition, so even if the failures are different, knowing
> one of the causes already points at some problem which should probably be
> fixed. We have a similar problem in the read path, which uses
> parallel_for_each().
>
> But regardless of whether we propagate one cause of failure, all of them,
> or none, just distinguishing a timeout from a non-timeout failure is
> already an improvement.
>
For that we need to implement write_failure error which was added in v4. 
It allows storage_proxy to give up on a request earlier if it sees that
there is no chance to fulfill a request because of errors. It is still
high level CQL error, it does not list all the reasons why it is thrown.

We can still have the cause attached to write_failure exception (e.g. using throw_with_nested). This way internal users will have access to it. CQL server can ignore it.
 
This is orthogonal for the fix here since for compatibility with v3 it
is translated to timeout error, so replying with timeout is valid.

It's only valid when responding to a CQL v3 request, where there is no choice. We shouldn't do this to internal users.
I don't see a conflict between attaching causes and throwing a CQL exception like write_failure from storage_proxy. We can attach a cause to write_failure.
 

>
> >
> > But lest say there is a caller that benefits from knowing all the internal
> > reasons we can extend cassandra_exception to hold an array of errors
> > and whoever wants is may explore it. But the churn is not worth it
> > without a real benefit which I simple do not see.
> >
> > Anyway the discussion should not hold the fix for the current issue to
> > go in since it fits current behaviour.
>
>
> Does Cassandra have the same problem which this patch aims to fix? What's
> their solution?
Cassandra has the same interface towards upper layers as we do. It
throws CQL exception. Look at mutate() definition there:
  public static void mutate(Collection<? extends IMutation> mutations, ConsistencyLevel consistency_level, long queryStartNanoTime)
    throws UnavailableException, OverloadedException, WriteTimeoutException, WriteFailureException

Looking at counters code there I see that they
use usual AbstractWriteResponseHandler mechanism to send counter
mutation, so they should get WriteTimeoutException in case of a timeout or
WriteFailureException in case an error is detected earlier. The later
will be translated to WriteTimeoutException for clients that negotiated
protocol v3 or earlier.

I think we should also distinguish from timeout and failure at storage proxy level, and in addition to that, attach the original exception to write_failure inside the catch(...) clause.

Gleb Natapov

<gleb@scylladb.com>
unread,
Oct 17, 2017, 8:31:43 AM10/17/17
to Tomasz Grabiec, scylladb-dev
Only tracing gives you the whole picture. Of course we cannot run with
tracing all of the time, but with persistent problem enabling it is
no-brainer. Printing some additional info, like "one node had a
bad_alloc" will not add much info either, you will see those bad_allocs
in other places anyway. And in much more cases the timeout will be
because of real timeout, but that real timeout may still be caused by
bad_alloc on another node, but you will not get this information.

> You may use it when debugging locally, but even there it's better if I
> don't have to enable trace-level logging on storage proxy to see the error,
> especially if the problem occurs only at heavy load.
>
> but in this case it does not even needed,
> > yes bad_alloc caused batchlog_manager error,
>
>
> How do you know that this bad_alloc belongs to batchlog, and not to an
> unrelated operation which happened at the same time?
It may have belonged to batchlog, but it was not caused by batchlog.
Batchlog is only unintended victim. Printing "batchlog failed because of
bad_alloc" does not give you any additional information about why error
happen above what we have now.

>
> but that is not very
> > helpful knowledge at all because it does not mean that batchlog_manager
> > caused low memory condition and this is what is interesting here. With
>
> bad_alloc there is a chance that it will not even reach storage_proxy
> > if it happens in very low rpc or io layer and sp will see either
> > connection closure or a real timeout because we failed to read from
> > a local disk.
> >
>
> Even if we agreed that bad_alloc is not interesting, that doesn't matter
> for point, because it's not the only exception which can happen. We may get
> other errors, e.g. "Attempted to mutate using not synced schema". In
> general, the set of exceptions is open. Lower layers can add new kinds of
> failures in the future. A middle layer should not make assumptions about
> all of them not being interesting. It should default to considering all
> failures as interesting, and only special case those that we know are not.
>
Mid layer should logs them with enough information about why error
happened. I believe we do that already for that particular error (which
usually indicates a bug, not something higher layers should deal with).

> bad_alloc is a bit of a weak case. But still. You're right that there are
> conditions leading to bad_alloc which affect the whole system, and in such
> conditions it doesn't matter which operations are affected. Those are not
> the only conditions though. There are also localized conditions like an
> operation trying to allocate too much. Knowing which operations fail allows
> to distinguish between those. If you see batchlog replay timing out, that
> could point to a different problem, not related to other bad_allocs you see
> on that node. Lower layers are supposed to propagate bad_alloc to the
> caller, so it shouldn't timeout because of bad_alloc at layers below
> storage_proxy.
If you have bad_allocs and batchlog replay timing out simultaneously it
would not be wise to look at batchlog error without resolving bad_allocs
because 99.9999% the former is caused by the later even if the former
may report some other weird kind of error (because this weird kind of
error is likely caused by bad_alloc).

Batchlog with bad_alloc is simply a bad example to made your point.

>
>
>
> > > Also, the failure may be reported in a completely different place, like
> > > cqlsh console. The one who sees the report may not have easy access to
> > > logs. Even when he has, it's extra burden to correlate the errors.
> > >
> > The point is exactly that such kind of errors should not reach cqlsh
> > console because it causes application failures (it causes c-s
> > to go completely crazy by claiming no host is reachable because, I
> > guess, receiving such en rror is considered as node failure).
> >
>
> If an operation fails with an error no all nodes, then it's right that the
> application fails the operation. There is no reason for it to believe that
> retrying will help. It can recognize non-fatal errors, like a timeout, and
> handle them specially, but the default should be to propagate the failure
> up to the application layer.
>
It should be propagated in CQL write error from either write_timeout or
write_failure (for protocol > v4). Not as an internal random error which
we do now.

> Cassandra propagates failures to cqlsh, it has this write_failure response.
> It only responds with a timeout when the protocol doesn't support it.
>
I have an impression you completely ignoring what I am writing, so I'll
summarize what I already wrote may times in this thread. Yes we
should implement write_failure/read_failure. They are not implemented
because they are later addition and were not present in Origin. They
are only available if protocol v4 is negotiated. If older protocol is
negotiated write_failure translated to write_timeout, so since we do not
yet have write_failure/read_failure it is legal to send write_timeout
until write_failure (and compatibility layer) is implemented. Note
though that even for write_failure no additional info is returned to
cqlsh. Message encoding does not include it.

>
> >
> > > There is also a semantic problem with always responding with a timeout on
> > > failure. The algorithm using storage_proxy may care if the operation
> > failed
> > > or timed out. It can attempt to retry on timeout, but it should not retry
> > > if there is no chance of succeeding, e.g. if the operation fails on
> > replica
> > > side due to incorrect request.
> > >
> > There should not be such failures, any such failure is a bug, so the
> > example is somewhat contrived
>
>
> It could be a bug somewhere, but we still should propagate information that
> we hit it, and which operation hit it.
>
>
We should log that we hit with all the context around. Propagating it up
will only lose some context and caller cannot deal with it anyway since
this is a bug.

> > and besides returning only one error from
> > multiple that may happen will make it not work for the case either.
> >
>
> Why not? It's enough that we distinguish a timeout failure from a
> non-timeout failure.
>
>
And what is the failure we returned is timeout one? And what if timeout
is cause by a bug that the caller will hit again and again forever?

> >
> > > This is a general talk without any real use case
> > > > which is hard to argue about.
> > >
> > >
> > > Error propagation is a very general problem.
> > >
> > And should not be taken to extreme at that. Lets look at bad_alloc
> > error. There is so much more info it can provide besides simple fact
> > that allocation failed: did it fail because there was really no more
> > memory or did it fail because fragmentation, what was allocation size
> > that failed and how hard LSA try to shrink before giving up. So much
> > info that can be included with bad_alloc and it is useful to debug real
> > problems, that is why we have loggers to trace bad_alloc events, but
> > no one propose to add all this to bad_alloc error propagation.
> >
>
> Propagating the exception which is a direct cause of the failure could
> hardly be called an extreme. In many cases it is enough to significantly
> narrow down investigation efforts. We want to improve on information which
> is attached to exceptions, not reduce it.
>
Storage proxy is a complicated subsystem that has complex internal
state. When error happens and requested operation cannot be done the
subsystem returns an error indicating that the operation cannot be done.
The internal reason for the failure a logged and not reported.

Memory is a complicated subsystem that has complex internal
state. When error happens and requested operation cannot be done the
subsystem returns an error indicating that the operation cannot be done.
The internal reason for the failure a logged and not reported.

Do you notice any similarities?

>
> > >
> > > > Storage proxy has a lot of logging and
> > > > at trace level it allows to see the entire lifetime of a query request
> > > >
> > >
> > > Trace-level logging doesn't solve the problem of correlating errors
> > across
> > > layers (e.g. with cqlsh output)
> > >
> > > Also, it's not enabled by default, and we may not be allowed to enable it
> > > in production. Failures could be hard to reproduce.
> > >
> > Trace level gives you much more info then one error message.
>
> Critical errors are logged at higher levels.
> >
>
> This patch turns exceptions into timeouts, where would the original errors
> get logged at higher levels?
It is logged at that level by the follow up patch I sent. I do not
understand what you suggest. Leave things as is because real error is
propagated and this is how you like it? But this breaks Cassandra
driver, so this is not a viable option. Proper CQL write error has to
be returned.

>
>
> >
> > >
> > > > which until now was enough for debugging any issue.
> > >
> > >
> > > It doesn't mean it always will be, and that we shouldn't improve on this.
> > >
> > > Recently we had a user who reported seeing in cqlsh an error with message
> > > "Not implemented: INDEXES". If that was a timeout error instead, that
> > user
> > > and we would waste more time before we got to the root cause.
> > >
> > This is not intermittent error and should be reported as such. It will
> > cause application to fail and this is desirable behaviour. Reporting
> > with anything but write_timeout (or write_failed with v4) on a temporary
> > communication error kills an application for no reason. If user hits
> > an unimplemented path in a storage_proxy of cause it should not be
> > reported as timeout and this is not what the patch is doing.
> >
>
> This patch is converting all uncaught exceptions to timeouts, treating them
> as intermittent failures. It's not right, because not all such errors are
> intermittent. It should assume they're not, and special case for the few
> which we recognize as expected.
>
So it will be OK from your POV to catch only some exceptions? What is
the list? I actually agree with you here, but with a little caveat: we
should filter exceptions knows to be non intermittent because the list
of those is finite. In this place it is empty though as far as I can
tell, but I am open to suggestions.
Because timeout can be as permanent as any other failure in case of a
bug and because if we report only one error it may be timeout and not
something that tells you it is permanent.

>
>
> >
> > > even better, but returns on doing that are diminishing. For CL=1
> > operation,
> > The difference in or views that that IMO propagating even one already
> > has diminishing returns.
> >
> > > it is as good. For CL>1, in many cases if there is one failure, chances
> > are
> > > all will be the same (e.g. bad request, serialization error). Each
> > failure
> > Those are not error conditions, those are bugs (do not remember them
> > even happening in anything but development). They are logged at very
> > high level.
>
>
> Not sure what's you point here. The case of bugs is where the error is most
> interesting. We want to see the error and correlate it with affected
> operations.
>
My point is that higher level are not suppose to handle them in any
special way and logging them in the context they happen with as much
available info as possible is perfectly good solution. Even preferable
since error propagation tend to lose context, so logging at higher
levels may have much less info about actually error.

>
> > Much more common reasons are connection closed and bad_alloc
> > and they can happen simultaneously (both may be a result of memory
> > shortage/fragmentation).
> >
>
> Just because there are errors which are not interesting, doesn't mean we
> should silence all errors. We should silence the uninteresting errors,
> after careful consideration of each case.
>
We should not. We should log them. There is the patch on top to do that.

>
> >
> > > is an abnormal condition, so even if the failures are different, knowing
> > > one of the causes already points at some problem which should probably be
> > > fixed. We have a similar problem in the read path, which uses
> > > parallel_for_each().
> > >
> > > But regardless of whether we propagate one cause of failure, all of them,
> > > or none, just distinguishing a timeout from a non-timeout failure is
> > > already an improvement.
> > >
> > For that we need to implement write_failure error which was added in v4.
>
> It allows storage_proxy to give up on a request earlier if it sees that
> > there is no chance to fulfill a request because of errors. It is still
> > high level CQL error, it does not list all the reasons why it is thrown.
> >
>
> We can still have the cause attached to write_failure exception (e.g. using
> throw_with_nested). This way internal users will have access to it. CQL
> server can ignore it.
If anything I prefer to add it directly to cassandra_exception (by
storing exception_ptr there). Much easier than trying to unfold nested
exception especially with multiple causes. But this is not relevant to
that patch.

>
>
> > This is orthogonal for the fix here since for compatibility with v3 it
> > is translated to timeout error, so replying with timeout is valid.
> >
>
> It's only valid when responding to a CQL v3 request, where there is no
> choice. We shouldn't do this to internal users.
Internal users do not handle this exception either, so what specific
problem are you trying to solve here?

I much prefer to not have write_failure implementation to be pre-request
for that fix because the fix also has to be backported to 1.7 and
implementing write_failure will be much larger and more risky backport.
Yes, as I said in some other email we can make CQL errors have some more
state about the error if you really insists. But first I do not see any point
doing it without a user and second this is orthogonal to this patch
because this additional info will not reach clients anyway and we have a
problem with them right now.

>
>
> >
> > >
> > > >
> > > > But lest say there is a caller that benefits from knowing all the
> > internal
> > > > reasons we can extend cassandra_exception to hold an array of errors
> > > > and whoever wants is may explore it. But the churn is not worth it
> > > > without a real benefit which I simple do not see.
> > > >
> > > > Anyway the discussion should not hold the fix for the current issue to
> > > > go in since it fits current behaviour.
> > >
> > >
> > > Does Cassandra have the same problem which this patch aims to fix? What's
> > > their solution?
> > Cassandra has the same interface towards upper layers as we do. It
> > throws CQL exception. Look at mutate() definition there:
> > public static void mutate(Collection<? extends IMutation> mutations,
> > ConsistencyLevel consistency_level, long queryStartNanoTime)
> > throws UnavailableException, OverloadedException,
> > WriteTimeoutException, WriteFailureException
> >
> > Looking at counters code there I see that they
> > use usual AbstractWriteResponseHandler mechanism to send counter
> > mutation, so they should get WriteTimeoutException in case of a timeout or
> > WriteFailureException in case an error is detected earlier. The later
> > will be translated to WriteTimeoutException for clients that negotiated
> > protocol v3 or earlier.
> >
>
> I think we should also distinguish from timeout and failure at storage
> proxy level, and in addition to that, attach the original exception to
> write_failure inside the catch(...) clause.
As I said numerous times in this thread, yes we need to implement
write_failure eventually. It is not implemented because it did not exist
in Origing, so this is usual catchup that we have to do. It should not
block the fix for the very real problem.

--
Gleb.
Reply all
Reply to author
Forward
0 new messages