Fwd: Heads up: KAFKA-1697 - remove code related to ack>1 on the broker

54 views
Skip to first unread message

Joe Stein

unread,
Jan 14, 2015, 7:04:24 PM1/14/15
to kafka-...@googlegroups.com
FYI

---------- Forwarded message ----------
From: Gwen Shapira <gsha...@cloudera.com>
Date: Wed, Jan 14, 2015 at 6:51 PM
Subject: Heads up: KAFKA-1697 - remove code related to ack>1 on the broker
To: "d...@kafka.apache.org" <d...@kafka.apache.org>


Hi Kafka Devs,

We are working on KAFKA-1697 - remove code related to ack>1 on the
broker. Per Neha's suggestion, I'd like to give everyone a heads up on
what these changes mean.

Once this patch is included, any produce requests that include
request.required.acks > 1 will result in an exception. This will be
InvalidRequiredAcks in new versions (0.8.3 and up, I assume) and
UnknownException in existing versions (sorry, but I can't add error
codes retroactively).

This behavior is already enforced by 0.8.2 producers (sync and new),
but we expect impact on users with older producers that relied on acks
> 1 and external clients (i.e python, go, etc).

Users who relied on acks > 1 are expected to switch to using acks = -1
and a min.isr parameter than matches their user case.

This change was discussed in the past in the context of KAFKA-1555
(min.isr), but let us know if you have any questions or concerns
regarding this change.

Gwen

Joe Stein

unread,
Jan 15, 2015, 1:10:08 PM1/15/15
to d...@kafka.apache.org, kafka-...@googlegroups.com
Looping in the mailing list that the client developers live on because they are all not on dev (though they should be if they want to be helping to build the best client libraries they can).

I whole hardily believe that we need to not break existing functionality of the client protocol, ever.

There are many reasons for this and we have other threads on the mailing list where we are discussing that topic (no pun intended) that I don't want to re-hash here.

If we change wire protocol functionality OR the binary format (either) we must bump version AND treat version as a feature flag with backward compatibility support until it is deprecated for some time for folks to deal with it.

match version = {
case 0: keepDoingWhatWeWereDoing()
case 1: doNewStuff()
case 2: doEvenMoreNewStuff()
}

has to be a practice we adopt imho ... I know feature flags can be construed as "messy code" but I am eager to hear another (better? different?) solution to this.

If we don't do a feature flag like this specifically with this change then what happens is that someone upgrades their brokers with a rolling restart in 0.8.3 and every single one of their producer requests start to fail and they have a major production outage. eeeek!!!!

I do 100% agree that > 1 makes no sense and we *REALLY* need people to start using 0,1,-1 but we need to-do that in a way that is going to work for everyone.

Old producers and consumers must keep working with new brokers and if we are not going to support that then I am unclear what the use of "version" is based on our original intentions of having it because of the 0.7=>-0.8. We said no more breaking changes when we did that.

- Joe Stein

On Thu, Jan 15, 2015 at 12:38 PM, Ewen Cheslack-Postava <ew...@confluent.io> wrote:
Right, so this looks like it could create an issue similar to what's
currently being discussed in
https://issues.apache.org/jira/browse/KAFKA-1649 where users now get errors
under conditions when they previously wouldn't. Old clients won't even know
about the error code, so besides failing they won't even be able to log any
meaningful error messages.

I think there are two options for compatibility:

1. An alternative change is to remove the ack > 1 code, but silently
"upgrade" requests with acks > 1 to acks = -1. This isn't the same as other
changes to behavior since the interaction between the client and server
remains the same, no error codes change, etc. The client might just see
some increased latency since the message might need to be replicated to
more brokers than they requested.
2. Split this into two patches, one that bumps the protocol version on that
message to include the new error code but maintains both old (now
deprecated) and new behavior, then a second that would be applied in a
later release that removes the old protocol + code for handling acks > 1.

2 is probably the right thing to do. If we specify the release when we'll
remove the deprecated protocol at the time of deprecation it makes things a
lot easier for people writing non-java clients and could give users better
predictability (e.g. if clients are at most 1 major release behind brokers,
they'll remain compatible but possibly use deprecated features).



On Wed, Jan 14, 2015 at 3:51 PM, Gwen Shapira <gsha...@cloudera.com> wrote:

> Hi Kafka Devs,
>
> We are working on KAFKA-1697 - remove code related to ack>1 on the
> broker. Per Neha's suggestion, I'd like to give everyone a heads up on
> what these changes mean.
>
> Once this patch is included, any produce requests that include
> request.required.acks > 1 will result in an exception. This will be
> InvalidRequiredAcks in new versions (0.8.3 and up, I assume) and
> UnknownException in existing versions (sorry, but I can't add error
> codes retroactively).
>
> This behavior is already enforced by 0.8.2 producers (sync and new),
> but we expect impact on users with older producers that relied on acks
> > 1 and external clients (i.e python, go, etc).
>
> Users who relied on acks > 1 are expected to switch to using acks = -1
> and a min.isr parameter than matches their user case.
>
> This change was discussed in the past in the context of KAFKA-1555
> (min.isr), but let us know if you have any questions or concerns
> regarding this change.
>
> Gwen
>



--
Thanks,
Ewen

Mark Roberts

unread,
Jan 15, 2015, 1:24:52 PM1/15/15
to Joe Stein, d...@kafka.apache.org, kafka-...@googlegroups.com
This would sting a whole lot less if there was a programmatic way to get what server version is in use. Also, how will this work in mixed version clusters (during an upgrade, for example)?
--
You received this message because you are subscribed to the Google Groups "kafka-clients" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kafka-client...@googlegroups.com.
To post to this group, send email to kafka-...@googlegroups.com.
Visit this group at http://groups.google.com/group/kafka-clients.
To view this discussion on the web visit https://groups.google.com/d/msgid/kafka-clients/CAA7ooCBtH2JjyQsArdx_%3DV25B4O1QJk0YvOu9U6kYt9sB4aqng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Magnus Edenhill

unread,
Jan 15, 2015, 1:40:30 PM1/15/15
to d...@kafka.apache.org, kafka-...@googlegroups.com

I very much agree on what Joe is saying, let's use the version field as intended
and be very strict with not removing nor altering existing behaviour without bumping the version.
Old API versions could be deprecated (documentation only?) immediately and removed completely in the next minor version bump (0.8->0.9).

An API to query supported API versions would be a good addition in the long run but doesn't help current clients
much as such a request to an older broker version will kill the connection without any error reporting to the client, thus making it rather useless in the short term.

Regards,
Magnus

Felix GV

unread,
Jan 15, 2015, 2:10:31 PM1/15/15
to Magnus Edenhill, d...@kafka.apache.org, kafka-...@googlegroups.com
Next time the protocol is evolved and new error codes can be introduced, would it make sense to add a new one called "Deprecated" (or "Deprecation" or "DeprecatedOperation" or whatever sounds best)? This would act as a more precise form of "Unknown" error. It could help identify what the problem is more easily when debugging clients.

Of course, this is the kind of lever one would prefer never pulling, but when you need it, you're better off having it than not, and if you end up having it and never using it, it does not do much harm either.


--
 
Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn
 
f...@linkedin.com
linkedin.com/in/felixgv

From: kafka-...@googlegroups.com [kafka-...@googlegroups.com] on behalf of Magnus Edenhill [mag...@edenhill.se]
Sent: Thursday, January 15, 2015 10:40 AM
To: d...@kafka.apache.org
Cc: kafka-...@googlegroups.com
Subject: [kafka-clients] Re: Heads up: KAFKA-1697 - remove code related to ack>1 on the broker

--
You received this message because you are subscribed to the Google Groups "kafka-clients" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kafka-client...@googlegroups.com.
To post to this group, send email to kafka-...@googlegroups.com.
Visit this group at http://groups.google.com/group/kafka-clients.

Gwen Shapira

unread,
Jan 15, 2015, 3:07:28 PM1/15/15
to Joe Stein, d...@kafka.apache.org, kafka-...@googlegroups.com
Is the protocol bump caused by the behavior change or the new error code?

1) IMO, error_codes are data, and clients can expect to receive errors
that they don't understand (i.e. unknown errors). AFAIK, clients don't
break on unknown errors, they are simple more challenging to debug. If
we document the new behavior, then its definitely debuggable and
fixable.

2) The behavior change is basically a deprecation - i.e. acks > 1 were
never documented, and are not supported by Kafka clients starting with
version 0.8.2. I'm not sure this requires a protocol bump either,
although its a better case than new error codes.

Thanks,
Gwen
> --
> You received this message because you are subscribed to the Google Groups
> "kafka-clients" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kafka-client...@googlegroups.com.
> To post to this group, send email to kafka-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/kafka-clients.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kafka-clients/CAA7ooCBtH2JjyQsArdx_%3DV25B4O1QJk0YvOu9U6kYt9sB4aqng%40mail.gmail.com.

Dana Powers

unread,
Jan 15, 2015, 3:52:52 PM1/15/15
to Gwen Shapira, Joe Stein, d...@kafka.apache.org, kafka-...@googlegroups.com
> clients don't break on unknown errors

maybe true for the official java clients, but I dont think the assumption holds true for community-maintained clients and users of those clients.  kafka-python generally follows the fail-fast philosophy and raises an exception on any unrecognized error code in any server response.  in this case, kafka-python allows users to set their own required-acks policy when creating a producer instance.  It is possible that users of kafka-python have deployed producer code that uses ack>1 -- perhaps in production environments -- and for those users the new error code will crash their producer code.  I would not be surprised if the same were true of other community clients.

*one reason for the fail-fast approach is that there isn't great documentation on what errors to expect for each request / response -- so we use failures to alert that some error case is not handled properly.  and because of that, introducing new error cases without bumping the api version is likely to cause those errors to get raised/thrown all the way back up to the user.  of course we (client maintainers) can fix the issues in the client libraries and suggest users upgrade, but it's not the ideal situation.


long-winded way of saying: I agree w/ Joe.

-Dana


Gwen Shapira

unread,
Jan 15, 2015, 7:59:33 PM1/15/15
to Ewen Cheslack-Postava, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
Hi,

I got convinced by Joe and Dana that errors are indeed part of the
protocol and can't be randomly added.

So, it looks like we need to bump version of ProduceRequest in the
following way:
Version 0 -> accept acks >1. I think we should keep the existing
behavior too (i.e. not replace it with -1) to avoid surprising
clients, but I'm willing to hear other opinions.
Version 1 -> do not accept acks >1 and return an error.
Are we ok with the error I added in KAFKA-1697? We can use something
less specific like InvalidRequestParameter. This error can be reused
in the future and reduce the need to add errors, but will also be less
clear to the client and its users. Maybe even add the error message
string to the protocol in addition to the error code? (since we are
bumping versions....)

I think maintaining the old version throughout 0.8.X makes sense. IMO
dropping it for 0.9 is feasible, but I'll let client owners help make
that call.

Am I missing anything? Should I start a KIP? It seems like a KIP-type
discussion :)

Gwen


On Thu, Jan 15, 2015 at 2:31 PM, Ewen Cheslack-Postava
<ew...@confluent.io> wrote:
> Gwen,
>
> I think the only option that wouldn't require a protocol version change is
> the one where acks > 1 is converted to acks = -1 since it's the only one
> that doesn't potentially break older clients. The protocol guide says that
> the expected upgrade path is servers first, then clients, so old clients,
> including non-java clients, that may be using acks > 1 should be able to
> work with a new broker version.
>
> It's more work, but I think dealing with the protocol change is the right
> thing to do since it eventually gets us to the behavior I think is better --
> the broker should reject requests with invalid values. I think Joe and I
> were basically in agreement. In my mind the major piece missing from his
> description is how long we're going to maintain his "case 0" behavior. It's
> impractical to maintain old versions forever, but it sounds like there
> hasn't been a decision on how long to maintain them. Maybe that's another
> item to add to KIPs -- protocol versions and behavior need to be listed as
> deprecated and the earliest version in which they'll be removed should be
> specified so users can understand which versions are guaranteed to be
> compatible, even if they're using (well-written) non-java clients.
>
> -Ewen
> Thanks,
> Ewen

Gwen Shapira

unread,
Jan 15, 2015, 8:19:29 PM1/15/15
to d...@kafka.apache.org, Ewen Cheslack-Postava, Joe Stein, kafka-...@googlegroups.com
The errors are part of the KIP process now, so I think the clients are safe :)

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

On Thu, Jan 15, 2015 at 5:12 PM, Steve Morin <steve...@gmail.com> wrote:
> Agree errors should be part of the protocol

Gwen Shapira

unread,
Jan 15, 2015, 9:14:48 PM1/15/15
to d...@kafka.apache.org, Ewen Cheslack-Postava, Joe Stein, kafka-...@googlegroups.com
I created a KIP for this suggestion:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1+-+Remove+support+of+request.required.acks

Basically documenting what was already discussed here. Comments will
be awesome!

Gwen

Jay Kreps

unread,
Jan 15, 2015, 11:27:28 PM1/15/15
to d...@kafka.apache.org, Ewen Cheslack-Postava, Joe Stein, kafka-...@googlegroups.com
This is a good case to discuss.

Let's figure the general case of how we want to handle errors and get that documented in the protocol. The problem right now is that we give no guidance on this. I actually thought Gwen's suggestion made sense on the guidance we should have given which is that we will enumerate a set of errors and their meaning for each API but it is possible that other errors will occur and they should be handled (maybe poorly) in the same way UNKNOWN_ERROR is handled which is our normal escape hatch for things like OOMException.

I really do think we shouldn't be dogmatic here: In considering a change to errors we should consider the potential ill-effect vs the complexity of yet another protocol version.

In this case I actually am not sure we need to bump the protocol because the whole point of the change was to make a setting we think doesn't make sense break, right? Well this will break it. It seems like the only downside is that older clients will get a bad error message instead of a good one. But it isn't like we will have rendered a client unusable, it is just that they will need to change their config.

-Jay

Ewen Cheslack-Postava

unread,
Jan 16, 2015, 12:37:03 PM1/16/15
to Jay Kreps, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
Gwen -- KIP write up looks good. Deprecation schedule probably needs to be more specific, but I think that discussion probably needs to happen after a solution is agreed upon.

Jay -- I think "older clients will get a bad error message instead of a good one" isn't what would be happening with this change. Previously they wouldn't have received an error and they would have been able to produce messages. After the change they'll just receive this new error message which their clients can't possibly handle gracefully since it didn't exist when the client was written. Whether the acks > 1 setting was actually accomplishing what they thought doesn't matter. Someone could have reasonably read the docs on 0.8.1.1, thought acks = 2 is an ok setting for their applications, set it as a default across a bunch of apps, then follow the recommended upgrade path of updating brokers to 0.8.2 and all their apps will start failing on produce requests.

--
Thanks,
Ewen

Gwen Shapira

unread,
Jan 16, 2015, 1:40:45 PM1/16/15
to d...@kafka.apache.org, Jay Kreps, Joe Stein, kafka-...@googlegroups.com
How about we continue the discussion on this thread, so we won't lose
the context of this discussion, and put it up for VOTE when this has
been finalized?

On Fri, Jan 16, 2015 at 10:22 AM, Neha Narkhede <ne...@confluent.io> wrote:
> Gwen,
>
> KIP write-up looks good. According to the rest of the KIP process proposal,
> would you like to start a DISCUSS/VOTE thread for it?
>
> Thanks,
> Neha
> --
> Thanks,
> Neha

Gwen Shapira

unread,
Jan 16, 2015, 2:55:27 PM1/16/15
to d...@kafka.apache.org, Jay Kreps, Joe Stein, kafka-...@googlegroups.com
I updated the KIP: Using acks > 1 in version 0 will log a WARN message
in the broker about client using deprecated behavior (suggested by Joe
in the JIRA, and I think it makes sense).

Gwen

Jay Kreps

unread,
Jan 18, 2015, 2:03:05 PM1/18/15
to Gwen Shapira, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
Hey guys,

I really think we are discussing two things here:
  1. How should we generally handle changes to the set of errors? Should introducing new errors be considered a protocol change or should we reserve the right to introduce new error codes?
  2. Given that this particular change is possibly incompatible, how should we handle it?
I think it would be good for people who are responding here to be specific about which they are addressing.

Here is what I think:

1. Errors should be extensible within a protocol version.

We should change the protocol documentation to list the errors that can be given back from each api, their meaning, and how to handle them, BUT we should explicitly state that the set of errors are open ended. That is we should reserve the right to introduce new errors and explicitly state that clients need a blanket "unknown error" handling mechanism. The error can link to the protocol definition (something like "Unknown error 42, see protocol definition at http://link"). We could make this work really well by instructing all the clients to report the error in a very googlable way as Oracle does with their error format (e.g. "ORA-32") so that if you ever get the raw error google will take you to the definition.

I agree that a more rigid definition seems like "right thing", but having just implemented two clients and spent a bunch of time on the server side, I think, it will work out poorly in practice. Here is why:
  • I think we will make a lot of mistakes in nailing down the set of error codes up front and we will end up going through 3-4 churns of the protocol definition just realizing the set of errors that can be thrown. I think this churn will actually make life worse for clients that now have to figure out 7 identical versions of the protocol and will be a mess in terms of testing on the server side. I actually know this to be true because while implementing the clients I tried to guess the errors that could be thrown, then checked my guess by close code inspection. It turned out that I always missed things in my belief about errors, but more importantly even after close code inspection I found tons of other errors in my stress testing.
  • In practice error handling always involves calling out one or two meaningful failures that have special recovery and then a blanket case that just handles everything else. It's true that some clients may not have done this well, but I think it is for the best if they fix that.
  • Reserving the right to add errors doesn't mean we will do it without care. We will think through each change and decide whether giving a little more precision in the error is worth the overhead and churn of a protocol version bump.
2. In this case in particular we should not introduce a new protocol version

In this particular case we are saying that acks > 1 doesn't make sense and we want to give an error to people specifying this so that they change their configuration. This is a configuration that few people use and we want to just make it an error. The bad behavior will just be that the error will not be as good as it could be. I think that is a better tradeoff than introducing a separate protocol version (this may be true of the java clients too).

We will have lots of cases like this in the future and we aren't going to want to churn the protocol for each of them. For example we previously had to get more precise about which characters were legal and which illegal in topic names.

-Jay

Jun Rao

unread,
Jan 18, 2015, 6:50:45 PM1/18/15
to Jay Kreps, Gwen Shapira, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
Overall, I agree with Jay on both points.

1. I think it's reasonable to add new error codes w/o bumping up the protocol version. In most cases, by adding new error codes, we are just refining the categorization of those unknown errors. So, a client shouldn't behave worse than before as long as unknown errors have been properly handled.

2. I think it's reasonable to just document that 0.8.2 will be the last release that will support ack > 1 and remove the support completely in trunk w/o bumping up the protocol. This is because (a) we never included ack > 1 explicitly in the documentation and so the usage should be limited; (2) ack > 1 doesn't provide the guarantee that people really want and so it shouldn't really be used.

Thanks,

Jun 


Gwen Shapira

unread,
Jan 18, 2015, 11:24:47 PM1/18/15
to Jun Rao, Jay Kreps, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
Overall, agree on point #1, less sure on point #2.

1. Some protocols never ever add new errors, while others add errors
without bumping versions. HTTP is a good example of the second type.
HTTP-451 was added fairly recently, there are some errors specific to
NGINX, etc. No one cares. I think we should properly document in the
wire-protocol doc that new errors can be added, and I think we should
strongly suggest (and implement ourselves) that unknown error codes
should be shown to users (or at least logged), so they can be googled
and understood through our documentation.
In addition, hierarchy of error codes, so clients will know if an
error is retry-able just by looking at the code could be nice. Same
for adding an error string to the protocol. These are future
enhancements that should be discussed separately.

2. I think we want to allow admins to upgrade their Kafka brokers
without having to chase down clients in their organization and without
getting blamed if clients break. I think it makes sense to have one
version that will support existing behavior, but log warnings, so
admins will know about misbehaving clients and can track them down
before an upgrade that breaks them (or before the broken config causes
them to lose data!). Hopefully this is indeed a very rare behavior and
we are taking extra precaution for nothing, but I have customers where
one traumatic upgrade means they will never upgrade a Kafka again, so
I'm being conservative.

Gwen

Jun Rao

unread,
Jan 19, 2015, 12:35:55 PM1/19/15
to Gwen Shapira, Jay Kreps, d...@kafka.apache.org, Joe Stein, kafka-...@googlegroups.com
For 2, how about we make a change to log a warning for ack > 1 in 0.8.2 and then drop the ack > 1 support in trunk (w/o bumping up the protocol version)? Thanks,

Jun

Joe Stein

unread,
Jan 19, 2015, 12:46:28 PM1/19/15
to Jun Rao, Gwen Shapira, Jay Kreps, d...@kafka.apache.org, kafka-...@googlegroups.com
<< For 2, how about we make a change to log a warning for ack > 1 in 0.8.2 and then drop the ack > 1 support in trunk (w/o bumping up the protocol version)?

+1

Gwen Shapira

unread,
Jan 19, 2015, 1:00:42 PM1/19/15
to Joe Stein, Jun Rao, Jay Kreps, d...@kafka.apache.org, kafka-...@googlegroups.com
Sounds good to me.
I'll open a new JIRA for 0.8.2 with just an extra log warning, to
avoid making KAFKA-1697 any more confusing.

Gwen Shapira

unread,
Jan 21, 2015, 8:01:22 PM1/21/15
to Joe Stein, Jun Rao, Jay Kreps, d...@kafka.apache.org, kafka-...@googlegroups.com
We have the new warning in 0.8.2.

I updated KIP-1 with the new plan:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1+-+Remove+support+of+request.required.acks

I'm waiting a day for additional discussion and if there are no
replies, I'll send the [VOTE] email.

Gwen
Reply all
Reply to author
Forward
0 new messages