Consul ignores service registration ACLs on /agent endpoints

1,354 views
Skip to first unread message

sj...@squareup.com

unread,
Feb 25, 2015, 9:02:25 PM2/25/15
to consu...@googlegroups.com, Anthony Bishopric
We recently upgraded a Consul cluster to v0.5.0 to test the new service ACL feature. Our code uses the /agent endpoints instead of the /catalog endpoints (this is what the documentation recommends), and we believe we've found some bugs relating to ACL enforcement on those endpoints.

Steps to repro:

Launch a consul server with this configuration:

{
  "server": true,
  "datacenter": "dc1",
  "acl_datacenter": "dc1",
  "acl_default_policy": "deny",
  "acl_master_token": "superuser",
  "bootstrap": true,
  "log_level": "trace"
}

Attempt to register a service using the /agent/service/register endpoint, without passing an ACL:
curl 0.0.0.0:8500/v1/agent/service/register -X PUT -v -d '{"Name":"foo"}'
Consul returns 200 to this request. We would have expected a 403 because no ACL was provided.

Now look at the server logs. There are some warnings:

2015/02/26 01:19:29 [DEBUG] http: Request /v1/agent/service/register (267.75µs) 2015/02/26 01:19:29 [WARN] consul.catalog: Register of service 'foo' on 'example.com' denied due to ACLs 2015/02/26 01:19:29 [WARN] agent: Service 'foo' registration blocked by ACLs 2015/02/26 01:20:35 [DEBUG] agent: Service 'foo' in sync

List the services on this node using the /agent/services endpoint, with no ACL token. Instead of returning 403, this also returns a 200. The foo service is listed.

$ curl 0.0.0.0:8500/v1/agent/services | python -m json.tool { "consul": { "Address": "", "ID": "consul", "Port": 8300, "Service": "consul", "Tags": [] }, "foo": { "Address": "", "ID": "foo", "Port": 0, "Service": "foo", "Tags": null } }

Let's try the /catalog/services endpoint instead. With or without an ACL token, the result is the same: no foo service.


$ curl 0.0.0.0:8500/v1/catalog/services | python -m json.tool { "consul": [] }
$ curl 0.0.0.0:8500/v1/catalog/services?token=superuser | python -m json.tool { "consul": [] }

Bug #1: the /agent endpoints should have returned 403 forbidden, since I was not passing an ACL token.

Now let's try registering the service again using the agent endpoint, but passing an ACL token this time.
curl 0.0.0.0:8500/v1/agent/service/register?token=superuser -X PUT -v -d '{"Name":"bar"}'

This gives an HTTP 200, which is what we expected (because this time the ACL token was passed). But the server logs show the same warnings as before:

2015/02/26 01:27:40 [DEBUG] http: Request /v1/agent/service/register?token=superuser (203.665µs) 2015/02/26 01:27:40 [WARN] consul.catalog: Register of service 'bar' on 'example.com' denied due to ACLs 2015/02/26 01:27:40 [WARN] agent: Service 'bar' registration blocked by ACLs 2015/02/26 01:28:11 [DEBUG] agent: Service 'bar' in sync

And once again, the service is not listed in the catalog:

$ curl 0.0.0.0:8500/v1/catalog/services | python -m json.tool { "consul": [] }
$ curl 0.0.0.0:8500/v1/catalog/services?token=superuser | python -m json.tool { "consul": [] }

Bug #2: even if the /agent endpoint is passed an ACL token, the token is not getting passed along to the catalog RPC endpoint, so the service is still not getting registered properly.

We also have a followup question - what is the functional difference between the catalog and agent endpoints? The docs say to prefer the agent endpoint, but they aren't really clear on what the catalog is or why we should use the agent endpoint instead. Thanks.

Ryan Uber

unread,
Feb 25, 2015, 9:37:42 PM2/25/15
to sj...@squareup.com, consu...@googlegroups.com, Anthony Bishopric
Thanks for reporting this issue. There are a few things in play here, let me explain:

Bug #1: The fact that the agent endpoint returns 200 is actually not erroneous. The reason is because the service is in fact registered with the agent. The main problem you are facing is that the agent is unable to sync the service into the catalog, which is why you see the error messages in the server log. The reason that a 200 is returned is because the service registration to the catalog is done asyncrhonously, and could eventually be completed by the anti-entropy mechanism between the agent and server.

Bug #2: This is a real problem. We actually have two separate but related tickets opened for this very symptom. You can find these tickets here:


This bug will be addressed very soon, and we are planning a 0.5.1 bugfix release in the near future which will include this fix.

WRT the /v1/agent vs /v1/catalog endpoints, the catalog endpoints are mainly for internal use, and represent service state and metadata moreso than configuration. The catalog can be thought of as a cluster-wide view of services, while the agent endpoints are scoped to the local node. One other thing to note is that the agents are authoritative over the catalog, meaning that whenever an agent successfully runs anti-entropy or syncs services and checks to the catalog, the agent state will always win. By registering a check directly with the catalog endpoint, you are not only greatly limited in the configuration options you can pass in, but your registration will likely be corrected by the agent during its next anti-entropy sync.

Hope this helps.
- Ryan.

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen Jung

unread,
Feb 25, 2015, 9:49:05 PM2/25/15
to Ryan Uber, consu...@googlegroups.com, Anthony Bishopric
Ah thanks, #734 is definitely what we're seeing. The clarification about the catalog also makes sense.

As for bug #1, shouldn't the agent be rejecting the service registration because no ACL token was passed? The 200 makes sense in that the local registration succeeded, but it seems to me that it shouldn't have succeeded at all due to a lack of authorization.
--
Stephen

Michael Fischer

unread,
Mar 2, 2015, 7:40:50 PM3/2/15
to Ryan Uber, sj...@squareup.com, consu...@googlegroups.com, Anthony Bishopric
On Wed, Feb 25, 2015 at 6:37 PM, Ryan Uber <ry...@hashicorp.com> wrote:

WRT the /v1/agent vs /v1/catalog endpoints, the catalog endpoints are mainly for internal use, and represent service state and metadata moreso than configuration. The catalog can be thought of as a cluster-wide view of services, while the agent endpoints are scoped to the local node. One other thing to note is that the agents are authoritative over the catalog, meaning that whenever an agent successfully runs anti-entropy or syncs services and checks to the catalog, the agent state will always win. By registering a check directly with the catalog endpoint, you are not only greatly limited in the configuration options you can pass in, but your registration will likely be corrected by the agent during its next anti-entropy sync.

This would be extremely useful to note in the API documentation.  Would you consider adding it?  Last I looked, there was a brief mention of "anti-entropy" but no discussion of what that meant that's as concise and clear as what you wrote above.

Thanks,

--Michael

Anthony Bishopric

unread,
Mar 2, 2015, 8:11:19 PM3/2/15
to Michael Fischer, Ryan Uber, Stephen Jung, consu...@googlegroups.com
Hi Ryan, thanks for your reply. Do you happen to have an answer for the second question Stephen had?

As for bug #1, shouldn't the agent be rejecting the service registration because no ACL token was passed? The 200 makes sense in that the local registration succeeded, but it seems to me that it shouldn't have succeeded at all due to a lack of authorization.

Also +1 for a documented anti-entropy explanation.

Thanks for your help,
Anthony

Ryan Uber

unread,
Mar 2, 2015, 8:51:35 PM3/2/15
to Anthony Bishopric, Michael Fischer, Stephen Jung, consu...@googlegroups.com
Hi,

Sorry for missing the last response on this thread! I’ll pull together some more in-depth documentation to clarify the agent and how anti-entropy works. This will be a new section which we can link to from the API docs where necessary.

Regarding the HTTP status code, I think it might make more sense after a documentation update, but the short answer is that the service  actually *is* registered with the agent successfully, which is why we return a 200. ACL’s are not checked by the agent, but by the consul server. This process is done in the background after the service has been registered with the agent. I’ll think more on how we can better handle the case where an invalid or untrusted token are provided, since in this case the situation may not ever be resolved without more manual intervention. Unfortunately it’s not just a return code that we can swap out to remedy the situation.

- Ryan

Michael Fischer

unread,
Mar 2, 2015, 9:59:24 PM3/2/15
to Ryan Uber, Anthony Bishopric, Stephen Jung, consu...@googlegroups.com
Can the request be made synchronous so that the agent waits for a response from the servers before returning to the caller?  That way you could get useful HTTP responses and not have to resort to callbacks (which aren't a good fit here IMO).

Ryan Uber

unread,
Mar 3, 2015, 8:06:35 PM3/3/15
to Michael Fischer, Anthony Bishopric, Stephen Jung, consu...@googlegroups.com
I spent some time today to create some documentation to help clarify anti-entropy and its inner mechanics. You can find the doc on the live consul website, at https://consul.io/docs/internals/anti-entropy.html.

I think it will help to read the page above, and clarify a bit more why a simple synchronous call to the catalog registration endpoint is a rather tricky operation. I think what it boils down to is that the anti-entropy sync is a multi-step operation which may or may not include unrelated services and checks, and cannot be performed transactionally with the agent registration, which limits our ability to recover or return useful response codes in partial error scenarios.

- Ryan

Michael Fischer

unread,
Mar 3, 2015, 8:13:18 PM3/3/15
to Ryan Uber, Anthony Bishopric, Stephen Jung, consu...@googlegroups.com
Is /v1/agent/service/register affected by this?  Or just /v1/agent/catalog/register?  

If only catalog registration is impacted, maybe it's not such a big deal.  

Maybe direct catalog registrations should be deprecated?  Or is there a valid use case for it?

--Michael

Stephen Jung

unread,
Mar 3, 2015, 8:18:00 PM3/3/15
to Michael Fischer, Ryan Uber, Anthony Bishopric, consu...@googlegroups.com
/v1/agent/service/register is the affected endpoint.

It does seem difficult to resolve the error when the catalog is updated asynchronously, but at the same time, the current API is really difficult to use correctly. We basically can't use /v1/agent/service/register with ACLs because, if the cluster is denying access and someone makes an unauthenticated request, the affected agent will essentially have a zombie service registered forever, that never reaches the catalog. The error will never be realized until a human operator checks the logs of the agent and realizes that the service is not synchronized. (This would be true even after #734 was fixed.)

In addition, if the service registers a malicious check, wouldn't the agent actually run the check because the service was registered locally? The results of the check would never be updated to the catalog, but the malicious check would still be executed. This is something that ACLs should be capable of preventing.
--
Stephen

Michael Fischer

unread,
Mar 3, 2015, 8:20:22 PM3/3/15
to Stephen Jung, Ryan Uber, Anthony Bishopric, consu...@googlegroups.com
I agree, from a usability point of view there simply has to be some sort of fail-fast mechanism if the input cannot be validated (due to ACL violation or any other reason).

Ryan Uber

unread,
Mar 3, 2015, 8:26:52 PM3/3/15
to Michael Fischer, Stephen Jung, Anthony Bishopric, consu...@googlegroups.com
I’ve opened https://github.com/hashicorp/consul/issues/752 about this after a conversation with Armon. I think what we can do is eventually support a pre-flight check of the token for service registrations from the agent, which would allow us to test whether or not we could successfully register the service or check. In this case, we could abort both registrations if the pre-flight check failed. This would basically be another round-trip to the server and back over RPC to validate the token.
Reply all
Reply to author
Forward
0 new messages