ACL system in legacy mode

1,791 views
Skip to first unread message

Rafael Sierra

unread,
Nov 6, 2018, 8:24:51 AM11/6/18
to Consul
Hi, I am playing with Consul 1.4.0 RC1 using Docker (swarm) with 3 nodes/containers.

My command line is `consul agent -config-file /server.json`, and my `server.json` is:

```
{
  "primary_datacenter": "dc",
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "extend-cache"
  },
  "advertise_addr" : "{{ GetInterfaceIP \"eth0\" }}",
  "autopilot": {"cleanup_dead_servers": true},
  "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "data_dir": "/consul/data",
  "datacenter": "viasto",
  "disable_update_check": true,
  "encrypt": "redacted",
  "leave_on_terminate" : true,
  "log_level": "debug",
  "retry_join" : ["consul.server"],
  "server" : true,
  "server_name" : "server",
  "skip_leave_on_interrupt" : true,
  "ui" : true
}
```

I can deploy my 3 nodes locally and everything seems to be running fine, but when I try to call `docker run -e CONSUL_HTTP_ADDR=http://consul_server:8500 --rm --network consul_network consul:1.4.0-rc1 acl bootstrap` I get an error with the following message:

```
Failed ACL bootstrapping: Unexpected response code: 500 (The ACL system is currently in legacy mode.)
```

I could not find any doc on how to disable or upgrade my ACL to "non-legacy" mode.

Paul Banks

unread,
Nov 6, 2018, 12:31:28 PM11/6/18
to consu...@googlegroups.com
Hi Rafael

We are currently working on the upgrade docs for the final RC (should be ready today or tomorrow).

Did you "upgrade" from an older version on Consul with the same state or are these all fresh servers?

Basically when 1.4.0 servers startup they advertise that they start in "Legacy ACL" mode for compatibility during the upgrade. They advertise that they are now capable of new ACLs via gossip. When all servers in a DC have advertised that they are ready for 1.4.0 ACLs, the leader transitions the cluster and writes that to the state store so future startups don't need to go through the same process.

So the answer is it should be automatic once all your servers are running 1.4.0 and are up and healthy. Can you give some more details on how you created them (e.g. do they have old pre-1.4.0 state) and from the logs/UI etc are all three up and healthy?

Thanks for the feedback it's really userful.

Paul

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/20d142a8-dcf0-4dab-9d33-884188aa02c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Banks

unread,
Nov 6, 2018, 12:33:44 PM11/6/18
to consu...@googlegroups.com
Also, do you get that message consistently if you try to bootstrap a few seconds after the cluster is up? If you were trying to run it in a script for example it might be that you make the call before the gossip has converged and the leader upgraded ACLs.

Matthew Keeler

unread,
Nov 6, 2018, 12:33:50 PM11/6/18
to consu...@googlegroups.com

Consul 1.4 will auto-transition out of legacy mode on its own. For servers in the primary dc the requirements are that all of the servers must be running version 1.4.0 or above and all servers must have ACLs enabled. Looking at your setup it sounds like you meet that criteria. The check for whether transitioning can take place will happen often when first starting but the interval gets increased over time with a cap of checking once every minute. Did you try to bootstrap immediately after bringing up the servers? If so what happens if you try and bootstrap after waiting a minute. 

Matt
--

Rafael Sierra

unread,
Nov 7, 2018, 4:56:50 AM11/7/18
to Consul
On Tuesday, November 6, 2018 at 6:31:28 PM UTC+1, Paul Banks wrote:
Hi Rafael

We are currently working on the upgrade docs for the final RC (should be ready today or tomorrow).

Did you "upgrade" from an older version on Consul with the same state or are these all fresh servers?

I am starting a new (docker swarm) stack from scratch using the image consul:1.4.0-rc1:

```
$ docker image ls |grep consul
consul                                                          1.4.0-rc1           3056d529950e        13 days ago         105MB
```

 

Basically when 1.4.0 servers startup they advertise that they start in "Legacy ACL" mode for compatibility during the upgrade. They advertise that they are now capable of new ACLs via gossip. When all servers in a DC have advertised that they are ready for 1.4.0 ACLs, the leader transitions the cluster and writes that to the state store so future startups don't need to go through the same process.

Ok, I will wait a little longer before trying to call any API and follow the logs for any ACL version upgrade notification. 


So the answer is it should be automatic once all your servers are running 1.4.0 and are up and healthy. Can you give some more details on how you created them (e.g. do they have old pre-1.4.0 state) and from the logs/UI etc are all three up and healthy?

I am creating all of them by issuing `docker stack deploy -c consul.yml my_consul`, where consul.yml is as follows:

```
version: '3.4'

services:
  server:
    image: consul:1.4.0-rc1
    networks:
        network:
          aliases:
            - consul.server
    configs:
      - server.json
    deploy:
      replicas: 3
      restart_policy:
        delay: 5s
        window: 15s
      update_config:
        failure_action: rollback
        monitor: 10s
        order: start-first
        parallelism: 1
      placement:
        constraints:
          - node.labels.consul == true

    command: "consul agent -config-file /server.json"

networks:
  network:
    driver: overlay
    attachable: true

configs:
  server.json:
    file: server.json
```

And `server.json` is the one from my previous email.

Also, do you get that message consistently if you try to bootstrap a few seconds after the cluster is up? If you were trying to run it in a script for example it might be that you make the call before the gossip has converged and the leader upgraded ACLs.

I do get it consistently every time I try to start a new cluster, I do use a script to bootstrap but that happens only after I see on the logs that the agents are up. At the moment the script contains only a call to join the nodes and then try to bootstrap them. Should I introduce a waiting time between joining and bootstraping? (so all agents could agree into new ACL version)

Rafael Sierra

unread,
Nov 7, 2018, 4:58:27 AM11/7/18
to Consul


On Tuesday, November 6, 2018 at 6:33:50 PM UTC+1, Matthew Keeler wrote:

Consul 1.4 will auto-transition out of legacy mode on its own. For servers in the primary dc the requirements are that all of the servers must be running version 1.4.0 or above and all servers must have ACLs enabled. Looking at your setup it sounds like you meet that criteria. The check for whether transitioning can take place will happen often when first starting but the interval gets increased over time with a cap of checking once every minute. Did you try to bootstrap immediately after bringing up the servers? If so what happens if you try and bootstrap after waiting a minute. 

I am indeed trying to bootstrap right after joining all agents, I will add some waiting time between joining and bootstrapping 

Matthew Keeler

unread,
Nov 7, 2018, 8:19:48 AM11/7/18
to consu...@googlegroups.com
You could either add the arbitrary sleep or simply retry the bootstrap so long as the error reported says it’s in legacy mode. I usually prefer the retry with some very short sleep between tries. 

Matt
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Rafael Sierra

unread,
Nov 7, 2018, 11:18:10 AM11/7/18
to consu...@googlegroups.com
Should I introduce a waiting time between joining and bootstraping? (so all agents could agree into new ACL version)

I just realized that this is not possible. I can't join the servers because there are no ACLs in place:

```
> $HUGE_DOCKER_PREFIX consul:1.4.0-rc1 join 10.0.16.5 10.0.16.4 10.0.16.6
Error joining address '10.0.16.5': Unexpected response code: 403 (ACL not found)
Error joining address '10.0.16.4': Unexpected response code: 403 (ACL not found)
Error joining address '10.0.16.6': Unexpected response code: 403 (ACL not found)
```

I am trying to bootstrap with ACLs enabled and down_policy=extend-cache

--
Rafael Sierra

Matthew Keeler

unread,
Nov 7, 2018, 12:00:01 PM11/7/18
to consu...@googlegroups.com
So the ACL not found error there is in using the agent api to request that the agent join the others and not actually that the agent itself would be unable to join. You can specify the nodes to join in the config or on the cli. Alternatively if you set acl.tokens.agent_master then you can use that token for any operations that utilize the /agent apis including joining/leaving the cluster. 

As for when to bootstrap the best thing to do is to detect the legacy mode error and then retry until it succeeds with a short wait in between tries. If you have no time constraints then you certainly could wait for 64ish seconds before bootstrapping. 

With the final 1.4.0 you will also be able to detect the ACL mode using the /agent/self endpoint (which would still require using the agent master token prior to ACL bootstrapping)

Matt
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Rafael Sierra

unread,
Nov 7, 2018, 12:30:23 PM11/7/18
to consu...@googlegroups.com
On Wed, Nov 7, 2018 at 6:00 PM Matthew Keeler <mke...@hashicorp.com> wrote:
So the ACL not found error there is in using the agent api to request that the agent join the others and not actually that the agent itself would be unable to join. You can specify the nodes to join in the config or on the cli. Alternatively if you set acl.tokens.agent_master then you can use that token for any operations that utilize the /agent apis including joining/leaving the cluster. 

Setting acl.tokens.agent_master did work to join and bootstrap them (setting -token to the same value on the command line), but the problem is that now I have a static master token that cannot be revoked. Is there a way around this?

Also, it has been around 5 minutes now and the agents still have no leader. The highest log level msg I am getting is a repeating warning like this:

```
2018/11/07 17:28:12 [WARN] agent: Node info update blocked by ACLs
2018/11/07 17:28:17 [WARN] agent: Coordinate update blocked by ACLs
```

Since all servers have an agent master token I was expecting that they would all just work out
 
As for when to bootstrap the best thing to do is to detect the legacy mode error and then retry until it succeeds with a short wait in between tries. If you have no time constraints then you certainly could wait for 64ish seconds before bootstrapping. 

I will loop waiting for the "transitioning out of legacy ACL mode" message.
 
With the final 1.4.0 you will also be able to detect the ACL mode using the /agent/self endpoint (which would still require using the agent master token prior to ACL bootstrapping)

That will be nice

--
Rafael Sierra

Matthew Keeler

unread,
Nov 7, 2018, 12:33:42 PM11/7/18
to consu...@googlegroups.com
The agent master token is special in that it never makes its way into the raft store and is local to the agent. Removal from the config once everything is setup and restarting is enough to remove the token. 

Matt
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Matthew Keeler

unread,
Nov 7, 2018, 12:42:13 PM11/7/18
to consu...@googlegroups.com

Also the agent master token being local to the agent, isn’t used for agent registration and updates either the agent or default tokens are used for those and the warnings will persist until one of those is set either in the config or via the /agent api. 

Additionally the agent master token isn’t a full master token. It only grants access to the agent apis and nothing else. I believe you should be able to revoke it via the /agent/token endpoint in without restarting consul but I haven’t tested this out. 

Having no leader is definitely a problem. Do you ever see any logs about successful joins? You could also use that agent master token to list out the cluster members to see if things joined up properly. 

Matt

On Nov 7, 2018, at 12:30 PM, Rafael Sierra <rafae...@gmail.com> wrote:

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages