Data Transfer between my Consul Servers is insane

885 views
Skip to first unread message

Fishstick Kitty

unread,
Feb 11, 2016, 10:52:30 AM2/11/16
to Consul
Hello Consul Peeps...I have a small environment with a 5 server cluster of consul servers and about 15 nodes.  On the order of 10 or so services and some KV data (under 100 kv pairs).  I am also using Vault so that data is stored in Consul.  

All of this is in AWS.  

I have noticed that data transfer between AZs month to date is over 33 TB (yes that's a "T") and we are only 11 days into the month.  We have traced that down to the Consul servers.

Questions:  1) does that seem crazy? 2) Our consul raft.db file is about 33MB in size...does that seem huge or is that normal?

Thank in advance!!

James Phillips

unread,
Feb 11, 2016, 12:07:46 PM2/11/16
to consu...@googlegroups.com
Hi,

The Raft db size seems pretty reasonable but the transfer between DCs seems way out of family for what Consul would use on its own for gossip on the WAN. If you make requests to a Consul agent in one DC targeted to a different DC, Consul will forward that RPC request over its Server RPC TCP connection, so it sounds like you've got something making lots of cross-DC requests against Consul.

Not every RPC endpoint has telemetry, but dumping https://www.consul.io/docs/agent/telemetry.html might give you an idea for some of the categories like health, catalog, or DNS queries. Which versions of Consul and Vault are you using?

-- James

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/766b1a36-9b14-4dd4-885a-8c57d9c33c4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Adams

unread,
Feb 11, 2016, 12:10:21 PM2/11/16
to consu...@googlegroups.com
I can't definitively speak for Fishstick Kitty, but I'm guessing this is a LAN cluster, not WAN traffic. AWS charges for cross-AZ network traffic, but of course you would want your consul servers in a single cluster spread across multiple AZs.

Fishstick Kitty

unread,
Feb 11, 2016, 12:58:48 PM2/11/16
to Consul
Right, what David said is accurate...this is 5 consul servers spread out among 3 AZs in 1 DC :).

Fishstick Kitty

unread,
Feb 11, 2016, 3:40:50 PM2/11/16
to Consul
Does anybody else think 33TB of data volume in 11 days (3TB/day) between consul servers is unusually high?

David Adams

unread,
Feb 11, 2016, 3:52:36 PM2/11/16
to consu...@googlegroups.com
Yes, I agree that's a crazy amount. We aren't using our Consul clusters all that much yet, but I'm seeing about 50GB/month (so <2GB/day) of inter-AZ LAN cluster traffic on our busiest VPC/datacenter (a dozen or so clients).


On Thu, Feb 11, 2016 at 2:40 PM, Fishstick Kitty <samp...@gmail.com> wrote:
Does anybody else think 33TB of data volume in 11 days (3TB/day) between consul servers is unusually high?

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Michael Fischer

unread,
Feb 11, 2016, 4:38:25 PM2/11/16
to consu...@googlegroups.com
That's 35MB/sec, which is a lot (almost half a 1Gb pipe). What does your query load look like?  Are you doing a lot of K/V store reads/updates?

Consul devs, it may make sense not to send read queries across AZ boundaries if stale reads are permitted. 

Fishstick Kitty

unread,
Feb 11, 2016, 4:41:58 PM2/11/16
to Consul
Hi Michael, no KV writes.  Minimal activity.  Also the volume persists evenly throughout the day and night when nobody is using the system...an odd job might run now and then.  I am going to do a KV dump and see what is in there.  I have a feeling Vault might be doing something nutty (we are using Vault for secret storage/retrieval).

Fishstick Kitty

unread,
Feb 11, 2016, 4:54:23 PM2/11/16
to Consul
So I did a kv export and, first of all, it was about 60MB of json which is huge...looking at the contents it all looks good until I start seeing the Vault data...about a million (total guess) of things like this:

{
    "LockIndex": 0,
    "Key": "vault/sys/token/id/4c8088ff6a62b2aadbd560446c043cb0660294dd",
    "Flags": 0,
    "Value": "AAAAAQIDHkILPyleloT+yQ3rjbzxLbCR3ElPy1w0BjghtVmyWRiThOLwlO1Yo4yRKRtVAjp7jyXEVAQiYGRi28FHLzuRiKzGRmaAQSli0rMwAtCoPMhl2I4xhJOZsQiTMXSDriGNc/aHZP8l7lE93S66cXF95Qxk06ANCcOm/vcEFXyiqFvtC+k9b8ktzSdflfbZmTlokqPNcGxmHBC9BbETldcVLx6KMQ0W4LL0KGEUZmRmmpIk4m89tkIoO+vqYoCUIB+GS8YhyqLduTKJyvpZCGblNIAopAlaLC/IVRlUcvBvhaRYlBhnc9SbuciQREsDTGrA1G8HQKtiRFaAWGJXU1ZyV5XSS3Wj/jsMdmaVLdgo67uK2eODcNQqawzZeaezU+RxJzmU7w1SyiE7RFZIdyzmhthGYEcXUX2jeLsCdzPQ+JYhWt2aRphLQRKXfqTbOA==",
    "CreateIndex": 432575,
    "ModifyIndex": 432575
  },

So...question:  Does all of the KV data constantly get sent around between servers?  Or is Vault just writing so much new data that it is in a constant state of sync'ing?

Thanks

Michael Fischer

unread,
Feb 11, 2016, 5:05:17 PM2/11/16
to consu...@googlegroups.com
Probably a question for the vault mailing list. I suspect something in your environment is generating tokens very frequently and/or pulling data out of the vault frequently. 
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/b842b108-1d76-4b91-91ab-ce6c9feddda7%40googlegroups.com.

James Phillips

unread,
Feb 11, 2016, 6:28:26 PM2/11/16
to consu...@googlegroups.com
Yeah this volume of data is nowhere close to what a normal, idle cluster would produce with gossip, even inside the LAN (sorry I was thinking WAN earlier). 

It looks like you may be running into this issue, which was fixed in Vault 0.4:


The cross-AZ thing is interesting from a cost perspective but seems like it could make things more complicated to configure, and could load stale reads unevenly across servers if the clients weren't balanced well - I'd probably avoid adding that complexity unless is was a huge cost win. Gossip should be a reasonable steady-state load, and the rest should all be actual application-related traffic in Consul.

-- James

James Phillips

unread,
Feb 12, 2016, 10:50:01 AM2/12/16
to consu...@googlegroups.com
Hi Fishstick Kitty,

The servers don't constantly exchange the entire KV contents but they do replicate each change to the KV store adding a key or deleting a key gets sent from the leader server to all the follower servers. It looks like from your Vault thread https://groups.google.com/forum/#!msg/vault-tool/c6_QVmsVhG4/4Kw3vTzdDwAJ that there might be a ton of short-lived items being churned by Vault, so the overall size of the KV store might not tell the whole story.

If you set debug logging (or use consul monitor which is mentioned in the same link) https://www.consul.io/docs/agent/options.html#_log_level on your Consul agents that Vault uses you'll be able to see the activity it's doing against Consul which might shed some light on what's going on.

-- James

Fishstick Kitty

unread,
Feb 12, 2016, 1:01:04 PM2/12/16
to Consul
Hi James I ran consul monitor -log-level debug for a about 1 hour and 40 mins...nothing remarkable (see below for typical sample).  It does call /v1/kv/?keys fairly often (975 times in 1:40 hours = 9.75 times per minute) and that response contains 105921 items and is about 2.5MB.

Ed
---------

Typical output from consul monitor -log-level debug

2016/02/12 17:40:23 [DEBUG] agent: check 'service:vault' is passing
2016/02/12 17:40:23 [DEBUG] http: Request GET /v1/kv/?keys (57.642447ms) from=127.0.0.1:44816
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/services (100.102µs) from=127.0.0.1:44878
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (73.902µs) from=127.0.0.1:44879
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (60.811µs) from=127.0.0.1:44880
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (65.605µs) from=127.0.0.1:44881
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (52.718µs) from=127.0.0.1:44882
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (72.803µs) from=127.0.0.1:44883
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/mongodb (64.97µs) from=127.0.0.1:44884
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/consul (65.428µs) from=127.0.0.1:44885
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/saltmaster (45.465µs) from=127.0.0.1:44886
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (67.77µs) from=127.0.0.1:44887
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (70.557µs) from=127.0.0.1:44888
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/catalog/service/vault (63.687µs) from=127.0.0.1:44889
2016/02/12 17:40:28 [DEBUG] http: Request GET /v1/kv/?keys (56.668669ms) from=127.0.0.1:44890
2016/02/12 17:40:31 [DEBUG] http: Request PUT /v1/session/renew/27b39d9d-aafe-add0-17ee-a871416efb9a (63.954µs) from=127.0.0.1:46518
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/services (110.479µs) from=127.0.0.1:44951
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (63.26µs) from=127.0.0.1:44952
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (66.634µs) from=127.0.0.1:44953
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (73.199µs) from=127.0.0.1:44954
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (51.648µs) from=127.0.0.1:44955
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (61.568µs) from=127.0.0.1:44956
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/mongodb (60.103µs) from=127.0.0.1:44957
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/consul (66.525µs) from=127.0.0.1:44958
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/saltmaster (47.453µs) from=127.0.0.1:44959
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (73.582µs) from=127.0.0.1:44960
2016/02/12 17:40:32 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (69.406µs) from=127.0.0.1:44961
2016/02/12 17:40:33 [DEBUG] http: Request GET /v1/catalog/service/vault (79.317µs) from=127.0.0.1:44962
2016/02/12 17:40:33 [DEBUG] http: Request GET /v1/kv/?keys (60.011327ms) from=127.0.0.1:44963
2016/02/12 17:40:33 [DEBUG] agent: Service 'consul' in sync
2016/02/12 17:40:33 [DEBUG] agent: Service 'vault' in sync
2016/02/12 17:40:33 [DEBUG] agent: Check 'service:vault' in sync
2016/02/12 17:40:34 [DEBUG] serf: forgoing reconnect for random throttling
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/services (97.049µs) from=127.0.0.1:45024
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (59.076µs) from=127.0.0.1:45025
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (59.48µs) from=127.0.0.1:45026
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (58.577µs) from=127.0.0.1:45027
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (52.363µs) from=127.0.0.1:45028
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (57.748µs) from=127.0.0.1:45029
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/mongodb (56.397µs) from=127.0.0.1:45030
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/consul (65.482µs) from=127.0.0.1:45031
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/saltmaster (46.848µs) from=127.0.0.1:45032
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (57.723µs) from=127.0.0.1:45033
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (59.332µs) from=127.0.0.1:45034
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/catalog/service/vault (64.687µs) from=127.0.0.1:45035
2016/02/12 17:40:35 [DEBUG] http: Request GET /v1/kv/?keys (57.156772ms) from=127.0.0.1:45036
2016/02/12 17:40:36 [DEBUG] memberlist: TCP connection from=10.4.5.122:50525
2016/02/12 17:40:38 [DEBUG] http: Request PUT /v1/session/renew/27b39d9d-aafe-add0-17ee-a871416efb9a (50.465µs) from=127.0.0.1:46518
2016/02/12 17:40:38 [DEBUG] agent: check 'service:vault' is passing
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/services (118.666µs) from=127.0.0.1:45098
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (61.603µs) from=127.0.0.1:45099
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (88.149µs) from=127.0.0.1:45100
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (81.782µs) from=127.0.0.1:45101
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (79.078µs) from=127.0.0.1:45102
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (83.409µs) from=127.0.0.1:45103
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/mongodb (68.045µs) from=127.0.0.1:45104
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/consul (90.061µs) from=127.0.0.1:45105
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/saltmaster (51.757µs) from=127.0.0.1:45106
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (75.149µs) from=127.0.0.1:45107
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (73.378µs) from=127.0.0.1:45108
2016/02/12 17:40:44 [DEBUG] http: Request GET /v1/catalog/service/vault (72.455µs) from=127.0.0.1:45109

James Phillips

unread,
Feb 12, 2016, 1:26:25 PM2/12/16
to consu...@googlegroups.com
Interesting - if your Vault node happened to pick a non-leader Consul server in one AZ and those requests do not allow stale (which they won't by default) then that server could forward to the leader in another AZ, so you'd get a 2X multiplier. Just for that key scan you'd get:

2 * 2.5 MB * 9.75/min * 60 min/hr * 24 hr/day * 11 day = 770 GB

That's a non-trivial amount of traffic but doesn't explain the 33 TB. I noticed you have something doing a bunch of catalog queries every few seconds on those boxes as well - that could add up. Is that a process that doing polling of some kind?

-- James

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Fishstick Kitty

unread,
Feb 12, 2016, 1:42:01 PM2/12/16
to Consul
Hmm...our applications will monitor other apps...but that would issue a GET /v1/health/service/<service name> (which I don't see any of those)...our app definitely isn't doing catalog lookups...the one for saltmaster is a good example because our app would absolutely not care about that.  The consul agents on the nodes might..so perhaps those catalog lookups are coming from the consul agents?  I dunno.

James Phillips

unread,
Feb 12, 2016, 1:45:52 PM2/12/16
to consu...@googlegroups.com
The Consul agents will query the catalog when they sync but they do internal RPC calls so they don't show up as HTTP requests in the log like this. This would be something external to Consul using the Consul HTTP API like Vault (but that doesn't look at the catalog), consul-template, or some custom app.

-- James

On Fri, Feb 12, 2016 at 10:42 AM, Fishstick Kitty <samp...@gmail.com> wrote:
Hmm...our applications will monitor other apps...but that would issue a GET /v1/health/service/<service name> (which I don't see any of those)...our app definitely isn't doing catalog lookups...the one for saltmaster is a good example because our app would absolutely not care about that.  The consul agents on the nodes might..so perhaps those catalog lookups are coming from the consul agents?  I dunno.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Fishstick Kitty

unread,
Feb 12, 2016, 2:01:19 PM2/12/16
to Consul
Ah...there you go...we are running consul-template on 3 web servers with a retry-interval of 10 seconds.  That doesn't really explain the high consul server volume though right?

James Phillips

unread,
Feb 12, 2016, 2:06:34 PM2/12/16
to consu...@googlegroups.com
We should take this off list into a GitHub issue since things are getting pretty specific to your setup.

The log was from the Consul agent on your Vault server, though, right? It looked like something on there was doing a bunch of catalog queries every few seconds. It's probably not going to explain the full 33TB but it looked strange.

Do you have any way to audit network traffic between different Consul agents to try and pin it to some specific nodes in your setup making tons of requests vs. see if it truly was just among the Consul servers?

On Fri, Feb 12, 2016 at 11:01 AM, Fishstick Kitty <samp...@gmail.com> wrote:
Ah...there you go...we are running consul-template on 3 web servers with a retry-interval of 10 seconds.  That doesn't really explain the high consul server volume though right?

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Darron Froese

unread,
Feb 12, 2016, 2:07:21 PM2/12/16
to Consul
That could explain the volume - Consul Template is awesome - but unless you're using the de-deduplication features it can generate a high volume of queries:


It's worth a look.

On Fri, Feb 12, 2016 at 12:01 PM Fishstick Kitty <samp...@gmail.com> wrote:
Ah...there you go...we are running consul-template on 3 web servers with a retry-interval of 10 seconds.  That doesn't really explain the high consul server volume though right?

Fishstick Kitty

unread,
Feb 12, 2016, 2:09:37 PM2/12/16
to Consul
James...the consul monitor was done on the vault "leader" server...which is also a consul server...it is one of the 5 server nodes.  We have vault running on all the consul servers.

Fishstick Kitty

unread,
Feb 23, 2016, 1:39:01 PM2/23/16
to Consul
Just to close the loop, the issue has been resolved.  We were using Vault in a wonky manner and it was causing the KV store in Consul to be gigantic.  Read here:  https://groups.google.com/forum/#!topic/vault-tool/c6_QVmsVhG4
Reply all
Reply to author
Forward
0 new messages