Hey Roger,
> I would like more detail on the small I/O from the job cache, and
> details on how to disable it.
The saltmaster caches all published jobs and their replys locally in
its job-cache. Thats quite useful if you for example publish a job and
a minion (not the minions, just one of many that was targeted) does
not answer in time (meaning the salt-cli will timeout waiting for
returns before all minions have returned).
Even though you dont see it on the cli, the saltmaster still receives
the delayed minions return and writes the reply into its job-cache.
You can then query the job-cache with a specific job-id (jid) to find
out, which minions have returned (jobs.lookup_jid i believe).
Now, if a minion does not reply in time, the saltmaster wants to know,
if the job might be still executing on that minion and publishes a job
asking the minions if job (jid) xxxxxxxxxxxxxxxxx is still running
(saltutil.find_job).
That job is also put into the job-cache.
If you look at the job-cache (/var/cache/salt/master/jobs/), its just
some directories with files in it with each file representing a
minions return to a specific jid.
Now imagine publishing jobs for 5000 minions with 5000 returns,
writing 5000 small files, even more files if saltutil.find_job has to
be used (which is very likely with many minions). Over time you will
find many many files in the cache which take quite a while to query if
you're looking for something specific.
Depending on your environment, how many jobs you plan to publish
hourly/daily/weekly and how long you want/have to keep the job-history
(thats what the job-cache is essentially), the cache can become a
burden for the salt-master with millions of files in it.
For example in my environment we have about 14.000 minions with about
200.000 customers which can configure their webspace through a gui.
The gui publishes jobs via https through salt-api. I have to keep a
history for the last 6 months, in case some customer fucks up his
webspace configuration so we can prove, he did it himself.
There are ways around this dillema, for example disabling the
job-cache and using returners, but with that you lose salts encryption.
Another alternative might be salt-eventsd (sorry dave, dont mean to
advertise, but the question seems to come up more frequently lately :-) ):
https://github.com/felskrone/salt-eventsd
which is (in its default setup) a mysql-job cache for salt with the
possibility to push the returned data anywhere you want (postgres,
graphite, etc.). . I'll happily answer questions on how to use it but
in a different thread please.
> How many minions are you running from one master?
Currently we're in the process of setting up around 4500 minions per
salt-master (4 masters in total) on DELL Poweredge 610, 2.1ghz dual
hexacore, 16GB Ram. The limit here is the number of connections per
second our master can handle without being syn-flooded.
Another factor for us is, how many minions we can have report data
within a 10 minute time frame without overloading the master. Which
turned out to be around 4500 minions using the scheduler with the
aforementioned settings.
I've not setup any monitoring yet, but usually the limit is cpu, not ram.
Now that i mention it, CPU power is what the salt-master likes to use
a lot. Read this:
https://github.com/saltstack/salt/pull/9235
You should consider using a 2048 bit keysize.
I think there has been commmits concerning the caching of keys on the
master, but im not sure. Anyone know?
> How many minions total?
We have a total of around 15.000 minion, with 14.000 being virtual
servers.
Hm, i maybe i should write a documentation section for larger setups
or maybe even a concrete example how we do it.
What you think, dave, something you like? :-)
- felskrone