Hi John - thanks for offering to help! I'll take a look at Vogeler
over the weekend and I'll report back here.
Thanks!
Grig
As Miquel and I also discussed privately, in terms of Config. Mgmt I
think we should start with supporting plugins that know how to output
configuration files for different CM tools. As I mentioned before, I
think kokki and chef-solo are 2 good starting points.
To me the central notion in CM is the one of roles. So we should
definitely be able to tie roles to nodes in Overmind. Another
important notion is that of attributes (to use the puppet/chef
nomenclature), which are variables such as IP addresses, host names,
directory names etc. that can be customized before running the CM tool
on a given client node. These will be harder to support, but I propose
that we allow users to specify both an attribute name and its
associated value via the Overmind UI, for starters at least.
Grig
>- Provide clear documentation on the format of messages put on the
>queue for clients to read.
>- Create a persistence backend for Vogeler that, when used, uses the
>Overmind REST api to store the results from the client.
That sounds good.
Could you elaborate on the second point?. I originally thought that
Overmind's backend would consume the result/response queue, which
should scale better than thousands of nodes storing results against a
server (REST API).
Another point is the monitoring component, which could also store
results either against a REST API or a queue. I am talking to the eyes
author (http://bitbucket.org/heckj/eyes/wiki/detailed_goals) to see
whether we can collaborate on that.
2010/10/16 lusis <lusi...@gmail.com>:
I have been working on Eyes slowly and quietly, intending to have something functional before making much in the way of noise about it. The wiki pages at Eyes has a lot of detail around what im trying to do with it - i see it very clearly as a simple and focused subcomponent for a larger system. fundamentally it was created from my frustration that existing monitoring solutions, even from large vendors, didnt have decent APIs that allowed you to create, update, delete, and check status of monitors - exactly the kind of thing you need to have to make integration with larger systems like Overmind, Chef, Puppet, etc effective at keeping track on monitoring.
But this list isnt about eyes - its about Overmind, so ill get back to our overmind discussions...
- joe
Sent from my iPad
the way I see it is this:
Overmind already has server provisioning. To be a complete solution it
needs an scalable CMS and monitoring system.
you and John (vogeler) have already put a lot of thought and code into
eyes and vogeler, respectively. If we can somehow fit the pieces of
the puzzle, we would have a very cool solution sooner than we thought.
If not, well I hope it helps us get on the right track, at least.
Even though monitoring was not planned at this stage, I think that the
final architecture must take into account that both cms and monitoring
data need to be handled, so that we arrive at the best, simplest
design possible.
For Vogeler, you seem to have thought about having a queue only for
issuing commands, but results are posted back via REST. That would be
compatible with eyes's approach, though I am not sure exactly how eyes
gets the data back from a node (there is an agent, but also polling?).
While we could go for such an approach, do you think it would be
feasible to have the agent report both cms run results and monitoring
data back to a queue? the advantage is even more scalability and
simplifying the design by reusing the queue system for more tasks and
avoiding polling. Maybe it is a dumb idea, but it sounds attractive.
2010/10/16 Joseph Heck <josep...@gmail.com>:
The architecture of eyes today is set up with a central "point" that's
REST based to post in results of something monitoring - expecting that
I have at least one (more if some scale is needed) agents running that
read against a REST blackboard of what is pending to be updated, and
reporting back through a REST interface. That same REST interface is
heading toward being extended to be able to ask questions about the
state of monitor, update the monitor (to change thresholds), etc.
Generally, that core is my "where I stash state for the monitors".
I seriously thought about queues, but backed away from them for a
start. That all said, it could easily be modified for something to
read out of a queue/pipe of event data and update the state. The agent
I originally envisioned was intended to run on the same server/VM
image as the state engine to start - and I knew I might want multiple
of those and scaling out to their own little VM "poller" engines if
need be. That's all for a polling model... which I will be the first
to agree doesn't scale at extreme levels.
The whole thing could be pushed out even further to agents running on
the target machines and just piping event data back into the
combination of systems - some of which gets routed to the monitoring
system to provide updates. I made provisions for that in my design
with the idea of a "passive monitor" (stealing the term from Nagios).
I started with a polling system because I thought that might be more
palatable to folks wanting to install and use something quickly and
easily.
Here's a diagram I drew up:
http://github.com/lusis/vogeler/blob/gh-pages/vogeler.jpg
To answer Joe's question, originally it was going to be a full blown
Configuration Management database but I've scaled back to focus on
Command and Control for now.
I forgot to mention. Using a queue based system for monitoring actually is a bit risky because you never know if the node will ever respond. Most monitors are "I need to know right now how full your disk is"
On Sat, Oct 16, 2010 at 12:41 PM, John Vincent <lusi...@gmail.com> wrote:
> I forgot to mention. Using a queue based system for monitoring actually is a
> bit risky because you never know if the node will ever respond. Most
> monitors are "I need to know right now how full your disk is"
>
>>>>> 2010/10/16 lusis <lusi...@gmail.com>:
I recently added populating a Last Error functionality in the client
as well. Essentially any exception will, assuming the exception isn't
communication with the queue itself, send a return message with the
details of the error.
While our small/medium iterations approach to development will mean we
don't get both features from the beginning, the design must *allow* to
have them later.
High availability:
there are three points of failure in a queue based system:
- Central server that does the work. You need to solve this problem
regardless of the architecture.
- Queue. You can replicate, make messages persistent and cluster the queue:
http://www.rabbitmq.com/pacemaker.html
- Agents. A system based on a queue and agents needs to transform a
weakness (agent failure) into a strength. How? having an agent
checking a queue is already a kind of basic monitoring. If you "miss a
heartbeat", the system would immediately mark the node as having
problems. Then, we can devise ways to handle those cases. For example
having active "emergency" checks, whereas overmind *calls* that
particular client to see what's wrong. And the agent could be designed
in an extra secure way. Fox example, having an agent (with an
efficient C core) that does nothing (simplicity) but communicating
with the queue and executing local commands (call chef-solo, execute
monitoring plugins, etc...) in an independent process. There might
even already be something similar that we can reuse.
But agent security against failure is a topic in itself. John, what
else have you investigated in this regard?
Scalability:
Agents that report results or info back to a central REST API scale
only up to a point (even though there are ways around that).
Queue systems, in contrast scale tremendously.
Some performance numbers for RabbitMQ:
- from 1M msg/s - 100 msg/s = 60,000,000 msg/minute - 6,000 msg/minute
http://groups.google.com/group/rabbitmq-discuss/browse_thread/thread/f376d885a8c478ac/e55ae8c821405044
Worst case:
http://www.sheysrebellion.net/blog/2009/06/24/amqp-kool-aid-part-15-rabbitmq-benchmark/
While having multiple pollers (like Joseph has envisioned) improves
the situation, a REST API can not scale to those levels easily.
In a queue based system, you remove the info exchange limit and there
remains only a single bottleneck: The server that processes the
information and saves to the DB + performs automatic tasks, and that
can be solved in different ways.
That said, I am not obsessed with an all queue design. You can
convince me otherwise. It is just that in principle polling seems...
don't know, just a worse option that resusing a component (queue) we
will already have in place for config management, and that can scale
to infinity out of the box.
2010/10/16 John Vincent <lusi...@gmail.com>:
>I forgot to mention. Using a queue based system for monitoring actually is a bit risky because you never know
> if the node will ever respond. Most monitors are "I need to know right now how full your disk is"
I partially responded to that, but I'd like to add something else.
You do want the agent to responde periodically. And a queue allows
that to be more often than other monitoring systems. Either by
sending overmind a periodic broadcast message, or by expecting to
recieve a message every minute.
Another thing. You say in Vogeler's FAQ: "I don’t think Vogeler will
ever be a replacement for Puppet or Chef"
But a Configuration Management database will surely substitute *some*
part of Chef? In any case Overmind's goal it to completely replace the
role of a Chef server, but probably using chef-solo at the client
side.
2010/10/18 Miquel Torres <tob...@googlemail.com>:
Complexity depends on the whole system. Currently, you need to
configure something for server provisioning, another system for CM,
another for monitoring...
Overmind will have a huge simplicity advantage.
Besides, it all depends on installation easiness. An agent sounds more
complicated, but in the end, you will have an overmind-server package,
and an overmind-agent package. The agent will be automatically
installed for new nodes, and can be manually installed for existing
nodes. You would need a package for the eyes client anyway.
2010/10/18 Miquel Torres <tob...@googlemail.com>:
Inspired by John's nice diagram, :-) I have drawn and added a very
simplified diagram of a would-be complete Overmind.
I made it so that you understand what my global idea is. I left out
some details on purpose. For example the DB question, which bears
another discussion. But it should suffice to get my idea. I am a huge
proponent of simplicity, and when you take into consideration all
areas together, a system can become very complex, with dozens of
moving parts (we use several solutions with lots of moving parts
nowadays!). This design eliminates much of that complexity, and allows
for a modular system, where I could imagine someone using it only for
monitoring, for example.
Vogeler-runner would be a task of the job queue, as would be the
emergency poller (or even in the case of a pure-polling design!).
Vogeler client would be like an overmind-agent. Response messages
could be consumed by the jobs themselves, or by the core (not sure
about that). The messaging structure outlined by John's diagram seems
like a good solution. The eyes event processor could also be
implemented as a celery periodic task (job).
Although django-celery is (yet again) something I've never done
before, it is used at my company for example to resize millions of
user submitted images, so it would be a really nice solution, while
allowing to develop all parts independently.
There are a lot more implications and advantages, like DB
consolidation/reuse, but I want to hear your opinions first.
What do you think about it? Please feel free to modify the page and
add other architecture possibilities!
Miquel
2010/10/18 Miquel Torres <tob...@googlemail.com>:
> 1 - My choice of CouchDB was simply because I was building a freeform
> system. Obviously I'm not expecting anyone else to use that.
Yeah, there is not really a need to determine the DB design right now.
> 3 - I had no intention of creating any sort of listening daemon on the
> client. I figured func would be a much better solution in that case
> but it has the whole certmaster component.
I think I am not following you here. In your design
(http://github.com/lusis/vogeler/raw/gh-pages/vogeler.jpg), you
clearly define a vogeler-client at each node that listens to the
queue. That does not seem to match using func.
Besides, func looks good for what it does. However, for a system that
needs to scale to thousands of nodes, managing the nodes with ssh
connections, no matter how parallel, is not a good option. That's what
the queue and agent are for. (the classical paper on that matter:
http://www.infrastructures.org/papers/bootstrap/bootstrap.html go to
the "Push vs. Pull" section).
Note however, that using something like func or fabric could be useful
for certain special "push" tasks. For example initial bootstrapping
(which we have left out for the moment), "backup" monitoring checks,
even agent repairs/restarts...
Just not for the normal way to communicate every X minutes to all nodes.
Are you thinking about a completely different thing, about a push
system? or what is func for?
> 4 - Right now Vogeler has no concept of any sort of special commands.
> I had thought about providing special handlers for ohai and facter
> (puppets fact tool) though. I've got some experience with both if
> that's helpful.
Certainly, Ohai and facter are another interesting area to handle. In
the ideal, rocking case, we would have plugins for configuration
management, so that you can use chef-solo or puppet, and also for
gathering system info with ohai or facter. I think that makes it even
more important to design a robust agent core to which you can plug in
this sort of functionality.
Btw., that would also be an awesome way to bring Python and Ruby
devops together.
I think you are the right man here, John ;-)
Miquel
2010/10/20 lusis <lusi...@gmail.com>:
yes, that would basically make sense.
Of course a daemon is not needed at all for that. Once the central
system notices a missing response, and decides there is a problem, a
job can be started that directly uses ssh to make the checks, which
you describe pretty well.
If everything fails and, like you say, manual intervention is
required, effort should be taken to create a meaningful alert with all
necessary info (logs) for the user. And the possibility could be given
to take "rebirth" action, which would mean to destroy the node and
create a new one with the exact same configuration.
Miquel
2010/10/26 lusis <lusi...@gmail.com>: