Development ideas for milestones 2 and 3

Miquel Torres

unread,

Oct 15, 2010, 6:31:22 AM10/15/10

to Overmind development

Hi,

after the 0.1 release, we are getting interesting feedback. There are
people with other ideas, with projects that could be integrated, and
so on.Let's discuss what goals the next two milestones can have, and
what would be the optimal architecture.

We already have the substrate, the foundation on which we can do
actual useful things: the nodes. What should come next? What are the
long-term plans?

Well, for one, The Overmind wants to *know*. It will not rest until it
controls all your servers, knows everything about them and can model
them at will. The end goal is of course self-awareness. Overmind will
bind them all and control the whole Internet.

But looking at the next steps, we can choose to give the next release
a particular focus. Here are some features that we can address?

- Making provisioning end user ready
- Save images for each provider to reduce waiting time
- Perform actions (like node creation) in the background. Either
in a thread or implementing the job queue straight away with Celery
- Logs. There should be a history of events, like creation or
destructions of nodes by user X. Every user could see the messages and
see what's new since his/her last login
- Design a pretty UI. That can be done in two ways: making a nice
css desing, clean and modern, or using jQuery UI's styles, easier,
more immediate and cool, but maybe not such a great end result.
- The end goal would be to have a great user interface

- Introducing Configuration management
- The first priority is configuration management. The most
important decision will be whether to go for a first quick fabric-push
version or straight away to the queue based system. Some ideas are
outlined in the Roadmap wiki page.
- Apart from an scalable architecture, it would be amazing if we
could support CMS as plugins, so that the Ruby world would also be
willing to use Overmind. Chef is getting a lot of traction, so being
able to use that would be a huge bonus. Let's unite all devops!

- Any ideas are welcome. But development should be highly iterative.

Let's get the ball rolling!
Miquel

lusis

unread,

Oct 15, 2010, 3:55:00 PM10/15/10

to Overmind development

I hit Grig up on the githubs and he suggested I subscribe so I did.

As far as the CM side goes, I've already got some working queue code
in Vogeler that you guys are free to poach. It's AMQP specific right
now because Stomp in RabbitMQ is pretty crappy.

The thing about Vogeler is that, while I originally wanted it to be a
CM database it's scaled back a bit to being almost pure C&C at this
point. Mainly because we all start with grandiose plans.

One thing that I DO have working/designed is abstracted persistence.
To that end, I mentioned to Grig that it might be worth creating an
Overmind persistence backend for Vogeler. If I did that, it might
help?

Other than that, I'm not a Django guy and pretty much a python newb so
I'll help where I can. I'm actually working on Fog (a ruby project
similar to libcloud) so if that would benefit, I'm game.

Grig Gheorghiu

unread,

Oct 15, 2010, 6:35:57 PM10/15/10

to overmi...@googlegroups.com

On Fri, Oct 15, 2010 at 12:55 PM, lusis <lusi...@gmail.com> wrote:
>
> As far as the CM side goes, I've already got some working queue code
> in Vogeler that you guys are free to poach. It's AMQP specific right
> now because Stomp in RabbitMQ is pretty crappy.
>
> The thing about Vogeler is that, while I originally wanted it to be a
> CM database it's scaled back a bit to being almost pure C&C at this
> point. Mainly because we all start with grandiose plans.
>
> One thing that I DO have working/designed is abstracted persistence.
> To that end, I mentioned to Grig that it might be worth creating an
> Overmind persistence backend for Vogeler. If I did that, it might
> help?
>
> Other than that, I'm not a Django guy and pretty much a python newb so
> I'll help where I can. I'm actually working on Fog (a ruby project
> similar to libcloud) so if that would benefit, I'm game.

Hi John - thanks for offering to help! I'll take a look at Vogeler
over the weekend and I'll report back here.

Thanks!

Grig

Grig Gheorghiu

unread,

Oct 15, 2010, 6:34:23 PM10/15/10

to overmi...@googlegroups.com

On Fri, Oct 15, 2010 at 3:31 AM, Miquel Torres <tob...@googlemail.com> wrote:
>
> - Introducing Configuration management
> - The first priority is configuration management. The most
> important decision will be whether to go for a first quick fabric-push
> version or straight away to the queue based system. Some ideas are
> outlined in the Roadmap wiki page.
> - Apart from an scalable architecture, it would be amazing if we
> could support CMS as plugins, so that the Ruby world would also be
> willing to use Overmind. Chef is getting a lot of traction, so being
> able to use that would be a huge bonus. Let's unite all devops!
>

As Miquel and I also discussed privately, in terms of Config. Mgmt I
think we should start with supporting plugins that know how to output
configuration files for different CM tools. As I mentioned before, I
think kokki and chef-solo are 2 good starting points.

To me the central notion in CM is the one of roles. So we should
definitely be able to tie roles to nodes in Overmind. Another
important notion is that of attributes (to use the puppet/chef
nomenclature), which are variables such as IP addresses, host names,
directory names etc. that can be customized before running the CM tool
on a given client node. These will be harder to support, but I propose
that we allow users to specify both an attribute name and its
associated value via the Overmind UI, for starters at least.

Grig

Miquel Torres

unread,

Oct 16, 2010, 4:48:00 AM10/16/10

to Overmind development

Hi lusis,

great to see you around.

(in case anyone is not following, Vogeler is hosted here
http://github.com/lusis/vogeler).

Now, sorry if I get on a tangent, but I think grandiose plans are
needed to actually get something useful done. The secret to success is
to divide work in small, realistic sized bits, so that you don't loose
focus. That's what we did with Overmind. You can see in the Roadmap
all the Milestones talk. That is not fancy talking, but a way to
ensure we don't get carried away, over engineer, and in general don't
get things done.

Every iteration while reasonable sized, should bring added value.

With Overmind, I actually envision much more than "just" CM. In the
end, operations should be semantic. A user should be able to say: "I
want a deployment of type X". And all the needed nodes, with the pre-
configured sizes, setup as load balancers, webservers, DBs, etc,
should be automatically launched *and* monitored. Now, if I had set to
implement something like that from the beginning, I would still be in
the planning phase. Now instead, just two months later, there is
already something limited but useful, that people can actually use.
But enough of preaching ;-)

So back on topic, ideally all the CM and in general anything related
to server configuration should be stored in the Node model, in the
Overmind DB. The reason is that it must be integrated into the system,
so that you can access it via REST or the UI. And there is no need for
an extra DB. We may need a nosql DB later, when we store monitoring
data, but that is another story.

So, how can you help us?
First, not knowing Django is no problem. Many things have been a first
for me in this project. I didn't know anything about Django forms or
authentication myself, for example.

We will need to figure out:
- How Overmind can send orders to a queue
- How to implement a node agent that consumes the queue, and when
appropriate executes the CM tool with attached parameters (roles,
attributes). We are leaning towards Chef-solo for CM at the node.
- Design the message structure proper

So I bet you know how to do much of this, and even can copy or adapt
code from Vogeler to get us jump started.

Another comment is that we are not interested in supporting several
different message queue implementations. RabbitMQ would do for
example. We want end-user flexibility, not library flexibility/
independence.

Btw., Fog looks quite nice. However, it does not benefit us because we
are already using libcloud. While we could relatively easily switch to
another cloud library if the need arouse, switching to a Ruby library
is just not possible from Python/django!

Ah! and the reason we are going for a queue at all is to allow for
extreme scalability, which a pull system only provides up to a point.

So, what do you think?

lusis

unread,

Oct 16, 2010, 10:46:42 AM10/16/10

to Overmind development

See comments inline.

On Oct 16, 4:48 am, Miquel Torres <tob...@googlemail.com> wrote:
> Hi lusis,
>
> great to see you around.
>
> (in case anyone is not following, Vogeler is hosted herehttp://github.com/lusis/vogeler).
>

Yeah I guess I should have mentioned where it was. I assume to much
these days that people will just hit up github ;)

> Now, sorry if I get on a tangent, but I think grandiose plans are
> needed to actually get something useful done. The secret to success is
> to divide work in small, realistic sized bits, so that you don't loose
> focus. That's what we did with Overmind. You can see in the Roadmap
> all the Milestones talk. That is not fancy talking, but a way to
> ensure we don't get carried away, over engineer, and in general don't
> get things done.
>
> Every iteration while reasonable sized, should bring added value.
>

I agree. In my case, not being very deep into Python, I found myself
going off on distracting tangents that prevented me from actually
making progress. I have some big plans for Vogeler and other projects
as well.

> With Overmind, I actually envision much more than "just" CM. In the
> end, operations should be semantic. A user should be able to say: "I
> want a deployment of type X". And all the needed nodes, with the pre-
> configured sizes, setup as load balancers, webservers, DBs, etc,
> should be automatically launched *and* monitored. Now, if I had set to
> implement something like that from the beginning, I would still be in
> the planning phase. Now instead, just two months later, there is
> already something limited but useful, that people can actually use.
> But enough of preaching ;-)
>
> So back on topic, ideally all the CM and in general anything related
> to server configuration should be stored in the Node model, in the
> Overmind DB. The reason is that it must be integrated into the system,
> so that you can access it via REST or the UI. And there is no need for
> an extra DB. We may need a nosql DB later, when we store monitoring
> data, but that is another story.
>
> So, how can you help us?
> First, not knowing Django is no problem. Many things have been a first
> for me in this project. I didn't know anything about Django forms or
> authentication myself, for example.
>

Right. I guess probably a better case is that I personally have very
little interest in front-end design/coding. I don't do it very well.
I'm much more of a backend design guy. I also personally don't think
that getting distracted with Django is good for me at this point of
learning Python.

> We will need to figure out:
> - How Overmind can send orders to a queue
> - How to implement a node agent that consumes the queue, and when
> appropriate executes the CM tool with attached parameters (roles,
> attributes). We are leaning towards Chef-solo for CM at the node.
> - Design the message structure proper
>
> So I bet you know how to do much of this, and even can copy or adapt
> code from Vogeler to get us jump started.
>

Well my initial thoughts were that I could work on providing a way for
Overmind to use Vogeler and vice-versa as a separate track. i.e.

- Provide clear documentation on the format of messages put on the
queue for clients to read.
- Create a persistence backend for Vogeler that, when used, uses the
Overmind REST api to store the results from the client.

Essentially, you could use Vogeler's already built queuing system to
communicate/execute commands on remote nodes. In that regard, you
would be operating in the same spaces as 'vogeler-runner'. Once the
commands are run and results dropped for return by the clients,
'vogeler-server' can persist those results to Overmind via REST.

> Another comment is that we are not interested in supporting several
> different message queue implementations. RabbitMQ would do for
> example. We want end-user flexibility, not library flexibility/
> independence.
>

Honestly using any AMQP messaging system is pretty transparent. It's
all about the setup. Some expect you to preconfigure exchanges and
queues (qpid) while other's let that happen at connection time
(RabbitMQ). I'm getting the same feeling myself about Vogeler.

> Btw., Fog looks quite nice. However, it does not benefit us because we
> are already using libcloud. While we could relatively easily switch to
> another cloud library if the need arouse, switching to a Ruby library
> is just not possible from Python/django!
>

I guess I was thinking more along the lines of my experience with Fog
would allow me to port some of the concepts (where appropriate) to
Overmind as it relates specifically to cloud interaction. Of course
libcloud is much more mature than fog at this point. We really only
have solid support for 3 or 4 providers.

> Ah! and the reason we are going for a queue at all is to allow for
> extreme scalability, which a pull system only provides up to a point.
>
> So, what do you think?
>

I agree wholeheartedly on the queue system design. It's proven
technology. You remove the need for listening daemons and, as you
said, it really does scale very well. I like asynch as a concept a
lot.

Miquel Torres

unread,

Oct 16, 2010, 11:15:15 AM10/16/10

to overmi...@googlegroups.com

>Well my initial thoughts were that I could work on providing a way for
>Overmind to use Vogeler and vice-versa as a separate track. i.e.

>- Provide clear documentation on the format of messages put on the
>queue for clients to read.
>- Create a persistence backend for Vogeler that, when used, uses the
>Overmind REST api to store the results from the client.

That sounds good.

Could you elaborate on the second point?. I originally thought that
Overmind's backend would consume the result/response queue, which
should scale better than thousands of nodes storing results against a
server (REST API).

Another point is the monitoring component, which could also store
results either against a REST API or a queue. I am talking to the eyes
author (http://bitbucket.org/heckj/eyes/wiki/detailed_goals) to see
whether we can collaborate on that.

2010/10/16 lusis <lusi...@gmail.com>:

Joseph Heck

unread,

Oct 16, 2010, 12:19:00 PM10/16/10

to overmi...@googlegroups.com, overmi...@googlegroups.com

Since I'm the author of Eyes, i should at least mention that I am lurking on the list.

I have been working on Eyes slowly and quietly, intending to have something functional before making much in the way of noise about it. The wiki pages at Eyes has a lot of detail around what im trying to do with it - i see it very clearly as a simple and focused subcomponent for a larger system. fundamentally it was created from my frustration that existing monitoring solutions, even from large vendors, didnt have decent APIs that allowed you to create, update, delete, and check status of monitors - exactly the kind of thing you need to have to make integration with larger systems like Overmind, Chef, Puppet, etc effective at keeping track on monitoring.

But this list isnt about eyes - its about Overmind, so ill get back to our overmind discussions...

- joe

Sent from my iPad

Miquel Torres

unread,

Oct 16, 2010, 2:39:50 PM10/16/10

to overmi...@googlegroups.com

Hi Joe,

the way I see it is this:
Overmind already has server provisioning. To be a complete solution it
needs an scalable CMS and monitoring system.

you and John (vogeler) have already put a lot of thought and code into
eyes and vogeler, respectively. If we can somehow fit the pieces of
the puzzle, we would have a very cool solution sooner than we thought.
If not, well I hope it helps us get on the right track, at least.

Even though monitoring was not planned at this stage, I think that the
final architecture must take into account that both cms and monitoring
data need to be handled, so that we arrive at the best, simplest
design possible.

For Vogeler, you seem to have thought about having a queue only for
issuing commands, but results are posted back via REST. That would be
compatible with eyes's approach, though I am not sure exactly how eyes
gets the data back from a node (there is an agent, but also polling?).

While we could go for such an approach, do you think it would be
feasible to have the agent report both cms run results and monitoring
data back to a queue? the advantage is even more scalability and
simplifying the design by reusing the queue system for more tasks and
avoiding polling. Maybe it is a dumb idea, but it sounds attractive.

2010/10/16 Joseph Heck <josep...@gmail.com>:

Joseph Heck

unread,

Oct 16, 2010, 3:00:01 PM10/16/10

to overmi...@googlegroups.com

Just to be certain - CMS = Configuration Management System? I read up
on Vogeler for a few seconds (Hi John!) and say that he was aiming at
a CMDB (configuration management database), but haven't dived deeper
to see what's there in code or architecture.

The architecture of eyes today is set up with a central "point" that's
REST based to post in results of something monitoring - expecting that
I have at least one (more if some scale is needed) agents running that
read against a REST blackboard of what is pending to be updated, and
reporting back through a REST interface. That same REST interface is
heading toward being extended to be able to ask questions about the
state of monitor, update the monitor (to change thresholds), etc.
Generally, that core is my "where I stash state for the monitors".

I seriously thought about queues, but backed away from them for a
start. That all said, it could easily be modified for something to
read out of a queue/pipe of event data and update the state. The agent
I originally envisioned was intended to run on the same server/VM
image as the state engine to start - and I knew I might want multiple
of those and scaling out to their own little VM "poller" engines if
need be. That's all for a polling model... which I will be the first
to agree doesn't scale at extreme levels.

The whole thing could be pushed out even further to agents running on
the target machines and just piping event data back into the
combination of systems - some of which gets routed to the monitoring
system to provide updates. I made provisions for that in my design
with the idea of a "passive monitor" (stealing the term from Nagios).
I started with a polling system because I thought that might be more
palatable to folks wanting to install and use something quickly and
easily.

John Vincent

unread,

Oct 16, 2010, 3:34:52 PM10/16/10

to overmi...@googlegroups.com

Actually Vogeler is ENTIRELY queue based. Client, server and runner
never talk to each other directly. The REST stuff in Vogeler is simply
because CouchDBkit talks to CouchDB via REST.

Here's a diagram I drew up:

http://github.com/lusis/vogeler/blob/gh-pages/vogeler.jpg

To answer Joe's question, originally it was going to be a full blown
Configuration Management database but I've scaled back to focus on
Command and Control for now.

John Vincent

unread,

Oct 16, 2010, 3:41:17 PM10/16/10

to overmi...@googlegroups.com

I forgot to mention. Using a queue based system for monitoring actually is a bit risky because you never know if the node will ever respond. Most monitors are "I need to know right now how full your disk is"

Joseph Heck

unread,

Oct 16, 2010, 3:53:56 PM10/16/10

to overmi...@googlegroups.com

agent failure in the queue system absolutely needs to be taken into
account if you're using that mechanism. Yeah, I short circuited that
by expecting a poller to do that work, and am planning on having some
components of the stateful core be able to respond to questions like
"how long has it been since you last interacted with this poller" to
keep an awareness in the distributed system that it might be failing
or in partial failure.

On Sat, Oct 16, 2010 at 12:41 PM, John Vincent <lusi...@gmail.com> wrote:
> I forgot to mention. Using a queue based system for monitoring actually is a
> bit risky because you never know if the node will ever respond. Most
> monitors are "I need to know right now how full your disk is"
>

>>>>> 2010/10/16 lusis <lusi...@gmail.com>:

John Vincent

unread,

Oct 16, 2010, 4:06:14 PM10/16/10

to overmi...@googlegroups.com

Yeah I actually planned on adding a background thread in the server
that would have it optionally ask clients to check in after a
configurable amount of time.

I recently added populating a Last Error functionality in the client
as well. Essentially any exception will, assuming the exception isn't
communication with the queue itself, send a return message with the
details of the error.

Miquel Torres

unread,

Oct 18, 2010, 7:08:10 AM10/18/10

to overmi...@googlegroups.com

Overmind aims would make it *the* most critical system of any IT
infrastructure. That means that we both need high availability and
scalability.

While our small/medium iterations approach to development will mean we
don't get both features from the beginning, the design must *allow* to
have them later.

High availability:
there are three points of failure in a queue based system:
- Central server that does the work. You need to solve this problem
regardless of the architecture.
- Queue. You can replicate, make messages persistent and cluster the queue:
http://www.rabbitmq.com/pacemaker.html
- Agents. A system based on a queue and agents needs to transform a
weakness (agent failure) into a strength. How? having an agent
checking a queue is already a kind of basic monitoring. If you "miss a
heartbeat", the system would immediately mark the node as having
problems. Then, we can devise ways to handle those cases. For example
having active "emergency" checks, whereas overmind *calls* that
particular client to see what's wrong. And the agent could be designed
in an extra secure way. Fox example, having an agent (with an
efficient C core) that does nothing (simplicity) but communicating
with the queue and executing local commands (call chef-solo, execute
monitoring plugins, etc...) in an independent process. There might
even already be something similar that we can reuse.
But agent security against failure is a topic in itself. John, what
else have you investigated in this regard?

Scalability:
Agents that report results or info back to a central REST API scale
only up to a point (even though there are ways around that).

Queue systems, in contrast scale tremendously.

Some performance numbers for RabbitMQ:
- from 1M msg/s - 100 msg/s = 60,000,000 msg/minute - 6,000 msg/minute
http://groups.google.com/group/rabbitmq-discuss/browse_thread/thread/f376d885a8c478ac/e55ae8c821405044

Worst case:
http://www.sheysrebellion.net/blog/2009/06/24/amqp-kool-aid-part-15-rabbitmq-benchmark/

While having multiple pollers (like Joseph has envisioned) improves
the situation, a REST API can not scale to those levels easily.
In a queue based system, you remove the info exchange limit and there
remains only a single bottleneck: The server that processes the
information and saves to the DB + performs automatic tasks, and that
can be solved in different ways.

That said, I am not obsessed with an all queue design. You can
convince me otherwise. It is just that in principle polling seems...
don't know, just a worse option that resusing a component (queue) we
will already have in place for config management, and that can scale
to infinity out of the box.

2010/10/16 John Vincent <lusi...@gmail.com>:

Miquel Torres

unread,

Oct 18, 2010, 7:16:39 AM10/18/10

to overmi...@googlegroups.com

@John

>I forgot to mention. Using a queue based system for monitoring actually is a bit risky because you never know
> if the node will ever respond. Most monitors are "I need to know right now how full your disk is"

I partially responded to that, but I'd like to add something else.
You do want the agent to responde periodically. And a queue allows
that to be more often than other monitoring systems. Either by
sending overmind a periodic broadcast message, or by expecting to
recieve a message every minute.

Another thing. You say in Vogeler's FAQ: "I don’t think Vogeler will
ever be a replacement for Puppet or Chef"
But a Configuration Management database will surely substitute *some*
part of Chef? In any case Overmind's goal it to completely replace the
role of a Chef server, but probably using chef-solo at the client
side.

2010/10/18 Miquel Torres <tob...@googlemail.com>:

Miquel Torres

unread,

Oct 18, 2010, 8:31:33 AM10/18/10

to overmi...@googlegroups.com

@Joseph:

>I started with a polling system because I thought that might be more
> palatable to folks wanting to install and use something quickly and easily.

Complexity depends on the whole system. Currently, you need to
configure something for server provisioning, another system for CM,
another for monitoring...
Overmind will have a huge simplicity advantage.

Besides, it all depends on installation easiness. An agent sounds more
complicated, but in the end, you will have an overmind-server package,
and an overmind-agent package. The agent will be automatically
installed for new nodes, and can be manually installed for existing
nodes. You would need a package for the eyes client anyway.

2010/10/18 Miquel Torres <tob...@googlemail.com>:

Miquel Torres

unread,

Oct 18, 2010, 9:46:27 AM10/18/10

to overmi...@googlegroups.com

Ok, I created a wiki page to start writing down the different
possibilities: http://github.com/tobami/overmind/wiki/Architecture-explorations

Inspired by John's nice diagram, :-) I have drawn and added a very
simplified diagram of a would-be complete Overmind.

I made it so that you understand what my global idea is. I left out
some details on purpose. For example the DB question, which bears
another discussion. But it should suffice to get my idea. I am a huge
proponent of simplicity, and when you take into consideration all
areas together, a system can become very complex, with dozens of
moving parts (we use several solutions with lots of moving parts
nowadays!). This design eliminates much of that complexity, and allows
for a modular system, where I could imagine someone using it only for
monitoring, for example.

Vogeler-runner would be a task of the job queue, as would be the
emergency poller (or even in the case of a pure-polling design!).
Vogeler client would be like an overmind-agent. Response messages
could be consumed by the jobs themselves, or by the core (not sure
about that). The messaging structure outlined by John's diagram seems
like a good solution. The eyes event processor could also be
implemented as a celery periodic task (job).

Although django-celery is (yet again) something I've never done
before, it is used at my company for example to resize millions of
user submitted images, so it would be a really nice solution, while
allowing to develop all parts independently.

There are a lot more implications and advantages, like DB
consolidation/reuse, but I want to hear your opinions first.

What do you think about it? Please feel free to modify the page and
add other architecture possibilities!

Miquel

2010/10/18 Miquel Torres <tob...@googlemail.com>:

lusis

unread,

Oct 19, 2010, 9:24:02 PM10/19/10

to Overmind development

I like the layout. It appears to be very scalable. A few things:

1 - My choice of CouchDB was simply because I was building a freeform
system. Obviously I'm not expecting anyone else to use that.
2 - I appreciate the kind words on the Vogeler setup. I don't know if
it's the BEST model but I like it ;)
3 - I had no intention of creating any sort of listening daemon on the
client. I figured func would be a much better solution in that case
but it has the whole certmaster component.
4 - Right now Vogeler has no concept of any sort of special commands.
I had thought about providing special handlers for ohai and facter
(puppets fact tool) though. I've got some experience with both if
that's helpful.

Sorry for the late response. I've been out of commission the past day
with work and a 24 hour bug. Back in the game now though.

> >>>http://groups.google.com/group/rabbitmq-discuss/browse_thread/thread/...
>
> >>> Worst case:
> >>>http://www.sheysrebellion.net/blog/2009/06/24/amqp-kool-aid-part-15-r...

>
> >>> While having multiple pollers (like Joseph has envisioned) improves
> >>> the situation, a REST API can not scale to those levels easily.
> >>> In a queue based system, you remove the info exchange limit and there
> >>> remains only a single bottleneck: The server that processes the
> >>> information and saves to the DB + performs automatic tasks, and that
> >>> can be solved in different ways.
>
> >>> That said, I am not obsessed with an all queue design. You can
> >>> convince me otherwise. It is just that in principle polling seems...
> >>> don't know, just a worse option that resusing a component (queue) we
> >>> will already have in place for config management, and that can scale
> >>> to infinity out of the box.
>

> >>> 2010/10/16 John Vincent <lusis....@gmail.com>:

> >>>> Yeah I actually planned on adding a background thread in the server
> >>>> that would have it optionally ask clients to check in after a
> >>>> configurable amount of time.
>
> >>>> I recently added populating a Last Error functionality in the client
> >>>> as well. Essentially any exception will, assuming the exception isn't
> >>>> communication with the queue itself, send a return message with the
> >>>> details of the error.
>

> >>>> On Sat, Oct 16, 2010 at 3:53 PM, Joseph Heck <joseph.h...@gmail.com> wrote:
> >>>>> agent failure in the queue system absolutely needs to be taken into
> >>>>> account if you're using that mechanism. Yeah, I short circuited that
> >>>>> by expecting a poller to do that work, and am planning on having some
> >>>>> components of the stateful core be able to respond to questions like
> >>>>> "how long has it been since you last interacted with this poller" to
> >>>>> keep an awareness in the distributed system that it might be failing
> >>>>> or in partial failure.
>

> >>>>> On Sat, Oct 16, 2010 at 12:41 PM, John Vincent <lusis....@gmail.com> wrote:
> >>>>>> I forgot to mention. Using a queue based system for monitoring actually is a
> >>>>>> bit risky because you never know if the node will ever respond. Most
> >>>>>> monitors are "I need to know right now how full your disk is"
>

> >>>>>>>> avoiding polling. Maybe it is a dumb idea, but it sounds...
>
> read more »

Miquel Torres

unread,

Oct 20, 2010, 4:07:53 AM10/20/10

to overmi...@googlegroups.com

Glad you like it.

> 1 - My choice of CouchDB was simply because I was building a freeform
> system. Obviously I'm not expecting anyone else to use that.

Yeah, there is not really a need to determine the DB design right now.

> 3 - I had no intention of creating any sort of listening daemon on the
> client. I figured func would be a much better solution in that case
> but it has the whole certmaster component.

I think I am not following you here. In your design
(http://github.com/lusis/vogeler/raw/gh-pages/vogeler.jpg), you
clearly define a vogeler-client at each node that listens to the
queue. That does not seem to match using func.
Besides, func looks good for what it does. However, for a system that
needs to scale to thousands of nodes, managing the nodes with ssh
connections, no matter how parallel, is not a good option. That's what
the queue and agent are for. (the classical paper on that matter:
http://www.infrastructures.org/papers/bootstrap/bootstrap.html go to
the "Push vs. Pull" section).

Note however, that using something like func or fabric could be useful
for certain special "push" tasks. For example initial bootstrapping
(which we have left out for the moment), "backup" monitoring checks,
even agent repairs/restarts...
Just not for the normal way to communicate every X minutes to all nodes.

Are you thinking about a completely different thing, about a push
system? or what is func for?

> 4 - Right now Vogeler has no concept of any sort of special commands.
> I had thought about providing special handlers for ohai and facter
> (puppets fact tool) though. I've got some experience with both if
> that's helpful.

Certainly, Ohai and facter are another interesting area to handle. In
the ideal, rocking case, we would have plugins for configuration
management, so that you can use chef-solo or puppet, and also for
gathering system info with ohai or facter. I think that makes it even
more important to design a robust agent core to which you can plug in
this sort of functionality.

Btw., that would also be an awesome way to bring Python and Ruby
devops together.
I think you are the right man here, John ;-)

Miquel

2010/10/20 lusis <lusi...@gmail.com>:

lusis

unread,

Oct 26, 2010, 12:21:57 PM10/26/10

to Overmind development

So i was thinking more about the out-of-band emergency check and I
think that might be best accomplished via standard SSH as opposed to
any new daemon. You can secure the access via an overmind-user who has
limited sudo functionality. That way, you can execute a basic set of
health checks:

- Are you alive? This would be yes if ssh lets us log in
- Why aren't you responding to queue messages? Maybe a quick log parse
of the queue client log?
- Can you repair yourself? Start up the client, give me the output
from startup.

Optimally the client would be using the standard queue process to
service requests so the goal of the health check is to get it back to
that base state - a running queue client. If all of that fails, manual
intervention is required.

Make sense or was there another idea for out-of-band emergency checks?

On Oct 20, 4:07 am, Miquel Torres <tob...@googlemail.com> wrote:
> Glad you like it.
>
> > 1 - My choice of CouchDB was simply because I was building a freeform
> > system. Obviously I'm not expecting anyone else to use that.
>
> Yeah, there is not really a need to determine the DB design right now.
>
> > 3 - I had no intention of creating any sort of listening daemon on the
> > client. I figured func would be a much better solution in that case
> > but it has the whole certmaster component.
>
> I think I am not following you here. In your design
> (http://github.com/lusis/vogeler/raw/gh-pages/vogeler.jpg), you
> clearly define a vogeler-client at each node that listens to the
> queue. That does not seem to match using func.
> Besides, func looks good for what it does. However, for a system that
> needs to scale to thousands of nodes, managing the nodes with ssh
> connections, no matter how parallel, is not a good option. That's what

> the queue and agent are for. (the classical paper on that matter:http://www.infrastructures.org/papers/bootstrap/bootstrap.htmlgo to

> the "Push vs. Pull" section).
>
> Note however, that using something like func or fabric could be useful
> for certain special "push" tasks. For example initial bootstrapping
> (which we have left out for the moment), "backup" monitoring checks,
> even agent repairs/restarts...
> Just not for the normal way to communicate every X minutes to all nodes.
>
> Are you thinking about a completely different thing, about a push
> system? or what is func for?
>
> > 4 - Right now Vogeler has no concept of any sort of special commands.
> > I had thought about providing special handlers for ohai and facter
> > (puppets fact tool) though. I've got some experience with both if
> > that's helpful.
>
> Certainly, Ohai and facter are another interesting area to handle. In
> the ideal, rocking case, we would have plugins for configuration
> management, so that you can use chef-solo or puppet, and also for
> gathering system info with ohai or facter. I think that makes it even
> more important to design a robust agent core to which you can plug in
> this sort of functionality.
>
> Btw., that would also be an awesome way to bring Python and Ruby
> devops together.
> I think you are the right man here, John ;-)
>
> Miquel
>

> 2010/10/20 lusis <lusis....@gmail.com>:

> ...
>
> read more »

Miquel Torres

unread,

Oct 27, 2010, 4:46:50 AM10/27/10

to overmi...@googlegroups.com

Hi,

yes, that would basically make sense.

Of course a daemon is not needed at all for that. Once the central
system notices a missing response, and decides there is a problem, a
job can be started that directly uses ssh to make the checks, which
you describe pretty well.

If everything fails and, like you say, manual intervention is
required, effort should be taken to create a meaningful alert with all
necessary info (logs) for the user. And the possibility could be given
to take "rebirth" action, which would mean to destroy the node and
create a new one with the exact same configuration.

Miquel

2010/10/26 lusis <lusi...@gmail.com>:

Reply all

Reply to author

Forward