common metadata for describing infra components

43 views
Skip to first unread message

John E. Vincent

unread,
Nov 3, 2010, 9:29:51 AM11/3/10
to devops-toolchain
So there's been a bit of random sniping back and forth on twitter
about facter vs. ohai vs. everything else. I know we mentioned in in a
previous thread but would it maybe make sense to try and come to an
agreement on some sort of "open" metadata format to describe the bits
in our infrastructure? That way some headway can be made in the
"silos" that are starting to arise between the tools.

I'm not going to posit what I think the basics are but I wouldn't mind
a few examples in, say, JSON that people might have.

thoughts?

Grig Gheorghiu

unread,
Nov 3, 2010, 11:10:08 AM11/3/10
to devops-t...@googlegroups.com

+1 for this. I believe I initiated some of that back and forth
yesterday on twitter ;-)

I was asking if people use ohai (or facter as an alternative) as
standalone tools and not part of chef or puppet, in order to glean
information about their servers.

Grig

Isaac Finnegan

unread,
Nov 3, 2010, 11:21:51 AM11/3/10
to devops-t...@googlegroups.com
I am currently using facter with a custom agent (we may just consolidate with puppet) in addition to puppet to report facts about systems up in to a central inventory system.

-Isaac

Scott McCarty

unread,
Nov 3, 2010, 11:23:37 AM11/3/10
to devops-t...@googlegroups.com
+1, I would also love to see a standard form

On Wed, Nov 3, 2010 at 11:21 AM, Isaac Finnegan <isaacf...@gmail.com> wrote:
I am currently using facter with a custom agent (we may just consolidate with puppet) in addition to puppet to report facts about systems up in to a central inventory system.

-Isaac

On Nov 3, 2010, at 8:10 AM, Grig Gheorghiu wrote:

Vladimir Vuksan

unread,
Nov 3, 2010, 11:33:40 AM11/3/10
to devops-t...@googlegroups.com

There was sniping ? I missed a hell of an argument.

I am also unsure what you mean by silos in tools ? Both facter and ohai
are great tools. Some have pointed out that ohai provides more detail which
is a fair point. Some have also pointed out that ohai outputs JSON and
facter doesn't but that is more of a presentation issue then anything else.
It would be trivial to write a wrapper to massage facter output into
something that looks very much like ohai output.

Call me confused.

Vladimir

Grig Gheorghiu

unread,
Nov 3, 2010, 11:36:45 AM11/3/10
to devops-t...@googlegroups.com
On Wed, Nov 3, 2010 at 8:33 AM, Vladimir Vuksan <vli...@veus.hr> wrote:
>
> There was sniping ? I missed a hell of an argument.
>
> I am also unsure what you mean by silos in tools ? Both facter and ohai
> are great tools. Some have pointed out that ohai provides more detail which
> is a fair point. Some have also pointed out that ohai outputs JSON and
> facter doesn't but that is more of a presentation issue then anything else.
> It would be trivial to write a wrapper to massage facter output into
> something that looks very much like ohai output.
>
> Call me confused.
>

I take 'silos' to mean that they're both doing essentially the same
thing, but there is no common way to interpret their output (which is
much more terse for facter than it is for ohai).

Ideally there would be some JSON structure with well-defined key names
(such as "machine", "os", etc) and a tool which would run the plugin
of your choice (facter, ohai, maybe others) and output the
correctly-formatted JSON. Then other tools could consume that output
and store it in the NoSQL engine of your choice ;-)

Grig

James Turnbull

unread,
Nov 3, 2010, 12:25:57 PM11/3/10
to devops-t...@googlegroups.com, Nigel Kersten
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Scott McCarty wrote:
> +1, I would also love to see a standard form
>

We (Puppet Labs) would happily support a standard format. It's
something Luke and I have discussed a bit in the past.

Regards

James Turnbull

- --
Author of:
* Pro Linux Systems Administration
(http://www.amazon.com/gp/product/1430219122/)
* Pulling Strings with Puppet
(http://www.amazon.com/gp/product/1590599780/)
* Pro Nagios 2.0
(http://www.amazon.com/gp/product/1590596099/)
* Hardening Linux
(http://www.amazon.com/gp/product/1590594444/)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEVAwUBTNGNFSFa/lDkFHAyAQJ2+QgAxtEw4ZYwI4lMSd+Xqp/hzqeZ908aGzNJ
tFz6hOxOiY6Y1Y3maraBCQMzEmwemxchYWNHHUbOvCfZpN1YT/Or/kN6hTzvdqL3
7EoBKfHU5oItoaqlZG3QO0QxsFjXie1OJIrRctAAjT0l82Dmy8k9W5+0K4Am6KW/
/Wnxrdgg/bBzB9Jl6K7YZQ+3ZoJSFlXAnazZ6OdtboZMkMUN2DMU9Y7FrrwHZmc0
nmRPbPjvZSK8eX6MMUJ8ZBb3LTkC2NvLK8TGhEIJGB6tCkyz225KyWEJTndb3NF4
fF1jO5l81/mMvHRgb2bO6bsEWW4n/oHp8wym838d9QS/7OD6X3QD/w==
=22Kp
-----END PGP SIGNATURE-----

John E. Vincent

unread,
Nov 3, 2010, 12:36:06 PM11/3/10
to devops-toolchain


On Nov 3, 11:33 am, Vladimir Vuksan <vli...@veus.hr> wrote:
> There was sniping ? I missed a hell of an argument.
>
> I am also unsure what you mean by silos in tools ? Both facter and ohai
> are great tools. Some have pointed out that ohai provides more detail which
> is a fair point. Some have also pointed out that ohai outputs JSON and
> facter doesn't but that is more of a presentation issue then anything else.
> It would be trivial to write a wrapper to massage facter output into
> something that looks very much like ohai output.
>
> Call me confused.
>
> Vladimir
>
Grig beat me to it but look at it from the perspective of someone
who's writing tools to interact with the data. Admittedly, it's not
too hard now to support both but as newer products come on the scene
it risks getting unweildy. I could see a world where Chef gets it's
information from Facter and Puppet speaks to Ohai.

Personally I have no concern which format tools use internally but I'd
love to simply have a way to get information from both systems in a
standard accepted format. It could be as simple as a command line arg
to facter and ohai or exposing it over a REST interface.

What I don't want to do right now is implementation details. I'm
really just trying to guage interest in the idea and possibly draft a
first round for describing one component - i.e. system - and what the
community thinks are the basic bits of information needed to describe
'system'. From there, i could see other objects like
'network' (possibly too abstract) or better yet 'application' (i.e.
has a version, a name, a path, whatever).

Maybe it's pie in the sky. Most attempts and a standardized language
fail miserably.

James Turnbull

unread,
Nov 3, 2010, 12:51:42 PM11/3/10
to devops-t...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John E. Vincent wrote:
> Grig beat me to it but look at it from the perspective of someone
> who's writing tools to interact with the data. Admittedly, it's not
> too hard now to support both but as newer products come on the scene
> it risks getting unweildy. I could see a world where Chef gets it's
> information from Facter and Puppet speaks to Ohai.

I don't see that as an unfeasible goal. I draw the requirements into:

1. Data interchange
2. Data emission

2. is a solved problem. If you want Facter facts for example in YAML
you can. Easy to add a JSON output also (if someone wants to send a
patch supporting JSON to us that'd be awesome BTW -
http://projects.puppetlabs.com/issues/5193). Ditto on Ohai and others.

1. is trickier but still do-able I think. Rather than thinking about it
just as data formats it might be better to consider it as an API and a
format, perhaps:

http://datasource/data/network/interface

Then you don't care how the data is stored internally as long as you can
query it.

>
> Personally I have no concern which format tools use internally but I'd
> love to simply have a way to get information from both systems in a
> standard accepted format. It could be as simple as a command line arg
> to facter and ohai or exposing it over a REST interface.

+1

>
> What I don't want to do right now is implementation details. I'm
> really just trying to guage interest in the idea and possibly draft a
> first round for describing one component - i.e. system - and what the
> community thinks are the basic bits of information needed to describe
> 'system'. From there, i could see other objects like
> 'network' (possibly too abstract) or better yet 'application' (i.e.
> has a version, a name, a path, whatever).

Count us in - myself and Nigel Kersten would be happy to be involved.

>
> Maybe it's pie in the sky. Most attempts and a standardized language
> fail miserably.

True but will guaranteed fail if no one starts it. :)

Regards

James

- --
Author of:
* Pro Linux Systems Administration
(http://www.amazon.com/gp/product/1430219122/)
* Pulling Strings with Puppet
(http://www.amazon.com/gp/product/1590599780/)
* Pro Nagios 2.0
(http://www.amazon.com/gp/product/1590596099/)
* Hardening Linux
(http://www.amazon.com/gp/product/1590594444/)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEVAwUBTNGTHiFa/lDkFHAyAQJvqQgA4QkfoSdb51RraZNihcC3gWO+s1Qfs7Pg
3XqQgT53Q4iMqSRqfFAysCHfg33nH++1iruH8nlLkVerkJv6FB8i+4GJTBOowau8
kzkc4uYhxNlr/vMq//mQa5zbBHlNUNnhcziMdv32V0v0fYxV0DZnO3sd7BA1GGOu
Cc65+s+kY2KSdRZRV7nuY2GxO4GFmZjXVzZgOml7XsfB1naHoMrNVZI7tpdEdY8i
d5as6XDLv3beXS9384zGDWbwoRMk1NHV440awvWEUz0IHQpeWScVIO1QPfwDqrSX
n98y37xShyZZB+PhOSsz9yT8JMkrDmr85d2TU7Je9PGKrm7W6wQESA==
=+qDA
-----END PGP SIGNATURE-----

Adam Jacob

unread,
Nov 3, 2010, 1:06:08 PM11/3/10
to devops-toolchain
On Nov 3, 9:51 am, James Turnbull <ja...@lovedthanlost.net> wrote:
> Count us in - myself and Nigel Kersten would be happy to be involved.
> > Maybe it's pie in the sky. Most attempts and a standardized language
> > fail miserably.
>
> True but will guaranteed fail if no one starts it. :)

We would be into it as well. I would have used facter if it wasn't
for the licensing issue (GPL/Apache conflicts). When should the first
working group meeting be? I'm slammed for the rest of the week, but
things look a little better early next, or the week of the 15th.

Adam

John E. Vincent

unread,
Nov 3, 2010, 1:09:37 PM11/3/10
to devops-toolchain
I'm taking two days off next week before starting at my new company.
One day has been reserved by the spouse but I can't think of a better
way to use the other day than spending part of it on something I
consider important.

Nigel Kersten

unread,
Nov 3, 2010, 2:05:45 PM11/3/10
to devops-t...@googlegroups.com

I'd love to see this happen.

There are better areas we can all innovate in compared to data collection.

I should also note that we've started the early stages of organizing a
FOSDEM[1] devroom around config management, and so far the projects
who have expressed interest are Puppet, Chef, cfengine, bcfg2.

More involvement is always welcome from other config management
projects, and I'd be overjoyed if we had something concrete to
coalesce around at FOSDEM with regard to common metadata.

Nigel


[1] - http://fosdem.org/2011/

Luke Kanies

unread,
Nov 3, 2010, 4:40:28 PM11/3/10
to devops-t...@googlegroups.com
On Nov 3, 2010, at 8:10 AM, Grig Gheorghiu wrote:

Note that you could pretty easily replace Facter with Ohai or whatever else in most of Puppet today - it's accessed through a plugin interface that's pretty trivial to replace:

https://github.com/puppetlabs/puppet/blob/master/lib/puppet/indirector/facts/facter.rb

Just write an 'ohai.rb' or equivalent, and set 'facts_terminus = ohai' and it should work.

Today you'd still need Facter for some framework-level pieces (e.g., seeing if providers are suitable) but at least the data being sent to the server would now be from Ohai.

Of course, the whole point of Facter was that people wouldn't have to write another one of these tools, but since they were written anyway, we'd like to do what we can to make it easier to swap them out. That does feel like going the wrong direction, though.

--
I don't want any yes-men around me. I want everybody to tell me the
truth even if it costs them their jobs. -- Samuel Goldwyn
---------------------------------------------------------------------
Luke Kanies -|- http://puppetlabs.com -|- +1(615)594-8199


John Vincent

unread,
Nov 3, 2010, 4:50:39 PM11/3/10
to devops-t...@googlegroups.com

Apologies for top posting in advance.

Let me clarify that that sort of swapping isn't my intention. My goal is simply to decide on a possible standard format for getting that information from the system.

I just used that as an example. For instance, in vogeler I'm working on a way to get baseline information from a system. Right now that consists of shelling out and either running ohai or facter and parsing the output. I'd love to have an api (don't need full blown rpc) that provides an agreed upon subset of configuration data in an agreed upon format.

Sent from my Droid. Please excuse and spelling or grammar mistakes.

On Nov 3, 2010 4:40 PM, "Luke Kanies" <lu...@puppetlabs.com> wrote:
> On Nov 3, 2010, at 8:10 AM, Grig Gheorghiu wrote:
>

Luke Kanies

unread,
Nov 3, 2010, 7:43:02 PM11/3/10
to devops-t...@googlegroups.com
On Nov 3, 2010, at 1:50 PM, John Vincent wrote:

Apologies for top posting in advance.

Let me clarify that that sort of swapping isn't my intention. My goal is simply to decide on a possible standard format for getting that information from the system.

I just used that as an example. For instance, in vogeler I'm working on a way to get baseline information from a system. Right now that consists of shelling out and either running ohai or facter and parsing the output. I'd love to have an api (don't need full blown rpc) that provides an agreed upon subset of configuration data in an agreed upon format.

Yep, I know that's the case, and sorry for not making it clear; I just wanted to make it clear that we're relatively agnostic (even if we have some assumptions), and thus are amenable to the basic concept.

I'm a touch skeptical that a common format is needed, although common naming schemes might - maybe more of a common schema than a common format?  A hash, or hash of hashes, should suffice for nearly all cases, right?  Or am I foolishly conflating or deconflating format and schema?


--
Of the thirty-six ways of avoiding disaster, running away is best.
-- Chinese Proverb

John E. Vincent

unread,
Nov 3, 2010, 9:59:46 PM11/3/10
to devops-toolchain


On Nov 3, 7:43 pm, Luke Kanies <l...@puppetlabs.com> wrote:
> Yep, I know that's the case, and sorry for not making it clear; I just wanted to make it clear that we're relatively agnostic (even if we have some assumptions), and thus are amenable to the basic concept.
>
> I'm a touch skeptical that a common format is needed, although common naming schemes might - maybe more of a common schema than a common format?  A hash, or hash of hashes, should suffice for nearly all cases, right?  Or am I foolishly conflating or deconflating format and schema?
>
> --
> Of the thirty-six ways of avoiding disaster, running away is best.
>                                              -- Chinese Proverb
> ---------------------------------------------------------------------
> Luke Kanies  -|-  http://puppetlabs.com  -|-   +1(615)594-8199

Well that's really part of the question I was posing. Is it needed?
I'm going to put on my end-user hat for a minute. My intention isn't
to offend or to repeat something people already know.

As an end-user, the decision around which CM tool to use has almost
never been about technical capability. At least in my experience. I
mentioned this at the Atlanta DevOps meeting last time.

Some people prefer Chef because "it's Ruby". Other people hate it for
the same reason. Some people would rather use Puppet because of the
DSL which abstracts that Ruby layer out. Others dislike Puppet for
that same reason (to far abstracted from Ruby). Some people will use
NEITHER tool because they're written in Ruby. Java guys at the company
I'm leaving are seriously considering Control Tier because it's
written in Java. Others have been around a long while and are happy
with cfengine. Maybe they have such an investment in it that it's too
painful to switch even for valid technical reasons. There are also
shops that rely solely on Cobbler and abuse ks_meta like a red-headed
stepchild.

Having been in all of those situations myself at various companies,
just having something common between the tools besides generic CM
concepts would have been a godsend. I'm a big believer in the concept
of "solved problems". I would think at this point, much of CM is a
solved problem and yet we still have these "interchange issues" where
I can't easily port between the various tools to determine which one
is the best fit based on technical merit. I would think that the
description of what composes a "system", "application" or whatever at
the most basic level would be a solved problem but it isn't to me. As
an end-user knowing that my machine database that I built up in chef
can be ported easily to puppet or cfengine is a big thing.

So to answer the last question/statement, maybe we're getting caught
up in semantics. Or maybe I'm projecting my personal wants and it's
really NOT an issue for most people. That was part of what I wanted to
ask originally.

So maybe a common "dictionary" is what I'm wanting. Everyone agrees
that a "host" is made up of a hostname, an ip address and an operating
system. Or maybe we throw out operating system. I don't want to over-
complicate it.

Honestly the OS is becoming a commodity at this point anyway. I should
really be creating SHA1 fingerprints of my infrastructure components
based on capabilities. Is this Ubuntu host capable of running Apache
2.2 with these given DSOs and does it have two nics? Yes? Then its
fingerprint is the same as this RHEL5 host over here. When we're
operating at the scale that virtualization allows, I don't have time
to be concerned with some of the lower level stuff. It's the same as
the difference between troubleshooting an OS problem or saying "Screw
it. I don't have time for this. Kick the blade and get it back in
service"

As I said, please don't take anything I've said as an indictment
against any particular tool or vendor. I've used almost all the tools
out there over the last 15 years and each one has a special place in
my heart ;)

Adam Jacob

unread,
Nov 4, 2010, 10:01:58 AM11/4/10
to devops-t...@googlegroups.com
On Wed, Nov 3, 2010 at 9:59 PM, John E. Vincent <lusi...@gmail.com> wrote:
> As an end-user, the decision around which CM tool to use has almost
> never been about technical capability. At least in my experience. I
> mentioned this at the Atlanta DevOps meeting last time.

Not at the level of a single system, if you're really paying
attention. Puppet, Chef, Cfengine, Bcfg2, they all install packages on
RedHat the same way, with roughly the same amount of characters
involved in the process.

> So to answer the last question/statement, maybe we're getting caught
> up in semantics. Or maybe I'm projecting my personal wants and it's
> really NOT an issue for most people. That was part of what I wanted to
> ask originally.
>
> So maybe a common "dictionary" is what I'm wanting. Everyone agrees
> that a "host" is made up of a hostname, an ip address and an operating
> system. Or maybe we throw out operating system. I don't want to over-
> complicate it.

What I hear you asking for is a common data format for automatically
discovered data about systems. For example, what does "ipaddress"
mean in Ohai? What does it mean in Facter? In Ohai, it means the IP
address of the interface that has the default route configured on it.
(That works most of the time, except when it doesn't, and you can
override it if you had to.) It also means "the one I want to use most
often if I have more than one", but that's a human meaning.

I think getting agreement on "how you determine" will be much harder
than "how I find it". For example:

{ "ipaddress": "127.0.0.1" }

Makes sense to everyone. If we all agree on that, poof, puff the
magic standard.

We can knock a large number of these out really easily - many of the
top-level Ohai keys are the same as top-level Facter keys, and I think
you could basically call a v1 of something like this done simply by
identifying where they overlap.

Now things get more complicated as you get deeper into the system.
For example, Ohai displays lots of data about your file systems:

"filesystem": {
"/dev/disk0s2": {
"block_size": 512,
"kb_size": 211812352,
"kb_used": 194387604,
"kb_available": 17168748,
"percent_used": "92%",
"mount": "/",
"fs_type": "hfs",
"mount_options": [
"local"
]
},
...
}

On my laptop, Facter collects none of that data. Assuming it would be
useful for them to have it, would they use my hash? :) Did we mess it
up and need to change a key?

Similarly, for data that does overlap, take network interfaces:

Facter:
interfaces => lo0,gif0,stf0,en1,fw0,en0,cscotun0,en2,en3,vboxnet0,vmnet1,vmnet8
ipaddress_en2 => 10.37.129.2

Chef:

"network": {
"interfaces": {
"lo0": {
...
}
}
}

If you wanted the equivalent of the interfaces key in Facter in Ohai,
you would do:

ruby-1.9.2-p0 > o[:network][:interfaces].keys
=> ["lo0", "gif0", "stf0", "en1", "fw0", "en0", "cscotun0", "en2",
"en3", "vboxnet0", "vmnet1", "vmnet8"]

So outside of solving this problem on this thread (and I really don't
want to try), you see kind of the details about what you're asking.
Getting to a place where we all agree on how to discover the data
would be the hardest thing - getting to a place where we agree on what
the data structure looks like under the hood at the basic level
("feels like a hash of hashes to me, bob") is easy, and in the middle
is negotiation about keyspaces.

I think there is value in the keyspace negotiation.

I think there may be one "easy" path, which would be for us to patch
Ohai to have a Facter compatible mode that spits out the flat keyspace
- but that's a compatibility thing, not an "easier for the end user"
thing.

> Honestly the OS is becoming a commodity at this point anyway. I should
> really be creating SHA1 fingerprints of my infrastructure components
> based on capabilities. Is this Ubuntu host capable of running Apache
> 2.2 with these given DSOs and does it have two nics? Yes? Then its
> fingerprint is the same as this RHEL5 host over here. When we're
> operating at the scale that virtualization allows, I don't have time
> to be concerned with some of the lower level stuff. It's the same as
> the difference between troubleshooting an OS problem or saying "Screw
> it. I don't have time for this. Kick the blade and get it back in
> service"

Neat idea.

> As I said, please don't take anything I've said as an indictment
> against any particular tool or vendor. I've used almost all the tools
> out there over the last 15 years and each one has a special place in
> my heart ;)

You're clearly totally evil and partisan, man.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: ad...@opscode.com

Scott McCarty

unread,
Nov 4, 2010, 10:07:46 AM11/4/10
to devops-t...@googlegroups.com
Am I missing something or was much of this solved by SNMP? I understand that things like package lists, and deeper OS stuff are not solved, but wouldn't it be prudent to try and expand upon the UC Davis MIBS for this?

Scott M

On Thu, Nov 4, 2010 at 10:01 AM, Adam Jacob <ad...@opscode.com> wrote:

John E. Vincent

unread,
Nov 4, 2010, 10:33:45 AM11/4/10
to devops-toolchain
On Nov 4, 10:07 am, Scott McCarty <scott.mcca...@gmail.com> wrote:
> Am I missing something or was much of this solved by SNMP? I understand that
> things like package lists, and deeper OS stuff are not solved, but wouldn't
> it be prudent to try and expand upon the UC Davis MIBS for this?
>
> Scott M
>

Correlation of SNMP indexes with actual important data is a pain in
the ass. I have to do three lookups to correlate an interface with an
ip address at minimum. Maybe two. I'm kind of over SNMP at this point
except where I HAVE to use it. Interestingly enough installed software
IS enumerated in a MIB already. I don't remember which one it is
offhand though and it doesn't address unmanaged software for obvious
reasons.

Specifically to what Adam said, I'm most definitely not wanting to
dictate how people discover information but, yes, negotiation about
keyspaces and a compatible data strucure is achievable. Does the
method for discovery really play much into that structure? If the OS
tells you interface at index0 is this IP address, I would just take
its word for it.

Adam Jacob

unread,
Nov 4, 2010, 10:43:12 AM11/4/10
to devops-t...@googlegroups.com, devops-t...@googlegroups.com
Well, SNMP certainly has a standard for getting some of this data over the network.  There are a few reasons none of the tools use it for discovery:

A. They don't need the protocol/service overhead - they just wanted a library.

B. Publishing extensions properly would require an OID assignment from IANA, an a tracking process. Making adding custom data cumbersome.

C. For deeply nested or complex data an snmp walk is pretty inefficient.

It's more likely you would see a plugin for SNMP to publish facter/ohai data than you would see facter/ohai replaced with snmp discovery.

Best,
Adam

Sent from my iPhone

Grig Gheorghiu

unread,
Nov 4, 2010, 11:34:27 AM11/4/10
to devops-t...@googlegroups.com
On Thu, Nov 4, 2010 at 7:01 AM, Adam Jacob <ad...@opscode.com> wrote:
>
> We can knock a large number of these out really easily - many of the
> top-level Ohai keys are the same as top-level Facter keys, and I think
> you could basically call a v1 of something like this done simply by
> identifying where they overlap.

Enthusiastic +1 on this v1 ;-)

> So outside of solving this problem on this thread (and I really don't
> want to try), you see kind of the details about what you're asking.
> Getting to a place where we all agree on how to discover the data
> would be the hardest thing - getting to a place where we agree on what
> the data structure looks like under the hood at the basic level
> ("feels like a hash of hashes to me, bob") is easy, and in the middle
> is negotiation about keyspaces.
>
> I think there is value in the keyspace negotiation.
>

So let's start the negotiation!

It would be very good if we got to the point where asking for a
certain key value was as easy as doing a query with curl on an EC2
instance against http://169.254.169.254/latest/meta-data/

Here's the keys that are spit out:

# curl -s http://169.254.169.254/latest/meta-data/
ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
instance-action
instance-id
instance-type
kernel-id
local-hostname
local-ipv4
placement/
public-hostname
public-ipv4
public-keys/
ramdisk-id
reservation-id

If you query for a specific key you get its value:

# curl -s http://169.254.169.254/latest/meta-data/local-ipv4
10.209.177.142

Grig

ahowchin

unread,
Nov 5, 2010, 2:20:44 AM11/5/10
to devops-toolchain
I might be way off the mark here (since I read the postings so far
with interest but with more than a little confusion), but I went
trolling through the IETF website and came across this:
Expressing SNMP SMI Datatypes in XML Schema Definition Language
http://datatracker.ietf.org/doc/rfc5935/

In my simple mind, it sounded like a way to represent what SNMP had
discovered in XML. I tried reading it, but gave up as it's Friday
afternoon and time to go home. HTH.

Regards,
Adrian Howchin

On Nov 5, 1:34 am, Grig Gheorghiu <grig.gheorg...@gmail.com> wrote:
> On Thu, Nov 4, 2010 at 7:01 AM, Adam Jacob <a...@opscode.com> wrote:
>
> > We can knock a large number of these out really easily - many of the
> > top-level Ohai keys are the same as top-level Facter keys, and I think
> > you could basically call a v1 of something like this done simply by
> > identifying where they overlap.
>
> Enthusiastic +1 on this v1 ;-)
>
> > So outside of solving this problem on this thread (and I really don't
> > want to try), you see kind of the details about what you're asking.
> > Getting to a place where we all agree on how to discover the data
> > would be the hardest thing - getting to a place where we agree on what
> > the data structure looks like under the hood at the basic level
> > ("feels like a hash of hashes to me, bob") is easy, and in the middle
> > is negotiation about keyspaces.
>
> > I think there is value in the keyspace negotiation.
>
> So let's start the negotiation!
>
> It would be very good if we got to the point where asking for a
> certain key value was as easy as doing a query with curl on an EC2
> instance againsthttp://169.254.169.254/latest/meta-data/
>
> Here's the keys that are spit out:
>
> # curl -shttp://169.254.169.254/latest/meta-data/

Scott McCarty

unread,
Nov 5, 2010, 3:58:02 AM11/5/10
to devops-t...@googlegroups.com
On Thu, Nov 4, 2010 at 10:43 AM, Adam Jacob <ad...@opscode.com> wrote:
Well, SNMP certainly has a standard for getting some of this data over the network.  There are a few reasons none of the tools use it for discovery:

A. They don't need the protocol/service overhead - they just wanted a library.

I understand the "feeling", but I suspect there is no real system impact for using snmp.I use it like crazy and I have never, ever noticed it impact performance, except when I have done something stupid, like a assign a UC Davis MIB to a script which then uses ssh to gather data from 6 other systems. I have always solved this by caching the data and firing off the next request.
 

B. Publishing extensions properly would require an OID assignment from IANA, an a tracking process. Making adding custom data cumbersome.

There are a bunch of private ones under UC Davis, which I use all of the time for custom scripts. I do understand kind of though because that is exactly what hte UC Davis MIBS are for. Some are standard and some subtrees can be customized privately and there would be a struggle to add new standard ones.
 

C. For deeply nested or complex data an snmp walk is pretty inefficient.

For real time stuff like knife or Mcollective, I am sympathetic.
 

It's more likely you would see a plugin for SNMP to publish facter/ohai data than you would see facter/ohai replaced with snmp discovery.

Best,
Adam

I just hear a lot fhe same problems coming up as with SNMP and it feels like reinventing a lot of the wheel again because SNMP is too esoteric (it annoys me too, but it might be wrapped). 

The other thing, is when dealing with network devices, none of this stuff is going to be on a cisco router?

Scott M

Scott McCarty

unread,
Nov 5, 2010, 4:04:41 AM11/5/10
to devops-t...@googlegroups.com
On Thu, Nov 4, 2010 at 10:33 AM, John E. Vincent <lusis.org@gmail.com> wrote:
On Nov 4, 10:07 am, Scott McCarty <scott.mcca...@gmail.com> wrote:
> Am I missing something or was much of this solved by SNMP? I understand that
> things like package lists, and deeper OS stuff are not solved, but wouldn't
> it be prudent to try and expand upon the UC Davis MIBS for this?
>
> Scott M
>

Correlation of SNMP indexes with actual important data is a pain in
the ass. I have to do three lookups to correlate an interface with an
ip address at minimum. Maybe two. I'm kind of over SNMP at this point
except where I HAVE to use it. Interestingly enough installed software
IS enumerated in a MIB already. I don't remember which one it is
offhand though and it doesn't address unmanaged software for obvious
reasons.
 
Yeah, I think two, the first one discovers what interfaces, the second query maps them and it is two walks. Cacti uses this beautifully for interface discovery. Nagios BGP checks use it flawlessly. There is a cacti BGP check that enumerates domains in a bind server and graphs each one. The tree structure of SNMP just isn't as easy to use as key/value pairs, but it has massive enumeration advantages.

I fear this is problem stemming from comfort not any real technical advantage. SNMP is essentially a document store with a separate file which defines them instead of a key/value.

I am kind of, only playing devils advocate here because I understand the frustration with SNMP, but I wonder why it can't be wrangled better? Maybe, at least used as the definition for this type of data. Check out the UC Davis MIBS they solve all of hte interface standard stuff.

Scott M

Adam Jacob

unread,
Nov 5, 2010, 9:57:56 AM11/5/10
to devops-t...@googlegroups.com
On Fri, Nov 5, 2010 at 3:58 AM, Scott McCarty <scott....@gmail.com> wrote:
> I understand the "feeling", but I suspect there is no real system impact for
> using snmp.I use it like crazy and I have never, ever noticed it impact
> performance, except when I have done something stupid, like a assign a UC
> Davis MIB to a script which then uses ssh to gather data from 6 other
> systems. I have always solved this by caching the data and firing off the
> next request.

You can absolutely make SNMP work just fine. But it's fundamentally a
network discovery protocol, and we're not doing network discovery.
I'm not saying it's not nice to have, but really, the right
relationship between tools like Ohai and SNMP is as a data source for
exposure via SNMP. (Or, if you want, gathering data *from* SNMP and
presenting it)

> For real time stuff like knife or Mcollective, I am sympathetic.

Good! :) Because there is a reason we keep writing it.

> I just hear a lot fhe same problems coming up as with SNMP and it feels like
> reinventing a lot of the wheel again because SNMP is too esoteric (it annoys
> me too, but it might be wrapped).

It's not because SNMP is too esoteric, it's because SNMP solves a
different problem.

> The other thing, is when dealing with network devices, none of this stuff is
> going to be on a cisco router?

Then of course we will do discovery with the means at hand.

Gonéri Le Bouder

unread,
Nov 9, 2010, 4:36:46 AM11/9/10
to devops-toolchain, fusioninventory-devel
On Nov 5, 2:57 pm, Adam Jacob <a...@opscode.com> wrote:
> On Fri, Nov 5, 2010 at 3:58 AM, Scott McCarty <scott.mcca...@gmail.com> wrote:

Hi there,

http://groups.google.com/group/devops-toolchain/browse_thread/thread/b1d63d46e37a76dc

I'm Gonéri Le Bouder from the FusionInventory project and we are
interested in joining this topic.
Let me first introduce our project.

FusionInventory is basically an agent which talks to different
servers:
- OCS Inventory (an inventory and deployment project, from which
FusionInventory comes from)
- the GLPI asset management software with a plugin called
FusionInventory for GLPI
- Uranos (formerly unattended-gui)
We are also in touch with the GOsa² and OPSI dev teams.

Our agent collect computers hardware and software inventory but can
work also scan the network (nmap, SNMP, Netbios) and collect data of
printers and network devices (switchs & routers).

FusionInventory Agent is based on OCS Inventory Agent for UNIX. I was
the author of the OCS unix agent and I decided to leave the project
during FOSDEM 2010.

Today FusionInventory is an active project and we support most of the
operating system:
http://forge.fusioninventory.org/projects/fusioninventory-agent/wiki/Agent_supportedplateforms

FusionInventory uses the XML format inherited from OCS Inventory.
Since the fork, we extend it a bit and wrote a documentation of the
format:
http://search.cpan.org/dist/FusionInventory-Agent/lib/FusionInventory/Agent/XML/Query/Inventory.pm
An MacOSX inventory example: http://wawax.info/public-fusioninventory/gamma.lan-2010-10-21-12-30-00.ocs

We are very interested in this topic because like you, we think
interoperability is very important.
Our XML inventory is simple and can easily be converted to a hash
structure. Of course, we are open to changes and improvement.

We will be at FOSDEM 2011 too and would be glade to introduce our
work.

Best regards,

Gonéri

Nigel Kersten

unread,
Nov 9, 2010, 10:57:58 PM11/9/10
to devops-t...@googlegroups.com, fusioninventory-devel

Gonéri, I'm putting together a combined configuration management
devroom at FOSDEM where it looks like we may have input from Puppet,
Chef, cfengine, OpenQRM and bcfg2.

I think devoting some of that time to interoperability between
inventory systems could be awesomely productive for us all.

Gonéri Le Bouder

unread,
Nov 10, 2010, 8:07:58 AM11/10/10
to devops-t...@googlegroups.com, fusioninventory-devel
2010/11/10 Nigel Kersten <ni...@explanatorygap.net>:

This is a great idea, thank you for the invitation. We can also do a
little presentation of GLPI
and FusionInventory if you want.

--
     Gonéri Le Bouder

John E. Vincent

unread,
Nov 23, 2010, 4:43:09 PM11/23/10
to devops-toolchain
Okay. I've taken a swipe at what I think makes sense as a first draft
of some basic system data:

https://gist.github.com/712574

I want to clarify some things going through my head when I crafted
that:

1) Out with the old
This means making some assumptions about modern hardware (by modern I
mean the last 5 or so years). This means that memory is specified in
MB and physical disk is specified in GB.
2) Try NOT to nest too deep
This was REALLY hard to do when it came to disks and network but I
think I got it to something useful. You might wonder about the format
but I tried to make it easy to get to the relevant information. It's a
bit more work to ensure that you preserve order but it keeps the
structure rather shallow. I can easily grab the count of disks/
interfaces and use that as the positional index to get specifics about
a given interface.
3) Minimal facts/Avoid transient data
I tried to avoid any transient data at this first swipe. Remember that
this is an "interchange" format of sorts. I need enough information to
build my basic CM database or enumerate my systems so I can populate
more detailed information. For instance, physical disks change rarely
while filesystem layout might change more frequently.
4) Handle virtualization and physical in the same format.
This actually plays into a bigger design. I want to avoid "optional"
data that are only displayed based on some other "fact". I would
rather not have to go reparse my json once I determine that the
hardware platform is virtual vs. physical.
5) Parse quickly and once
I wanted to be able to parse the data in a single pass and get
everything I need without having to reparse based on other
information. Obviously this doesn't work as well for nics and disks.

I know this isn't enough information on its own to get a full fledged
inventory (number of DIMMS, partition scheme and the like) but it's
pretty opaque and non-vendor specific. You should be able to populate
every field regardless of os (solaris, aix, linux, windows, hpux).

Thoughts?

John E. Vincent

unread,
Nov 23, 2010, 4:44:34 PM11/23/10
to devops-toolchain
On Nov 23, 4:43 pm, "John E. Vincent" <lusis....@gmail.com> wrote:
> Okay. I've taken a swipe at what I think makes sense as a first draft
> of some basic system data:
>
> https://gist.github.com/712574
>
A few quick followups:

- Sorry for the formatting. I forgot to space things out.
- The "id", "name" and "provider" field are all determined by the data
provider.

Daniel Pittman

unread,
Nov 23, 2010, 7:18:03 PM11/23/10
to devops-t...@googlegroups.com
"John E. Vincent" <lusi...@gmail.com> writes:

> Okay. I've taken a swipe at what I think makes sense as a first draft
> of some basic system data:
>
> https://gist.github.com/712574
>
> I want to clarify some things going through my head when I crafted
> that:
>
> 1) Out with the old
> This means making some assumptions about modern hardware (by modern I
> mean the last 5 or so years). This means that memory is specified in
> MB and physical disk is specified in GB.
> 2) Try NOT to nest too deep
> This was REALLY hard to do when it came to disks and network but I
> think I got it to something useful. You might wonder about the format
> but I tried to make it easy to get to the relevant information. It's a
> bit more work to ensure that you preserve order but it keeps the
> structure rather shallow. I can easily grab the count of disks/
> interfaces and use that as the positional index to get specifics about
> a given interface.

...I would probably lean toward an array-of-hash data for disks and
interfaces, because it keeps facts about the same object grouped in a single
data structure.

That makes it easier to iterate over them without needing to pass a
potentially large number of data structures around, or pack them myself.


Also, your network facts are ... interesting. How do you represent this
machine in your proposed structure? (Also, keep in mind that this is the
*simplified* version of this, because we had to make an emergency rollout
without actually putting in the HA links to the second system, or the support
for multiple paths up to our data center. Those should land in the next few
days. :)

Oh, and please note eth2 - multiple IP addresess, but none of the usual label
aliasing for it. Plus, I omitted a bunch of essentially duplicate interfaces
that had already been seen.

root@fitz-fw01:~# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 00:15:17:f4:2e:69 brd ff:ff:ff:ff:ff:ff
3: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 100
link/ether 00:15:17:f4:2e:68 brd ff:ff:ff:ff:ff:ff
inet 192.168.2.1/24 brd 192.168.2.255 scope global eth4
inet6 fe80::215:17ff:fef4:2e68/64 scope link
valid_lft forever preferred_lft forever
[...]
5: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 100
link/ether 00:15:17:f4:2e:6a brd ff:ff:ff:ff:ff:ff
inet 203.214.67.82/29 brd 203.214.67.87 scope global eth2
inet 203.214.67.83/29 brd 203.214.67.87 scope global secondary eth2
inet 203.214.67.84/29 brd 203.214.67.87 scope global secondary eth2
inet 203.214.67.85/29 brd 203.214.67.87 scope global secondary eth2
inet 203.214.67.86/29 brd 203.214.67.87 scope global secondary eth2
inet6 fe80::215:17ff:fef4:2e6a/64 scope link
valid_lft forever preferred_lft forever
[...]
8: vlan1@eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 00:15:17:f4:2e:6b brd ff:ff:ff:ff:ff:ff
inet 192.168.1.1/24 brd 192.168.1.255 scope global vlan1
inet6 fe80::215:17ff:fef4:2e6b/64 scope link
valid_lft forever preferred_lft forever
9: vlan101@eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 00:15:17:f4:2e:6b brd ff:ff:ff:ff:ff:ff
inet 192.168.254.4/24 brd 192.168.254.255 scope global vlan101
inet6 fe80::215:17ff:fef4:2e6b/64 scope link
valid_lft forever preferred_lft forever
10: vlan201@eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 00:15:17:f4:2e:6b brd ff:ff:ff:ff:ff:ff
inet 192.168.201.1/24 brd 192.168.201.255 scope global vlan201
inet6 fe80::215:17ff:fef4:2e6b/64 scope link
valid_lft forever preferred_lft forever
[...]
13: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100
link/[65534]
inet 192.168.20.1 peer 192.168.20.2/32 scope global tun0


> 3) Minimal facts/Avoid transient data
> I tried to avoid any transient data at this first swipe.

The network details contain a whole lot of transient data: on many of our
system we have duplication of IP addresses across multiple machines, or HA
pool addresses that are present on one machine or another only if they are the
active master, or similar complexities.

[...]

> Thoughts?

The network side falls short in representing even medium-complexity machines
like mine above - which is hardly uncommon. Puppet/facter currently fall
pretty far short in representing that, because they made the same sort of odd
assumptions about how networks look.

Otherwise, it is good to see someone pushing this sort of standardization.

Regards,
Daniel
--
✣ Daniel Pittman ✉ dan...@rimspace.net+61 401 155 707
♽ made with 100 percent post-consumer electrons

John E. Vincent

unread,
Nov 23, 2010, 8:06:48 PM11/23/10
to devops-toolchain
On Nov 23, 7:18 pm, Daniel Pittman <dan...@rimspace.net> wrote:
> "John E. Vincent" <lusis....@gmail.com> writes:
>
> ...I would probably lean toward an array-of-hash data for disks and
> interfaces, because it keeps facts about the same object grouped in a single
> data structure.
>

Makes sense. I might have gone overboard in my attempt to reduce
nesting.

> That makes it easier to iterate over them without needing to pass a
> potentially large number of data structures around, or pack them myself.
>
> Also, your network facts are ... interesting.  How do you represent this
> machine in your proposed structure?  (Also, keep in mind that this is the
> *simplified* version of this, because we had to make an emergency rollout
> without actually putting in the HA links to the second system, or the support
> for multiple paths up to our data center.  Those should land in the next few
> days. :)
< snip>
> The network side falls short in representing even medium-complexity machines
> like mine above - which is hardly uncommon.  Puppet/facter currently fall
> pretty far short in representing that, because they made the same sort of odd
> assumptions about how networks look.
>

I snipped out all the text above but I'm not ignoring it.
I realized I left out a few things (netmasks/routing anyone? heh) and
was pondering the best way to represent aliases or if to include them
at all.

Let me explain that.

I would personally consider HA addressing and MOST secondary
addressing to be "application level" configuration. The data I'm
trying to represent is basic information about a system. Notice that I
don't have any packages listed either. That's the job of a puppet or a
chef to manage the application level configuration. This information
is the basic information you'll need to stand up a system. From there,
a puppetrun for chef-client will apply an application configuration to
it because it knows client 192.168.1.2 is my firewall/router box or
that apache configs on 10.10.1.1 require secondary addresses for SSL
certs.

I'm not opposed to representing THAT data as well but that's a MUCH
bigger fish to fry.

If I'm wrong on the aliasing/HA addresses please let me know but
that's my personal opinion ;)

I'll get back to you with a revised network block.

Daniel Pittman

unread,
Nov 23, 2010, 9:35:22 PM11/23/10
to devops-t...@googlegroups.com
"John E. Vincent" <lusi...@gmail.com> writes:
> On Nov 23, 7:18 pm, Daniel Pittman <dan...@rimspace.net> wrote:
>> "John E. Vincent" <lusis....@gmail.com> writes:
>>
>> ...I would probably lean toward an array-of-hash data for disks and
>> interfaces, because it keeps facts about the same object grouped in a single
>> data structure.
>
> Makes sense. I might have gone overboard in my attempt to reduce nesting.

I thought about when I use information like that, and it is pretty much always
"walk over the set of disks, filter on some field, act on some others", so to
me that makes more sense. I can see your argument, though, too. :)

>> That makes it easier to iterate over them without needing to pass a
>> potentially large number of data structures around, or pack them myself.
>>
>> Also, your network facts are ... interesting.  How do you represent this
>> machine in your proposed structure?  (Also, keep in mind that this is the
>> *simplified* version of this, because we had to make an emergency rollout
>> without actually putting in the HA links to the second system, or the support
>> for multiple paths up to our data center.  Those should land in the next few
>> days. :)
> < snip>
>> The network side falls short in representing even medium-complexity machines
>> like mine above - which is hardly uncommon.  Puppet/facter currently fall
>> pretty far short in representing that, because they made the same sort of odd
>> assumptions about how networks look.
>
> I snipped out all the text above but I'm not ignoring it. I realized I left
> out a few things (netmasks/routing anyone? heh) and was pondering the best
> way to represent aliases or if to include them at all.

*nod*

> Let me explain that.
>
> I would personally consider HA addressing and MOST secondary addressing to
> be "application level" configuration. The data I'm trying to represent is
> basic information about a system.

Well, for the systems in question it literally does have, and respond to, a
range of services on all those different addresses. (Also, most large web
servers are going to have multiple IP addresses - because HTTPS support still
needs them, darn it all.)

So, those are not really "application level" in any meaningful way that
whatever random address gets picked as the first one is - and are often *less*
"application level", since they are what the host actually communicates using.


It is probably important to think about how this information is going to be
used, also: in pretty much every case I have needed to know about an IP
address (as opposed to the hostname, or some other identifier) it is because
we need to do something meaningful with it.

In which case having partial information is going to eventually fail. It
could be binding services to an interface, or building firewall rules, or
ensuring that the on-disk network configuration matches the running
configuration, or any number of things - but the missing information is
eventually going to be the information that I actually need this time.


I can kind of see an effort to distinguish these addresses as somehow less
important than the underlying management address - but on at least some
servers we have a separate management network, and the only parts we care
about configuring (or reporting) are the front-end parts.

[...]

> If I'm wrong on the aliasing/HA addresses please let me know but that's my
> personal opinion ;)

I think that it would be a fundamental mistake to bake in the utterly wrong
idea that there is any distinction between those addresses, frankly. Which,
obviously, is my opinion - but is informed by the frequency at which people do
get this wrong.

(The same fault comes up with "gateway" facts and other network related things
every time, too, because they look like they are simple, one-value items, but
they turn out to be full of lurking complexity in modern large-scale
networks.)

Walid Nouh

unread,
Nov 24, 2010, 10:10:25 AM11/24/10
to devops-t...@googlegroups.com
Hello,
I'm Walid Nouh, I work with Goneri on the FusionInventory project, and also on GLPI (http://glpi-project.org), which is an asset management software.

Okay. I've taken a swipe at what I think makes sense as a first draft
of some basic system data:

https://gist.github.com/712574

  
FusionInventory already generats inventories on a large base of operating systems (AIX, BSD, Solaris, Linux, HP-UX and Windows).
Here is an inventory example of the current agent on a Debian Sid machine: https://gist.github.com/713620
This inventory is in YAML format instead of the previous XML example. IT includes the following sections:
PROCESSES: the running process
BIOS: bios or firmware information, including serial numbers (SSN, MSN)
PRINTERS: the printer installed on the machine, if possible with a serial number
VIDEOS: information regarding the video card
INPUTS: the mice, keyboard, ...
HARDWARE: in this section we can find information about the machine
SOUNDS: the sound card
ENVS: the environment variables declared
SOFTWARES: the installed softwares. In this case, it's the list of the Debian packages but it case also be BSD package, RPM, Windows softwares, etc. The list can be extended easily with a Perl module.
MEMORIES: information about physical memory extension. With serial number is possible.
BATTERIES: the battery (it's a laptop computer). Again we try to get the serial number.
CONTROLLERS: more or less lspci output
VERSIONCLIENT: the software used to generate the inventory
USBDEVICES: the USB devices found on the machine (à la lsusb).
STORAGES: hard drives
NETWORKS: network interfaces (physical or virtual)
USERS: the users logged on the system
DRIVES: partitions
MONITORS: screens configuration
PORTS: the machine physical ports, not not very accurate because dmidecode can give wrong information and there no is way to know if a port is really connected to an internal socket
SLOTS: it's more or less the same here. In this example, you can find the PCMCIA slot
CPUS: an entry per CPU with serial and thread and core number

Most of the section fields are documented here :
 
http://search.cpan.org/dist/FusionInventory-Agent/lib/FusionInventory/Agent/XML/Query/Inventory.pm


Our format is already compatible with some software: OCS Inventory, GLPI, Uranos, Pulse2 and more.
John's draft seems interesting for us, as some informations are already present in our XML.
Since you're still at the begining of the standardization process, we really hope that we will be able to find something similar.

Regarding Daniel's point : our schema is limited too. The major difference here is the NETWORKS/SLAVES field which is a list of slave network interfaces. For example a bond0 interface will have SLAVE=eth1/eth2.
For the moment, this SLAVES field is only supported on Linux.

Best regards,
Walid.

John E. Vincent

unread,
Nov 24, 2010, 10:45:33 AM11/24/10
to devops-toolchain

On Nov 23, 9:35 pm, Daniel Pittman <dan...@rimspace.net> wrote:
> "John E. Vincent" <lusis....@gmail.com> writes:
> I thought about when I use information like that, and it is pretty much always
> "walk over the set of disks, filter on some field, act on some others", so to
> me that makes more sense.  I can see your argument, though, too. :)

I'll back burner that one for a minute because it plays into how much
information we need below.

> Well, for the systems in question it literally does have, and respond to, a
> range of services on all those different addresses.  (Also, most large web
> servers are going to have multiple IP addresses - because HTTPS support still
> needs them, darn it all.)
>
> So, those are not really "application level" in any meaningful way that
> whatever random address gets picked as the first one is - and are often *less*
> "application level", since they are what the host actually communicates using.
>
> It is probably important to think about how this information is going to be
> used, also: in pretty much every case I have needed to know about an IP
> address (as opposed to the hostname, or some other identifier) it is because
> we need to do something meaningful with it.
>
> In which case having partial information is going to eventually fail.  It
> could be binding services to an interface, or building firewall rules, or
> ensuring that the on-disk network configuration matches the running
> configuration, or any number of things - but the missing information is
> eventually going to be the information that I actually need this time.
>
> I can kind of see an effort to distinguish these addresses as somehow less
> important than the underlying management address - but on at least some
> servers we have a separate management network, and the only parts we care
> about configuring (or reporting) are the front-end parts.
>

Here's where we might differ in philosophy. I treat the hardware that
something runs on as transient for lack of a better word (and despite
my previous usage).

Yes, there are basic firewall rules that exist for hosts but I
separate those outside of the firewall rules that my apache server or
database might need.
When I apply a theoretical role of ApacheServer to a box the following
happens:

- Install Apache
- Apply basic apache config bound to, say, the secondary NICs IP
address
- Apply firewall rules on the box itself, upstream devices (if
applicable - i.e. upstream proxy or upstream firewall)
- Start apache

At this point I have an apache server. Now I want to apply the
theoretical ApacheServer::MySSLEnabledSite role:

- Bind additional IP address (for SSL)
- Apply new firewall rule allowing HTTPS on the box itself
- Apply new firewall rule downstream to DB server to allow traffic
from this box
- Apply apache config for this particular site
- Restart apache
- Apply new firewall rule/proxy rule upstream to bring the box into
service

In my mind the secondary IP doesn't belong to the box, it doesn't
belong to being an apache server. It belongs to the role of serving my
SSL enabled site.

This is something of a contrived example but I think it makes the
point. The philosophical difference is that the role of the box could
change at any time. The base platform itself is transient because
hardware sucks.
The same would apply to firewalls or proxy servers or whatever.

This is exactly how I treat my monitoring as well. I don't tie
anything other than base services to the box itself (i.e. in Nagios).
For my business-level stuff (like a website), the metrics are tied to
that site, NOT to that box. If I move the website to another box, I
don't want to lose all the trending history associated with it.

All I REALLY care about at a larger scale is:
- Do I have a box attached to network X and network Y (say management
network and external/cluster network). I don't care if the secondary
NIC is bound or not yet but a management iface needs to be there.
- Does it have the appropriate amount of free disk space and memory to
serve the top most role I want for it (ApacheServer::MySSLEnabledSite)
- Does it have connectivity to any shared resources if appropriate
(SAN/NAS/whatever)

From there, pxe or virt I can bring it either to my base state for all
systems or go ahead and have it brought up to the service state it's
intended.

Especially in a dense datacenter, the guys racking and stacking should
be able to cable it, power it on and have it PXE boot to a state where
the CM system can take over and allocate it (whether it's premarked
for a role or just added to the resource pool for later allocation).

> I think that it would be a fundamental mistake to bake in the utterly wrong
> idea that there is any distinction between those addresses, frankly.  Which,
> obviously, is my opinion - but is informed by the frequency at which people do
> get this wrong.
>
> (The same fault comes up with "gateway" facts and other network related things
>  every time, too, because they look like they are simple, one-value items, but
>  they turn out to be full of lurking complexity in modern large-scale
>  networks.)
>
>         Daniel

I can totally appreciate that perspective. With regards to routing
specifically, I left that out too by mistake. As I sit here and think
about how to represent it, I honestly have no idea. This goes back to
the first thing - data structure. I think you're right in that an
array of hashes makes more sense for disk/network but I still want to
keep it as skinny as possible. Remember that the first round of this
is simply basic facts about hardware and OS that can be used to
determine if a box is appropriate for a higher level role.

John E. Vincent

unread,
Nov 24, 2010, 11:59:15 AM11/24/10
to devops-toolchain
Okay. I've updated the gist: https://gist.github.com/712574 with a new
network block. I merged what I considered relevant L2 and L3 facts
into one section. I also added secondary addresses in there. I left
out default route because for now but added in the default gateway for
each interface.

I also added a few top level identifiers. Timestamp right now would be
the timestamp when the information was last gathered as opposed to
generated. Basically a freshness check? The role and provisioned
fields I'm not sure about but I added them just the same. The thought
is that if you were using this information from Chef/Puppet/Whatever
to populate another database somewhere would you want it? I think so.

I'm trying to keep what would be considered monitoring data(% free on
a given disk or mem in use) data out of the format for now but it's
obviously up for discussion.

Joe Williams

unread,
Nov 24, 2010, 2:45:30 PM11/24/10
to devops-t...@googlegroups.com
John,

I like how in general you have kept the names of devices, ip addresses, etc from being keys. While having device names as keys is perfectly legitimate and usable in my opinion it's a sticking point in ohai (chef) data. 

One issue I see in your gist is that disks rely on array ordering to determine size. In my fork I made a change for this. Devices is a list, each device in the list is a hash and each hash has a name field and anything else, like "size" in the case of disks. This has few benefits in my mind, we can explicitly grab the name of the device using the "name" key rather than implicitly relying on the structure to provide them as keys or as array index. Second, it allows for expansion, we can arbitrarily add key/value pairs to each device hash later on. Lastly, the "count" field is now implicit by length of the devices array. I imagine that code that would create the json would likely be checking the array length to populate that field anyway.

I also did the same thing for network devices. I think arrays of hashes work well in situations like these where one has many objects in the same category that have possibly varying attributes. Disks and network interfaces are the most obvious cases I can think of.


Thanks.
-Joe



Grig Gheorghiu

unread,
Nov 24, 2010, 3:30:04 PM11/24/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 11:45 AM, Joe Williams <j...@joetify.com> wrote:
> I like how in general you have kept the names of devices, ip addresses, etc
> from being keys. While having device names as keys is perfectly legitimate
> and usable in my opinion it's a sticking point in ohai (chef) data.
> One issue I see in your gist is that disks rely on array ordering to
> determine size. In my fork I made a change for this. Devices is a list, each
> device in the list is a hash and each hash has a name field and anything
> else, like "size" in the case of disks. This has few benefits in my mind, we
> can explicitly grab the name of the device using the "name" key rather than
> implicitly relying on the structure to provide them as keys or as array
> index. Second, it allows for expansion, we can arbitrarily add key/value
> pairs to each device hash later on. Lastly, the "count" field is now
> implicit by length of the devices array. I imagine that code that would
> create the json would likely be checking the array length to populate that
> field anyway.
> I also did the same thing for network devices. I think arrays of hashes work
> well in situations like these where one has many objects in the same
> category that have possibly varying attributes. Disks and network interfaces
> are the most obvious cases I can think of.
> https://gist.github.com/714207


I am +1 on Joe's modifications which use arrays of hashes. Those also
map very well to document-oriented DBs like CouchDB and MongoDB.

Grig

Paul Nasrat

unread,
Nov 24, 2010, 3:42:50 PM11/24/10
to devops-t...@googlegroups.com
On 24 November 2010 19:45, Joe Williams <j...@joetify.com> wrote:
>
> On Nov 24, 2010, at 8:59 AM, John E. Vincent wrote:

I've not been on this list too long so missed the start of this
thread. Clearly this discussion ressonates with my interest in facter.

> Okay. I've updated the gist: https://gist.github.com/712574 with a new
> network block. I merged what I considered relevant L2 and L3 facts
> into one section. I also added secondary addresses in there. I left
> out default route because for now but added in the default gateway for
> each interface.

Previously I was quite heavily involved in systems with OpenFirmware
and I always liked the representation from IEEE 1275 and as exported
in /proc/device-tree.

I guess it depends on whether you want to represent buses as in
sysfs/OF or merely logical entities such as disks.

One feature I like about OF is the ability to have aliases and the
fact they are simply mapped as part of the tree so it is easy to
discover.

> I'm trying to keep what would be considered monitoring data(% free on
> a given disk or mem in use) data out of the format for now but it's
> obviously up for discussion.

Thinking about data about devices is an interesting thought. Disks
don't really have free space - that's a property of an FS which is
elsewhere, and can even span multiple disks. Same with memory usage -
that's really a property of the kernel/running system not of the
hardware map.

I think it'd be important to distinguish between these. I quite like
what I've seen in the json representation so far but I'd like to play
with it a bit more.

Paul

Joe Williams

unread,
Nov 24, 2010, 4:09:28 PM11/24/10
to devops-t...@googlegroups.com
On Nov 24, 2010, at 12:42 PM, Paul Nasrat wrote:

>> Okay. I've updated the gist: https://gist.github.com/712574 with a new
>> network block. I merged what I considered relevant L2 and L3 facts
>> into one section. I also added secondary addresses in there. I left
>> out default route because for now but added in the default gateway for
>> each interface.
>
> Previously I was quite heavily involved in systems with OpenFirmware
> and I always liked the representation from IEEE 1275 and as exported
> in /proc/device-tree.
>
> I guess it depends on whether you want to represent buses as in
> sysfs/OF or merely logical entities such as disks.
>
> One feature I like about OF is the ability to have aliases and the
> fact they are simply mapped as part of the tree so it is easy to
> discover.

My gut says that the json representation should in many cases be an abstraction above what might be included in device-tree. That said parsing something like device-tree and sysfs to produce it makes sense.

>> I'm trying to keep what would be considered monitoring data(% free on
>> a given disk or mem in use) data out of the format for now but it's
>> obviously up for discussion.
>
> Thinking about data about devices is an interesting thought. Disks
> don't really have free space - that's a property of an FS which is
> elsewhere, and can even span multiple disks. Same with memory usage -
> that's really a property of the kernel/running system not of the
> hardware map.
>
> I think it'd be important to distinguish between these. I quite like
> what I've seen in the json representation so far but I'd like to play
> with it a bit more.

I agree, we should keep running state metrics out of the hardware map.

John E. Vincent

unread,
Nov 24, 2010, 4:22:40 PM11/24/10
to devops-toolchain
On Nov 24, 2:45 pm, Joe Williams <j...@joetify.com> wrote:
> John,
>
> I like how in general you have kept the names of devices, ip addresses, etc from being keys. While having device names as keys is perfectly legitimate and usable in my opinion it's a sticking point in ohai (chef) data.
>

Yeah I would prefer that for all keys as well. The main offender is
trying to get keys that are IP addresses into MongoDB. Mongo
"dislikes" keys with dots in the name and from my last contact with
the ML on the issue, they're not keen on changing it ;)

> One issue I see in your gist is that disks rely on array ordering to determine size. In my fork I made a change for this. Devices is a list, each device in the list is a hash and each hash has a name field and anything else, like "size" in the case of disks. This has few benefits in my mind, we can explicitly grab the name of the device using the "name" key rather than implicitly relying on the structure to provide them as keys or as array index. Second, it allows for expansion, we can arbitrarily add key/value pairs to each device hash later on. Lastly, the "count" field is now implicit by length of the devices array. I imagine that code that would create the json would likely be checking the array length to populate that field anyway.
>

Yeah I didn't get around to modifying disk but I want it to follow the
same model as network. In fact, anything similar to disk/network in
scope should be the same nested format.

> I also did the same thing for network devices. I think arrays of hashes work well in situations like these where one has many objects in the same category that have possibly varying attributes. Disks and network interfaces are the most obvious cases I can think of.
>
> https://gist.github.com/714207
>
> Thanks.
> -Joe

I like what you've done with the place! Seriously that's much more
elegant than what I did. I'm going to merge your changes into the main
gist for now since it honors the same general key structure I was
using elsewhere.

Adam Jacob

unread,
Nov 24, 2010, 4:39:29 PM11/24/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 01:22:40PM -0800, John E. Vincent wrote:
> Yeah I would prefer that for all keys as well. The main offender is
> trying to get keys that are IP addresses into MongoDB. Mongo
> "dislikes" keys with dots in the name and from my last contact with
> the ML on the issue, they're not keen on changing it ;)

Does this really bother folks? From a data structure point of view,
these really are hashes - they have a unique identifier. While I'm
certainly sympathetic to MongoDB being kind of lame about key
identifiers, I'm not that sympathetic. :)

The difference here can be significant - think about how you look the
data up:

{
"disks": {
"/dev/sda1": {
"size": "100"
}
}
}

If you wanted to know if /dev/sda1 exists:

data["disks"].exists?("/dev/sda1")

Will do the job, in constant time. Whereas:

{
"disks": [
{
"name": "/dev/sda1",
"size": "100"
}
]
}

data["disks"].find { |d| d["name"] == "/dev/sda1" }

Does it in linear time.

This will happen to you every time you want to do this kind of lookup,
which is pretty frequently.

Take this into another language without Ruby's block syntax, and it gets
even stranger.

exists($data{"disks"}{"/dev/sda1"})
grep { $_{"name"} == "/dev/sda1 } $data{"disks"}

I care a lot more about that than I do Mongo's key choices.

Best,
Adam

Daniel Pittman

unread,
Nov 24, 2010, 6:16:12 PM11/24/10
to devops-t...@googlegroups.com
"John E. Vincent" <lusi...@gmail.com> writes:

> Okay. I've updated the gist: https://gist.github.com/712574 with a new
> network block. I merged what I considered relevant L2 and L3 facts into one
> section. I also added secondary addresses in there. I left out default route
> because for now but added in the default gateway for each interface.

That looks pretty good to me, compared to the last lot. The structure makes a
lot more sense, I think.

You should probably update your example to show what an interface looks like
with two unequal priority default gateways on an interface, though. (Which
is a real example: we have a leased line, and a VPN fallback, for our
connection up to our other data center, so two gateways, same interface, same
network, different metrics. :)

> I also added a few top level identifiers. Timestamp right now would be the
> timestamp when the information was last gathered as opposed to
> generated. Basically a freshness check?

I would suggest you name it explicitly for what it contains, because otherwise
someone won't read the spec right, make an assumption, and get upset. (Not
that I would ever do that or anything. ;)

Maybe 'collected_time' or 'collected_at'? Anyway, something that makes it
harder to guess what the timestamp is from would be good.

> The role and provisioned fields I'm not sure about but I added them just the
> same. The thought is that if you were using this information from
> Chef/Puppet/Whatever to populate another database somewhere would you want
> it? I think so.

We have servers that have multiple "roles", just to annoy. (...or, perhaps,
we have some roles that are no broader than a single server, so were not
individually named. :)

Daniel Pittman

unread,
Nov 24, 2010, 6:26:33 PM11/24/10
to devops-t...@googlegroups.com
"John E. Vincent" <lusi...@gmail.com> writes:
> On Nov 23, 9:35 pm, Daniel Pittman <dan...@rimspace.net> wrote:
>> "John E. Vincent" <lusis....@gmail.com> writes:

[...]

>> Well, for the systems in question it literally does have, and respond to, a
>> range of services on all those different addresses.  (Also, most large web
>> servers are going to have multiple IP addresses - because HTTPS support
>> still needs them, darn it all.)
>>
>> So, those are not really "application level" in any meaningful way that
>> whatever random address gets picked as the first one is - and are often *less*
>> "application level", since they are what the host actually communicates using.
>>
>> It is probably important to think about how this information is going to be
>> used, also: in pretty much every case I have needed to know about an IP
>> address (as opposed to the hostname, or some other identifier) it is because
>> we need to do something meaningful with it.
>>
>> In which case having partial information is going to eventually fail.  It
>> could be binding services to an interface, or building firewall rules, or
>> ensuring that the on-disk network configuration matches the running
>> configuration, or any number of things - but the missing information is
>> eventually going to be the information that I actually need this time.
>>
>> I can kind of see an effort to distinguish these addresses as somehow less
>> important than the underlying management address - but on at least some
>> servers we have a separate management network, and the only parts we care
>> about configuring (or reporting) are the front-end parts.
>
> Here's where we might differ in philosophy. I treat the hardware that
> something runs on as transient for lack of a better word (and despite my
> previous usage).

So do I, frankly. I pretty much ignored the "hwaddress" parts of the data,
for example, because they are transient. That routing, and the associated
firewall stuff? That is actually part of the role of those machines, not
something added on.

I think our mismatch is at the level of what is "hardware" and what isn't,
rather than over the basic concepts. :)

> Yes, there are basic firewall rules that exist for hosts but I separate
> those outside of the firewall rules that my apache server or database might
> need. When I apply a theoretical role of ApacheServer to a box the
> following happens:
>
> - Install Apache
> - Apply basic apache config bound to, say, the secondary NICs IP address

We differ here, because "secondary" and "primary" are not particularly
meaningful in our environment. We might have "unclassified",
"medical-in-confidence", and "management" interfaces attached to a machine,
though.

[...]

> In my mind the secondary IP doesn't belong to the box, it doesn't belong to
> being an apache server. It belongs to the role of serving my SSL enabled
> site.

*nod* I absolutely agree. If I was going to express things the same way you
do, though, the primary IP of the Apache server would be the *service*
address, and the secondary IP would be the management one.

> This is something of a contrived example but I think it makes the point. The
> philosophical difference is that the role of the box could change at any
> time. The base platform itself is transient because hardware sucks. The
> same would apply to firewalls or proxy servers or whatever.

...but would you expect the management address to change when the hardware
did? If so, why?

[...]

> All I REALLY care about at a larger scale is:
> - Do I have a box attached to network X and network Y (say management
> network and external/cluster network). I don't care if the secondary
> NIC is bound or not yet but a management iface needs to be there.

*nod*

> - Does it have the appropriate amount of free disk space and memory to
> serve the top most role I want for it (ApacheServer::MySSLEnabledSite)
> - Does it have connectivity to any shared resources if appropriate
> (SAN/NAS/whatever)

*nod* Me either. Which is why I think a flat list of addresses, rather than
the primary/secondary distinction, is the right one.

(Incidentally, are you sure you want to continue the illusion that the IP
address is tied to the interface, rather than being a property of the
machine? I probably would, but I figure I may as well ask. :)

[...]

> I can totally appreciate that perspective. With regards to routing
> specifically, I left that out too by mistake. As I sit here and think about
> how to represent it, I honestly have no idea. This goes back to the first
> thing - data structure.

*nod* For what it is worth: the *only* way I can possibly imagine
representing it is a logical view of the routing table. You can't really
simplify that, and even that is going to hurt. (Hello, source-based routing
on Linux, I love you and your multiple routing tables. :)

> I think you're right in that an array of hashes makes more sense for
> disk/network but I still want to keep it as skinny as possible. Remember
> that the first round of this is simply basic facts about hardware and OS
> that can be used to determine if a box is appropriate for a higher level
> role.

*nod* I think that ditching the primary/secondary distinction for the
addresses on an interface makes sense. (Include a "default source address" if
you really want to be able to reconstruct that.)[1]

Something like this gives the same information, but without imposing local
administrative distinctions on them, I think:

"network":{
"devices": [
{
"name": "eth0",
"address": ["192.168.1.1/24", "192.168.1.3/24", "10.0.0.0/24"],
"sourceip": "192.168.1.1",
"hwaddress": "01:01:01:01:01:01",
"speed": 1000,
"mtu": 9000,
}
]
}

Regards,
Daniel

Footnotes:
[1] Technically, Linux does have a primary/secondary distinction, but that
only dictates what addresses stick around or go away when an interface is
brought down, and is mostly irrelevant to this management layer.

Adam Jacob

unread,
Nov 24, 2010, 6:29:23 PM11/24/10
to devops-t...@googlegroups.com
On Thu, Nov 25, 2010 at 10:26:33AM +1100, Daniel Pittman wrote:
> Something like this gives the same information, but without imposing local
> administrative distinctions on them, I think:
>
> "network":{
> "devices": [
> {
> "name": "eth0",
> "address": ["192.168.1.1/24", "192.168.1.3/24", "10.0.0.0/24"],
> "sourceip": "192.168.1.1",
> "hwaddress": "01:01:01:01:01:01",
> "speed": 1000,
> "mtu": 9000,
> }
> ]
> }

Ew - that's a hash:

"network":{
"devices": {
"eth0": {


"address": ["192.168.1.1/24", "192.168.1.3/24", "10.0.0.0/24"],
"sourceip": "192.168.1.1",
"hwaddress": "01:01:01:01:01:01",
"speed": 1000,
"mtu": 9000
}
}
}
}

--

Opscode, Inc.
Adam Jacob, CTO

T: (206) 619-7151 E: ad...@opscode.com

Joe Williams

unread,
Nov 24, 2010, 6:33:46 PM11/24/10
to devops-t...@googlegroups.com


Constant vs linear time is perfectly reasonable and something I didn't take into account. My thought is that in practice these lists will short, rarely more than 100 or 1000 elements, it's unlikely we will be spend much wall clock time iterating to find an ethernet interface. Realistically parsing json will probably be more of an efficiency killer than iterating through lists but I don't think a binary format is realistic because it's not human readable. To that end my only concern is we would be compromising aspects like readability, compatibility (like with MongoDB) and usability for what will likely be a small albeit non-zero efficiency gain.


> Take this into another language without Ruby's block syntax, and it gets
> even stranger.
>
> exists($data{"disks"}{"/dev/sda1"})
> grep { $_{"name"} == "/dev/sda1 } $data{"disks"}

Fair enough, that ain't pretty.

Adam Jacob

unread,
Nov 24, 2010, 6:39:47 PM11/24/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 03:33:46PM -0800, Joe Williams wrote:
> Constant vs linear time is perfectly reasonable and something I didn't take into account. My thought is that in practice these lists will short, rarely more than 100 or 1000 elements, it's unlikely we will be spend much wall clock time iterating to find an ethernet interface. Realistically parsing json will probably be more of an efficiency killer than iterating through lists but I don't think a binary format is realistic because it's not human readable. To that end my only concern is we would be compromising aspects like readability, compatibility (like with MongoDB) and usability for what will likely be a small albeit non-zero efficiency gain.

I don't think you're really compromising on readability - with JSON in
particular, you'll get no promised key order, so things like "name" as
an attribute in the array-of-hashes will never be in the same place
visually. I'm the king of just pretty-printing JSON and calling it
usable, and take it from me: if you want it readable, you're going to do
it in a custom format. :)

As for compatibility, I would say that MongoDBs issues with dots in
keyspace is for folks who want to use this data in MongoDB to handle. If
there is a compatibility issue, it's with MongoDB not accepting all
valid JSON. (Which they are perfectly clear about - it's "JSON-style",
not JSON)

Adam

Joe Williams

unread,
Nov 24, 2010, 7:20:17 PM11/24/10
to devops-t...@googlegroups.com

On Nov 24, 2010, at 3:39 PM, Adam Jacob wrote:

> On Wed, Nov 24, 2010 at 03:33:46PM -0800, Joe Williams wrote:
>> Constant vs linear time is perfectly reasonable and something I didn't take into account. My thought is that in practice these lists will short, rarely more than 100 or 1000 elements, it's unlikely we will be spend much wall clock time iterating to find an ethernet interface. Realistically parsing json will probably be more of an efficiency killer than iterating through lists but I don't think a binary format is realistic because it's not human readable. To that end my only concern is we would be compromising aspects like readability, compatibility (like with MongoDB) and usability for what will likely be a small albeit non-zero efficiency gain.
>
> I don't think you're really compromising on readability - with JSON in
> particular, you'll get no promised key order, so things like "name" as
> an attribute in the array-of-hashes will never be in the same place
> visually. I'm the king of just pretty-printing JSON and calling it
> usable, and take it from me: if you want it readable, you're going to do
> it in a custom format. :)

Sure, I certainly see your point although I think I personally still prefer an array-of-hashes.

> As for compatibility, I would say that MongoDBs issues with dots in
> keyspace is for folks who want to use this data in MongoDB to handle. If
> there is a compatibility issue, it's with MongoDB not accepting all
> valid JSON. (Which they are perfectly clear about - it's "JSON-style",
> not JSON)

Heh, "JSON-style", there's a MongoDB /dev/null joke in there some where.

Really though, I agree, valid JSON is valid JSON and if a system isn't compatible with the standard that's their fault. Regardless, we shouldn't knowingly shut them out if we can help it. Although it is most certainly impossible to make everyone happy.

Adam Jacob

unread,
Nov 24, 2010, 7:45:55 PM11/24/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 04:20:17PM -0800, Joe Williams wrote:
> > I don't think you're really compromising on readability - with JSON in
> > particular, you'll get no promised key order, so things like "name" as
> > an attribute in the array-of-hashes will never be in the same place
> > visually. I'm the king of just pretty-printing JSON and calling it
> > usable, and take it from me: if you want it readable, you're going to do
> > it in a custom format. :)
>
> Sure, I certainly see your point although I think I personally still prefer an array-of-hashes.

Why?

> Really though, I agree, valid JSON is valid JSON and if a system isn't compatible with the standard that's their fault. Regardless, we shouldn't knowingly shut them out if we can help it. Although it is most certainly impossible to make everyone happy.

Right - I'm just advocating that if we're going to have an extensible
system, limiting the valid key name space to omit dots is bad mojo,
since dots often appear in valid places.

Joe Williams

unread,
Nov 24, 2010, 8:49:05 PM11/24/10
to devops-t...@googlegroups.com

On Nov 24, 2010, at 4:45 PM, Adam Jacob wrote:

> On Wed, Nov 24, 2010 at 04:20:17PM -0800, Joe Williams wrote:
>>> I don't think you're really compromising on readability - with JSON in
>>> particular, you'll get no promised key order, so things like "name" as
>>> an attribute in the array-of-hashes will never be in the same place
>>> visually. I'm the king of just pretty-printing JSON and calling it
>>> usable, and take it from me: if you want it readable, you're going to do
>>> it in a custom format. :)
>>
>> Sure, I certainly see your point although I think I personally still prefer an array-of-hashes.
>
> Why?

The aforementioned reasons but of those probably most importantly, I like that idea of each attribute having an explicit name. Programmatically this allows one to know ahead of time how to get what they are looking for, i.e. the "name" of the device or the "size" or "mtu". The key describes what the attribute is, so if you want a device name you get data["devices"][1...n]["name"] not data["devices"].keys[1..n]. In both cases if one doesn't know what device they want they still have to iterate through them until they find it, except in my example one can bank on the "name" being the name of the device not the name implicitly being the key.

Additionally as Paul Nasrat alluded to earlier in the the end these are all PCI busses and etc, eth0 is just the alias the operating system gave pci0000:02. I'm not sure it deserves to be treated any differently than the MTU or IP address which also describe that PCI bus.

Adam Jacob

unread,
Nov 24, 2010, 8:58:16 PM11/24/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 05:49:05PM -0800, Joe Williams wrote:
> The aforementioned reasons but of those probably most importantly, I
> like that idea of each attribute having an explicit name.
> Programmatically this allows one to know ahead of time how to get what
> they are looking for, i.e. the "name" of the device or the "size" or
> "mtu". The key describes what the attribute is, so if you want a
> device name you get data["devices"][1...n]["name"] not
> data["devices"].keys[1..n]. In both cases if one doesn't know what
> device they want they still have to iterate through them until they
> find it, except in my example one can bank on the "name" being the
> name of the device not the name implicitly being the key.

Even in your example it sucks - you want the name of "what"? :) The
zeroth device? You're basically saying you want to iterate all the
time, which I don't think you actually do.

If the issue is that you want the key to be flexible, feel free to dupe
the data into a "name" field.

There is a semantic value to these data structures - if you want to know
the "size" of something, that something has a name. If you want to get
the answer to the question "what is the size of /dev/sda1", in your
world we need to iterate - in mine you just ask. You still have to make
a choice (you only get one key, after all), but you'll make life easier
for people most of the time.

By sticking an array in front of the data you are explicitly creating a
new semantic, and rather than having people understand what you are
using as a key, your forcing them to understand how you determine a
consistent order for the resulting hash. You can't opt-out of there
being a key in the middle, you can just make it something that has
discoverable semantic meaning or something that has opaque semi-random
meaning.

> Additionally as Paul Nasrat alluded to earlier in the the end these
> are all PCI busses and etc, eth0 is just the alias the operating
> system gave pci0000:02. I'm not sure it deserves to be treated any
> differently than the MTU or IP address which also describe that PCI
> bus.

There are certainly other options for valid keys, the pbi-bus identifier
being one of them. I would argue it's a bad one, as it is very rarely of
any value to the end-user. The number of use-cases where you look
something up mentally by PCI bus identifier is mind-bogglingly low, in
comparison to how often you look something up by interface identifier.

Joe Williams

unread,
Nov 24, 2010, 9:50:23 PM11/24/10
to devops-t...@googlegroups.com

On Nov 24, 2010, at 5:58 PM, Adam Jacob wrote:

> On Wed, Nov 24, 2010 at 05:49:05PM -0800, Joe Williams wrote:
>> The aforementioned reasons but of those probably most importantly, I
>> like that idea of each attribute having an explicit name.
>> Programmatically this allows one to know ahead of time how to get what
>> they are looking for, i.e. the "name" of the device or the "size" or
>> "mtu". The key describes what the attribute is, so if you want a
>> device name you get data["devices"][1...n]["name"] not
>> data["devices"].keys[1..n]. In both cases if one doesn't know what
>> device they want they still have to iterate through them until they
>> find it, except in my example one can bank on the "name" being the
>> name of the device not the name implicitly being the key.
>
> Even in your example it sucks - you want the name of "what"? :) The
> zeroth device? You're basically saying you want to iterate all the
> time, which I don't think you actually do.
>
> If the issue is that you want the key to be flexible, feel free to dupe
> the data into a "name" field.

That's a good compromise but duping data sucks. :P

> There is a semantic value to these data structures - if you want to know
> the "size" of something, that something has a name. If you want to get
> the answer to the question "what is the size of /dev/sda1", in your
> world we need to iterate - in mine you just ask. You still have to make
> a choice (you only get one key, after all), but you'll make life easier
> for people most of the time.

Right, if you know what you want ahead of time it works great. If you don't you still have to iterate and having an explicit keys for each attribute gives you more to work with. In my mind the "name" of the device is an attribute of the device not the device itself. I look at the array in this case as an unordered list of device descriptions not devices.

> By sticking an array in front of the data you are explicitly creating a
> new semantic, and rather than having people understand what you are
> using as a key, your forcing them to understand how you determine a
> consistent order for the resulting hash. You can't opt-out of there
> being a key in the middle, you can just make it something that has
> discoverable semantic meaning or something that has opaque semi-random
> meaning.

I'll argue that I'm not creating a new semantic because "devices" signifies that it is the key for something that is iterable, regardless if it's a hash, array, etc.

>> Additionally as Paul Nasrat alluded to earlier in the the end these
>> are all PCI busses and etc, eth0 is just the alias the operating
>> system gave pci0000:02. I'm not sure it deserves to be treated any
>> differently than the MTU or IP address which also describe that PCI
>> bus.
>
> There are certainly other options for valid keys, the pbi-bus identifier
> being one of them. I would argue it's a bad one, as it is very rarely of
> any value to the end-user. The number of use-cases where you look
> something up mentally by PCI bus identifier is mind-bogglingly low, in
> comparison to how often you look something up by interface identifier.

Certainly, I was just using it as an example.

Adam Jacob

unread,
Nov 25, 2010, 12:02:13 AM11/25/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 06:50:23PM -0800, Joe Williams wrote:
> Right, if you know what you want ahead of time it works great. If you
> don't you still have to iterate and having an explicit keys for each
> attribute gives you more to work with.

How so?

data[:devices].each do |name, device_data|
...
end

while(my($name, $device_data) = each %{$data{'devices'}}) {
...
}

Or:

data[:devices].each do |device|
...
end

foreach my $device (@{$data{'devices'}}) {
...
}

I agree that I like the second form better when I'm iterating (kind of
obviously) but it's not because any data is missing.

> In my mind the "name" of the device is an attribute of the device not
> the device itself. I look at the array in this case as an unordered
> list of device descriptions not devices.

So do I - I'm just saying the data has more than one use case.
Iteration is one, but direct access is another. In the hash form, you
get both: iterating is easy, and direct access is at least possible. In
the array form, you loose the second use case entirely.

> I'll argue that I'm not creating a new semantic because "devices"
> signifies that it is the key for something that is iterable,
> regardless if it's a hash, array, etc.

It's both a key for something iterable and a key for a direct lookup for
a device. There is a reason you called it "name", after all. :)

Joe Williams

unread,
Nov 25, 2010, 2:00:42 AM11/25/10
to devops-t...@googlegroups.com

On Nov 24, 2010, at 9:02 PM, Adam Jacob wrote:

>> In my mind the "name" of the device is an attribute of the device not
>> the device itself. I look at the array in this case as an unordered
>> list of device descriptions not devices.
>
> So do I - I'm just saying the data has more than one use case.
> Iteration is one, but direct access is another. In the hash form, you
> get both: iterating is easy, and direct access is at least possible. In
> the array form, you loose the second use case entirely.
>
>> I'll argue that I'm not creating a new semantic because "devices"
>> signifies that it is the key for something that is iterable,
>> regardless if it's a hash, array, etc.
>
> It's both a key for something iterable and a key for a direct lookup for
> a device. There is a reason you called it "name", after all. :)

You have convinced me, locking out direct access would be suboptimal. To that end direct access is probably worth more than having explicit keys for each attribute, "name" or otherwise.

John Vincent

unread,
Nov 26, 2010, 1:21:06 AM11/26/10
to devops-t...@googlegroups.com
On Wed, Nov 24, 2010 at 4:39 PM, Adam Jacob <ad...@opscode.com> wrote:
> Does this really bother folks?  From a data structure point of view,
> these really are hashes - they have a unique identifier. While I'm
> certainly sympathetic to MongoDB being kind of lame about key
> identifiers, I'm not that sympathetic. :)
>

i'm not overly sympathetic either especially with the attitude that
10gen has in general but since this is an "interchange" format of
sorts, I'd like for it to be as flexible as possible.

> The difference here can be significant - think about how you look the
> data up:
>
>  {
>    "disks": {
>      "/dev/sda1": {
>        "size": "100"
>      }
>    }
>  }
>
> If you wanted to know if /dev/sda1 exists:
>
> data["disks"].exists?("/dev/sda1")
>
> Will do the job, in constant time.  Whereas:
>
>  {
>    "disks": [
>      {
>        "name": "/dev/sda1",
>        "size": "100"
>      }
>    ]
>  }
>
> data["disks"].find { |d| d["name"] == "/dev/sda1" }
>
> Does it in linear time.
>

Excluding other languages, I had to give this a test. As far as ruby
goes, I've always been "semi-smart" about coding practices (using ''
instead of "" to avoid the interpolation pass, using << instead of +=
because it's faster. I was really curious about basic lookup speed
between the two so I gave it a go:

(gist and json files are here - https://gist.github.com/716336)

jvincent@jvx64:~/development/json-tests$ ruby bm.rb

Testing without cleanup
user system total real
{}.has_key? 3 0.000000 0.000000 0.000000 ( 0.000087)
[].find 3 0.000000 0.000000 0.000000 ( 0.000067)
{}.has_key? 10 0.000000 0.000000 0.000000 ( 0.000376)
[].find 10 0.000000 0.000000 0.000000 ( 0.000129)

Testing with cleanup
user system total real
{}has_key? 3 0.000000 0.000000 0.000000 ( 0.000065)
[].find 3 0.000000 0.000000 0.000000 ( 0.000058)
{}.has_key? 10 0.000000 0.000000 0.000000 ( 0.000061)
[].find 10 0.000000 0.000000 0.000000 ( 0.000085)

That was with 1.9.2 (which ships with JSON support OOB).

I'm no benchmark wizard so I might have screwed something up. Short of
forcing GC to run, it APPEARS that array.find is faster. I haven't
delved into WHY that is. Maybe an Array is a faster native data
structure than a Hash? I wouldn't say there are any GROSS time
differences between them.

As for which way to go? Like I said, I'm no fan of 10gen and MongoDB
so I won't let that be the massive deciding factor but my personal
preference is to not use transient values as key names if possible.

John

Adam Jacob

unread,
Nov 27, 2010, 2:57:03 PM11/27/10
to devops-t...@googlegroups.com
On Fri, Nov 26, 2010 at 01:21:06AM -0500, John Vincent wrote:
> i'm not overly sympathetic either especially with the attitude that
> 10gen has in general but since this is an "interchange" format of
> sorts, I'd like for it to be as flexible as possible.

Yep - which means to me not restricting the values in the keyspace
beyond what is already in JSON.

> Excluding other languages, I had to give this a test. As far as ruby
> goes, I've always been "semi-smart" about coding practices (using ''
> instead of "" to avoid the interpolation pass, using << instead of +=
> because it's faster. I was really curious about basic lookup speed
> between the two so I gave it a go:

.. snip ..

> I'm no benchmark wizard so I might have screwed something up. Short of
> forcing GC to run, it APPEARS that array.find is faster. I haven't
> delved into WHY that is. Maybe an Array is a faster native data
> structure than a Hash? I wouldn't say there are any GROSS time
> differences between them.

There isn't at small scale. Check out:

http://en.wikipedia.org/wiki/Time_complexity

A hash lookup happens in constant time - the time it takes to hash the
key, essentially. An array traversal (like find) happens in linear time
- as you add more elements to the array, it takes longer to find. (Now,
if the item you are looking for is first in the array, it may always
be short, because the find method might quit after the first item is
found, for example.)

This was Joe's point from earlier in the thread - given the size of the
array input thats likely, the differential for walking the array or
looking up an item in a hash is likely minimal.

require 'benchmark'

hash_buddy = Hash.new
array_buddy = Array.new
0.upto(10000000) do |number|
hash_buddy[number] = true
array_buddy << number
end

Benchmark.bm do |x|
x.report("hash lookup 10m:") { hash_buddy.has_key?(10000000) }
x.report("array lookup 10m:") { array_buddy.find { |i| i == 10000000 } }
x.report("hash lookup 1:") { hash_buddy.has_key?(1) }
x.report("array lookup 1:") { array_buddy.find { |i| i == 1 } }
end

And you'll get this:

user system total real
hash lookup 10m: 0.000000 0.000000 0.000000 ( 0.000017)
array lookup 10m: 0.900000 0.010000 0.910000 ( 0.905898)
hash lookup 1: 0.000000 0.000000 0.000000 ( 0.000008)
array lookup 1: 0.000000 0.000000 0.000000 ( 0.000010)

Notice the array lookup takes almost a second for the 10m case, while
the hash lookup remains constant.

Best,

claymation

unread,
Mar 26, 2011, 11:54:51 PM3/26/11
to devops-t...@googlegroups.com, Adam Jacob
Has this initiative died over the difference between lists and dicts? I longed for such a standard a few years ago when I wrote a host discovery and CMDB application at Ning, and still think it would be valuable, not least because it could be the basis for a configuration query language that I'm working on.

From what I've seen of ohai and facter, ohai's approach is closer to what I'd like to see, minus the ephemeral data. There are better tools for collecting and storing ephemeral data, and, as I understand it, the goals of this standard are more along the lines of identifying a host and its fundamental properties than with fully characterizing the host.

On the lists-vs-dicts question, I agree with Adam: you lose nothing by using keys with semantic value, but you gain constant-time lookup and simplicity, and one goal of this exchange format ought to be simplicity. I agree with Joe that it's nice to have the device name included among its attributes, and not sequestered away in the key, but that's easily accomplished by simply duplicating the value:

    interfaces['eth0'] = {
        'name': 'eth0',
        ...
    }

You also gain the ability to represent the data structure as URIs when using semantic keys:


Issues with MongoDB are largely irrelevant, I think, because we're talking about an exchange format, not necessarily a storage format. Users of MongoDB (like myself) have a little more work to do to escape keys so they won't interfere with queries, but I don't think the specification should cater to the requirements of one database or another.

Cheers,

Clay

John Vincent

unread,
Mar 27, 2011, 12:00:19 AM3/27/11
to devops-t...@googlegroups.com, claymation, Adam Jacob
I wouldn't say it's dead. I have it as an outstanding task and I see
it every day but I haven't gotten around to revisiting it. As I do
work on Noah, this issue keeps nagging at the back of my mind.

So in an effort to revise it, I'll take a look at the last state of my
gists and see where we left off. I did submit a basic patch to facter
that did nothing more than covert the fact output to yaml. My plan
was, when a final format + set of facts was decided, to actually
submit patches to both facter and ohai that would accept a flag to
dump the information in the format we all agreed upon - something like
'facter --format common' or 'ohai --format common'

--
John E. Vincent
http://about.me/lusis

Matthew Macdonald-Wallace

unread,
Mar 27, 2011, 3:36:22 AM3/27/11
to devops-t...@googlegroups.com

Ok, I missed this thread originally, if someone can send me a link to the discussion is be happy to see if I could integrate it into Edison.

Matt

Clay McClure

unread,
Mar 27, 2011, 12:22:07 PM3/27/11
to devops-t...@googlegroups.com, Matthew Macdonald-Wallace
Noah? Edison?

Matthew Macdonald-Wallace

unread,
Mar 27, 2011, 12:41:03 PM3/27/11
to Clay McClure, devops-t...@googlegroups.com
Edison: https://github.com/proffalken/edison

It aims to centralise a Configuration Management DataBase,
Configuration Deployment (using puppet at the moment but contributions
for other systems are welcome!) and Change Management in one place.

The main idea behind it was that I'd be able to say "What has changed
on server X in the past 8 hours" and it would tell me - life-saving in
the middle of a major incident!

Cheers,

Matt (ProfFalken)

On 27 March 2011 17:22, Clay McClure <cl...@daemons.net> wrote:
> Noah? Edison?
>
> On Sun, Mar 27, 2011 at 3:36 AM, Matthew Macdonald-Wallace
> <mattm...@gmail.com> wrote:
>>
>> Ok, I missed this thread originally, if someone can send me a link to the
>> discussion is be happy to see if I could integrate it into Edison.
>>
>> Matt
>>

John Vincent

unread,
Mar 27, 2011, 12:54:31 PM3/27/11
to devops-t...@googlegroups.com, Matthew Macdonald-Wallace

Noah is a service registry plus distributed coordination system similar/inspired by Apache zookeeper:

https://github.com/lusis/Noah

Vogleler is a stalled project of mine that was going to be a framework for a command and control + cmdb. I started this thread when I was still working on it.

https://github.com/lusis/vogeler

I have every intention of picking it back up and I'm actually going to refractory quite a bit based on my experiences developing Noah.

Edison is proffalken's baby. I don't have the URL on me.

On Mar 27, 2011 12:22 PM, "Clay McClure" <cl...@daemons.net> wrote:

John Vincent

unread,
Mar 27, 2011, 1:13:23 PM3/27/11
to devops-t...@googlegroups.com

Matt,

Best bet is to hit groups.google.com and find the mailing list page. This thread is right up top right now ;) I warn you that the previous discussion was 3 or so pages long.

Clay McClure

unread,
Mar 28, 2011, 4:23:14 PM3/28/11
to devops-t...@googlegroups.com
John,

Noah looks pretty cool. I like the stack you're using with that: sinatra, ohm, redis.

Tell me more about vogeler. It sounds interesting, and perhaps related to a project that's been floating around in my head for a while now.

Whereabouts do you work?

Cheers,

Clay

John Vincent

unread,
Mar 28, 2011, 4:42:14 PM3/28/11
to devops-t...@googlegroups.com, Clay McClure
Clay,

I'll hit you off list on the work stuff.

As for Vogeler, It morphed over the time I was working on it but the
original use case was a system of record for a company I worked at.
Cobbler didn't really work as a model for our environment and puppet
was using Cobbler for lookups. We were going to hack on Cobbler to
support a CouchDB backend or wrap all our cobbler calls in lookups to
couchdb. So I started hacking on something in my spare time - Vogeler.

As it evolved it, it really became more of generic command-and-control
system. That was born out of the desire to not rewrite factor or
cobbler. When I first "announced" it, Patrick (Debois) asked me some
questions so I wrote this blog post:

http://lusislog.blogspot.com/2010/09/follow-up-to-vogeler-post.html

So while it's in a stalled state now, I actually want to refactor it
and remove the rabbitmq + couchdb stuff and replace them both with
Redis (since it actually works well for both storage and queuing -
pubsub replacing fanout and LISTs replacing direct exchanges).

You should also take a look at what Miquel Torres and Grig Gheorghiu
are doing with Overmind - https://github.com/tobami/overmind

We had talked about using some form of the Vogeler C&C capabilities in
it but I've not had time (surprise) to check back in with it.

On Mon, Mar 28, 2011 at 4:23 PM, Clay McClure <cl...@daemons.net> wrote:
> John,
> Noah looks pretty cool. I like the stack you're using with that: sinatra,
> ohm, redis.
> Tell me more about vogeler. It sounds interesting, and perhaps related to a
> project that's been floating around in my head for a while now.
> Whereabouts do you work?
> Cheers,
> Clay
>

> On Sun, Mar 27, 2011 at 12:54 PM, John Vincent <lusi...@gmail.com> wrote:
>>
>> Noah is a service registry plus distributed coordination system
>> similar/inspired by Apache zookeeper:
>>
>> https://github.com/lusis/Noah
>>
>> Vogleler is a stalled project of mine that was going to be a framework for
>> a command and control + cmdb. I started this thread when I was still working
>> on it.
>>
>> https://github.com/lusis/vogeler
>>
>> I have every intention of picking it back up and I'm actually going to
>> refractory quite a bit based on my experiences developing Noah.
>>
>> Edison is proffalken's baby. I don't have the URL on me.
>>
>> On Mar 27, 2011 12:22 PM, "Clay McClure" <cl...@daemons.net> wrote:
>
>

Reply all
Reply to author
Forward
0 new messages