RFC: Database-backed inventory service plan

6 views
Skip to first unread message

Nick Lewis

unread,
Feb 23, 2011, 5:04:03 PM2/23/11
to puppet...@googlegroups.com, puppe...@googlegroups.com
Our current plan for the inventory service is to provide active_record termini for the "facts" and "inventory" indirections. This is to support fast look-up of facts, and search of nodes based on their facts. However, there are already tables for facts, used for storeconfigs, along with an active_record terminus for facts. We want to avoid unnecessarily duplicating this behavior, by reusing the existing tables and terminus. This would result in the same fact data being used by both the inventory service and storeconfigs.

The only potential concern we can see with this is users wanting different fact expiration policies for inventory service and storeconfigs. Given the usage scenarios for storeconfigs that we are aware of, this seems unlikely (it sounds like storeconfig fact data is mostly being used as a stand-in for an inventory service). This proposal would have no other effect on storeconfigs.

Please share any other comments or concerns you may have related to this proposal, particularly if it would interfere with your current use of storeconfigs. Thanks.

Markus Roberts

unread,
Feb 24, 2011, 3:58:18 PM2/24/11
to puppe...@googlegroups.com
N --

Do we want to cross post this on the user list?

-- M

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To post to this group, send email to puppe...@googlegroups.com.
To unsubscribe from this group, send email to puppet-dev+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.



--
-----------------------------------------------------------
When in trouble or in doubt, run in circles,
scream and shout. -- 1920's parody of the 
maritime general prudential rule
------------------------------------------------------------

Matt Robinson

unread,
Feb 24, 2011, 4:21:55 PM2/24/11
to puppe...@googlegroups.com, Markus Roberts

Matt Robinson

unread,
Feb 25, 2011, 4:55:33 PM2/25/11
to puppe...@googlegroups.com, Nick Lewis, puppet...@googlegroups.com
On Wed, Feb 23, 2011 at 2:04 PM, Nick Lewis <ni...@puppetlabs.com> wrote:
> Our current plan for the inventory service is to provide active_record
> termini for the "facts" and "inventory" indirections. This is to support
> fast look-up of facts, and search of nodes based on their facts. However,
> there are already tables for facts, used for storeconfigs, along with an
> active_record terminus for facts. We want to avoid unnecessarily duplicating
> this behavior, by reusing the existing tables and terminus. This would
> result in the same fact data being used by both the inventory service and
> storeconfigs.

In principle I don't like the idea of tying the backends of
storeconfigs and inventory together by sharing tables, especially
since I'm not clear on the future of storeconfigs or a lot of details
of how it's currently used, so it makes it harder to change
implementation details. As a specific example, I don't like the
schema storeconfigs has for storing fact data (explained in more
detail below) and would prefer to use a different one. If we share
tables this is awkward.

I propose that we don't share tables, and the inventory service (and
any other future service that needs a database backend) has its own
set of namespaced tables (servicename_tablename). Ideally I would
like to use separate database schemas entirely, but that would be a
bigger, harder to manage change with the current code that relies on
the active_record terminus.

Currently the storeconfigs tables dealing with facts look something
like this (I've removed the columns that are irrelevant to the
inventory service):

create_table :hosts do |t|
t.column :name, :string, :null => false
end

create_table :fact_names do |t|
t.column :name, :string, :null => false
end

create_table :fact_values do |t|
t.column :value, :text, :null => false
t.column :fact_name_id, :integer, :null => false
t.column :host_id, :integer, :null => false
end

I propose something more like:

create_table :nodes do |t|
t.column :name, :string, :null => false
t.column :timestamp, :datetime
end

create_table :facts do |t|
t.column :name, :string, :null => false
t.column :value, :text, :null => false
t.column :node_id, :integer, :null => false
end

It's less normalized than the storeconfigs schema since fact names
will be duplicated per node, but easier to understand and work with,
and I think better satisfies the types of queries we will be doing
which are of the form "select nodes where fact equal to value". The
more normalized schema would be better for queries of the form "select
all values for fact", but I don't think that's something we'll be
doing. Correct me if I'm wrong.

Other benefits of the proposed schema include the "metadata" about
each fact set being columns on the node table (Nick has also proposed
that table be called fact_sets and have a column called node_name)
instead of being stored as a fact. Also we tend to use the word host
all over our code (in both puppet and dasbhoard) when we really ought
to use the word node since host confuses people into thinking the host
name is what identifies a node, when by default it's the fqdn and
could be anything.

> Please share any other comments or concerns you may have related to this
> proposal, particularly if it would interfere with your current use of
> storeconfigs. Thanks.

Questions:
Do or will we want historical fact sets? Current understanding is no,
that we only store the most recent fact set per node. This makes the
database smaller and I can't think of a motivator for wanting
historical fact sets, but maybe someone else can.
What other "metadata" do we want to store about facts. Currently the
only metadata we're storing is timestamp.

Markus Roberts

unread,
Feb 25, 2011, 6:27:30 PM2/25/11
to puppe...@googlegroups.com
M --


I'm not saying that you're wrong, but I'm not seeing how the denormalized table would be easier to work with; and if anything I would expect that it would be slower to query (and slower to update) in the denormalized form.  If you index on [fact_id,value] the query should be blazingly fast, and this would be easier/cheaper to maintain than a key on [name,value]...or am I missing something?

 
Other benefits of the proposed schema include the "metadata" about
each fact set being columns on the node table (Nick has also proposed
that table be called fact_sets and have a column called node_name)
instead of being stored as a fact.  

So completely denormalized?  Why?

-- M

Ohad Levy

unread,
Feb 28, 2011, 8:43:29 AM2/28/11
to puppe...@googlegroups.com
-1
I don't see the benefit in changing the schema,  besides adding duplicate data into the database.
granted, there are more than one way to design a schema, but the current design is  a valid one.

I'm more than happy to supply the required finder methods for querying the facts, as I use them within Foreman.

Ohad

Matt Robinson

unread,
Feb 28, 2011, 1:18:16 PM2/28/11
to puppe...@googlegroups.com, Markus Roberts
I'm less concerned with the performance ramifications or specifics of
schema design of my proposal than the fact that I don't want to tie
the architecture of the inventory service to storeconfigs, but
performace concerns are addressed inline farther down this email.

On Mon, Feb 28, 2011 at 5:43 AM, Ohad Levy <ohad...@gmail.com> wrote:
> I don't see the benefit in changing the schema,  besides adding duplicate
> data into the database.
> granted, there are more than one way to design a schema, but the current
> design is  a valid one.
> I'm more than happy to supply the required finder methods for querying the
> facts, as I use them within Foreman.

I agree it's a valid schema, and I don't want to bikeshed over schema
details. What more concerns me is that I'm not sure how the
storeconfigs tables are currently used in the wild, and at the moment
the active record models to the tables are essentially the API that
people may have built inventory-like services upon (like the Foreman).
We want to design the inventory service in such a way that the API is
at the indirector level (essentially RESTful), so shouldn't care about
implementation details of the schema, but needs to care if we build on
top of the tables storeconfigs already uses.

I may be making too big a deal about this since the data model for
storing facts is relatively simple and it might not need that much
architectural flexibility, I'm just hesitant to build on top of
existing code when I'm not sure how it's depended upon and where it
might introduce issues with running two puppet "services" at the same
time.

In general I find fewer tables easier to work with, and the active
record syntax stays a little nicer too (I know you can pretty up the
syntax by wrapping some methods around things, but if you don't have
to, working with the data is more intuitive in my opinion):

Facts.where {|fact| fact.name => 'puppetversion', fact.value =>
'2.6.5'}.map {|fact| fact.node.timestamp}
NormalizedFacts.where (:fact_name_id =>
FactName.find_by_name('puppetversion').id, fact.value => '2.6.5').map
{|fact| fact.node.metadata.where(:name => 'timestamp') }

I don't see that storing the fact names in a seprate table would help
much. The maintenance of an extra index for fact_name_id and the need
to do a join or extra activerecord lookup would probably make speed
about equivalent for queries as just storing the fact names
denormalized. In fact, it should be fairly trivial to test that out.

Here's a little repo that sets up both schemas and some test data (100
nodes with 100 facts each):
https://github.com/mmrobins/Denormalization-Permance-Test

The performance of inserting data into the normalized tables is
actually a bit slower, I think because you need to do do a separate
database query to get back the id of the fact name before you can
insert the fact:

== PopulateTestData: migrating ===========
denormalized
26.950000 1.080000 28.030000 ( 39.207524)
normalized
34.860000 1.390000 36.250000 ( 49.772006)
== PopulateTestData: migrated (89.2669s) ===

Searching for facts looks pretty much identical if I do a separate
query for the fact name:

Benchmark.measure { 100.times { |i| Fact.where("name = ? AND value
= ?", "fact_#{i}", "fact_value_#{i}").map {|f| f.node.name } } }
=> #<Benchmark::Tms:0x103b27a38 @cutime=0.0, @total=5.46,
@label="", @stime=0.33, @real=6.2
3462510108948, @utime=5.13, @cstime=0.0>

Benchmark.measure { 100.times {|i|
NormalizedFact.where("fact_name_id = ? AND value = ?",
FactName.find_by_name("fact_#{i}"), "fact_value_#{i}").map {|f|
f.node.name } } }
=> #<Benchmark::Tms:0x10374d228 @cutime=0.0,
@total=5.41999999999999, @label="", @stime=0.319999999999999,
@real=6.22925186157227, @utime=5.09999999999999, @cstime=0.0>

and surprising to me the normalized table a little slower if I do a
join and one query

Benchmark.measure { 100.times {|i| NormalizedFact.find(:all, :include
=> :fact_name, :conditions => ["fact_names.name = ? AND value = ?",
"fact_#{i}", "fact_value_#{i}"]).map {|f| f.node.name } } }
=> #<Benchmark::Tms:0x103954828 @cutime=0.0, @total=6.59000000000001,
@label="", @stime=0.33, @real=7.36611413955688,
@utime=6.26000000000001, @cstime=0.0>

>> Other benefits of the proposed schema include the "metadata" about
>> each fact set being columns on the node table (Nick has also proposed
>> that table be called fact_sets and have a column called node_name)
>> instead of being stored as a fact.
>
> So completely denormalized?  Why?

You mean as opposed to having a separate table for metadata? (note
storeconfigs timestamp metadata is stored as a fact with a serialized
symbol in yaml as the fact name) Mainly because we currently only
have one piece of metadata we store, and as I showed in the above
example, the syntax is a bit nicer. If there's more metadata about
fact sets that people can think of, I'd be happy to normalize the
data. We could do it ahead of time, but why bother? Should be just
as easy to do at a later date.

Matt

Markus Roberts

unread,
Feb 28, 2011, 1:37:24 PM2/28/11
to puppe...@googlegroups.com, Matt Robinson
I am not a DBA, but I've seen the following pattern play out many times: the DBA says "normalize, normalize!" the developers say "normalizing is a pain, and in this particular case it doesn't buy us much, etc." and down the road the DBA turns out to have been right.  Then the developers rediscover the fact that "normalized --> denormalized" is a much easier migration than "denormalized --> normalized" is, and there is much gnashing of teeth.

Note also that the fully denomalized form of the facts may well be one to two orders of magnitude larger than the fully normalized form.

But, as I said, I'm not a DBA.

Matt Robinson

unread,
Mar 1, 2011, 4:32:43 PM3/1/11
to puppe...@googlegroups.com, puppet...@googlegroups.com
On Wed, Feb 23, 2011 at 2:04 PM, Nick Lewis <ni...@puppetlabs.com> wrote:
> Our current plan for the inventory service is to provide active_record
> termini for the "facts" and "inventory" indirections. This is to support
> fast look-up of facts, and search of nodes based on their facts. However,
> there are already tables for facts, used for storeconfigs, along with an
> active_record terminus for facts.

On Fri, Feb 25, 2011 at 1:55 PM, Matt Robinson <ma...@puppetlabs.com> wrote:
> I propose that we don't share tables, and the inventory service (and
> any other future service that needs a database backend) has its own
> set of namespaced tables (servicename_tablename).

Thanks to those who gave feedback. The general consensus I've reached
talking offline to other devs (Jacob, Nick, Paul) is that we should
use separate tables for the inventory service from the ones that
storeconfigs currently uses.

The question of whether to normalize or denormalize (which I didn't
mean to have be the focus of this discussion at all) can be left up to
the devs who end up working on the implementation, taking the
discussion from this thread into account.

Matt

Reply all
Reply to author
Forward
0 new messages