Has anyone already developed an Elasticsearch backend to Hiera?

nick....@countersight.co

unread,

Mar 28, 2018, 1:19:58 AM3/28/18

to Puppet Developers

Hi,

I'd like to be able to store (and more importantly) retrieve Hiera hash data from Elasticsearch for my Puppet runs. Does anyone know if this has already been done?

Can you think of any particular reason why this might not work? From a layman's perspective, ES seems like an ideal place to be storing this data.

Cheers,

Nick George

John Bollinger

unread,

Mar 29, 2018, 9:52:55 AM3/29/18

to Puppet Developers

On Wednesday, March 28, 2018 at 12:19:58 AM UTC-5, nick....@countersight.co wrote:

Hi,

I'd like to be able to store (and more importantly) retrieve Hiera hash data from Elasticsearch for my Puppet runs. Does anyone know if this has already been done?

I haven't heard of such a project, and a bit of web searching didn't turn one up. If someone has made such a thing, then they don't seem to be saying much about it.

Can you think of any particular reason why this might not work?

In principle, one ought to be able to build an Hiera back end that draws data from ES, just as there is one based on Postgres. I don't see any special barrier. At its most basic level, the data store aspect of an Hiera back end just needs to provide for looking up values by key, and ES is certainly capable of that.

From a layman's perspective, ES seems like an ideal place to be storing this data.

Why? Seriously, if there is more to that opinion than just "ES is a popular tool for searching and retrieving data" then that would be something we could talk about.

In any case, ES's focus is on fast and flexible search, but Hiera's basic search needs are very simple -- just value lookup by exact key, possibly in multiple runtime-selected tables. Now, if you could actually implement Hiera-style priority lookups and its various merge behaviors directly in ES queries, then maybe you would be onto something, but I don't think that's possible, especially when you start considering how Hiera features such as interpolations affect that.

Moreover, although ES is pretty fast as that's judged in its target application space, Hiera's use is, again, not really in that space. From an Hiera perspective, ES would have enormous memory overhead, and, I estimate, non-trivial performance overhead arising at minimum from IPC. Hiera's default YAML back end is tiny in comparison, and can run entirely inside the catalog builder process.

That's by no means a reliable analysis of relative efficiency, but the point is that from my perspective, EL for an Hiera back end isn't something I would consider at all. If you want to put your data in a full-fledged DB then have Hiera go directly to that DB -- I see no advantage to putting ES between. With that said, however, I don't hear much about people doing that in practice.

So anyway, was this just idle speculation on your part, or are you genuinely trying to design a Puppet infrastructure? If the latter, then do you have specific requirements that the default YAML back end does not meet? For otherwise, I'd certainly recommend starting there.

John

nick....@countersight.co

unread,

Mar 31, 2018, 6:59:12 AM3/31/18

to Puppet Developers

Thanks for your response John,

I appreciate you taking a quick look around to see if anyone else has already done this. I had come to the same conclusion, that if someone has already, they mostly likely haven't shared it.

You raise valid points about EL being generally pretty unsuitable as a Hiera backend. However, the project I am working on already has an Elasticsearch instance running in it, so there would be next to no performance overhead for me. It uses a web interface to write out YAML files that are fed into a Hiera for a 'puppet apply' run which configures various aspects of the system. By using Elastic instead of YAML files, I can eliminate some of the issues surrounding concurrent access, it also means backups are simplified, as I'd just need to backup ES.

This arrangement would work well in a master-less, distributed setup where a centralised Elasticsearch holds the Hiera config for a number of nodes. Of course. another database would work just as well, but given that we're already using Elastic, it seems like a natural fit.

Is writing a proof-of-concept Hiera backend something that someone with reasonable coding skills be able to knock out in a few hours?

Cheers,

Nick

John Bollinger

unread,

Apr 2, 2018, 10:47:37 AM4/2/18

to Puppet Developers

On Saturday, March 31, 2018 at 5:59:12 AM UTC-5, nick....@countersight.co wrote:

Thanks for your response John,

I appreciate you taking a quick look around to see if anyone else has already done this. I had come to the same conclusion, that if someone has already, they mostly likely haven't shared it.

You raise valid points about EL being generally pretty unsuitable as a Hiera backend. However, the project I am working on already has an Elasticsearch instance running in it, so there would be next to no performance overhead for me. It uses a web interface to write out YAML files that are fed into a Hiera for a 'puppet apply' run which configures various aspects of the system. By using Elastic instead of YAML files, I can eliminate some of the issues surrounding concurrent access, it also means backups are simplified, as I'd just need to backup ES.

With an ES instance already running, I agree that you have negligible additional memory overhead to consider, but that doesn't do anything about performance overhead. Nevertheless, the (speculative) performance impact is not necessarily big; you might well find it entirely tolerable, especially for the kind of usage you describe. It will depend in part on how, exactly, you implement the details.

Is writing a proof-of-concept Hiera backend something that someone with reasonable coding skills be able to knock out in a few hours?

It depends on what degree of integration you want to achieve. If you start with the existing YAML back end, and simply hack it to retrieve its target YAML objects from ES instead of from the file system, then yes, I think that could be done in a few hours. It would mean ES offering up relatively few, relatively large chunks of YAML, which I am supposing would be stored as whole objects in the database. I think that would meet your concurrency and backup objectives.

If you want a deeper integration, such as having your back end performing individual key lookups in ES, then you might hack up an initial implementation in a few hours, but I would want a lot longer to test it out. I would want someone with detailed knowledge of Hiera and its capabilities to oversee the testing, too, or at least to review it. Even more so to whatever extent you have in mind to implement Hiera prioritization, merging behavior, interpolations, and / or other operations affecting what data Hiera presents to callers. If there is an actual budget for this then I believe Puppet, Inc. offers consulting services, or I'm sure you could find a third-party consultant if you prefer.

John

Reid Vandewiele

unread,

Apr 2, 2018, 12:32:25 PM4/2/18

to Puppet Developers

Hey Nick,

A particular phrase you used caught my attention: "Elasticsearch holds the Hiera config for a number of nodes."

There's a lot about putting together the words "elasticsearch" and "hiera backend" that can sound scary if it's done wrong, but I have seen backends built to solve the "config for individual nodes" problem in a way that complements Hiera's default yaml backend system, without noticeably sacrificing performance, by using a carefully limited number of calls to the external backend per catalog compile. Most generalized data that doesn't need to change frequently or programmatically is still stored in yaml files alongside the code.

When that's done, the implementing hiera.yaml file may look something like this:

hierarchy:
  - name: 'Per-node data'
    data_hash: elasticsearch_data
    uri: 'http://localhost:9200'
    path: %{trusted.certname}"
  
  - name: 'Yaml data'
    data_hash: yaml_data
    paths:
      - "role/%{trusted.extensions.pp_role}"
      - "datacenter/%{trusted.extensions.pp_datacenter}"
      - "common"

The most important bit showcased here is that for performance, the data_hash backend type is used. Hiera can make thousands of lookup calls per catalog compile, so something like lookup_key can get expensive over an API. data_hash front-loads all the work, returning a batch of data from one operation which is then cached and consulted for the numerous lookups that'll come from automatic parameter lookup.

There's an example of how to do that in https://github.com/uphillian/http_data_hash.

To John's point, I wouldn't hesitate to run your use case by an expert if you have the option.

Cheers,

~Reid

nick....@countersight.co

unread,

Apr 4, 2018, 11:30:26 PM4/4/18

to Puppet Developers

Thanks for that John,

If I go down this road, I'll post any code that I produce on GitHub and see if anyone else is interested in trying/testing it.

Cheers,

Nick

nick....@countersight.co

unread,

Apr 4, 2018, 11:34:06 PM4/4/18

to Puppet Developers

Thanks for your tips Reid, especially the bit about "data_hash". I'll be sure to keep that in mind if I end up writing such a backend. Unfortunately there's no budget for this, so would definitely be an 'in-house' job. It's possible that I might be able to use the http_data_hash plugin you mentioned with Elasticsearch as it talks HTTP.