Switching from YAML and PSON to JSON

156 views
Skip to first unread message

Andy Parker

unread,
Oct 23, 2014, 8:04:13 PM10/23/14
to puppe...@googlegroups.com
A while ago we removed support for puppet to *send* YAML on the network. At the same time we converted to using safe_yaml for receiving YAML in order to keep compatibility with existing agents. Instead of YAML all of the communication was done with PSON, which is a variant of JSON that has been in use in puppet since at least 2010. As far as I understand PSON started out as simply a vendored version of json_pure. The name PSON was apparently because rails would try to patch anything named JSON, and so they needed to name it something different to stop that from happening (that is all hearsay, so I don't know how truthful it is).

Over time PSON started to evolve. Little changes were made to it here and there. The largest change came about because of http://projects.puppetlabs.com/issues/5261. The changes for that ticket removed the restriction that only valid UTF-8 could be sent in PSON, which opened the door to a) binary data as file contents and b) absolutely no control over what encodings puppet was using. Over time there have been a large number of issues that have been related to not keeping track of what encoding puppet is dealing with.

I'd like to move us away from PSON and onto a standard format. YAML is out of the question because it is either slow and unsafe (all of the YAML vulnerabilities) or extremely slow and safe (safe_yaml). MessagePack might be nice. It is pretty well specified, has a fairly large number of libraries written for it, but it doesn't do much to help us solve the wild west of encoding in puppet. In MessagePack there aren't really any enforcements of string encodings and everything is treated as an array of bytes.

In order to keep consistency across various puppet projects we'll be going with JSON. JSON requires that everything is valid UTF-8, which gives us a nice deliberateness to handling data. JSON is pretty fast (not as fast as MessagePack) and there are a lot of libraries if it turns out that the built in json isn't fast enough (puppet-server could use jrjackson, for instance).

So what all would be changing?

  1. Network communication that is using PSON would move to JSON
  2. YAML files that the master and agent write would move to JSON (node, facts, last_run_summary, state, etc.).
  3. A new exec node terminus would be written to handle JSON, or the existing one would be updated (check the first byte for '{').

That is just some of the changes that will need to happen. There will be a ripple of other changes based on the fact that JSON has to be UTF-8.

  1. A new "encoding" parameter on File and a base64() function. This will allow transferring non-UTF-8 data as file content until we can get a new catalog structure that allows tracking data types and more changes to the language to differentiate Strings from Blobs.
  2. Reports will have to strip invalid UTF-8 sequences. Nothing would be worse than a single byte stopping a report from being sent. This is what PuppetDB does right now with facts, catalogs, and reports.
  3. Facts can't contain non-UTF-8 data. Facter 2 already enforces this.

As I start entering tickets and we work through them, there are probably other things that will come up. I've create https://tickets.puppetlabs.com/browse/PUP-3524 to track this work. Tickets will be created as children of that epic.

What I don't know right now is how much of an impact this change will actually have. It isn't really clear how often non-UTF-8 data is actually placed in the catalog. Is it really common and making this change without better support in the language is going to be a huge burden? Or is it pretty common and only shows up in a few specific situations? How can we find out?

--
Andrew Parker
Freenode: zaphod42
Twitter: @aparker42
Software Developer

Join us at PuppetConf 2015, October 5-9 in Portland, OR - http://2015.puppetconf.com 
Register early to save 40%!

Spencer Krum

unread,
Oct 23, 2014, 9:01:42 PM10/23/14
to puppe...@googlegroups.com
Awesome work Andy. I will be pleased to not see any encoding bugs any more. I also did not know that anecdote about PSON, good stuff.

As to your questions abut user use, I use the hiera-file type pretty frequently, and so some of my catalogs have binary data in the 'content' parameter of the file resource. Or at least I think that's what's going on. Can you describe for me how to check my catalogs for the things you're asking about? I'd be happy to generate some results and share them with you.

Thanks,
Spencer

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-dev/CANhgQXtegN-WVmJfcb_kQekWg25iVKC1w4P7tJ_rB%2BqzQY3owg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--
Spencer Krum
(619)-980-7820

Eric Shamow

unread,
Oct 23, 2014, 9:07:28 PM10/23/14
to puppe...@googlegroups.com
+1 on hiera-file - I suspect this is where you are going to find this - edge cases around create_resources and other places where people don’t realize they’re serializing data into another format.

On the plus side the people doing this are also the ones most likely to understand the impact and able to work around it. I’m not (consciously) using anything that isn’t UTF-8, and if I find it, I expect the hassle of going through a transform or trying to get the data another way to be much lower than the silliness of putting that much binary data into the catalog anyway.

Overall, very +1 on this change.

-Eric

-- 
Eric Shamow
Sent with Airmail

Henrik Lindberg

unread,
Oct 23, 2014, 9:24:35 PM10/23/14
to puppe...@googlegroups.com
> 1. A new "encoding" parameter on File and a base64() function.. This
> will allow transferring non-UTF-8 data as file content until we can get
> a new catalog structure that allows tracking data types and more changes
> to the language to differentiate Strings from Blobs.

I would like us to add a Binary datatype upfront instead of doing the
base64 encoding in the puppet code. Instead, it is the serialization
formats responsibility to transform it into a form that can be
transported. A JSON in text form can then do the base64 encoding. A
MsgPack / JSON can instead use the binary directly.

Even if our first cut of this always performs a base64 encoding the user
logic does not have to change.

Thus, instead of calling base64(content) and setting the encoding in the
File resource, a Binary is created directly with a binary(encoding,
content) function.

- henrik

--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Erik Dalén

unread,
Oct 24, 2014, 5:47:33 AM10/24/14
to Puppet Developers
How do you differentiate between an encoded binary string and a regular string in the JSON though?
You would need some sort of annotation, and if that is inside the string (which it is in the content parameter of files already btw) you might need a way to escape it to be able to have a regular string that contains that annotation stuff.
 
--
Erik Dalén

Wil Cooley

unread,
Oct 24, 2014, 12:30:10 PM10/24/14
to puppet-dev group


On Oct 23, 2014 5:04 PM, "Andy Parker" <an...@puppetlabs.com> wrote:
>
> So what all would be changing?

...


>   2. YAML files that the master and agent write would move to JSON (node, facts, last_run_summary, state, etc.).

Please store these pretty-printed rather than minimized - one long, unindented line of JSON is pretty intolerable to read for a human. having to run them through a pretty-printer manually is feasible but  cumbersome and not user-friendly.

It might even make sense to have an flag to always pretty-print, even for the other cases, in case a human needs to troubleshoot.

Wil

Joshua Hoblitt

unread,
Oct 24, 2014, 12:40:34 PM10/24/14
to puppe...@googlegroups.com
On 10/23/2014 05:04 PM, Andy Parker wrote:
> MessagePack might be nice. It is pretty well specified, has a fairly
> large number of libraries written for it, but it doesn't do much to
> help us solve the wild west of encoding in puppet. In MessagePack
> there aren't really any enforcements of string encodings and
> everything is treated as an array of bytes.
AFAIK - this is no longer true in terms of the spec (I do not know what
the state of the various implementations is). There was a bunch of
discussion around type encoding about a year ago as part of a push to
prepare for an eventual IETF submission.

https://github.com/msgpack/msgpack/blob/master/spec.md#type-system

-Josh

--

Trevor Vaughan

unread,
Oct 24, 2014, 12:44:59 PM10/24/14
to puppe...@googlegroups.com
I would like to ask for a puppet subcommand that pretty prints (with highlighting?!) the catalog.

For large catalogs, those newlines can add 50k or more and the less the better IMO.

Trevor

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
tvau...@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

Andy Parker

unread,
Oct 24, 2014, 12:59:21 PM10/24/14
to puppe...@googlegroups.com
I talked to Henrik about this and his idea is that we make file content a special case. We write a binary() function that takes a String and produces a hash of { "encoding" => ..., "data" => ... } (or something like that) in the serialized form. Then the file content is written to allow either a string or a hash of that structure. We could even implement this as a type in the puppet language and update the serializer to do that. Perhaps we should also create a new binary_file() function so that non-UTF-8 values don't leak in via file().
 
 
--
Erik Dalén

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andy Parker

unread,
Oct 24, 2014, 1:25:46 PM10/24/14
to puppe...@googlegroups.com
Pretty printing introduces a significant overhead. On a test catalog that we have (the one produced by our many_modules benchmark) it increases the space needed by 52% (JSON.pretty_generate(p).size == 141065 and JSON.generate(p).size == 92596) and increases the time to serialize by 24%.

1.9.3-p484 :017 > Benchmark.measure { 1000.times { JSON.generate(p) } }
 =>   3.650000   0.010000   3.660000 (  3.664184)

1.9.3-p484 :018 > Benchmark.measure { 1000.times { JSON.pretty_generate(p) } }
 =>   4.390000   0.130000   4.520000 (  4.527144)

I doesn't seem like a good idea to incur those overheads at all times on the off chance that a user will want to look at this data.

A flag might be a reasonable way of introducing this, but it seems like using a tool like jsonpp or jq would be just as easy and doesn't require turning on a flag and restarting the system. 

Wil

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Henrik Lindberg

unread,
Oct 24, 2014, 1:30:00 PM10/24/14
to puppe...@googlegroups.com
On 2014-24-10 18:44, Trevor Vaughan wrote:
> I would like to ask for a puppet subcommand that pretty prints (with
> highlighting?!) the catalog.
>
> For large catalogs, those newlines can add 50k or more and the less the
> better IMO.
>

Totally agree. In fact, storing and viewing should really be separate -
say you want efficient storage and fast parsing and are using MsgPack,
you need a command to view and pretty print anyway.

Also convert - read what is stored (in MsgPack say), and output as
pretty printed json etc.)

Henrik Lindberg

unread,
Oct 24, 2014, 1:35:10 PM10/24/14
to puppe...@googlegroups.com
The current MsgPack spec-version does not have embedded encoding
specified, but they are working on it.

We would simply have our own rules on top of MsgPack - say JSON semantics.

Luke Kanies

unread,
Oct 24, 2014, 1:48:43 PM10/24/14
to puppe...@googlegroups.com
Can’t we switch file serving to just do raw downloads?  Why do they even need encoding at all?

Especially if we focus on getting the static catalog to work, all file serving turns into a plain HTTP get, and it should skip all of the Puppet transfer, encoding, etc.

Andy Parker

unread,
Oct 24, 2014, 2:06:20 PM10/24/14
to puppe...@googlegroups.com
File serving is already done that way. We switched file buckets to that system a few releases ago as well, IIRC. The problem isn't the file server or the file bucket, but file resources in manifests that have a "content" parameter with non-UTF-8 data.
 
Especially if we focus on getting the static catalog to work, all file serving turns into a plain HTTP get, and it should skip all of the Puppet transfer, encoding, etc.


The static compiler deals with the source parameter, not the content parameter (although it could I suppose). The current implementation also has the problem that it takes over the content parameter for another meaning, which has caught out several people (try to save a file that has content => "{md5}abdefabcdef"). 
 

--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Wil Cooley

unread,
Oct 24, 2014, 2:49:27 PM10/24/14
to puppet-dev group
On Fri, Oct 24, 2014 at 10:25 AM, Andy Parker <an...@puppetlabs.com> wrote:
 
Pretty printing introduces a significant overhead. On a test catalog that we have (the one produced by our many_modules benchmark) it increases the space needed by 52% (JSON.pretty_generate(p).size == 141065 and JSON.generate(p).size == 92596) and increases the time to serialize by 24%.

Yow! That sounds like a lot! For the master storing stuff for all of the agents, it certainly could a non-negligible amount of overhead.

But for the agent, which is going to write 3 state files and its catalog at the end of each run, it seems insignificant.
 
On the other hand, I'm more likely to try to run grep on the master, which does not work well with minimized JSON (I'm clever enough that I can do it, but I would curse under my breath at having to do so). (Yes, I could probably query PuppetDB for these cases, but...)

I guess it doesn't matter much; the matter is on my mind because was galling with Cobbler, which stores host data in JSON and nicely commits changes made through the web UI into a local Git repo, but that's mostly useless for diff'ing/blame'ing without pretty-printed data. (Until I found the option to store pretty-printed data.)

Wil

Joshua Hoblitt

unread,
Oct 24, 2014, 4:59:05 PM10/24/14
to puppe...@googlegroups.com
On 10/24/2014 11:49 AM, Wil Cooley wrote:
> On the other hand, I'm more likely to try to run grep on the master, which
> does not work well with minimized JSON (I'm clever enough that I can do it,
> but I would curse under my breath at having to do so). (Yes, I could
> probably query PuppetDB for these cases, but...)

If you haven't already tried it, jgrep (http://jgrep.org/) can be helpful.

-Josh

--

Andy Parker

unread,
Oct 24, 2014, 5:59:12 PM10/24/14
to puppe...@googlegroups.com
And if you want a mind warping, but incredibly powerful, tool try out jq (http://stedolan.github.io/jq/). I used it to process a directory full of json files into a single file that contained an analysis.

echo *.analysis | xargs cat | jq -s 'map(.[] | select(has("Puppet::Pops::Model::ResourceExpression")) | to_entries) | add | group_by(.key) | map({ "key": .[0] | .key, "value": map(.value) | add }) | from_entries as $expressions | { "expressions": $expressions, "expressions_per_resource": (($expressions | to_entries | map(.value) | add) / ($expressions | .["Puppet::Pops::Model::ResourceExpression"])), "most_common_expressions": $expressions | to_entries | sort_by(.value) | reverse | .[0:9] | map(.key) }'
 
Had to echo and cat because there were too many files for the command line.

-Josh

--


--
You received this message because you are subscribed to the Google Groups "Puppet Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-dev+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Charlie Sharpsteen

unread,
Oct 26, 2014, 12:51:24 AM10/26/14
to puppe...@googlegroups.com


On Friday, October 24, 2014 2:59:12 PM UTC-7, Andy Parker wrote:
On Fri, Oct 24, 2014 at 1:58 PM, Joshua Hoblitt <jhob...@cpan.org> wrote:
On 10/24/2014 11:49 AM, Wil Cooley wrote:
> On the other hand, I'm more likely to try to run grep on the master, which
> does not work well with minimized JSON (I'm clever enough that I can do it,
> but I would curse under my breath at having to do so). (Yes, I could
> probably query PuppetDB for these cases, but...)

If you haven't already tried it, jgrep (http://jgrep.org/) can be helpful.


And if you want a mind warping, but incredibly powerful, tool try out jq (http://stedolan.github.io/jq/). I used it to process a directory full of json files into a single file that contained an analysis.

echo *.analysis | xargs cat | jq -s 'map(.[] | select(has("Puppet::Pops::Model::ResourceExpression")) | to_entries) | add | group_by(.key) | map({ "key": .[0] | .key, "value": map(.value) | add }) | from_entries as $expressions | { "expressions": $expressions, "expressions_per_resource": (($expressions | to_entries | map(.value) | add) / ($expressions | .["Puppet::Pops::Model::ResourceExpression"])), "most_common_expressions": $expressions | to_entries | sort_by(.value) | reverse | .[0:9] | map(.key) }'
 
Had to echo and cat because there were too many files for the command line.

Also, for the simple case of handling compressed JSON, any system with a Python interpreter can pretty print with a simple shell command:

    cat fugly.json | python -m json.tool

I agree that looking at condensed JSON is a pain, but it may not be worth the performance tradeoffs given there are quick options for pretty printing.

James Turnbull

unread,
Oct 26, 2014, 3:08:37 AM10/26/14
to puppe...@googlegroups.com
Andy Parker wrote:
> the communication was done with PSON, which is a variant of JSON that
> has been in use in puppet since at least 2010. As far as I understand
> PSON started out as simply a vendored version of json_pure. The name
> PSON was apparently because rails would try to patch anything named
> JSON, and so they needed to name it something different to stop that
> from happening (that is all hearsay, so I don't know how truthful it is).
>

Ah... History.

https://github.com/puppetlabs/puppet/commit/bca3b70437666a8b840af032cab20fc1ea4f18a2

Regards

James

--
* The Docker Book (http://dockerbook.com)
* The LogStash Book (http://logstashbook.com)
* Pro Puppet (http://tinyurl.com/ppuppet2 )
* Pro Linux System Administration (http://tinyurl.com/linuxadmin)
* Pro Nagios 2.0 (http://tinyurl.com/pronagios)
* Hardening Linux (http://tinyurl.com/hardeninglinux)

Luke Kanies

unread,
Oct 26, 2014, 8:59:22 AM10/26/14
to puppe...@googlegroups.com
> On Oct 26, 2014, at 12:08 AM, James Turnbull <ja...@lovedthanlost.net> wrote:
>
> Andy Parker wrote:
>> the communication was done with PSON, which is a variant of JSON that
>> has been in use in puppet since at least 2010. As far as I understand
>> PSON started out as simply a vendored version of json_pure. The name
>> PSON was apparently because rails would try to patch anything named
>> JSON, and so they needed to name it something different to stop that
>> from happening (that is all hearsay, so I don't know how truthful it is).
>
> Ah... History.
>
> https://github.com/puppetlabs/puppet/commit/bca3b70437666a8b840af032cab20fc1ea4f18a2

In other words, exactly right.

We initially just force-loaded rails first and then over-rode its
monkey patches, but then it started force-loaded all of the json libs
so it could guarantee that its (incompatible) monkey patches won.

At that point our only choice was to use different names. Yay rails.

James Turnbull

unread,
Oct 26, 2014, 1:06:30 PM10/26/14
to puppe...@googlegroups.com
Luke Kanies wrote:
>> On Oct 26, 2014, at 12:08 AM, James Turnbull<ja...@lovedthanlost.net> wrote:
>>
>> Andy Parker wrote:
>>> the communication was done with PSON, which is a variant of JSON that
>>> has been in use in puppet since at least 2010. As far as I understand
>>> PSON started out as simply a vendored version of json_pure. The name
>>> PSON was apparently because rails would try to patch anything named
>>> JSON, and so they needed to name it something different to stop that
>>> from happening (that is all hearsay, so I don't know how truthful it is).
>> Ah... History.
>>
>> https://github.com/puppetlabs/puppet/commit/bca3b70437666a8b840af032cab20fc1ea4f18a2
>
> In other words, exactly right.

There had to be a first time for an historical event to be reported
correctly. :)

>
> We initially just force-loaded rails first and then over-rode its
> monkey patches, but then it started force-loaded all of the json libs
> so it could guarantee that its (incompatible) monkey patches won.
>
> At that point our only choice was to use different names. Yay rails.
>

All I remember now is that commit and Markus stroking his beard and
shaking his head a lot. :)

Cheers

markus

unread,
Oct 26, 2014, 4:58:02 PM10/26/14
to puppe...@googlegroups.com

> > We initially just force-loaded rails first and then over-rode its
> > monkey patches, but then it started force-loaded all of the json libs
> > so it could guarantee that its (incompatible) monkey patches won.
> >
> > At that point our only choice was to use different names. Yay rails.
> >
>
> All I remember now is that commit and Markus stroking his beard and
> shaking his head a lot. :)

That's funny. I remember it more as pulling on my beard, banging my
head on the table, and whimpering "Why, why, why?!"

-- M


Reply all
Reply to author
Forward
0 new messages