Jira (PDB-5594) SPIKE - can we use clojure/data.generators

12 views
Skip to first unread message

Austin Blatt (Jira)

unread,
Feb 13, 2023, 12:28:02 PM2/13/23
to puppe...@googlegroups.com
Austin Blatt created an issue
 
PuppetDB / Task PDB-5594
SPIKE - can we use clojure/data.generators
Issue Type: Task Task
Assignee: Unassigned
Created: 2023/02/13 9:27 AM
Priority: Normal Normal
Reporter: Austin Blatt

The existing tool in Clojure that I know of is https://github.com/clojure/data.generators/ but I do not have any experience with it.

Can we use it to generate data that matches the command wireformats that we need?

Should we?

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.20.11#820011-sha1:0629dd8)
Atlassian logo

Cas Donoghue (Jira)

unread,
Feb 15, 2023, 1:35:03 PM2/15/23
to puppe...@googlegroups.com
Cas Donoghue updated an issue
Change By: Cas Donoghue
Sprint: Skeletor 03/01/2024

Cas Donoghue (Jira)

unread,
Feb 15, 2023, 1:45:02 PM2/15/23
to puppe...@googlegroups.com

Joshua Partlow (Jira)

unread,
Feb 16, 2023, 1:02:02 PM2/16/23
to puppe...@googlegroups.com
Joshua Partlow assigned an issue to Joshua Partlow
Change By: Joshua Partlow
Assignee: Joshua Partlow

Joshua Partlow (Jira)

unread,
Feb 28, 2023, 2:02:05 AM2/28/23
to puppe...@googlegroups.com
Joshua Partlow commented on Task PDB-5594
 
Re: SPIKE - can we use clojure/data.generators

I've looked at clojure.data.generators some, and also at the existing pdb functions for generating resources, and data elements. I wrote some little catalog generator functions in the cli.benchmark to sort of feel my way around. Catalogs should have internal consistency, though, with a set of resources that are linked into a graph by a set of edge references. For the most part, our existing functions seem sufficient to get us started. I expect the data.generators library can help with an additional helper functions that might be needed. Here's a simple and incomplete example that I will try to flesh out for PDB-5592: https://github.com/jpartlow/puppetdb/tree/tmp/pdb-5592.

Rob pointed me to spec.alpha, and schema.generators. Since pdb has some schema related to the wire formats for facts, catalogs and reports, I worked with schema.generator for a while to try and get a feel for how it would work. Since it would potentially generate directly from our existing schema I thought it might keep things simpler. Under the hood, schema.generators relies on the test.check library for generation (pdb has a dev dependency on an older test.check 0.9.0, just for the test suite atm). However given the need for internally consistent self-referential resource/events/edges in catalogs and reports, data generation isn't as straight forward as just blowing some random strings into a set of leaf properties. I think test.check is sufficiently complex to deal with that by using functions like gen/bind and gen/let so that building a catalog in phases would allow edges to be built from earlier generated resources, for example. But the real problem I had was that test-check is just too random.

For example, as a test of generating fact values:

local-test=> (clojure.pprint/pprint (tc/sample (tc/map tc/string-alphanumeric (tc/recursive-gen (fn [inner] (tc/one-of [(tc/vector inner) (tc/map tc/string-alphanumeric inner)])) tc/string-alphanumeric) {:min-elements 5 :max-elements 10}) 2))
({"" [],
  "2" ["G" "4x"],
  "A" [["T"]],
  "QP" [{}],
  "W2" {"" {}},
  "963" {},
  "up" [{"K" [[]]}],
  "bx" []}
 {"" [{"g" {"S" {"V" []}}}],
  "42" {"" [[{}]]},
  "93" {},
  "9Eu" [],
  "595" [[]],
  "P2" {"" []},
  "5T7u" [],
  "K" ["M5nu" "p7g1" "" ""]})

Now, this could be refined to better constrain and generate reasonable fact data, but I think it's working at the problem from the wrong direction, and we're better of starting with intelligible fact, catalog and report data and permuting that.

I didn't look deeply into spec.alpha, but it does also use test.check under the hood for generation. It's more complicated than I could pick up quickly, and I didn't want to start creating a duplicate set of spec based schema for wireformats, especially given that we'd just be generating again with test.check. However, there may well be other reasons for going this direction, and someone else on the team with more clojure experience may weigh in here.

One other thing Rob mentioned was an old Puppet Data Platform branch that included code to generate fake_data in such a way that it can be reproduced with the same seed. It's not set up to generate pdb data in the formats we need, but being able to regenerate the exact same data set might be useful for benchmarking. Or it maybe sufficient to just be able to specify an equal set of starting parameters (node count, resource count, fact size, etc). I'm not certain what level of reproducibility we're looking for.

Austin Blatt (Jira)

unread,
Feb 28, 2023, 3:20:03 PM2/28/23
to puppe...@googlegroups.com
Austin Blatt commented on Task PDB-5594

I think the simplicity of generating the catalog with some helper functions look good. It gives us flexibility over structuring the content of each field.

I agree that the test check output looks too random, and given the simplicity of the example where we generate things ourselves it's probably not worth trying to cajole that library into producing the output we want.

My thinking was that this tool doesn't need any level of reproducibility. Benchmark can load example commands from json sample data. By default it uses some that are checked in to the puppetdb repo https://github.com/puppetlabs/puppetdb/tree/main/resources/puppetlabs/puppetdb/benchmark/samples so I was envisioning running these generators to create sample files and then the tests could re-use those sample files until we feel the need to change some variable of the generated data.

Austin Blatt (Jira)

unread,
Feb 28, 2023, 3:22:02 PM2/28/23
to puppe...@googlegroups.com
Austin Blatt updated an issue
 
Change By: Austin Blatt
Fix Version/s: PDB n/a
Release Notes: Not Needed
Reply all
Reply to author
Forward
0 new messages