| I've looked at clojure.data.generators some, and also at the existing pdb functions for generating resources, and data elements. I wrote some little catalog generator functions in the cli.benchmark to sort of feel my way around. Catalogs should have internal consistency, though, with a set of resources that are linked into a graph by a set of edge references. For the most part, our existing functions seem sufficient to get us started. I expect the data.generators library can help with an additional helper functions that might be needed. Here's a simple and incomplete example that I will try to flesh out for PDB-5592: https://github.com/jpartlow/puppetdb/tree/tmp/pdb-5592. Rob pointed me to spec.alpha, and schema.generators. Since pdb has some schema related to the wire formats for facts, catalogs and reports, I worked with schema.generator for a while to try and get a feel for how it would work. Since it would potentially generate directly from our existing schema I thought it might keep things simpler. Under the hood, schema.generators relies on the test.check library for generation (pdb has a dev dependency on an older test.check 0.9.0, just for the test suite atm). However given the need for internally consistent self-referential resource/events/edges in catalogs and reports, data generation isn't as straight forward as just blowing some random strings into a set of leaf properties. I think test.check is sufficiently complex to deal with that by using functions like gen/bind and gen/let so that building a catalog in phases would allow edges to be built from earlier generated resources, for example. But the real problem I had was that test-check is just too random. For example, as a test of generating fact values:
local-test=> (clojure.pprint/pprint (tc/sample (tc/map tc/string-alphanumeric (tc/recursive-gen (fn [inner] (tc/one-of [(tc/vector inner) (tc/map tc/string-alphanumeric inner)])) tc/string-alphanumeric) {:min-elements 5 :max-elements 10}) 2)) |
({"" [], |
"2" ["G" "4x"], |
"A" [["T"]], |
"QP" [{}], |
"W2" {"" {}}, |
"963" {}, |
"up" [{"K" [[]]}], |
"bx" []} |
{"" [{"g" {"S" {"V" []}}}], |
"42" {"" [[{}]]}, |
"93" {}, |
"9Eu" [], |
"595" [[]], |
"P2" {"" []}, |
"5T7u" [], |
"K" ["M5nu" "p7g1" "" ""]})
|
Now, this could be refined to better constrain and generate reasonable fact data, but I think it's working at the problem from the wrong direction, and we're better of starting with intelligible fact, catalog and report data and permuting that. I didn't look deeply into spec.alpha, but it does also use test.check under the hood for generation. It's more complicated than I could pick up quickly, and I didn't want to start creating a duplicate set of spec based schema for wireformats, especially given that we'd just be generating again with test.check. However, there may well be other reasons for going this direction, and someone else on the team with more clojure experience may weigh in here. One other thing Rob mentioned was an old Puppet Data Platform branch that included code to generate fake_data in such a way that it can be reproduced with the same seed. It's not set up to generate pdb data in the formats we need, but being able to regenerate the exact same data set might be useful for benchmarking. Or it maybe sufficient to just be able to specify an equal set of starting parameters (node count, resource count, fact size, etc). I'm not certain what level of reproducibility we're looking for. |