version controlling worldmodels in safeguarded AI: why not git and lockfiles?

Quinn Dougherty

unread,

May 28, 2024, 3:16:05 PM5/28/24

to safe-by-design

There's remark in TA1 of safeguarded AI that the worldmodels need version control.

TA1.2 Backend shall develop a professional-grade implementation of the Theory, yielding a distributed version control system for knowledge represented as mathematical world-models, as well as for specifications.

Given dependencies (which may have their own dependencies), a lockfile is a transitive closure that marks particular commits or tags (with hashes) for each version and each version in the transitive closure.

So `conda-lock.yml` (from python) is generated by SMT, but others (like `yarn.lock` from javascript) use simpler graph algorithms. The problem with these lockfiles is if package A depends on package C >15 and package B depends on package C <14, it's unclear which version of C you want in your project. In some cases you can just install two versions of C and use them in the two different contexts, but this can be finicky and doesn't always work.

`flake.lock` (from nixpkgs/nixos) doesn't have this problem, because dependencies are fully isolated. What's more, the specification file `flake.nix` allows granular control over the transitive closure, with "follows":

```

{

Inputs = {

Nixpkgs.url = "github:nixos/nixpkgs"

Otherdep.url = "github:me/somerepo"

Otherdep.inputs.nixpkgs.follows = "Nixpkgs"

...

```

Since every flake has a lockfile, by default Otherdep will have its own commit of nixpkgs. However, I can patch a dep's lockfile in the spec to force agreement. Furthermore, I tend not to actually need to, since when nix arranges symlinks to dependencies it keeps track enough to not have collisions.

I don't think `package.json` or `Pipfile` or `anaconda.yml` has this.

I think a flake-like lockfile system may be sufficient for a system like safeguarded ai.

Ozzie of QURI's counterargument is that commits are not granular enough for worldmodels that depend on continuous time data streams. If i make up an example that i think captures this concern, consider if one input to a worldmodel is inflation rate from a web socket stream, and another is a GET request to some wikidata repo that fires monthly. So unless you checkpoint in auxiliary code, you dont know what time t your inflation rate is from. it's unclear how an analyst can accomplish a reproducible model (like a harder version of fixing random seeds in data science). If you think there was some corruption at wikidata at time t-1, you can roll back to time t-2 given any lockfile setup, but it's unclear what a principled approach to the interest rate stream is if you think the interest rate stream was corrupted recently.

Track the development of squigglehub https://squigglehub.org/ which I think is making a version control system soon ish.

One further reason git may not be great is that there may not be a natural notion of diff for worldmodels, or even specs. Squiggle code could serve as a bridge between the pedestrian's git diff and some more quantitative idea of diff, but this may be an unnecessary limitation. What are some ideas for quantitative diffs?

Owen Lynch

unread,

May 28, 2024, 8:52:08 PM5/28/24

to Quinn Dougherty, safe-by-design

From a practical perspective, I'm hoping that ARIA bootstraps via nix
and precisely this method.

However, there are two problems, related to what you said.

1. While we can store an arbitrary model as a text file in a git
repository, the line-based diff algorithm is not necessarily
meaningful; we want diffs that are meaningful to the structure of the
model, i.e. "creation and deletion of entities/laws."
2. Ideally we can go much, much more granular than git repositories.
https://www.unison-lang.org/ version controls individual functions; we
could version control individual parts of a model. Sure, we can
approximate a hashcons datastructure via git repositories and lock
files, but it's really not ergonomic.

Happy to hear more about your thoughts on this,
-Owen

> --
> You received this message because you are subscribed to the Google Groups "safe-by-design" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to safe-by-desig...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/safe-by-design/CAGaOgARp%3Dq%2BogiesLPiOgepZxHXWDdoS%2BMd58ZqTrgiJFckB%2BQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Quinn Dougherty

unread,

May 29, 2024, 3:09:59 PM5/29/24

to safe-by-design

> bootstraps via nix and precisely this method.

you mean that you'd like to see a flake.lock derived method for MVP, but think that in the grand scheme of things something else would be needed?

There's a compelling vision in unison re granularity and version controlling individual functions! At that point I certainly agree that forcing everything into git and lockfiles would be ergonomically sad. Still, this seems like the less crucial half of the problem-- "quantitative diffs", diffs that capture the higher level things than the model, is a much larger question mark in my brain. Again, squiggle is a great start at reconciling these issues, but I wonder how much further we could go.

It seems conceptually straightforward to make AST-diff that shows you a node based view instead of a line based view. This may be worth developing

> "creation and deletion of entities/laws."

For example, in SLEEC rules, concerns, and purposes are simply nodes in the language's AST! so you could have a structured diff that draws your attention to the changes you care about

Overall I think I agree with you about room for improvement over git and lockfiles, I just think for unambitious and tractable MVP, PoC, we don't want to worry too much.

Quinn Dougherty

unread,

Jun 7, 2024, 7:15:50 PM6/7/24

to Owen Lynch, safe-by-design

There's more progress on syntactic diffs than I thought https://github.com/afnanenayet/diffsitter upon thinking about it I realized it was technically feasible, at the same time someone at manifest pointed me toward this list: https://difftastic.wilfred.me.uk/tree_diffing.html

The next question is semantic diffs or diffs over equivalence classes of ASTs

On Tue, Jun 4, 2024, 11:21 AM Owen Lynch <ow...@topos.institute> wrote:

> Overall I think I agree with you about room for improvement over git and
> lockfiles, I just think for unambitious and tractable MVP, PoC, we don't
> want to worry too much.

I agree... but I am of the opinion that the time to start thinking and
working on theory that doesn't go into the first MVP is actually quite a
bit earlier than one might expect, and I think that if we don't get this
right in time for the second or third MVP, it's going to be a *huge*
albatross for the entire program. So that's why I want to work on it now!

Another argument for it not going into the first MVP is that we should
get a good baseline of "what does this look like *without* a new theory
of version control," which will provide justification for why we need a
new theory of version control XD

davidad (David A. Dalrymple)

unread,

Jun 7, 2024, 9:39:14 PM6/7/24

to safe-by-design

I agree with Owen's and Ozzie's points!

I can already see a couple paths to doing much better than ASTs: doing rewriting in ACSets. Very roughly, each item of an ACSet can be annotated with a subset of versions in which that item appears (with some coherence conditions, e.g. it should be causally path-connected). A "diff" is then roughly a span of ACSets whose apex is the overlap between versions, with the first leg denoting deletions and the second leg denoting additions — but this span can be annotated with various kinds of information (all encoded as other ACSets and ACSet morphisms) that elaborate "what rewrite" is intended here. These annotations should enable somewhat semantically-aware rebasing and merging.

The references here are also relevant.

ACSet rewriting is still syntactic in some deeper way, even though the syntax is much more richly structurable than lines of text. Computation with semantic equivalence classes is best handled by egraphs. A highly advanced form of merging would search for specific representatives from each of the two leaves being merged such that applying the other leaf's path of rewrite rules from the least common ancestor result in members of the same equivalence class. I don't think this level of cleverness in the merge algorithm is going to be important for my programme, but the underlying data structures problem (unifying egglog and ACSets) might be worth solving anyway because it's likely very useful for partial evaluation (generating highly optimized executable versions of models).

Incidentally, representing models as ACSets could be very good for incremental compilation because they essentially fit into the ontology of Datalog and Datalog is suited for incremental type-checking. Also of interest (and new since I wrote that github README!): Ascent/BYODS, "Wrapping Rings in Lattices", hash chronicles.

Onward,

davidad

--

You received this message because you are subscribed to the Google Groups "safe-by-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to safe-by-desig...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/safe-by-design/CAGaOgAQoJDHJYBPOWd0PXY0tbtiCqHtR5S1hu8Ume32pkN5kPQ%40mail.gmail.com.

Reply all

Reply to author

Forward