Incremental ologs

48 views
Skip to first unread message

Yuri de Wit

unread,
Feb 13, 2025, 12:19:39 PMFeb 13
to Categorical Data
My worldview has always been very “operational,” but category theory is upending that perspective and opening new doors of understanding. At the same time, I’m deepening my knowledge of category theory while exploring topics such as ologs and CDL—which sometimes lead to interesting dead ends.

For example, I’m very interested in incremental computation and would like to understand what an incremental version of ologs/CDL would look like. Would the same category-theoretic machinery apply, or are we dealing with an entirely different set of constructions?

My intuition is that the finitely presented category for schemas (and the related machinery for attributes) would change very little, if at all, whereas the C-set instance might change significantly. Since I’m considering a stream of deltas with delta and integration operations, I suspect that we might be dealing with something closer to an Abelian group (i.e., C-Ab) rather than just sets. Do the existing notions of (co)limits and pullbacks work the same way at the abstract categorical level?

I apologize if my question is a bit abstract—how should I start thinking about this from a category theory perspective?

Yuri de Wit

unread,
Feb 13, 2025, 12:41:48 PMFeb 13
to Categorical Data
To elaborate on my idea, I envision the schema instance as a stream of transactions, where each transaction is a delta applied to one or more entities (i.e., a collection of deltas across multiple entities). At any given time, a database snapshot is the cumulative result of all deltas up to that point, although for now I’m primarily interested in regenerating only the latest snapshot

Could we model each transaction delta as an object, with morphisms (perhaps in an Abelian group context) representing their sequencing (linear for now, but potentially tree-shaped)? Furthermore, how might data migration work over this structure, which is one level removed from the typical C-set functor representing a snapshot of the db at a certain point in time?

In essence, this is an append-only database where queries and migrations are also incremental and can process transactions sequentially.

Ryan Wisnesky

unread,
Feb 13, 2025, 12:58:49 PMFeb 13
to categor...@googlegroups.com
There’s good news and bad news: the good news is you can work all this out formally. The bad news is, the semantics of algebraic databases are exactly that: entire databases.

It is possible to define a situation where a single row change to an input CQL database results in an unbounded, or infinite, number of row change to an output database. To operate, CQL’s delta and pi need “all the data” in the same sense an SQL join does. And indeed, we have worked out the “drift” that occurs when you try to simulate a pi on an entire database using a Pi on a sub-database and an differential database, in much the same way joins on stream data have a quantifiable drift depending on their window size. But we can still compose: Pi_F(Pi_G(I)) = Pi(GoF)(I) etc, which might be an alternative depending on what you’re doing.

My primary thought is that, following tradition, to really handle streams, especially infinite ones, a person would want co-algebraic databases. And in fact, David Spivak has a starting point for this line of work: in the category “Poly" comonoids are exactly categories, where you think of them as an infinite tree stream of morphisms.
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/categoricaldata/810558cb-2464-4e68-99a5-5a9cd97954dbn%40googlegroups.com.

Yuri de Wit

unread,
Feb 13, 2025, 6:42:27 PMFeb 13
to Categorical Data
One caveat regarding the term “stream” is that it isn’t truly infinite—it’s a sequence of transactions containing deltas from time zero up to now (or until the last transaction). Therefore, computing a join would always yield a valid result at each point in the sequence, rather than an approximation. The idea is that the query or transformation never ends; it simply pauses while waiting for the next transaction. Some orthogonal caching might be needed to reduce cold starts, but that is a secondary concern.

The schema would function exactly as it does today in ologs/CQL: with entities, relationships/foreign keys, and attributes. The instance would be a sequence of transactions, where each transaction contains a set of changes for one or more entities in the schema. This sequence of deltas is append-only. The traditional C-set as the schema instance would still be useful and necessary for queries that interface with clients seeking an integrated or materialized view. For data migrations between different databases, transforms would take an input stream of deltas from the source schema and output another stream of deltas in the target schema.

How could this sequence of deltas (transaction log) be represented as a category? I imagine there would need to be a map (or functor) from a schema-based delta to a transaction log category.

Or do you think that this should still be handled co-algebraically (Poly)?

Ryan Wisnesky

unread,
Feb 14, 2025, 2:44:39 AMFeb 14
to categor...@googlegroups.com
Mathematically, even CQL’s databases can be infinite; for example, on the schema with an entity person and foreign key person -> person, any non-empty database will have infinitely many people.  So infinite instances can be tough to avoid.

But anyway, perhaps the situation you are describing could be modeled using morphisms of databases; an incremental computation of a database X starting from a database Y might be modeled as a series of morphisms Y -> … -> A -> … -> X through some intermediate database A.  Certainly a “chase sequence” has this property, as do other operations.  

David Spivak

unread,
Feb 14, 2025, 4:59:01 PMFeb 14
to categor...@googlegroups.com
any non-empty database will have infinitely many people
That's not quite right, is it? Couldn't I have 
Bob : Person
f (Bob) = Bob
?

Ryan Wisnesky

unread,
Feb 14, 2025, 7:17:47 PMFeb 14
to categor...@googlegroups.com
That’s a good point, I meant free database (no equations) - any free database on the circular schema with more than one generator will have infinitely many rows.
> To view this discussion visit https://groups.google.com/d/msgid/categoricaldata/CACcOXSE6-ZouvoTxHbd2ztB0kmhYUeBOsqxpOL%3DwRfC9O6ojMQ%40mail.gmail.com.

Yuri de Wit

unread,
Feb 24, 2025, 7:51:29 PMFeb 24
to Categorical Data
Aside from the incremental part, I would like to give special emphasis to the evolution of the same schema in addition to data integration/migration across separate schemas (other 'points of view'). My understanding is that this just uses the same capabilities from CQL but with a different focus. The c-set functor F : C -> Set would becomes an indexed functor F_{i,j} : C_{i} -> Set_{i,j} where i is the schema version and j is the data version. This indexed functor would be objects in another category D and morphisms would be the version changes from one functor to another.

Another important aspect is the composition of queries where a query is a data migration from zero or more source schemas and zero or more source queries. Queries should also be indexed since they have their own schema and data: Q_{i,j}. Queries depend on other queries in addition to other indexed functors, F_{i,j}. This composition of queries is super important, specially tracking the dependency tree among them since it brings the notion of spreadsheets to databases. I am not sure if this should be tracked in the same category D (two types of objects and two types of morphisms) or a separate category for dependency tracking.

This tracking of schema/data evolution plus query dependencies is what seems to be extra here.

Ryan Wisnesky

unread,
Feb 25, 2025, 1:42:56 AMFeb 25
to categor...@googlegroups.com
I feel like maybe you are describing the idea of a CQL program itself? It is a directed acyclic graph in which the nodes are schemas, instances, queries, etc, and the edges are dependencies, and within which people incrementally construct the desired outputs through a series of intermediate steps. You can visualize this graph in the “Summary” slide of the CQL viewer by clicking the “dependency graph” button, and there’s a couple other graphs besides.

What is the best way to formalize a CQL program in CQL so that it can be “meta-programmed"? I’m not sure, but categories, functors, etc, can be axiomatized in CQL (see the ‘Meta’ example), and doing so as a CQL typeside is open issue #3 in the CQL repo. You can even program CQL using lambda calculus notation, see the extra slides of https://categoricaldata.net/cql/lambdaconf.pdf

On a side note, there’s also a few different ways to import spreadsheets into CQL while preserving all the formulae; one way is here: https://arxiv.org/abs/2209.14457
> To view this discussion visit https://groups.google.com/d/msgid/categoricaldata/263f0a27-e587-4bac-9717-e873f9748e84n%40googlegroups.com.

Yuri de Wit

unread,
Feb 25, 2025, 11:26:36 AMFeb 25
to categor...@googlegroups.com
After reading your reply, it seems that I need to stop reading papers :)  and get my hands dirty using CQL and understanding the source code.

Thanks for your help so far 

Reply all
Reply to author
Forward
0 new messages