Languages

41 views
Skip to first unread message

Yuri de Wit

unread,
Feb 3, 2025, 2:59:59 PMFeb 3
to Categorical Data
I’m finally taking a closer look at the CQL GitHub repository, and I see that it supports multiple “languages,” namely “EASIK,” “Sketch,” “MPL,” “CQL” (the default), and “CQL_ALT.”

Should I primarily focus on the “CQL” language when considering the research presented on the website?

Also, are these other languages mainly experimental or historical artifacts that I can disregard? It seems they are distinct and do not interact with each other.

Thanks,



Ryan Wisnesky

unread,
Feb 3, 2025, 6:04:34 PMFeb 3
to categor...@googlegroups.com
That’s correct. CQL is the only language that a user would really care about in the IDE. There was a time the IDE supported older versions of CQL, which have now been spun off into a separate project in Github (“old FQL"), and each such version was a “language” (FQL, FQL++, FPQL, OPL, are all mentioned in the research literature). There’s a rich space of language design trade-offs in functorial data migration.

EASIK isn’t so much a language as a separate interoperable tool accessible from the tools menu (it’s based on Prof. Rosebrugh’s theory of ’sketches'. It has its own file format, which is “registered” to the IDE via the “language” mechanism. There are buttons to convert EASIK to CQL and vice versa within EASIK. “Sketch” is a technical artifact of EASIK having two different file formats.

CQL_ALT is for experiments and MPL is a very minimal language that allows writing “monoidal” categories, which are like wiring diagrams that come up occasionally at the edges of what CQL can express. Mostly it is used to generate figures/images of complex diagrams.
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/categoricaldata/14cb64de-d7b1-4eba-8bff-677b7d6f1cacn%40googlegroups.com.

Yuri de Wit

unread,
Feb 5, 2025, 2:07:05 PMFeb 5
to Categorical Data
Thanks!


> That’s correct. CQL is the only language that a user would really care about in the IDE. There was a time the IDE supported older versions of CQL, which have now
> been spun off into a separate project in Github (“old FQL"), and each such version was a “language” (FQL, FQL++, FPQL, OPL, are all mentioned in the research
> literature).

Regarding the apg_* syntax in the IDE—correct me if I’m wrong—but after reviewing many of the papers listed on the website in chronological order, it seems that there is a later algebraic perspective that extends and refines the functorial one. I assume the apg_* syntax implements that perspective?

> There’s a rich space of language design trade-offs in functorial data migration.

One aspect I find particularly interesting, beyond reasoning and inference, is the ability to handle “invalid” or “incomplete” data.

Another (perhaps niche) aspect that interests me—and which I did not see explicitly mentioned in the papers—is the ability to infer a schema from an instance. Representing “invalid” instances seems to be a prerequisite for that.

Yuri de Wit

unread,
Feb 5, 2025, 6:01:54 PMFeb 5
to Categorical Data
I was too hasty ...  I just come across what I mentioned above in the Algebraic Data Integration paper: "(co-)pivoting (converting instances into schemas)".

Ryan Wisnesky

unread,
Feb 5, 2025, 7:15:27 PMFeb 5
to categor...@googlegroups.com
Hi Yuri,

I forgot about APG, which is indeed a dialect of CQL that we worked on with Uber: https://arxiv.org/abs/1909.04881 . You can think of it as what happens when you extend CQL with co-products: you can model more things, but as a result you only have one way to move data along a mapping, rather than three. This project was picked up by Microsoft who continues to work with us on it here: https://github.com/categoricalData/hydra

The literature is large, but I see you found the “category of elements”, or “Grothendieck” construction to convert instances to schemas. In the built-in Petri example, an instance on a simple schema for Petri nets turns into a schema representing the possible valid executions of the net; that’s probably the coolest example I've seen. We also often take RDF triples and turn them into a schema.

The usual process for dealing with dirty data is to consider it on a schema without data integrity, write down the data integrity you want it to have as CQL constraints, and then use the CQL chase algorithm to repair the data. Probabilities can be attached to attributes representing confidences, and conditions on those probabilities used in the rules (if X and Y have the same name with 90% probability, put X and Y into the Match table, etc).

Ryan
> To view this discussion visit https://groups.google.com/d/msgid/categoricaldata/49185401-e1bb-4ae4-96fd-1e6a66f116acn%40googlegroups.com.


Yuri de Wit

unread,
Feb 5, 2025, 8:52:07 PMFeb 5
to Categorical Data
> I forgot about APG, which is indeed a dialect of CQL that we worked on with Uber: https://arxiv.org/abs/1909.04881 . You can think of it as what happens when you extend CQL with co-products: you can model more things, but as a result you only have one way to move data along a mapping, rather than three. This project was picked up by Microsoft who continues to work with us on it here: https://github.com/categoricalData/hydra

Ok. I think I understand what you mean. With co-products we have a richer language to describe schemas (see ride sharing schema), at the expense of a more constrained migration path.

I also see that more is coming from a categorical perspective (praqueries/proqueries, profunctos, polynomial functors, ...) but I am still digesting papers from 2017 :-). Is the research basically focusing on finding richer categorical structures to represent richer schema languages but trying to keep the same or more properties?

> In the built-in Petri example, an instance on a simple schema for Petri nets turns into a schema representing the possible valid executions of the net; that’s probably the coolest example I've seen

This is indeed a very cool example!

Thanks for your time answering my questions.

Yuri de Wit

unread,
Feb 5, 2025, 9:06:50 PMFeb 5
to Categorical Data

FYI:

One one interesting connection mentioned in the 'Algebraic Data Integration' paper, is between the sigma operation and the chase: "there is a semantic similarity between our Σ operation and the chase".

This reminded my of a paper I came across recently named 'Semantic foundations of equality saturation': "In this paper, we define a fixpoint semantics of equality saturation based on tree automata and uncover deep connections between equality saturation and the chase".

Is there a semantic similarity between e-graphs and the sigma operation? But I got an impression that the KB algo was enough for the current expressivity of schemas in the functorial data model and something like e-graphs is not needed (maybe not yet?).

Ryan Wisnesky

unread,
Feb 6, 2025, 3:14:00 PMFeb 6
to categor...@googlegroups.com
Hi Yuri,

Thank you for bringing that paper to my attention.  That paper, and our paper ‘Fast Left Kan extensions using the chase’ (https://arxiv.org/abs/2205.02425, which shows how to implement sigma using the chase, thereby concluding the open question in the 2017 paper), both give algorithms to compute algebraic databases (databases defined by equational rules).  They are very closely related.

As that paper shows, equality saturation is weaker than the chase, and weaker than equational theorem proving (equality saturation only solves the “ground” word problem).  I believe the folks at Algebraic Julia have implemented equality saturation to compute colimits.  

In CQL we use theorem provers (such as Knuth-Bendix completion) for colimits and functoriality checks (requiring the non-ground word problem) and expose an “EGD-fair" chase algorithm for everything else, both techniques that go beyond equality saturation. 

I am hopeful, as is that paper’s author, that chase-based techniques can be imported into equality saturation to make it go faster - under the hood, for computing limits (not colimits), CQL actually exploits SQL because SQL is so good at indexing, optimizing, etc.  In fact I suspect the chase will power many algorithms eventually, not just equality saturation and left kan extensions.

Ryan

Yuri de Wit

unread,
Feb 13, 2025, 12:05:32 PMFeb 13
to Categorical Data
I am glad it was helpful! Thanks for the other insights.
Reply all
Reply to author
Forward
0 new messages