It may be overwhelming to approach CQL without first having used SQL or similar (e.g. data frames), but I’ve tried to respond inline below:
> On Aug 16, 2024, at 4:26 PM, Julius Hamilton <
juliusha...@gmail.com> wrote:
>
> Thanks.
>
> I am working on processing the information in the slides and PDF.
>
> I have a lot of reading to do, and I might need to discuss what I'm learning along the way to sufficiently reinforce it.
>
> I am currently thinking about CQL in the following way:
>
> 1. The "typeside" is where you just declare the fundamental data types you will be using, like Integer and String. But I don't understand if these are technically arbitrary names, or, if you actually need to supply definitions in Java, for example, for CQL to actually carry out various computations.
CQL lets users write any finite equational theory for the typeside. Or, if that isn’t expressive enough, or there are pragmatic considerations, users can instead supply java definitions that implicitly define an infinite equational theory for the type side.
>
> 2. A schema is a category. The objects can be thought of as types, and are called entities. The morphisms can be thought of as relationships, and are called attributes.
A schema is a category whose objects can be divided into two classes: the types, inherited from the type side, whose intended denotation is usually infinite - Integer, String, etc, and the entities, not inherited form the type side, whose intended denotation is usually finite - Employee, Student, etc. A morphism is a function, either from an entity to an entity (called a foreign key) or an entity to a type (called an attribute).
> 3. A "finitely presented category" may have a sophisticated definition, but can be simply understood as a "finite category". That is, you will only be dealing with a finite number of objects or morphisms. However, it could be the case that one object actually represents an infinite number of isomorphic objects, for example. So that category is technically not finite, but it has a "finite presentation", an equivalent form we can map it to.
A finitely presented category need not have a finite number of morphisms, which is why computational category theory is hard. Consider the schema with one object, Person, and one generating morphism, father : Person -> Person. That category has infinitely many morphisms: id, father(id), father(father(id)), and so on.
Finite category theory is computationally intractable, but easy.
>
> 4. "Foreign keys" are just attributes where the target is not a fundamental data type. Thus,
person.name is an attribute since name is of type String, but employee.department is a foreign key, since a department is an entity in the schema, and not a data type in the typeside.
Yes
>
> 5. The path equations are necessary, not "safety checks". They literally specify the composition of morphisms in the category.
Yes, a finite presentation of category contains a set of equations as part of its given structure, and the category denoted by the presentation defines its composition operation in terms of these equations.
There are many easy proofs of this; for example, the word problem for groups has been known to be undecidable for a long time, and you can convert a word problem for a group into a word problem for a category - a group is a category with a single object and every morphism invertible. In fact, these undecidable categories may have small presentations (smallest I’ve seen is 3 equations).
> 7. On a more abstract level, Spivak showed that categorical data has three main operations, \Sigma which is union, \Pi which is product or join, and \Delta which is projection.
https://arxiv.org/pdf/1009.1166
>
To a first approximation; however, Sigma goes beyond union, and requires what database theorists call a “chase algorithm” to compute.
> 8. One of the main uses of representing a schema as a category is that functors become ways to "migrate data between formats".
Yes
>
> So, does that mean, CQL could translate data from SQL, to CSV, to JSON, to Neo4j, etc.? (Depending on if such a translation is mathematically valid.)
Yes, many uses of CQL in practice involve integrating different ‘meta models’, such as SQL, CSV, JSON, XML, Graph, all at the same time.
>
> Apart from being able to translate data between different formats, does categorical data science suggest that there is a "most universal" way to represent any kind of data at all? For example, if someone wanted to create a data model about a person: name, schools attended, age, address, and so on - CQL would represent this info in a way that emphasizes "universal properties", more than SQL would, even though they are mathematical equivalent?
The category theory and database theory textbooks (and I also) say that as a general rule, the best way to represent a data model is to use what CQL calls “constraints” and what database theorists call “embedded, implicational dependences” (
https://dbucsd.github.io/paperpdfs/2008_8.pdf) and what category theorists call “regular logic / lifting problems”. This logic is basically Horn clauses with existential quantifiers added, and is not supported by SQL. It is very well studied - see for example the textbook by Arenas et al, “Data Exchange”.
The reason is that this is the strongest logic that admits the delta/sigma/pi data migration functors; any stronger, and at least one of them disappears. Their disappearance isn’t necessarily bad - our work with Uber uses a logic for which they disappear
https://arxiv.org/pdf/1909.04881 - but this logic is a good starting point to balance ability to migrate vs expressive power within the model.
-Ryan
>
> Thank you,
> Julius
>
>
> On Sunday, August 11, 2024 at 10:58:12 PM UTC-6
wisn...@gmail.com wrote:
> Hi Julius,
>
> Responses inline.
>
> > On Aug 11, 2024, at 8:53 AM, Julius Hamilton <
juliusha...@gmail.com> wrote:
> >
> > I must have been really busy when I wrote the previous emails since I somehow never even read the response thoroughly.
> >
> > I looked over the tutorial and I have some new questions.
> >
> > - It appears that Ryan describes CQL as a “theorem prover”, and I’m wondering why it isn’t billed as such more prominently. It describes itself as a “query language”. Initially, I would have thought that a “query” language could only read, and not write, data, but that doesn’t seem to be the case, as I think I read a key feature of query languages is also the ability to transform data. So I guess a “query language” might be thought of as a “data language”, whose primary function is to manipulate entire data sets at once, rather than single pieces of data?
>
> CQL contains several theorem provers, but they are too specialized to either data migration or category theory to really be considered automated theorem provers in the sense of E, Vampire, Coq, etc, or enter automated proving competitions.
>
> In retrospect calling CQL a query language was a misnomer. The reason is that, in practice, we’ve come to discover that when most people say “query” they mean “compute a limit” whereas what is new in CQL is “computing colimits”. So CQL is really “a logic for data migration”, not query. As for whether or not "query languages" can write in addition to read, that didn’t enter into our thinking.
>
> Pedantically, SQL is distinguished from DDL (data definition language).
>
> >
> > - Does the formal language that is CQL correspond well to a familiar mathematical formal system? Is there a good analogy like “typesides are like ____”, “schemas are like _____” (categories, functors)?
> >
>
> To a first approximation, it is best to think of “CQL without attributes/types”, whose slogan is quite literally, “schemas are finitely presented categories” and “databases are finitely presented functors”.
>
>
> > - Why is the language presented primarily through an IDE repo, rather than the language itself? Could there be a repo just for a compiler/interpreter for the language, just like installing any other programming language, which includes a REPL?
>
> I’m not entirely sure I understand this question. CQL is written is java, and so requires java to be installed wherever it runs. The java jarfile can be run either as a stand alone graphical IDE, or on the command line without the graphical component. It is also possible to access CQL’s underlying API via the java jar - for example, people who are interested in theorem provers can access theorem provers via the java API. If the question is, “why isn’t CQL available inside of Microsoft visual studio code / <IDE X Y Z>”, the answer is simply, no one has tried to do it.
>
> But perhaps your question is a lot like asking, “why don’t developers interact with Oracle through its API rather than SQL”? Then answer is, the SQL language is a better abstraction to get work done in Oracle than the Oracle API is. Similarly for CQL; although all of CQL’s functions are available directly in java, scripting them together in java is a lot more verbose, and a lot less pleasant, than writing CQL code.
>
> >
> > > In fact we are ramping up; let me know if you might be interested in collaborating and/or a job at Conexus AI.
> >
> > It pains me to admit I have been getting by doing various jobs while trying to cram in as much study of mathematics in my free time as possible while also trying to collaborate with any software development projects. I should have seen this email earlier. That said, maybe I’m not yet knowledgeable enough to assist with the project, except in a low-level role. My only good language is Python, and I know some JavaScript. I’m working on building a portfolio of projects on my GitHub. I’m pretty good at writing though, and I sort of was a participant in an AI startup kind of interacting with people in a Discord server. I might be able to help with writing technical documentation, web design or community outreach. I actually came here today to ask a few questions about the tutorial but to also mention that there doesn’t appear to be a lot of explanation of what CQL actually does on the homepage, which could be of benefit.
> >
> > > CQL will emit Coq code from olog definitions. (And TPTP code).
> > Can you provide sample code that does this?
>
> I’ve attached a picture of Coq generation; TPTP code is only generated for Mappings.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
categoricalda...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/categoricaldata/42e162c8-b67c-4d9e-a4a9-9b08ac2409a7n%40googlegroups.com.