Hello

89 views
Skip to first unread message

Julius Hamilton

unread,
Mar 28, 2024, 12:37:15 PM3/28/24
to Categorical Data
Hello everyone,

I am curious to begin exploratively using CQL.

I noticed the CQL IDE GitHub repo has not been updated for about 3 years. I was curious if it is considered to be in a "stable" longterm state, or if the developers have moved on to other projects?

I am also curious about the choice of Java as the development language - I'm sure it's an excellent choice, just curious to learn if the developers chose it for a specific reason.

I am also curious if people think Coq is already well set up to do similar things to CQL - I am considering trying to write my own ologs and similar in Coq.

Thanks,
Julius Hamilton

Ryan Wisnesky

unread,
Mar 28, 2024, 1:23:16 PM3/28/24
to categor...@googlegroups.com
Hi Julius,

CQL is indeed stable, and development work has shifted to commercialization. Which is to say, the choice of available operators in CQL has been sufficient for commercial needs, and now the focus is on creating fast, production-quality implementations of certain parts of CQL, which we keep proprietary and license. In fact we are ramping up; let me know if you might be interested in collaborating and/or a job at Conexus AI.

CQL will emit Coq code from olog definitions. (And TPTP code). Or at least, the version we have internally does, and it is updated much more frequently. I can provide access- lmk. (For example, our internal version of CQL supported this paper: https://arxiv.org/abs/2209.14457). As a proof assistant, certainly Coq can express ologs and prove things about them; however, *computing with* ologs in Coq is much trickier. For example, take a look at Jason Gross’s category theory library in Coq; you’ll find that while it defines left Kan extensions, it does not allow to enumerate their elements, such as CQL does. This distinction manifests when ologs are “shallowly embedded” into Coq, rather than “deeply embedded”. CQL also contains theorem provers not available in Coq, for example, it can decide equality for any monoidal theory that admits a non-length increasing equivalent system.

Re: java, this was mostly for compatibility reasons, as well as speed reasons. Using JDBC, CQL can connect to pretty much any SQL system. And java’s optimization and profiling tools were necessary to make CQL performant (as well as algorithmic improvements, such as https://arxiv.org/abs/2205.02425).

Happy to chat further,
Ryan
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/1c5070be-d2f2-4a79-bab1-306bddcec554n%40googlegroups.com.

Rich Hilliard

unread,
Mar 29, 2024, 8:35:50 PM3/29/24
to categor...@googlegroups.com
I’d be interested in producing Coq code from CQL (not necessary ologs).
I assume this is not available in the CQL version (May 2021) which can be downloaded from categoricaldata.net. Will newer versions of CQL be made publicly available?

— Rich
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/8298D8EC-F01D-457B-95AC-E7985E735CF6%40gmail.com.

Ryan Wisnesky

unread,
Mar 31, 2024, 1:55:43 AM3/31/24
to categor...@googlegroups.com
Hi All,

The CQL IDE jar file at categoricaldata.net has been updated.  The primary changes:

 - Coq and DOT output for schemas (new tab in viewer)

 - Coq and TPTP output for mappings (new tab in viewer)

 - Right kan extensions of queries (rext <query> <query> keyword)

 - Psuedo-colimits of schemas (see NewDemoPsuedo example)

 - Learning of queries from pairs of instances (learn <inst> <inst> keyword)

 - Warehousing UI that displays all the formulaic steps in a warehousing scenario (tools->warehousing menu item)

 - Requires java 17 (up from 15)


Left out of this release but available by request:

 - Excel import/export capability (for sheets with formulas)

 - Text/markdown import using NLP

 - Bitsy Tinkerpop support (java class file version issue)

 - SQL query re-writing (“view unfolding") and validation (proving constraint preservation)


As always, I am eager to talk to anyone about potential applications, of CQL or other categorical data technology (such as Hydra, https://github.com/CategoricalData/hydra, an open-source collaboration with LinkedIn/Microsoft currently under heavy development).  I am also eager to talk to functional programmers with category theory background who are looking for jobs.

-Ryan

Julius Hamilton

unread,
Apr 2, 2024, 8:45:21 AM4/2/24
to Categorical Data
Thanks. I am not sure why I did not see your response. Perhaps I was not subscribed to the mailing list.

I'll follow up soon with more of a response. Thanks.

Julius

Rich Hilliard

unread,
Apr 2, 2024, 9:04:27 PM4/2/24
to categor...@googlegroups.com
Thanks, Ryan!

Sent from the Rh phone. 

On Mar 31, 2024, at 1:55 AM, Ryan Wisnesky <wisn...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "Categorical Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.

Julius Hamilton

unread,
Aug 11, 2024, 11:53:09 AM8/11/24
to Categorical Data
I must have been really busy when I wrote the previous emails since I somehow never even read the response thoroughly.

I looked over the tutorial and I have some new questions.

- It appears that Ryan describes CQL as a “theorem prover”, and I’m wondering why it isn’t billed as such more prominently. It describes itself as a “query language”. Initially, I would have thought that a “query” language could only read, and not write, data, but that doesn’t seem to be the case, as I think I read a key feature of query languages is also the ability to transform data. So I guess a “query language” might be thought of as a “data language”, whose primary function is to manipulate entire data sets at once, rather than single pieces of data?

- Does the formal language that is CQL correspond well to a familiar mathematical formal system? Is there a good analogy like “typesides are like ____”, “schemas are like _____” (categories, functors)?

- Why is the language presented primarily through an IDE repo, rather than the language itself? Could there be a repo just for a compiler/interpreter for the language, just like installing any other programming language, which includes a REPL?

> In fact we are ramping up; let me know if you might be interested in collaborating and/or a job at Conexus AI.

It pains me to admit I have been getting by doing various jobs while trying to cram in as much study of mathematics in my free time as possible while also trying to collaborate with any software development projects. I should have seen this email earlier. That said, maybe I’m not yet knowledgeable enough to assist with the project, except in a low-level role. My only good language is Python, and I know some JavaScript. I’m working on building a portfolio of projects on my GitHub. I’m pretty good at writing though, and I sort of was a participant in an AI startup kind of interacting with people in a Discord server. I might be able to help with writing technical documentation, web design or community outreach. I actually came here today to ask a few questions about the tutorial but to also mention that there doesn’t appear to be a lot of explanation of what CQL actually does on the homepage, which could be of benefit.

> CQL will emit Coq code from olog definitions. (And TPTP code).

Can you provide sample code that does this?

Or at least, the version we have internally does, and it is updated much more frequently. I can provide access- lmk.

Sure, that’s generous, please do.

> This distinction manifests when ologs are “shallowly embedded” into Coq, rather than “deeply embedded”. CQL also contains theorem provers not available in Coq, for example, it can decide equality for any monoidal theory that admits a non-length increasing equivalent system.

Would be interested to understand this better.

As for my questions about the tutorial which is why I originally came back here:

- If I understand correctly, a “typeside” is just a simple aliasing process. In the tutorial example case, they establish convenient identifiers for strings of Java code. Does this mean that CQL is commonly used as a “meta”-programming language, which could interface with any programming language? Is there a way to check that the types you declare in CQL, like `plus : Integer \times Integer`, is ‘valid’ for the literal translation into Java’s “+”, and so on? Or is a typeside always meant to be written in Java?

- As for a schema: “entities” makes sense to me - the declaration of fundamental types, something I have become familiar with in my own stabs at formal modeling. It seems like “foreign keys” is just the declaration of function types - so why is it called “foreign keys”? It seems like “path equations” are more like “checks” for “type-safety” - not intrinsically necessary to the functioning of the program, but I’m not sure. Perhaps CQL requires them?

That’s all for now. Thanks,

Julius

Ryan Wisnesky

unread,
Aug 12, 2024, 12:58:12 AM8/12/24
to categor...@googlegroups.com
Hi Julius,

Responses inline.

> On Aug 11, 2024, at 8:53 AM, Julius Hamilton <juliusha...@gmail.com> wrote:
>
> I must have been really busy when I wrote the previous emails since I somehow never even read the response thoroughly.
>
> I looked over the tutorial and I have some new questions.
>
> - It appears that Ryan describes CQL as a “theorem prover”, and I’m wondering why it isn’t billed as such more prominently. It describes itself as a “query language”. Initially, I would have thought that a “query” language could only read, and not write, data, but that doesn’t seem to be the case, as I think I read a key feature of query languages is also the ability to transform data. So I guess a “query language” might be thought of as a “data language”, whose primary function is to manipulate entire data sets at once, rather than single pieces of data?

CQL contains several theorem provers, but they are too specialized to either data migration or category theory to really be considered automated theorem provers in the sense of E, Vampire, Coq, etc, or enter automated proving competitions.

In retrospect calling CQL a query language was a misnomer. The reason is that, in practice, we’ve come to discover that when most people say “query” they mean “compute a limit” whereas what is new in CQL is “computing colimits”. So CQL is really “a logic for data migration”, not query. As for whether or not "query languages" can write in addition to read, that didn’t enter into our thinking.

Pedantically, SQL is distinguished from DDL (data definition language).

>
> - Does the formal language that is CQL correspond well to a familiar mathematical formal system? Is there a good analogy like “typesides are like ____”, “schemas are like _____” (categories, functors)?
>

To a first approximation, it is best to think of “CQL without attributes/types”, whose slogan is quite literally, “schemas are finitely presented categories” and “databases are finitely presented functors”.


> - Why is the language presented primarily through an IDE repo, rather than the language itself? Could there be a repo just for a compiler/interpreter for the language, just like installing any other programming language, which includes a REPL?

I’m not entirely sure I understand this question. CQL is written is java, and so requires java to be installed wherever it runs. The java jarfile can be run either as a stand alone graphical IDE, or on the command line without the graphical component. It is also possible to access CQL’s underlying API via the java jar - for example, people who are interested in theorem provers can access theorem provers via the java API. If the question is, “why isn’t CQL available inside of Microsoft visual studio code / <IDE X Y Z>”, the answer is simply, no one has tried to do it.

But perhaps your question is a lot like asking, “why don’t developers interact with Oracle through its API rather than SQL”? Then answer is, the SQL language is a better abstraction to get work done in Oracle than the Oracle API is. Similarly for CQL; although all of CQL’s functions are available directly in java, scripting them together in java is a lot more verbose, and a lot less pleasant, than writing CQL code.

>
> > In fact we are ramping up; let me know if you might be interested in collaborating and/or a job at Conexus AI.
>
> It pains me to admit I have been getting by doing various jobs while trying to cram in as much study of mathematics in my free time as possible while also trying to collaborate with any software development projects. I should have seen this email earlier. That said, maybe I’m not yet knowledgeable enough to assist with the project, except in a low-level role. My only good language is Python, and I know some JavaScript. I’m working on building a portfolio of projects on my GitHub. I’m pretty good at writing though, and I sort of was a participant in an AI startup kind of interacting with people in a Discord server. I might be able to help with writing technical documentation, web design or community outreach. I actually came here today to ask a few questions about the tutorial but to also mention that there doesn’t appear to be a lot of explanation of what CQL actually does on the homepage, which could be of benefit.
>
> > CQL will emit Coq code from olog definitions. (And TPTP code).
> Can you provide sample code that does this?

I’ve attached a picture of Coq generation; TPTP code is only generated for Mappings.

Screenshot 2024-08-11 at 9.44.42 PM.png

Julius Hamilton

unread,
Aug 16, 2024, 7:26:27 PM8/16/24
to Categorical Data
Thanks.

I am working on processing the information in the slides and PDF.

I have a lot of reading to do, and I might need to discuss what I'm learning along the way to sufficiently reinforce it.

I am currently thinking about CQL in the following way:

1. The "typeside" is where you just declare the fundamental data types you will be using, like Integer and String. But I don't understand if these are technically arbitrary names, or, if you actually need to supply definitions in Java, for example, for CQL to actually carry out various computations.

2. A schema is a category. The objects can be thought of as types, and are called entities. The morphisms can be thought of as relationships, and are called attributes.

3. A "finitely presented category" may have a sophisticated definition, but can be simply understood as a "finite category". That is, you will only be dealing with a finite number of objects or morphisms. However, it could be the case that one object actually represents an infinite number of isomorphic objects, for example. So that category is technically not finite, but it has a "finite presentation", an equivalent form we can map it to.

4. "Foreign keys" are just attributes where the target is not a fundamental data type. Thus, person.name is an attribute since name is of type String, but employee.department is a foreign key, since a department is an entity in the schema, and not a data type in the typeside.

5. The path equations are necessary, not "safety checks". They literally specify the composition of morphisms in the category.

6. I had never heard of the "word problem" before, and I'm surprised that determining path equivalence in a finitely presented category would be undecidable. https://en.wikipedia.org/wiki/Word_problem_(mathematics)

7. On a more abstract level, Spivak showed that categorical data has three main operations, \Sigma which is union, \Pi which is product or join, and \Delta which is projection. https://arxiv.org/pdf/1009.1166

8. One of the main uses of representing a schema as a category is that functors become ways to "migrate data between formats".

So, does that mean, CQL could translate data from SQL, to CSV, to JSON, to Neo4j, etc.? (Depending on if such a translation is mathematically valid.)

Apart from being able to translate data between different formats, does categorical data science suggest that there is a "most universal" way to represent any kind of data at all? For example, if someone wanted to create a data model about a person: name, schools attended, age, address, and so on - CQL would represent this info in a way that emphasizes "universal properties", more than SQL would, even though they are mathematical equivalent?

Thank you,
Julius

Ryan Wisnesky

unread,
Aug 16, 2024, 8:55:47 PM8/16/24
to categor...@googlegroups.com
It may be overwhelming to approach CQL without first having used SQL or similar (e.g. data frames), but I’ve tried to respond inline below:

> On Aug 16, 2024, at 4:26 PM, Julius Hamilton <juliusha...@gmail.com> wrote:
>
> Thanks.
>
> I am working on processing the information in the slides and PDF.
>
> I have a lot of reading to do, and I might need to discuss what I'm learning along the way to sufficiently reinforce it.
>
> I am currently thinking about CQL in the following way:
>
> 1. The "typeside" is where you just declare the fundamental data types you will be using, like Integer and String. But I don't understand if these are technically arbitrary names, or, if you actually need to supply definitions in Java, for example, for CQL to actually carry out various computations.

CQL lets users write any finite equational theory for the typeside. Or, if that isn’t expressive enough, or there are pragmatic considerations, users can instead supply java definitions that implicitly define an infinite equational theory for the type side.

>
> 2. A schema is a category. The objects can be thought of as types, and are called entities. The morphisms can be thought of as relationships, and are called attributes.

A schema is a category whose objects can be divided into two classes: the types, inherited from the type side, whose intended denotation is usually infinite - Integer, String, etc, and the entities, not inherited form the type side, whose intended denotation is usually finite - Employee, Student, etc. A morphism is a function, either from an entity to an entity (called a foreign key) or an entity to a type (called an attribute).


> 3. A "finitely presented category" may have a sophisticated definition, but can be simply understood as a "finite category". That is, you will only be dealing with a finite number of objects or morphisms. However, it could be the case that one object actually represents an infinite number of isomorphic objects, for example. So that category is technically not finite, but it has a "finite presentation", an equivalent form we can map it to.

A finitely presented category need not have a finite number of morphisms, which is why computational category theory is hard. Consider the schema with one object, Person, and one generating morphism, father : Person -> Person. That category has infinitely many morphisms: id, father(id), father(father(id)), and so on.

Finite category theory is computationally intractable, but easy.

>
> 4. "Foreign keys" are just attributes where the target is not a fundamental data type. Thus, person.name is an attribute since name is of type String, but employee.department is a foreign key, since a department is an entity in the schema, and not a data type in the typeside.

Yes

>
> 5. The path equations are necessary, not "safety checks". They literally specify the composition of morphisms in the category.

Yes, a finite presentation of category contains a set of equations as part of its given structure, and the category denoted by the presentation defines its composition operation in terms of these equations.

>
> 6. I had never heard of the "word problem" before, and I'm surprised that determining path equivalence in a finitely presented category would be undecidable. https://en.wikipedia.org/wiki/Word_problem_(mathematics)
>

There are many easy proofs of this; for example, the word problem for groups has been known to be undecidable for a long time, and you can convert a word problem for a group into a word problem for a category - a group is a category with a single object and every morphism invertible. In fact, these undecidable categories may have small presentations (smallest I’ve seen is 3 equations).

> 7. On a more abstract level, Spivak showed that categorical data has three main operations, \Sigma which is union, \Pi which is product or join, and \Delta which is projection. https://arxiv.org/pdf/1009.1166
>

To a first approximation; however, Sigma goes beyond union, and requires what database theorists call a “chase algorithm” to compute.

> 8. One of the main uses of representing a schema as a category is that functors become ways to "migrate data between formats".

Yes

>
> So, does that mean, CQL could translate data from SQL, to CSV, to JSON, to Neo4j, etc.? (Depending on if such a translation is mathematically valid.)

Yes, many uses of CQL in practice involve integrating different ‘meta models’, such as SQL, CSV, JSON, XML, Graph, all at the same time.

>
> Apart from being able to translate data between different formats, does categorical data science suggest that there is a "most universal" way to represent any kind of data at all? For example, if someone wanted to create a data model about a person: name, schools attended, age, address, and so on - CQL would represent this info in a way that emphasizes "universal properties", more than SQL would, even though they are mathematical equivalent?

The category theory and database theory textbooks (and I also) say that as a general rule, the best way to represent a data model is to use what CQL calls “constraints” and what database theorists call “embedded, implicational dependences” (https://dbucsd.github.io/paperpdfs/2008_8.pdf) and what category theorists call “regular logic / lifting problems”. This logic is basically Horn clauses with existential quantifiers added, and is not supported by SQL. It is very well studied - see for example the textbook by Arenas et al, “Data Exchange”.

The reason is that this is the strongest logic that admits the delta/sigma/pi data migration functors; any stronger, and at least one of them disappears. Their disappearance isn’t necessarily bad - our work with Uber uses a logic for which they disappear https://arxiv.org/pdf/1909.04881 - but this logic is a good starting point to balance ability to migrate vs expressive power within the model.

-Ryan

>
> Thank you,
> Julius
>
>
> On Sunday, August 11, 2024 at 10:58:12 PM UTC-6 wisn...@gmail.com wrote:
> Hi Julius,
>
> Responses inline.
>
> > On Aug 11, 2024, at 8:53 AM, Julius Hamilton <juliusha...@gmail.com> wrote:
> >
> > I must have been really busy when I wrote the previous emails since I somehow never even read the response thoroughly.
> >
> > I looked over the tutorial and I have some new questions.
> >
> > - It appears that Ryan describes CQL as a “theorem prover”, and I’m wondering why it isn’t billed as such more prominently. It describes itself as a “query language”. Initially, I would have thought that a “query” language could only read, and not write, data, but that doesn’t seem to be the case, as I think I read a key feature of query languages is also the ability to transform data. So I guess a “query language” might be thought of as a “data language”, whose primary function is to manipulate entire data sets at once, rather than single pieces of data?
>
> CQL contains several theorem provers, but they are too specialized to either data migration or category theory to really be considered automated theorem provers in the sense of E, Vampire, Coq, etc, or enter automated proving competitions.
>
> In retrospect calling CQL a query language was a misnomer. The reason is that, in practice, we’ve come to discover that when most people say “query” they mean “compute a limit” whereas what is new in CQL is “computing colimits”. So CQL is really “a logic for data migration”, not query. As for whether or not "query languages" can write in addition to read, that didn’t enter into our thinking.
>
> Pedantically, SQL is distinguished from DDL (data definition language).
>
> >
> > - Does the formal language that is CQL correspond well to a familiar mathematical formal system? Is there a good analogy like “typesides are like ____”, “schemas are like _____” (categories, functors)?
> >
>
> To a first approximation, it is best to think of “CQL without attributes/types”, whose slogan is quite literally, “schemas are finitely presented categories” and “databases are finitely presented functors”.
>
>
> > - Why is the language presented primarily through an IDE repo, rather than the language itself? Could there be a repo just for a compiler/interpreter for the language, just like installing any other programming language, which includes a REPL?
>
> I’m not entirely sure I understand this question. CQL is written is java, and so requires java to be installed wherever it runs. The java jarfile can be run either as a stand alone graphical IDE, or on the command line without the graphical component. It is also possible to access CQL’s underlying API via the java jar - for example, people who are interested in theorem provers can access theorem provers via the java API. If the question is, “why isn’t CQL available inside of Microsoft visual studio code / <IDE X Y Z>”, the answer is simply, no one has tried to do it.
>
> But perhaps your question is a lot like asking, “why don’t developers interact with Oracle through its API rather than SQL”? Then answer is, the SQL language is a better abstraction to get work done in Oracle than the Oracle API is. Similarly for CQL; although all of CQL’s functions are available directly in java, scripting them together in java is a lot more verbose, and a lot less pleasant, than writing CQL code.
>
> >
> > > In fact we are ramping up; let me know if you might be interested in collaborating and/or a job at Conexus AI.
> >
> > It pains me to admit I have been getting by doing various jobs while trying to cram in as much study of mathematics in my free time as possible while also trying to collaborate with any software development projects. I should have seen this email earlier. That said, maybe I’m not yet knowledgeable enough to assist with the project, except in a low-level role. My only good language is Python, and I know some JavaScript. I’m working on building a portfolio of projects on my GitHub. I’m pretty good at writing though, and I sort of was a participant in an AI startup kind of interacting with people in a Discord server. I might be able to help with writing technical documentation, web design or community outreach. I actually came here today to ask a few questions about the tutorial but to also mention that there doesn’t appear to be a lot of explanation of what CQL actually does on the homepage, which could be of benefit.
> >
> > > CQL will emit Coq code from olog definitions. (And TPTP code).
> > Can you provide sample code that does this?
>
> I’ve attached a picture of Coq generation; TPTP code is only generated for Mappings.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/42e162c8-b67c-4d9e-a4a9-9b08ac2409a7n%40googlegroups.com.

Julius Hamilton

unread,
Aug 23, 2024, 1:18:26 PM8/23/24
to Categorical Data
Thanks. That looks really interesting.

So I am thinking about translating between JSON and SQL using CQL. I am wondering if there is a mathematical proof of how these data structures compare to each other. I think of a JSON schema as a tree structure. And I guess an SQL schema can be seen to have the structure, as discussed above, as a category. Does this suggest that every JSON schema can be readily converted to SQL, but not vice versa? I'm imagining an SQL schema having structural properties a tree cannot represent, like cycles of edges. On the other hand, perhaps an SQL database can be mapped to a *collection* of JSON tree-structures?

Let's assume all the data in some JSON will be of type String, Integer or List. JSON is basically nested dictionaries.

typeside JSON = literal {
    String, Integer, List
}

schema S = literal : JSON {
    entities
        Person Residence
    foreign_keys
        owns : Person -> Residence
    path_equations
    
    attributes
        name: Person -> String
        age: Person -> Integer
        residences: Person -> List
        address: Residence -> String
        ownedBy : Residence -> Person
        
I know this is incorrect. Where would you go from here?

One of the biggest reasons I got into this in 2023, when I first discovered the work of Spivak, is an exploration of creating knowledge graphs with a hypergraph structure. The idea is to have a flexible schema which can be programmatically modified at any time. Imagine that I have tiny units of data, like a particular URL. In any given formal schema, there may be a specific field I should fill out, to give metadata to a datum. For example, for a URL, I might have fields / headers like webpage type (social network, web app, article, wikipedia article), article title, date accessed, etc. However, I do not want to decide rigidly and in advance on a schema. In the real world, a datum has many, many different kinds of attributes and relations to other things. On the most basic level, I have been exploring creating relations between things that are not necessarily labeled. This adds a kind of cognitive freedom to organizing some data. So for example, for this URL:


I can confidently say this piece of data has "some relation" to many things: Wikipedia, articles, Bitcoin, the Bitcoin protocol, URLs, Strings, etc. It is much easier to establish a relation to anything I can think of, without being committed to a particular "data format".

But for some of those relations, we realize that a certain label expresses the relation well. For example, "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is a" "URL". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "links to" "the Wikipedia page about the Bitcoin protocol". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is about" "the Bitcoin protocol". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is an" "article". And so on.

Now, similar to how in set theory, a function is identified with a set of values, we can see the emergence of concepts in such a hypergraph structure, because instead of trying to define "is" via certain properties, we can instead characterize "is" by every mapping/relation in our knowledge graph which uses the "is" label. But this means that there has to be the possibilities for relations between elements on any "level" of the hypergraph: a relation between "is", which is a "level 1" entity, and a level 0 entity like "verb". Or a relation between "is" and "equals", two level 1 concepts, which represents the concept of "is similar to" or "are synonyms".

Do you have any advice on how to mathematically structure that kind of system? Can CQL reflect hypergraph structures, where attributes themselves can have attributes?

Thanks very much,
Julius

Ryan Wisnesky

unread,
Aug 23, 2024, 3:50:12 PM8/23/24
to categor...@googlegroups.com
Responses below.

> On Aug 23, 2024, at 10:18 AM, Julius Hamilton <juliusha...@gmail.com> wrote:
>
> Thanks. That looks really interesting.
>
> So I am thinking about translating between JSON and SQL using CQL. I am wondering if there is a mathematical proof of how these data structures compare to each other. I think of a JSON schema as a tree structure. And I guess an SQL schema can be seen to have the structure, as discussed above, as a category. Does this suggest that every JSON schema can be readily converted to SQL, but not vice versa? I'm imagining an SQL schema having structural properties a tree cannot represent, like cycles of edges. On the other hand, perhaps an SQL database can be mapped to a *collection* of JSON tree-structures?

This is pretty straightforward: trees/graphs can be directly as relations, in particular, as an “edge” relation and a “node” relation. If you look at the TinkerPop schema for graph data in CQL (schema keyword ’tinkerpop’), you can literally see the node and edge relations. To encode a relation as a tree/graph, you can use degenerate, unbalanced trees that encode ‘lists/sets/bags of tuples'.

It’s probably best to think in terms of first order logic. Pretty much every data model can be expressed in terms of first order logic; that’s a big reason FOL is studied so much. CQL, SQL, JSON, RDF, XML, etc all have clear definitions in terms of FOL. In all cases, hierarchy and foreign keys turn out to be very similar concepts.


>
> Let's assume all the data in some JSON will be of type String, Integer or List. JSON is basically nested dictionaries.
>
> typeside JSON = literal {
> String, Integer, List
> }
>
> schema S = literal : JSON {
> entities
> Person Residence
> foreign_keys
> owns : Person -> Residence
> path_equations
> attributes
> name: Person -> String
> age: Person -> Integer
> residences: Person -> List
> address: Residence -> String
> ownedBy : Residence -> Person
> I know this is incorrect. Where would you go from here?

The built-in “JSON” example illustrates the typical way we import and export JSON data, which actually passes through RDF because that community already took care of the graph to relation encoding mentioned above. That example creates a JSON file programmatically, then imports it into CQL, then exports it as JSON again. In doing so, you can clearly see that translating from JSON and then back to JSON is a monad, not an isomorphism: the JSON that comes out of CQL is equivalent, but not the same, as the JSON that went in to CQL.


>
> One of the biggest reasons I got into this in 2023, when I first discovered the work of Spivak, is an exploration of creating knowledge graphs with a hypergraph structure. The idea is to have a flexible schema which can be programmatically modified at any time. Imagine that I have tiny units of data, like a particular URL. In any given formal schema, there may be a specific field I should fill out, to give metadata to a datum. For example, for a URL, I might have fields / headers like webpage type (social network, web app, article, wikipedia article), article title, date accessed, etc. However, I do not want to decide rigidly and in advance on a schema. In the real world, a datum has many, many different kinds of attributes and relations to other things. On the most basic level, I have been exploring creating relations between things that are not necessarily labeled. This adds a kind of cognitive freedom to organizing some data. So for example, for this URL:
>
> https://en.wikipedia.org/wiki/Bitcoin_protocol
>
> I can confidently say this piece of data has "some relation" to many things: Wikipedia, articles, Bitcoin, the Bitcoin protocol, URLs, Strings, etc. It is much easier to establish a relation to anything I can think of, without being committed to a particular "data format".
>
> But for some of those relations, we realize that a certain label expresses the relation well. For example, "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is a" "URL". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "links to" "the Wikipedia page about the Bitcoin protocol". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is about" "the Bitcoin protocol". And "https://en.wikipedia.org/wiki/Bitcoin_protocol" "is an" "article". And so on.
>
> Now, similar to how in set theory, a function is identified with a set of values, we can see the emergence of concepts in such a hypergraph structure, because instead of trying to define "is" via certain properties, we can instead characterize "is" by every mapping/relation in our knowledge graph which uses the "is" label. But this means that there has to be the possibilities for relations between elements on any "level" of the hypergraph: a relation between "is", which is a "level 1" entity, and a level 0 entity like "verb". Or a relation between "is" and "equals", two level 1 concepts, which represents the concept of "is similar to" or "are synonyms".
>
> Do you have any advice on how to mathematically structure that kind of system? Can CQL reflect hypergraph structures, where attributes themselves can have attributes?

My advice in these situations is almost always to first write it down on paper using math, and then implement it on a computer only once you know exactly what you should be doing. (“If you want sense, you have to make it yourself.”) Like, CQL and other data models can encode hyper graphs, sure, but whether or not hyper graphs are an appropriate data model for your problem is not a question of CQL or software. There are a number of methodologies for modeling data, almost all of which begin with characterizing exactly what you want to do with the data and what would go wrong practically speaking if you didn’t model it that way in that circumstance.

For example, if your data is “People”, then you would model that differently if your goal is analytics (perhaps you’d use a star schema to support fast queries at scale) vs transaction processing (perhaps you’d use a schema with lots of indices to support small updates frequently). Or, if your goal is to exchange such data among untrusted peers, perhaps you would work in a lot of redundancy to the model, etc. All of these can be formalized and related in FOL / CQL.

Here are some slides that go through a particular data modeling approach called “E/R”: http://www.csbio.unc.edu/mcmillan/Media/Comp521F12Lecture02.pdf . Similarly, if you were modeling an airplane, you’d model it differently if you were trying to study how it flies, vs how much it costs to build, vs how long it would take to wear out, etc, because “all models are wrong, but some are useful.” And in the end, you may model it many ways :-)
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/072e4098-fe35-4209-8e72-edfd0d4bea76n%40googlegroups.com.

Julius Hamilton

unread,
Aug 27, 2024, 3:55:13 PM8/27/24
to Categorical Data
Thanks. It will take me more time to absorb enough foundational knowledge to fully understand a lot of what you just said.

I am now trying to use CQL for the first time. Here's one use case:

Wikipedia offers a data dump download in XML format.

I want to convert that to SQL.

The manual gives this example for XML:

command createXmlData = exec_js {

"Java.type(\"catdata.Util\").writeFile(\" <data xmlns=\\\"http://example.org\\\">\\n <item>\\n <name>Hello</name>\\n </item>\\n <item>\\n <name>Hello</name>\\n </item>\\n</data> \", \"tempX.xml\")"

}

instance I = import_xml_all "tempX.xml"

command exportXmlData = export_rdf_instance_xml I "fileO.xml" {

external_types

Integer -> "http://www.w3.org/2001/XMLSchema#Integer" "x => x.toString()"

}


Essentially, this appears to create a dummy .xml file, and imports it into an instance with the command "import_xml_all".

That command is defined as:

instance (import_xml_all "md_uri") : schemaOf import_xml_all "md_uri"

Is this referencing a schema that isn't documented?

Or is the point that we have to write the schema anew for different XML data, since all data has its own schema particular to the data?

> If you look at the TinkerPop schema for graph data in CQL (schema keyword ’tinkerpop’)

I actually don't see this. Is there an error where the schema isn't displaying in the manual? (Please see attached photo).

Thanks,
Julius
Screenshot 2024-08-27 at 1.51.01 PM.png

Ryan Wisnesky

unread,
Aug 27, 2024, 4:51:00 PM8/27/24
to categor...@googlegroups.com
Hi Julius,

Re: converting the Wikipedia XML data dump (which uncompressed is over
100GB) to SQL, this is already a solved problem, and a priori not
necessarily a data migration or even a query problem: the MediaWiki
software itself ingests the XML file and creates an SQL database as
part of its initialization process, which undoubtedly does not run
in-memory as CQL does. Here's a handy reference:
https://stackoverflow.com/questions/62663042/import-english-wikipedia-dump-into-sql-server
. With a specification of what MediaWiki is doing, one might prove
that that data transformation is expressible in relational algebra
(and so easy to do in SQL) or existential horn clauses (and so easy to
do in CQL). But implicit in your desire to "convert that to SQL" is
*how*: there are many ways to represent any given data structure in
another; I'm assuming you'd like to use the SQL structure that
MediaWiki itself uses.

Re: schemaOf, this is indeed a command that takes the schema of an
instance, and is useful when you don't know the schema at compile time
- say, because you are converting an instance to a schema. But in the
manual's example and the built-in examples, there's not actually any
reason to use it; its use is apparently a historical artifact of the
fact that we added RDF support after XML rather than before and so for
a time the "RDF" schema keyword was not available to land XML data on.
Anyway, I've changed CQL to use the RDF schema keyword directly
instead of through schemaOf. So XML, JSON, and RDF will all land onto
the RDF schema keyword directly in CQL; right now, they all land on
the RDF schema, but "indirectly", through evaluation.

I've attached a screen shot showing what the tinkerpop schema and
constraints evaluate into, as well as a screenshot showing the RDF
schema associated with an XML import.

-Ryan

On Tue, Aug 27, 2024 at 12:55 PM Julius Hamilton
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/f07d7153-e268-4cdc-8ac9-73803087463en%40googlegroups.com.
Screenshot 2024-08-27 at 1.26.35 PM.png
Screenshot 2024-08-27 at 1.17.54 PM.png
Reply all
Reply to author
Forward
0 new messages