Wondering how to use CQL to perform pi (right pushforward)

56 views
Skip to first unread message

James Hester

unread,
Jun 11, 2019, 10:59:14 PM6/11/19
to Categorical Data
Greetings all,

I am exploring the possibility of using CQL to perform migration of data from data files of varying formats into an instance conforming to a canonical schema. 

The idea is that any data file can be represented by a format-specific schema, and our scientific ontology is also representable as a schema. If a mapping (functor) is defined between these schemas, it should be possible to migrate the data from a collection of data files into the canonical ontology automatically using CQL. However, it seems that the data migration functor that is appropriate is pi,the right pushforward functor.  The CQL tutorial advice is to reformulate pi as a query.  If I have understood the papers and tutorial correctly, for this to be possible, every entity in the ontology should have a source entity in the file format ontology, which is almost never the case in my project as the ontology is very general. Furthermore, it appears that the foreign key section of the query requires that a foreign key in the target schema should be defined using a path in the source schema, although the functor takes a foreign key in the source to a path in the target, ie the opposite.

I'm wondering if I have understood the situation correctly. It seems to me that I can do the pi migration in my head, often because there are often only singleton values for many attributes, which then propagate through the canonical ontology. I was wondering if there is any advice any of you could give me as to how I might go about organising a pi migration in the above scenario?  Perhaps I just need to add a few options in CQL and the built-in pi will work? Or add an initial object to the file schema?

thanks,
James.

Ryan Wisnesky

unread,
Jun 11, 2019, 11:07:55 PM6/11/19
to categor...@googlegroups.com
Hi James,

In general we recommend using queries instead of pi because queries tend to be a lot faster in practice, but CQL does provide pi:

instance I = pi F J

Does that answer your question?

-Ryan
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/a235883b-d15c-4c7e-9f6c-8467e003dc13%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

David Spivak

unread,
Jun 12, 2019, 8:25:24 AM6/12/19
to categor...@googlegroups.com
Hi James,

Also note that the queries for any given entity could be empty. Does that help?

If you want to send a minimal example, I'd be happy to take a look.

David


James Hester

unread,
Jun 12, 2019, 9:19:05 PM6/12/19
to Categorical Data
Hi David and Ryan,

Thanks for your help. I had tried to run pi previously, and when presented with an error assumed that I was hitting the limits of what the pi implementation was designed to do. Here is a minimal CQL example showing the problem that I have with doing a pi transform:

//We handle only strings
typeside
TyJava = literal {
 java_types
 
string = "java.lang.String"
 java_constants
 
string = "return input[0]"
}

//A collection of data files, each containing a wavelength
schema ADSC
= literal: TyJava {
 entities
adsc_file
attributes
    adsc_file_to__diffrn_radiation_wavelength_dot_wavelength
: adsc_file -> string
}


instance testdata
= literal : ADSC {
generators
adsc_file_3 adsc_file_1 adsc_file_2
: adsc_file
multi_equations
adsc_file_to__diffrn_radiation_wavelength_dot_wavelength
-> {
 adsc_file_1
"0.710894" , adsc_file_2 "0.710894" , adsc_file_3 "0.710894"
}
}

//An ontology for image data, images are collected into diffraction experiments
//and one aspect of a diffraction experiment is the wavelength that it was run
//at. diffrn_radiation_wavelength is a separate entity because in more complex
//experiments we might describe the incident radiation as a table of weighted
//contributing wavelengths.
schema imgCIF
= literal: TyJava {
 entities
diffrn
diffrn_data_frame
diffrn_radiation_wavelength
foreign_keys
    diffrn_data_frame_to_diffrn
: diffrn_data_frame -> diffrn
    diffrn_to_diffrn_radiation_wavelength
: diffrn -> diffrn_radiation_wavelength
attributes
    _diffrn_radiation_wavelength_dot_wavelength
: diffrn_radiation_wavelength -> string
}

mapping ADSC_to_imgCIF
= literal : ADSC -> imgCIF {
entity
adsc_file
-> diffrn_data_frame
    attributes
    adsc_file_to__diffrn_radiation_wavelength_dot_wavelength
->diffrn_data_frame_to_diffrn.diffrn_to_diffrn_radiation_wavelength._diffrn_radiation_wavelength_dot_wavelength
}


instance
Result = pi ADSC_to_imgCIF testdata


When running this using the (very cool) CQL IDE I get:

Error in  Result: java.util.concurrent.ExecutionException: java.lang.RuntimeException: In transform for foreign key diffrn_data_frame_to_diffrn, In sknull, the type for labelled null sknull is not defined.

I think this is telling me that there is no type for the generator of diffrn_data_frame, even though if I run "sigma" instead of "pi" everything is fine. Is there a simple way to fix this?

Just as some background to this example, and how I think it should work, "imgCIF" is an ontology designed for describing raw data images from crystallographic experiments. In the imgCIF ontology, every diffraction experiment (identified by "diffrn") is run at a particular wavelength (attribute _diffrn_radiation_wavelength_dot_wavelength). A diffraction experiment consists of many image frames ("diffrn_data_frame"), each of which belongs to a particular value of "diffrn". Mapping into this imgCIF ontology is the source ontology for data files of type "ADSC", which describes a collection of single frame files, each of which includes a wavelength.

How I think the right pushforward should work is that "diffrn_radiation_wavelength" is populated with a single value, as there is only one distinct wavelength, and then likewise for "diffrn". The diffrn_data_frame_to_diffrn morphism maps each file to that single "diffrn" value.  If there had been more than one value for wavelength, the right pushforward functor would create multiple values for "diffrn" and the diffrn_data_frame_to_diffrn morphism would be mapped appropriately.

From my point of view this is neat, as essentially the collection of attributes of "diffrn" self-organise data frames into diffraction experiments.

My long-term concern is that in real life such file collections could include up to about a thousand files, and if the pi data migration naively creates the product of all contributing object elements before discarding based on source category morphisms, memory usage could be rather large as there are many more attributes in each file than in the example.  There do seem to be more efficient algorithms around, but I'm not sure if they are implemented in CQL. 

Thanks for any insight you can provide!

James.

On Wednesday, June 12, 2019 at 10:25:24 PM UTC+10, David Spivak wrote:
Hi James,

Also note that the queries for any given entity could be empty. Does that help?

If you want to send a minimal example, I'd be happy to take a look.

David


On Tue, Jun 11, 2019 at 11:07 PM Ryan Wisnesky <ry...@conexus.ai> wrote:
Hi James,

In general we recommend using queries instead of pi because queries tend to be a lot faster in practice, but CQL does provide pi:

instance I = pi F J

Does that answer your question? 

-Ryan


> On Jun 11, 2019, at 10:59 PM, James Hester <jamesr...@gmail.com> wrote:
>
> Greetings all,
>
> I am exploring the possibility of using CQL to perform migration of data from data files of varying formats into an instance conforming to a canonical schema.
>
> The idea is that any data file can be represented by a format-specific schema, and our scientific ontology is also representable as a schema. If a mapping (functor) is defined between these schemas, it should be possible to migrate the data from a collection of data files into the canonical ontology automatically using CQL. However, it seems that the data migration functor that is appropriate is pi,the right pushforward functor.  The CQL tutorial advice is to reformulate pi as a query.  If I have understood the papers and tutorial correctly, for this to be possible, every entity in the ontology should have a source entity in the file format ontology, which is almost never the case in my project as the ontology is very general. Furthermore, it appears that the foreign key section of the query requires that a foreign key in the target schema should be defined using a path in the source schema, although the functor takes a foreign key in the source to a path in the target, ie the opposite.
>
> I'm wondering if I have understood the situation correctly. It seems to me that I can do the pi migration in my head, often because there are often only singleton values for many attributes, which then propagate through the canonical ontology. I was wondering if there is any advice any of you could give me as to how I might go about organising a pi migration in the above scenario?  Perhaps I just need to add a few options in CQL and the built-in pi will work? Or add an initial object to the file schema?
>
> thanks,
> James.
>
> --
> You received this message because you are subscribed to the Google Groups "Categorical Data" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categor...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/a235883b-d15c-4c7e-9f6c-8467e003dc13%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Categorical Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to categor...@googlegroups.com.

Ryan Wisnesky

unread,
Jun 12, 2019, 9:24:26 PM6/12/19
to categor...@googlegroups.com
Hi James,

There are situations where sigma will work but pi won’t (mostly having to do with infinite types such as String or Integer and unmapped attributes), but given the complexity of your example and the error message, you may have very well found a bug in pi. I’ll work through the example and get back to you soon.

Ryan
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/4097a620-802f-46ae-a387-b1db5f2b7152%40googlegroups.com.

James Hester

unread,
Jun 12, 2019, 9:35:39 PM6/12/19
to Categorical Data
Just regarding empty entities in queries, that sounds promising if I can still describe the foreign key mappings from the entity as I think these would be needed in order to populate the entity.  However, I don't think I can come up with a sequence of paths in the source category if the endpoint or starting point is not mapped into the target category.  I'm by no means completely across section 4.3.4 of the "Algebraic Data integration" 2017 paper but it seems to be saying that there should always be a mapped entity for the query equivalent to pi migration to exist?

On Wednesday, June 12, 2019 at 10:25:24 PM UTC+10, David Spivak wrote:
Hi James,

Also note that the queries for any given entity could be empty. Does that help?

If you want to send a minimal example, I'd be happy to take a look.

David

--
You received this message because you are subscribed to the Google Groups "Categorical Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to categor...@googlegroups.com.

Ryan Wisnesky

unread,
Jun 12, 2019, 9:42:03 PM6/12/19
to categor...@googlegroups.com
Hi James,

You can convert a mapping to a query in two different ways; the one you want here is:

instance Result = eval (toCoQuery ADSC_to_imgCIF) testdata

Unfortunately, that also causes an exception. So it looks like your example found a bug, the only question is where. Updates to follow.

Ryan
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/20bcd5d1-7468-47a3-a46e-bbe3d0329046%40googlegroups.com.

Ryan Wisnesky

unread,
Jun 12, 2019, 10:00:03 PM6/12/19
to categor...@googlegroups.com
Hi James:

Quick follow-up: the situation where pi doesn’t work but sigma does are those schema mappings which are not ’surjective on attributes’. Is your intended semantics such that the value of every attribute in the target instance comes (somehow) from the source? Or are you attempting to create new nulls using pi?

Thanks,
Ryan
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/599E0BA1-81DA-4355-950D-8DC08224F106%40conexus.ai.

David Spivak

unread,
Jun 12, 2019, 11:43:09 PM6/12/19
to categor...@googlegroups.com
I'll also chime in.

How I think the right pushforward should work is that "diffrn_radiation_wavelength" is populated with a single value, as there is only one distinct wavelength, and then likewise for "diffrn".
I would agree if there were no attributes. But because your schema is expecting a string, and there is no canonical choice, it actually attempts to duplicate your single value, once for each string. This is the universal solution for Pi (the one for Sigma just creates a labeled null).

What string do you think should be there for your single value? Is it really supposed to have a string? I think this whole thing would work fine if your schema looked like this:
schema imgCIF = literal: TyJava {
 entities
diffrn
diffrn_data_frame
diffrn_radiation_wavelength
foreign_keys
    diffrn_data_frame_to_diffrn 
: diffrn_data_frame -> diffrn
    diffrn_to_diffrn_radiation_wavelength 
: diffrn ->
 diffrn_radiation_wavelength
attributes
    _diffrn_radiation_wavelength_dot_wavelength
: diffrn_data_frame ->string
}

Ryan Wisnesky

unread,
Jun 13, 2019, 12:09:35 AM6/13/19
to categor...@googlegroups.com
Hi James,

Your program was indeed causing CQL to fail un-gracefully, and I pushed a new jar to address that.  However, the underlying issue is that your mapping is not ’surjective on attributes’.  That means that doing a pi along it will often lead to infinite instances; in relational terms, the corresponding query is not “domain independent”.  In fact, on the empty instance, pi along your mapping will contains all strings.  You can even run it in CQL, if you use a finite, non-java typeside, e.g.,:

typeside TyJava = literal { 
 types
  string 
  constants aStr bStr cStr : string
}

Pi on the empty instance has one row per string in two tables - see attached.  I would suggest trying to re-vamp your schema along David’s lines.  I’d also suggest giving queries a try - both their syntax and semantics are close to SQL, making it easier to get started.  Also, it’s impossible to accidentally write domain dependent queries, as you may have noticed in attempting to translate pi :-)

This is a cool example and you found an obscure but important bug; kudos and thanks for sharing.



James Hester

unread,
Jun 14, 2019, 1:11:00 AM6/14/19
to Categorical Data
Hi Ryan and David,

Thanks for your explanations. I have downloaded the updated jar and can confirm that things now work according to my naive expectations after making the change to a finite typeside suggested by Ryan. 

My confusion appears to have arisen by assuming that the values of any attribute in the computed target instance would be limited to those present for the equivalent attribute in the source instance, which I think was how I understood the original "Functorial Data Migration" paper to be operating. But it seems that the type sides (which I am not really across) instead work with infinite types even if a particular source instance could only have a countable number of instantiations of that type.  In any case, I can work from Ryan's example of a finite typeside to recreate my original assumption - as I am generating the CQL programmatically, it is easy for me to define a type for each attribute in the source instance, populated by only those values that actually appear.  Of course, it would be a nice option to have in CQL that infinite types can be restricted to only the finite set of values actually appearing in the source instance, but I don't know how easy that would be to implement and how useful it would be to anybody else.

David - unfortunately I can't rewrite the schema as you suggest, because the schema is essentially "given" by the community. The community has defined that wavelength is a property of the data collection as a whole, not a property of every frame, and indeed one nice side-effect of the functorial migration approach is that these definitions group the data frames into experiments automatically. The way the system I envisage is supposed to work is that each data file format is associated with a simple definition file listing the locations of attributes from the community ontology within files of that format. Programmatically I then create a source schema and a functor onto the predefined community ontology.  If you're interested, there is a PDF of a recent presentation of mine on this at http://www.c3dis.com/wp-content/uploads/2019/05/2019-C3DIS_Hester_Interoperable-heterogeneous-data-repositories.pdf - some fairly loose ideas in there but I think the basic thrust has great potential (the category diagrams were clearly generated before finding Easik in the CQL GUI).

One more question - I would like to put cql.jar in a processing chain, so take the output back into other software. Is there any way to run cql.jar from the command line and obtain the computed target instance as straight machine-readable text (e.g. using the syntax of the instance definition in CQL)?  HTML would be a annoying to scrape data out of.

Anyway, I'll let you know how I go with the full-scale ontology.

thanks again,
James.

Ryan Wisnesky

unread,
Jun 14, 2019, 1:31:44 AM6/14/19
to categor...@googlegroups.com
Hi James,

Your point about a non-HTML command line driver is well taken; I’ll add it asap.

Regarding your question:

it would be a nice option to have in CQL that infinite types can be restricted to only the finite set of values actually appearing in the source instance, but I don't know how easy that would be to implement and how useful it would be to anybody else.

This is actually an interesting and open research question.  We looked at it briefly but concluded it wasn’t functorial or wasn’t well-behaved somehow.  It’s the categorical analog of “active domain semantics”, which is usually regarded as bad for non ‘domain independent queries’ such as pi along mappings that are not surjective on attributes, such as yours.  My guess is that your use of pi is actually a red herring, and that your modeling problem just needs more mappings and queries and intermediate schemas and possibly non-equational constraints, rather than pi with active domain semantics.  

-- 
You received this message because you are subscribed to the Google Groups "Categorical Data" group.
To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.

David Spivak

unread,
Jun 14, 2019, 7:56:33 AM6/14/19
to categor...@googlegroups.com
My guess is that your use of pi is actually a red herring, and that your modeling problem just needs more mappings and queries and intermediate schemas and possibly non-equational constraints, rather than pi with active domain semantics.  
That's my intuition as well. Perhaps it's a pi followed by a sigma, for example.

If you could somehow translate your question from physics into one about "persons and feet and toes" or something (every toe is on a foot, every foot is on a person): 
  • what the two schemas are supposed to "mean" in that context, 
  • what the mapping represents,
  • what the intended input instance is meant to represent and a simple example, 
  • what the intended output instance is supposed to mean and the intended output instance
that would help me help you get there. 

Ryan Wisnesky

unread,
Jun 20, 2019, 1:43:29 AM6/20/19
to categor...@googlegroups.com
The latest jar file can now be used from the command line by passing it a filename as argument:

java -Dnashorn.args="--no-deprecation-warning" -cp cql.jar catdata.aql.AqlCmdLine ~/Documents/GitHub/cql/resources/examples/cql/Delta.cql

Still, it’s recommended to emit data of interest as e.g., CSV or SQL rather than attempt to parse the output, since the textual output may not contain all information that would be exported (it is meant to be human readable, rather than complete.)
> To view this discussion on the web visit https://groups.google.com/d/msgid/categoricaldata/CACcOXSF99XL6W%2Bxf5Wc25qxvOCJLt1exgPQ3UEUL_2Ns86uf3w%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages