CQL operational use

86 views
Skip to first unread message

Tim Pierson

unread,
Apr 5, 2019, 2:49:32 PM4/5/19
to categoricaldata
Hi, 
  This is quite a cool project.  I've been sorting through the various papers and example code in the `IDE`, and while I'm beginning to understand various database operations in terms of (my rather limited) category theory, I remain in the dark about some aspects of the more mundane devops procedures:  

  If we have a warehousing integration problem whereby we have n schemata (currently online in some flavor of SQL) to integrate, is there a way to generate the CQL expressions that would define the schema and import mappings?  

  Algebraic Data Integration lists some sort of quotient syntax for computing pullbacks but I can't find that in the `PharmaColim_` sample:  what's the best way to compute pullbacks at scale?

  I understand that the engine can emit at lease some sql code; is it possible to emit the sql that would produce the database alterations specified in the pullback or does the CQL engine itself need to process the data?  If the latter, how's performance on semi-big-data scales (ie, 500gb)?

  In the above scenario, what's the best way to do iterative integration?  Say each month we ingested a new dataset: would it be best to keep all the source schemata and data separate and then, each month, compute a comprehensive pullback and export to a canonical representation? Or would it be better to integrate each dataset with the output of the previous pullback?  Of concern here is that our backend is Haskell and the datatypes by which we deal with database access are derived from the names in the schema; so we would want to minimize unnecessary breakage.

  Is there any news on publicly accessible artifacts of the Statebox collaboration?  A CQL dsl in Haskell would be cool . . .

Thanks for any info!

  
  

Ryan Wisnesky

unread,
Apr 5, 2019, 5:21:34 PM4/5/19
to categor...@googlegroups.com, Erik Post, Fred Eisele
Hi Tim,

Thank you for your kind words. I’ve answered your questions inline below.

> On Apr 5, 2019, at 2:49 PM, Tim Pierson <tim.p...@gmail.com> wrote:
>
> Hi,
> This is quite a cool project. I've been sorting through the various papers and example code in the `IDE`, and while I'm beginning to understand various database operations in terms of (my rather limited) category theory, I remain in the dark about some aspects of the more mundane devops procedures:
>
> If we have a warehousing integration problem whereby we have n schemata (currently online in some flavor of SQL) to integrate, is there a way to generate the CQL expressions that would define the schema and import mappings?

If you’re ok with importing all of the data in a SQL database, rather than landing just a part of it, you can import an entire SQL db at once with:

instance I = import_jdbc_all “jdbc://mysql/...etc”

That command also works at the schema level, if you just want the whole SQL DB’s schema.

> Algebraic Data Integration lists some sort of quotient syntax for computing pullbacks but I can't find that in the `PharmaColim_` sample: what's the best way to compute pullbacks at scale?

A pullback of sets in best done by evaluating a query - see the built-in ‘Pullback’ example. There is also syntax for quotienting an instance by an equivalence relation which is itself defined by evaluating a query - see the built-in Linkage example. Finally, a pushout/quotient of sets can also be done by co-evaluating a query - see the built-in Pushout example.

> I understand that the engine can emit at lease some sql code; is it possible to emit the sql that would produce the database alterations specified in the pullback or does the CQL engine itself need to process the data? If the latter, how's performance on semi-big-data scales (ie, 500gb)?

When possible, the CQL IDE displays the SQL code to evaluate a query in the viewer; this code can also be emitted programmatically and stored as a set of SQL views using

command c = export_jdbc_query Q “jdbc_string” "prefix1” “prefix2"

where the two prefixes are used to control the strings in the emitted SQL query. Fred (cc’d) may know more about this than I do, because his primary use case goes through SQL.

The open-source version of CQL processes all data internally - some CQL operations cannot be implemented in SQL (or rather, relational algebra). To run CQL at multi-node scale you would either have to switch to the commercial product (which can target e.g., Spark; happy to chat about that), or switch to the old archived version of FQL, which can be used as a command line compiler targeting SQL but does not support operations such as instance co-limits. CQL also includes EASIK (see the tools menu), an orphaned project we adopted when its primary author retired, which can be used to generate triggers to implement pullbacks and other constraints.

> In the above scenario, what's the best way to do iterative integration? Say each month we ingested a new dataset: would it be best to keep all the source schemata and data separate and then, each month, compute a comprehensive pullback and export to a canonical representation? Or would it be better to integrate each dataset with the output of the previous pullback? Of concern here is that our backend is Haskell and the datatypes by which we deal with database access are derived from the names in the schema; so we would want to minimize unnecessary breakage.

Depending on what computation you are doing incrementally several different approaches may be appropriate. Sigma, for example, can be approximated incrementally but not actually computed incrementally, except in certain cases, and the approximation algorithm is in our commercial product. Alternatively, you may be in a situation where incremental computation just means running the same CQL program each month and adding the output to the previous month. I’d be happy to chat offline about your scenario.

> Is there any news on publicly accessible artifacts of the Statebox collaboration? A CQL dsl in Haskell would be cool . . .

I’ve cc’d Erik Post, who contributed to CQL and is now at Statebox. Last I heard, CQL-in-Haskell (which is a fragment of CQL-in-java) was scheduled to be released through the state box foundation under a GPL license. In fact, I owe them some comments and code clean up...

> Thanks for any info!

Regards,
Ryan

>
>
> --
> You received this message because you are subscribed to the Google Groups "categoricaldata" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to categoricalda...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tim Pierson

unread,
Apr 11, 2019, 1:23:40 PM4/11/19
to categoricaldata
Hey Ryan,
  Thanks for the info, that was quite helpful.  We're going to proceed on our side and I'll reach out privately when we're a little closer to working out the details of our use case.
Thanks!
> To unsubscribe from this group and stop receiving emails from it, send an email to categor...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages