feature wishlist/proposal: a database representation for collections of snapshots

17 views
Skip to first unread message

Yarden Katz

unread,
Aug 15, 2019, 10:11:12 AM8/15/19
to kappa-users
Hi all,

This is follow up to this github issue (https://github.com/Kappa-Dev/KaSim/issues/598). I was hoping to start a discussion about the right representation for storing large collections of snapshots. I'm curious to hear what others think about this.

Some background: Many analyses rely on analyzing the states of a Kappa simulation through time. This can be done by dumping snapshots, e.g. in JSON format. As noted above, the JSON representation (and this applies to the alternative .ka format too) is wasteful; even with the simplest compression a snapshot file can be reduced ~100x in size. One feature that would be helpful is dumping snapshots in a compressed format.

The current way of getting snapshots in Kappa is flexible and terrific for many purposes, and compression would make it even better. But for analyses that require a large number of snapshots, even compressed snapshots are unwieldy. For example, if one samples 1 million time points from a Kappa program,

// get a snapshot for the first 1 million events
%mod: do $SNAPSHOT "snap".[E].".json"; repeat [E] < 1e6

this would require generating 1 million files.

I work regularly with such programs and they're challenging for a couple of reasons. First, having so many distinct files around is not a good idea, especially when they are logically connected (they're all part of the same "run" and should be organized as such). Second, JSON is best for small data that you can load into memory, but loading 1 million JSON files into memory is not viable. Instead, these snapshots should be accessed by disk using an index, based on their metadata - like the snapshot's event number or time stamp.

To work with large numbers of snapshots, I've been: (1) first compressing them (in the future this won't be necessary), (2) using a hand-crafted "streaming" version of JSON to avoid loading everything into memory, (3) doing first passes through the snapshots to collect metadata and then writing wrappers that load snapshots on demand from disk. In short, I found myself badly reinventing something like a relational database for a collection of snapshots.

In principle, the right solution would be to store the snapshots in a database. Consider something like a $COLLECT operator:

// collect snapshots into a database
%mod: do $COLLECT "snap".[E].".json" "my_snapshots.db"; repeat [E] < 1e6

KaSim would write to the database "my_snapshots.db" through time, avoiding the proliferation of files. The database would obviously be indexed by all the relevant features - by snapshot, event number, time, etc. - which could be queried from disk per usual, avoiding memory bottlenecks.

It would have been great to do this with sqlite3 because it's portable, free, uses the familiar SQL, and some languages (like Python) even have built-in libraries for parsing/writing sqlite3 files. I found working with sqlite3 to be very convenient, both as user and programmer. However, I think storing snapshots in sqlite3 would quickly get out of hand, particularly for fairly large snapshots - unless there's an elegant schema I'm overlooking. It's just not meant for storing arbitrary graphs.

It seems like the right way to do this is to use a document database, like MongoDB, or a more specialized graph database such as cayley (https://github.com/cayleygraph/cayley) - which implements GraphQL (https://github.com/cayleygraph/cayley/blob/master/docs/GraphQL.md) - or Neo4j (https://neo4j.com/). Conceptually, a snapshot is essentially a "document," which contains a collection of graphs (complexes) and is associated with some metadata (event number, time stamp, etc.). A Kappa run produces a collection of documents (snapshots) that can be queried by the user later with graph-aware queries. I don't have hands-on experience with these databases and I'm curious to hear people's thoughts.

In theory, one could avoid snapshots altogether and use the trace, which should be the minimal object from which every query about the stochastic simulation could be answered. But in practice, the trace is also a very large file and requires elaborate machinery to access (e.g., the Trace Query Language, which is not part of Kappa). The trace also contains a lot more information than needed for many queries that can be easily answered at the snapshot/state level (e.g., collecting statistics about the sizes of various complexes, etc.). I think that storing a collection of snapshots in a database would be a very reasonable intermediate between getting a handful of snapshots (which Kappa already supports) and getting a total description of the run (the trace). 

Of course, implementing this snapshots database would be a non-trivial addition to KaSim. But I think that once you have a database format pinned down, it shouldn't require much change, unless there's a change to the Kappa language. And I think it would enable analyses that are currently impractical.

Best,
Yarden

Héctor F

unread,
Aug 15, 2019, 5:41:01 PM8/15/19
to kappa-users
My understanding of graph databases like Google's KnowledgeGraph is that they serve to crosslink nodes between graphs of different types (e.g. the "Vatican" entry maps to nodes in the "Countries", "Locations", "Religious Centers" graphs, all contributing different types of data). What do you envison the use being in Kappa?

Querying the KaDB™ for, say some agent type, and getting all the ... the complexes it was ever part of? Maybe with some constraints on time?

If Kappa exposed a parser as part of an API, say in Python, one would be able to run the simulation in Python, get the snapshot object, do some quick metadata extraction (say UUID, time, event number, API version), and add that to a document database, all without ever producing a file to disk. And the document might just was well contain just a binary object, that can be parsed/read by the Kappa API itself. I could even port KaSaAn to this hypothetical API, thus sidestepping the need to go through any ASCII representation.

Best,
Hector

Yarden Katz

unread,
Aug 15, 2019, 8:13:40 PM8/15/19
to kappa...@googlegroups.com
On 8/15/19 5:41 PM, Héctor F wrote:
My understanding of graph databases like Google's KnowledgeGraph is that they serve to crosslink nodes between graphs of different types (e.g. the "Vatican" entry maps to nodes in the "Countries", "Locations", "Religious Centers" graphs, all contributing different types of data). What do you envison the use being in Kappa?

Querying the KaDB™ for, say some agent type, and getting all the ... the complexes it was ever part of? Maybe with some constraints on time?

Every snapshot is a set of complexes and each complex is a graph. Once you have that stored, you'd run all the same queries that you do for any snapshot - e.g., all the things you do in your KaSaAn package. Such as: get all agents of type X that have their site z bound to agent of type Y, get all complexes that begin with node of type C that has its x site free, etc. And sure, you can run all those queries indexed by snapshot time, event number, etc.


If Kappa exposed a parser as part of an API, say in Python, one would be able to run the simulation in Python, get the snapshot object, do some quick metadata extraction (say UUID, time, event number, API version), and add that to a document database, all without ever producing a file to disk. And the document might just was well contain just a binary object, that can be parsed/read by the Kappa API itself. I could even port KaSaAn to this hypothetical API, thus sidestepping the need to go through any ASCII representation.

I think it'd be more versatile to just serialize a database in a widely readable format rather than tie these features to a specific language API.


Best,
Hector

On Thursday, August 15, 2019 at 10:11:12 AM UTC-4, Yarden Katz wrote:
Hi all,

This is follow up to this github issue (https://github.com/Kappa-Dev/KaSim/issues/598). I was hoping to start a discussion about the right representation for storing large collections of snapshots. I'm curious to hear what others think about this.

Some background: Many analyses rely on analyzing the states of a Kappa simulation through time. This can be done by dumping snapshots, e.g. in JSON format. As noted above, the JSON representation (and this applies to the alternative .ka format too) is wasteful; even with the simplest compression a snapshot file can be reduced ~100x in size. One feature that would be helpful is dumping snapshots in a compressed format.

The current way of getting snapshots in Kappa is flexible and terrific for many purposes, and compression would make it even better. But for analyses that require a large number of snapshots, even compressed snapshots are unwieldy. For example, if one samples 1 million time points from a Kappa program,

// get a snapshot for the first 1 million events
%mod: do $SNAPSHOT "snap".[E].".json"; repeat [E] < 1e6

this would require generating 1 million files.

I work regularly with such programs and they're challenging for a couple of reasons. First, having so many distinct files around is not a good idea, especially when they are logically connected (they're all part of the same "run" and should be organized as such). Second, JSON is best for small data that you can load into memory, but loading 1 million JSON files into memory is not viable. Instead, these snapshots should be accessed by disk using an index, based on their metadata - like the snapshot's event number or time stamp.

To work with large numbers of snapshots, I've been: (1) first compressing them (in the future this won't be necessary), (2) using a hand-crafted "streaming" version of JSON to avoid loading everything into memory, (3) doing first passes through the snapshots to collect metadata and then writing wrappers that load snapshots on demand from disk. In short, I found myself badly reinventing something like a relational database for a collection of snapshots.

In principle, the right solution would be to store the snapshots in a database. Consider something like a $COLLECT operator:

// collect snapshots into a database
%mod: do $COLLECT "snap".[E].".json" "my_snapshots.db"; repeat [E] < 1e6

KaSim would write to the database "my_snapshots.db" through time, avoiding the proliferation of files. The database would obviously be indexed by all the relevant features - by snapshot, event number, time, etc. - which could be queried from disk per usual, avoiding memory bottlenecks.

It would have been great to do this with sqlite3 because it's portable, free, uses the familiar SQL, and some languages (like Python) even have built-in libraries for parsing/writing sqlite3 files. I found working with sqlite3 to be very convenient, both as user and programmer. However, I think storing snapshots in sqlite3 would quickly get out of hand, particularly for fairly large snapshots - unless there's an elegant schema I'm overlooking. It's just not meant for storing arbitrary graphs.

It seems like the right way to do this is to use a document database, like MongoDB, or a more specialized graph database such as cayley (https://github.com/cayleygraph/cayley) - which implements GraphQL (https://github.com/cayleygraph/cayley/blob/master/docs/GraphQL.md) - or Neo4j (https://neo4j.com/). Conceptually, a snapshot is essentially a "document," which contains a collection of graphs (complexes) and is associated with some metadata (event number, time stamp, etc.). A Kappa run produces a collection of documents (snapshots) that can be queried by the user later with graph-aware queries. I don't have hands-on experience with these databases and I'm curious to hear people's thoughts.

In theory, one could avoid snapshots altogether and use the trace, which should be the minimal object from which every query about the stochastic simulation could be answered. But in practice, the trace is also a very large file and requires elaborate machinery to access (e.g., the Trace Query Language, which is not part of Kappa). The trace also contains a lot more information than needed for many queries that can be easily answered at the snapshot/state level (e.g., collecting statistics about the sizes of various complexes, etc.). I think that storing a collection of snapshots in a database would be a very reasonable intermediate between getting a handful of snapshots (which Kappa already supports) and getting a total description of the run (the trace). 

Of course, implementing this snapshots database would be a non-trivial addition to KaSim. But I think that once you have a database format pinned down, it shouldn't require much change, unless there's a change to the Kappa language. And I think it would enable analyses that are currently impractical.

Best,
Yarden

--
You received this message because you are subscribed to a topic in the Google Groups "kappa-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kappa-users/fCH1pdQeLQQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kappa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kappa-users/c1031bd6-71df-4960-b2af-74d80a34fe7e%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages