Dynamic schemas

Jonathan Shapiro

unread,

Sep 17, 2023, 6:04:58 PM9/17/23

to Cap'n Proto

We're working on multiple applications with dynamic schemas: schemas where customers need to be able to add or delete fields to meet their particular requirements. This sort of thing impacts both the database layer and the wire schema. I can think of an implementation that doesn't suck (see below), I'm wondering what the best way would be to handle this sort of thing in capnproto.

The doesn't suck approach:

In this application we can restrict in advance what the valid types are for these new columns, which would allow us to handle the dynamic fields as a list struct/union values, each carrying its field name as a string or something similar. It's not an especially elegant way of describing the protocol, but it covers the use cases. It isn't necessary that the bits on the wire be "pretty", it's reasonably easy to decode into JavaScript objects on the client, and it doesn't seem like a huge lift on the server to validate that all expected extension fields are present and have values of the expected type.

Is there a better way to think about this?

Jonathan

Kenton Varda

unread,

Sep 18, 2023, 5:01:09 PM9/18/23

to Jonathan Shapiro, Cap'n Proto

Hi Jonathan,

I suppose the first question is: Does your database actually need to understand the dynamic fields at all? Or does it just need to be able to echo them back to the client later on, with only the client actually understanding them?

If only the client really needs to know about them, then I'd suggest using an `AnyPointer` field which the client fills in however they want. (You could also use generics for the same effect wrapped in nicer PL theory.)

If the server needs to understand the fields -- perhaps, to index them, or something -- then one possibility is the client could upload its schema to the server in capnp schema format (schema.capnp). On the server, you could use the "dynamic API" (in C++, capnp/dynamic.h; some other implementations support it as well but not all) to load this schema dynamically and introspect the messages based on it.

Alternatively, you could of course just have the client send names and values, as you suggest. Though at that point you might even consider just using JSON (or one of the myriad binary alternative encodings of JSON) as the Cap'n Proto encoding may actually be adding more bloat than it saves.

It's hard to say what's the best approach without really understanding the details of the application, though.

-Kenton

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/capnproto/CAJdcQk3s9jnXR6VJEyhXSFHPO0qNNBBnRrm7A3Pr%2BukU5Q9DmQ%40mail.gmail.com.

Jonathan Shapiro

unread,

Sep 19, 2023, 2:44:39 PM9/19/23

to Cap'n Proto

Hi, Kenton!

I suppose the first question is: Does your database actually need to understand the dynamic fields at all? Or does it just need to be able to echo them back to the client later on, with only the client actually understanding them?

Yes. It needs to be able to index them and to process queries that reference them as (e.g.) select criteria. While this can (mostly) be done with a single JSON column, the virtual columns required effectively amount to creating new DB columns anyway, at which point it isn't clear why the JSON column is providing benefit when used for this purpose.

If the server needs to understand the fields -- perhaps, to index them, or something -- then one possibility is the client could upload its schema to the server in capnp schema format (schema.capnp). On the server, you could use the "dynamic API" (in C++, capnp/dynamic.h; some other implementations support it as well but not all) to load this schema dynamically and introspect the messages based on it.

I had not known about the dynamic API, and I will look in to that. In the application at hand, the mechanism for adding fields is via the API. Both the server and the client know about the resulting change, so I would think that either side could send the revised schema to the other. I'm also having disgusting thoughts involving wasm, on-the-fly codegen, and orphaned baby ducks. :-)

An advantage to the list of union of tagged value types approach I mentioned might be that all current implementations know how to handle that. If it's changing dynamically, we're going to have to interpret things in any case (at least at some level of abstraction). These fields are important, but generally few in number.

Alternatively, you could of course just have the client send names and values, as you suggest. Though at that point you might even consider just using JSON (or one of the myriad binary alternative encodings of JSON) as the Cap'n Proto encoding may actually be adding more bloat than it saves.

Every time that comes up I think unpleasant thoughts about JavaScript and integers. Brendan's decision to represent all numbers as floats got him an implementation in three days, but every one of us has been paying for it every day since then. Even Brendan recognizes that this was an unfortunate call in hindsight.

It's hard to say what's the best approach without really understanding the details of the application, though.

Here's an attempt at expansion.

Suppose you are building a production management application. Part of that is prioritizing what to build. The prioritization depends in part on physical aspects of the products - both for the build itself and for efficiency in gathering components. In consequence, there are fields needed in the item descriptions that would be relegated to a text description in a more conventional inventory system (e.g. for tee shirts: (size, color, type) would usually be implicit in the SKU and described in the description, but for production we need to sort and filter on those for inventory picking). Since you can't anticipate all of the products that will come along, it becomes necessary to let the customer deploying the management system figure out what their additional fields need to be.

And just to make matters fun, those fields vary from product to product, which has a way of turning the entire thing into an open union join. Which in turn has unpleasant consequences for implementing updates. But that's a whole different issue. The saving grace, such as it is, is that a given production flow is generally going to sort/select using the same criteria for all of the items it produces.

Jonathan

Reply all

Reply to author

Forward