Hi,
As secondary index has its limitation, people usually have to create separate index tables to support different queries on the same entity, either with denormalized values or just the primary key. I think it can save a lot of effort and avoid many boilerplate codes if the driver has support for this kind of use case, such as the "Scalable Equals Only Index" pattern in
PlayOrm.
When dealing with denormalized data, it is always not trivial to update or delete the denormalized data when there is change to the entity data. In the cases that the rows of index tables that need to be updated cannot be derived from entity data on its own, such as the
like relation use case in ebay, there must be a reverse-index table to look up anyway, and the life will be much easier if the driver can do all the work of managing the reverse-index table instead.
For example, the driver can maintain a reverse-index table for item:
CREATE TABLE item_reverse_idx(
id text,
index text,
field_mapping map<text,text>,
PRIMARY KEY (id, index)
);
where 'index' is a serialized string (such as JSON) of table name + primary key for each row of those index tables that contains denormalized values of a given item, so that whenever the values of an entity change, the driver can look up the reverse index table by item id and generate a set of update statements which we can then use to update the index tables, either synchronously or asynchronously in the background if the number of writes are large and we want to keep response time low, in a sort of eventual consistence manner. The 'field_mapping' column stores the mapping of names from entity table columns to index table columns in case they have different names.
There could even be a reverse-index table for each field of item so that update statements are generated only when it is absolutely necessary. For example, in the ebay like relation example, if the user_by_item table also contains the price of each item and there are other index tables that only contain the name of items, it will be reasonable performance-wise to track name and price in different reverse-index tables or in different shards of the table as price may be updated much more frequently than name. We can create the following table:
CREATE TABLE item_reverse_idx_byfield(
id text,
field text,
index text,
field_mapping text,
PRIMARY KEY ((id, field), index)
);
When we need to update the price of a given item, we can do "SELECT * FROM item_reverse_idx_byfield WHERE id=? and field='price'", which may return much fewer results than "SELECT * FROM item_reverse_idx_byfield WHERE id=? and field='name'" or "SELECT * FROM item_reverse_idx where id=?", and thus we can minimize the number of writes.
The reverse-index table can be maintained and updated along with index tables. When values of an entity need to be added to a index table, we can generate a row in the reverse-index table and batch update them. We can also delete a row from reverse-index table when an entity's values need to be removed from a index table. Since we always know and need to know the table name and primary key of the row of updated/inserted index table, we can always insert/delete the record into/from the reverse-index table.
For the normalization case where only the primary key of entities are kept in the index tables, it will be helpful if the accessor interface can perform eager fetching, either based on annotations or the return type of the method, so that we don't need to issue separate select statements in our codes and it can also help avoid using the wrong multi-get mechanism (I suppose sending multiple select by id requests in parallel is a better practice than one select in request in Cassandra?).
With this feature added, I think using Cassandra as a general purpose database and as a replacement of RDBMS can be dead simple in most cases.
Thanks,
Ziju