Feature request: Adding support to simplify reading and updating index tables

90 views
Skip to first unread message

Ziju Feng

unread,
Oct 1, 2014, 8:41:19 PM10/1/14
to java-dri...@lists.datastax.com
Hi,

As secondary index has its limitation, people usually have to create separate index tables to support different queries on the same entity, either with denormalized values or just the primary key. I think it can save a lot of effort and avoid many boilerplate codes if the driver has support for this kind of use case, such as the "Scalable Equals Only Index" pattern in PlayOrm

When dealing with denormalized data, it is always not trivial to update or delete the denormalized data when there is change to the entity data. In the cases that the rows of index tables that need to be updated cannot be derived from entity data on its own, such as the like relation use case in ebay, there must be a reverse-index table to look up anyway, and the life will be much easier if the driver can do all the work of managing the reverse-index table instead.

For example, the driver can maintain a reverse-index table for item:
CREATE TABLE item_reverse_idx(
  id text,
  index text,
  field_mapping map<text,text>,
  PRIMARY KEY (id, index)
);
where 'index' is a serialized string (such as JSON) of table name + primary key for each row of those index tables that contains denormalized values of a given item, so that whenever the values of an entity change, the driver can look up the reverse index table by item id and generate a set of update statements which we can then use to update the index tables, either synchronously or asynchronously in the background if the number of writes are large and we want to keep response time low, in a sort of eventual consistence manner. The 'field_mapping' column stores the mapping of names from entity table columns to index table columns in case they have different names. 

There could even be a reverse-index table for each field of item so that update statements are generated only when it is absolutely necessary. For example, in the ebay like relation example, if the user_by_item table also contains the price of each item and there are other index tables that only contain the name of items, it will be reasonable performance-wise to track name and price in different reverse-index tables or in different shards of the table as price may be updated much more frequently than name. We can create the following table:
CREATE TABLE item_reverse_idx_byfield(
  id text,
  field text,
  index text,
  field_mapping text,
  PRIMARY KEY ((id, field), index)
);
When we need to update the price of a given item, we can do "SELECT * FROM item_reverse_idx_byfield WHERE id=? and field='price'", which may return much fewer results than "SELECT * FROM item_reverse_idx_byfield WHERE id=? and field='name'" or "SELECT * FROM item_reverse_idx where id=?", and thus we can minimize the number of writes. 

The reverse-index table can be maintained and updated along with index tables. When values of an entity need to be added to a index table, we can generate a row in the reverse-index table and batch update them. We can also delete a row from reverse-index table when an entity's values need to be removed from a index table. Since we always know and need to know the table name and primary key of the row of updated/inserted index table, we can always insert/delete the record into/from the reverse-index table. 

For the normalization case where only the primary key of entities are kept in the index tables, it will be helpful if the accessor interface can perform eager fetching, either based on annotations or the return type of the method, so that we don't need to issue separate select statements in our codes and it can also help avoid using the wrong multi-get mechanism (I suppose sending multiple select by id requests in parallel is a better practice than one select in request in Cassandra?).

With this feature added, I think using Cassandra as a general purpose database and as a replacement of RDBMS can be dead simple in most cases.

Thanks,

Ziju

Alex Popescu

unread,
Oct 1, 2014, 9:06:57 PM10/1/14
to java-dri...@lists.datastax.com
Thanks a lot for sharing your ideas with us Ziju!

I've been thinking on and off about something in this direction. My concerns are around the following questions:

1. what would be the best way, from an API pov, to describe the index tables? (explicit, meaning the user has full control over it; or based on the accessors defined by the user to work with given entities?)
2. what kind of consistency and isolation guarantees should such index tables operate under? (should updates to the main entity and index tables be part of a batch or can they be applied async, etc.)

As you can see there are quite a few open questions. And there are others that I've probably forgotten to mention. As a matter of personal taste, I really like powerful tools, but I also think that performance and too much magic don't mix well :-).

To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.



--

[:>-a)


Alex Popescu
Sen. Product Manager @ DataStax
@al3xandru

Ziju Feng

unread,
Oct 2, 2014, 8:25:30 PM10/2/14
to java-dri...@lists.datastax.com
Hi Alex, here are my two cents regarding your concerns:

If we skip those relational concepts like foreign key, oneToMany/ManyToMany relations, which Cassandra is not supposed to support anyway, each row of a table can be treated as an entity(well... sort of). Then the only difference between an index table and an entity table is that an index table row contains some extra index information other than the primary key of the entity it is mapped to (while it may lack some other descriptive columns). Therefore we can treat the main entity table and index tables almost the same from mapper pov.

To allow mapping multiple index tables to the same entity, we can introduce annotations like @IndexTable to entity class and have information of each index table, such as table name, column alias, annotated on the same class, so that both the user and the driver know which class a row of an index table should be mapped to and which index tables the entity values are denormalized to. 

Accessors can use the same syntax for read from both the main entity table and index tables, i.e using the entity class or Result as return type, as their rows are all mapped to the same entity, though index tables may not use all of their primary keys for mapping underneath.

To handle updates to entity table, the driver can add another return type class, say DenormalizedStatements, which contains both the statement of updating entity table and a list of statements of updating index tables, so that user can decide when to use batch and when to send them async. The actual BoundStatements of updating index tables can be generated based on the query that updates the main entity table and annotations of how entity fields correspond to index table columns. For example, for an accessor method annotated with query "UPDATE ks.posts SET content=? WHERE post_id=?", if table ks.posts is associated with entity class 'Post' and the 'Post' class is also associated with index table 'ks.featured_posts', the driver can generate statement "UPDATE ks.featured_posts SET content=? WHERE type=? AND rank=?" to update the index table, assuming (type,rank) is its primary key.

For accessor methods that insert entity instances into or delete them from index table, the driver can update the reverse index based on the given primary key parameters. The driver may choose to expose the existence of reverse index table by allowing return type of a pair of statements, one for updating index table and one for updating reverse index table, though users may live with the driver updating both tables in a batch by default as otherwise they still need to write to a reverse index kind of table themselves anyway in order to be able to update the denormalized values in the future. 

The driver can use the same consistency level of queries to the main entity table for index/reverse index tables by default and can also allow users to choose consistency levels, either through annotations on entity class, the same way as readConsistency element of @Table, or through QueryParameters on accessor methods. For isolation, the driver can do all the work in a batch if the return type is Result, entity class, ResultSet, ListenableFuture etc., or allow user to have full control by returning Statement kind of type, like the "DenormalizedStatements" mentioned above.

All the ideas above are made under the assumption that entities are stored as a row of either entity table or index table. In case of nested entity where one entity can have another entity as field, I think using UDT is the way to go and we can allow entity class to also be annotated with UDT so that the driver can use this information to maintain reverse-index table and update index tables.

Andrew Fitzgerald

unread,
Jun 26, 2016, 12:12:06 AM6/26/16
to DataStax Java Driver for Apache Cassandra User Mailing List
Got directed to this thread from some relevant jira issues: https://datastax-oss.atlassian.net/browse/JAVA-549 and https://datastax-oss.atlassian.net/browse/JAVA-524

Does the addition of materialized views change this conversation at all?
We're currently using a lookup table for each "indexed" field, which means an additional pojo/mapper for each "indexed" field, and then using batch queries to update them transactionally-ish.

My understanding of materialized views is that this will allow us to simplify our schemas, and we can update everything by just updating the base table, but we would still require index-pojo's for querying by an attribute, and then continue using a bean mapping util to convert it to the primary domain class?

Alexandre Dutra

unread,
Jul 11, 2016, 1:05:47 PM7/11/16
to DataStax Java Driver for Apache Cassandra User Mailing List
Hello Andrew,

The addition of materialized views doesn't fundamentally change the landscape.

But even if the mapper doesn't have full support for index tables / materialized views, it doesn't mean you need to create one different entity for each one of your index tables, or materialized views. 

There is indeed one smart trick: define a @Table entity that maps to your base table, and then use @Accessor queries to query your index tables / materialized views. The accessor will be smart enough to map results from your index tables / materialized views into the primary entity, as long as your index tables / materialized views have similar columns.

The driver test suite has an example of that.

Thanks,

Alexandre


--
You received this message because you are subscribed to the Google Groups "DataStax Java Driver for Apache Cassandra User Mailing List" group.

To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-us...@lists.datastax.com.
--
Alexandre Dutra
Driver & Tools Engineer @ DataStax
Reply all
Reply to author
Forward
0 new messages