Dbmaint.q

0 views

Skip to first unread message

Venice Sassone

unread,

Aug 4, 2024, 9:31:49 PM8/4/24

to tiotinkrovern

Thispaper illustrates the flexibility of kdb+ and how instances can be customized to meet the practical needs of business applications across different financial asset classes. Given the vast extent to which kdb+ can be customized, this document focuses on some specific cases relating to the capture and storage of intra-day and historical time-series data. Many of the cases described should resonate with those charged with managing large-scale kdb+ installations. These complex systems can involve hundreds of kdb+ processes and/or databases supporting risk, trading and compliance analytics powered by petabytes of data, including tick, transaction, pricing and reference data.

The goal of the paper is to share implementation options to those tasked with overseeing kdb+ setups, in the hope that the options will allow them to scale their deployments in a straightforward manner, maximize performance and enhance the end-user experience. Seven scenarios are presented focused on tuning the technology using some common and uncommon techniques that allow for a significantly more efficient and performant system. The context where these techniques would be adopted is discussed, as well as the gains and potential drawbacks of adopting them.

Where appropriate, sample code snippets are provided to illustrate how the technique might be implemented. The code examples are there for illustrative purposes only and we recommend that you always check the use and effect of any of the samples on a test database before implementing changes to a large dataset or production environment.

Code examples are intended for readers with at least a beginner-level knowledge of kdb+ and the q language. However we encourage any interested reader to contact us at lectur...@firstderivatives.com with questions regarding code syntax or any other aspect of the paper.

In some kdb+ installations, table schemas change frequently. This can be difficult to manage if the installation consists of a splayed partitioned database, since each time the schema changes you have to manually change all the historical data to comply with the new schema. This is time-consuming and prone to error. dbmaint.q is a useful tool for performing these manual changes.

To reduce the administrative overhead, the historical database schema can be updated programmatically as intra-day data is persisted. This can be achieved by invoking function updateHistoricalSchema (below), which connects to the historical database and runs locally-defined functions to add and remove tables, add and remove columns, reorder columns and change their types. It does this by comparing the table schemas in the last partition with those of the other partitions, and making the necessary adjustments to the other partitions.

The function below iterates over each table-column pair in all partitions except for the latest one. It makes sure all columns are present and if not, creates the column with the default value of the column type in the latest partition.

The following function deletes columns in earlier partitions that are not in the last partition. It does this by iterating over each of the tables in all partitions, except for the last one, getting the columns that are in that partition but not in the last partition, and deletes them.

The function below re-orders the columns by iterating over each of the partitions except for the last partition. It checks that the column order matches that of the latest partition by looking at the .d file. If there is a mismatch, it modifies the .d file to the column list in the last partition.

The following function iterates over every table-column pair in all partitions except for the last. It checks that the type of the column matches that of the last partition and if not, it casts it to the correct type.

To apply the parted attribute to a list, all occurrences in the list of each value \(n\) must be adjacent to one another. For example, the following list is not of the required structure and attempting to apply the parted attribute to it will fail:

Likewise, to apply the unique attribute, the items of the list must be unique and to apply the sorted attribute, the items of a list must be in ascending order. The grouped attribute does not dictate any constraints on the list and the attribute can be applied to any list of atoms.

When the unique, parted or grouped attribute is set on a list, q creates a hash table alongside the list. For a list with the grouped attribute, the hash table maps the items to the indexes where they occur; for a parted list it maps to the index at which the item starts; and for a list with the unique attribute applied, it maps to the index of the item. The sorted attribute merely marks the list as sorted, so that q knows to use binary search on functions such as =, in , ?, and within.

Now we explore the use of attributes on splayed partitioned tables. The most common approach with regard to attributes in splayed partitioned tables is to set the parted attribute on the security identifier column of the table, as this is usually the most commonly queried column. However, we are not limited to just one attribute on splayed partitioned tables. Given the constraint required for the parted and sorted attributes, as explained above, it is rarely possible to apply more than one parted, more than one sorted or a combination of the two on a given table. There would have to be a specific functional relationship between the two data sets for this to be feasible. That leaves grouped as a viable additional attribute to exist alongside parted or sorted.

To illustrate the performance gain of using attributes, the following query was executed with the various attributes applied, where the number of records in t in partition x was 10 million, the number of distinct values of c in partition x is 6700 and c was of type enumerated symbol. Note that the query itself is not very useful but does provide a helpful benchmark.

However, there is a trade-off with disk space. The column with the grouped attribute applied consumed three times as much space as the column with no attribute. The parted attribute applied consumed only 1% more than no attribute. Parted and sorted attributes are significantly more space-efficient than grouped and unique.

Suppose the data contains IDs which for a given date are unique, then the unique attribute can be applied regardless of any other attributes on the table. Taking the same dataset as in the example above with c (of type long instead of enumerated symbol) unique for each date and the unique attribute applied, the same query is 42 times faster with the attribute compared with no attribute. However, the column consumes five times the amount of disk space.

Sorting the table by that same column and setting the sorted attribute results in the same query performance as with the unique attribute, and the same disk space consumed as with no attribute. However, the drawback of using sorted, as with parted for reasons explained above, is not being able to use the sorted or parted attributes on any other columns.

It is often the case that a kdb+ application receives an ID field which does not repeat and contains characters that are not numbers. The question is which datatype should be used for such a field. The problem with using enumerated symbols in this case is that the enumeration file becomes too large, slowing down searches and resulting in too much memory being consumed by the enumeration list.

The problem with using char vector is that it is not an atom, so attributes cannot be applied, and searches on char vectors are very slow. In this case, it is worth considering GUID, which can be used for storing 16-byte values.

As the data is persisted to disk, or written to an intra-day in-memory instance, the ID field can be converted to GUID and persisted along with the original ID. The following code shows how this would be done in memory. It assumes that the original ID is a char vector of no more than 16 characters in length.

If the number of characters in the ID is not known in advance, one should consider converting the rightmost 16 characters to GUID (or leftmost depending on which section of the ID changes the most), and keeping the remaining 16 character sections of the ID in a column consisting of GUID lists. An alternative is to just convert the rightmost 16 characters to GUID and use the original ID in the query as follows.

Moreover, by having separate processes for intra-day and historical data, it is often necessary to introduce a gateway as a single point of contact for querying intra-day and historical data, thereby running three processes for what is essentially a single data set.

However, one of the main arguments against combining intra-day and historical data is that large I/O-bound queries on the historical data prevent updates being sent from the feeding process. This results in the TCP/IP buffer filling up, or blocking the feeding process entirely if it is making synchronous calls.

The code snippet below provides a function which simply writes the contents of the in-memory tables to disk in splayed partitioned form, with different names to their in-memory equivalent. The function then memory maps the splayed partitioned tables.

The above will result in an additional (splayed partitioned) table being created for each non-keyed in-memory table. When combining intra-day and historical data into one process, it is best to provide functions to users and external applications which handle the intricacies of executing the query across the two tables (as a gateway process would), rather than allow them to run raw queries.

Let us now assume there are memory limitations and you want the intra-day data to be persisted to disk multiple times throughout the day. The following example assumes the intra-day and historical data reside in the same process, as outlined in the previous section. However in most cases large volumes of data dictate that separate historical and intra-day processes are required. Accordingly, minor modifications can be made to the code below to handle the case of separate intra-day and historical processes.