Data is efficiently organised in an RDB for events occuring at
irregular times (i.e. the n'th event occurs at time t(n) where t(n) is
a random variable and t(n)-t(n-1) can exhibit considerable variation).
For example, we may have the price of the n'th transaction, P(n), the
associated volume and a large number other characteristics C of the
stock at that instant - say of the order of C=100. In addition to data
describing characteristics of irregularly spaced events, we also have I
entities (e.g. many stocks, say in the order of I=100). The timings of
events are of course not the same across entities which means that if
N(i) is the number of events for the i'th entity (trades of the i'th
stock) then the number of all times of interest M (the times at which
any one of the I stock trades) is such that M>>N(i) for any typical i.
For example M might be 10*N(i) for a liquid stock and N(i) may be
around 20,000 per year for such a stock. Just to make sure you are
still with me, think of the price of the last transaction for each
stock just before each transaction in any stock. This is a much larger
number of prices than the number of prices of all transaction for all
stocks.
If we wish to study statistical relationships across time and entities
simultaneously (e.g. correlations in returns across stocks as well as
own return autocorrelations), we enter the field of panel data
analysis. Such analyses typically require that at a particular point of
time of interest (e.g. the time at which the n'th trade for the i'th
stock occurs), we know the characterics of interest both for the i'th
stock as well as all the other stocks. This means we need to use the
value of the characteristics C of stock i at M points in time rather
than N(i) which is the number contained in the rdb and the number we
would need to study each stock in isolation (a pure time-series rather
than a panel data analysis).
This panel data format can inflate the size of the database by about
100 times, making it expensive to store the panel data even
persistently, let alone handle signficant chunks of it in memory. On
the other hand, determining what the value of C characteristics across
I stocks at a set of points in time of interest using queries to the
rdb can be an extremely slow process.
My current approach is to severely limit the number of characteristics
C analysed, generate the panel data for a few characteristics at the M
points in time, store it persistently, then bring as much of this panel
data into memory as I can at any point in time (8GB) and finally
analyse the data chunk-by-chunk. My RDB is in SQL and data analysis is
in Matlab on a 64bit machine.
Has anyone found a more satisfying solution to this problem? Are other
people struggling with this?
Thanks,
~S.