multidimensional db concepts and hypertable

20 views
Skip to first unread message

Mateusz Berezecki

unread,
May 1, 2008, 4:17:44 AM5/1/08
to Hypertable User
Hello list readers,

I'd like to know how do the concepts of multidimensional db apply to
hypertable?

Is it the case that dimensions are mapped to column families?
If yes, than how do the columns within a family map to the mddb
concepts?

What I'm specifically asking about is how would you model a n-
dimensional cube
for e.g. Sales measure with dimensions of Item, Price, Region as
dimensions
in a hypertable.

I'm relatively new to this field, hence the question.

best regards,
Mateusz Berezecki

Doug Judd

unread,
May 1, 2008, 6:12:57 PM5/1/08
to hyperta...@googlegroups.com
Hello Mateusz,

The multidimensional database technologies commonly used in Online Analytical Processing (OLAP) generally provide efficient means for aggregation on one or more dimensions of data.  A classic example, which you mention, is storing and doing analytic calculations over point of sale (POS) data.

With the Google stack (i.e. GFS, Map-reduce, Bigtable), the most efficient way to do OLAP-style aggregation queries would be to express them as Map-reduce programs.  Having to write a map-reduce program to express each query is a bit of a hassle.  However, with a Sawzall-like language (see http://labs.google.com/papers/sawzall.html) these queries can be much easier to express.

It is possible to build an OLAP system on top of Hypertable.  Here's how you could do it for the POS example that you mention.  Create two tables, POSRaw and POSRollup.  The POSRaw table would contain all of the raw point of sale transaction information.  The row key would be the timestamp of the transaction and would include the following column families:

item
price
region
store_id
[...]

For the the sparse information about each product, you could have a column family called "attr" and use the column qualifiers (word after the colon) to specify each attribute.  Since Hypertable stores information in a sparse representation, it is well-suited for this kind of data.  So, the qualified "attr" columsn might include the following:

attr:color
attr:size
attr:count
attr:weight
attr:length
attr:units
[...]

For pre-canned queries or reports, you can have a "cron job" kick off every night or every week that selects the most previous day or weeks worth of data, computes the summaries that you're interested in and stores them in the POSRollup table.  This "rollup" table can be easily queried for reports.

Also, for ad-hoc queries, since the data is kept sorted by timestamp, you can efficiently select a historical segment of the data and build a summary of just that data on the fly.

With the 1.0 release, the query language will be limited to simple SELECT with a predicate.  The next major release after 1.0 will contain a richer query language that will include a full suite of basic stats functions for generating summaries.

- Doug

Mateusz Berezecki

unread,
May 2, 2008, 7:01:00 AM5/2/08
to hyperta...@googlegroups.com
On Fri, May 2, 2008 at 12:12 AM, Doug Judd <do...@zvents.com> wrote:
Helo Doug,

Thank you for your time to answer this question,

>
> The multidimensional database technologies commonly used in Online
> Analytical Processing (OLAP) generally provide efficient means for
> aggregation on one or more dimensions of data. A classic example, which you
> mention, is storing and doing analytic calculations over point of sale (POS)
> data.

>


> It is possible to build an OLAP system on top of Hypertable. Here's how you
> could do it for the POS example that you mention. Create two tables, POSRaw
> and POSRollup. The POSRaw table would contain all of the raw point of sale
> transaction information. The row key would be the timestamp of the
> transaction and would include the following column families:
>
> item
> price
> region
> store_id
> [...]

I'm assuming that you meant that the row key would be the timestamp
and the POSRaw
would contain the above column families?

> For the the sparse information about each product, you could have a column
> family called "attr" and use the column qualifiers (word after the colon) to
> specify each attribute. Since Hypertable stores information in a sparse
> representation, it is well-suited for this kind of data. So, the qualified
> "attr" columsn might include the following:
>
> attr:color
> attr:size
> attr:count
> attr:weight
> attr:length
> attr:units
> [...]

Ok, so this is a completely separate table with products only, which
is modelled using
the star schema and the proper record would be selected by looking up
the item using
'item' column family (possibly the 'id' column within that family)
from the POSRaw table.
Is that correct?

> With the 1.0 release, the query language will be limited to simple SELECT
> with a predicate. The next major release after 1.0 will contain a richer
> query language that will include a full suite of basic stats functions for
> generating summaries.

Is hypertable OK with building Index tables which contain ORDER BY indexes
for some subsets of columns from some subset of column families?

If not how would you do the ORDER BY selects at the current state of hypertable?

Mateusz

Doug Judd

unread,
May 5, 2008, 1:17:12 PM5/5/08
to hyperta...@googlegroups.com
Hello Mateusz,

Sorry for the delayed response.  I don't have direct experience with data wherehousing and OLAP technology, so I'm having to review some of the literature.  Comments inline ...

On Fri, May 2, 2008 at 4:01 AM, Mateusz Berezecki <mate...@gmail.com> wrote:
[...]

>
> It is possible to build an OLAP system on top of Hypertable.  Here's how you
> could do it for the POS example that you mention.  Create two tables, POSRaw
> and POSRollup.  The POSRaw table would contain all of the raw point of sale
> transaction information.  The row key would be the timestamp of the
> transaction and would include the following column families:
>
> item
> price
> region
> store_id
> [...]

I'm assuming that you meant that the row key would be the timestamp
and the POSRaw
would contain the above column families?

Yes, the row key would be the timestamp and POSRaw would contain all of the above column families.

> For the the sparse information about each product, you could have a column
> family called "attr" and use the column qualifiers (word after the colon) to
> specify each attribute.  Since Hypertable stores information in a sparse
> representation, it is well-suited for this kind of data.  So, the qualified
> "attr" columsn might include the following:
>
> attr:color
> attr:size
> attr:count
> attr:weight
> attr:length
> attr:units
> [...]

Ok, so this is a completely separate table with products only, which
is modelled using
the star schema and the proper record would be selected by looking up
the item using
'item'  column family (possibly the 'id' column within that family)
from the POSRaw table.
Is that correct?

Actually, my thought was that the attr; column would get stored along with the item, price, and region columns in the POSRaw table in essentially a de-normalized manner.  Since Hypertable has a number of good compression algorithms including BMZ (long common string compression), the storage cost wouldn't be too bad.  Plus, storage is relatively inexpensive and the system scales easily.  [NOTE: we should implement a specialized compression algorithm that efficiently collapses data for column families whose values are of low cardinality]

Once you have this denormalized POSRaw table, you could create materialized views using map-reduce.  The map-reduce system would have to be able to read from a Hypertable and write to a Hypertable (easily implementable extensions).  So for example, let's say you want to answer questions about total sales per product.  You could write a map-reduce program that operated on the POSRaw table and mapped each transaction into the following key/value pair

key=(product, month) value=1

The reduce() function would aggregate (sum together) the monthly sales volume for each product.  The output of the reduce() function could be written directly into Hypertable.  This new table, TotalSalesByProduct, would be a materialized view of the raw data, where the row key would be (product, month) and the one column would be sales volume.  For example, here's a sample of what SELECT * output might look like:

[...]
Levi's Jeans 2007-01  567
Levi's Jeans 2007-02  476
Levi's Jeans 2007-03  456
Levi's Jeans 2007-04  344
Levi's Jeans 2007-05  289
Levi's Jeans 2007-06  134
[...]

This table essentially represents the output of the ORDER BY select statement that you mentioned.

So to answer the question "What were the total sales of Levi's jeans for the year 2007?", you would issue the following query and roll-up the results by adding the SalesVolume numbers together:

SELECT SalesVolume FROM TotalSalesByProduct WHERE ROW STARTS WITH "Levi's Jeans 2007";

This solution is one that involves map-reduce.  A simpler, yet less efficient, approach would be to support the construction of materialized views directly with the SELECT statement.  For example, you could imagine creating the TotalSalesByProduct table with a SELECT statement with the following form:

SELECT ... FROM POSRaw ... INTO TABLE TotalSalesByProduct;

Does this sound like a solution that would work for you?

- Doug

Mateusz Berezecki

unread,
May 6, 2008, 6:22:34 AM5/6/08
to hyperta...@googlegroups.com
On Mon, May 5, 2008 at 7:17 PM, Doug Judd <do...@zvents.com> wrote:
> Hello Mateusz,


Hello Doug !

> Sorry for the delayed response. I don't have direct experience with data
> wherehousing and OLAP technology, so I'm having to review some of the
> literature. Comments inline ...

Again, thank you very much for your time spent on this answer.

Ah, so that's the way. I'm definitely going to try this out and share some
information about my experience wrt to compression efficiency.

> Once you have this denormalized POSRaw table, you could create materialized
> views using map-reduce. The map-reduce system would have to be able to read
> from a Hypertable and write to a Hypertable (easily implementable
> extensions). So for example, let's say you want to answer questions about
> total sales per product. You could write a map-reduce program that operated
> on the POSRaw table and mapped each transaction into the following key/value
> pair
>
> key=(product, month) value=1
>
> The reduce() function would aggregate (sum together) the monthly sales
> volume for each product. The output of the reduce() function could be
> written directly into Hypertable. This new table, TotalSalesByProduct,
> would be a materialized view of the raw data, where the row key would be
> (product, month) and the one column would be sales volume.

I'm still trying to connect the dots and this seems to me as the most
appropriate solution to the problem. I'm definitely going to try this out
very very soon.


> Does this sound like a solution that would work for you?


Definitely! Thank you very much for the comprehensive explanation.

As soon as implement this in practice I'm going to share my experiences
with the list.

best regards,
Mateusz Berezecki

Reply all
Reply to author
Forward
0 new messages