On Non/semi-additive aggregates and PTDs (expreimentals)

77 views
Skip to first unread message

Stefan Urbanek

unread,
Mar 29, 2015, 10:59:53 AM3/29/15
to cubes-...@googlegroups.com
Hello,


I’ve spent quite a lot of hours in last few nights appropriating few features that were experimental in 1.0 such as the non-additive aggregates ([1] #265). Here are some thoughts:


1. NONADDITIVENESS – #265

The semi-additive aggregates as implemented in 1.0 were very limited in their use, but required quite a lot of internal complexity.

* It worked only for time-like dimensions (this would be most of the use cases, so whether this is limitation negative is disputable)
* there was expectation that the ‘nonadditive’ property would contain only role, where dimension list might be more useful
* the ‘time’-role dimension picked was very arbitrary (one used in drilldown, if more then it failed; if none then one from the cube; if there are more then just the first one - which is arbitrary, …)
* used additional mapping ‘dimension.__key__’ which didn’t fit with the current mapping scheme and was decided during query time. Star schema requires all mappings to be known up-in-front and currently there is no mechanism for “extra” attributes/columns to be included

If someone can help with drafting a proper specification, I would be happy. The specification should consider:

* how the non/semi additiveness is specified in the model
* should be aware of existence of expressions where an expression migh but might not take care of the nonadditiveness
* make sure that the specification is simple – no more complex than an array
* should consider dimensions and hierarchies
* should include how the query builder knows about the columns to be used in the query when a nonadditive measure is present

The pay-off for the result vs. costs is not worth for now. I don’t want to invest in this feature more time before 1.1.


2. PTD – #261

Other feature dropped in 1.1 is "periods to date” (PTD, [2] #261). The mechanism on which this feature relied on has been removed from the query builder as part of the SQL refactoring. In fact, the underlying mechanism of PTD was one of the major reasons of 1.0’s SQL complexity: the feature relied on ‘condition’ column mapping which included a Python expression evaluated using eval() within a table/query context. The query builder was not able to know which columns would be used in the query, so it had to have special arrangements. One of those arrangements was two-stage statement composition (master/detail) which rendered the SQL query generator quite obscure (see Image [3]). Introduction of outer-joins were major reason that such composition was needed. While the implementation did its job nicely, it’s implementation was very narrow and not portable. The requirements were:

* special date dimension (usually duplicate for every date dimension) with a `condition` mapping for it’s columns
* a custom database function for computing time difference
* extra columns in the date dimension with number of calendar units from certain period start

I’m not saying that the feature can be implemented without the later two, but the specification has to be very clear, understandable and easy to deploy. The best case would be to have the feature without imposing requirements to the database schema. Might require more work at the SQLAlchemy side, might yield slower statements, but it should work. The later two requirements should be just a way to optimise the already working solution.

As mentioned, since the outer-joins were one of the major causes that this feature caused complexity, I would rather drop the complexity by dropping an under-specified, limited and low-use feature in favour of more useful and well defined one, such as outer-joins [4]. 

Same as above, if someone can help with drafting a proper specification, I would be happy too. The specification should consider:

* how the PTD is specified the model
* how the PTD is specified in the queries (Python and HTTP)
* make sure that the specification is simple
* should have an implementation that does not require any modifications to the database (additional function, additional columns, …) – usability goes before performance
* has to state how the query builder/star schema knows about all required attributes up-in-front if any extra attributes/mappings should be required

As with the previous feature, the general usability vs. the internal costs (understandability, maintainability, extensibility, requirements for users,…) was too high. I’m dropping this feature for 1.1 as well.


CONCLUSION

Since I would like 1.1 with expressions and cleaned-up SQL query generator to be released, I can’t spend any more time on the features that don’t have proper specification yet, have limited use and are not critical. I’l rather postpone them and will go back to the drawing board later.

Many thanks to Robin Thomas for prototyping the preliminary and clever implementation as a proof of concept in pre-1.0 cubes. Nice starting point and reference for further implementation of the mentioned features.

My sincere apologies to those who were using those two features in 1.0. They will have a come-back later, since they both are useful features.


Thoughts, comments, suggestions?

Cheers,

Stefan


[1] https://github.com/DataBrewery/cubes/issues/265

Robin Thomas

unread,
Mar 30, 2015, 1:14:10 AM3/30/15
to cubes-...@googlegroups.com
It would, I agree, be preferable to state the PTD feature in terms that are clear and understandable before we attempt to add the feature back into the 1.1 codebase. The existing implementation worked, but only for a particular database implementation of PTD, and you're correct that it complicated the code in the sql backend.

We perhaps could express the PTD feature, independent of database implementation and at the logical layer of cubes instead of the physical/mapping layer, in terms of an extra, implicit "cut" on the time dimension:

1) A model defines a time dimension, call it 'date', with several "cubes standard" dimension levels with standard meanings. That is, 'date.year' is the four-digit common calendar year, 'date.dow' is the day of week expressed as 1..7, 'date.week' is the YYYY-MM-DD format of the day ending that week, etc.

2) The model defines a second dimension, call it 'date_ptd', which somehow indicates that the first time dimension 'date' is a *template* for 'date_ptd' dimension. Also, 'date_ptd' indicates that it wishes to be a PTD dimension adding an implicit cut for the underlying template dimension, 'date'.

3) In order for the PTD implicit cut to work, the underlying 'date' dimension needs to define extra hierarchies that re-use existing levels in new arrangements, or alternately, cubes allows cut specifications that don't specify every level of a hierarchy, but instead can specify a cut starting at a depth lower than the top level.

Example:

On March 18, 2015, a user wishes to aggregate a cube 'sales' with drilldown 'date_ptd@ymd:month'. The user wants aggregates, segmented by (year, month), but only aggregating facts whose time falls in the period-to-date of its month -- the first 18 days of each month.

Although the user merely requests /cube/sales/aggregate?drilldown=date_ptd@ymd:month, the cubes server sees that date_ptd is a PTD dimension using 'date' dimension as a template. So it first translates date_ptd@ymd:month into date@ymd:month. Then, the server determines the current datetime, sees that the next level below 'month' in the date_ptd@ymd: (and date@ymd) hierarchy is 'day', determines that the current day is day 18. Then it can compute an extra implicit cut filtering out any date after the 18th of the month.

Now, how to define such a cut? If the 'date' dimension helpfully includes a date@dayhour hierarchy, with 'date.day' level at the top of the hierarchy, then cubes server -- still working at the logical layer -- can say, "hey, there's a hierarchy in the dimension where the level we want a range cut for is at the top level. Let's use it!" Then the implicit cut becomes cut=date@dayhour:1-18 and the server can execute the equivalent of

/cube/sales/aggregate?drilldown=date@ymd:month&cut=date@dayhour:1-18

but when returning results provide the metadata as for ?drilldown=date_ptd@ymd:month .

This approach stays entirely at the logical layer, so it imposes no extra complexity on any particular cubes backend. Only the logical layer's cube browser needs to interpret the PTD-ness of the ate_ptd dimension, calculate the implicit cut, and proceed with the backend's aggregation.

However, this approach requires that the source/template dimension, 'date', define a whole lot of hierarchies solely for the purpose of accomplishing implicit PTD cuts. If you want to drilldown by (year, month), then some other hierarchy must exist with 'date.day' at the top. If you want to drilldown by (year), then some hierarchy must exist with 'date.month' at the top. Any level on which a user might want a drilldown requires that another hierarchy exist in the dimension with the next largest level at the top. I would hate to require users to create all these hierarchies manually; perhaps we could include them in our standard time dimension so that modelers could use that standard time dimension as a template and get all these extra hierarchies. 

The only other alternative to the "lots of extra hierarchies stamped out from a template" would be to allow cuts that start below the top level of a hierarchy, e.g. ?cut=date@ymd:?,?,1-?,?,18 where "?" is a placeholder for some character sequence meaning "ignore this level". Such an allowance seems to violate the idea of a drilldown hierarchy, and perhaps might cause performance problems for some backends or some models, so we wouldn't want to add this feature to the cut specification that every cube and backend would have to support.

But if we allow ?cut=date@ymd:?,?,1-?,?,18 then there is no need at all for extra hierarchies to accomplish PTD behavior. It's also possible for us to forbid "downlevel cuts" like this by default, and a cube or dimension must explicitly allow this with a setting in its model.

An extra concern: The PTD dimension I have in my current models don't want to do implicit cuts just on the level below the drilldown level, e.g. "cut for day 1-18 for a month drilldown, cut for hour 0-13 for a day drilldown". Instead, I want *all* the implicit cuts for a particular time dimension to be down to the hour regardless of the drilldown: "cut for day 1-18 AND hour 0-13". Well, that works fine given current cubes specification, because multiple cuts for the same dimension are allowed, so the implicit cut would be ?cut=date@ymdh:?,1-?,18;date@ymdh:?,?,?,0-?,?,?,13 and I'm very happy with that for my current PTD needs. At the logical layer, I could perhaps define this with { "ptd": "hour" } somewhere in the model for the PTD dimension, and that would be used to determine the depth of the implicit cuts.

I have separate thoughts about the non/semi-additive feature, which I'll post separately...

Stefan Urbanek

unread,
May 6, 2015, 1:12:06 AM5/6/15
to cubes-...@googlegroups.com
Thanks robin for detailed description of an interesting solution. One thing that I somehow can’t relate to is the requirement of additional dimension, otherwise sounds quite reasonable.

I was thinking about this alternative:

OVERVIEW

Period to Date or PTD is an advanced feature that imposes a condition on a date/time during aggregation. PTD operates on a single date dimension of user’s choice. The condition selects only those facts which have the same distance from the beginning of the period of the date dimension member as current (server?) time. For example: PTD condition for date dimension at level ‘day’ with reference (current) date ‘2015-10-05’ would yield only facts where date dimension day ≤ 5.

Requirements:
* new aggregate() argument
* backend-specific model properties

Limitations:
* only one dimension to be used
* not all backends provide the feature
* first implementation uses only system/server time as the reference time

USER

Specify a date dimension that will be used for “ptd” condition at the query time through an additional argument “ptd” containing a reference to the dimension which has to be of role “time” or “date”. It can be any existing dimension, no additional modelling necessary.

Example:
GET /cube/foo/aggregate?drilldown=date&ptd=date:day

result = browser.aggregate(drilldown=“date”, ptd=“date:day”)

The ptd argument includes dimension, hierarchy and level where the hierarchy and level are optional with implicit defaults.

BACKEND

If “ptd” argument is provided it should be respected, otherwise backend should raise an exception about not providing the feature. Implementation is up to the backend.


SQL BACKEND

Since the SQL is going to be the primary backend of Cubes, it should also provide this advanced feature. The proposed implementation is slightly different from the v1.0. Requirements for the implementation are:

* retain data provenance – we should be able to determine exact star/snowflake columns and their dependencies (everything that is used in a query) just from the model metadata
* should impose as little requirement on the physical schema as possible
* should impose as little requirement on the model as possible
* should not require any stored procedures


SQL PROPOSAL

SYSTEM ATTRIBUTES

Introduce special – system – attributes to dimensions. In this case it would be “__date__” and “__time__”. [1]  System attributes are used only during dimension attribute mapping, they have no logical representation. System attributes are referenced in the mappings in the same ways as other attributes with “dimension.attribute” pattern as in: “start_date.__date__”. The referenced physical column should be of type date or time. Since system attributes are defined only as mappings they are considered “base attributes”. There is no way how to define them as derived – to have an expression for such attribute. This satisfies the first requirement “retain data provenance”.

QUERY

Query builder will be aware of possible existence of the system attributes, therefore it can include them among standard attributes.

If the “ptd” argument is present, then the query builder will construct the PTD query condition using SQL expression composition (using postgres-like dialect for illustration):

WHERE (__DATE__ - date_trunc(PERIOD, __DATE__)) ≤ (now() - date_trunct(PERIOD, __DATE__))

__DATE__ is the actual mapped date column, PERIOD is name of the period such as “month” or “year"

Query builder with cooperation of SQL alchemy will take care about the proper dialect version of the statement.

The tricky condition is the condition for the week, as some dialects don’t have truncation on the week level. Since we are using a dimension of role date/time we know which level is the “week” level. We know which day is the first day of the week from the Calendar (Workspace.calendar). This information can be used to construct dialect-specific condition as above.

The solution is inspired by the original one developed by Robin for 1.0, just promotes the expression (“cond”) from mapping into the query builder, moves the time distance computation to the query time.

SUMMARY

Advantages:

* The proposed solution does not require special pre-computed columns to exist in the date/time dimension tables.
* It does not require a stored function (in the database).
* Solution maintains data provenance.
* Does not require additional dimension.
* The impact on the model is minimal: one or two straightforward mapping attributes per relevant dimension.

Disadvantage:

* might not be easy to implement in other backends
* adds an argument to already argument rich aggregate() query

What do you think?


Cheers,

Stefan


[1] In the future there might be added attribute “__key__” for mapping the lowest level but unique key for a dimension hierarchy member for some other features.


--
You received this message because you are subscribed to the Google Groups "Cubes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cubes-discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages