Cubes 2.0 Goals

104 views

Skip to first unread message

Stefan Urbanek

unread,

Apr 2, 2017, 5:21:55 PM4/2/17

to cubes-...@googlegroups.com

# Cubes 2.0

(Rendered markdown of this message: https://gist.github.com/Stiivi/3dd87f0ba920d2cae49e45d210bb0265)

Hi there. After almost two years of none or very sparse activity due to life and career situation, I’m committing myself back to the Cubes project. It will take some time to ramp-up, but we will eventually get there. I apologize for not meeting expectations lately and for letting the framework, mailing list and discussions go stale.

I got quite a lot of useful feedback and recommendations from users and people in the domain and that revived my motivation to spend more of my spare time to make Cubes better and modern OLAP toolkit.

Now, let’s move forward. To do any improvements or changes, Cubes needs quite a lot of housekeeping. The whole 2.0 release addresses that. Only when we have consistent, well-defined interface, when we have goals and equally importantly non-goals set, we can start growing Cubes again.

Comments, questions, concerns are more than welcome.

## Objectives of the 2.0 Release

- Maintainability

- Type consistency

- Correctness before feature richness

- Transparency of the query generator and execution process

- Extensions API clarification

- Better decoupling of components

- Preserved existing model compatibility if possible and sensible

- Preserved existing HTTP interface compatibility

- Multiple physical representations

Summary of 2.0 issues can be found [here](https://github.com/DataBrewery/cubes/milestone/5).

## Maintainability and Contributions

I have been major Cubes developer most of the time and I admit I was not good in communicating the ideas frequently, clearly or up-front in time for discussion. I apologize for that. This resulted in Cubes codebase being non-easy to understand or maintain by other or new people.

Therefore, during the refactoring process, I will try to focus to make the codebase more understandable and more maintainable. Will try to decrease barrier for being able to contribute to the library wherever possible.

I already started to categorize issues based on size:

- [size-small](https://github.com/DataBrewery/cubes/labels/size-small) – not many changes needed, either minor but repeated changes in multiple files or small change within one file. Understanding of broader context is not quite necessary.

- [size-medium](https://github.com/DataBrewery/cubes/labels/size-medium) – might span across files/modules, might require some refactoring, understanding a bit more than the changed piece of code is needed. Should be a change with quite well defined boundaries.

- [size-large](https://github.com/DataBrewery/cubes/labels/size-large) – deeper understanding of the library is needed and change is expected to affect a lot of modules or start a change dependency chain.

The small issues are good to get familiar with code.

Other tags:

- [help-wanted](https://github.com/DataBrewery/cubes/labels/help-wanted) – anyone with at least some knowledge with the library might be able to implement it and I am willing to assist

- [easy](https://github.com/DataBrewery/cubes/labels/easy) – should be easy to implement, good for those who want to get familiar with the code

If you would like to contribute, but don’t know where to start, start with `easy`, `size-small` or `help-wanted` issues. Or you can also:

- comment on existing issues with implementation proposals

- challenge proposed implementations

- propose project/module/class reorganization

- write unit-tests

- make sure that the documentation reflects code, add missing documentation or remove obsolete documentation

The best contribution is still a [pull-request](https://guides.github.com/introduction/flow/).

I will be available on the [Cubes Gitter channel](https://gitter.im/DataBrewery/cubes) mostly evenings or weekends PST (San Francisco) time zone.

Side-note: I understand that the BI and Data Warehousing is quite lucrative domain and people usually don’t spend their spare time contributing to open-source. However, if your organisation or company uses Cubes, please let the community know. It helps the project going by keeping involved people motivated.

## Type Consistency and Correctness

**Background:** Cubes has approximately seven years old code which has grown in complexity quite organically. Most of the time the focus was on feature richness and growth. This resulted in lots of compromises in the interface, coupling out of convenience, such as functional metadata with human oriented metadata in the model or wild argument types which can be anything from strings through tuples to complex dictionaries. The code became non-trivial to understand and navigate which resulted in difficulties of adding new features, maintaining or debugging existing ones. Some good advanced features such as Periods-to-Date or Semi-additive measure-dimension relationships had to be removed as they had negative impact on the rest of the code base.

Static type annotations and type checking is very basic but powerful way to reveal potential inconsistencies, catch possible type mismatch errors. It also helps understand what is actually being passed around and helps us to see whether the design is optimal, has holes or can be improved.

The 2.0 release will be focusing mostly on type correctness. There are no radical feature changes planned, mostly refactoring of the existing ones.

Opened issues:

- [Type Annotations (Python 3.6)](https://github.com/DataBrewery/cubes/issues/393) (master task)

- [“Typing” related issues](https://github.com/DataBrewery/cubes/labels/topic-typing)

## Query generator and execution transparency

Current state: SQL query generator is complex, tightly and weirdly coupled collection of interacting objects: Browser, StarSchema, QueryContext, Store etc. In 1.x release the query generator was rewritten with attempt to be more straightforward, however many wild or anonymous data types were introduced and it was still not properly decoupled from the rest of the library. One of the reasons that was not mentioned before was also idea of having reusable SQL denormalizer in the future that would reside outside of Cubes, which turned out to be not a great idea, due to lack of rich metadata that Cubes already provides.

Secondary problem is the concept of Browser itself. It looked like a great idea from the very beginning of the library, where it was the only object responsible for everything. Now it has overgrown a bit to a state where it is not clear in which object what functionality should happen. Another problem with the Browser is, that it tries to accept wild types of arguments (names, model objects, differently shaped drilldowns, etc.) but the backend query generator needs them as well known consistent data types. This was solved by having pairs such as `aggregate()` with backend-customizable arguments, `prepare_aggregates()` and `provide_aggregate()` but no strict rules around those functions were proposed.

The Cubes 2.0 should bring more transparency to the query generator and properly decouple the components involved in the process. Notable changes:

- have well defined multi-dimensional query object, that can be preserved, shared, published, etc.

- have a ‘prepared query’ object that can be used for execution, inspection and storage of denormalized or aggregated queries generated by Cubes

- provide first-class interface for getting a compiled SQL query without execution for further viewing, processing or executing by another system

- have greater involvement of the Store for query preparation, execution and materialization

The above functionality with greatly extend usability of the library as an embedded component that can generate multi-dimensional ROLAP or other queries and feed them to other systems.

Proper separation of components will also make debugging and testing easier. Today preparing good unit tests for the browser is not trivial.