You probably noticed that I always used Votes in
the previous screenshots, and there’s a reason for it: While votes are a
cumulative concept, doesn’t make any sense at all to show a sum of
ranks. But until this point, we have no information that allow us to
know what’s the business meaning of these fields; All we know is if
they’re strings, numbers, binaries, dates, etc… In order to get insights
from fields like rank, we need to get more semantics out of the fields mean from a business perspective.
And
how do we get this information? Classically, this information is
appended in a separate stage of the process. We are used to calling it
the modelling stage,
and it’s an operation usually done after the data integration stage is
complete. On our stack, we do this by writing Mondrian schemas (if we
want to use OLAP) or Pentaho Metadata models for interactive reporting.
But
this is incredibly stupid! From the point we get a field called rank,
we already know it should be treated as an average. As soon as we see a
date field, most likely it will feed a date dimension. If we get
country, state, city fields, they will mostly likely be attributes of a
territory dimension. Makes no sense at all to wait till the end of this
data preparation stage and resort to a different tool to append an
information we have from the start.
In
this new way of analyzing data as part of the data integration process,
we started with the following assumption: There are two different
lenses that we can apply to look at a data set:
- Stream view:
This is the bi-dimensional representation of the physical data that
we’re working with; A view over the fields and their primary types
- Model view:
The semantic meaning of those fields; Dimensions, attributes, measures,
basically the real business meaning of the stream underneath
On the example I’ve been using, these are the two views:
Figure 24: Stream view and Model view
Like
mentioned before, these are two views over the same domain; If we’re
interested in looking at the physical stream, the one on the left will
be used. If we’re looking from a business perspective, it’s the model
view that has the added information. Our current thinking is that only
the model view will be available for the end users (once we get this
data exploration experience there).