Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

basics question on star schema design: sum and distinct-count measures

5 views
Skip to first unread message

elkarel

unread,
Apr 23, 2012, 7:01:09 AM4/23/12
to
I am beginner in Data Warehouse Design. I have red some theory, but
recently met a practical problem with a design of a OLAP cube. I use
star schema.
Lets say I have 2 dimension tables and 1 fact table:

Dimension Gazetteer:
- dimension_id
- country_name
- province_name
- district_name

Dimension Device:
- dimension_id
- device_category
- device_subcategory

Fact table:
- gazetteer_id
- device_dimension_id
- hazard_id (column for measure called "number of hazards")
- area_m2 (column for measure called "total area occupied")

A "business object" (which is a mine field actually) can have multiple
devices, is located in a single location (Gazetteer) and ocuppies X
square meters.
So in order to know which device categories there are, I created a
fact per each device in hazard like this:

+--------------+---------------------+-----------------------
+-----------+
| gazetteer_id | device_dimension_id | hazard_id |
area_m2 |
+--------------+---------------------+-----------------------
+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 |
6000 |
+--------------+---------------------+-----------------------
+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 |
6000 |
+--------------+---------------------+-----------------------
+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 |
6000 |
+--------------+---------------------+-----------------------
+-----------+
I defined a measure "number of hazards" as distinct-count of
hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many
hazards there are with given dimension members.
But the problem is the "total area occupied" measure: because it is
defined as a sum, it gives a value n-times higher than the actual
area, where n is th number of devices of the particular hazard object.
For example, with the data above would give 18000m2.
How would you solve this problem?

I am using the Pentaho stack but I believe it is a generic question.

Thanks in advance for any suggestion
0 new messages