can partitioning harm perforemance

2 views
Skip to first unread message

Rea

unread,
Jul 18, 2005, 5:06:02 AM7/18/05
to
Hey eb
In my cube time domension's leaf level is months and a new partition
is dedicated to each month.

Calculated measures then use data from a number of months(partitions)
,for example to sum up data from begining of year, or to calculate
avarage values per month etc.

In this case can partitioning data to months partitions,harm performance
instead of improoving??

Has any one had a case where partitions impared perforemance?
TIA

pras

unread,
Jul 19, 2005, 7:35:53 AM7/19/05
to
Basically partitioning improves the query performance.Setting up the
data slice for each partition infact boosts the query performance in
AS.

Regards,
Prasanna

SQL McOLAP

unread,
Jul 19, 2005, 9:37:04 AM7/19/05
to
Pras is absolutely correct. Let me just say that the only way partitioning
will HARM query performance will be if you create partitions, but DON'T set
the slice, that could disasterous in a large cube!

If you're going to partition a cube, a practice to follow is to be sure it's
partitioned by what people will query by often (time is usually a good
candidate) and that the data is load balanced by what you've partitioned by
as well, again time usually is load balanced, unless the data is very
seasonal. Partitioning with slices really boosts query performance
(especially in larger cubes) because now the engine will go directly just to
the partition your query sliced on (if you did slice on it in the query)
instead of having to scan the entire cube.

Partitioning really aids in processing too because now you can process a
cube in parallel with threads processing different partitions at the same
time.

The BIGGEST rule of them when partitioning --> ALWAYS SET THE SLICE!

Good luck.

- Phil

RuiDC

unread,
Jul 20, 2005, 3:37:55 AM7/20/05
to
although querying performance has definately improved for us, routing
administrative tasks, from processing through to cloning are
substantially slower given that we have around 10 partitions for each
of our cube.

Regards,
Rui

Dave Wickert [MSFT]

unread,
Jul 20, 2005, 11:27:20 AM7/20/05
to
The primary assumption with partitioning and queries is that you can
subdivide a problem into small pieces and then only have to do a subset of
the pieces. For example, if you partition by month and have 60 months (and
thus 60 partitions), AND if you queries are restricted by month, such as
just the current month, then we only need to scan 1/60th of the amount of
data if you have an aggregation miss.

HOWEVER, if your query pattern does not match the partitioning scheme, then
partitioning can add overhead. Suppose you partitioned by month, and then
queried for a given product line across all months, what happens? The system
cannot determine which partition has what product line, so it has to scan
them all. It does this in parallel (one thread per partition) and then
merges the results. Suppose you were querying on a product line which was
brand new -- i.e. it only had records in the current month. The system
doesn't know that so it has to scan 60 partitions in parallel, of which 59
are empty and 1 has data in it.

Notice that I said an "aggregation miss" in the first paragraph. For the
most part, if you have a good aggregation design and/or a nice data cache in
memory, then hitting an aggregation with partitioning is just about as good
as if it was 1 partition or 60 partitions. As a worst case, the system has
to add 60 things together. The REAL difference is if you have an aggregation
miss and the system has to scan the fact table. In this case, there is a
huge difference if it has to scan all of the data (even if done in parallel)
and only having to scan 1/60th of the data.

This is why setting the data slice is so important (with SQL2K). It is kind
of like an optimizer hint. If the data slice is set, then the system knows
which partition subset to scan. Without it, it must scan all of them. Thus
without it you defeat the whole purpose of partitioning (as a way of
improving query time -- there are other good reasons to partition such as
being able to do a massive delete of data -- or only have to process a small
subset at a time -- but the heart of any partitioning scheme from a querying
point of view is:

1 - the partitioning schema must be useful for the query pattern -- if the
two don't align then you will have more overhead and doubtful if queries
will run faster.

2 - setting the data slice so the system knows how to subdivide the problem
so it can do a smaller piece then it otherwise would have done

Note that with SQL2K5 there are some major improvements -- particularly with
#2 above. In SQL2K5, with MOLAP structures, we now build a historgam-like
structure which tells us which members in the various dimensions, that a
partition uses. You don't have to set the data slice, we figure it out
already for you. You can just process the partitions and off you go. You
still have to set the data slice for a partition in ROLAP storage mode or if
you are using proactive caching with a ROLAP phase. This is because a basic
tenet with ROLAP is that we never move the core fact table. Since this
automatic detection of members is done when the MOLAP structures are being
created from the fact data -- and the data isn't copied with ROLAP -- that
means that the histogram-like structures are never built.

This automatic building of the data slice is important because it allows a
new type of partitioning technique which wasn't possible in SQL2K. Suppose
you have the case where one member represents 50% of the data and you would
like to partition based on that member. You would like 1 partition with that
member as the data slice; and a second partition which contains everything
else. With SQL2K, this would be very difficult to setup. This is because
there is no nature of "everything else" as a data slice. You would have to
actually enumerate each of the "everything else" members on the data slice.
Since SQL2K5 automatically detects what members are in a partition, you just
process the partitions. One partition would have a WHERE clause in the query
binding including "AND fact.dim_fk = member_key" and the other partition
would have "AND fact.dim_fk <> member_key". And bingo -- it just works.

Hope this helps.
--
Dave Wickert [MSFT]
dwic...@online.microsoft.com
Program Manager
BI SystemsTeam
SQL BI Product Unit (Analysis Services)
--
This posting is provided "AS IS" with no warranties, and confers no rights.


"Rea" <R...@discussions.microsoft.com> wrote in message
news:C5177A93-4CA6-47F9...@microsoft.com...

Reply all
Reply to author
Forward
0 new messages