Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

FOR ALL INDEXED COLUMNS SIZE 254

657 views
Skip to first unread message

Mladen Gogala

unread,
Sep 12, 2011, 5:15:21 PM9/12/11
to
I have frequently used the method_opt from the title to analyze my tables
and have been castigated for it with amazing regularity. I still don't
quite understand why is that method bad. My logic is quite simple:
The main decision that an optimizer has to make about the particular
column condition is whether to use index to resolve it or not. That
decision is based on the column statistics, as well as the the index
statistics. The things that influence the decision are the condition
itself, the existence of the histograms which helps to estimate the
number of rows that need to be read, the clustering factor which will
help estimate the number of blocks to read, the values from SYS.AUX_STAT$
to estimate the I/O cost and the statistics from the table itself. There
are two possible decisions that optimizer can make with respect to the
column: whether to use an index or not.
There is no decision to be made if the index is not there, except with
respect to join method, whether to use sort or hash. That means that
using "for all indexed columns size 254" actually makes a lot of sense
and it doesn't waste as much space as "FOR ALL COLUMNS SIZE 254". There
is an option "FOR ALL COLUMNS SIZE SKEWONLY" which would be nice, except
that I have no idea of the histogram size.
This method_opt was called "a bane":
http://structureddata.org/2008/10/14/dbms_stats-method_opt-and-for-all-
indexed-columns/

So, why is this a bane? The explanation on the page is less than clear.




--
http://mgogala.byethost5.com

Mladen Gogala

unread,
Sep 12, 2011, 6:32:39 PM9/12/11
to
On Mon, 12 Sep 2011 21:15:21 +0000, Mladen Gogala wrote:


> So, why is this a bane? The explanation on the page is less than clear.
>

BTW, there is something extremely stupid about this method: it collects
histograms even for the the single column primary and unique keys, which
is completely pointless because the distribution is given by the very
definition of the primary/unique key:


SQL> begin
2 dbms_stats.gather_schema_stats(
3 ownname=>'SCOTT',
4 method_opt => 'FOR TABLE FOR ALL INDEXED COLUMNS SIZE 254',
5 estimate_percent => NULL,
6 cascade => true);
7 end;
8 /

PL/SQL procedure successfully completed.


SQL> select table_name,column_name,count(*)
2 from dba_histograms
3 where owner='SCOTT'
4 group by table_name,column_name;

TABLE_NAME COLUMN_NAME COUNT(*)
------------------------------ -------------------- ----------
DEPT DEPTNO 4
EMP EMPNO 14

Those histograms are completely unnecessary. Fortunately, there is a
method called SET_TABLE_PREFS which can do something like this:

SQL> begin
2 dbms_stats.set_table_prefs(
3 ownname=>'SCOTT',
4 tabname => 'EMP',
5 pname => 'METHOD_OPT',
6 pvalue => 'FOR ALL COLUMNS SIZE 1');
7 end;
8 /

That tells the DBMS_STATS not to collect histograms for any of the
columns of the table EMP:

SQL> exec dbms_stats.delete_schema_stats(ownname=>'SCOTT');

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.13

SQL> begin
2 dbms_stats.gather_schema_stats(
3 ownname=>'SCOTT',
4 cascade => true);
5 end;
6 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.65
SQL> select table_name,column_name,count(*)
2 from dba_histograms
3 where owner='SCOTT'
4 group by table_name,column_name;

TABLE_NAME COLUMN_NAME COUNT(*)
------------------------------ -------------------- ----------
EMP ENAME 2
EMP SAL 2
EMP DEPTNO 2
DEPT DEPTNO 4
EMP COMM 2
EMP JOB 2
EMP EMPNO 2
EMP MGR 2
EMP HIREDATE 2

9 rows selected.

Now all columns have histograms with size 2, which means that only
minimum and maximum is collected. Observe that there is still a histogram
of size 4 for the DEPTNO column. There should be an automatic preference
not to gather histograms for the single column primary/unique keys.
--
http://mgogala.byethost5.com

Charles Hooper

unread,
Sep 12, 2011, 6:45:27 PM9/12/11
to
On Sep 12, 5:15 pm, Mladen Gogala <gogala.mla...@gmail.com> wrote:
(snip)
> So, why is this a bane? The explanation on the page is less than clear.

Mladen,

I took a look at the link that you provided and found the following
comment:
http://structureddata.org/2008/10/14/dbms_stats-method_opt-and-for-all-indexed-columns/#comment-304
"Wolfgang Breitling on December 3, 2008 at 12:57 am
In response to Mladen:
“The columns that really do need histograms, where it really does
matter, are the indexed columns. Those are the only columns where the
optimizer can choose the access path.”
That is a common misconception: that histograms only matter for
indexed columns. Choosing an access method ( index or tablescan ) for
a row source is only part of the task of the CBO when putting together
an access plan. Other, at least equally important, tasks are the types
and especially order of joins – unless none of your sql contain joins.
And it is there where column selectivities, derived from column
statistics, play a crucial role. If you deprive the optimizer of this
vital information it is bound to produce suboptimal, even very bad
plans. If you have not yet encountered that count yourself lucky."

I do not have the full answer for your question, but I think that
there is at least one more criteria that must be determined - table
join order. Consider this, you have 3 tables (T1, T2, T3). In which
of the 6 join orders makes the most sense for the query optimizer:
T1 -> T2 -> T3
T1 -> T3 -> T2
T2 -> T1 -> T3
T2 -> T3 -> T1
T3 -> T1 -> T2
T3 -> T2 -> T1

Assume that you only create histograms on the indexed columns, and the
current statistics indicate that each table is estimated to return the
following number of rows:
T1: 1,000,000
T2: 500,000
T3: 100,000

What would be the join order selected in the above case? Very likely
T3 -> T2 -> T1 (smallest cardinality estimate to largest)

Now consider that table T1 has an unindexed column named STATUS - the
status is one of RELEASED, CLOSED, and CANCELLED. Your particular
query is looking for the few rows in table T1 with a STATUS of
CANCELLED, with just 200 of the 1,000,000 rows matching that
criteria. If a histogram were created on that column, the optimizer
might change the cardinality estimates like this (note that it likely
will not be exactly correct due to the selectivity estimated being
multiplied together when multiple predicates are present with AND
specified between the predicates:
T1: 5
T2: 500,000
T3: 100,000

In the above case, the selected join order might be T1 -> T3 -> T2

I think that the above is a case where just collecting histograms on
indexed columns might be limiting.

In comment 38 in the following blog article:
http://richardfoote.wordpress.com/2008/01/04/dbms_stats-method_opt-default-behaviour-changed-in-10g-be-careful/
Richard Foote stated:
"I wouldn’t recommend ALL INDEXED COLUMNS. Non-indexed columns can
benefit from stats and histograms as well as indexed columns as the
stats can provide important cardinality estimates of steps within an
execution plan that can influence join orders and when steps are
performed"

Why not index the STATUS column in my example? You could, but then
what if you are looking for something different, like sales in the
state of Alaska (probably much fewer than for California) - would you
necessarily index a column that indicates the state of a customer?
(On second thought, I probably would not have thought to manually
build a histogram on this column either, but the statistics collection
process might).

Charles Hooper
Co-author of "Expert Oracle Practices: Oracle Database Administration
from the Oak Table"
http://hoopercharles.wordpress.com/
IT Manager/Oracle DBA
K&M Machine-Fabricating, Inc.

Noons

unread,
Sep 13, 2011, 1:09:28 AM9/13/11
to
On Sep 13, 8:45 am, Charles Hooper <hooperc2...@gmail.com> wrote:



>  “The columns that really do need histograms, where it really does
> matter, are the indexed columns. Those are the only columns where the
> optimizer can choose the access path.”
>  That is a common misconception: that histograms only matter for
> indexed columns. Choosing an access method ( index or tablescan ) for
> a row source is only part of the task of the CBO when putting together
> an access plan. Other, at least equally important, tasks are the types
> and especially order of joins – unless none of your sql contain joins.

Which, strangely enough, is the precise case of the examples provided
by Greg in that entry.
Sorry. but trying to convince anyone that joins can be affected with
an example that doesn't use joins is, IMHO, a bit of a dream.

Yes, we can see the difference in cardinality. Yes, the query in the
examples uses no indexes in ALL cases exposed. As it should, given
there are no indexes in those columns.
What exactly is the problem?

What, you telling me that in a join somehow magically the CBO would
create an index on the fly and use it if there were histograms for all
columns?

You don't use Peoplesoft HR much, do you? I can provide some concrete
examples where using histograms in all columns, indexed or not, can be
deadly in 10gr2.
In fact, there is now (FINALLY!!!) a documented process that
specifically eliminates histograms from certain tables used by PS for
paycalc runs and as a result gets reasonable performance instead of
the horrible pot-luck that is histograms combined with bind variable
peeking in 10gr2.

So, please: before we start crucifying those who by choice wouldn't
give a fig about being an ACE or OCP, wouldn't it be better to
actually provide proper examples if at all possible?
Particularly where relevant to a specific release?
It's not like the CBO behaves consistently across releases or has EVER
done so in recent history!...


> And it is there where column selectivities, derived from column
> statistics, play a crucial role. If you deprive the optimizer of this
> vital information it is bound to produce suboptimal, even very bad
> plans. If you have not yet encountered that count yourself lucky."

I've never encountered that situation and I am not particularly
lucky.
Given lack of indexes and partition pruning, the CBO can only produce
full table scans to get data, end of story and period.
That may or may not be bad, depending on a huge set of factors and
conditions.
Nothing to do with histograms on indexed or non-indexed columns.

> snippage...
> Now consider that table T1 has an unindexed column named STATUS - the
> status is one of RELEASED, CLOSED, and CANCELLED.  Your particular
> query is looking for the few rows in table T1 with a STATUS of
> CANCELLED, with just 200 of the 1,000,000 rows matching that
> criteria.  If a histogram were created on that column, the optimizer
> might change the cardinality estimates like this (note that it likely
> will not be exactly correct due to the selectivity estimated being
> multiplied together when multiple predicates are present with AND
> specified between the predicates:
> T1: 5
> T2: 500,000
> T3: 100,000
>
> In the above case, the selected join order might be T1 -> T3 -> T2
>
> I think that the above is a case where just collecting histograms on
> indexed columns might be limiting.

"might", "likely to change", and so on. See the problem?
Yes, there may be a particular case, once in a blue moon, when the
wind is blowing in the right direction, where this might be
necessary. In the other 99.999999% of the cases, it will never
happen. Production dbas deal with the 0.000001% case when/if needed,
otherwise they just set one size fits all and go with it. Nothing
wrong with that and certainly not an appropriate subject for
"crucification".


> Richard Foote stated:
> "I wouldn’t recommend ALL INDEXED COLUMNS. Non-indexed columns can
> benefit from stats and histograms as well as indexed columns as the
> stats can provide important cardinality estimates of steps within an
> execution plan that can influence join orders and when steps are
> performed"

Like I said: "can" is not a synonym for "must" or "will"

> Why not index the STATUS column in my example?  You could, but then
> what if you are looking for something different, like sales in the
> state of Alaska (probably much fewer than for California) - would you
> necessarily index a column that indicates the state of a customer?

Yes, of course I would! In the case where we are after a high
cardinality value, the indexes would work perfectly and as desired.
In the case of low cardinality, hopefully the histogram accompanying
that index - the case of "FOR ALL INDEXED COLUMNS" - would allow the
CBO to drop the index and FTS. And in a join the same would happen.
Isn't that what one would hope for (abstracting the usual CBO bugs)?

I've said it many times before: a specific example is not proof of a
much wider case's validity. I think it applies here more than
anything else. Although I'm not sure I'd follow Mladen's idea of "256
always" without further scrutiny, I can't see what the "crucification"
problem is with histograms on indexed columns.

> (On second thought, I probably would not have thought to manually
> build a histogram on this column either, but the statistics collection
> process might).

The lesser said about the "auto" stats gathering process, the better...

Randolf Geist

unread,
Sep 13, 2011, 5:03:14 AM9/13/11
to
On Sep 12, 5:15 pm, Mladen Gogala <gogala.mla...@gmail.com> wrote:
There are several things to consider:

1. If you use the INDEXED keyword in the METHOD_OPT clause then you
effectively do not gather any statistics for columns that are not
indexed. If you know that all columns that are used as filter / join
predicates are indexed, then this is fine. If you however have columns
that lack basic column statistics but are relevant then the CBO is in
trouble. Dynamic sampling won't be used for such cases and Oracle
falls back to hard-coded built-in defaults for selectivity/cardinality
estimates based on those columns. Hence cardinality estimates are
potentially way off leading to bad join and access orders. This is
what Charles referred to as common misunderstanding - bad cardinality
estimates are your enemy no. 1 for good performing execution plans.

Bottom line here is: You should ensure to have at least basic column
statistics on all columns that are relevant as join / filter
predicates. Note that I'm not talking about histograms here but merely
basic column statistics

2. Using the clause SIZE 254 Oracle will unconditionally generate
histograms on those columns. You don't need histograms on all
(indexed) columns, for most cases basic column statistics are more
than sufficient.

The point here is that histograms can be counter-productive for many
cases. If you end up with height-balanced histograms (more than 254
distinct column values) the usefulness of the histogram is quite
questionable and only helpful in rare cases.

But even if you have a frequency histogram there are little
implementation details and side-effects that can introduce surprising
problems, joins included.

Bottom line here is: Only generate histograms where necessary - and I
believe that this means even less histograms than are generated when
using the SIZE AUTO option. Histograms should rarely be applied, and
in some cases you would actually need to craft a histogram manually in
order to be beneficial rather than gather the histogram.

For more details I recommend reading Richard Foote's recent blog post
on the SIZE AUTO option including the comments:

http://richardfoote.wordpress.com/2011/09/01/method_opt-size-auto-quiz-solution-the-trickster

Some of the reasons why histograms can be bad I've summarized in a
comment to the post:

http://richardfoote.wordpress.com/2011/09/01/method_opt-size-auto-quiz-solution-the-trickster/#comment-31285

I would summarize a general recommendation like this:

- Use FOR ALL COLUMNS SIZE 1 as default. This way you ensure that you
end up with basic column statistics on all columns.
- Generate histograms only on columns where you can prove that it will
be beneficial - do this explicitly or you could use an approach of
SIZE REPEAT instead of SIZE 1 and generate the histogram once
explicitly
- There are cases where you will only benefit from a histogram when
crafting it manually

Hope this helps,
Randolf

Mladen Gogala

unread,
Sep 13, 2011, 10:05:11 AM9/13/11
to
On Mon, 12 Sep 2011 15:45:27 -0700, Charles Hooper wrote:

> Why not index the STATUS column in my example? You could, but then what
> if you are looking for something different, like sales in the state of
> Alaska (probably much fewer than for California) - would you necessarily
> index a column that indicates the state of a customer? (On second
> thought, I probably would not have thought to manually build a histogram
> on this column either, but the statistics collection process might).

Charles, the issue here is how to collect statistics, the regular job of
collecting statistics. In Oracle11g, it's not a problem any longer
because I have SET_TABLE_PREFS routine which can help me with setting the
preferences for the individual table. However, collecting stats for all
columns will consume a fairly large amount of space, that is something
that I don't necessarily want to do. In oracle10g, I solved the problem
by disabling the built-in database collecting stats and writing my own
script, which would run every Saturday at 9PM. This new procedure would
run "GATHER_DATABASE_STATS" followed by special treatment for 14 tables.
The "special treatment" included changing the method to
"FOR ALL COLUMNS SIZE 254" for some tables, as well as dividing the
clustering factor by 100 for some indexes. The problem is what to collect
by default, without wasting too much space.



--
http://mgogala.byethost5.com

Mladen Gogala

unread,
Sep 13, 2011, 10:18:16 AM9/13/11
to
On Tue, 13 Sep 2011 02:03:14 -0700, Randolf Geist wrote:

> - Use FOR ALL COLUMNS SIZE 1 as default. This way you ensure that you
> end up with basic column statistics on all columns. - Generate
> histograms only on columns where you can prove that it will be
> beneficial - do this explicitly or you could use an approach of SIZE
> REPEAT instead of SIZE 1 and generate the histogram once explicitly
> - There are cases where you will only benefit from a histogram when
> crafting it manually
>
> Hope this helps,
> Randolf

This is an interesting suggestion which doesn't waste too much space. I
will try it out and will let you know of the outcome.



--
http://mgogala.byethost5.com
0 new messages