Can anyone tell me (or point me at information) about how statistics are generated/used for multi-column indexes?
We have a 2 column status model (major and minor status, if you like). This is key to our application.
"Naturally" we have put a 2 column index on these two columns.
But we have "doubts" about whether the CBO is always doing as good as job as we'd like.
The key question seems to be:
Are the columns "treated as a pair" so that the frequency estimate for a pair of statuses depends on both of them or...
Are the columns treated in isolation so that the frequency estimate is simply obtained by getting the estimate for each column, and combining them mathematically.
This would make a big difference in our app, since our major/minor status (in practice) have interesting and complex correlations.
jg -- @home.com is bogus. "...Network hardware includes Dell servers, Dell workstations, IBM RS6000 and P570 RISK based systems, Cisco routers, and HP ProCurve switch’s. MINIMUM REQUIREMENTS Qualified candidates must have a Bachelor’s degree in Computer Science or a related degree with 4-6 years programming experience with HTML/ CGI/Java, ASP.Net and an understanding of internet working and knowledge of applications available on the WEB, Business Objects XIR2, Crystal Reports XIR2, VB6 Programming Language, Visual Basic and Studio, in an Oracle/AIX/2003 environment, or..." - metrolink job ad.
On Fri, 04 Dec 2009 16:48:43 +0000, bugbear wrote: > Can anyone tell me (or point me at information) about how statistics are > generated/used for multi-column indexes?
> We have a 2 column status model (major and minor status, if you like). > This is key to our application.
> "Naturally" we have put a 2 column index on these two columns.
> But we have "doubts" about whether the CBO is always doing as good as > job as we'd like.
> The key question seems to be:
> Are the columns "treated as a pair" so that the frequency estimate for a > pair of statuses depends on both of them or...
> Are the columns treated in isolation > so that the frequency estimate > is simply obtained by getting the estimate for each column, and > combining them mathematically.
> This would make a big difference in our app, since our major/minor > status (in practice) have interesting and complex correlations.
> BugBear
10g doesn't do multi-column histograms. 10g will create histograms for every single column of the index and will estimate the number of entries retrieved for each column by using those histograms. After that, it will calculate the selectivity of the conditions by multiplying the selectivities for every single column. Selectivity is the estimated number of retrieved values divided by the total number of values in that column. When the CBO gets the selectivity, it will multiply it by the index clustering factor to estimate the number of blocks that need to be retrieved. The correlation between the number of blocks and the price of the query is known only to the God and Jonathan, but there definitely is one.
> On Fri, 04 Dec 2009 16:48:43 +0000, bugbear wrote:
>> Can anyone tell me (or point me at information) about how statistics are >> generated/used for multi-column indexes?
>> We have a 2 column status model (major and minor status, if you like). >> This is key to our application.
>> "Naturally" we have put a 2 column index on these two columns.
>> But we have "doubts" about whether the CBO is always doing as good as >> job as we'd like.
>> The key question seems to be:
>> Are the columns "treated as a pair" so that the frequency estimate for a >> pair of statuses depends on both of them or...
>> Are the columns treated in isolation >> so that the frequency estimate >> is simply obtained by getting the estimate for each column, and >> combining them mathematically.
>> This would make a big difference in our app, since our major/minor >> status (in practice) have interesting and complex correlations.
> 10g doesn't do multi-column histograms. 10g will create histograms for > every single column of the index and will estimate the number of entries > retrieved for each column by using those histograms. After that, it will > calculate the selectivity of the conditions by multiplying the > selectivities for every single column. Selectivity is the estimated > number of retrieved values divided by the total number of values in that > column. When the CBO gets the selectivity, it will multiply it by the > index clustering factor to estimate the number of blocks that need to be > retrieved. The correlation between the number of blocks and the price of > the query is known only to the God and Jonathan, but there definitely is > one.
Are there cases where it makes sense to use a FBI to get "multi column" histogram information? Of course, queries then would also have to use that concatenated value as query criteria which makes usage of this quite nasty (especially if the SQL is generated by some kind of persistence container).
> Can anyone tell me (or point me at information) about > how statistics are generated/used for multi-column indexes?
> We have a 2 column status model (major and minor status, if you like). > This is key to our application.
> "Naturally" we have put a 2 column index on these two columns.
> But we have "doubts" about whether the CBO is always > doing as good as job as we'd like.
> The key question seems to be:
> Are the columns "treated as a pair" so that the > frequency estimate for a pair of statuses depends > on both of them or...
> Are the columns treated in isolation > so that the frequency estimate > is simply obtained by getting the estimate > for each column, and combining them mathematically.
> This would make a big difference in our app, > since our major/minor status (in practice) have interesting > and complex correlations.
> BugBear
To add to what others have said, how the CBO treats the two columns is going to in part depend on how the sql statement is written. If the query uses two bind variables, one for each indexed column, in the where clause conditions then the CBO (ignoring bind variable peeking) uses the default selectivity (.05) for the first column times the default selectivity of the second column in its calculations. If the data is not skewed then the presence or absence of histograms will probably not have a major mpact on the plan.
My question to you though is major/minor status of what? Will you be seeking all occurrences of whatever with a specific major status or all rows with a specific minor status? What I am trying to bring into consideration is if the status columns need to be associated with another key. That is you might need the index to be major/minor/ part_no or something like this depending on it major/minor is the PK of one of your tables or jsut really an attribute set of some of your tables. The three part key would allow more selective determination of the target rows before the table has to be accessed.
On Sat, 05 Dec 2009 14:18:38 +0100, Robert Klemme wrote: > Are there cases where it makes sense to use a FBI to get "multi column" > histogram information?
Hmmm, let me quote from the 11g book: if the columns are related in the way that breaks the 3rd normal form, such statistics would make sense. The famous example by Jonatan was about zodiac signs and months of birth. The 10g CBO would produce an incorrect estimate for the number of people with the sign of Capricorn born in June. There are, of course, none, but you can't know that without a multi-column histogram.
> On Sat, 05 Dec 2009 14:18:38 +0100, Robert Klemme wrote:
>> Are there cases where it makes sense to use a FBI to get "multi column" >> histogram information?
> Hmmm, let me quote from the 11g book: if the columns are related in the > way that breaks the 3rd normal form, such statistics would make sense. > The famous example by Jonatan was about zodiac signs and months of birth. > The 10g CBO would produce an incorrect estimate for the number of people > with the sign of Capricorn born in June. There are, of course, none, but > you can't know that without a multi-column histogram.
On Sun, 06 Dec 2009 13:33:06 +0100, Robert Klemme wrote: > On 05.12.2009 22:35, Mladen Gogala wrote: >> On Sat, 05 Dec 2009 14:18:38 +0100, Robert Klemme wrote:
>>> Are there cases where it makes sense to use a FBI to get "multi >>> column" histogram information?
>> Hmmm, let me quote from the 11g book: if the columns are related in the >> way that breaks the 3rd normal form, such statistics would make sense. >> The famous example by Jonatan was about zodiac signs and months of >> birth. The 10g CBO would produce an incorrect estimate for the number >> of people with the sign of Capricorn born in June. There are, of >> course, none, but you can't know that without a multi-column histogram.
> That makes a lot of sense! Thank you, Mladen!
> Kind regards
> robert
Give the credit where the credit is due. The original source is Jonathan Lewis, not me. I do confess to reading his articles vewy, vewy cawefully. I was unable to find the example quickly so here is where Tom Kyte also invokes the same famous example: http://tinyurl.com/yjw5gbe
Robert Klemme wrote: > Are there cases where it makes sense to use a FBI to get "multi column" > histogram information? Of course, queries then would also have to use > that concatenated value as query criteria which makes usage of this > quite nasty (especially if the SQL is generated by some kind of > persistence container).
I though about this rather carefully.
It is possible to do this (in my case, at least)
However, the "accurate" statistics associated with the FBI make it quite important that there are not "too many" other indexes that the CBO might consider.
Because due to the inapropriate estimation of selectivity the CBO is quite determined to use almost-anything-other that YOUR FBI.
It doesn't "know" that the apparently low selectivity of your FBI is because it's the only index with accurate stats!
> We have a 2 column status model (major and minor status, if you like). > This is key to our application.
> "Naturally" we have put a 2 column index on these two columns.
> But we have "doubts" about whether the CBO is always > doing as good as job as we'd like.
> The key question seems to be:
> Are the columns "treated as a pair" so that the > frequency estimate for a pair of statuses depends > on both of them or...
> Are the columns treated in isolation > so that the frequency estimate > is simply obtained by getting the estimate > for each column, and combining them mathematically.
If you create histograms on the two columns separately and use literals in the query, then Oracle can use the histograms to calculate the selectivity of the two columns separately and multiply to get the overall selectivity. If you are using bind variables, then the values used for this optimisation step will be the peeked bind values on the first call - and that may not be good for subsequent calls).
For a pair of status columns with (we assume) a small number of distinct values, it makes sense to use literals and histograms. (And then it's a good idea to construct histograms, rather than letting the automatic stats collection gather them - see: http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/
If you don't have histograms, then there are various possibilities depending on version of Oracle. The optimizer may multiply the column selectivities, but it may base its arithmetic on the number of distinct keys in the index. The behaviour varies between 10.2.0.1, 10.2.0.3 and 10.2.0.4.
To add to what others have said, how the CBO treats the two columns is going to in part depend on how the sql statement is written. If the query uses two bind variables, one for each indexed column, in the where clause conditions then the CBO (ignoring bind variable peeking) uses the default selectivity (.05) for the first column times the default selectivity of the second column in its calculations.
Mark
That 5% figure does appear in various places in the optimizer's arithmetic, but in this case is would be using the product of 1/num_distinct for the two columns, or 1/distinct_keys for the index (depending on version)