Statistics Gathering

Karen Morton just posted a great paper on her blog about statistics gathering. The paper is titled “Managing Statistics for Optimal Query Performance“. I was excited to see it because I think gathering stats is one of the most important and least well understood aspects of managing an Oracle environment. I must admit that I was expecting a recommended method or framework for gathering stats, but actually the paper is really more about how the statistics are used along with general guidelines for gathering them, rather than a direct recommendation on how to gather them. Nevertheless, it is one of the best papers I’ve seen on the subject. She’s going to present the paper at the Hotsos Symposium to be held in Dallas the week of March 9th. I’m going to be there and am really looking forward to hearing what she has to say on the subject.

By the way, I can’t recommend this conference highly enough. If you really want to understand how Oracle works, this is the place to be. You should know that I am not generally a fan of formal training classes. I have often been disappointed because I felt like my time would have been better spent researching the subject matter myself. On the other hand, I have found a lot of value in working closely on a project with someone who knows the subject matter well, kind of like a mentor. But generally speaking, the formal classes have been less satisfying, except in the rare case where you get the great instructor that wrote the class. This symposium format on the other hand allows you to listen to a collection of really knowledgeable Oracle people packed into a short period of time. I have been to the Hotsos Symposiums for several years in a row and I always come away with pages of notes on things I want to investigate further. And the participants are, generally speaking, a collection of very bright Oracle people. So even the conversations between the presentations are often very interesting. Finally, they run two presentations at a time which allows you to pick the presentation that is most interesting. I have often found it hard to choose (don’t tell anyone, but I have on more than one occasion listened to the first half of one and then the second half of the other). So like I said, I find it to be a very productive few days.

But I digress, Karen’s paper is pretty long (24 pages) but it covers a vast amount of ground. There are a number of one liners that could be expanded into full papers. In the paper she discusses a number of topics including dealing with short comings of the optimizer in 10g. One of those issues is bind variable peeking (probably my least favorite optimizer feature, quirk, bug, … what ever you want to call it). I must say that I think it has caused far more problems than it solved, and I frankly don’t know what they were thinking when they put that feature in. I wrote a little about a way to get around it using outlines here. By the way, this reminds me of a cartoon I drew 20 years ago that looked very similar to this one (that I lifted off of Steve Karam’s blog)

Of course as Karen points out, the right way to deal with bind variable peeking issues it is to understand your data and use literals where they are appropriate, keeping in mind the number of additional statements that will need to be parsed and dealt with in the shared pool. She also points out that code could be written to selectively use literals for specific values, giving you a mix of literals and bind variables for the same statement. This approach should allow you to minimize the impact on the shared pool while still providing the optimizer with the data it needs to make good decisions (this is a great idea but I’ve never seen anyone actually do it). And of course she points out that 11g has a much better mechanism for dealing with this whole issue.

Another idea that really got me thinking was the use of dynamic sampling . Karen clearly shows one of the advantages of dynamic sampling in the case of correlated predicates (i.e. the optimizer assumes a query where car_model = ‘Mustang’ and car_make = ‘Ford’ are independent, when clearly they are not). She shows how dynamic sampling can be very useful in conjunction with normal statistics in this situation. (rats, now I’m going to have to go play around with that a bit – so much to do, so little time)

Finally, she discusses some of the statistics gathering options and differences in 9i, 10g, and 11g. The automatic creation of histograms is one of the main differences between 9i and 10g and she discusses this issue, but doesn’t go into to much detail on it. I must admit that I think 10g’s default setup does a very poor job when it comes to histograms. This is the one area I would have liked to see address a little more fully, but at 24 pages already I can understand why she had to draw the line somewhere. Anyway, by default 10g creates histograms on columns based on several factors including their use in where clauses. Unfortunately, histograms often get created on columns where their usefulness is questionable at best and they regularly get created with a very small sample sizes. The small sample sizes often result in significant inaccuracies. I personally think that allowing the gather stats job to automatically create histograms in 10g is really bad idea.

Anyway, this is a paper that is well worth reading in my opinion. Typical Hotsos Symposium fare!

4 Comments

  1. Josip Voyager says:

    Hi Kerry,

    Let say I have one large table in my data warehouse that is partitioned by range (on date column).

    My ETL procedure is executing on a daily basis and only last three or four partition have been changed between two ETL batch procedures.

    I can maintain partition level statistics for changed partitions and local indexes easily (it’s very fast on my system), but the problem is global statistics maintenance.

    Question is how can I incremental update/maintain global statistics for large, range partitioned table on 10g release 2 database running on IBM AIX server?

    Please don’t tell me to upgrade to 11g because feature of incremental updating global statistics is implemented there, while my database is about 20 Tb, and Oracle 11g release 2 has not been released for IBM AIX platform yet.

    I can send you the way I collect statistics right now if you like.

    I’d like If you could provide an example.

    Regards

    Josip

  2. osborne says:

    Hi Josip,

    10g has a couple of options that can be used to simulate the 11g functionality. Namely the ‘APPROX_GLOBAL AND PARTITION’ ganularity setting on the gather_table_stats procedure and the dbms_stats.copy_table_stats procedure. Both are available with 10.2.0.4 but require a one off patch (8413548). It’s not exactly the same, but close.

    You could do something like this:

    EXEC DBMS_STATS.GATHER_TABLE_STATS (‘&owner’, ‘&table_name’, ‘&partition_name’, GRANULARITY => ‘APPROX_GLOBAL AND PARTITION’);

    Do a google search and you should find a couple of hits on this. For example, the Optimizer Development Group Blog has this post:

    http://optimizermagic.blogspot.com/2009/02/maintaining-statistics-on-large.html

    Kerry

  3. Josip Voyager says:

    Hi Kerry,

    I’ve found the article on Optimizer Development Group Blog, but there are no patch 8413548 to enable APPROX_GLOBAL AND PARTITION option for statistics gathering on large partitioned tables in Oracle 10g release 2.

    I’ve search the Metalink without success. I assume that you mistype the patch number (and the guys behind Optimizer Development Group Blog) because patch number 8413548 doesn’t exist. Could you please help me to speed up my global statistics gathering.

    Josip

  4. osborne says:

    Josip,

    I can’t find that patch any more either (possibly due to it being rolled into 10.2.0.5). Check out this document on Metalink: 7116357.8.

    Kerry

Leave a Reply