Statistics Gathering
Karen Morton just posted a great paper on her blog about statistics gathering. The paper is titled “Managing Statistics for Optimal Query Performance“. I was excited to see it because I think gathering stats is one of the most important and least well understood aspects of managing an Oracle environment. I must admit that I was expecting a recommended method or framework for gathering stats, but actually the paper is really more about how the statistics are used along with general guidelines for gathering them, rather than a direct recommendation on how to gather them. Nevertheless, it is one of the best papers I’ve seen on the subject. She’s going to present the paper at the Hotsos Symposium to be held in Dallas the week of March 9th. I’m going to be there and am really looking forward to hearing what she has to say on the subject.
By the way, I can’t recommend this conference highly enough. If you really want to understand how Oracle works, this is the place to be. You should know that I am not generally a fan of formal training classes. I have often been disappointed because I felt like my time would have been better spent researching the subject matter myself. On the other hand, I have found a lot of value in working closely on a project with someone who knows the subject matter well, kind of like a mentor. But generally speaking, the formal classes have been less satisfying, except in the rare case where you get the great instructor that wrote the class. This symposium format on the other hand allows you to listen to a collection of really knowledgeable Oracle people packed into a short period of time. I have been to the Hotsos Symposiums for several years in a row and I always come away with pages of notes on things I want to investigate further. And the participants are, generally speaking, a collection of very bright Oracle people. So even the conversations between the presentations are often very interesting. Finally, they run two presentations at a time which allows you to pick the presentation that is most interesting. I have often found it hard to choose (don’t tell anyone, but I have on more than one occasion listened to the first half of one and then the second half of the other). So like I said, I find it to be a very productive few days.
But I digress, Karen’s paper is pretty long (24 pages) but it covers a vast amount of ground. There are a number of one liners that could be expanded into full papers. In the paper she discusses a number of topics including dealing with short comings of the optimizer in 10g. One of those issues is bind variable peeking (probably my least favorite optimizer feature, quirk, bug, … what ever you want to call it). I must say that I think it has caused far more problems than it solved, and I frankly don’t know what they were thinking when they put that feature in. I wrote a little about a way to get around it using outlines here. By the way, this reminds me of a cartoon I drew 20 years ago that looked very similar to this one (that I lifted off of Steve Karam’s blog)
Of course as Karen points out, the right way to deal with bind variable peeking issues it is to understand your data and use literals where they are appropriate, keeping in mind the number of additional statements that will need to be parsed and dealt with in the shared pool. She also points out that code could be written to selectively use literals for specific values, giving you a mix of literals and bind variables for the same statement. This approach should allow you to minimize the impact on the shared pool while still providing the optimizer with the data it needs to make good decisions (this is a great idea but I’ve never seen anyone actually do it). And of course she points out that 11g has a much better mechanism for dealing with this whole issue.
Another idea that really got me thinking was the use of dynamic sampling . Karen clearly shows one of the advantages of dynamic sampling in the case of correlated predicates (i.e. the optimizer assumes a query where car_model = ‘Mustang’ and car_make = ‘Ford’ are independent, when clearly they are not). She shows how dynamic sampling can be very useful in conjunction with normal statistics in this situation. (rats, now I’m going to have to go play around with that a bit – so much to do, so little time)
Finally, she discusses some of the statistics gathering options and differences in 9i, 10g, and 11g. The automatic creation of histograms is one of the main differences between 9i and 10g and she discusses this issue, but doesn’t go into to much detail on it. I must admit that I think 10g’s default setup does a very poor job when it comes to histograms. This is the one area I would have liked to see address a little more fully, but at 24 pages already I can understand why she had to draw the line somewhere. Anyway, by default 10g creates histograms on columns based on several factors including their use in where clauses. Unfortunately, histograms often get created on columns where their usefulness is questionable at best and they regularly get created with a very small sample sizes. The small sample sizes often result in significant inaccuracies. I personally think that allowing the gather stats job to automatically create histograms in 10g is really bad idea.
Anyway, this is a paper that is well worth reading in my opinion. Typical Hotsos Symposium fare!