Bind Variable Peeking – Drives Me Nuts!

In the constant battle to provide consistent performance, Oracle took a giant step backwards with the 9i version by introducing an “Enhancement” called Bind Variable Peeking. I’ll explain what I mean in a minute, but first a bit of history.

When Oracle introduced histograms in 8i, they provided a mechanism for the optimizer to recognize that the values in a column were not distributed evenly. That is, in a table with 100 rows and 10 distinct values, the default assumption the optimizer would make, in the absence of a histogram, would be that no matter which value you picked – you would always get 100/10 or 10 rows back. Histograms let the optimizer know if that was not the case. The classic example would be 100 records with 2 distinct values where one value, say “Y”, occurred 99 times and the other value, say “N”, occurred only 1 time.  So without a histogram the optimizer would always assume that whether you requested records with a “Y” or an “N”, you would get half the records back (100/2 = 50). Therefore you always want to do a full table scan as opposed to using an index on the column. A histogram, assuming it was accurate (we’ll come back to that later), would let the optimizer know that the distribution was not normal (i.e. not spread out evenly – also commonly called skewed) and that a “Y” would get basically the whole table, while an “N” would get only 1%. This would allow the optimizer to pick an appropriate plan regardless of which value was specified in the Where Clause.

So let’s consider the implications of that. Would that improve the response time for the query where the value was “Y”. The answer is no. In this simple case, the default costing algorithm is close enough and produces the same plan that the histogram produces. The full table scan takes just as long whether the optimizer thought it was getting 50 rows or 99 rows. But what about the case where we specified the value of “N”. In this case, with a histogram we would pick up the index on that column and presumably get a much better response time than the full table scan. This is an important point. Generally speaking it is only the outliers, the exceptional cases if you will, where the histogram really makes a difference.

So at first glance, all appeared well with the world. But there was a fly in the ointment. You had to use literals in your SQL statements for the optimizer to be able use the histograms. So you had to write your statements like this:

SELECT XYZ FROM TABLE1 WHERE COLUMN1 = ‘Y’;

SELECT XYZ FROM TABLE1 WHERE COLUMN1 = ‘N’;

Not a problem in our simple example, because you only have two possibilities. But consider a statement with 2 or 3 skewed columns, each with a couple of hundred distinct values. The possible combinations could quickly grow into the millions. Not a good thing for the shared pool.

Enter our star: Bind Variable Peeking, a new feature introduced in 9i that was added to allow the optimizer to peek at the value of bind variables and then use a histogram to pick an appropriate plan, just like it would do with literals. The problem with the new feature was that it only looked at the variables once, when the statement was parsed. So let’s make our simple example a little more realistic by assuming we have a 10 million row table where 99% have a value of “Y” and 1% have a value of “N”. So in our example, if the first time the statement was executed it was passed a “Y”, the full table scan plan would be locked in and it would be used until the statement had to be re-parsed, even if the value “N” was passed to it in subsequent executions.

So let’s consider the implication of that. When you get the full table scan plan (because you passed a “Y” the first time) it behaves the same way no matter what which value you pass subsequently. Always a full table scan, always the same amount of work and the same basic elapsed time. From a user standpoint that seems reasonable. The performance is consistent. (this is the way it would work without a histogram by the way) On the other hand, if the index plan gets picked because the parse occurs with a value of “N”, the executions where the value is “N” will be even faster than they were before, but the execution with a value of “Y” will be incredibly slow. This is not at all what the user expects. They expect the response time to be about the same every time they execute a piece of code. And this is the problem with bind variable peeking. It’s basically just Russian Roulette. It just depends on what value you happen to pass the statement when it’s parsed (which could be any execution by the way).

So is Bind Variable Peeking a feature or a bug? Well technically it’s not a bug because it works the way it’s designed. I just happen to believe that it was not a good decision to implement it that way. But what other choices did the optimizer development group have?

  • They could have evaluated the bind variables and re-parsed  for every execution of every statement using bind variables. This would eliminate the advantage of having bind variables in the first place and would never work for high transaction systems. So it was basically not an option.
  • They could have just said no, and made us use literals in order to get the benefit of histograms (probably not a bad option in retrospect – the fact that they added _optim_peek_user_binds probably means that they decided later to give us that option via setting this hidden parameter).
  • They could have implemented a system where they could identify statements that might benefit from different plans based on the values of bind variables. Then peek at those variables for every execution of those “bind sensitive” statements (sound familiar? – that’s what they finally did in 11g with Adaptive Cursor Sharing).

So why is it such a pervasive problem? And I do believe it is a pervasive problem with 10g in particular. A couple of reasons come to mind:

  1. We’ve been taught to always use bind variables. It’s a best practice which allows SQL statements to be shared, thus eliminating a great deal of work/contention. Using bind variable is an absolute necessity when building scalable high transaction rate systems. (of course that doesn’t mean that you can’t bend the rule occasionally)
  2. 10g changed it’s default stats gathering method to automatically gather histograms. So in a typical 10g database there are a huge number of histograms, many of them inappropriate (i.e. on columns that don’t have significantly skewed distributions) and many of them created with very small sample sizes causing the histograms to be less than accurate. Note that 11g appears to be better on both counts – that is to say, 11g seems to create fewer inappropriate histograms and seems to create much more accurate histograms with small sample sizes. But the jury is still out on 11g stats gathering as it has not been widely adopted at this point in time.
  3. In my humble opinion, Bind Variable Peeking is not that well understood. When I talk to people about the issue, they usually have heard of it and have a basic idea what the problem is, but their behavior (in terms of the code they write and how they manage their databases) indicates that they don’t really have a good handle on the issue.

So what’s the best way to deal with this issue? Well recognizing that you have a problem is the first step to recovery, so being able to identify that you have a problem with plan stability is an appropriate first step. Direct queries against the Statspack or AWR tables are probably the best way to identify the issue. I’ve posted a couple of scripts that I find useful for this purpose previously – (unstable_plans.sql, awr_plan_stats.sql, awr_plan_change.sql). What you’re looking for is statements that flip flop back and forth between 2 or more plans. Note that there are other reasons for statements to change plans, but Bind Variable Peeking is the number one suspect. Here’s an example of their usage:

SQL> @unstable_plans
SQL> break on plan_hash_value on startup_time skip 1
SQL> select * from (
  2  select sql_id, sum(execs), min(avg_etime) min_etime, max(avg_etime) max_etime, stddev_etime/min(avg_etime) norm_stddev
  3  from (
  4  select sql_id, plan_hash_value, execs, avg_etime,
  5  stddev(avg_etime) over (partition by sql_id) stddev_etime
  6  from (
  7  select sql_id, plan_hash_value,
  8  sum(nvl(executions_delta,0)) execs,
  9  (sum(elapsed_time_delta)/decode(sum(nvl(executions_delta,0)),0,1,sum(executions_delta))/1000000) avg_etime
 10  -- sum((buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta))) avg_lio
 11  from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
 12  where ss.snap_id = S.snap_id
 13  and ss.instance_number = S.instance_number
 14  and executions_delta > 0
 15  and elapsed_time_delta > 0
 16  group by sql_id, plan_hash_value
 17  )
 18  )
 19  group by sql_id, stddev_etime
 20  )
 21  where norm_stddev > nvl(to_number('&min_stddev'),2)
 22  and max_etime > nvl(to_number('&min_etime'),.1)
 23  order by norm_stddev
 24  /
Enter value for min_stddev:
Enter value for min_etime:
 
SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
1tn90bbpyjshq         20         .06         .24        2.2039
0qa98gcnnza7h         16       20.62      156.72        4.6669
7vgmvmy8vvb9s        170         .04         .39        6.3705
32whwm2babwpt        196         .02         .26        8.1444
5jjx6dhb68d5v         51         .03         .47        9.3888
71y370j6428cb        155         .01         .38       19.7416
66gs90fyynks7        163         .02         .55       21.1603
b0cxc52zmwaxs        197         .02         .68       23.6470
31a13pnjps7j3        196         .02        1.03       35.1301
7k6zct1sya530        197         .53       49.88       65.2909
 
10 rows selected.
 
SQL> @find_sql
SQL> select sql_id, child_number, plan_hash_value plan_hash, executions execs,
  2  (elapsed_time/1000000)/decode(nvl(executions,0),0,1,executions) avg_etime,
  3  buffer_gets/decode(nvl(executions,0),0,1,executions) avg_lio,
  4  sql_text
  5  from v$sql s
  6  where upper(sql_text) like upper(nvl('&sql_text',sql_text))
  7  and sql_text not like '%from v$sql where sql_text like nvl(%'
  8  and sql_id like nvl('&sql_id',sql_id)
  9  order by 1, 2, 3
 10  /
Enter value for sql_text:
Enter value for sql_id: 0qa98gcnnza7h
 
SQL_ID         CHILD  PLAN_HASH        EXECS     AVG_ETIME      AVG_LIO SQL_TEXT
------------- ------ ---------- ------------ ------------- ------------ ------------------------------------------------------------
0qa98gcnnza7h      0  568322376            3          9.02      173,807 select avg(pk_col) from kso.skew where col1 > 0
 
SQL> @awr_plan_stats
SQL> break on plan_hash_value on startup_time skip 1
SQL> select sql_id, plan_hash_value, sum(execs) execs, sum(etime) etime, sum(etime)/sum(execs) avg_etime, sum(lio)/sum(execs) avg_lio
  2  from (
  3  select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
  4  nvl(executions_delta,0) execs,
  5  elapsed_time_delta/1000000 etime,
  6  (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
  7  buffer_gets_delta lio,
  8  (buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
  9  from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
 10  where sql_id = nvl('&sql_id','4dqs2k5tynk61')
 11  and ss.snap_id = S.snap_id
 12  and ss.instance_number = S.instance_number
 13  and executions_delta > 0
 14  )
 15  group by sql_id, plan_hash_value
 16  order by 5
 17  /
Enter value for sql_id: 0qa98gcnnza7h
 
SQL_ID        PLAN_HASH_VALUE        EXECS          ETIME    AVG_ETIME        AVG_LIO
------------- --------------- ------------ -------------- ------------ --------------
0qa98gcnnza7h       568322376           14          288.7       20.620      172,547.4
0qa98gcnnza7h      3723858078            2          313.4      156.715   28,901,466.0
 
SQL> @awr_plan_change
SQL> break on plan_hash_value on startup_time skip 1
SQL> select ss.snap_id, ss.instance_number node, begin_interval_time, sql_id, plan_hash_value,
  2  nvl(executions_delta,0) execs,
  3  (elapsed_time_delta/decode(nvl(executions_delta,0),0,1,executions_delta))/1000000 avg_etime,
  4  (buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta)) avg_lio
  5  from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
  6  where sql_id = nvl('&sql_id','4dqs2k5tynk61')
  7  and ss.snap_id = S.snap_id
  8  and ss.instance_number = S.instance_number
  9  and executions_delta > 0
 10  order by 1, 2, 3
 11  /
Enter value for sql_id: 0qa98gcnnza7h
 
   SNAP_ID   NODE BEGIN_INTERVAL_TIME            SQL_ID        PLAN_HASH_VALUE        EXECS    AVG_ETIME        AVG_LIO
---------- ------ ------------------------------ ------------- --------------- ------------ ------------ --------------
     21857      1 20-MAR-09 04.00.08.872 PM      0qa98gcnnza7h       568322376            1       31.528      173,854.0
     22027      1 27-MAR-09 05.00.08.006 PM      0qa98gcnnza7h                            1      139.141      156,807.0
     22030      1 27-MAR-09 08.00.15.380 PM      0qa98gcnnza7h                            3       12.451      173,731.0
     22031      1 27-MAR-09 08.50.04.757 PM      0qa98gcnnza7h                            2        8.771      173,731.0
     22032      1 27-MAR-09 08.50.47.031 PM      0qa98gcnnza7h      3723858078            1      215.876   28,901,466.0
     22033      1 27-MAR-09 08.57.37.614 PM      0qa98gcnnza7h       568322376            2        9.804      173,731.0
     22034      1 27-MAR-09 08.59.12.432 PM      0qa98gcnnza7h      3723858078            1       97.554   28,901,466.0
     22034      1 27-MAR-09 08.59.12.432 PM      0qa98gcnnza7h       568322376            2        8.222      173,731.5
     22035      1 27-MAR-09 09.12.00.422 PM      0qa98gcnnza7h                            3        9.023      173,807.3
 
9 rows selected.

So back to the question, what’s the best way to deal with the issue. In general, the best way to eliminate Bind Variable Peeking is as follows:

  1. Only create histograms on skewed columns.
  2. Use literals in where clauses on columns where you have histograms and want to use them. Note that it’s not necessary to use literals for every possible value of a skewed column. There may be only a few outlier values that result in significantly different plans. With a little extra code you can use literals for those values and bind variables for the rest of the values that don’t matter.
  3. If you can’t modify the code, consider turning off Bind Variable Peeking by setting the _OPTIM_PEEK_USER_BINDS parameter to false. You won’t get the absolute best performance for every possible statement, but you will get much more consistent performance, which is, in my opinion, more important than getting the absolute best performance. Keep in mind that this is a hidden parameter and so should be carefully tested and probably discussed with Oracle support prior to implementing it in any production system.
  4. You can also consider stronger methods of forcing the optimizer’s hand such as Outlines (see my previous posts on Unstable Plans and on Outlines). This option provides a quick method of locking in a single plan, but it’s not fool proof. Even with outlines, there is some possibility that the plan can change. Also note that this option is only palatable in situations where you have a relatively small number of problem SQL statements.
  5. Upgrade to 11g and let Adaptive Cursor Sharing take care of all your problems for you (don’t bet on it working without a little effort – I’ll try to do a post on that soon).

In summary, using literals with histograms on columns with skewed data distributions are really the only effective way to deal with the issue and still retain the ability for the optimizer to choose the absolute best execution plans. However, if circumstances prevent this approach, there are other techniques that can be applied. These should be considered temporary fixes, but may work well while a longer term solution is contemplated. From a philosophical stand point, I strongly believe that consistency is more important than absolute speed. So when a choice must be made, I would always favor slightly reduced but consistent performance over anything that didn’t provide that consistency.

Your comments are always welcome. Please let me know what you think.

25 Comments

  1. [...] Peeking, un bel post riassuntivo Pubblicato il Giovedì 2 Aprile 2009 di Cristian Cudizio Kerry Osborne, recente ingresso del mio blogroll, nonché del mio feed reader, alcuni giorni fa ha pubblicato sul [...]

  2. Daniel says:

    Excellent article.
    i was reading about histograms on richard foote blog & searched for “bind variable peekings & histograms” and found your article.

    Regards!

  3. osborne says:

    Jonathan Lewis’ post has the best and simplest description of this problem and it’s solution that I’ve ever seen here. And it’s his “Philosophy 1″ – must be pretty high on his list as well.

    Kerry

  4. Michael Fontana says:

    If you consider the fact that Oracle likely began work on the peeking “feature” back in 2000 or so when a VLDB was still considered to be 10G or more, they probably never really considered the full implications of it’s effects on database and table sizes of today.

  5. Marc Travaglini says:

    Thanks for an excellent article. I was able to use it and within 10 minutes find out exactly why one particular sql had started bringing the system to its knees. Interestingly, it was not because of bind variable peeking, but because of new statistics being gathered on some tables which changed the plan for this particluar sql. The above set of queries showed this very clearly, so they are even more useful than you give them credit for!

  6. Ansh Abhishek says:

    Kerry
    best explaination for the “Bind Variable Peeking is not that well understood”
    My personal opinion
    In my opinion, however, that disabling bind peeking in situation(where client is using lot of queries with literals also) is the wrong choice, since
    (a) still run the risk of getting unpredictable results from queries with literals;
    (b) will be doing tons more work during statistics collection, since histogram computation is expensive.
    Regarding
    _OPTIM_PEEK_USER_BINDS
    Running Oracle Database with an underscore parameter makes us different from the rest of the world, and I believe this is not how Oracle Database was tested and intended to be run in a first place .
    What is your views on the same
    Regards
    Abhishek

  7. osborne says:

    Abhishek,

    I agree that setting underscore parameters is not a good idea in general. However, in this case it may be warranted. The combination of bind variable, histograms and skewed data can be deadly. While it’s true that you would not be in the majority if you set _OPTIM_PEEK_USER_BINDS to false, you may be surprised to know that there are a fair number of systems that run with that setting. SAP for example uses this setting when they run on top of Oracle. That said, my preference would be to use literals where necessary to get the plans that I want. This does not mean that all statements need to use literals. In fact, it is even possible to write code that is smart about when to use literals. It is often possible to to use literals for a limited number of values while using bind variables where the optimizer doesn’t really need to know the value to get a decent plan. You’re also correct in that gather stats with histograms is expensive. I do believe that in many cases they are gathered where they are not needed. I wrote a pretty lengthy chapter in the recently released Apress book “Pro Oracle SQL” on Plan Stability and Control, by the way. It covers my thoughts on the subject in some detail.

    Kerry

  8. Chris says:

    This is a great post. We are currently struggling to understand how and why some production outages arise from what appear to be random long running queries that tie up user threads (we’ve noticed an incorrect plan once before). I’m trying to get info related to what plans these queries were running but our dba says the plan data didn’t get captured via ASH. Any thoughts on how we could determine if this is happening to us? It seems to fit the profile, but to date no one can put their finger on it. We see queries that should take at most a second or two take minutes and even hours sometimes..meanwhile other things run fine in the db. Any insight would be much appreciated. Thanks!

  9. osborne says:

    Hi Chris,

    Any statement that runs for hours should be captured by AWR, so the scripts in the this post should help you identify that issue. Occasionally statements that run very quickly are not captured by AWR, but if they are run often (which sounds like they are in your case) they will be captured. If they run very quickly and are not capture by AWR when they are running well you may have to capture the plans directly out of the shared pool while they are running or use the ability of AWR to color a SQL statement (using dbms_workload_repository.add_colored_sql) so that it will be captured by AWR regardless of whether it makes it into the top of any category that AWR captures. Whether the instability is caused by BVP or not, the awr_plan_change.sql statement should allow you to get a pretty good idea as to when a key statement is slowing down and hopefully help you diagnose why. Hope that helps.

    Kerry

  10. Mahesh says:

    Hi Kerry,

    Its a great article, very easy to understand about bind peeking and histograms, thanks for sharing the knowledge.

    I am not good in sql coding, could you please explain the description of the columns in unstable_plans.sql, awr_plan_stats.sql scripts.

    Thanks in advance

  11. Danyc says:

    Hey Kerry,

    Nice post, well done. Quick one: the scripts you used are pulling the info from AWR however you mentioned the info can be collected from statspack as well (it piss me a lot Oracle policy where the diagnostic license can a) be used only on EE and b) u can use AWR for basic things without the expensive license), do you already have scripts for that and if yes do you mind sharing it?

    Cheers,
    dani

  12. osborne says:

    Hi Dani,

    No I don’t have any statspack versions hanging around. Sorry.

    Kerry

  13. DanyC says:

    No worries, thanks for feedback.

    Take care,
    Dani

  14. Sundeep says:

    Hello Kerry

    I am new to oracle, so if some of my assumptions are wrong please excuse them. I just have 4 questions.

    1. If I understood you artical correctly then the bind variable peeking is when the bind variable looks at the value given to it and comes up with an optimal execution plan for that particular value, and it uses the same execution plan for all the subsequent values. Then what is the use of having bind variables if the same execution plan is generated for different values which prove to be sub optimal?
    2. Histograms provide the optimizer with info on data distribution to come up with an optimal execution plan for the respective literals. Shouldn’t the same histogram provide information to the optimizer about the data distribution so that the optimizer can choose a diff execution plan for different values passed to the bind variables?
    3. I read in other blog post that the dropping of histograms will make the queries using bind variables run faster. If the Histograms are dropped than how does the optimizer know which access path to follow based on data distribution to retrieve the results regardless of the use of bind variables?
    4. Does adaptive cursor sharing introduced in 11g take care of the problems caused by bind peeking and histograms combined?
    Thank You

  15. osborne says:

    Hi Sundeep,

    1. Originally, bind variable values were not considered by the optimizer. Later versions provided the ability for the optimizer to peek and see the value that were passed. The idea was that this would allow the optimizer to make a better decision when coming up with a plan. Unfortunately, the way it was implemented was to just peek on the first hard parse, which meant subsequent executions would stick with the plan until the cursor was invalidated for some reason so that the next execution would result in a new hard parse and potentially a new plan depending on the new value of the bind variable(s). The values weren’t peeked for every execution because it was judged to be too expensive. The combination of skewed data distributions, histograms, and bind variables can be problematic.

    2. Yes – see answer to #1 above.

    3. Dropping histograms in no way guarantees that statements with bind variables will run faster. But the combination of bind variables, histograms, and skewed data can cause a lot of unpredictable performance, particularly in versions prior to 11g when Adaptive Cursor Sharing showed up.

    4. 11g is better because they introduced Adaptive Cursor Sharing which basically monitors bind variables for all executions and will create multiple cursors with different plans. But it’s not perfect for a couple of reasons. The first reason is that you basically have to get a bad plan/variable combination (i.e. suboptimal performance) before the feature creates a new plan. The second reason is that the Bind Aware Info is lost when the statement is flushed from the shared pool for whatever reason, and so it has to be re-learned again. I did a post about ACS here: http://kerryosborne.oracle-guy.com/2009/06/oracle-11g-adaptive-cursor-sharing-acs/

  16. Reddy says:

    Hello Kerry,

    This is very excellent article and thank you for this article.

    My Prod 11g db is suddenly effecting with these kind of issues.
    Application fires lots of SQLs’ with bind variables to the database very frequently, out of them three particular SQLs are hanging on the database(these used to run fine in the past) with different PHVs for each time. Though i have given the best specific PHV to take automatically when these fires through SQL Profiles, the problem is still same.
    it would be very helpful if you answer these?

    As these are generating from application we are unable to avoid the bind variables?(is any other way to avoid bind variables?)
    Why specific plan is not using though we are giving through SQL profiles?
    If we presume Histograms are there on the columns, how to find and restrain them?
    Are the frequent changes in statistics effecting in changing the plans? if yes how to avoid this?

    Thank you,
    Reddy

  17. [...] To understand this blog post you have to know what bind variable peeking is. You can found a very good explanation into this Kerry Osborne’s blog post. [...]

  18. Karthik says:

    Hello Kerry,
    I have a question here. You mention that BVP shall be the reason if the statement switches back and forth between more than one execution plan.
    But if BVP happens, then oracle sticks to the plan created by first parsed/peeked value right. So unless there is a hard parse to invalidate the cursor, how it can jump forth and back between 2 execution plans. Thank you.
    My question is in the absence of histogram, we almost have a stable plan as no data about skewedness is available (and)
    even with histogram, we need a hard parse or cache to be flused to switch between execution plans

  19. osborne says:

    Karthik,

    Yes – a statement has to be re-parsed to get a new plan. But that can happen for many reasons, instance restart, new stats, aging out, etc…

    Kerry

Leave a Reply