Unstable Plans (Oracle Plan Stability/Instability)

One of the most frustrating things about the Cost Based Optimizer in Oracle is the sometimes erratic performance. It can seem almost random at times. Resolving these issues can also be challenging. It’s a little like taking your car to the mechanic when it makes a funny noise. It never seems to make the noise when the mechanic is there. Fortunately we have ASH/AWR which captures a ton of information about what our databases are doing.

Here’s a couple of scripts that I find useful.

The first one can be used to show statements that have experienced significant variances in execution time (it can be modified to look for variances in the amount of logical i/o, but I’ll leave it as an exercise for the reader).  I called the script unstable_plans.sql. It uses an analytic function to calculate a standard deviation on the average elapsed time by plan. So the statements that have multiple plans with wild variations in the response time between plans will be returned by the script. The script prompts for a couple of values. The first is minimum number of standard deviations. The second is the minimum elapsed time (I usually don’t care if a statement executes sometimes in .005 seconds and sometimes in .02 seconds, even though this is a large swing statistically). Both these inputs are defaulted by the way.

The second one is called awr_plan_change.sql. It’s shows how the plans for a given statement have changed over time, along with some statistics such as the average elapsed time and the average lio’s.

Anyway, here’s an example of the usage of these two scripts (by the way, the example is on an 11gR1 database, but the scrips work fine on 10g as well).

 

> sqlplus / as sysdba
 
SQL*Plus: Release 11.1.0.6.0 - Production on Tue Oct 7 15:44:20 2008
 
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
 
SQL> alter system set optimizer_mode=rule;
 
System altered.
 
SQL> @awr_snap 
 
PL/SQL procedure successfully completed.
 
SQL> @unstable_plans
Enter value for min_stddev: 
Enter value for min_etime: 
 
SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
c5by9gdw3814r         14         .03         .12        2.1274
848dyu9288c3h         16         .02         .16        2.1309
2am60vd2kw8ux         11         .05         .22        2.4976
frxg53fac2n8j          5         .03         .30        4.2479
0qa98gcnnza7h         62       25.58      314.34        7.9833
g0jvz8csyrtcf          2         .09        1.19        8.2304
2cn0kc8u4b81w        545         .02         .42       12.8022
9wt62290ah0f7          6         .01         .47       38.5857
d8mayxqw0wnpv       1373         .01         .85       48.3874
 
9 rows selected.
 
SQL> /
Enter value for min_stddev: 
Enter value for min_etime: 2
 
SQL_ID        SUM(EXECS)   MIN_ETIME   MAX_ETIME   NORM_STDDEV
------------- ---------- ----------- ----------- -------------
0qa98gcnnza7h         62       25.58      314.34        7.9833
 
SQL> @find_sql
Enter value for sql_text: 
Enter value for address: 
Enter value for sql_id: 0qa98gcnnza7h
 
SQL_ID         CHILD  PLAN_HASH        EXECS         ETIME     AVG_ETIME USERNAME      SQL_TEXT
------------- ------ ---------- ------------ ------------- ------------- ------------- -----------------------------------------
0qa98gcnnza7h      0 3723858078            5        356.53         71.31 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0
 
0qa98gcnnza7h      1  568322376            1          7.92          7.92 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0
 
0qa98gcnnza7h      2  568322376           10         52.14          5.21 SYS           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0
 
0qa98gcnnza7h      3  568322376           30      1,064.19         35.47 KSO           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0
 
0qa98gcnnza7h      4 3723858078           10      4,558.62        455.86 KSO           select avg(pk_col) from kso.skew where co
                                                                                       l1 > 0
 
SQL> @awr_plan_change
Enter value for sql_id: 0qa98gcnnza7h
 
   SNAP_ID   NODE BEGIN_INTERVAL_TIME            SQL_ID        PLAN_HASH_VALUE        EXECS    AVG_ETIME        AVG_LIO
---------- ------ ------------------------------ ------------- --------------- ------------ ------------ --------------
      3206      1 02-OCT-08 08.00.38.743 AM      0qa98gcnnza7h       568322376            4       10.359      121,722.8
      3235      1 03-OCT-08 01.00.44.932 PM      0qa98gcnnza7h                            1       10.865      162,375.0
      3235      1 03-OCT-08 01.00.44.932 PM      0qa98gcnnza7h      3723858078            1      127.664   28,913,271.0
      3236      1 03-OCT-08 01.28.09.000 PM      0qa98gcnnza7h       568322376            1        7.924      162,585.0
      3236      1 03-OCT-08 01.28.09.000 PM      0qa98gcnnza7h      3723858078            1       86.682   27,751,123.0
      3305      1 06-OCT-08 10.00.11.988 AM      0qa98gcnnza7h                            4       64.138   22,616,931.5
      3305      1 06-OCT-08 10.00.11.988 AM      0qa98gcnnza7h       568322376            2        5.710       81,149.0
      3306      1 06-OCT-08 11.00.16.490 AM      0qa98gcnnza7h                            6        5.512      108,198.5
      3307      1 06-OCT-08 12.00.20.716 PM      0qa98gcnnza7h                            2        3.824       81,149.0
      3328      1 07-OCT-08 08.39.20.525 AM      0qa98gcnnza7h                           30       35.473      156,904.7
      3335      1 07-OCT-08 03.00.20.950 PM      0qa98gcnnza7h      3723858078           10      455.862   28,902,128.6
 
11 rows selected.
 
SQL> @dplan_awr
Enter value for sql_id: 0qa98gcnnza7h
Enter value for plan_hash_value: 
 
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0
 
Plan hash value: 568322376
 
---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       | 44497 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    11 |            |          |
|   2 |   TABLE ACCESS FULL| SKEW |    32M|   335M| 44497   (2)| 00:08:54 |
---------------------------------------------------------------------------
 
SQL_ID 0qa98gcnnza7h
--------------------
select avg(pk_col) from kso.skew where col1 > 0
 
Plan hash value: 3723858078
 
--------------------------------------------------
| Id  | Operation                    | Name      |
--------------------------------------------------
|   0 | SELECT STATEMENT             |           |
|   1 |  SORT AGGREGATE              |           |
|   2 |   TABLE ACCESS BY INDEX ROWID| SKEW      |
|   3 |    INDEX RANGE SCAN          | SKEW_COL1 |
--------------------------------------------------
 
Note
-----
   - rule based optimizer used (consider using cbo)
 
33 rows selected.

A couple of quick comments. The unstable_plan.sql script shows several statements that are switching plans. Most of them are sub-second response times though. The one that stands out is sql_id 0qa98gcnnza7h. It has been executed 62 times and one plan averages about 25 seconds while another averages about 314 seconds. The awr_plan_changes.sql script shows that this statement has been switching between two plans (3723858078 and 568322376). 3723858078 is the less efficient plan (doing 25M or so logical i/o’s per execution) while plan 568322376 is much better (doing only about 120K logical i/o’s per execution).

This is a typical pattern we see when bind variable peeking is an issue. Plans that flip flop back and forth between two or three options. This post is not specifically about bind variable peeking and so I won’t go into all the details (if you want more info just search on this site). But I will say that it is quite often the case that a single plan is acceptable regardless of the bind variables that are used (in this case the 568322376 plan). It is pretty easy to get Oracle to use a specific plan using an Outline, SQL Profile, or Baseline (depending on the version of Oracle you are using). And that’s exactly what we did in this case to resolve the issue.

29 Comments

  1. [...] stronger methods of forcing the optimizer’s hand such as Outlines (see my previous posts on Unstable Plans and on Outlines). This option provides a quick method of locking in a single plan [...]

  2. [...] here’s little example: Note: unstable_plans.sql and awr_plan_stats.sql are discussed here: Unstable Plans (Plan Instability) ?View Code SQLSQL> @unstable_plans Enter value for min_stddev: Enter value for min_etime:   [...]

  3. Julio Cesar says:

    Hi Kerry, we are facing this trouble now, the same query can run in 1 minute or in a hour, we disabled the AMM and set all the memory parameters manually, but the problem persists in a dangerous way, do you have some idea in solving it?

    Best regards and I wish you luck if you’re still with this trouble.

    Júlio César

  4. osborne says:

    Julio,

    Not much to go on in you’re question, but I’ll hazard a couple of suggestions any way. Plan instability is often the result of bind variable peeking. The worst cases often happen when you have a histogram on a skewed column in combination with a bind variable in a where clause that references the column. The best way to fix this issue is to use literals instead of bind variables for such columns. Creating a SQL Profile or Outline to lock in one of the plans can often be a pretty quick short term fix as well. Also, you should be aware that 10g’s standard method of gathering stats does not do a very good job (in my opinion) of picking which columns to gather histograms on (it usually creates too many). The auto sample size is often too small to create an accurate histogram as well. So you may want to re-evaluate the stats gathering. I’ve written several posts about bind variable peeking and methods for dealing with the issue. Here are a couple you might want to look at:

    Bind Variable Peeking Drives Me Nuts!
    Outlines – aka Plan Stability
    SQl Profiles

    Or just search the blog for “Plan Instability” or “Bind Variable Peeking”.

    Hope that helps.

    Kerry

  5. Paul says:

    Hi Kerry,
    Nice piece of work. Could you please post the ‘find_sql.sql’ script??
    thx
    Paul

  6. osborne says:

    Hi Paul,

    The find_sql.sql script is here:

    http://www.oracle-guy.com/scripts/find_sql.sql

    Sometimes I don’t post the text for scripts I use regularly (if I’ve already posted them). A quick search for the script name in the blog search box will usually turn them up. I probably should do a post that lists my most used scripts with links to them. I’ll see if I can get that done soon.

    Kerry

  7. Yuvaraj says:

    Could you paste the “@dplan_awr” script?

  8. osborne says:

    SELECT * FROM table(dbms_xplan.display_awr(nvl(‘&sql_id’,’a96b61z6vp3un’),nvl(‘&plan_hash_value’,null),null,’typical +peeked_binds’));

  9. Yuvaraj says:

    Thanks. I am trying to find a procedure to use old hash value instead of a new hash value. There are two hash values for a same SQL_ID. Could you pls provide a link if you have written a doc on this?

  10. osborne says:

    I’m sorry Yuvaraj, but I have not done anything with old_hash_value. Nor do I know of anyone who has written about it. Tanel Poder has written about calculating SQL_ID from hash_value though.

    Kerry

  11. Yuvaraj says:

    I thought of checking with you whether if it is possible to force oracle to use the plan from old hash value then the newer one. There are two hash values for a same SQL_ID. I would like to force the old hash value.

  12. osborne says:

    I think you are confusing plan_hash_value with hash_value (and old_hash_value). A statement can have only one sql_id and only one hash_value (its based on the text of the statement). A statement can have many plan_hash_values. And yes you can influence a statement to use a particular plan. I’ve done a number of posts on using SQL Profiles to do just that. See SQL Profiles for example.

    Kerry

  13. Raja says:

    I noticed two plan_hash_value for a SQL_ID with same execution plan. I thought plan_hash_value cannot be identical. Please let me know if I am wrong. how can there be two plan_hash_value for the same execution plan?I only bind_mismatch Y, but plan_hash_value is different, It on a 10.2.0.4 database using cursor_sharing set to exact.

  14. osborne says:

    Hmmm, are you sure the plans are exactly the same? How are you comparing the plans? You could write a little script to compute your own hash value on all the data for the sql_id and child of interest in v$sql_plan and see if it comes up with the same value for both. I suspect there is something different (maybe it’s not displayed in the xplan output though). I do not know exactly what is included in the the plan_hash_value calculation by the way. There may be other things besides what’s exposed in the v$sql_plan.

  15. Raja says:

    I came to know it’s a bug in 10.2.0.4. It’s bug 10162430 caused same plan with multiple PHVs.

  16. osborne says:

    Ah – interesting. Thanks for following up on that Raja. I had not run into this issue before.

    Just to provide a little more info to casual observers, here’s a bit of the MOS bug report:

    RELEASE NOTES:
    ]]You may see a different plan hash value after every compilation of a SQL stat
    ]]ement even when the execution plan is the same when it involves cursor tempor
    ]]ary tables (name predixed by SYS_TEMP).
    REDISCOVERY INFORMATION:
    If you see a different plan hash value after every compilation of a SQL
    statement even when the execution plan is the same and the plan involves
    cursor temporary tables (name predixed by SYS_TEMP) then you are likely
    to have encountered the same problem.
    WORKAROUND:
    None

    It’s supposed to be fixed in version 12.1 by the way. ;)

  17. antony says:

    How about applications which use bind variables quite a lot where you’ll not get the same SQLID.For an example,in PeopleSoft applications every job is driven by a process instance which is a bind variable.So you will not get the same SQLID for every run of the same job.How do I find unstable plans in those cases using your analytic function approach?

    Thanks

  18. osborne says:

    Antony,

    That’s more difficult but do-able. You’d have to change the group by to use some portion of the sql_text (that presumably matches). I have a rough script that does something like that based on matching the first 80 characters of the SQL text. It is very rough but should give you an idea of how to go about what you are after. Unfortunately I don’t have time right now to work on this, but you can try this as a starting point:

    -- unstable_plans_by_sqltext.sql
    set lines 155
    col execs for 999,999,999
    col min_etime for 999,999.99
    col max_etime for 999,999.99
    col avg_etime for 999,999.999
    col sql_text for a80
    break on sql_text skip 1
    select a.sql_id,
    substr(dbms_lob.substr(sql_text,3999,1),1,80) sql_text, b.execs, min_etime, max_etime, norm_stddev
    from dba_hist_sqltext a, (
    select sql_id, sum(execs) execs, min(avg_etime) min_etime, max(avg_etime) max_etime, stddev_etime/min(avg_etime) norm_stddev
    from (
    select sql_id, plan_hash_value, execs, avg_etime,
    stddev(avg_etime) over (partition by sql_id) stddev_etime
    from (
    select sql_id, plan_hash_value,
    sum(nvl(executions_delta,0)) execs,
    (sum(elapsed_time_delta)/decode(sum(nvl(executions_delta,0)),0,1,sum(executions_delta))/1000000) avg_etime
    -- sum((buffer_gets_delta/decode(nvl(buffer_gets_delta,0),0,1,executions_delta))) avg_lio
    from DBA_HIST_SQLSTAT S, DBA_HIST_SNAPSHOT SS
    where ss.snap_id = S.snap_id
    and ss.instance_number = S.instance_number
    and executions_delta > 0
    and elapsed_time_delta > 0
    group by sql_id, plan_hash_value
    )
    )
    group by sql_id, stddev_etime
    ) b
    where norm_stddev > nvl(to_number('&min_stddev'),2)
    and max_etime > nvl(to_number('&min_etime'),.1)
    and substr(dbms_lob.substr(sql_text,3999,1),1,80) like nvl('&sql_text',substr(dbms_lob.substr(sql_text,3999,1),1,80))
    and a.sql_id = b.sql_id
    order by 2
    /
    
  19. Peter says:

    Peter…

    [...]Kerry Osborne’s Oracle Blog » Blog Archive » Unstable Plans (Oracle Plan Stability/Instability) – Kerry Osborne’s Oracle Blog[...]…

  20. performance tuning…

    [...]Kerry Osborne’s Oracle Blog » Blog Archive » Unstable Plans (Oracle Plan Stability/Instability) – Kerry Osborne’s Oracle Blog[...]…

  21. osborne says:

    No – but here’s a script that looks for statements in the cursor cache with multiple plans. Perhaps you can modify it to look at AWR snaps.

    set lines 155
    set pages 9999
    col sql_text for a80
    !echo
    !echo "*** This script searchs the shared pool for SQL stataments with How_Many (or more) distinct plans."
    -- !echo
    select sql_id,  count(distinct plan_hash_value) distinct_plans, sql_text 
    from v$sql
    group by sql_id, sql_text
    having count(distinct plan_hash_value) >= &how_many
    /
    
  22. Ranjit says:

    Hi,

    What are the situations when query having same sql_id and same plan_hash_value(single child also) but some times varies in Logical I/O? In our case deviation of Logical I/O is in hundreds of magnitude for same plan_hash_value and hence performing very badly and causing Latch:cache buffer chain.

    There had been no change in
    1) SQL(in any form)
    2) Tables and indexes stats
    3) rums of rows processed
    4) Fetch array size

    Could you please let me know any possible cause?

    Regards

  23. osborne says:

    Hi Ranjit,

    I’ve just published a couple of articles on plan changes that aren’t reflected in the plan_hash_value. The main areas of concern are predicates. Changes to order of predicates can have a huge effect on performance. Here’s a link to the first post on the topic. SQL Gone Bad – But Plan Not Changed? – Part 1

    Hopefully that will help.

    By the way, the other more simple explanation would be that the statement sometimes does more work than other times, either because of skewed data (i.e. that the values used in the query sometimes cause the statement to process more data) or that there are temp tables that sometimes have lot’s of rows and other times not so many. It could also happen if you have a system that uses synonyms to point to different physical tables that have different amounts of data. From you’re comment about num rows being the same though, I’m assuming you’ve already eliminated these possibilities?

  24. [...] For given database session, I can run Tanel’s asqlmon, look at historical SQL performance using Kerry’s AWR scripts, and drill down into IO latency with just point and [...]

  25. ted chyn says:

    In the output from awr_plan_change, I noticed some of rows contain null value for PLAN_HASH_VALUE field. my question what is that means and why some rows has plan_hash_value some don’t ? thnx in advance.

  26. osborne says:

    Hi Ted,

    The script uses “set break on plan_hash_value” which is a SQL*Plus command that suppresses output of repeating values in a column, so the nulls are rows where the value hasn;t changed from the previous row, thus highlighting the rows where the plan has changed.

    Kerry

Leave a Reply