SQL Gone Bad – But Plan Not Changed? – Part 1
Last week an interesting issue popped up on a mission critical production app (MCPA). A statement that was run as part of a nightly batch process ran long. In fact, the statement never finished and the job had to be killed and restarted. This particular system is prone to plan stability issues due to various factors outside the scope of this post, so the first thing that the guys checked was if there had been a plan change. Surprisingly the plan_hash_value was the same as it had been for the past several months. The statement was very simple and a quick look at the xplan output showed that the plan was indeed the same with one exception. The predicate section was slightly different.
As a quick diversion, you probably already know that the plan_hash_value is calculated based on partial information about the plan. Arguably it’s the most important parts, but there are some important parts of the plan that are not included (namely the stuff that shows up in the predicate section of the plan). Randolf Geist explained which parts of the plan are used in calculating the plan_hash_value well in a post on How PLAN_HASH_VALUES Are Calculated several years ago. His summary was this:
So in summary the following conclusions can be made:
– The same PLAN_HASH_VALUE is merely an indicator that the same operations on objects of the same name are performed in the same order.
– It tells nothing about the similarity of the expected runtime performance of the execution plan, due to various reasons as demonstrated. The most significant information that is not covered by the PLAN_HASH_VALUE are the filter and access predicates, but there are other attributes, too, that are not part of the hash value calculation.
The production issue got me thinking about several things:
Could I come up with a simple test case to show a severe performance degradation between executions of a statement with the same plan_hash_value because of a change in the predicates? (the answer is it’s pretty easy actually)
What forensics could be used to determine when this situation has occurred?
How do you fix the problem?
So the first thing I did was create a test case that consists of a query with two predicates, one that is fairly CPU intensive and the other that does a simple equality comparison (which uses very little CPU). Both of the filters are very selective (and the optimizer actually estimate that the same number of rows will be returned on my test system) but the CPU cost of executing them is very different. So the order in which the filters are applied should make a big difference. I also put the low CPU filter last so that I could use a rarely used hint (ordered_predicates) to rearrange the way the optimizer wants to do it (i.e. apply the expensive one first). So here’s the set up.
SYS@DEMO1> -- fast filter
SYS@DEMO1> select count(*) from kso.skew2 where col4='F';
COUNT(*)
----------
1
Elapsed: 00:00:00.82 <======== about 1 second with no smart scan tricks
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID gzvkkhaykkphp, child number 0
-------------------------------------
select count(*) from kso.skew2 where col4='F'
Plan hash value: 4220890033
-----------------------------------------------------
| Id | Operation | Name | E-Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | TABLE ACCESS STORAGE FULL| SKEW2 | 3610 | <=== optimizer estimates 3610
-----------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("COL4"='F')
filter("COL4"='F')
Note
-----
- dynamic sampling used for this statement (level=2)
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
27 rows selected.
Elapsed: 00:00:00.01
SYS@DEMO1> -- slow filter
SYS@DEMO1> select count(*) from kso.skew2 where TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a';
COUNT(*)
----------
0
Elapsed: 00:05:49.85 <==== almost 6 minutes with no smart scan
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID gatc45y5cza17, child number 0
-------------------------------------
select count(*) from kso.skew2 where
TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a'
Plan hash value: 4220890033
-----------------------------------------------------
| Id | Operation | Name | E-Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | TABLE ACCESS STORAGE FULL| SKEW2 | 3610 | <=== optimizer still expects 3610 rows
-----------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
filter(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
Note
-----
- dynamic sampling used for this statement (level=2)
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
28 rows selected.
Elapsed: 00:00:00.02
The first section shows how long it takes to read through the table and apply each of the 2 filters. As you can see, the COL4='F' filter takes much less time (note that I don't want to get bogged down here, but I hope you'll trust me that no Exadata magic kicked in and they did the same amount of work in terms of lio/pio, etc...). Also note that the STORAGE predicate is present merely because we are on an Exadata (as was the original problem query) and it only signifies that a predicate could potentially be offloaded, not that it actually has been offloaded. Please take my word for it that no smarts kick in for any of the testing in this post.
The next bit of code shows the actual test case which combines the two filters.
SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
2 where translate(to_char(sin(col1)),'1','a') = 'a'
3 and col4 = 'F'
4 /
AVG(PK_COL)
-----------
Elapsed: 00:00:01.72
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
SQL_ID 5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
Plan hash value: 4220890033
-----------------------------------------------------
| Id | Operation | Name | E-Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | TABLE ACCESS STORAGE FULL| SKEW2 | 3784 |
-----------------------------------------------------
Predicate Information (identified by operation id): <==== note that the fast filter is applied first
---------------------------------------------------
2 - storage(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='
a'))
filter(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a
'))
Note
-----
- dynamic sampling used for this statement (level=2)
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
30 rows selected.
Elapsed: 00:00:00.01
SYS@DEMO1> @sm
Enter value for sid:
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id:
SID SERIAL# SQL_ID SQL_EXEC_ID SQL_EXEC_START PLAN_HASH_VALUE ETIME BUFFER_GETS DISK_READS
---- ---------- ------------- ----------- -------------------- --------------- ------------- ----------- ----------
#### 2853 5p4610dk709pj 16777216 01-Jun-13 22:50:34 4220890033 1.63 325662 0
SYS@DEMO1> @rsm
Enter value for sid:
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id:
REPORT
------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report
SQL Text
------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
Global Information
------------------------------
Status : DONE (ALL ROWS)
Instance ID : 1
Session : SYS (1366:2853)
SQL ID : 5p4610dk709pj
SQL Execution ID : 16777216
Execution Started : 06/01/2013 22:50:34
First Refresh Time : 06/01/2013 22:50:34
Last Refresh Time : 06/01/2013 22:50:36
Duration : 2s
Module/Action : sqlplus@enkdb03.enkitec.com (TNS V1-V3)/-
Service : SYS$USERS
Program : sqlplus@enkdb03.enkitec.com (TNS V1-V3)
Fetch Calls : 1
Global Stats
=================================================
| Elapsed | Cpu | Other | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls | Gets |
=================================================
| 1.63 | 1.62 | 0.02 | 1 | 326K |
=================================================
SQL Plan Monitoring Details (Plan Hash Value=4220890033)
====================================================================================================================================
| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Activity | Activity Detail |
| | | | (Estim) | | Active(s) | Active | | (Actual) | (%) | (# samples) |
====================================================================================================================================
| 0 | SELECT STATEMENT | | | | 1 | +2 | 1 | 1 | | |
| 1 | SORT AGGREGATE | | 1 | | 1 | +2 | 1 | 1 | | |
| 2 | TABLE ACCESS STORAGE FULL | SKEW2 | 3784 | 89309 | 2 | +1 | 1 | 0 | 100.00 | Cpu (2) |
====================================================================================================================================
So the optimizer reordered the filters correctly (fast filter applied first) as we expected and the statement executed in less than 2 seconds. Note that this happened even though we had no stats on the table and dynamic sampling had to kick in and that the estimated cardinality of the two filters was the same. So this reordering was done based on estimated CPU cost.
Now let's see what happens when we trick the optimizer into applying the filters in the wrong order. I could do this by changing the statistics but I'll use a SQL Patch to apply the ordered_predicates hint. To do this I'll use a little script I wrote called create_sql_patch.sql which prompts for a hint and applies it to a statement behind the scenes (much like a SQL Profile or Baseline).
SYS@DEMO1> @create_sql_patch
Enter value for sql_id: 5p4610dk709pj
Enter value for profile_name (PATCH_sqlid): ordered_predicates
Enter value for category (DEFAULT):
Enter value for hint_text: ordered_predicates
Enter value for validate (false):
SQL Patch ordered_predicates created.
Elapsed: 00:00:00.04
SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
2 where translate(to_char(sin(col1)),'1','a') = 'a'
3 and col4 = 'F'
4 /
AVG(PK_COL)
-----------
Elapsed: 00:05:54.05 <=== clearly something's changed!
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
Plan hash value: 4220890033
-----------------------------------------------------
| Id | Operation | Name | E-Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | TABLE ACCESS STORAGE FULL| SKEW2 | 3784 |
-----------------------------------------------------
Predicate Information (identified by operation id): <==== filters now in wrong order
---------------------------------------------------
2 - storage((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
"COL4"='F'))
filter((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
"COL4"='F'))
Note
-----
- dynamic sampling used for this statement (level=2)
- SQL patch "ordered_predicates" used for this statement <==== Patch applied
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
32 rows selected.
Elapsed: 00:00:00.01
So now we have the same statement (sql_id: 5p4610dk709pj) and the same plan (plan_hash_value: 4220890033) but the performance is vastly different (~6 minutes vs. < 2 seconds). So yes this can and does happen. Fortunately it doesn't happen very often, at least in my experience. Or at least I should say, it doesn't seem to happen very often where the performance is vastly different. I should note that this specific test case is exceedingly simple and not very likely in real life because the optimizer should always know how expensive a filter is in terms of CPU usage. The more common case would be a change in estimated selectivity of a filter due to changes in statistics. Also, this simple test case is just reversing 2 filters on a single line in a plan but it is also possible that a filter would be moved to another line or that an access path may become a filter or some internal function may be applied inside a filter or access predicate. Since this post is getting long I will plan to do a Part 2 that discusses the forensic aspects (i.e. how do you find this situation?). As Tanel says, "the rabbithole gets deeper, as always".
Hi Kerry,
Coincidently there was a question a couple of days ago in the Oracle-list http://www.freelists.org/post/oracle-l/Very-unstable-execution-plan,7 which treats the same issue.
You’ve used the /*+ ordered_predicates */ hint to simulate the issue, but the question is how this could happen in real life? Jonathan Lewis has a blog article about a change in the CBO model that conducted the optimizer, under certain circumstances, to not apply the where clause predicate in the order they appear in the query:
http://jonathanlewis.wordpress.com/2006/12/15/ora-01722-upgrade-error/#more-96
Best regards
Mohamed
Mohamed,
A big difference in the oracle-l example was that the plan HAD changed, even though the plan_hash_value hadn’t changed and a cursory inspection of the tkprof output suggested that the plan itself hadn’t changed. I’ve now written up a blog article about it, with a simplified model of the demonstrating the issue: http://jonathanlewis.wordpress.com/2013/06/07/same-plan/
Regards
Jonathan Lewis
Hi Jonathan,
That’s an interesting case. Just to clarify, when you say the plan had changed you mean that the plan steps were operating on different objects. And so in reality, the plan had changed even though it wasn’t reflected in the plan text, and thus the plan_hash_value didn’t change. Randolf’s post points out a bunch of things that are not used in calculating the plan_hash_value and the fact that it only looks at the texts of the plan steps (and knows nothing about the underlying objects – so synonyms pointing to different objects or multiple objects with the same name could all trip it up). I personally believe that the calculation of the plan_hash_value should probably be modified to include more information (such as the predicates and the object_id’s). Thanks for the link to the post by the way.
Kerry
Excellent real life example !!
Hi Mohamad (Houri),
Part 2 will focus on how to diagnose the problem and some of the causes and I hope that will help. But briefly, there are many ways that it can happen. The optimizer does a lot of things to try to minimize execution time and moving predicates around is one of the main ones. It tries to filter as many rows as possible as early as possible. So statistics are usually at the heart of these kinds of issues but you can also influence some of them with hints like PUSH_PRED. The specific test case I showed (filters re-ordered on the same line in the plan) is very simple but can happen if for example the optimizer decides it needs to do an implicit conversion, then it may also re-order the filters. Here’s an example from my test system (I didn’t create it by the way, it’s just a normal thing that happens).
One cursor has this:
The other has this:
Anyway, I hope that helps a little. I’ll try to get the next post out in the near future.
Kerry
[…] note by Kerry Osborne – http://kerryosborne.oracle-guy.com/2013/06/sql-gone-bad-but-plan-not-changed/. Identical PLAN_HASH_VALUEs are not a guarantee that the underlying execution plans are […]
[…] have learned from Kerry Osborne’s recent post on SQL Gone Bad – But Plan Not Changed?, and a prior post from Randolf Geist on PLAN_HASH_VALUE – How equal (and stable?) are your […]
Just….THANKS a LOT!!!!
[…] recently, two interesting blog articles, here and here, have been published to emphasize the possibility of having a same plan hash value for actually two […]
Hi Kerry,
In my case, my sqlid is taking same plan_hash with different child number. Both child numbers are taking different plans. can you please explain why it is taking different child number.
Hi Ram,
There are many reasons a new child cursor maybe created. Check v$sql_shared_cursor for the reason why the new cursor was created. Here’s a link to a script (a modified version of one Dion Cho wrote) to check why new child cursors were created: mismatch.sql
Kerry
[…] specific parts of the plan are included in the hash value calculation (see this post for more info: SQL Gone Bad – But Plan Not Changed? In this case the operation (TABLE ACCESS) and not the options (INMEMORY FULL or STORAGE FULL) are […]
Also don’t forget that your plan may not have changed in your local database, while it dramatically changed in a remote database.
You’ll have no indication of a change, while the performance could be 100 times worse. Yes, the optimization plan will show the remote database, but its easy to forget that you have to track down the sql in the remote database and also check the plan or plans there. Where I work there may be a dozen plans in remote databases access by scalar subqueries….terrible writing style, but not uncommon.