SQL Gone Bad – But Plan Not Changed? – Part 1

Last week an interesting issue popped up on a mission critical production app (MCPA). A statement that was run as part of a nightly batch process ran long. In fact, the statement never finished and the job had to be killed and restarted. This particular system is prone to plan stability issues due to various factors outside the scope of this post, so the first thing that the guys checked was if there had been a plan change. Surprisingly the plan_hash_value was the same as it had been for the past several months. The statement was very simple and a quick look at the xplan output showed that the plan was indeed the same with one exception. The predicate section was slightly different.

As a quick diversion, you probably already know that the plan_hash_value is calculated based on partial information about the plan. Arguably it’s the most important parts, but there are some important parts of the plan that are not included (namely the stuff that shows up in the predicate section of the plan). Randolf Geist explained which parts of the plan are used in calculating the plan_hash_value well in a post on How PLAN_HASH_VALUES Are Calculated several years ago. His summary was this:

So in summary the following conclusions can be made:

– The same PLAN_HASH_VALUE is merely an indicator that the same operations on objects of the same name are performed in the same order.

– It tells nothing about the similarity of the expected runtime performance of the execution plan, due to various reasons as demonstrated. The most significant information that is not covered by the PLAN_HASH_VALUE are the filter and access predicates, but there are other attributes, too, that are not part of the hash value calculation.

The production issue got me thinking about several things:

    Could I come up with a simple test case to show a severe performance degradation between executions of a statement with the same plan_hash_value because of a change in the predicates? (the answer is it’s pretty easy actually)

    What forensics could be used to determine when this situation has occurred?

    How do you fix the problem?

So the first thing I did was create a test case that consists of a query with two predicates, one that is fairly CPU intensive and the other that does a simple equality comparison (which uses very little CPU). Both of the filters are very selective (and the optimizer actually estimate that the same number of rows will be returned on my test system) but the CPU cost of executing them is very different. So the order in which the filters are applied should make a big difference. I also put the low CPU filter last so that I could use a rarely used hint (ordered_predicates) to rearrange the way the optimizer wants to do it (i.e. apply the expensive one first). So here’s the set up.

 
SYS@DEMO1> -- fast filter 
SYS@DEMO1> select count(*) from kso.skew2 where col4='F';
 
  COUNT(*)
----------
         1
 
Elapsed: 00:00:00.82 <======== about 1 second with no smart scan tricks
 
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
 
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gzvkkhaykkphp, child number 0
-------------------------------------
select count(*) from kso.skew2 where col4='F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3610 | <=== optimizer estimates 3610
-----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage("COL4"='F')
       filter("COL4"='F')
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
27 rows selected.
 
Elapsed: 00:00:00.01
SYS@DEMO1> -- slow filter
SYS@DEMO1> select count(*) from kso.skew2 where TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a';
 
  COUNT(*)
----------
         0
Elapsed: 00:05:49.85 <==== almost 6 minutes with no smart scan
 
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST')); 
 
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  gatc45y5cza17, child number 0
-------------------------------------
select count(*) from kso.skew2 where
TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3610 | <=== optimizer still expects 3610 rows
-----------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
       filter(TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a')
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
28 rows selected.
Elapsed: 00:00:00.02

The first section shows how long it takes to read through the table and apply each of the 2 filters. As you can see, the COL4=’F’ filter takes much less time (note that I don’t want to get bogged down here, but I hope you’ll trust me that no Exadata magic kicked in and they did the same amount of work in terms of lio/pio, etc…). Also note that the STORAGE predicate is present merely because we are on an Exadata (as was the original problem query) and it only signifies that a predicate could potentially be offloaded, not that it actually has been offloaded. Please take my word for it that no smarts kick in for any of the testing in this post.

The next bit of code shows the actual test case which combines the two filters.

SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
  2  where translate(to_char(sin(col1)),'1','a') = 'a'   
  3  and col4 = 'F'
  4  /
 
AVG(PK_COL)
-----------
 
 
Elapsed: 00:00:01.72
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------
SQL_ID  5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3784 |
-----------------------------------------------------
 
Predicate Information (identified by operation id): <==== note that the fast filter is applied first
---------------------------------------------------
 
   2 - storage(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='
              a'))
       filter(("COL4"='F' AND TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a
              '))
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
30 rows selected.
 
Elapsed: 00:00:00.01
SYS@DEMO1> @sm
Enter value for sid: 
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id: 
 
 SID    SERIAL# SQL_ID        SQL_EXEC_ID SQL_EXEC_START       PLAN_HASH_VALUE         ETIME BUFFER_GETS DISK_READS
---- ---------- ------------- ----------- -------------------- --------------- ------------- ----------- ----------
####       2853 5p4610dk709pj    16777216 01-Jun-13 22:50:34        4220890033          1.63      325662          0
 
SYS@DEMO1> @rsm
Enter value for sid: 
Enter value for sql_id: 5p4610dk709pj
Enter value for sql_exec_id: 
 
REPORT
------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report
 
SQL Text
------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  SYS (1366:2853)
 SQL ID              :  5p4610dk709pj
 SQL Execution ID    :  16777216
 Execution Started   :  06/01/2013 22:50:34
 First Refresh Time  :  06/01/2013 22:50:34
 Last Refresh Time   :  06/01/2013 22:50:36
 Duration            :  2s
 Module/Action       :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1
 
Global Stats
=================================================
| Elapsed |   Cpu   |  Other   | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    1.63 |    1.62 |     0.02 |     1 |   326K |
=================================================
 
SQL Plan Monitoring Details (Plan Hash Value=4220890033)
====================================================================================================================================
| Id |          Operation           | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Activity | Activity Detail |
|    |                              |       | (Estim) |       | Active(s) | Active |       | (Actual) |   (%)    |   (# samples)   |
====================================================================================================================================
|  0 | SELECT STATEMENT             |       |         |       |         1 |     +2 |     1 |        1 |          |                 |
|  1 |   SORT AGGREGATE             |       |       1 |       |         1 |     +2 |     1 |        1 |          |                 |
|  2 |    TABLE ACCESS STORAGE FULL | SKEW2 |    3784 | 89309 |         2 |     +1 |     1 |        0 |   100.00 | Cpu (2)         |
====================================================================================================================================

So the optimizer reordered the filters correctly (fast filter applied first) as we expected and the statement executed in less than 2 seconds. Note that this happened even though we had no stats on the table and dynamic sampling had to kick in and that the estimated cardinality of the two filters was the same. So this reordering was done based on estimated CPU cost.

Now let’s see what happens when we trick the optimizer into applying the filters in the wrong order. I could do this by changing the statistics but I’ll use a SQL Patch to apply the ordered_predicates hint. To do this I’ll use a little script I wrote called create_sql_patch.sql which prompts for a hint and applies it to a statement behind the scenes (much like a SQL Profile or Baseline).

 
SYS@DEMO1> @create_sql_patch
Enter value for sql_id: 5p4610dk709pj
Enter value for profile_name (PATCH_sqlid): ordered_predicates
Enter value for category (DEFAULT): 
Enter value for hint_text: ordered_predicates
Enter value for validate (false): 
 
SQL Patch ordered_predicates created.
 
Elapsed: 00:00:00.04
SYS@DEMO1> select /*+ monitor */ avg(pk_col) from kso.skew2
  2  where translate(to_char(sin(col1)),'1','a') = 'a'
  3  and col4 = 'F'
  4  /
 
AVG(PK_COL)
-----------
 
 
Elapsed: 00:05:54.05 <=== clearly something's changed!
SYS@DEMO1> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
 
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  5p4610dk709pj, child number 0
-------------------------------------
select /*+ monitor */ avg(pk_col) from kso.skew2 where
translate(to_char(sin(col1)),'1','a') = 'a' and col4 = 'F'
 
Plan hash value: 4220890033
 
-----------------------------------------------------
| Id  | Operation                  | Name  | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT           |       |        |
|   1 |  SORT AGGREGATE            |       |      1 |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW2 |   3784 |
-----------------------------------------------------
 
Predicate Information (identified by operation id): <==== filters now in wrong order
---------------------------------------------------
 
   2 - storage((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
              "COL4"='F'))
       filter((TRANSLATE(TO_CHAR(SIN("COL1")),'1','a')='a' AND
              "COL4"='F'))
 
Note
-----
   - dynamic sampling used for this statement (level=2)
   - SQL patch "ordered_predicates" used for this statement <==== Patch applied
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level
 
 
32 rows selected.
 
Elapsed: 00:00:00.01

So now we have the same statement (sql_id: 5p4610dk709pj) and the same plan (plan_hash_value: 4220890033) but the performance is vastly different (~6 minutes vs. < 2 seconds). So yes this can and does happen. Fortunately it doesn't happen very often, at least in my experience. Or at least I should say, it doesn't seem to happen very often where the performance is vastly different. I should note that this specific test case is exceedingly simple and not very likely in real life because the optimizer should always know how expensive a filter is in terms of CPU usage. The more common case would be a change in estimated selectivity of a filter due to changes in statistics. Also, this simple test case is just reversing 2 filters on a single line in a plan but it is also possible that a filter would be moved to another line or that an access path may become a filter or some internal function may be applied inside a filter or access predicate. Since this post is getting long I will plan to do a Part 2 that discusses the forensic aspects (i.e. how do you find this situation?). As Tanel says, “the rabbithole gets deeper, as always”.

Leave a Reply