Those of you who are old enough to remember disco dancing will probably remember this guy. Wimpy (J. Wellington Wimpy actually) was a lazy, glutinous character in the old Popeye cartoons.
I was reminded of Wimpy last week by a performance problem that occurred on an 11g database I was looking at (I’ll tell you why in a minute). So here’s what happened. There were several statements that were running much more slowly than usual (10’s of seconds instead of less than a second). So I started poking around and here’s what I saw:
(note that I cleaned this up a bit to protect the innocent)
(also note that many of the scripts I use have been documented in other posts – so if I haven’t shown the SQL for a particular script, try searching for it on this site)
SQL> @find_sql_acs
Enter value for sql_text:
Enter value for sql_id: 3dhvsp0duttat
Enter value for is_bind_aware:
SQL_ID CHILD PLAN_HASH_VALUE IOB IBS IBA ISH EXECS ROWS_PROCESSED AVG_ETIME AVG_CPU AVG_PIO AVG_LIO SQL_TEXT
------------- ------ --------------- --- --- --- --- ---------- -------------- ---------- ---------- ---------- ---------- ------------------
3dhvsp0duttat 0 3289235442 N N N Y 9 46 0.02 0.00 1.11 272 SELECT row_order,
3dhvsp0duttat 1 3289235442 N N N Y 13 48 0.08 0.01 12.62 546 SELECT row_order,
3dhvsp0duttat 2 4076066623 N N N Y 15 31 38.07 6.55 27,176.53 2,426,778 SELECT row_order,
SQL> @dplan
Enter value for sql_id: 3dhvsp0duttat
Enter value for child_no: 0
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 3dhvsp0duttat, child number 0
-------------------------------------
SELECT row_order, ...
Plan hash value: 3289235442
--------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 5 (100)| | | |
|* 1 | VIEW | | 1 | 586 | 5 (20)| 00:00:01 | | |
|* 2 | WINDOW SORT PUSHED RANK | | 1 | 258 | 5 (20)| 00:00:01 | | |
|* 3 | FILTER | | | | | | | |
| 4 | PARTITION RANGE ITERATOR | | 1 | 258 | 4 (0)| 00:00:01 | KEY | KEY |
|* 5 | TABLE ACCESS BY LOCAL INDEX ROWID| REALLY_BIG_TABLE | 1 | 258 | 4 (0)| 00:00:01 | KEY | KEY |
|* 6 | INDEX RANGE SCAN | GOOD_INDEX | 1 | | 3 (0)| 00:00:01 | KEY | KEY |
--------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(...)
2 - filter(ROW_NUMBER() OVER ( ORDER BY "DATE_COL")<=:25)
3 - filter((TO_DATE(:1,:2)<=TO_DATE(:3,:4) AND :24<=:25))
5 - filter(...)
6 - access("COL1" LIKE :21 AND "DATE_COL">=TO_DATE(:1,:2) AND "DATE_COL"<=TO_DATE(:3,:4))
SQL> @dplan
Enter value for sql_id: 3dhvsp0duttat
Enter value for child_no: 2
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 3dhvsp0duttat, child number 2
-------------------------------------
SELECT row_order, ...
Plan hash value: 4076066623
--------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 1 (100)| | |
|* 1 | VIEW | | 1 | 586 | 0 (0)| | |
|* 2 | WINDOW NOSORT STOPKEY | | 1 | 574 | 0 (0)| | |
|* 3 | FILTER | | | | | | |
| 4 | PARTITION RANGE ITERATOR | | 1 | 574 | 0 (0)| KEY | KEY |
|* 5 | TABLE ACCESS BY LOCAL INDEX ROWID| REALLY_BIG_TABLE | 1 | 574 | 0 (0)| KEY | KEY |
|* 6 | INDEX RANGE SCAN | BAD_INDEX | 1 | | 0 (0)| KEY | KEY |
--------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(("ROW_ORDER">=:24 AND "ROW_ORDER"<=:25))
2 - filter(ROW_NUMBER() OVER ( ORDER BY "DATE_COL")<=:25)
3 - filter((TO_DATE(:1,:2)<=TO_DATE(:3,:4) AND :24<=:25))
5 - filter(...)
6 - access("DATE_COL">=TO_DATE(:1,:2) AND "DATE_COL"<=TO_DATE(:3,:4))
SQL> -- Uh Oh, - cost of 1, no records expected
SQL>
SQL> @child_load_time
Enter value for sql_text:
Enter value for sql_id: 3dhvsp0duttat
SQL_ID CHILD PLAN_HASH LAST_LOAD_TIME EXECS AVG_ETIME AVG_PIO AVG_LIO SQL_TEXT
------------- ------ ---------- --------------- ---------- ------------- ---------- ------------ -----------------------------------------
3dhvsp0duttat 0 3289235442 2009-05-01/12:04:20 9 0.02 1 272 SELECT row_order, ...
3dhvsp0duttat 1 3289235442 2009-05-05/19:43:14 13 0.08 13 546
3dhvsp0duttat 2 4076066623 2009-05-07/16:49:36 15 38.07 27,177 2,426,778
SQL> -- so it looks like maybe something bad happened with stats on 5/7
SQL> -- let's make use of the diif_table_stats procedure to see if we can spot what happened
SQL> !date
Tue May 12 11:59:22 CDT 2009
SQL> !cat diff_table_stats.sql
select report, maxdiffpct from
table(dbms_stats.diff_table_stats_in_history('&owner','&table_name',
systimestamp-&days_ago))
/
SQL> @diff_table_stats.sql
Enter value for owner: XXX
Enter value for table_name: REALLY_BIG_TABLE
Enter value for days_ago: 5
REPORT MAXDIFFPCT
-------------------------------------------------------------------------------- ----------
###############################################################################
STATISTICS DIFFERENCE REPORT FOR:
.................................
TABLE : REALLY_BIG_TABLE
OWNER : XXX
SOURCE A : Statistics as of 07-MAY-09 11.46.36.000000 AM -05:00
SOURCE B : Current Statistics in dictionary
PCTTHRESHOLD : 10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TABLE / (SUB)PARTITION STATISTICS DIFFERENCE:
.............................................
OBJECTNAME TYP SRC ROWS BLOCKS ROWLEN SAMPSIZE
...............................................................................
P2009W19 P A 0 0 0 0
B 4001801 187810 342 4001801
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
COLUMN STATISTICS DIFFERENCE:
.............................
COLUMN_NAME SRC NDV DENSITY HIST NULLS LEN MIN MAX SAMPSIZ
...............................................................................
PARTITION: P2009W19
...................
COL1 A 0 0 NO 0 0 0
B 105136 .000009511 NO 2565391 4 302E3 6E303 1436410
COL2 A 0 0 NO 0 0 0
B 669696 .000001493 NO 1011264 9 2A2A2 554E4 2990537
COL3 A 0 0 NO 0 0 0
B 988 .001012145 NO 1113316 4 30303 39393 2888485
COL4 A 0 0 NO 0 0 0
B 669696 .000037461 YES 0 11 2A2A2 554E4 5540
...
SQL> -- Sure enough, stats were created when the partition was empty, ouch!
So a couple of things to note here:
1. the table is partitioned
2. dbms_stats.diff_table_stats_in_history is smart enough to show us partition stats (nice)
3. on May 7th the stats changed dramatically
4. this statement was not re-parsed due to the stats changing
So what’s the fix? Well since stats had been generated on this partition every night for the last 5 days, the fix was just to get the optimizer to re-parse my query (and any other affected statements). In 11g this is very easy as there is a built function (psuedo documented) to flush a single SQL statement from the shared pool (see my previous post – Flush A Single SQL Statement – for details). However, I wasn’t sure how many statements were affected (more on that in a minute). But certainly any statement that touched that partitioned table which got parsed on May 7th was almost certain to have a bad plan. So eventually we decided to do the brute force approach and just flush the entire shared pool. (I know, not elegant, but fairly effective) Note that there is no guarantee that flushing the shared pool will clear all the statements. In fact, on a busy system you’re almost guaranteed that it won’t flush every statement. But it did clear up most of the issues. So with the pressure now off (and the system back to running pretty well) I had time to write a little script to look for SQL statements with a cost of 1. (I guess I could have looked for statements which were parsed on May 7th that accessed my partitioned table, but I didn’t.)
Here’s a bit of the SQL*Plus Session:
SQL>-- initially we flushed a couple directly
SQL>
SQL> @flush_sql
Enter value for sql_id: 3dhvsp0duttat
PL/SQL procedure successfully completed.
SQL> -- so we can find all of these guys and flush them - or just flush the whole shared pool
SQL> -- flushing the shared pool is easier!
SQL>
SQL> alter system flush shared_pool;
System altered.
SQL> -- but there are still no guarantees they are all cleaned up, let's check
SQL>
SQL> !cat find_sql_by_cost.sql
select distinct parsing_schema_name, sa.sql_id, sp.plan_hash_value, cost
from v$sql_plan sp, v$sqlarea sa
where sp.sql_id = sa.sql_id
and parsing_schema_name like nvl('&parsing_user',parsing_schema_name)
and cost like nvl(to_number('&cost'),to_number(cost))
and sp.plan_hash_value like nvl('&plan_hash_value',sp.plan_hash_value)
and parent_id is null
order by 4 desc
/
SQL> @find_sql_by_cost
Enter value for parsing_user:
Enter value for cost: 1
Enter value for plan_hash_value: 407%
PARSING_SCHEMA_NAME SQL_ID PLAN_HASH_VALUE COST
------------------------------ ------------- --------------- ----------
XXXX_PROD 02t2tt28xp72m 4076066623 1
XXXX_PROD 5byphafyx78j6 4076066623 1
SQL> -- note I did a little checking here to make sure they were suffering from the same issue
SQL> -- then I cleaned them up
SQL>
SQL>
SQL> @flush_sql
Enter value for sql_id: 02t2tt28xp72m
PL/SQL procedure successfully completed.
SQL> @find_sql
Enter value for sql_text:
Enter value for sql_id: 02t2tt28xp72m
no rows selected
SQL> @flush_sql
Enter value for sql_id: 5byphafyx78j6
PL/SQL procedure successfully completed.
SQL> @find_sql
Enter value for sql_text:
Enter value for sql_id: 5byphafyx78j6
no rows selected
So the short term fix worked. The statements picked up the correct plans after re-parsing. Now let’s talk about the long term fix. Why did the partition not have stats on it? Turns out the partitioning scheme in place was range by date with a new partition for each week. Each partition goes from Thursday through the following Wednesday. Several partitions had been created in advance but as you may have guessed by now, nothing was done about the stats. Meaning that the partitions had values of 0 for the number of rows. Since the gather stats job was kicking in every evening, a whole days worth of data could be loaded into the partition before the stats would be updated. So that means that 24 hours worth of activity would occur with “bad” stats every week. Couple that with the default of NO_INVALIDATE=TRUE on the stats gathering and you get statements that may hang around for extended periods of time, even if the stats change dramatically (0 rows to a few million is dramatic, right?). So we manually set the stats on the pre-created partitions using a combination of dbms_stats.copy_table_stats and dbms_stats.set_table_stats and dbms_stats.set_column_stats (see Dion Cho’s excellent post for details and example code – and also this post on the Optimizermagic Blog for further info). I should also mention that Greg Rahn has several good posts on 11g Statistics Enhancements here, here, and here.
So why was I reminded of Wimpy? Because Wimpy’s favorite saying was “I’ll gladly pay you Tuesday for a hamburger today”. And the way this system was set up, every Tuesday (er, Thursday), we were paying the price for our bad/missing statistics. OK, so it’s a stretch, but I like the pictures of Wimpy!
Please feel free to leave me a comment letting me know how you feel …
… about the pictures of Wimpy.