Exadata Offload – The Secret Sauce

The “Secret Sauce” for Exadata is its ability to offload processing to the storage tier. Offloading and Smart Scan are two terms that are used somewhat interchangeably. Offloading is a more generic term that means doing work at the storage tier that would otherwise have to be done on the database tier (this can include work that is not related to executing queries such as optimization of incremental backups). Smart Scans on the other hand are the access mechanism used to offload query processing tasks. For example, storage servers can apply predicate filters at the storage layer, instead of shipping every possible block back to the database server(s). Another thing that happens with Smart Scans is that the volume of data returned can be further reduced by column projection (i.e. if you only select 1 column from a 100 column table, there is no need to return the other 99 columns). Offloading is geared to long running queries that access a large amount of data. Offloading only works if Oracle decides to use its direct path read mechanism. Direct path reads have traditionally been done by parallel query slaves, but can also be done by serial queries. In fact, as of 11g, Oracle has changed the decision making process resulting in more aggressive use of serial direct path reads. I’ve seen this feature described both as “serial direct path reads” and “adaptive direct path reads”.

I’ll digress here a bit to discuss this feature since direct path reads are critical to Exadata Offloading. Direct path reads do not load blocks into Oracle’s buffer cache. Instead, the data is returned directly to the PGA of the process requesting the data. This means that the data does not have to be in Oracle block format. That means no 8K block that is only partially filled, that may only have a record or two that you’re interested in, containing every column including ones you don’t want, and with additional header information – needs to be shipped back up from the storage layer. Instead, a much more compact result set containing only the columns requested and hopefully only the rows you need are returned. As I said, direct path reads are traditionally used by parallel query slaves. They are also used in a few other instances such as LOB access and sorts that spill over into TEMP. So the ability to use direct path reads is very important to the Exadata platform and thus the changes to the make them more attractive in 11g. Here are a few links to info on the subject of serial direct path reads:

  1. Doug Burns has a good post on 11g serial direct path reads.
  2. Alex Fatkulin has a very good post on some of the factors controlling adaptive direct path reads.
  3. There is a note on MOS (793845.1) on changes in 11g in the heuristics to choose between direct path reads or reads through the buffer cache.
  4. You may also find MOS note (50415.1) on misleading nature of “direct path read” wait events of interest.

Also be aware that direct path reads are only available for full scans (tables or indexes). So any statement that uses an index range scan to get to a row in a table via a rowid will not use this mechanism. Also keep in mind that direct path requires extra processing to ensure that all blocks on disk are current – (i.e. an object level check point), so frequently modified tables will suffer some overhead before direct path reads can be initiated.

I must say that I think the changes to the heuristics in 11g may be a little on the aggressive side for non-Exadata platforms (the changes may well be driven by Exadata). And by the way, serial direct path reads are not always faster than the normal reads that go through the buffer cache. Dion Cho has a good post on a performance problem due to serial direct path reads kicking in on one node of an 11g RAC environment (not Exadata). The node doing the direct path reads was running the query much slower than the node using the normal buffer cache reads. He also has a post on turning off serial direct path reads.

But enough about the direct path reads stuff, on to the Offloading. One of the first things I wanted to know when I got my first look at a system running on Exadata was whether a particular query was eligible for offloading and if so, how much of the expected i/o was saved by the feature. So of course I wrote a little script to show me that. Turns out there is plenty of info in V$SQL to see what’s going on. I called the script fsx.sql (short for Find_Sql_eXadata). Here’s a little demo:

SYS@LABRAT1> @test_offload
SYS@LABRAT1> -- ran but turned off output
SYS@LABRAT1> -- select /* test full */ * from kso.skew3; 
SYS@LABRAT1> 
SYS@LABRAT1> select /* test count */ count(col1) from kso.skew3;
 
COUNT(COL1)
-----------
  384000036
 
1 row selected.
 
Elapsed: 00:00:24.04
SYS@LABRAT1> select /* test count(rll) */ count(pk_col), count(col1), count(col2), count(col3), count(col4) from kso.skew3;
 
COUNT(PK_COL) COUNT(COL1) COUNT(COL2) COUNT(COL3) COUNT(COL4)
------------- ----------- ----------- ----------- -----------
    384000048   384000036   384000036   384000036   384000036
 
1 row selected.
 
Elapsed: 00:00:39.09
SYS@LABRAT1> select /* test max */ max(col1) from kso.skew3;
 
 MAX(COL1)
----------
   1000000
 
1 row selected.
 
Elapsed: 00:00:24.51
SYS@LABRAT1> select /* test max all */ max(pk_col), max(col1), max(col2), max(col3), max(col4) from kso.skew3;
 
MAX(PK_COL)  MAX(COL1) MAX(COL2)                      MAX(COL3) M
----------- ---------- ------------------------------ --------- -
 1.0000E+12    1000000 asddsadasd                     01-JAN-09 Y
 
1 row selected.
 
Elapsed: 00:00:57.32
SYS@LABRAT1> select /* test avg */ avg(col1) from kso.skew3;
 
 AVG(COL1)
----------
495000.194
 
1 row selected.
 
Elapsed: 00:00:30.00
SYS@LABRAT1> select /* test avg filter */ avg(col1) from kso.skew3 where col1 between 2 and 10000;
 
 AVG(COL1)
----------
 
 
1 row selected.
 
Elapsed: 00:00:13.97
SYS@LABRAT1> select /* test index */ avg(pk_col) from kso.skew where col1 = 136133;
 
AVG(PK_COL)
-----------
   15636133
 
1 row selected.
 
Elapsed: 00:00:00.00
SYS@LABRAT1> set echo off
SYS@LABRAT1> set timing off
SYS@LABRAT1> @fsx
Enter value for sql_text: % test %
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 1767x3dw3m8pa      0 2684249835          1         57.31    1,956,226  1,956,219      0 Yes               9.35 select /* test max all */ max(pk_col), m
    1 1wp72w4c5d449      0 2684249835          1         29.98    1,956,226  1,956,219      0 Yes              71.84 select /* test avg */ avg(col1) from kso
    1 46vd0hducqms8      0 1667163222          1        422.77    2,032,691  1,956,219      0 No                 .00 select /* test full */ * from kso.skew3
    1 5ry8vd8u0n0bf      0 2684249835          1         24.03    1,956,226  1,956,219      0 Yes              71.85 select /* test count */ count(col1) from
    1 6jtf4rnxu2dq7      0 2684249835          1         24.50    1,956,226  1,956,219      0 Yes              71.85 select /* test max */ max(col1) from kso
    1 8pfxs6mdf62mq      0 3723858078          1           .00           35          0      0 No                 .00 select /* test index */ avg(pk_col) from
    1 abwyftyy395gk      0 2684249835          1         13.96    1,956,226  1,956,219      0 Yes              99.99 select /* test avg filter */ avg(col1) f
    1 fxpkhxt4jwqqm      0 2684249835          1         39.08    1,956,226  1,956,219      0 Yes               9.35 select /* test count(rll) */ count(pk_co
 
8 rows selected.

So I did several queries. Some were off-loadable and some were not. Then I ran the script to see what how the statements had been handled. There are a couple of things to note here:

  1. I used IO_CELL_OFFLOAD_ELIGIBLE_BYTES in V$SQL to indicate whether a statement can be offloaded or not.
    (i.e. decode(IO_CELL_OFFLOAD_ELIGIBLE_BYTES,0,’No’,’Yes’) Offloadable) Note, if you run this script on a non-Exadata 11gR2 database, all the SQL statements should show up as Offloadable=NO.
  2. All the statements using the aggregate functions (max,count,avg…) are marked as off-loadable even though the functions themselves are not off-loadable. Projection accounts for the reduced IO (limited to columns requested) and the marking as off-loadable. Note that the statement commented “test max all” shows almost no reduction in IO (it applies the max function to every column in the table). I’m not yet sure why there was a slight reduction in I/O even with all columns listed. And by the way, V$SQLFN_METADATA has a list of all valid functions and whether they are off-loadable or not (none of the aggregate functions are).
  3. It appears the statements selecting all columns with no where clause are not off-loadable (no filter predicate or column projection).
  4. Statements using index range scans are not off-loadable (the optimizer has to pick a plan with a full scan). Check out the statement commented “test index” for an example.

It is possible to turn offloading off via a parameter, CELL_OFFLOAD_PROCESSING, or via a hint. Here’s a bit of the Oracle Exadata Storage Server Software User’s Guide:

The CELL_OFFLOAD_PROCESSING initialization parameter enables SQL processing offload to Exadata Cell. When the value of the parameter is set to TRUE, predicate evaluation can be offloaded to cells. The default value of the parameter is TRUE. If the parameter is set to FALSE at the session or system level, then the database performs all the predicate evaluation with cells serving blocks. You can set CELL_OFFLOAD_ PROCESSING dynamically with the SQL ALTER SYSTEM or ALTER SESSION command. For example:
SQL> ALTER SESSION SET CELL_OFFLOAD_PROCESSING = TRUE;
The CELL_OFFLOAD_PROCESSING parameter can also be set with the OPT_PARAM optimizer hint to enable or disable predicate filtering for a specific SQL command. For example:
— to disable cell_offload_processing for a SQL command SELECT /*+ OPT_PARAM(‘cell_offload_processing’ ‘false’) */ COUNT(*) FROM EMPLOYEES;
— to enable cell_offload_processing for a SQL command SELECT /*+ OPT_PARAM(‘cell_offload_processing’ ‘true’) */ COUNT(*) FROM EMPLOYEES;

And a quick example:

SYS@LABRAT1> select /* test avg filter hint */
  2  /*+ OPT_PARAM('cell_offload_processing' 'false') */
  3  avg(col1) from kso.skew3 where col1 between 2 and 10000;
 
 AVG(COL1)
----------
 
 
1 row selected.
 
SYS@LABRAT1> @fsx
Enter value for sql_text: % test avg filter %
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 0nb4tjmjuq67t      0 2684249835          1        104.89    1,956,227  1,956,219      0 No                 .00 select /* test avg filter hint */ /*+ OP
    1 abwyftyy395gk      0 2684249835          1         13.96    1,956,226  1,956,219      0 Yes              99.99 select /* test avg filter */ avg(col1) f
 
2 rows selected.

So as you can see, the hint disabled offloading and resulted in the statement taking about 10X the elapsed time of the offloaded version.

Now for one more example to whet your appetite for things to come:

SYS@LABRAT1> select count(*) from kso.skew3 where col1 is null;
 
  COUNT(*)
----------
        12
 
1 row selected.
 
Elapsed: 00:00:00.08
SYS@LABRAT1> @fsx
Enter value for sql_text: %col1 is null%
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 0u1q4b7puqz6g      0 2684249835          1           .08    1,956,226  1,956,219      0 Yes             100.00 select count(*) from kso.skew3 where col
 
1 row selected.
 
Elapsed: 00:00:00.09
SYS@LABRAT1> @dplan
Enter value for sql_id: 0u1q4b7puqz6g                               
Enter value for child_no: 
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0u1q4b7puqz6g, child number 0
-------------------------------------
select count(*) from kso.skew3 where col1 is null
 
Plan hash value: 2684249835
 
------------------------------------------------------------------------------------
| Id  | Operation                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |       |       |       |   533K(100)|          |
|   1 |  SORT AGGREGATE            |       |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW3 |    12 |    60 |   533K  (1)| 01:46:38 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage("COL1" IS NULL)
       filter("COL1" IS NULL)
 
 
20 rows selected.
 
Elapsed: 00:00:00.02

Yes that’s a full scan on a 384M row table that occupies about 2M blocks. And it took 8/100’s of a second. That’s 80 milliseconds – roughly the amount of time necessary to do 10-15 single block disk i/o’s.

Wow! Can you say storage index? More on that in the next installment!

Leave a Reply