Exadata Offload – The Secret Sauce

The “Secret Sauce” for Exadata is its ability to offload processing to the storage tier. Offloading and Smart Scan are two terms that are used somewhat interchangeably. Offloading is a more generic term that means doing work at the storage tier that would otherwise have to be done on the database tier (this can include work that is not related to executing queries such as optimization of incremental backups). Smart Scans on the other hand are the access mechanism used to offload query processing tasks. For example, storage servers can apply predicate filters at the storage layer, instead of shipping every possible block back to the database server(s). Another thing that happens with Smart Scans is that the volume of data returned can be further reduced by column projection (i.e. if you only select 1 column from a 100 column table, there is no need to return the other 99 columns). Offloading is geared to long running queries that access a large amount of data. Offloading only works if Oracle decides to use its direct path read mechanism. Direct path reads have traditionally been done by parallel query slaves, but can also be done by serial queries. In fact, as of 11g, Oracle has changed the decision making process resulting in more aggressive use of serial direct path reads. I’ve seen this feature described both as “serial direct path reads” and “adaptive direct path reads”.

I’ll digress here a bit to discuss this feature since direct path reads are critical to Exadata Offloading. Direct path reads do not load blocks into Oracle’s buffer cache. Instead, the data is returned directly to the PGA of the process requesting the data. This means that the data does not have to be in Oracle block format. That means no 8K block that is only partially filled, that may only have a record or two that you’re interested in, containing every column including ones you don’t want, and with additional header information – needs to be shipped back up from the storage layer. Instead, a much more compact result set containing only the columns requested and hopefully only the rows you need are returned. As I said, direct path reads are traditionally used by parallel query slaves. They are also used in a few other instances such as LOB access and sorts that spill over into TEMP. So the ability to use direct path reads is very important to the Exadata platform and thus the changes to the make them more attractive in 11g. Here are a few links to info on the subject of serial direct path reads:

  1. Doug Burns has a good post on 11g serial direct path reads.
  2. Alex Fatkulin has a very good post on some of the factors controlling adaptive direct path reads.
  3. There is a note on MOS (793845.1) on changes in 11g in the heuristics to choose between direct path reads or reads through the buffer cache.
  4. You may also find MOS note (50415.1) on misleading nature of “direct path read” wait events of interest.

Also be aware that direct path reads are only available for full scans (tables or indexes). So any statement that uses an index range scan to get to a row in a table via a rowid will not use this mechanism. Also keep in mind that direct path requires extra processing to ensure that all blocks on disk are current – (i.e. an object level check point), so frequently modified tables will suffer some overhead before direct path reads can be initiated.

I must say that I think the changes to the heuristics in 11g may be a little on the aggressive side for non-Exadata platforms (the changes may well be driven by Exadata). And by the way, serial direct path reads are not always faster than the normal reads that go through the buffer cache. Dion Cho has a good post on a performance problem due to serial direct path reads kicking in on one node of an 11g RAC environment (not Exadata). The node doing the direct path reads was running the query much slower than the node using the normal buffer cache reads. He also has a post on turning off serial direct path reads.

But enough about the direct path reads stuff, on to the Offloading. One of the first things I wanted to know when I got my first look at a system running on Exadata was whether a particular query was eligible for offloading and if so, how much of the expected i/o was saved by the feature. So of course I wrote a little script to show me that. Turns out there is plenty of info in V$SQL to see what’s going on. I called the script fsx.sql (short for Find_Sql_eXadata). Here’s a little demo:

SYS@LABRAT1> @test_offload
SYS@LABRAT1> -- ran but turned off output
SYS@LABRAT1> -- select /* test full */ * from kso.skew3; 
SYS@LABRAT1> 
SYS@LABRAT1> select /* test count */ count(col1) from kso.skew3;
 
COUNT(COL1)
-----------
  384000036
 
1 row selected.
 
Elapsed: 00:00:24.04
SYS@LABRAT1> select /* test count(rll) */ count(pk_col), count(col1), count(col2), count(col3), count(col4) from kso.skew3;
 
COUNT(PK_COL) COUNT(COL1) COUNT(COL2) COUNT(COL3) COUNT(COL4)
------------- ----------- ----------- ----------- -----------
    384000048   384000036   384000036   384000036   384000036
 
1 row selected.
 
Elapsed: 00:00:39.09
SYS@LABRAT1> select /* test max */ max(col1) from kso.skew3;
 
 MAX(COL1)
----------
   1000000
 
1 row selected.
 
Elapsed: 00:00:24.51
SYS@LABRAT1> select /* test max all */ max(pk_col), max(col1), max(col2), max(col3), max(col4) from kso.skew3;
 
MAX(PK_COL)  MAX(COL1) MAX(COL2)                      MAX(COL3) M
----------- ---------- ------------------------------ --------- -
 1.0000E+12    1000000 asddsadasd                     01-JAN-09 Y
 
1 row selected.
 
Elapsed: 00:00:57.32
SYS@LABRAT1> select /* test avg */ avg(col1) from kso.skew3;
 
 AVG(COL1)
----------
495000.194
 
1 row selected.
 
Elapsed: 00:00:30.00
SYS@LABRAT1> select /* test avg filter */ avg(col1) from kso.skew3 where col1 between 2 and 10000;
 
 AVG(COL1)
----------
 
 
1 row selected.
 
Elapsed: 00:00:13.97
SYS@LABRAT1> select /* test index */ avg(pk_col) from kso.skew where col1 = 136133;
 
AVG(PK_COL)
-----------
   15636133
 
1 row selected.
 
Elapsed: 00:00:00.00
SYS@LABRAT1> set echo off
SYS@LABRAT1> set timing off
SYS@LABRAT1> @fsx
Enter value for sql_text: % test %
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 1767x3dw3m8pa      0 2684249835          1         57.31    1,956,226  1,956,219      0 Yes               9.35 select /* test max all */ max(pk_col), m
    1 1wp72w4c5d449      0 2684249835          1         29.98    1,956,226  1,956,219      0 Yes              71.84 select /* test avg */ avg(col1) from kso
    1 46vd0hducqms8      0 1667163222          1        422.77    2,032,691  1,956,219      0 No                 .00 select /* test full */ * from kso.skew3
    1 5ry8vd8u0n0bf      0 2684249835          1         24.03    1,956,226  1,956,219      0 Yes              71.85 select /* test count */ count(col1) from
    1 6jtf4rnxu2dq7      0 2684249835          1         24.50    1,956,226  1,956,219      0 Yes              71.85 select /* test max */ max(col1) from kso
    1 8pfxs6mdf62mq      0 3723858078          1           .00           35          0      0 No                 .00 select /* test index */ avg(pk_col) from
    1 abwyftyy395gk      0 2684249835          1         13.96    1,956,226  1,956,219      0 Yes              99.99 select /* test avg filter */ avg(col1) f
    1 fxpkhxt4jwqqm      0 2684249835          1         39.08    1,956,226  1,956,219      0 Yes               9.35 select /* test count(rll) */ count(pk_co
 
8 rows selected.

So I did several queries. Some were off-loadable and some were not. Then I ran the script to see what how the statements had been handled. There are a couple of things to note here:

  1. I used IO_CELL_OFFLOAD_ELIGIBLE_BYTES in V$SQL to indicate whether a statement can be offloaded or not.
    (i.e. decode(IO_CELL_OFFLOAD_ELIGIBLE_BYTES,0,’No’,’Yes’) Offloadable) Note, if you run this script on a non-Exadata 11gR2 database, all the SQL statements should show up as Offloadable=NO.
  2. All the statements using the aggregate functions (max,count,avg…) are marked as off-loadable even though the functions themselves are not off-loadable. Projection accounts for the reduced IO (limited to columns requested) and the marking as off-loadable. Note that the statement commented “test max all” shows almost no reduction in IO (it applies the max function to every column in the table). I’m not yet sure why there was a slight reduction in I/O even with all columns listed. And by the way, V$SQLFN_METADATA has a list of all valid functions and whether they are off-loadable or not (none of the aggregate functions are).
  3. It appears the statements selecting all columns with no where clause are not off-loadable (no filter predicate or column projection).
  4. Statements using index range scans are not off-loadable (the optimizer has to pick a plan with a full scan). Check out the statement commented “test index” for an example.

It is possible to turn offloading off via a parameter, CELL_OFFLOAD_PROCESSING, or via a hint. Here’s a bit of the Oracle Exadata Storage Server Software User’s Guide:

The CELL_OFFLOAD_PROCESSING initialization parameter enables SQL processing offload to Exadata Cell. When the value of the parameter is set to TRUE, predicate evaluation can be offloaded to cells. The default value of the parameter is TRUE. If the parameter is set to FALSE at the session or system level, then the database performs all the predicate evaluation with cells serving blocks. You can set CELL_OFFLOAD_ PROCESSING dynamically with the SQL ALTER SYSTEM or ALTER SESSION command. For example:
SQL> ALTER SESSION SET CELL_OFFLOAD_PROCESSING = TRUE;
The CELL_OFFLOAD_PROCESSING parameter can also be set with the OPT_PARAM optimizer hint to enable or disable predicate filtering for a specific SQL command. For example:
– to disable cell_offload_processing for a SQL command SELECT /*+ OPT_PARAM(‘cell_offload_processing’ ‘false’) */ COUNT(*) FROM EMPLOYEES;
– to enable cell_offload_processing for a SQL command SELECT /*+ OPT_PARAM(‘cell_offload_processing’ ‘true’) */ COUNT(*) FROM EMPLOYEES;

And a quick example:

SYS@LABRAT1> select /* test avg filter hint */
  2  /*+ OPT_PARAM('cell_offload_processing' 'false') */
  3  avg(col1) from kso.skew3 where col1 between 2 and 10000;
 
 AVG(COL1)
----------
 
 
1 row selected.
 
SYS@LABRAT1> @fsx
Enter value for sql_text: % test avg filter %
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 0nb4tjmjuq67t      0 2684249835          1        104.89    1,956,227  1,956,219      0 No                 .00 select /* test avg filter hint */ /*+ OP
    1 abwyftyy395gk      0 2684249835          1         13.96    1,956,226  1,956,219      0 Yes              99.99 select /* test avg filter */ avg(col1) f
 
2 rows selected.

So as you can see, the hint disabled offloading and resulted in the statement taking about 10X the elapsed time of the offloaded version.

Now for one more example to whet your appetite for things to come:

SYS@LABRAT1> select count(*) from kso.skew3 where col1 is null;
 
  COUNT(*)
----------
        12
 
1 row selected.
 
Elapsed: 00:00:00.08
SYS@LABRAT1> @fsx
Enter value for sql_text: %col1 is null%
Enter value for sql_id: 
Enter value for inst_id: 
 
 INST SQL_ID         CHILD  PLAN_HASH      EXECS     AVG_ETIME      AVG_LIO    AVG_PIO AVG_PX OFFLOADABLE IO_SAVED_% SQL_TEXT
----- ------------- ------ ---------- ---------- ------------- ------------ ---------- ------ ----------- ---------- ----------------------------------------
    1 0u1q4b7puqz6g      0 2684249835          1           .08    1,956,226  1,956,219      0 Yes             100.00 select count(*) from kso.skew3 where col
 
1 row selected.
 
Elapsed: 00:00:00.09
SYS@LABRAT1> @dplan
Enter value for sql_id: 0u1q4b7puqz6g                               
Enter value for child_no: 
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  0u1q4b7puqz6g, child number 0
-------------------------------------
select count(*) from kso.skew3 where col1 is null
 
Plan hash value: 2684249835
 
------------------------------------------------------------------------------------
| Id  | Operation                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |       |       |       |   533K(100)|          |
|   1 |  SORT AGGREGATE            |       |     1 |     5 |            |          |
|*  2 |   TABLE ACCESS STORAGE FULL| SKEW3 |    12 |    60 |   533K  (1)| 01:46:38 |
------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - storage("COL1" IS NULL)
       filter("COL1" IS NULL)
 
 
20 rows selected.
 
Elapsed: 00:00:00.02

Yes that’s a full scan on a 384M row table that occupies about 2M blocks. And it took 8/100′s of a second. That’s 80 milliseconds – roughly the amount of time necessary to do 10-15 single block disk i/o’s.

Wow! Can you say storage index? More on that in the next installment!

19 Comments

  1. >Yes that’s a full scan on a 384M row table that occupies about 2M blocks. And it took 8/100’s of a second. That’s 80 milliseconds – roughly the amount of time necessary to do 10-15 single block disk i/o’s.
    Fantastic result, very close to throughput of 200 GB/sec (considering you are using standard block size of 8K). Of course it’s not that :)

  2. Nigel Thomas says:

    > I’m not yet sure why there was a slight reduction in I/O even with all columns listed.

    Wouldn’t that be because the storage layer is passing back just the row data – so it’s excluding all the block overhead (PCTFREE, ITL, etc)?

    Regards Nigel

  3. osborne says:

    Timur,

    “Of course it’s not that :)”

    Yes, it’s using the storage index. I was frankly amazed at how fast it was though. I will save the details for a later post, but it appears that the storage index knows precisely where to look for nulls. Dealing with nulls has traditionally been a sore spot, so this optimization is pretty interesting.

    Kerry

  4. osborne says:

    Nigel,

    Of course, I’m sure you’re right about the block overhead and free space. I haven’t yet had a look at the blocks to see if there are any differences in how they are stored, but I expect that in the absence of HCC they are the same as non-Exadata.

    Kerry

  5. Eric Guyer says:

    Thanks for the update! Keep those Exadata-related posts coming :-D

    By the way, we should compare notes on guitar playing sometime…

  6. osborne says:

    Your welcome, I’ll try to keep them coming. I’ve got a whole list of things I’d like to write about on the Exadata front, but there are never enough hours in the day to get it all done. On the music front, I never cease to be amazed at how many techies are also musicians. I think there are a number of similarities in thought processes (particularly between programming and musical improvisation). Some day I’ll get around posting on that topic as well.

    Kerry

  7. Two points if I may…

    > Statements using index access are not off-loadable (the optimizer has to pick a plan with a full scan). Check put the statement commented “test index” for an example.

    …Index Fast Full Scans is also serviced by Smart Scan. Smart Scan has to be access method full and direct (serial direct or PQ direct all the same).

    >Yes, it’s using the storage index. I was frankly amazed at how fast it was though

    …That’s because there is nothing faster than not doing something. Unlike on-disk Indexes, which are optimized to decrease the time it takes to *find* something, Storage Indexes are there only to quickly determine whether or not anything needs to be done. To that end it is like comparing reduced I/O to eliminated I/O.

    …sounds like your Exadata testing is moving right along! Good to hear.

  8. osborne says:

    Thanks for stopping by Kevin.

    1. I guess I should have been more specific on the “full scans” comment. I meant full scans of indexes and tables. That’s why I didn’t say “full table scans”. ;) I edited the post to say “index range scans” instead of “index access” as well.

    2. The storage indexes are awesome!

    By the way, I’ve been working on 32 cpu Solaris box with SSD for the past week and I can’t believe how slow it feels.

    Kerry

  9. >32 cpu Solaris

    I presume SPARC? No, 32 SPARC anything compared to a Database Machine (112 Nehalem EP cores) is not fair :-)

  10. osborne says:

    Yes SPARC. We’ve loaded up about a 1T database and run a batch process that takes 4-5 hours on the Sun in about 45min on a quarter rack. That’s without doing anything other than loading the data and running it (across both nodes of course). But in general, I was just talking about how it feels. Everything on the Sun feels like it runs at a glacial speed vs. what we’ve gotten used to on the Exadata.

  11. [...] getting a reasonable amount of attention since 11g was released. Doug Burns, Dion Cho (and here), Kerry Osborne and Miladin Modrakovic have all written on the subject. Tanel Põder also covered it in his [...]

  12. joshua says:

    Hi Kerry ,
    Do you have any idea why IO_SAVED_% can be negative even when I am not using compression

    SQL_ID CHILD PLAN_HASH EXECS AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_
    ————- —— ———- —— ———- —— ——- ———- —-
    2uka1yym51jfg 1 3599640982 1 4.06 13 Yes -55.58 inse

  13. osborne says:

    Joshua – When the IO_SAVED_% column is negative it means that more data was transferred than exists in the object(s) that are being scanned. There are numerous reasons this can happen. Sorts that spill over into temp for example often do a lot of i/o and if the sum is great than the IO_CELL_OFFLOAD_ELIGIBLE_BYTES then the calculation goes negative. Hope that helps.

    Kerry

  14. HarshadMark says:

    I’m running the query on Exdata v2 (Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 – 64bit Production).
    The table is partitioned and the targeted partition contains about 5 million rows.

    Partitiion : Range by date
    subpartition : list
    col1 : number data type and it is NOT part of partition/subpartition scheme.
    Query : select col1 from t1 partiton (p_20110101) where col1 < 0 ;

    I see Oracle returning rows whre col1 is null.

    When I run the same query by changing the -
    alter session set cell_offload_processing=TRUE;

    I see no output.

    I could not understand why Exadata treats nulls different than it treats on regular Oracle database.
    I thought Storage index might be the cuase of this behaviour..I tried changing "_kcfis_storageidx_disabled"
    but I still get the same o/p ie Oracle retruns nulls for the condtition col1 < 0.

    Have you seen this before ?
    Please let me know if you need more info.

    Thanks
    Harshad

    • HarshadMark says:

      Sorry..
      When I change the setting to
      alter session set cell_offload_processing=FASLE;

      Oracle returns no rows ( expected output ).
      The current setting is cell_offload_processing=TRUE, and when query runs under this setting it returns nulls for col1<0.

  15. osborne says:

    11.2.0.1 was the first release on the V2′s and it had a few wrong results bugs. Many were related to storage indexes. You didn’t mention what cellsrv version you were running but there were also some issues caused by patching cellsrv and leaving the db unpatched.

    You said you had tried turning off storage indexes. Just verifying you used this syntax.

    alter session set “_kcfis_storageidx_disabled”=true;

    You can also check the stat “cell physical IO bytes saved by storage index” in v$mystat or v$sesstat to make sure storge indexes have not been used. I published a little script called mystats.sql that does that you can use for that. You can find more info about it here: Oracle Exadata – Storage Indexes

    At any rate, you should definitely consider upgrading, as 11.2.0.1 is several years old and there have been many bug fixes and enhancements since then.

  16. HarshadMark says:

    Thanks a lot for your reply.
    The issue is resolved with the cellsrv software upgrade. The versions are :
    Old : 11.2.2.1.1
    New : 11.2.2.4.2

    I’d even tried with alter session set “_kcfis_storageidx_disabled”=true; but the null issue still remained on old version.
    Now with the new cell sw upgrade the results are as expected.

    Thanks
    Harshad

  17. osborne says:

    Cool. Thanks for the follow up.

Leave a Reply