Oracle Exadata V2 - Flash Cache

One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:

Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.

Oracle has a White Paper here:

Exadata Smart Flash Cache and the Sun Oracle Database Machine

This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:

These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.

Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.

Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:

Pardon Me, Where Is That Flash Cache? Part I.
Pardon Me, Where Is That Flash Cache? Part II.

Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).

So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.

  1. We can use the cellcli utility on the storage servers themselves.
  2. We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).

Here’s a little output from the system showing method 1 (cellcli):

[root@dm01cel01 ~]# cellcli
CellCLI: Release 11.2.1.2.3 - Production on Fri Apr 30 16:09:29 CDT 2010
 
Copyright (c) 2007, 2009, Oracle.  All rights reserved.
Cell Efficiency Ratio: 38M
 
CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE'
         FC_BYKEEP_OVERWR                FLASHCACHE      0.0 MB
         FC_BYKEEP_OVERWR_SEC            FLASHCACHE      0.0 MB/sec
         FC_BYKEEP_USED                  FLASHCACHE      300.6 MB
         FC_BY_USED                      FLASHCACHE      135,533.7 MB
         FC_IO_BYKEEP_R                  FLASHCACHE      10,399.4 MB
         FC_IO_BYKEEP_R_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BYKEEP_W                  FLASHCACHE      6,378.3 MB
         FC_IO_BYKEEP_W_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R                      FLASHCACHE      480,628.3 MB
         FC_IO_BY_R_MISS                 FLASHCACHE      55,142.4 MB
         FC_IO_BY_R_MISS_SEC             FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_BY_R_SKIP                 FLASHCACHE      1,448,220.2 MB
         FC_IO_BY_R_SKIP_SEC             FLASHCACHE      12.8 MB/sec
         FC_IO_BY_W                      FLASHCACHE      178,761.9 MB
         FC_IO_BY_W_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_ERRS                      FLASHCACHE      0
         FC_IO_RQKEEP_R                  FLASHCACHE      1051647 IO requests
         FC_IO_RQKEEP_R_MISS             FLASHCACHE      291829 IO requests
         FC_IO_RQKEEP_R_MISS_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SKIP             FLASHCACHE      0 IO requests
         FC_IO_RQKEEP_R_SKIP_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_W                  FLASHCACHE      176405 IO requests
         FC_IO_RQKEEP_W_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQ_R                      FLASHCACHE      21095663 IO requests
         FC_IO_RQ_R_MISS                 FLASHCACHE      1574404 IO requests
         FC_IO_RQ_R_MISS_SEC             FLASHCACHE      0.6 IO/sec
         FC_IO_RQ_R_SEC                  FLASHCACHE      1.6 IO/sec
         FC_IO_RQ_R_SKIP                 FLASHCACHE      4879720 IO requests
         FC_IO_RQ_R_SKIP_SEC             FLASHCACHE      26.8 IO/sec
         FC_IO_RQ_W                      FLASHCACHE      5665344 IO requests
         FC_IO_RQ_W_SEC                  FLASHCACHE      2.9 IO/sec


The stats I found most interesting are:

FC_IO_RQ_R - The number of read I/O requests satisfied from Flash Cache.
FC_IO_RQ_R_MISS - The number of read I/O requests which did not find all data in Flash Cache.
FC_IO_RQ_R_SKIP - The number of read I/O requests with a hint to bypass Flash Cache.

A quick “hit ratio” calculation can be done like so:

hit ratio = FC_IO_RQ_R/(FC_IO_RQ_R+FC_IO_RQ_R_MISS)
or
hit ratio ignoring skips = (FC_IO_RQ_R+FC_IO_RQ_R_SKIP)/(FC_IO_RQ_R+FC_IO_RQ_R_MISS+FC_IO_RQ_R_SKIP)

So for the system we were just looking at:

SQL> select (21095663)/(21095663+1574404) hit_ratio from dual;
 
 HIT_RATIO
----------
.930551418
 
SQL> select (4879720+21095663)/(21095663+1574404+4879720) hit_ratio_ignoring_skips from dual;
 
HIT_RATIO_IGNORING_SKIPS
------------------------
              .942852408

Looks like 93-94% depending on how you calculate it. Keep in mind this is for one storage server and it is cumulative since the storage server started. (maybe AWR will start pulling that info with snapshots)

Now here’s a little output from the system showing method 2 (v$sessstat/v$mystat). Note that for this demonstration I had a query that retrieved a single row from a very large table via an index (I left the full text of the statement out of the demo though):

 
SQL>  !cat mystats.sql
col name for a70
col value for 99999999999999
select name, value
from v$mystat s, v$statname n
where n.statistic# = s.statistic#
and name like nvl('&name',name)
/
 
SQL> @mystats -- my starting value for this session (obviously I had been playing around in this session already)
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> select a single row by index ... (first time this row has been selected)
 
SQL> @mystats
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> -- no change in stats, because the blocks had not been read into the flash cache yet
SQL>
SQL> select a single row by index ... (same row as before)
 
SQL> @mystats 
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> -- no change, because the block is now in the database buffer cache (SGA) 
SQL> -- so no need to access storage at all (i.e. no physical read was necessary)
SQL> -- let's make it revisit the storage cell
SQL> 
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select a single row by index ... (same row as before)
 
SQL> @mystats
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               408
 
SQL> -- this time the stat was incremented

So that’s cool. We at least have some visibility into what’s happening. Note that I did trace these statements as well to verify that it was doing what I thought, but I will not bore you with that output as this post is already long enough.

So what about the performance? I think we all expect that accessing a block in the flash cache should be considerably faster than reading it off of disk. And of course that is the case. First a bit of output from an AWR report showing single block read times. Note that the there are a couple of new wait events (”cell single block physical read” and “cell multiblock physical read”). These take the place of “db file sequential read” and “db file scattered read” respectively when using Exadata storage.

Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
cell multiblock physical r 22.6K  61.3  12.1   3.2   8.0   6.6   6.7   2.2
cell single block physical 930.1  94.7   3.7    .3    .4    .6    .1    .0

These are pretty typical values for these two events on the system we have been working with. In this particular time period about 95% of the single block reads were returned in under 1ms (the ones coming from the flash cache presumably). Let’s take a closer look via a 10046 trace.

=====================
PARSING IN CURSOR #9 len=78 dep=0 uid=0 oct=3 lid=0 tim=1272663655301341 hv=1577793954 ad='442beca50' sqlid='9st4pvjg0qfd2'
select * from TABLE_XXX where address_id = 602275
END OF STMT
PARSE #9:c=2000,e=1562,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301340
EXEC #9:c=0,e=35,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301412
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655301447
WAIT #9: nam='ges message buffer allocation' ela= 2 pool=0 request=1 allocated=0 obj#=36406 tim=1272663655301544
WAIT #9: nam='gc cr grant 2-way' ela= 276 p1=6 p2=42702323 p3=1 obj#=37362 tim=1272663655301875
WAIT #9: nam='cell single block physical read' ela= 738 cellhash#=2520626383 diskhash#=3243237505 bytes=8192 obj#=37362 tim=1272663655302720
WAIT #9: nam='cell single block physical read' ela= 566 cellhash#=2520626383 diskhash#=3754429853 bytes=8192 obj#=37362 tim=1272663655303388
WAIT #9: nam='cell single block physical read' ela= 652 cellhash#=88802347 diskhash#=26778600 bytes=8192 obj#=37362 tim=1272663655304130
WAIT #9: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=37362 tim=1272663655304194
WAIT #9: nam='gc cr grant 2-way' ela= 116 p1=6 p2=52163917 p3=1 obj#=36406 tim=1272663655304336
WAIT #9: nam='cell single block physical read' ela= 646 cellhash#=398250101 diskhash#=4236948042 bytes=8192 obj#=36406 tim=1272663655305043
FETCH #9:c=0,e=3598,p=4,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=2222833759,tim=1272663655305083
WAIT #9: nam='SQL*Net message from client' ela= 654 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305768
FETCH #9:c=0,e=7,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655305792
STAT #9 id=1 cnt=1 pid=0 pos=1 obj=36406 op='TABLE ACCESS BY INDEX ROWID TABLE_XXX (cr=5 pr=4 pw=0 time=0 us cost=4 size=275 card=1)'
STAT #9 id=2 cnt=1 pid=1 pos=1 obj=37362 op='INDEX RANGE SCAN KSO_INDEX_XXX (cr=4 pr=3 pw=0 time=0 us cost=3 size=0 card=1)'
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305844
WAIT #9: nam='SQL*Net message from client' ela= 463 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655306347

So the single block reads took roughly half of a millisecond (ela=646). That’s about an order of magnitude faster than a typical disk read and on par with what we’d expect (SSD devices are often in this ballpark as well, although throughput should be better with flash cache due to not having to run everything through a disk controller).

Just a couple of final points. I think the flash cache is an extremely important component in the Exadata Storage platform, particularly when it comes to running mixed work loads (i.e. OLTP and DW). It provides the capability of having a very large disk cache with very smart algorithms for deciding what to cache and what not to cache. (I guess that’s why they call it “Exadata Smart Flash Cache”) It doesn’t do anything for you on the write side, but part of it could be carved out as a “Flash Disk” for systems where write speed is the major bottle neck. Putting the log files on flash may make sense for very high transaction systems for example.

So that’s it for now. Let me know what you think or if you have anything you want me to try.

One Comment

  1. Twitter Trackbacks for » Exadata - Take 2 - Flash Cache - Kerry Osborne’s Oracle Blog [oracle-guy.com] on Topsy.com:

    [...] » Exadata - Take 2 - Flash Cache - Kerry Osborne’s Oracle Blog kerryosborne.oracle-guy.com/2010/05/exadata-take-2-flash-cache – view page – cached One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage… Read moreOne of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like: View page Tweets about this link Topsy.Data.Twitter.User['oracleusersph'] = {”photo”:”http://a1.twimg.com/profile_images/398705654/oracle-240-small_normal.png”,”url”:”http://twitter.com/oracleusersph”,”nick”:”oracleusersph”}; oracleusersph: “Oracle Exadata Flash Cache http://goo.gl/2ry7 http://bit.ly/acNFxS ” 1 day ago view tweet retweet Filter tweets [...]

Leave a comment