Oracle Exadata V2 – Flash Cache

One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:

Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.

Oracle has a White Paper here:

Exadata Smart Flash Cache and the Sun Oracle Database Machine

This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:

These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.

Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.

Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:

Pardon Me, Where Is That Flash Cache? Part I.
Pardon Me, Where Is That Flash Cache? Part II.

Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).

So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.

  1. We can use the cellcli utility on the storage servers themselves.
  2. We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).

Here’s a little output from the system showing method 1 (cellcli):

[root@dm01cel01 ~]# cellcli
CellCLI: Release 11.2.1.2.3 - Production on Fri Apr 30 16:09:29 CDT 2010
 
Copyright (c) 2007, 2009, Oracle.  All rights reserved.
Cell Efficiency Ratio: 38M
 
CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE'
         FC_BYKEEP_OVERWR                FLASHCACHE      0.0 MB
         FC_BYKEEP_OVERWR_SEC            FLASHCACHE      0.0 MB/sec
         FC_BYKEEP_USED                  FLASHCACHE      300.6 MB
         FC_BY_USED                      FLASHCACHE      135,533.7 MB
         FC_IO_BYKEEP_R                  FLASHCACHE      10,399.4 MB
         FC_IO_BYKEEP_R_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BYKEEP_W                  FLASHCACHE      6,378.3 MB
         FC_IO_BYKEEP_W_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R                      FLASHCACHE      480,628.3 MB
         FC_IO_BY_R_MISS                 FLASHCACHE      55,142.4 MB
         FC_IO_BY_R_MISS_SEC             FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_BY_R_SKIP                 FLASHCACHE      1,448,220.2 MB
         FC_IO_BY_R_SKIP_SEC             FLASHCACHE      12.8 MB/sec
         FC_IO_BY_W                      FLASHCACHE      178,761.9 MB
         FC_IO_BY_W_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_ERRS                      FLASHCACHE      0
         FC_IO_RQKEEP_R                  FLASHCACHE      1051647 IO requests
         FC_IO_RQKEEP_R_MISS             FLASHCACHE      291829 IO requests
         FC_IO_RQKEEP_R_MISS_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SKIP             FLASHCACHE      0 IO requests
         FC_IO_RQKEEP_R_SKIP_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_W                  FLASHCACHE      176405 IO requests
         FC_IO_RQKEEP_W_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQ_R                      FLASHCACHE      21095663 IO requests
         FC_IO_RQ_R_MISS                 FLASHCACHE      1574404 IO requests
         FC_IO_RQ_R_MISS_SEC             FLASHCACHE      0.6 IO/sec
         FC_IO_RQ_R_SEC                  FLASHCACHE      1.6 IO/sec
         FC_IO_RQ_R_SKIP                 FLASHCACHE      4879720 IO requests
         FC_IO_RQ_R_SKIP_SEC             FLASHCACHE      26.8 IO/sec
         FC_IO_RQ_W                      FLASHCACHE      5665344 IO requests
         FC_IO_RQ_W_SEC                  FLASHCACHE      2.9 IO/sec


The stats I found most interesting are:

FC_IO_RQ_R – The number of read I/O requests satisfied from Flash Cache.
FC_IO_RQ_R_MISS – The number of read I/O requests which did not find all data in Flash Cache.
FC_IO_RQ_R_SKIP – The number of read I/O requests with a hint to bypass Flash Cache.

A quick “hit ratio” calculation can be done like so:

hit ratio = FC_IO_RQ_R/(FC_IO_RQ_R+FC_IO_RQ_R_MISS)
or
hit ratio ignoring skips = (FC_IO_RQ_R+FC_IO_RQ_R_SKIP)/(FC_IO_RQ_R+FC_IO_RQ_R_MISS+FC_IO_RQ_R_SKIP)

So for the system we were just looking at:

SQL> select (21095663)/(21095663+1574404) hit_ratio from dual;
 
 HIT_RATIO
----------
.930551418
 
SQL> select (4879720+21095663)/(21095663+1574404+4879720) hit_ratio_ignoring_skips from dual;
 
HIT_RATIO_IGNORING_SKIPS
------------------------
              .942852408

Looks like 93-94% depending on how you calculate it. Keep in mind this is for one storage server and it is cumulative since the storage server started. (maybe AWR will start pulling that info with snapshots)

Now here’s a little output from the system showing method 2 (v$sessstat/v$mystat). Note that for this demonstration I had a query that retrieved a single row from a very large table via an index (I left the full text of the statement out of the demo though):

 
SQL>  !cat mystats.sql
col name for a70
col value for 99999999999999
select name, value
from v$mystat s, v$statname n
where n.statistic# = s.statistic#
and name like nvl('&name',name)
/
 
SQL> @mystats -- my starting value for this session (obviously I had been playing around in this session already)
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> select a single row by index ... (first time this row has been selected)
 
SQL> @mystats
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> -- no change in stats, because the blocks had not been read into the flash cache yet
SQL>
SQL> select a single row by index ... (same row as before)
 
SQL> @mystats 
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               404
 
SQL> -- no change, because the block is now in the database buffer cache (SGA) 
SQL> -- so no need to access storage at all (i.e. no physical read was necessary)
SQL> -- let's make it revisit the storage cell
SQL> 
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL> select a single row by index ... (same row as before)
 
SQL> @mystats
Enter value for name: cell flash%
 
NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits                                               408
 
SQL> -- this time the stat was incremented

So that’s cool. We at least have some visibility into what’s happening. Note that I did trace these statements as well to verify that it was doing what I thought, but I will not bore you with that output as this post is already long enough.

So what about the performance? I think we all expect that accessing a block in the flash cache should be considerably faster than reading it off of disk. And of course that is the case. First a bit of output from an AWR report showing single block read times. Note that the there are a couple of new wait events (“cell single block physical read” and “cell multiblock physical read”). These take the place of “db file sequential read” and “db file scattered read” respectively when using Exadata storage.

Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
cell multiblock physical r 22.6K  61.3  12.1   3.2   8.0   6.6   6.7   2.2
cell single block physical 930.1  94.7   3.7    .3    .4    .6    .1    .0

These are pretty typical values for these two events on the system we have been working with. In this particular time period about 95% of the single block reads were returned in under 1ms (the ones coming from the flash cache presumably). Let’s take a closer look via a 10046 trace.

=====================
PARSING IN CURSOR #9 len=78 dep=0 uid=0 oct=3 lid=0 tim=1272663655301341 hv=1577793954 ad='442beca50' sqlid='9st4pvjg0qfd2'
select * from TABLE_XXX where address_id = 602275
END OF STMT
PARSE #9:c=2000,e=1562,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301340
EXEC #9:c=0,e=35,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301412
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655301447
WAIT #9: nam='ges message buffer allocation' ela= 2 pool=0 request=1 allocated=0 obj#=36406 tim=1272663655301544
WAIT #9: nam='gc cr grant 2-way' ela= 276 p1=6 p2=42702323 p3=1 obj#=37362 tim=1272663655301875
WAIT #9: nam='cell single block physical read' ela= 738 cellhash#=2520626383 diskhash#=3243237505 bytes=8192 obj#=37362 tim=1272663655302720
WAIT #9: nam='cell single block physical read' ela= 566 cellhash#=2520626383 diskhash#=3754429853 bytes=8192 obj#=37362 tim=1272663655303388
WAIT #9: nam='cell single block physical read' ela= 652 cellhash#=88802347 diskhash#=26778600 bytes=8192 obj#=37362 tim=1272663655304130
WAIT #9: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=37362 tim=1272663655304194
WAIT #9: nam='gc cr grant 2-way' ela= 116 p1=6 p2=52163917 p3=1 obj#=36406 tim=1272663655304336
WAIT #9: nam='cell single block physical read' ela= 646 cellhash#=398250101 diskhash#=4236948042 bytes=8192 obj#=36406 tim=1272663655305043
FETCH #9:c=0,e=3598,p=4,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=2222833759,tim=1272663655305083
WAIT #9: nam='SQL*Net message from client' ela= 654 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305768
FETCH #9:c=0,e=7,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655305792
STAT #9 id=1 cnt=1 pid=0 pos=1 obj=36406 op='TABLE ACCESS BY INDEX ROWID TABLE_XXX (cr=5 pr=4 pw=0 time=0 us cost=4 size=275 card=1)'
STAT #9 id=2 cnt=1 pid=1 pos=1 obj=37362 op='INDEX RANGE SCAN KSO_INDEX_XXX (cr=4 pr=3 pw=0 time=0 us cost=3 size=0 card=1)'
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305844
WAIT #9: nam='SQL*Net message from client' ela= 463 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655306347

So the single block reads took roughly half of a millisecond (ela=646). That’s about an order of magnitude faster than a typical disk read and on par with what we’d expect (SSD devices are often in this ballpark as well, although throughput should be better with flash cache due to not having to run everything through a disk controller).

Just a couple of final points. I think the flash cache is an extremely important component in the Exadata Storage platform, particularly when it comes to running mixed work loads (i.e. OLTP and DW). It provides the capability of having a very large disk cache with very smart algorithms for deciding what to cache and what not to cache. (I guess that’s why they call it “Exadata Smart Flash Cache”) It doesn’t do anything for you on the write side, but part of it could be carved out as a “Flash Disk” for systems where write speed is the major bottle neck. Putting the log files on flash may make sense for very high transaction systems for example.

So that’s it for now. Let me know what you think or if you have anything you want me to try.

17 Comments

  1. […] » Exadata – Take 2 – Flash Cache – Kerry Osborne’s Oracle Blog kerryosborne.oracle-guy.com/2010/05/exadata-take-2-flash-cache – view page – cached One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage… Read moreOne of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like: View page Tweets about this link Topsy.Data.Twitter.User[‘oracleusersph’] = {“photo”:”http://a1.twimg.com/profile_images/398705654/oracle-240-small_normal.png”,”url”:”http://twitter.com/oracleusersph”,”nick”:”oracleusersph”}; oracleusersph: “Oracle Exadata Flash Cache http://goo.gl/2ry7 http://bit.ly/acNFxS ” 1 day ago view tweet retweet Filter tweets […]

  2. Sam Reddy says:

    Hi thanks for sharing internals on “Exadata smart Flash Cache”. So, Exadata out of the box advantages are strictly limited to reads, no help for writes unless its a typical process of reading lots of data based on which it updates/inserts (writes).

    Overall improvements can be seen in specific and above scenario mentioned. No imprvements for direct writes unless you carve out “Flash Disk” for writes and log files. Please comment or confirm.

  3. osborne says:

    Sam,

    Yes, no specific enhancements over Oracle on non-Exadata platform for writes (i.e. no fancy write cache). The flash memory in the storage servers can be carved off as a “disk” for high write files if so desired (log files for example).

    Kerry

    • John Black says:

      Hi Kerry,

      Is putting log files on flash “advisable”? We’re implementing 4 full Exadata machines now in an OLTP environment. We’re planning to use Goldengate for replication to other servers for a period of time from Exadata to other machines. We’re going to make ~1.5TB/day of archivelogs…which is a lot for GG to mine, so I brought up putting redo logs in asm using a portion of the flash cards to improve redo IO for Goldengate, but it was discouraged due to redundancy concerns. What’s your opinion?

      Thanks,
      JOhn

  4. osborne says:

    John,

    I’m not sure what the concerns about redundancy would be. Exadata systems are generally set up with diskgroups using normal or high redundancy (2 or 3 way mirroring). The failure groups are usually set up so that each storage cell represents a separate failure group. So if you choose high redundancy you should be able to loose two storage cells without loosing any data. If your co-workers don’t trust the software supplied redundancy provided by ASM that is another issue.

    I would definitely want to do some testing to prove to myself that putting the log files on flash disks really gave me the performance improvement I was hoping for, as the results of my (admittedly limited) experiments in that area have been somewhat mixed. The downside of persueing this approach is that any flash that you use as disk decreases the amount available for caching data. Not as big a deal if you can get most of it into the SGA, but nevertheless, it’s something to consider.

    It sounds like a very interesting system. Good luck. I’d be very interested in hearing more as your evaluation/testing proceeds.

    Kerry

  5. JoeG says:

    Kerry, we are having a issue with a table with 262 columns. From what I have read having a table over 255 columns causes performance issues. Is there a way to determine if that is what is happening? I am currently tracing a session and we have lots of “cell single block physical read” wait events. Its an INSERT statement with following stats:

    call count cpu elapsed disk query current rows
    ——- —— ——– ———- ———- ———- ———- ———-
    Parse 0 0.00 0.00 0 0 0 0
    Execute 439 602.53 1399.50 1329033 54696 136314235 43900
    Fetch 0 0.00 0.00 0 0 0 0
    ——- —— ——– ———- ———- ———- ———- ———-
    total 439 602.53 1399.50 1329033 54696 136314235 43900

    • osborne says:

      Hi Joe,

      I’m not sure what the problem you’re trying to solve is. Are you trying to make you’re insert go faster? If so, sharing a little more info like the statement will be helpful. Also why you you are working this issue (is it slower than on another platform or than other similar inserts)?

      Kerry

  6. JoeG says:

    We are on exadata 11gr2, version BP 10. Load table is truncated prior to single row inserts, insert into table1(col1…col262) values(v1…..v262);

    The table we are loading from has same number of columns and process is done by informatica with ~10000 row commit frequency. There are 4 indexes on the target table. Also the target table uses more blocks, from dba_segments, than the target. Both have same storage parameters.

    Any advice on what to look at would be appreciated.

    Thanks,
    Joe

  7. JoeG says:

    I think we ran into an issue with Informatica processing 1 row at a time. Insert into table1 form table2 it completes successfully in about 1 hour time for 12.5 million rows. When i add parallel dml and do it in parallel it completes in about 6 minutes. Without looking I am making an assumption the Information server slowed down due to high memory usage.

  8. JoeG says:

    I meant to say if we do a insert into table1 select * from table2 the process runs much faster than processing 1 row at a time from Informatica load.

  9. osborne says:

    Yes – set processing generally wins over 1 row at a time processing regardless of the platform, but with Exadata the difference can be magnified.

    Kerry

  10. Hi Kerry,

    The whitepaper has moved : http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf

    One day Oracle will learn how to implement redirects when they move their website content around ….

    • Krishna says:

      Hi Kerry,

      Our database size is 500GB. Can we keep our entire database in flash disks? I mean by creating partitions in flash disk and keeping all our data,control,redo log files. Correct me if i am wrong. Is that possible? Any drawbacks in doing that?

      Thanks in advance

  11. osborne says:

    Hi Krishna,

    The answer is yes – you can carve off part of the flash as permanent storage. I would caution you to think about the decision to do this though. The flash is most useful in most cases as “flash cache”. And it does a very good job of buffering single block reads. You can even designate table for preferential caching with the KEEP option which will allow full scan and smart scans to use the flash cache on those objects. So my recommendation before doing that would be to do some investigation and experimenting – but it is an option.

    Kerry

Leave a Reply