One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:
Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.
Oracle has a White Paper here:
This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:
These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.
Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.
Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:
Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).
So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.
- We can use the cellcli utility on the storage servers themselves.
- We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).
Here’s a little output from the system showing method 1 (cellcli):
[root@dm01cel01 ~]# cellcli CellCLI: Release 22.214.171.124.3 - Production on Fri Apr 30 16:09:29 CDT 2010 Copyright (c) 2007, 2009, Oracle. All rights reserved. Cell Efficiency Ratio: 38M CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE' FC_BYKEEP_OVERWR FLASHCACHE 0.0 MB FC_BYKEEP_OVERWR_SEC FLASHCACHE 0.0 MB/sec FC_BYKEEP_USED FLASHCACHE 300.6 MB FC_BY_USED FLASHCACHE 135,533.7 MB FC_IO_BYKEEP_R FLASHCACHE 10,399.4 MB FC_IO_BYKEEP_R_SEC FLASHCACHE 0.0 MB/sec FC_IO_BYKEEP_W FLASHCACHE 6,378.3 MB FC_IO_BYKEEP_W_SEC FLASHCACHE 0.0 MB/sec FC_IO_BY_R FLASHCACHE 480,628.3 MB FC_IO_BY_R_MISS FLASHCACHE 55,142.4 MB FC_IO_BY_R_MISS_SEC FLASHCACHE 0.0 MB/sec FC_IO_BY_R_SEC FLASHCACHE 0.1 MB/sec FC_IO_BY_R_SKIP FLASHCACHE 1,448,220.2 MB FC_IO_BY_R_SKIP_SEC FLASHCACHE 12.8 MB/sec FC_IO_BY_W FLASHCACHE 178,761.9 MB FC_IO_BY_W_SEC FLASHCACHE 0.1 MB/sec FC_IO_ERRS FLASHCACHE 0 FC_IO_RQKEEP_R FLASHCACHE 1051647 IO requests FC_IO_RQKEEP_R_MISS FLASHCACHE 291829 IO requests FC_IO_RQKEEP_R_MISS_SEC FLASHCACHE 0.0 IO/sec FC_IO_RQKEEP_R_SEC FLASHCACHE 0.0 IO/sec FC_IO_RQKEEP_R_SKIP FLASHCACHE 0 IO requests FC_IO_RQKEEP_R_SKIP_SEC FLASHCACHE 0.0 IO/sec FC_IO_RQKEEP_W FLASHCACHE 176405 IO requests FC_IO_RQKEEP_W_SEC FLASHCACHE 0.0 IO/sec FC_IO_RQ_R FLASHCACHE 21095663 IO requests FC_IO_RQ_R_MISS FLASHCACHE 1574404 IO requests FC_IO_RQ_R_MISS_SEC FLASHCACHE 0.6 IO/sec FC_IO_RQ_R_SEC FLASHCACHE 1.6 IO/sec FC_IO_RQ_R_SKIP FLASHCACHE 4879720 IO requests FC_IO_RQ_R_SKIP_SEC FLASHCACHE 26.8 IO/sec FC_IO_RQ_W FLASHCACHE 5665344 IO requests FC_IO_RQ_W_SEC FLASHCACHE 2.9 IO/sec
The stats I found most interesting are:
FC_IO_RQ_R – The number of read I/O requests satisfied from Flash Cache.
FC_IO_RQ_R_MISS – The number of read I/O requests which did not find all data in Flash Cache.
FC_IO_RQ_R_SKIP – The number of read I/O requests with a hint to bypass Flash Cache.
A quick “hit ratio” calculation can be done like so:
hit ratio = FC_IO_RQ_R/(FC_IO_RQ_R+FC_IO_RQ_R_MISS)
hit ratio ignoring skips = (FC_IO_RQ_R+FC_IO_RQ_R_SKIP)/(FC_IO_RQ_R+FC_IO_RQ_R_MISS+FC_IO_RQ_R_SKIP)
So for the system we were just looking at:
SQL> select (21095663)/(21095663+1574404) hit_ratio from dual; HIT_RATIO ---------- .930551418 SQL> select (4879720+21095663)/(21095663+1574404+4879720) hit_ratio_ignoring_skips from dual; HIT_RATIO_IGNORING_SKIPS ------------------------ .942852408
Looks like 93-94% depending on how you calculate it. Keep in mind this is for one storage server and it is cumulative since the storage server started. (maybe AWR will start pulling that info with snapshots)
Now here’s a little output from the system showing method 2 (v$sessstat/v$mystat). Note that for this demonstration I had a query that retrieved a single row from a very large table via an index (I left the full text of the statement out of the demo though):
SQL> !cat mystats.sql col name for a70 col value for 99999999999999 select name, value from v$mystat s, v$statname n where n.statistic# = s.statistic# and name like nvl('&name',name) / SQL> @mystats -- my starting value for this session (obviously I had been playing around in this session already) Enter value for name: cell flash% NAME VALUE ---------------------------------------------------------------- ---------- cell flash cache read hits 404 SQL> select a single row by index ... (first time this row has been selected) SQL> @mystats Enter value for name: cell flash% NAME VALUE ---------------------------------------------------------------- ---------- cell flash cache read hits 404 SQL> -- no change in stats, because the blocks had not been read into the flash cache yet SQL> SQL> select a single row by index ... (same row as before) SQL> @mystats Enter value for name: cell flash% NAME VALUE ---------------------------------------------------------------- ---------- cell flash cache read hits 404 SQL> -- no change, because the block is now in the database buffer cache (SGA) SQL> -- so no need to access storage at all (i.e. no physical read was necessary) SQL> -- let's make it revisit the storage cell SQL> SQL> alter system flush buffer_cache; System altered. SQL> select a single row by index ... (same row as before) SQL> @mystats Enter value for name: cell flash% NAME VALUE ---------------------------------------------------------------- ---------- cell flash cache read hits 408 SQL> -- this time the stat was incremented
So that’s cool. We at least have some visibility into what’s happening. Note that I did trace these statements as well to verify that it was doing what I thought, but I will not bore you with that output as this post is already long enough.
So what about the performance? I think we all expect that accessing a block in the flash cache should be considerably faster than reading it off of disk. And of course that is the case. First a bit of output from an AWR report showing single block read times. Note that the there are a couple of new wait events (“cell single block physical read” and “cell multiblock physical read”). These take the place of “db file sequential read” and “db file scattered read” respectively when using Exadata storage.
Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s -------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- cell multiblock physical r 22.6K 61.3 12.1 3.2 8.0 6.6 6.7 2.2 cell single block physical 930.1 94.7 3.7 .3 .4 .6 .1 .0
These are pretty typical values for these two events on the system we have been working with. In this particular time period about 95% of the single block reads were returned in under 1ms (the ones coming from the flash cache presumably). Let’s take a closer look via a 10046 trace.
===================== PARSING IN CURSOR #9 len=78 dep=0 uid=0 oct=3 lid=0 tim=1272663655301341 hv=1577793954 ad='442beca50' sqlid='9st4pvjg0qfd2' select * from TABLE_XXX where address_id = 602275 END OF STMT PARSE #9:c=2000,e=1562,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301340 EXEC #9:c=0,e=35,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301412 WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655301447 WAIT #9: nam='ges message buffer allocation' ela= 2 pool=0 request=1 allocated=0 obj#=36406 tim=1272663655301544 WAIT #9: nam='gc cr grant 2-way' ela= 276 p1=6 p2=42702323 p3=1 obj#=37362 tim=1272663655301875 WAIT #9: nam='cell single block physical read' ela= 738 cellhash#=2520626383 diskhash#=3243237505 bytes=8192 obj#=37362 tim=1272663655302720 WAIT #9: nam='cell single block physical read' ela= 566 cellhash#=2520626383 diskhash#=3754429853 bytes=8192 obj#=37362 tim=1272663655303388 WAIT #9: nam='cell single block physical read' ela= 652 cellhash#=88802347 diskhash#=26778600 bytes=8192 obj#=37362 tim=1272663655304130 WAIT #9: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=37362 tim=1272663655304194 WAIT #9: nam='gc cr grant 2-way' ela= 116 p1=6 p2=52163917 p3=1 obj#=36406 tim=1272663655304336 WAIT #9: nam='cell single block physical read' ela= 646 cellhash#=398250101 diskhash#=4236948042 bytes=8192 obj#=36406 tim=1272663655305043 FETCH #9:c=0,e=3598,p=4,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=2222833759,tim=1272663655305083 WAIT #9: nam='SQL*Net message from client' ela= 654 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305768 FETCH #9:c=0,e=7,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655305792 STAT #9 id=1 cnt=1 pid=0 pos=1 obj=36406 op='TABLE ACCESS BY INDEX ROWID TABLE_XXX (cr=5 pr=4 pw=0 time=0 us cost=4 size=275 card=1)' STAT #9 id=2 cnt=1 pid=1 pos=1 obj=37362 op='INDEX RANGE SCAN KSO_INDEX_XXX (cr=4 pr=3 pw=0 time=0 us cost=3 size=0 card=1)' WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305844 WAIT #9: nam='SQL*Net message from client' ela= 463 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655306347
So the single block reads took roughly half of a millisecond (ela=646). That’s about an order of magnitude faster than a typical disk read and on par with what we’d expect (SSD devices are often in this ballpark as well, although throughput should be better with flash cache due to not having to run everything through a disk controller).
Just a couple of final points. I think the flash cache is an extremely important component in the Exadata Storage platform, particularly when it comes to running mixed work loads (i.e. OLTP and DW). It provides the capability of having a very large disk cache with very smart algorithms for deciding what to cache and what not to cache. (I guess that’s why they call it “Exadata Smart Flash Cache”) It doesn’t do anything for you on the write side, but part of it could be carved out as a “Flash Disk” for systems where write speed is the major bottle neck. Putting the log files on flash may make sense for very high transaction systems for example.
So that’s it for now. Let me know what you think or if you have anything you want me to try.