Oracle Exadata V2 – Flash Cache
One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:
Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.
Oracle has a White Paper here:
Exadata Smart Flash Cache and the Sun Oracle Database Machine
This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:
These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.
Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.
Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:
Pardon Me, Where Is That Flash Cache? Part I.
Pardon Me, Where Is That Flash Cache? Part II.
Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).
So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.
- We can use the cellcli utility on the storage servers themselves.
- We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).
Here’s a little output from the system showing method 1 (cellcli):
[root@dm01cel01 ~]# cellcli
CellCLI: Release 11.2.1.2.3 - Production on Fri Apr 30 16:09:29 CDT 2010
Copyright (c) 2007, 2009, Oracle. All rights reserved.
Cell Efficiency Ratio: 38M
CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE'
FC_BYKEEP_OVERWR FLASHCACHE 0.0 MB
FC_BYKEEP_OVERWR_SEC FLASHCACHE 0.0 MB/sec
FC_BYKEEP_USED FLASHCACHE 300.6 MB
FC_BY_USED FLASHCACHE 135,533.7 MB
FC_IO_BYKEEP_R FLASHCACHE 10,399.4 MB
FC_IO_BYKEEP_R_SEC FLASHCACHE 0.0 MB/sec
FC_IO_BYKEEP_W FLASHCACHE 6,378.3 MB
FC_IO_BYKEEP_W_SEC FLASHCACHE 0.0 MB/sec
FC_IO_BY_R FLASHCACHE 480,628.3 MB
FC_IO_BY_R_MISS FLASHCACHE 55,142.4 MB
FC_IO_BY_R_MISS_SEC FLASHCACHE 0.0 MB/sec
FC_IO_BY_R_SEC FLASHCACHE 0.1 MB/sec
FC_IO_BY_R_SKIP FLASHCACHE 1,448,220.2 MB
FC_IO_BY_R_SKIP_SEC FLASHCACHE 12.8 MB/sec
FC_IO_BY_W FLASHCACHE 178,761.9 MB
FC_IO_BY_W_SEC FLASHCACHE 0.1 MB/sec
FC_IO_ERRS FLASHCACHE 0
FC_IO_RQKEEP_R FLASHCACHE 1051647 IO requests
FC_IO_RQKEEP_R_MISS FLASHCACHE 291829 IO requests
FC_IO_RQKEEP_R_MISS_SEC FLASHCACHE 0.0 IO/sec
FC_IO_RQKEEP_R_SEC FLASHCACHE 0.0 IO/sec
FC_IO_RQKEEP_R_SKIP FLASHCACHE 0 IO requests
FC_IO_RQKEEP_R_SKIP_SEC FLASHCACHE 0.0 IO/sec
FC_IO_RQKEEP_W FLASHCACHE 176405 IO requests
FC_IO_RQKEEP_W_SEC FLASHCACHE 0.0 IO/sec
FC_IO_RQ_R FLASHCACHE 21095663 IO requests
FC_IO_RQ_R_MISS FLASHCACHE 1574404 IO requests
FC_IO_RQ_R_MISS_SEC FLASHCACHE 0.6 IO/sec
FC_IO_RQ_R_SEC FLASHCACHE 1.6 IO/sec
FC_IO_RQ_R_SKIP FLASHCACHE 4879720 IO requests
FC_IO_RQ_R_SKIP_SEC FLASHCACHE 26.8 IO/sec
FC_IO_RQ_W FLASHCACHE 5665344 IO requests
FC_IO_RQ_W_SEC FLASHCACHE 2.9 IO/sec
The stats I found most interesting are:
FC_IO_RQ_R – The number of read I/O requests satisfied from Flash Cache.
FC_IO_RQ_R_MISS – The number of read I/O requests which did not find all data in Flash Cache.
FC_IO_RQ_R_SKIP – The number of read I/O requests with a hint to bypass Flash Cache.
A quick “hit ratio” calculation can be done like so:
hit ratio = FC_IO_RQ_R/(FC_IO_RQ_R+FC_IO_RQ_R_MISS)
or
hit ratio ignoring skips = (FC_IO_RQ_R+FC_IO_RQ_R_SKIP)/(FC_IO_RQ_R+FC_IO_RQ_R_MISS+FC_IO_RQ_R_SKIP)
So for the system we were just looking at:
SQL> select (21095663)/(21095663+1574404) hit_ratio from dual;
HIT_RATIO
----------
.930551418
SQL> select (4879720+21095663)/(21095663+1574404+4879720) hit_ratio_ignoring_skips from dual;
HIT_RATIO_IGNORING_SKIPS
------------------------
.942852408
Looks like 93-94% depending on how you calculate it. Keep in mind this is for one storage server and it is cumulative since the storage server started. (maybe AWR will start pulling that info with snapshots)
Now here’s a little output from the system showing method 2 (v$sessstat/v$mystat). Note that for this demonstration I had a query that retrieved a single row from a very large table via an index (I left the full text of the statement out of the demo though):
SQL> !cat mystats.sql
col name for a70
col value for 99999999999999
select name, value
from v$mystat s, v$statname n
where n.statistic# = s.statistic#
and name like nvl('&name',name)
/
SQL> @mystats -- my starting value for this session (obviously I had been playing around in this session already)
Enter value for name: cell flash%
NAME VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits 404
SQL> select a single row by index ... (first time this row has been selected)
SQL> @mystats
Enter value for name: cell flash%
NAME VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits 404
SQL> -- no change in stats, because the blocks had not been read into the flash cache yet
SQL>
SQL> select a single row by index ... (same row as before)
SQL> @mystats
Enter value for name: cell flash%
NAME VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits 404
SQL> -- no change, because the block is now in the database buffer cache (SGA)
SQL> -- so no need to access storage at all (i.e. no physical read was necessary)
SQL> -- let's make it revisit the storage cell
SQL>
SQL> alter system flush buffer_cache;
System altered.
SQL> select a single row by index ... (same row as before)
SQL> @mystats
Enter value for name: cell flash%
NAME VALUE
---------------------------------------------------------------- ----------
cell flash cache read hits 408
SQL> -- this time the stat was incremented
So that’s cool. We at least have some visibility into what’s happening. Note that I did trace these statements as well to verify that it was doing what I thought, but I will not bore you with that output as this post is already long enough.
So what about the performance? I think we all expect that accessing a block in the flash cache should be considerably faster than reading it off of disk. And of course that is the case. First a bit of output from an AWR report showing single block read times. Note that the there are a couple of new wait events (“cell single block physical read” and “cell multiblock physical read”). These take the place of “db file sequential read” and “db file scattered read” respectively when using Exadata storage.
Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
cell multiblock physical r 22.6K 61.3 12.1 3.2 8.0 6.6 6.7 2.2
cell single block physical 930.1 94.7 3.7 .3 .4 .6 .1 .0
These are pretty typical values for these two events on the system we have been working with. In this particular time period about 95% of the single block reads were returned in under 1ms (the ones coming from the flash cache presumably). Let’s take a closer look via a 10046 trace.
=====================
PARSING IN CURSOR #9 len=78 dep=0 uid=0 oct=3 lid=0 tim=1272663655301341 hv=1577793954 ad='442beca50' sqlid='9st4pvjg0qfd2'
select * from TABLE_XXX where address_id = 602275
END OF STMT
PARSE #9:c=2000,e=1562,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301340
EXEC #9:c=0,e=35,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655301412
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655301447
WAIT #9: nam='ges message buffer allocation' ela= 2 pool=0 request=1 allocated=0 obj#=36406 tim=1272663655301544
WAIT #9: nam='gc cr grant 2-way' ela= 276 p1=6 p2=42702323 p3=1 obj#=37362 tim=1272663655301875
WAIT #9: nam='cell single block physical read' ela= 738 cellhash#=2520626383 diskhash#=3243237505 bytes=8192 obj#=37362 tim=1272663655302720
WAIT #9: nam='cell single block physical read' ela= 566 cellhash#=2520626383 diskhash#=3754429853 bytes=8192 obj#=37362 tim=1272663655303388
WAIT #9: nam='cell single block physical read' ela= 652 cellhash#=88802347 diskhash#=26778600 bytes=8192 obj#=37362 tim=1272663655304130
WAIT #9: nam='ges message buffer allocation' ela= 1 pool=0 request=1 allocated=0 obj#=37362 tim=1272663655304194
WAIT #9: nam='gc cr grant 2-way' ela= 116 p1=6 p2=52163917 p3=1 obj#=36406 tim=1272663655304336
WAIT #9: nam='cell single block physical read' ela= 646 cellhash#=398250101 diskhash#=4236948042 bytes=8192 obj#=36406 tim=1272663655305043
FETCH #9:c=0,e=3598,p=4,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=2222833759,tim=1272663655305083
WAIT #9: nam='SQL*Net message from client' ela= 654 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305768
FETCH #9:c=0,e=7,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=2222833759,tim=1272663655305792
STAT #9 id=1 cnt=1 pid=0 pos=1 obj=36406 op='TABLE ACCESS BY INDEX ROWID TABLE_XXX (cr=5 pr=4 pw=0 time=0 us cost=4 size=275 card=1)'
STAT #9 id=2 cnt=1 pid=1 pos=1 obj=37362 op='INDEX RANGE SCAN KSO_INDEX_XXX (cr=4 pr=3 pw=0 time=0 us cost=3 size=0 card=1)'
WAIT #9: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655305844
WAIT #9: nam='SQL*Net message from client' ela= 463 driver id=1650815232 #bytes=1 p3=0 obj#=36406 tim=1272663655306347
So the single block reads took roughly half of a millisecond (ela=646). That’s about an order of magnitude faster than a typical disk read and on par with what we’d expect (SSD devices are often in this ballpark as well, although throughput should be better with flash cache due to not having to run everything through a disk controller).
Just a couple of final points. I think the flash cache is an extremely important component in the Exadata Storage platform, particularly when it comes to running mixed work loads (i.e. OLTP and DW). It provides the capability of having a very large disk cache with very smart algorithms for deciding what to cache and what not to cache. (I guess that’s why they call it “Exadata Smart Flash Cache”) It doesn’t do anything for you on the write side, but part of it could be carved out as a “Flash Disk” for systems where write speed is the major bottle neck. Putting the log files on flash may make sense for very high transaction systems for example.
So that’s it for now. Let me know what you think or if you have anything you want me to try.
[…] » Exadata – Take 2 – Flash Cache – Kerry Osborne’s Oracle Blog kerryosborne.oracle-guy.com/2010/05/exadata-take-2-flash-cache – view page – cached One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage… Read moreOne of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like: View page Tweets about this link Topsy.Data.Twitter.User[‘oracleusersph’] = {“photo”:”http://a1.twimg.com/profile_images/398705654/oracle-240-small_normal.png”,”url”:”http://twitter.com/oracleusersph”,”nick”:”oracleusersph”}; oracleusersph: “Oracle Exadata Flash Cache http://goo.gl/2ry7 http://bit.ly/acNFxS ” 1 day ago view tweet retweet Filter tweets […]
Hi thanks for sharing internals on “Exadata smart Flash Cache”. So, Exadata out of the box advantages are strictly limited to reads, no help for writes unless its a typical process of reading lots of data based on which it updates/inserts (writes).
Overall improvements can be seen in specific and above scenario mentioned. No imprvements for direct writes unless you carve out “Flash Disk” for writes and log files. Please comment or confirm.
Sam,
Yes, no specific enhancements over Oracle on non-Exadata platform for writes (i.e. no fancy write cache). The flash memory in the storage servers can be carved off as a “disk” for high write files if so desired (log files for example).
Kerry
Hi Kerry,
Is putting log files on flash “advisable”? We’re implementing 4 full Exadata machines now in an OLTP environment. We’re planning to use Goldengate for replication to other servers for a period of time from Exadata to other machines. We’re going to make ~1.5TB/day of archivelogs…which is a lot for GG to mine, so I brought up putting redo logs in asm using a portion of the flash cards to improve redo IO for Goldengate, but it was discouraged due to redundancy concerns. What’s your opinion?
Thanks,
JOhn
John,
I’m not sure what the concerns about redundancy would be. Exadata systems are generally set up with diskgroups using normal or high redundancy (2 or 3 way mirroring). The failure groups are usually set up so that each storage cell represents a separate failure group. So if you choose high redundancy you should be able to loose two storage cells without loosing any data. If your co-workers don’t trust the software supplied redundancy provided by ASM that is another issue.
I would definitely want to do some testing to prove to myself that putting the log files on flash disks really gave me the performance improvement I was hoping for, as the results of my (admittedly limited) experiments in that area have been somewhat mixed. The downside of persueing this approach is that any flash that you use as disk decreases the amount available for caching data. Not as big a deal if you can get most of it into the SGA, but nevertheless, it’s something to consider.
It sounds like a very interesting system. Good luck. I’d be very interested in hearing more as your evaluation/testing proceeds.
Kerry
Kerry, we are having a issue with a table with 262 columns. From what I have read having a table over 255 columns causes performance issues. Is there a way to determine if that is what is happening? I am currently tracing a session and we have lots of “cell single block physical read” wait events. Its an INSERT statement with following stats:
call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 0 0.00 0.00 0 0 0 0
Execute 439 602.53 1399.50 1329033 54696 136314235 43900
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 439 602.53 1399.50 1329033 54696 136314235 43900
Hi Joe,
I’m not sure what the problem you’re trying to solve is. Are you trying to make you’re insert go faster? If so, sharing a little more info like the statement will be helpful. Also why you you are working this issue (is it slower than on another platform or than other similar inserts)?
Kerry
We are on exadata 11gr2, version BP 10. Load table is truncated prior to single row inserts, insert into table1(col1…col262) values(v1…..v262);
The table we are loading from has same number of columns and process is done by informatica with ~10000 row commit frequency. There are 4 indexes on the target table. Also the target table uses more blocks, from dba_segments, than the target. Both have same storage parameters.
Any advice on what to look at would be appreciated.
Thanks,
Joe
Hi Joe,
I’m not sure what you’re asking here. Are you just curious why the table sizes don’t match?
Kerry
I think we ran into an issue with Informatica processing 1 row at a time. Insert into table1 form table2 it completes successfully in about 1 hour time for 12.5 million rows. When i add parallel dml and do it in parallel it completes in about 6 minutes. Without looking I am making an assumption the Information server slowed down due to high memory usage.
I meant to say if we do a insert into table1 select * from table2 the process runs much faster than processing 1 row at a time from Informatica load.
Yes – set processing generally wins over 1 row at a time processing regardless of the platform, but with Exadata the difference can be magnified.
Kerry
Hi Kerry,
The whitepaper has moved : http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf
One day Oracle will learn how to implement redirects when they move their website content around ….
Thanks Robin.
Hi Kerry,
Our database size is 500GB. Can we keep our entire database in flash disks? I mean by creating partitions in flash disk and keeping all our data,control,redo log files. Correct me if i am wrong. Is that possible? Any drawbacks in doing that?
Thanks in advance
Hi Krishna,
The answer is yes – you can carve off part of the flash as permanent storage. I would caution you to think about the decision to do this though. The flash is most useful in most cases as “flash cache”. And it does a very good job of buffering single block reads. You can even designate table for preferential caching with the KEEP option which will allow full scan and smart scans to use the flash cache on those objects. So my recommendation before doing that would be to do some investigation and experimenting – but it is an option.
Kerry
Thank you very much for your valuable reply Kerry.