Oracle Exadata V2 – Flash Cache

One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:

Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.

Oracle has a White Paper here:

Exadata Smart Flash Cache and the Sun Oracle Database Machine

This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:

These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.

Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.

Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:

Pardon Me, Where Is That Flash Cache? Part I.
Pardon Me, Where Is That Flash Cache? Part II.

Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).

So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.

  1. We can use the cellcli utility on the storage servers themselves.
  2. We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).

Here’s a little output from the system showing method 1 (cellcli):

[root@dm01cel01 ~]# cellcli
CellCLI: Release 11.2.1.2.3 - Production on Fri Apr 30 16:09:29 CDT 2010

Copyright (c) 2007, 2009, Oracle.  All rights reserved.
Cell Efficiency Ratio: 38M

CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE'
         FC_BYKEEP_OVERWR                FLASHCACHE      0.0 MB
         FC_BYKEEP_OVERWR_SEC            FLASHCACHE      0.0 MB/sec
         FC_BYKEEP_USED                  FLASHCACHE      300.6 MB
         FC_BY_USED                      FLASHCACHE      135,533.7 MB
         FC_IO_BYKEEP_R                  FLASHCACHE      10,399.4 MB
         FC_IO_BYKEEP_R_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BYKEEP_W                  FLASHCACHE      6,378.3 MB
         FC_IO_BYKEEP_W_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R                      FLASHCACHE      480,628.3 MB
         FC_IO_BY_R_MISS                 FLASHCACHE      55,142.4 MB
         FC_IO_BY_R_MISS_SEC             FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_BY_R_SKIP                 FLASHCACHE      1,448,220.2 MB
         FC_IO_BY_R_SKIP_SEC             FLASHCACHE      12.8 MB/sec
         FC_IO_BY_W                      FLASHCACHE      178,761.9 MB
         FC_IO_BY_W_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_ERRS                      FLASHCACHE      0
         FC_IO_RQKEEP_R                  FLASHCACHE      1051647 IO requests
         FC_IO_RQKEEP_R_MISS             FLASHCACHE      291829 IO requests
         FC_IO_RQKEEP_R_MISS_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SKIP             FLASHCACHE      0 IO requests
         FC_IO_RQKEEP_R_SKIP_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_W                  FLASHCACHE      176405 IO requests
         FC_IO_RQKEEP_W_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQ_R                      FLASHCACHE      21095663 IO requests
         FC_IO_RQ_R_MISS                 FLASHCACHE      1574404 IO requests
         FC_IO_RQ_R_MISS_SEC             FLASHCACHE      0.6 IO/sec
         FC_IO_RQ_R_SEC                  FLASHCACHE      1.6 IO/sec
         FC_IO_RQ_R_SKIP                 FLASHCACHE      4879720 IO requests
         FC_IO_RQ_R_SKIP_SEC             FLASHCACHE      26.8 IO/sec
         FC_IO_RQ_W                      FLASHCACHE      5665344 IO requests
         FC_IO_RQ_W_SEC                  FLASHCACHE      2.9 IO/sec

Continue reading ‘Oracle Exadata V2 – Flash Cache’ »

What is Exadata?

Well I’ve been holed up playing with an Exadata V2 machine for the past several weeks. Wow. Very interesting technology.

I must say that I believe the concept of offloading SQL processing to the storage layer is a game changer and I wouldn’t be at all surprised to see this as a standard feature a few years from now. What that means for other storage vendors is unclear at this point. So for this first post on the topic let me just describe the configuration (and some potential upgrades).

The basic architecture consists of a set of database severs and a set of storage servers.

Database Servers:

  • Sun x4170 (1RU 64x server)
  • 2 – Quad-core Intel Xeon E5540 2.53GHz processors
  • 72G Ram (18x4G Dimms – max of 144G using 8G DIMMs)
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • 4 – 146G internal drives (SAS 10,000 RPM)
  • dual hot swappable power supplies
  • no spare/empty slots!

Here’s what the Database Servers look like:

Storage Servers:

  • Sun x4275 (2RU 64x server)
  • 2 – Quad-core Intel Xeon E5540 (2.53GHz) processors
  • 24G Ram
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • dual hot swappable power supplies
  • 4 – 96G Sun Flash PCIe Cards (total of 384 GB)
  • 12 – 600 GB 15,000 RPM SAS or 2 TB 7,200 RPM SATA

Continue reading ‘What is Exadata?’ »

Funny Developer Tricks – first_rows(999999999)

I ran across a funny SQL statement recently (funny strange, not funny ha ha – well actually funny ha ha too I guess). It had a first_rows hint like so:



select /*+ FIRST_ROWS (999999999)  */ 
"MANDT" , "OPBEL" , "OPUPW" , "OPUPK" , "OPUPZ" , "BUKRS" , "GSBER" , "BUPLA" , "SEGMENT" , 
"AUGST" , "GPART" , "VTREF" , "VTRE2" , "VKONT" , "ABWBL" , "ABWTP" , "ABWKT" , "APPLK" , 
"HVORG" , "TVORG" , "KOFIZ" , "SPART" , "HKONT", "MWSKZ" , "MWSZKZ" , "XANZA" , "STAKZ" , 
"BLDAT" , "BUDAT" , "OPTXT" , "WAERS" , "FAEDN", "FAEDS" , "VERKZ" , "STUDT" , "SKTPZ" , 
"XMANL" , "KURSF" , "BETRH" , "BETRW" , "BETR2" , "BETR3" , "SKFBT" , "SBETH" , "SBETW" , 
"SBET2" , "SBET3" , "MWSKO" , "MWVKO" , "TXRUL" , "SPZAH" , "PYMET" , "PYBUK" , "PERNR" , 
"GRKEY" , "PERSL" , "XAESP" , "AUGDT" , "AUGBL" , "AUGBD" , "AUGRD" , "AUGWA" , "AUGBT" , 
"AUGBS" , "AUGSK" , "AUGVD" , "AUGOB" , "WHANG" , "WHGRP" , "XEIPH" , "MAHNV" , "MANSP" , 
"XAUGP" , "ABRZU" , "ABRZO" , "FDGRP" , "FDLEV" , "FDZTG", "FDWBT" , "XTAUS" , "AUGRS" , 
"PYGRP" , "PDTYP" , "SPERZ" , "INFOZ" , "TXJCD" , "TXDAT" ,"VBUND" , "KONTT" , "KONTL" , 
"OPSTA" , "BLART" , "EMGPA" , "EMBVT" , "EMADR" , "IKEY" , "EUROU" , "XRAGL" , "ASTKZ" , 
"ASBLG" , "XBLNR" , "INKPS" , "RNDPS" , "QSSKZ" , "QSSEW" , "QSPTP" , "QSSHB" , "QBSHB" , 
"QSZNR" , "RFUPK" , "STRKZ" , "FITPR" , "XPYOR" , "LANDL" , "INTBU", "EMCRD" , "C4EYE" , 
"C4EYP" , "SCTAX" , "STTAX" , "STZAL" , "ORUPZ" , "NEGBU" , "SUBAP" , "PSWSL" , "PSWBT" , 
"PSWTX" , "PSGRP" , "FINRE" , "RDSTA" , "RDSTB" , "DEAKTIV" , "SGRKEY", "SOLLDAT" , "RECPT" , 
"TOCOLLECT" , "EINMALANF" , "VORAUSZAHL" , "APERIODIC" , "ABRABS" , "GRBBP" , "ASMETH" , 
"INT_CROSSREFNO" , "ETHPPM" , "PAYFREQID" , "INVOICING_PARTY" , "PPMST" , "LOGNO" , "APERIODICT" , 
"ADD_REFOBJ" , "ADD_REFOBJID" , "ADD_SERVICE" , "ZZAGENCY" , "ZZ_EXT_REF" , "ZZ_PAY_AGENT" , 
"ZZFUNDSOURCE" , "ZZINSTALLMENT" , "Z_PROD_ID" , "ZZUSERNAME" , "ZZWF_STAT" , "ZZPAYCHANNEL" 
FROM "DFKKOP" 
WHERE "MANDT" = :A0 -- NDV=1
AND "BUKRS" = :A1 -- NDV=1 
AND "AUGST" = :A2 -- NDV=2 
AND "FAEDN" < :A3 -- less than today probably all records
AND ( "PYMET" = :A4 OR "PYMET" = :A5 ) -- NDV=8
AND ROWNUM <= :A6; -- less than 1B


Yes - that's a first rows hint with about a billion as the number of rows to optimizer for.

The reason I noticed it is that it runs for 15 hours before getting a Snapshot Too Old error. The attempted solution was to restart it the next day (thinking maybe it will run better the second time I guess). The table has roughly 100M rows. There was no index on PYMET which is unfortunate as the two values requested account for only about 0.15% (not 15%, 0.15%). The optimizer chooses an index on MANDT, BURKRS, AUGST, FAEDN and as you might expect, it doesn't work very well (see the NDV comments I added to the statement).

Funny things:

The First_Rows hint is requesting the Oracle optimizer to return the first billion records as fast as possible (even though there are only 100M rows).

The documentation for the First_Rows hint in 11g looks like this:

The FIRST_ROWS hint instructs Oracle to optimize an individual SQL statement for fast response, choosing the plan that returns the first n rows most efficiently. For integer, specify the number of rows to return.

For example, the optimizer uses the query optimization approach to optimize the following statement for best response time:

SELECT /*+ FIRST_ROWS(10) */ employee_id, last_name, salary, job_id
FROM employees
WHERE department_id = 20;

In this example each department contains many employees. The user wants the first 10 employees of department 20 to be displayed as quickly as possible.

So I can see where the developers might have interpreted this as the ever elusive "Go Fast" hint.

The developers also added "and rownum < 999999999" to the where clause which limits the amount of rows that can be returned. I'm not sure whether they knew it or not, but this clause also has the same affect as the hint. That is to say that the clause causes the optimizer to modify it's calculations as if the first_rows_N hint had been applied. Maybe the developers weren't getting the "go fast" behavior they expected from the hint and after doing some research found that the "rownum <" syntax would basically do the same thing. I'm guessing that's the case because I can't see why they would really want to limit the number of rows coming back, but I'm not sure. It's a very odd statement because the First_Rows hint tends to push the optimizer towards index usage, and this statement was behaving badly precisely because it was using an index (a full table scan only took about 1 hour). Regardless of what the developers were trying to do, the fact that they used such a big number caused the optimizer to ignore the hint anyway. Since the table only had 100M rows and the parameter was 1B, the hint was ignored (well at least the "First K Rows" modifications to the optimizer calculations were not used). This happens to the "rownum <" induced behavior as well by the way. Here's a bit of a couple of 10053 trace file showing some details: Continue reading ‘Funny Developer Tricks – first_rows(999999999)’ »

Funny Developer Tricks – upper(number)

I saw a funny one today. I’ll paraphrase:

select * from table_x
where upper(acct_number) = '876876'
or upper(acct_number) = '826531';

Nice huh?

1. Obviously turns off any indexes on acct_number (unless they had a function based index).
2. Looks like they are probably storing numeric data in a character data type (or implicitly converting a number to character string).
3. Not using bind variables so they are not helping themselves from a parsing standpoint.
4. Finally, they’re making darn sure they take care of any mixed case numbers!

Some fun.

(by the way, those lower case numbers cause me problems all the time)

New forcedirection Mount Option in Solaris

I was scanning a manual for a software product that runs on top of Oracle the other day and I ran across some “best practices” for Oracle. I was surprised to see a new mount option that I hadn’t run across before. Apparently it makes sequential reads 10-15% faster. See below for the section of the SERVER PREPARATION GUIDE for IBM Prospect® for Alcatel-Lucent Release 33.

Using forcedirection to Increase Performance

Sun Microsystems best practices for Solaris and Oracle installations recommends using the forcedirection mount option for file systems that store Oracle data files. The forcedirection option forces input/output (I/O) operations to bypass Solaris file buffering and caching. The bypass lets you combine the performance benefits of raw file systems with the manageability and flexibility of traditional file systems. The forcedirection option typically results in 10 percent to 15 percent faster sequential reads.

———————————————————
Caution! Use the forcedirection option on Solaris UNIX file systems. Other third-party file
systems provide similar functionality. If you are using a third-party file system, such as Veritas,
consult the company documentation for instructions on turning off file system buffering.
———————————————————

To set up the forcedirection option on existing IBM Prospect servers

1.Edit the

/etc/vfstab

file to include the forcedirection option using the root account.
$ su root
2.The seventh column in /etc/vfstab is the mount option column. In this column, change the value from – to forcedirection for the file systems containing Oracle data files. For standard installations, these are file systems /u02 through /u06. For example:
/dev/dsk/c2t0d0s6 /dev/rdsk/c2t0d0s6 /u02 ufs 3 yes –
becomes
SERVER PREPARATION GUIDE
IBM Prospect® for Alcatel-Lucent Release 33
Installing and Configuring Solaris
© Copyright IBM Corp 1999, 2009. All rights reserved.
SPG – Page 28
/dev/dsk/c2t0d0s6 /dev/rdsk/c2t0d0s6 /u02 ufs 3 yes forcedirection
Caution! For standard installations, enable forcedirection on file systems /u02 through
/u06 only. Do not enable forcedirection on file system /u01 — the /u01 file system
does not contain Oracle data files.
3.The forcedirection setup changes occur during the next system reboot. These changes are only a performance improvement; if they are the only changes requiring a reboot, schedule the reboot for the next available maintenance period. For information about scheduled maintenance, see the Administration Guide.
4.For a setup using forcedirection, use mount table entries in /etc/vfstab as follows:
/dev/dsk/c0t0d0s7 /dev/rdsk/c0t0d0s7 /u01 ufs 2 yes –
/dev/dsk/c2t0d0s6 /dev/rdsk/c2t0d0s6 /u02 ufs 3 yes forcedirection
/dev/dsk/c2t1d0s6 /dev/rdsk/c2t1d0s6 /u03 ufs 3 yes forcedirection
/dev/dsk/c2t2d0s6 /dev/rdsk/c2t2d0s6 /u04 ufs 3 yes forcedirection
/dev/dsk/c2t3d0s6 /dev/rdsk/c2t3d0s6 /u05 ufs 3 yes forcedirection
/dev/dsk/c2t4d0s6 /dev/rdsk/c2t4d0s6 /u06 ufs 3 yes forcedirection

If you’ve made it this far I have to go ahead and let you know that there is no forcedirection mount option (as far as I know). I think the editor of the manual (or maybe an overactive spell checker) got a little carried away – it should have been forcedirectio. Happy April Fools Day!