Archive for the ‘Oracle’ Category.

Exadata Offload - The Secret Sauce

The “Secret Sauce” for Exadata is it’s ability to offload processing to the storage tier. Offloading means that the storage servers can apply predicate filters at the storage layer, instead of shipping every possible block back to the database server(s). Another thing that happens with offloading is that the volume of data returned can be further reduced by column projection (i.e. if you only select 1 column from a 100 column table, there is no need to return the other 99 columns). Offloading is geared to long running queries that access a large amount of data. Offloading only works if the Oracle decides to use it’s direct path read mechanism. Direct path reads have traditionally been done by parallel query slaves but can also be done by serial queries. In fact, as of 11g, Oracle has changed the decision making process resulting in more aggressive use of serial direct path reads. I’ve seen this feature described as “serial direct path reads” and “adaptive direct path reads”.

I’ll digress here a bit to discuss this feature since direct path reads are critical to Exadata Offloading. Direct path reads do not load blocks into Oracle’s buffer cache. Instead, the data is returned directly to the PGA of the process requesting the data. This means that the data does not have to be in Oracle block format. That means no 8K block that is only partially filled, that may only have a record or two that you’re interested in, containing every column including ones you don’t want, and with additional header information - needs to be shipped back up from the storage layer. Instead, a much more compact result set containing only the columns requested and hopefully only the rows you need are returned. As I said, direct path reads are traditionally used by parallel query slaves. They are also used in a few other instances such as LOB access and sorts that spill over into TEMP. So the ability to use direct path reads is very important to the Exadata platform and thus the changes to the make them more attractive in 11g. Here are a few links to info on the subject of serial direct path reads:

  1. Doug Burns has a good post on 11g serial direct path reads.
  2. Alex Fatkulin has a very good post on some of the factors controlling adaptive direct path reads.
  3. There is a note on MOS (793845.1) on changes in 11g in the heuristics to choose between direct path reads or reads through the buffer cache.
  4. You may also find MOS note (50415.1) on misleading nature of “direct path read” wait events of interest.

Also be aware that direct path reads are only available for full scans (tables or indexes). So any statement that uses an index to get to a row in a table will not use this mechanism. Also keep in mind that direct path requires extra processing to ensure that all blocks on disk are current - (i.e. a check point), so frequently modified tables will suffer some overhead before direct path reads can be initiated.

I must say that I think the changes to the heuristics in 11g may be a little on the aggressive side for non-Exadata platforms (the changes may well be driven by Exadata). And by the way, serial direct path reads are not always faster than the normal reads that go through the buffer cache. Dion Cho has a good post on a performance problem due to serial direct path reads kicking in on one node of an 11g RAC environment (not Exadata). The node doing the direct path reads was running the query much slower than the node using the normal buffer cache reads. He also has a post on turning off serial direct path reads.

But enough about the direct path reads stuff, on to the Offloading. One of the first things I wanted to know when I got my first look at a system running on Exadata was whether a particular query was eligible for offloading and if so, how much of the expected i/o was saved by the feature. So of course I wrote a little script to show me that. Turns out there is plenty of info in V$SQL to see what’s going on. I called the script fsx.sql (short for Find_Sql_eXadata). Here’s a little demo:

Continue reading ‘Exadata Offload - The Secret Sauce’ »

Oracle Exadata Delivery Day

Well our new Exadata showed up this week. We had a pretty nice lab environment already. A bunch of Dell’s, some IBM’s, several Sun’s. We have a couple of EMC Sans as well (we actually threw away a whole EMC rack to make room for the Exadata). And of course we have every version of Oracle from 8i to 11gR2. It’s a good learning environment. It also let’s us try things when clients have a specific set of versions that we want to mimic. So now we have an Exadata V2 as well. We’ve had the delivery date on the calendar for several weeks. For some reason it reminded of the Weird Al Yankovik song Weasel Stomping Day.

It’s probably a sad reflection on how geeky we are that everyone is running around all excited like it’s Christmas or something.

Here’s a few pictures:

It’s a really fast machine by the way. In fact, we had trouble keeping up with it from the moment we got it off the truck.

Really fast, and slippery. Well in a couple of days we can actually turn it on (we’re supposed to let it acclimate to our environment). On Friday afternoon we’re going to have a happy hour to celebrate our newest edition. Wade calls it a sip and see. We’ll probably take a few pictures of ourselves with the little bundle of joy and sing a chorus of a festive Weird Al song, or maybe two. Come on by if you’re in the neighborhood!

Oracle Exadata V2 - Flash Cache

One of the things I didn’t really talk about in my first post on Exadata was the flash cache component of the storage servers. They are a key component of the “OLTP” claims that Oracle is making for the platform. So let’s talk about the hardware first. The storage servers have 4 of the Sun Flash Accelerator F20 PCIe cards. These cards hold 96G each for a total of 384G on each storage server. That’s well over a terabyte on the smallest quarter rack configuration. Here’s what they look like:

Note that they are only installed in the storage servers and not in the database servers. The cards are usually configured exclusively as Flash Cache, but can optionally have a portion defined as a “ram disk”.

Oracle has a White Paper here:

Exadata Smart Flash Cache and the Sun Oracle Database Machine

This white paper was published in late 2009 and it is specific to V2. It has some good information and is well worth reading. One of the comments I found interesting was the discussion of carving a piece of the Flash Cache out as a “disk”. Here’s the quote:

These high-performance logical flash disks can be used to store frequently accessed data. To use them requires advance planning to ensure adequate space is reserved for the tablespaces stored on them. In addition, backup of the data on the flash disks must be done in case media recovery is required, just as it would be done for data stored on conventional disks. This option is primarily useful for highly write intensive workloads where the disk write rate is higher than the disks can keep up with.

Do not confuse the use of these cards in the storage server with the new 11gR2 feature “Database Flash Cache”. That feature allows an extended SGA (level 2) cache to be created on a database server (if you are using Solaris or Oracle Enterprise Linux) and has nothing to do with the Exadata Smart Flash Cache which resides on the Exadata storage servers. Think of the Database Flash Cache as an extended SGA and the Exadata Smart Flash Cache as large “smart” disk cache. I say smart because it implements some of the same type of Oracle cache management features as the SGA.

Kevin Closson has a couple of good posts outlining the differences between Database Flash Cache and Exadata Smart Flash Cache here:

Pardon Me, Where Is That Flash Cache? Part I.
Pardon Me, Where Is That Flash Cache? Part II.

Note also that Exadata Smart Flash Cache does not affect writes (i.e. it is not a write cache).

So how do we see what’s going on with the Exadata Flash Cache? Well there are a couple of ways.

  1. We can use the cellcli utility on the storage servers themselves.
  2. We can look in v$sesstat (one of the best ways to do that is with Tanel Poder’s snapper script by the way).

Here’s a little output from the system showing method 1 (cellcli):

[root@dm01cel01 ~]# cellcli
CellCLI: Release 11.2.1.2.3 - Production on Fri Apr 30 16:09:29 CDT 2010
 
Copyright (c) 2007, 2009, Oracle.  All rights reserved.
Cell Efficiency Ratio: 38M
 
CellCLI> LIST METRICCURRENT WHERE objectType = 'FLASHCACHE'
         FC_BYKEEP_OVERWR                FLASHCACHE      0.0 MB
         FC_BYKEEP_OVERWR_SEC            FLASHCACHE      0.0 MB/sec
         FC_BYKEEP_USED                  FLASHCACHE      300.6 MB
         FC_BY_USED                      FLASHCACHE      135,533.7 MB
         FC_IO_BYKEEP_R                  FLASHCACHE      10,399.4 MB
         FC_IO_BYKEEP_R_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BYKEEP_W                  FLASHCACHE      6,378.3 MB
         FC_IO_BYKEEP_W_SEC              FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R                      FLASHCACHE      480,628.3 MB
         FC_IO_BY_R_MISS                 FLASHCACHE      55,142.4 MB
         FC_IO_BY_R_MISS_SEC             FLASHCACHE      0.0 MB/sec
         FC_IO_BY_R_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_BY_R_SKIP                 FLASHCACHE      1,448,220.2 MB
         FC_IO_BY_R_SKIP_SEC             FLASHCACHE      12.8 MB/sec
         FC_IO_BY_W                      FLASHCACHE      178,761.9 MB
         FC_IO_BY_W_SEC                  FLASHCACHE      0.1 MB/sec
         FC_IO_ERRS                      FLASHCACHE      0
         FC_IO_RQKEEP_R                  FLASHCACHE      1051647 IO requests
         FC_IO_RQKEEP_R_MISS             FLASHCACHE      291829 IO requests
         FC_IO_RQKEEP_R_MISS_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_R_SKIP             FLASHCACHE      0 IO requests
         FC_IO_RQKEEP_R_SKIP_SEC         FLASHCACHE      0.0 IO/sec
         FC_IO_RQKEEP_W                  FLASHCACHE      176405 IO requests
         FC_IO_RQKEEP_W_SEC              FLASHCACHE      0.0 IO/sec
         FC_IO_RQ_R                      FLASHCACHE      21095663 IO requests
         FC_IO_RQ_R_MISS                 FLASHCACHE      1574404 IO requests
         FC_IO_RQ_R_MISS_SEC             FLASHCACHE      0.6 IO/sec
         FC_IO_RQ_R_SEC                  FLASHCACHE      1.6 IO/sec
         FC_IO_RQ_R_SKIP                 FLASHCACHE      4879720 IO requests
         FC_IO_RQ_R_SKIP_SEC             FLASHCACHE      26.8 IO/sec
         FC_IO_RQ_W                      FLASHCACHE      5665344 IO requests
         FC_IO_RQ_W_SEC                  FLASHCACHE      2.9 IO/sec

Continue reading ‘Oracle Exadata V2 - Flash Cache’ »

Fun with Oracle Exadata V2

Well I’ve been holed up playing with an Exadata V2 machine for the past several weeks. Wow. Very interesting technology.

I must say that I believe the concept of offloading SQL processing to the storage layer is a game changer and I wouldn’t be at all surprised to see this as a standard feature a few years from now. What that means for other storage vendors is unclear at this point. So for this first post on the topic let me just describe the configuration (and some potential upgrades).

The basic architecture consists of a set of database severs and a set of storage servers.

Database Servers:

  • Sun x4170 (1RU 64x server)
  • 2 - Quad-core Intel Xeon E5540 2.53GHz processors
  • 72G Ram (18×4G Dimms - max of 144G using 8G DIMMs)
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • 4 - 146G internal drives (SAS 10,000 RPM)
  • dual hot swappable power supplies
  • no spare/empty slots!

Here’s what the Database Servers look like:

Storage Servers:

  • Sun x4275 (2RU 64x server)
  • 2 - Quad-core Intel Xeon E5540 (2.53GHz) processors
  • 24G Ram
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • dual hot swappable power supplies
  • 4 - 96G Sun Flash PCIe Cards (total of 384 GB)
  • 12 - 600 GB 15,000 RPM SAS or 2 TB 7,200 RPM SATA

Continue reading ‘Fun with Oracle Exadata V2’ »

Funny Developer Tricks - first_rows(999999999)

I ran across a funny SQL statement recently (funny strange, not funny ha ha - well actually funny ha ha too I guess). It had a first_rows hint like so:

 
 
select /*+ FIRST_ROWS (999999999)  */ 
"MANDT" , "OPBEL" , "OPUPW" , "OPUPK" , "OPUPZ" , "BUKRS" , "GSBER" , "BUPLA" , "SEGMENT" , 
"AUGST" , "GPART" , "VTREF" , "VTRE2" , "VKONT" , "ABWBL" , "ABWTP" , "ABWKT" , "APPLK" , 
"HVORG" , "TVORG" , "KOFIZ" , "SPART" , "HKONT", "MWSKZ" , "MWSZKZ" , "XANZA" , "STAKZ" , 
"BLDAT" , "BUDAT" , "OPTXT" , "WAERS" , "FAEDN", "FAEDS" , "VERKZ" , "STUDT" , "SKTPZ" , 
"XMANL" , "KURSF" , "BETRH" , "BETRW" , "BETR2" , "BETR3" , "SKFBT" , "SBETH" , "SBETW" , 
"SBET2" , "SBET3" , "MWSKO" , "MWVKO" , "TXRUL" , "SPZAH" , "PYMET" , "PYBUK" , "PERNR" , 
"GRKEY" , "PERSL" , "XAESP" , "AUGDT" , "AUGBL" , "AUGBD" , "AUGRD" , "AUGWA" , "AUGBT" , 
"AUGBS" , "AUGSK" , "AUGVD" , "AUGOB" , "WHANG" , "WHGRP" , "XEIPH" , "MAHNV" , "MANSP" , 
"XAUGP" , "ABRZU" , "ABRZO" , "FDGRP" , "FDLEV" , "FDZTG", "FDWBT" , "XTAUS" , "AUGRS" , 
"PYGRP" , "PDTYP" , "SPERZ" , "INFOZ" , "TXJCD" , "TXDAT" ,"VBUND" , "KONTT" , "KONTL" , 
"OPSTA" , "BLART" , "EMGPA" , "EMBVT" , "EMADR" , "IKEY" , "EUROU" , "XRAGL" , "ASTKZ" , 
"ASBLG" , "XBLNR" , "INKPS" , "RNDPS" , "QSSKZ" , "QSSEW" , "QSPTP" , "QSSHB" , "QBSHB" , 
"QSZNR" , "RFUPK" , "STRKZ" , "FITPR" , "XPYOR" , "LANDL" , "INTBU", "EMCRD" , "C4EYE" , 
"C4EYP" , "SCTAX" , "STTAX" , "STZAL" , "ORUPZ" , "NEGBU" , "SUBAP" , "PSWSL" , "PSWBT" , 
"PSWTX" , "PSGRP" , "FINRE" , "RDSTA" , "RDSTB" , "DEAKTIV" , "SGRKEY", "SOLLDAT" , "RECPT" , 
"TOCOLLECT" , "EINMALANF" , "VORAUSZAHL" , "APERIODIC" , "ABRABS" , "GRBBP" , "ASMETH" , 
"INT_CROSSREFNO" , "ETHPPM" , "PAYFREQID" , "INVOICING_PARTY" , "PPMST" , "LOGNO" , "APERIODICT" , 
"ADD_REFOBJ" , "ADD_REFOBJID" , "ADD_SERVICE" , "ZZAGENCY" , "ZZ_EXT_REF" , "ZZ_PAY_AGENT" , 
"ZZFUNDSOURCE" , "ZZINSTALLMENT" , "Z_PROD_ID" , "ZZUSERNAME" , "ZZWF_STAT" , "ZZPAYCHANNEL" 
FROM "DFKKOP" 
WHERE "MANDT" = :A0 -- NDV=1
AND "BUKRS" = :A1 -- NDV=1 
AND "AUGST" = :A2 -- NDV=2 
AND "FAEDN" < :A3 -- less than today probably all records
AND ( "PYMET" = :A4 OR "PYMET" = :A5 ) -- NDV=8
AND ROWNUM <= :A6; -- less than 1B

Yes - that’s a first rows hint with about a billion as the number of rows to optimizer for.

The reason I noticed it is that it runs for 15 hours before getting a Snapshot Too Old error. The attempted solution was to restart it the next day (thinking maybe it will run better the second time I guess). The table has roughly 100M rows. There was no index on PYMET which is unfortunate as the two values requested account for only about 0.15% (not 15%, 0.15%). The optimizer chooses an index on MANDT, BURKRS, AUGST, FAEDN and as you might expect, it doesn’t work very well (see the NDV comments I added to the statement).

Funny things:

The First_Rows hint is requesting the Oracle optimizer to return the first billion records as fast as possible (even though there are only 100M rows).

The documentation for the First_Rows hint in 11g looks like this:

The FIRST_ROWS hint instructs Oracle to optimize an individual SQL statement for fast response, choosing the plan that returns the first n rows most efficiently. For integer, specify the number of rows to return.

For example, the optimizer uses the query optimization approach to optimize the following statement for best response time:

SELECT /*+ FIRST_ROWS(10) */ employee_id, last_name, salary, job_id
FROM employees
WHERE department_id = 20;

In this example each department contains many employees. The user wants the first 10 employees of department 20 to be displayed as quickly as possible.

So I can see where the developers might have interpreted this as the ever elusive “Go Fast” hint.

The developers also added “and rownum < 999999999" to the where clause which limits the amount of rows that can be returned. I'm not sure whether they knew it or not, but this clause also has the same affect as the hint. That is to say that the clause causes the optimizer to modify it's calculations as if the first_rows_N hint had been applied. Maybe the developers weren't getting the "go fast" behavior they expected from the hint and after doing some research found that the "rownum <" syntax would basically do the same thing. I'm guessing that's the case because I can't see why they would really want to limit the number of rows coming back, but I'm not sure.

It's a very odd statement because the First_Rows hint tends to push the optimizer towards index usage, and this statement was behaving badly precisely because it was using an index (a full table scan only took about 1 hour). Regardless of what the developers were trying to do, the fact that they used such a big number caused the optimizer to ignore the hint anyway. Since the table only had 100M rows and the parameter was 1B, the hint was ignored (well at least the "First K Rows" modifications to the optimizer calculations were not used). This happens to the "rownum <" induced behavior as well by the way.

Here's a bit of a couple of 10053 trace file showing some details:
Continue reading ‘Funny Developer Tricks - first_rows(999999999)’ »