Archive for the ‘11gR2’ Category.

Exadata and Parallel Queuing

Over the years Oracle has added many enhancements in order to allow individual SQL statements to take full advantage of multiprocessor computers. A few months ago Cary Millsap did a talk where he recalled the presentation Larry Ellison did when Oracle first announced the Parallel Query feature. During Larry’s demo he had a multiprocessor computer all to himself. I don’t remember how many processors it had, but I remember he had some kind of graphic showing individual CPU utilization on one screen while he fired up a parallel query on another screen. The monitoring screen lit up like a Christmas tree. Every one of the CPU’s was pegged during his demo. When Cary was telling the story he said that he had wondered at the time what would have happened if there had been other users on the system during the demo. Their experience would probably not have been a good one. I remember having the exact same thought.

Oracle’s parallel capabilities have been a great gift but they have also been a curse because controlling the beast in an environment where there are multiple users trying to share the resources is pretty difficult. There have been many attempts at coming up with a reasonable way of throttling big parallel statements along the way. But to date, I think this technology has only been used effectively in batch processing environments and large data warehouses where consuming the whole machine’s resources is acceptable due to the relatively low degree of concurrency required by those environments.

So why did I mention Exadata in the title of this post? Well I believe that one of the most promising aspects of Exadata is it’s potential with regard to running mixed work loads (OLTP and DW) on the same database, without crippling one or the other. In order to do that, Oracle needs some mechanism to separate the workloads. Resource Manager is an option in this area, but it doesn’t go far enough in controlling throughput on parallel queries. This new queuing mechanism should be a great help in that regard. So let’s review the options:

Parallel Adaptive Multi User (the old way)
This ability to automatically downgrade the degree of parallelism based on what’s happening on the system when a query kicks off is actually a powerful mechanism and is the best approach we’ve had prior to 11g Release 2. The downside of this approach is that parallelized statements can have wildly varying execution times. As you can imagine, a statement that gets 32 slaves one time and then gets downgraded to serial execution the next time will probably not make the user very happy. The argument for this type of approach is that stuff is going to run slower if the system is busy regardless of what you do. And that users expect it to run slower when the system is busy. The first part of that statement may be true but I don’t believe the second part is (at least in most cases). The bigger problem with the downgrade mechanism though is that the the decision about how many slaves to use is based on a single point in time (the point when the parallel statement starts). And once the degree of parallelism (DOP) is set for a statement it can not be changed. That execution of the statement will run with the number of slaves it got to start with, even if additional resources become available. So consider the statement that takes a minute with 32 slaves that gets downgraded to serial due to a momentarily high load. And say that 10 seconds after it starts the system load drops back to more normal levels. Unfortunately, the serialized statement will continue to run for nearly 30 minutes with it’s single process, even though on average the system is not busier than usual.

Parallel Queuing (the new way)
Now let’s compare that with the new mechanism introduced in 11gR2 that allows parallel statements to be queued in a First In - First Out fashion. This mechanism separates (presumably) long running parallel queries from the rest of the workload. The mechanics are pretty simple. Turn the feature on. Set a target number of parallel slaves (parallel_servers_target). Run stuff. If a statement tries to start that requires exceeding the target, it will be queued until the required number of slaves become available.

The Parallel Queuing feature is controlled by a hidden parameter called “_parallel_statement_queuing”. A value of TRUE turns it on and FALSE turns it off. FALSE is the default by the way. This parameter is not documented but is set automatically when the PARALLEL_DEGREE_POLICY parameter is set to AUTO. Unfortunately, PARALLEL_DEGREE_POLICY is one of those parameters that controls more than one thing. When set to AUTO it also turns on Automatic DOP calculation. This feature calculates a DOP for each statement regardless of whether any objects have been decorated with a parallel setting. The result is that all kinds of statements are run in parallel, even if no objects have been specifically defined with a parallel degree setting. This is truly automatic parallel processing because the database decides what to run in parallel and with how many slaves. On top of that, by default, the slaves may be spread across multiple nodes in a RAC database (this can be disabled by setting PARALLEL_FORCE_LOCAL to TRUE). Finally, AUTO is supposed to enable “In Memory Parallel Query”. This poorly named feature refers to 11gR2’s ability to make use of the SGA for parallel query, as opposed to using direct reads exclusively. Note: I haven’t actually seen this kick in yet, which is probably good, since Exadata Offloading depends on direct reads. I haven’t seen it kick in on non-Exadata databases either though.

Unfortunately this combination of features is a little like the wild west with things running in parallel all over the place. But the ability to queue parallel statements does provide some semblance of order. And to be fair, there are a number of parameters that can be set to control how the calculations are performed. Anyway, here’s a brief synopsis of parameter changes caused by the various settings PARALLEL_DEGREE_POLICY.

Continue reading ‘Exadata and Parallel Queuing’ »

Exadata Offload - The Secret Sauce

The “Secret Sauce” for Exadata is it’s ability to offload processing to the storage tier. Offloading means that the storage servers can apply predicate filters at the storage layer, instead of shipping every possible block back to the database server(s). Another thing that happens with offloading is that the volume of data returned can be further reduced by column projection (i.e. if you only select 1 column from a 100 column table, there is no need to return the other 99 columns). Offloading is geared to long running queries that access a large amount of data. Offloading only works if the Oracle decides to use it’s direct path read mechanism. Direct path reads have traditionally been done by parallel query slaves but can also be done by serial queries. In fact, as of 11g, Oracle has changed the decision making process resulting in more aggressive use of serial direct path reads. I’ve seen this feature described as “serial direct path reads” and “adaptive direct path reads”.

I’ll digress here a bit to discuss this feature since direct path reads are critical to Exadata Offloading. Direct path reads do not load blocks into Oracle’s buffer cache. Instead, the data is returned directly to the PGA of the process requesting the data. This means that the data does not have to be in Oracle block format. That means no 8K block that is only partially filled, that may only have a record or two that you’re interested in, containing every column including ones you don’t want, and with additional header information - needs to be shipped back up from the storage layer. Instead, a much more compact result set containing only the columns requested and hopefully only the rows you need are returned. As I said, direct path reads are traditionally used by parallel query slaves. They are also used in a few other instances such as LOB access and sorts that spill over into TEMP. So the ability to use direct path reads is very important to the Exadata platform and thus the changes to the make them more attractive in 11g. Here are a few links to info on the subject of serial direct path reads:

  1. Doug Burns has a good post on 11g serial direct path reads.
  2. Alex Fatkulin has a very good post on some of the factors controlling adaptive direct path reads.
  3. There is a note on MOS (793845.1) on changes in 11g in the heuristics to choose between direct path reads or reads through the buffer cache.
  4. You may also find MOS note (50415.1) on misleading nature of “direct path read” wait events of interest.

Also be aware that direct path reads are only available for full scans (tables or indexes). So any statement that uses an index to get to a row in a table will not use this mechanism. Also keep in mind that direct path requires extra processing to ensure that all blocks on disk are current - (i.e. a check point), so frequently modified tables will suffer some overhead before direct path reads can be initiated.

I must say that I think the changes to the heuristics in 11g may be a little on the aggressive side for non-Exadata platforms (the changes may well be driven by Exadata). And by the way, serial direct path reads are not always faster than the normal reads that go through the buffer cache. Dion Cho has a good post on a performance problem due to serial direct path reads kicking in on one node of an 11g RAC environment (not Exadata). The node doing the direct path reads was running the query much slower than the node using the normal buffer cache reads. He also has a post on turning off serial direct path reads.

But enough about the direct path reads stuff, on to the Offloading. One of the first things I wanted to know when I got my first look at a system running on Exadata was whether a particular query was eligible for offloading and if so, how much of the expected i/o was saved by the feature. So of course I wrote a little script to show me that. Turns out there is plenty of info in V$SQL to see what’s going on. I called the script fsx.sql (short for Find_Sql_eXadata). Here’s a little demo:

Continue reading ‘Exadata Offload - The Secret Sauce’ »

Fun with Oracle Exadata V2

Well I’ve been holed up playing with an Exadata V2 machine for the past several weeks. Wow. Very interesting technology.

I must say that I believe the concept of offloading SQL processing to the storage layer is a game changer and I wouldn’t be at all surprised to see this as a standard feature a few years from now. What that means for other storage vendors is unclear at this point. So for this first post on the topic let me just describe the configuration (and some potential upgrades).

The basic architecture consists of a set of database severs and a set of storage servers.

Database Servers:

  • Sun x4170 (1RU 64x server)
  • 2 - Quad-core Intel Xeon E5540 2.53GHz processors
  • 72G Ram (18×4G Dimms - max of 144G using 8G DIMMs)
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • 4 - 146G internal drives (SAS 10,000 RPM)
  • dual hot swappable power supplies
  • no spare/empty slots!

Here’s what the Database Servers look like:

Storage Servers:

  • Sun x4275 (2RU 64x server)
  • 2 - Quad-core Intel Xeon E5540 (2.53GHz) processors
  • 24G Ram
  • Dual-Port QDR InfiniBand Host Channel Adapter
  • HBA with 512MB Battery Backed Write Cache (only for internal disks???)
  • dual hot swappable power supplies
  • 4 - 96G Sun Flash PCIe Cards (total of 384 GB)
  • 12 - 600 GB 15,000 RPM SAS or 2 TB 7,200 RPM SATA

Continue reading ‘Fun with Oracle Exadata V2’ »

Increasing Priority of lgwr Process using _high_priority_processes

At the Hotsos Symposium last week, Tanel Poder pointed out an interesting hidden parameter (_high_priority_processes). We were discussing the occasional need to increase priority on the LGWR process when dealing with “log file sync” issues. The aforementioned parameter is used to tell Oracle to automatically set background processes to a higher than normal priority when the database is started. On linux this means putting them in the RR class, in Solaris, the RT class is used. The parameter defaults to LMS* in 10.2 and to LMS*||VKTM in 11gR1. (LMS* are the Lock Manager Server processes in a RAC instance and VKTM is the new Time Keeper process in 11g). So it makes sense that these processes would run at a higher priority.

Recently I have been working on a system that was spending a significant amount of time waiting on “log file sync”. Increasing the priority of the LGWR process made a significant improvement. It’s not a silver bullet though. When the system becomes really busy, the “log file sync” times still suck, just not as much. But I digress, this post is just to point out that Oracle has a built in mechanism to increase priority on background processes, not to tell you how to reduce long “log file sync” times. There are numerous posts on that subject already. Here a couple of pretty good ones:


Manly-Men Only Use Solid State Disk for Redo Logging - Kevin Closson

Tuning Log File Sync Wait Events - Riyaj Shamsudeen
Hotsos 2010 - Day 4 - Doug Burns (see the comments section)

Back to my story. Unfortunately, every time the database is bounced, LGWR goes back to his original priority and class and we have to get the Unix guys to reset it. So I got to wondering if the _high_priority_processes parameter could be used to do this automatically. And sure enough it looks like it can. First here’s a look at the class and priority info of the standard high priority processes on a Solaris system and a Linux system.

Continue reading ‘Increasing Priority of lgwr Process using _high_priority_processes’ »

Temp Usage

There was a post on the oracle-l list the other day regarding disk space usage in the temp tablespace. Rich Jesse posted a neat little script for capturing large consumers of temp space along with the user and SQL statement responsible. I started playing around with the script and noticed that it wasn’t reporting the correct SQL Statement. Turns out there is a bug having to do with the definition of v$sort_usage. It’s apparently been around since 10gR1.

There is a listing on Metalink (sorry,… I mean My Oracle Support). I’ve pasted in the interesting bits here:

Bug 7210183: SQL_ID VALUE IS NOT SAME IN V$TEMPSEG_USAGE AND OTHER VIEWS.
~~~~~~~~~~~
The actual problem here looks to be in the definition of
GV$SORT_USAGE thus:

In RDBMS_MAIN_LINUX_080825:
GV$TEMPSEG_USAGE is a synonym resolving to GV_$SORT_USAGE.

GV_$SORT_USAGE is a VIEW which selects various columns from
GV$SORT_USAGE.

GV$SORT_USAGE has a definition like this:
select x$ktsso.inst_id, username, username, ktssoses, ktssosno,
prev_sql_addr, prev_hash_value, prev_sql_id, <<<<< NOTICE HERE
ktssotsn, ....
from x$ktsso, v$session
where ktssoses = v$session.saddr and ktssosno = v$session.serial#

So GV$SORT_USAGE SQL_ID etc.. are just taken from the sessions
current PREV_* columns. This is not correct. The sort usage
in a session could be tied to any of the open cursors for that
session. The above definition just looks plain wrong.
*** 09/09/08 05:45 pm ***
*** 09/10/08 12:06 am *** (DEL: Impact/Symptom->WRONG RESULTS )
*** 09/10/08 12:06 am *** (ADD: Impact/Symptom->WRONG RESULTS )
*** 09/10/08 12:06 am ***
I am not sure that in all cases using V$SESSION.SQL_ADDRESS
etc.. would be correct.
eg: Consider that the session does something like the
following:
Open cursor 1
Parse , execute and fetch one row from a SQL that needs
temp space in part of the plan
Open cursor 2
select * from dual (no sorting)
At this point V$SESSION.SQL_ID would be for the
select from dual

I do not believe that we actually have the information
available to return the correct SQL_ID etc.. this view.

eg: a. Consider this simple test:
create global temporary table foo ( a number );
insert into foo values(10);
b. In a second session do:
select sql_id from v$sort_usage;
c. Now in the first session issue any old select:
select 99 from dual;
d. And check V$SORT_USAGE:
select sql_id from v$sort_usage;
e. Repeat c and d over using a different SQL each time
in c . eg: select 77 from dual A;

In this example V$SORT_USAGE will show us which session
has the temp space (SESSION_ADDR) but the SQL which
created the temp space is not even available as an open
cursor against that session (the insert created the temp
space).

It looks like this really needs a larger change - something
like capturing the SQL_ID etc.. at the time that the temp seg
gets created and then exposing that information through some
new X$ colums in x$ktsso ?

So I modified the script to report the current sql_id instead.
Here’s a quick replay of the investigation starting with the original query:

SYS@LAB1024> -- original query
SYS@LAB1024> l      
  1  SELECT sysdate "TIME_STAMP", vsu.username, vsu.sql_id, vsu.tablespace,
  2  vsu.usage_mb, vst.sql_text, vp.spid
  3          FROM
  4          (
  5                  SELECT username, sqladdr, sqlhash, sql_id, tablespace, session_addr,
  6  sum(blocks)*8192/1024/1024 "USAGE_MB"
  7                  FROM v$sort_usage
  8                  HAVING SUM(blocks)> 10000 -- 80MB
  9                  GROUP BY username, sqladdr, sqlhash, sql_id, tablespace, session_addr
 10          ) "VSU",
 11          v$sqltext vst,
 12          v$session vs,
 13          v$process vp
 14  WHERE vsu.sql_id = vst.sql_id
 15          AND vsu.sqladdr = vst.address
 16          AND vsu.sqlhash = vst.hash_value
 17          AND vsu.session_addr = vs.saddr
 18          AND vs.paddr = vp.addr
 19          AND vst.piece = 0;
 
no rows selected
 
SYS@LAB1024> -- note the filter to only get sorts using more than 10000 blocks of temp space
SYS@LAB1024> -- (that's why no records were returned)
SYS@LAB1024> -- go to another session and start a big sort, then check again
 
SYS@LAB1024> /
 
TIME_STAM USERNAME        SQL_ID        TABLESPACE        USAGE_MB SQL_TEXT                                                         SPID
--------- --------------- ------------- --------------- ---------- ---------------------------------------------------------------- ------------
19-JAN-10 SYS             9babjv8yq8ru3 TEMP                   145 BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;             31514
 
SYS@LAB1024> -- so now we have a big one
SYS@LAB1024> -- but obviously the get_lines procedure is not causing the big sort
SYS@LAB1024> -- hmmm, what's actually active right now?
SYS@LAB1024> @as
 
no rows selected
 
SYS@LAB1024> l
  1  select sid, substr(program,1,19) prog, address, hash_value, b.sql_id, sql_child_number child, plan_hash_value, executions execs,
  2  (elapsed_time/decode(nvl(executions,0),0,1,executions))/1000000 avg_etime,
  3  sql_text
  4  from v$session a, v$sqlarea b
  5  where status = 'ACTIVE'
  6  and username is not null
  7  and a.sql_id = b.sql_id
  8* and audsid != SYS_CONTEXT('userenv','sessionid')
SYS@LAB1024> del
SYS@LAB1024> /
 
 SID PROG       ADDRESS  HASH_VALUE SQL_ID         CHILD PLAN_HASH_VALUE      EXECS   AVG_ETIME SQL_TEXT
---- ---------- -------- ---------- ------------- ------ --------------- ---------- ----------- -----------------------------------------
 410 sqlplus@ho 74AAF3D0 1572343862 1mkjj1tfvh41q      0      1304592819          5       67.42 SELECT COL1,COL2 FROM KSO.SKEW ORDER BY C
 433 sqlplus@ho 74BB6C20 1310159811 bn8b1wp71fwy3      0       187045271          4         .01 select sid, substr(program,1,19) prog, ad
 
SYS@LAB1024> -- yep, SID 410 is running the query I started in the other session
SYS@LAB1024> -- let' check the prev_sql_id as indicated by the Metalink note
SYS@LAB1024>
SYS@LAB1024> select sid, sql_id, prev_sql_id from v$session where sid=410;
 
 SID SQL_ID        PREV_SQL_ID
---- ------------- -------------
 410 1mkjj1tfvh41q 9babjv8yq8ru3
 
SYS@LAB1024> -- aha, it is prev_sql_id that is exposed in v$sort_usage
SYS@LAB1024> -- just to prove it let's look at the fixed view definition
SYS@LAB1024> get fixed_view_def
  1  select * from v$fixed_view_definition
  2* where view_name like upper('&view_name')
SYS@LAB1024> /
Enter value for view_name: GV$SORT_USAGE
 
VIEW_NAME
------------------------------
VIEW_DEFINITION
-----------------------------------------------------------------------------------------------------------------------------------------------------------
GV$SORT_USAGE
select x$ktsso.inst_id, username, username, ktssoses, ktssosno, prev_sql_addr, prev_hash_value, prev_sql_id, ktssotsn, decode(ktssocnt, 0, 'PERMANENT', 1,
'TEMPORARY'), decode(ktssosegt, 1, 'SORT', 2, 'HASH', 3, 'DATA', 4, 'INDEX', 5, 'LOB_DATA', 6, 'LOB_INDEX' , 'UNDEFINED'), ktssofno, ktssobno, ktssoexts, k
tssoblks, ktssorfno from x$ktsso, v$session where ktssoses = v$session.saddr and ktssosno = v$session.serial#
 
SYS@LAB1024> -- notice the prev_sql_id ...
SYS@LAB1024>
SYS@LAB1024> -- so now let's look at a fixed version (only modified slightly)
SYS@LAB1024> -- since we're already joining to v$session, we can just get the sql_id from there
SYS@LAB1024> -- I rearranged the columns a bit as well 
SYS@LAB1024> -- and joined to dba_tablespaces to allow for various block sizes
SYS@LAB1024> 
SYS@LAB1024> @temp_usage
 
no rows selected
 
SYS@LAB1024> l
  1  SELECT sysdate "TIME_STAMP", vsu.username, vs.sid, vp.spid, vs.sql_id, vst.sql_text, vsu.tablespace,
  2         sum_blocks*dt.block_size/1024/1024 usage_mb
  3     FROM
  4     (
  5             SELECT username, sqladdr, sqlhash, sql_id, tablespace, session_addr,
  6  -- sum(blocks)*8192/1024/1024 "USAGE_MB",
  7                  sum(blocks) sum_blocks
  8             FROM v$sort_usage
  9             HAVING SUM(blocks)> 1000
 10             GROUP BY username, sqladdr, sqlhash, sql_id, tablespace, session_addr
 11     ) "VSU",
 12     v$sqltext vst,
 13     v$session vs,
 14     v$process vp,
 15     dba_tablespaces dt
 16  WHERE vs.sql_id = vst.sql_id
 17  -- AND vsu.sqladdr = vst.address
 18  -- AND vsu.sqlhash = vst.hash_value
 19     AND vsu.session_addr = vs.saddr
 20     AND vs.paddr = vp.addr
 21     AND vst.piece = 0
 22     AND dt.tablespace_name = vsu.tablespace
 23* order by usage_mb
 
SYS@LAB1024> -- start the sort again, and recheck
SYS@LAB1024> /
 
TIME_STAM USERNAME         SID SPID         SQL_ID        SQL_TEXT                                           TABLESPACE        USAGE_MB
--------- --------------- ---- ------------ ------------- -------------------------------------------------- --------------- ----------
19-JAN-10 SYS              410 31514        1mkjj1tfvh41q SELECT COL1,COL2 FROM KSO.SKEW ORDER BY COL3, COL2 TEMP                   191
 
SYS@LAB1024> -- that looks better, that's the statement that is causing the big sort and temp usage

As pointed out in the Metalink note, it’s possible that the current sql_id may not have the correct value in some situations. One specific case that was mentioned was when a statement has completed and the session has issued another statement, but the temp space has not yet been cleaned up. It seems to work pretty well in most situations though.

One other note, it appears that this issue (bug) has not been addressed as of 11.2.0.1.