12c In-Memory on RAC

I started looking into In-Memory on RAC this week. Data can be distributed across RAC nodes in a couple of different ways. The default is to spread it across the available nodes in the cluster. So if you had a 2 node cluster, roughly 50% of the data in your table or partition would be loaded into the column store in each of the 2 instances.

SYS@dw1> alter table kso.skew inmemory;
 
Table altered.
 
SYS@dw1> @gen_ddl
Enter value for object_type: 
Enter value for owner: KSO
Enter value for object_name: SKEW
 
DDL
--------------------------------------------------------------------------------
 
  CREATE TABLE "KSO"."SKEW"
   (    "PK_COL" NUMBER,
        "COL1" NUMBER,
        "COL2" VARCHAR2(30),
        "COL3" DATE,
        "COL4" VARCHAR2(1),
         PRIMARY KEY ("PK_COL")
  USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 INVISIBLE COMPUTE STATISTICS
  STORAGE(INITIAL 865075200 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "USERS"  ENABLE
   ) SEGMENT CREATION IMMEDIATE
  PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
 NOCOMPRESS LOGGING
  STORAGE(INITIAL 1480589312 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "USERS"
  INMEMORY PRIORITY NONE MEMCOMPRESS FOR QUERY LOW
  DISTRIBUTE AUTO NO DUPLICATE                             <--- here's the RAC bit
   CACHE
 
SYS@dw1> @inmem_segs
Enter value for owner: 
Enter value for segment_name: 
 
OWNER                          SEGMENT_NAME                   ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
                                                                             ----------------
sum
 
no rows selected
 
SYS@dw1> select count(*) from kso.skew;
 
  COUNT(*)
----------
  32000004
 
SYS@dw1> @inmem_segs
Enter value for owner: 
Enter value for segment_name: 
 
OWNER                          SEGMENT_NAME                   ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
KSO                            SKEW                                  1,413.0            391.4        1.7              749.4
                                                                             ----------------
sum                                                                                     391.4
SYS@dw1> -- so about half the data is loaded in the local instance column store
SYS@dw1> -- let's see what's in the other instance's cache
SYS@dw1> l
  1  SELECT v.owner, v.segment_name,
  2  v.bytes/(1024*1024) orig_size_megs,
  3  v.inmemory_size/(1024*1024) in_mem_size_megs,
  4  (v.bytes - v.bytes_not_populated) / v.inmemory_size comp_ratio,
  5  v.bytes_not_populated/(1024*1024) megs_not_populated
  6  FROM v$im_segments v
  7  where owner like nvl('&owner',owner)
  8* and segment_name like nvl('&segment_name',segment_name)
SYS@dw1> l6
  6* FROM v$im_segments v
SYS@dw1> c/v$/gv$/
  6* FROM gv$im_segments v
SYS@dw1> /
Enter value for owner: 
Enter value for segment_name: 
 
OWNER                          SEGMENT_NAME                   ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
KSO                            SKEW                                  1,413.0            569.1        1.6              526.6
KSO                            SKEW                                  1,413.0            391.4        1.7              749.4
                                                                             ----------------
sum                                                                                     960.5

So in this example with a non-partitioned table, the data is automatically spread (roughly 50/50) across the 2 instances. That’s pretty cool. It allows the column store to be scaled via adding RAC nodes. For partitioned tables there are a couple of options as to how the data can be distributed. You can distribute by partition or by rowid range. The default for a partitioned table appears to be to distribute partitions to the nodes in a round robin fashion not ordered by number of rows or size (as I would have expected), but then again I haven’t done much testing on it. Anyway, here’s a simple example using a partitioned table.

SYS@dw1> create table kso.skewp partition by range (col1) (partition "P1" values less than (2), partition p2 values less than (100000), partition p3 values less than (maxvalue)) as select * from kso.skew;
 
Table created.
 
SYS@dw1> select count(*) from kso.skewp partition (p1);
 
  COUNT(*)
----------
   3199971
 
SYS@dw1> select count(*) from kso.skewp partition (p2);
 
  COUNT(*)
----------
      8512
 
SYS@dw1> select count(*) from kso.skewp partition (p3);
 
  COUNT(*)
----------
  28791521
 
SYS@dw1> alter table kso.skewp inmemory;
 
Table altered.
 
SYS@dw1> select count(*) from kso.skewp;
 
  COUNT(*)
----------
  32000004
 
SYS@dw1> @inmem_part_segs
Enter value for owner: 
Enter value for segment_name: SKEWP
 
OWNER                          SEGMENT_NAME                   PARTITION_NAME                 ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
KSO                            SKEWP                          P1                                      128.0             55.4        2.3                 .0
KSO                            SKEWP                          P3                                    1,220.0            349.6        1.5              689.3
                                                                                                            ----------------
sum                                                                                                                    405.1
 
SYS@dw1> @inmem_part_segs_g
Enter value for owner: 
Enter value for segment_name: SKEWP
 
   INST_ID OWNER                          SEGMENT_NAME                   PARTITION_NAME                 ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
---------- ------------------------------ ------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
         1 KSO                            SKEWP                          P1                                      128.0             55.4        2.3                 .0
         1 KSO                            SKEWP                          P3                                    1,220.0            878.1        1.4                 .0
         2 KSO                            SKEWP                          P2                                        8.0              1.1        7.1                 .0
                                                                                                                       ----------------
sum                                                                                                                               934.7
 
SYS@dw1> -- very skewed distribution, P1 and P3 on node 1, and P2 on node 2
SYS@dw1> -- let's try again
SYS@dw1> alter table kso.skewp inmemory distribute by rowid range;
 
Table altered.
 
SYS@dw1> select count(*) from kso.skewp;
 
  COUNT(*)
----------
  32000004
 
SYS@dw1> @inmem_part_segs
Enter value for owner: 
Enter value for segment_name: SKEWP
 
set lines 200
OWNER                          SEGMENT_NAME                   PARTITION_NAME                 ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
KSO                            SKEWP                          P1                                      128.0             55.4        2.3                 .0
KSO                            SKEWP                          P3                                    1,220.0            605.2        1.4              358.9
                                                                                                            ----------------
sum                                                                                                                    660.6
 
 
SYS@dw1> @inmem_part_segs_g.sql
Enter value for owner: 
Enter value for segment_name: SKEWP
 
   INST_ID OWNER                          SEGMENT_NAME                   PARTITION_NAME                 ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
---------- ------------------------------ ------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
         1 KSO                            SKEWP                          P1                                      128.0             55.4        2.3                 .0
         1 KSO                            SKEWP                          P3                                    1,220.0            605.2        1.4              358.9
         2 KSO                            SKEWP                          P2                                        8.0              1.1        7.1                 .0
         2 KSO                            SKEWP                          P3                                    1,220.0            263.0        1.6              793.5
                                                                                                                       ----------------
sum                                                                                                                               924.8

So the default distribution mode for partitioned tables appears to be round robin by partition. This can result in very skewed data distributions as in this example. Using the DISTRIBUTE BY ROWID RANGE syntax we can get a more even distribution of data across the nodes. In this case it put the 2 smaller partitions on separate nodes and split the largest partition between the two (asymmetrically by the way, so that the data was actually spread pretty much equally).

There is another option which allows all the data to be duplicated across the instances. This is a high availability option. If a node needs to be restarted, query times could be impacted until the data is reloaded. Note that queries would still execute without error, but the data would have to be processed through the normal row oriented process. The DUPLICATE option populates the column store of both nodes with all the data for the object, in order to minimize performance impact if a node goes down. The documentation says this feature is only available on engineered systems by the way. Let’s have a look.

SYS@dw1> alter table kso.skew inmemory duplicate;
 
Table altered.
 
SYS@dw1> @gen_ddl
Enter value for object_type: 
Enter value for owner: KSO
Enter value for object_name: SKEW
 
DDL
--------------------------------------------------------------------------------
 
  CREATE TABLE "KSO"."SKEW"
   (    "PK_COL" NUMBER,
        "COL1" NUMBER,
        "COL2" VARCHAR2(30),
        "COL3" DATE,
        "COL4" VARCHAR2(1),
         PRIMARY KEY ("PK_COL")
  USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 INVISIBLE COMPUTE STATISTICS
  STORAGE(INITIAL 865075200 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "USERS"  ENABLE
   ) SEGMENT CREATION IMMEDIATE
  PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
 NOCOMPRESS LOGGING
  STORAGE(INITIAL 1480589312 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "USERS"
  INMEMORY PRIORITY NONE MEMCOMPRESS FOR QUERY LOW
  DISTRIBUTE AUTO DUPLICATE
   CACHE
 
SYS@dw1> select count(*) from kso.skew;
 
  COUNT(*)
----------
  32000004
 
SYS@dw1> @inmem_segs
Enter value for owner: 
Enter value for segment_name: 
 
OWNER                          SEGMENT_NAME                   ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED
------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------
KSO                            SKEW                                  1,413.0            956.4        1.5                 .0
                                                                             ----------------
sum                                                                                     956.4
 
SYS@dw1> @inmem_segs_g.sql
Enter value for owner: 
Enter value for segment_name: 
 
   INST_ID OWNER                          SEGMENT_NAME                   ORIG_SIZE_MEGS IN_MEM_SIZE_MEGS COMP_RATIO MEGS_NOT_POPULATED POPULATE_
---------- ------------------------------ ------------------------------ -------------- ---------------- ---------- ------------------ ---------
         1 KSO                            SKEW                                  1,413.0            956.4        1.5                 .0 COMPLETED
         2 KSO                            SKEW                                  1,413.0            965.4        1.5                 .0 COMPLETED
                                                                                        ----------------
sum                                                                                              1,921.9

So the DUPLICATE keyword caused the entire table to be loaded into both instances (i.e. DUPLICATEd). Note that this test was done on an Exadata (an engineered system) so it should work.

How the data is accessed across the nodes is also interesting. When accessing data spread across two nodes with a serial plan, all the work appears to be done by the server process handling the connection on the driving node. (i.e. there don’t appear to be any processes on the remote that are burning CPU) When accessing data with a parallel plan on the other hand, slave processes are used on both nodes. This makes sense because that capability already existed. In this case though, the slaves appear to automatically be spread across the nodes at least when the data is distributed (i.e. not duplicated). This is in contrast to the normal spreading of slaves across a cluster which tends to want to keep them on the same node if possible. Anyway, here is an example. Note that this example has a very unusual SQL Monitor output.

SYS@dw1> select /*+ parallel */ avg(pk_col+col1) from kso.skewp a;
 
AVG(PK_COL+COL1)
----------------
      16588749.5
 
SYS@dw1> @rsm
Enter value for sql_id: 
Enter value for sql_exec_id: 
 
REPORT
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Monitoring Report
 
SQL Text
------------------------------
select /*+ parallel */ avg(pk_col+col1) from kso.skewp a
 
Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  SYS (141:54624)
 SQL ID              :  0qmdbnszhb2nh
 SQL Execution ID    :  16777218
 Execution Started   :  09/16/2014 22:01:27
 First Refresh Time  :  09/16/2014 22:01:26
 Last Refresh Time   :  09/16/2014 22:01:28
 Duration            :  1s
 Module/Action       :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@enkdb03.enkitec.com (TNS V1-V3)
 Fetch Calls         :  1
 
Global Stats
===================================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs | Bytes | Offload |
===================================================================================================
|    3.67 |    3.66 |     0.00 |        0.00 |     0.00 |     1 |  81893 |    1 | 352KB |  59.84% |
===================================================================================================
 
Parallel Execution Details (DOP=3 , Servers Allocated=3)
 Instances  : 2
 
=======================================================================================================================================================
| Instance |      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Application |  Other   | Buffer | Read | Read  |  Cell   | Wait Events |
|          |                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes | Offload | (sample #)  |
=======================================================================================================================================================
| 1        | PX Coordinator | QC    |         |    0.01 |    0.01 |          |        0.00 |     0.00 |     13 |      |     . |    NaN% |             |
| 1        | p000           | Set 1 |       1 |    1.05 |    1.05 |          |             |          |     21 |      |     . |    NaN% |             |
| 1        | p001           | Set 1 |       2 |    1.01 |    1.01 |     0.00 |             |          |     69 |    1 | 352KB |  59.84% |             |
| 2        | p000           | Set 1 |       3 |    1.60 |    1.60 |          |             |          |  81790 |      |     . |    NaN% |             |
=======================================================================================================================================================
 
Instance Drill-Down
====================================================================================================================================
| Instance | Process Names | Elapsed |   Cpu   |    IO    | Application |  Other   | Buffer | Read | Read  |  Cell   | Wait Events |
|          |               | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes | Offload |             |
====================================================================================================================================
|    1     | QC p000 p001  |    2.07 |    2.06 |     0.00 |        0.00 |     0.00 |    103 |    1 | 352KB |  59.84% |             |
|    2     | p000          |    1.60 |    1.60 |          |             |          |  81790 |      |       |    NaN% |             |
====================================================================================================================================
 
SQL Plan Monitoring Details (Plan Hash Value=2545210427)
============================================================================================================================================================================
| Id |             Operation             |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity | Activity Detail |
|    |                                   |          | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |   (# samples)   |
============================================================================================================================================================================
|  0 | SELECT STATEMENT                  |          |         |      |         1 |     +1 |     1 |        1 |      |       |         |       |          |                 |
|  1 |   SORT AGGREGATE                  |          |       1 |      |         1 |     +1 |     1 |        1 |      |       |         |       |          |                 |
|  2 |    PX COORDINATOR                 |          |         |      |         1 |     +1 |     4 |        3 |      |       |         |       |          |                 |
|  3 |     PX SEND QC (RANDOM)           | :TQ10000 |       1 |      |         1 |     +1 |     3 |        3 |      |       |         |       |          |                 |
|  4 |      SORT AGGREGATE               |          |       1 |      |         1 |     +1 |     3 |        3 |      |       |         |       |    33.33 | Cpu (1)         |
|  5 |       PX BLOCK ITERATOR           |          |     32M |  783 |         1 |     +1 |     3 |      32M |      |       |         |       |          |                 |
|  6 |        TABLE ACCESS INMEMORY FULL | SKEWP    |     32M |  783 |         2 |     +0 |    31 |      32M |    1 | 352KB |  60.00% |    1M |    66.67 | in memory (2)   |
============================================================================================================================================================================

Can you see why this one is so unusual? This plan showed a TABLE ACCESS INMEMORY FULL, which had both inmemory access and offloading via smart scan (probably on different partitions). I’m not at all sure why that happened though. If you know please enlighten me.

12 Comments

  1. kevinclosson says:

    Kerry,

    I want to just take a moment to say thanks for these detailed posts on column store functionality! Very helpful!

  2. Sidney Chen says:

    Hi Kerry,

    Thanks for the sharing.

    If the data scanned from in memory columnar store, seems the buffer gets will not increase, it makes sense since we are not doing traditional logical reads. From the last sql monitor report, the buffer gets for the P000 slave on instance 2 is 81790, which is around 600M, I guess if some partitions of SKEWP on instance 2 is not loaded into in memory columnar store.

  3. osborne says:

    Hi Sidney,

    I haven’t looked too closely at statistics being gathered during execution yet, but v$sql has a column called buffer_gets. It is not incremented when accessing objects in the column_store. There are 3 columns that store data about IM though. There are also a lot of stats being collected. Check out v$mystat for your session and look for %IM% in the stat name. I’ll do a post on that if I can find the time (or maybe Tanel will beat me to it).

    Kerry

  4. Fairlie Rego says:

    Hi Kerry,

    Am unable to get the in memory column store to kick in on an ADG environment.
    Do you know if this is a documented restriction.?

    Ta
    Fairlie

  5. Fairlie Rego says:

    Ok found the documentation :-)

    The IM column store cannot be used on an Active Data Guard standby instance in the current release.
    However it can be used in a Logical Standby instance and in an instance maintained using Oracle
    Golden Gate.

    Cheers
    Fairlie

  6. osborne says:

    Hey Fairlie,

    Thanks for letting me know. I hadn’t had a chance to look into that yet. Lot’s of stuff to investigate these days. :) Are you by any chance going to OOW? It would be good to put a face with a name.

    Kerry

  7. Bertrand says:

    Hi Kerry,

    Regarding the In-Memory data distribution on RAC, I tried to get rid of the default behaviour on non Engineered Systems (“Every row of the test table is stored in the IMCS of either one instance or the other”) for 2 reasons:

    1) Christian Antognini finding: It could lead to bad performance.
    2) In case of service defined as preferred/available, it would make sense to have the IMCS populated with all the DATA on the “active” Instance.

    To get rid of the default behaviour, we can set the “_inmemory_auto_distribute” hidden parameter to false.

    I tested it for:

    RAC (non CDB) here: https://bdrouvot.wordpress.com/2014/11/20/in-memory-instance-distribution-with-rac-databases-i-want-at-least-one-instance-that-contains-all-the-data/
    RAC with PDB here: https://bdrouvot.wordpress.com/2014/11/21/in-memory-instance-distribution-with-rac-and-multitenant-environment/
    RAC with PDB and service defined as preferred/available here (Note that in that case FAN callouts can be enough):https://bdrouvot.wordpress.com/2014/12/04/fan-callouts-for-in-memory-instance-distribution-with-rac-pdb-and-preferredavailable-service/

    Thx
    Bertrand

  8. Tarun says:

    Hi Kerry,
    Here is my scenario. I have 2 node RAC.

    My table size on disk is
    —-
    select owner , segment_name,sum(bytes) “Bytes”
    from dba_segments
    where owner=’XXCCS_O’
    and segment_name=’XXCCS_DS_SL_CVR’
    group by owner, segment_name;
    ————–
    OWNER SEGMENT_NAME Bytes
    XXCCS_O XXCCS_DS_SL_CVR 1,946,157,056
    ————-
    When I load a subset of columns from this table inmemory I get the following output from the following query
    —–
    SELECT v.inst_id,v.owner,
    v.segment_name,
    v.segment_type,
    v.bytes,
    v.inmemory_size,
    v.inmemory_duplicate
    FROM gv$im_segments v
    where owner=’XXCCS_O’
    and segment_name=’XXCCS_DS_SL_CVR’
    —–
    INST_ID OWNER SEGMENT_NAME SEGMENT_TYPE BYTES INMEMORY_SIZE INMEMORY_DUPLICATE
    2 XXCCS_O XXCCS_DS_SL_CVR TABLE 1,946,157,056 356,909,056 NO DUPLICATE
    1 XXCCS_O XXCCS_DS_SL_CVR TABLE 1,946,157,056 356,909,056 NO DUPLICATE
    ——-
    Table Size from dba_segments is – 1,946,157,056 bytes
    Now on both the RAC instances it shows Bytes loaded as – 1,946,157,056
    bytes as shown in the above output.

    This is confusing me because I thought that it would distribute the data across both the nodes as the data is loaded with NO DUPLICATE as shown in the gv$im_segments. But looking at the numbers it shows that the same bytes have been loaded across both the RAC Nodes.

    Can you please explain the numbers here?

    I don’t want to load the same data inmemory across both the Nodes but want to divide and distribute the data across the nodes as I have limited memory space and want to be sure I am not doing anything wrong here.
    Regards
    Tarun

  9. ThePhantomDataWrangler says:

    It’s kind of annoying. The various DISTRIBUTE and REPLICATE options are quietly (or not so quietly) ignored on non Exadata systems. There’s no way to build node affinities by manually specifying the target node for distribution (e.g. by partition), so any cleverness in RAC design has to spill out to the application to perhaps query gv$im_segment metadata and deliberately target one node, unless using PQ.

  10. […] Kerry Osborne show us how we can distribute the data (using the distribute INMEMORY attribute) across the nodes into this blog post. […]

Leave a Reply