Infrastructure at your Service

Franck Pachot

Create constraints in your datawarehouse – why and how

We still see some developers not declaring referential integrity constraints in datawarehouse databases because they think they don’t need it (integrity of data has been validated by the ETL). Here is a small demo I did to show why you need to declare them, and how to do it to avoid any overhead on the ETL.

Test case

I create 3 dimension tables and 1 fact table:

21:01:18 SQL> create table DIM1 (DIM1_ID number, DIM1_ATT1 varchar2(20));
Table DIM1 created.
 
21:01:19 SQL> create table DIM2 (DIM2_ID number, DIM2_ATT1 varchar2(20));
Table DIM2 created.
 
21:01:20 SQL> create table DIM3 (DIM3_ID number, DIM3_ATT1 varchar2(20));
Table DIM3 created.
 
21:01:21 SQL> create table FACT (DIM1_ID number, DIM2_ID number, DIM3_ID number,MEAS1 number);
Table FACT created.

I insert 10 million rows into the fact table:

21:01:22 SQL> insert into FACT select mod(rownum,3),mod(rownum,5),mod(rownum,10),rownum from xmltable('1 to 10000000');
10,000,000 rows inserted.
 
Elapsed: 00:00:18.983

and fill the dimension tables from it:

21:01:42 SQL> insert into DIM1 select distinct DIM1_ID,'...'||DIM1_ID from FACT;
3 rows inserted.
 
Elapsed: 00:00:01.540
 
21:01:52 SQL> insert into DIM2 select distinct DIM2_ID,'...'||DIM2_ID from FACT;
5 rows inserted.
 
Elapsed: 00:00:01.635
 
21:01:57 SQL> insert into DIM3 select distinct DIM3_ID,'...'||DIM3_ID from FACT;
10 rows inserted.
 
Elapsed: 00:00:01.579
 
21:01:58 SQL> commit;
Commit complete.

Query joining fact with one dimension

I’ll run the following query:

21:01:58 SQL> select count(*) from FACT join DIM1 using(DIM1_ID);
 
COUNT(*)
--------
10000000
 
Elapsed: 00:00:01.015

Here is the execution plan:

21:02:12 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 4pqjrjkc7sn17, child number 0
-------------------------------------
select count(*) from FACT join DIM1 using(DIM1_ID)
 
Plan hash value: 1826335751
 
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7514 (100)| |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
|* 2 | HASH JOIN | | 10M| 253M| 7514 (2)| 00:00:01 |
| 3 | TABLE ACCESS FULL| DIM1 | 3 | 39 | 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| FACT | 10M| 126M| 7482 (1)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - access("FACT"."DIM1_ID"="DIM1"."DIM1_ID")
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

Actually, we don’t need that join. A dimension table has two goals:

  • filter facts on the dimension attributes. Example: filter on customer last name
  • add dimension attributes to the result. Example: add customer first name

Here, there is no WHERE clause on DIM1 columns, and no columns from DIM1 selected. We don’t need to join to DIM1. However, we often see those useless joins for two reasons:

  • We query a view that joins the fact with all dimensions
  • The query is generated by a reporting tool which always join to dimensions

Join elimination

The Oracle optimizer is able to remove those kinds of unnecessary joins. But one information is missing here for the optimizer. We know that all rows in the fact table have a matching row in each dimension, but Oracle doesn’t know that. And if there is no mathing row, then the inner join should not return the result. For this reason, the join must be done.

Let’s give this information to the optimizer: declare the foreign key from FACT to DIM1 so that Oracle knows that there is a many-to-one relationship:

21:02:17 SQL> alter table DIM1 add constraint DIM1_PK primary key(DIM1_ID);
Table DIM1 altered.
 
Elapsed: 00:00:00.051
 
21:02:20 SQL> alter table FACT add constraint DIM1_FK foreign key(DIM1_ID) references DIM1;
Table FACT altered.
 
Elapsed: 00:00:03.210

I’ve spent 3 seconds here to create this foreign key (would have been much longer with a real fact table and lot of columns and rows) but now, the optimizer is able to eliminate the join:

21:02:24 SQL> select count(*) from FACT join DIM1 using(DIM1_ID);
 
COUNT(*)
--------
10000000
 
21:02:25 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 4pqjrjkc7sn17, child number 0
-------------------------------------
select count(*) from FACT join DIM1 using(DIM1_ID)
 
Plan hash value: 3735838348
 
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7488 (100)| |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
|* 2 | TABLE ACCESS FULL| FACT | 10M| 126M| 7488 (1)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - filter("FACT"."DIM1_ID" IS NOT NULL)
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

No join needed here, the query is faster. This is exactly the point of this blog post: to show you that declaring constraints improve performance of queries. It adds information to the optimizer, like statistics. Statistics gives estimated cardinalities. Foreign keys are exact cardinality (many-to-one).

No validate

When loading a datawarehouse, you usually don’t need to validate the constraints because data was bulk loaded from a staging area where all data validation has been done. You don’t want to spend time validating constraints (the 3 seconds in my small example above) and this is why some datawarehouse developers do not declare constraints.

However, we can declare constraints without validating them. Let’s do that for the second dimension table:

21:02:34 SQL> alter table DIM2 add constraint DIM2_PK primary key(DIM2_ID) novalidate;
Table DIM2 altered.
 
Elapsed: 00:00:00.018
%nbsp;
21:02:35 SQL> alter table FACT add constraint DIM2_FK foreign key(DIM2_ID) references DIM2 novalidate;
Table FACT altered.
 
Elapsed: 00:00:00.009

That was much faster than the 3 seconds we had for the ‘validate’ constraint which is the default. Creating a constraint in NOVALIDATE is immediate and do not depend on the size of the table.

However this is not sufficient to get the join elimination:

21:02:39 SQL> select count(*) from FACT join DIM2 using(DIM2_ID);
 
COUNT(*)
--------
10000000
 
21:02:40 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 4t9g2n6duw0jf, child number 0
-------------------------------------
select count(*) from FACT join DIM2 using(DIM2_ID)
 
Plan hash value: 3858910383
 
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7518 (100)| |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
|* 2 | HASH JOIN | | 10M| 253M| 7518 (2)| 00:00:01 |
| 3 | INDEX FULL SCAN | DIM2_PK | 5 | 65 | 1 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL| FACT | 10M| 126M| 7488 (1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - access("FACT"."DIM2_ID"="DIM2"."DIM2_ID")
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

The constraint ensures that no rows will be inserted without a matching row in the dimension. However, because Oracle has not validated the result itself, it does not apply the join elimination, just in case a previously existing row has no matching dimension.

Rely novalidate

If you want the optimizer to do the join elimination on a ‘novalidate’ constraint, then it has to trust you and rely on the constraint you have validated.

RELY is an attribute of the constraint that you can set:

21:02:44 SQL> alter table DIM2 modify constraint DIM2_PK rely;
Table DIM2 altered.
 
Elapsed: 00:00:00.016
 
21:02:45 SQL> alter table FACT modify constraint DIM2_FK rely;
Table FACT altered.
 
Elapsed: 00:00:00.010

But this is not sufficient. You told Oracle to rely on your constraint, but Oracle must trust you.

Trusted

The join elimination is a rewrite of the query and, by default, rewrite is enabled but only when integrity is enforced by Oracle:

21:02:50 SQL> show parameter query_rewrite
NAME TYPE VALUE
----------------------- ------ --------
query_rewrite_enabled string TRUE
query_rewrite_integrity string ENFORCED

Let’s allow our session to have rewrite transformations to trust our RELY constraints:

21:02:52 SQL> alter session set query_rewrite_integrity=trusted;
Session altered.

Now, joining to DIM2 without using DIM2 columns outside of the join allows join elimination:

21:02:57 SQL> select count(*) from FACT join DIM2 using(DIM2_ID);
 
COUNT(*)
--------
10000000
 
Elapsed: 00:00:00.185
21:02:58 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 4t9g2n6duw0jf, child number 0
-------------------------------------
select count(*) from FACT join DIM2 using(DIM2_ID)
 
Plan hash value: 3735838348
 
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7494 (100)| |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
|* 2 | TABLE ACCESS FULL| FACT | 10M| 126M| 7494 (1)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - filter("FACT"."DIM2_ID" IS NOT NULL)
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
- rely constraint used for this statement

In 12.2 the execution plan has a note to show that the plan depends on RELY constraint.

From this example, you can see that you can, and should, create RELY NOVALIDATE constraints on tables where you know the existing data is valid. They are immediately created, without any overhead on the load process, and helps to improve queries generated on your dimensional model.

Rely Disable

I said that a NOVALIDATE constraint has no overhead when created, but you may have further inserts or updates in your datawarehouse. And then, those constraints will have to be verified and this may add a little overhead. In this case, you can go further and disable the constraint:

21:03:04 SQL> alter table DIM3 add constraint DIM3_PK primary key(DIM3_ID) rely;
Table DIM3 altered.
 
Elapsed: 00:00:00.059
 
21:03:05 SQL> alter table FACT add constraint DIM3_FK foreign key(DIM3_ID) references DIM3 rely disable novalidate;
Table FACT altered.
 
Elapsed: 00:00:00.014

Note that I had to set the referenced constraint DIM3_PK to RELY here, even if it is enable and validate, or I would get: ORA-25158: Cannot specify RELY for foreign key if the associated primary key is NORELY.

My session still trusts RELY constraints for query rewrite:

21:03:07 SQL> show parameter query_rewrite
 
NAME TYPE VALUE
----------------------- ------ -------
query_rewrite_enabled string TRUE
query_rewrite_integrity string TRUSTED

Now, the join elimination occurs:

21:03:08 SQL> select count(*) from FACT join DIM3 using(DIM3_ID);
 
COUNT(*)
--------
10000000
 
21:03:09 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 3bhs523zyudf0, child number 0
-------------------------------------
select count(*) from FACT join DIM3 using(DIM3_ID)
 
Plan hash value: 3735838348
 
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 7505 (100)| |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
|* 2 | TABLE ACCESS FULL| FACT | 11M| 138M| 7505 (1)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - filter("FACT"."DIM3_ID" IS NOT NULL)
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
- rely constraint used for this statement

So, we can still benefit from the query optimization even with the RELY DISABLE NOVALIDATE.

But I would not recommend this. Be careful. Here are my foreign key constraints:

21:03:15 SQL> select table_name,constraint_type,constraint_name,status,validated,rely from all_constraints where owner='
DEMO' and table_name='FACT' order by 4 desc,5 desc,6 nulls last;
 
TABLE_NAME CONSTRAINT_TYPE CONSTRAINT_NAME STATUS VALIDATED RELY
---------- --------------- --------------- ------ --------- ----
FACT R DIM1_FK ENABLED VALIDATED
FACT R DIM2_FK ENABLED NOT VALIDATED RELY
FACT R DIM3_FK DISABLED NOT VALIDATED RELY

For DIM1_FK and DIM2_FK the constraints prevent us from inconsistencies:

21:03:17 SQL> insert into FACT(DIM1_ID)values(666);
 
Error starting at line : 1 in command -
insert into FACT(DIM1_ID)values(666)
Error report -
ORA-02291: integrity constraint (DEMO.DIM1_FK) violated - parent key not found

But the disabled one will allow inconsistencies:

21:03:19 SQL> insert into FACT(DIM3_ID)values(666);
1 row inserted.

That’s bad. I rollback this immediately:

21:03:20 SQL> rollback;
Rollback complete.

Star transformation

Join elimination is not the only transformation that needs to know about the many-to-one relationship between fact tables and dimensions. You usually create a bitmap index on each foreign key to the dimension, to get the higher selectivity when looking at the table rows from the combination of criteria on the dimension attributes.

21:03:24 SQL> create bitmap index FACT_DIM1 on FACT(DIM1_ID);
Index FACT_DIM1 created.
 
21:03:29 SQL> create bitmap index FACT_DIM2 on FACT(DIM2_ID);
Index FACT_DIM2 created.
 
21:03:33 SQL> create bitmap index FACT_DIM3 on FACT(DIM3_ID);
Index FACT_DIM3 created.

Here is the kind of query with predicates on each dimension attributes:

21:03:35 SQL> select count(*) from FACT
2 join DIM1 using(DIM1_ID) join DIM2 using(DIM2_ID) join DIM3 using(DIM3_ID)
3 where dim1_att1='...0' and dim2_att1='...0' and dim3_att1='...0';
 
COUNT(*)
--------
333333

By default, the optimizer applies those predicates on the dimension and do a cartesian join to get all accepted combinations of dimension IDs. Then the rows can be fetched from the table:

21:03:37 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 01jmjv0sz1dpq, child number 0
-------------------------------------
select count(*) from FACT join DIM1 using(DIM1_ID) join DIM2
using(DIM2_ID) join DIM3 using(DIM3_ID) where dim1_att1='...0' and
dim2_att1='...0' and dim3_att1='...0'
 
Plan hash value: 1924236134
 
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 5657 (100)| |
| 1 | SORT AGGREGATE | | 1 | 114 | | |
| 2 | NESTED LOOPS | | 55826 | 6215K| 5657 (1)| 00:00:01 |
| 3 | MERGE JOIN CARTESIAN | | 1 | 75 | 9 (0)| 00:00:01 |
| 4 | MERGE JOIN CARTESIAN | | 1 | 50 | 6 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | DIM1 | 1 | 25 | 3 (0)| 00:00:01 |
| 6 | BUFFER SORT | | 1 | 25 | 3 (0)| 00:00:01 |
|* 7 | TABLE ACCESS FULL | DIM2 | 1 | 25 | 3 (0)| 00:00:01 |
| 8 | BUFFER SORT | | 1 | 25 | 6 (0)| 00:00:01 |
|* 9 | TABLE ACCESS FULL | DIM3 | 1 | 25 | 3 (0)| 00:00:01 |
| 10 | BITMAP CONVERSION COUNT | | 55826 | 2126K| 5657 (1)| 00:00:01 |
| 11 | BITMAP AND | | | | | |
|* 12 | BITMAP INDEX SINGLE VALUE| FACT_DIM3 | | | | |
|* 13 | BITMAP INDEX SINGLE VALUE| FACT_DIM2 | | | | |
|* 14 | BITMAP INDEX SINGLE VALUE| FACT_DIM1 | | | | |
-------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
5 - filter("DIM1"."DIM1_ATT1"='...0')
7 - filter("DIM2"."DIM2_ATT1"='...0')
9 - filter("DIM3"."DIM3_ATT1"='...0')
12 - access("FACT"."DIM3_ID"="DIM3"."DIM3_ID")
13 - access("FACT"."DIM2_ID"="DIM2"."DIM2_ID")
14 - access("FACT"."DIM1_ID"="DIM1"."DIM1_ID")

Here rows are fetched from the fact table through a nested loop from the cartesian join on the dimensions, using the bitmap index access for each loop. If there are lot of rows to fetch, then the optimizer will chose a hash join and then will have to full scan the fact table, which is expensive. To lower that cost, the optimizer can add a ‘IN (SELECT DIM_ID FROM DIM WHERE DIM_ATT)’ for very selective dimensions. This is STAR transformation and relies on the foreign key constraints.

It is not enabled by default:

21:03:43 SQL> show parameter star
NAME TYPE VALUE
---------------------------- ------- -----
star_transformation_enabled string FALSE

We can enable it and then it is a cost based transformation:

21:03:45 SQL> alter session set star_transformation_enabled=true;
Session altered.

Here is my example:

21:03:47 SQL> select count(*) from FACT
2 join DIM1 using(DIM1_ID) join DIM2 using(DIM2_ID) join DIM3 using(DIM3_ID)
3 where dim1_att1='...0' and dim2_att1='...0' and dim3_att1='...0';
 
COUNT(*)
--------
333333

The star transformation, changing a join to an ‘IN()’ is possible only when we know that there is a many-to-one relationship. We have all constraints for that, disabled or not, validated or not, but all in RELY. Then Star Transformation can occur:

21:03:51 SQL> select * from dbms_xplan.display_cursor();
 
PLAN_TABLE_OUTPUT
-----------------
SQL_ID 01jmjv0sz1dpq, child number 1
-------------------------------------
select count(*) from FACT join DIM1 using(DIM1_ID) join DIM2
using(DIM2_ID) join DIM3 using(DIM3_ID) where dim1_att1='...0' and
dim2_att1='...0' and dim3_att1='...0'
 
Plan hash value: 1831539117
 
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 68 (100)| |
| 1 | SORT AGGREGATE | | 1 | 38 | | |
|* 2 | HASH JOIN | | 2 | 76 | 68 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | DIM2 | 1 | 25 | 3 (0)| 00:00:01 |
| 4 | VIEW | VW_ST_62BA0C91 | 8 | 104 | 65 (0)| 00:00:01 |
| 5 | NESTED LOOPS | | 8 | 608 | 56 (0)| 00:00:01 |
| 6 | BITMAP CONVERSION TO ROWIDS| | 8 | 427 | 22 (5)| 00:00:01 |
| 7 | BITMAP AND | | | | | |
| 8 | BITMAP MERGE | | | | | |
| 9 | BITMAP KEY ITERATION | | | | | |
|* 10 | TABLE ACCESS FULL | DIM1 | 1 | 25 | 3 (0)| 00:00:01 |
|* 11 | BITMAP INDEX RANGE SCAN| FACT_DIM1 | | | | |
| 12 | BITMAP MERGE | | | | | |
| 13 | BITMAP KEY ITERATION | | | | | |
|* 14 | TABLE ACCESS FULL | DIM2 | 1 | 25 | 3 (0)| 00:00:01 |
|* 15 | BITMAP INDEX RANGE SCAN| FACT_DIM2 | | | | |
| 16 | BITMAP MERGE | | | | | |
| 17 | BITMAP KEY ITERATION | | | | | |
|* 18 | TABLE ACCESS FULL | DIM3 | 1 | 25 | 3 (0)| 00:00:01 |
|* 19 | BITMAP INDEX RANGE SCAN| FACT_DIM3 | | | | |
| 20 | TABLE ACCESS BY USER ROWID | FACT | 1 | 25 | 43 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
2 - access("ITEM_1"="DIM2"."DIM2_ID")
3 - filter("DIM2"."DIM2_ATT1"='...0')
10 - filter("DIM1"."DIM1_ATT1"='...0')
11 - access("FACT"."DIM1_ID"="DIM1"."DIM1_ID")
14 - filter("DIM2"."DIM2_ATT1"='...0')
15 - access("FACT"."DIM2_ID"="DIM2"."DIM2_ID")
18 - filter("DIM3"."DIM3_ATT1"='...0')
19 - access("FACT"."DIM3_ID"="DIM3"."DIM3_ID")
 
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
- star transformation used for this statement
- this is an adaptive plan

Here, each dimension drives a range scan on the bitmap index: the predicate on the dimension table returns the dimension ID for the the index lookup on the fact table. The big advantage of bitmap indexes here is that when this access path is used for several dimensions, the bitmap result can be combined before going to the table. This transformation avoids the join and then you must be sure that there is a many-to-one relationship.

In summary

As you should rely on the integrity of data in your datawarehouse, you should find the following parameters to query on fact-dimension schemas:

NAME TYPE VALUE
---------------------------- ------- ------
query_rewrite_enabled string TRUE
query_rewrite_integrity string TRUSTED
star_transformation_enabled string FALSE

And you should define all constraints. When you are sure about the integrity of data, then those constraints can be created RELY ENABLE NOVALIDATE. If some processing must be optimized by not enforcing the constraint verification, then you may create those constraints as RELY DISABLE NOVALIDATE but the gain will probably minimal, and the risk high. But remember that there are not only the well-controlled processes which update data. You may have one day to do a manual update to fix something, and enabled constraint can prevent terrible errors.

I have not covered all optimizer transformations that rely on constraints. When using materialized views you, the rewrite capability also relies on constraints. Relationship cardinality is one of the most important information of database design, this information must be known by the optimizer.

Added 9-OCT-2017

Re-reading this, I realize that I forgot to mention one important thing about disabled constraints. I recommend having the constraints enabled in case there is an update. But when you bulk insert (insert /*+ append */) you will disable it or the insert will not be done in direct-path. So the idea is to disable it before the load and enabled it RELY NOVALIDATE after the load. And while we are there, I can mention that inconsistencies can happen only in NOVALIDATE DISABLE because with VALIDATE DISABLE you cannot insert/update/delete.

 

Leave a Reply


1 × = four

Franck Pachot
Franck Pachot

Technology Leader