By Franck Pachot

.
I know that lot of people are against the ANSI join syntax in Oracle. And this goes beyond the limits when talking about NATURAL JOIN. But I like them and use them quite often.

Why is Natural Join bad?

Natural join is bad because it relies on column names, and, at the time of writing the query, you don’t know which columns will be added or removed later. Here is an example on the SCOTT schema, joining on DEPTNO which has the same name in DEPT and EMP:


SQL> select * from EMP natural join DEPT where DNAME='SALES';
 
    DEPTNO      EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM DNAME          LOC
---------- ---------- ---------- --------- ---------- --------- ---------- ---------- -------------- -------------
        30       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500 SALES          CHICAGO
        30       7844 TURNER     SALESMAN        7698 08-SEP-81       1500          0 SALES          CHICAGO
        30       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300 SALES          CHICAGO
        30       7900 JAMES      CLERK           7698 03-DEC-81        950            SALES          CHICAGO
        30       7698 BLAKE      MANAGER         7839 01-MAY-81       2850            SALES          CHICAGO
        30       7654 MARTIN     SALESMAN        7698 28-SEP-81       1250       1400 SALES          CHICAGO

The DEPT table has a ‘LOC’column for the location of the department. But the data model may evolve and you may add a location for each employee. And we may also call it LOC:


SQL> alter table EMP add (LOC varchar2(10));
Table altered.

But now our Natural Join adds this column to the join predicate and the result is wrong because it shows only rows which have same department location as employee location:


SQL> select * from EMP natural join DEPT where DNAME='SALES';
 
no rows selected

Projection

In my opinion, the problem is not the Natural Join. Column names have a meaning for their tables. But the tables have different roles in our queries. As soon as a table or view participates to our query, we should redefine the column names. If we don’t, the result is completely wrong as:


SQL> select * from EMP join DEPT using(DEPTNO) where DNAME='SALES';
 
    DEPTNO      EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM LOC        DNAME          LOC
---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- -------------- -------------
        30       7521 WARD       SALESMAN        7698 22-FEB-81       1250        500            SALES          CHICAGO
        30       7844 TURNER     SALESMAN        7698 08-SEP-81       1500          0            SALES          CHICAGO
        30       7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300            SALES          CHICAGO
        30       7900 JAMES      CLERK           7698 03-DEC-81        950                       SALES          CHICAGO
        30       7698 BLAKE      MANAGER         7839 01-MAY-81       2850                       SALES          CHICAGO
        30       7654 MARTIN     SALESMAN        7698 28-SEP-81       1250       1400            SALES          CHICAGO

Look: the result has two columns with the same name. This is completely wrong for a relational database and I don’t even understand why this parses without raising an error.

The projection is the most important relational operation, often overlooked as if it was just a rename for aesthetic purpose. You need to name the columns of your result set. They are the metadata for the interface between SQL and the host language. ‘select *’ is a shortcut when running an interactive query, to get a glance at the result rows. But a SQL query result is not complete without proper column names. And in most cases, at least when you query more than one table, the name of the query result columns should be different than the name of the underlying table columns. A department may have a location. And an employee may have a location. But the location of the employee department is something completely different than the employee location.

Then, as you need to name each column anyway, why not doing it as soon as possible? Do it for each table involved in the query, so that you are sure that all column names are correct within the query. As soon as you introduce an new table in the FROM clause, you should actually name the columns according to their role in the query. Let’s take an example with an airline data model. Each airport is linked to a city. This can be a CITY column in the AIRPORTS table. But as soon as you join FLIGHTS with AIRPORTS, this table has a different role. You join on destination airport or source airport. Then you alias the AIRPORTS table in the FROM clause, such as DST_AIRPORTS or SRC_AIRPORTS. Within the query, you can reference the columns with the table alias, such as DST_AIRPORTS.CITY or SRC_AIRPORTS.CITY but this cannot be exposed as-is in the query result. You must name them in the SELECT clause with something like SELECT DST_AIRPORTS.CITY as DST_ARP_CITY , SRC_AIRPORTS.CITY as SRC_ARP_CITY.

Then, as I’ll need to rename them anyway, I prefer to do it as soon as I join to a new table in the FROM clause. Instead of joining to AIRPORTS DST_AIRPORTS I can join to (SELECT IATA DST_ARP_IATA, CITY DST_ARP_CITY FROM AIRPORTS) and all column names will relate to the role without table aliases and without further renaming. And when I do that correctly, I can use natural join without risk.

Projection in the FROM clause

Let’s take an example. Here is a query in DEPT where I explicitly mention that LOC is the department location. This is implicit when the column name belongs to the DEPT table. But it will not be implicit anymore once I join this table to another table. Here is the view ready to be included in any query:


SQL> select DEPTNO,DNAME DEPT_DNAME,LOC DEPT_LOC from DEPT where DNAME='SALES';
 
    DEPTNO DEPT_DNAME     DEPT_LOC
---------- -------------- -------------
        30 SALES          CHICAGO

Now, I can join this to the EMP table. I prefix all columns from EMP with “EMP_” and all columns from DEPT with “EMP_DEPT_” because they belong to DEPT when in the role of employee department:


SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_EMPNO
    from
    (select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
    natural join
    (select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
    where EMP_DEPT_DNAME='SALES';
 
 EMP_EMPNO EMP_ENAME  EMP_DEPT_DNAME EMP_DEPT_LOC  EMP_LOC    EMP_MGR_EMPNO
---------- ---------- -------------- ------------- ---------- -------------
      7521 WARD       SALES          CHICAGO                           7698
      7844 TURNER     SALES          CHICAGO                           7698
      7499 ALLEN      SALES          CHICAGO                           7698
      7900 JAMES      SALES          CHICAGO                           7698
      7698 BLAKE      SALES          CHICAGO                           7839
      7654 MARTIN     SALES          CHICAGO                           7698

As you can see, when the names are clearly indicating the column with its role in the join, and how they are correlated with the other tables, there is no need to mention any join predicate. I used Natural Join because the join is on EMP_DEPTNO and I’m sure that it will always be the one and only one column with the same name. By query design.

And the column names in the result are correct, explicitly mentioning what is an Employee attribute or an Employee department attribute. That can be easy to parse and put in an object graph in the host language. You can see there that the MGR column of EMP was named EMP_MGR_EMPNO because this is actually what it is: the EMPNO of the employee manager. It is a foreign key to the EMP table.

And then, adding more information about the manager is easy: join with EMP again but with the proper projection of columns: EMPNO will be EMP_MGR_EMPNO when in the role of the employee manager, ENAME will be EMP_MGR_ENAME, DEPTNO will be EMP_MGR_DEPTNO, and so on:


SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME
    from
    (select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
    natural join
    (select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
    natural join
    (select DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP)
    where EMP_DEPT_DNAME='SALES';
 
 EMP_EMPNO EMP_ENAME  EMP_DEPT_DNAME EMP_DEPT_LOC  EMP_LOC    EMP_MGR_DEPTNO EMP_MGR_ENAME
---------- ---------- -------------- ------------- ---------- -------------- -------------
      7900 JAMES      SALES          CHICAGO                              30 BLAKE
      7499 ALLEN      SALES          CHICAGO                              30 BLAKE
      7654 MARTIN     SALES          CHICAGO                              30 BLAKE
      7844 TURNER     SALES          CHICAGO                              30 BLAKE
      7521 WARD       SALES          CHICAGO                              30 BLAKE
      7698 BLAKE      SALES          CHICAGO                              10 KING

No need to review the whole query when adding a new table. No need to solve the new ‘column ambiguously defined’. We don’t even need to alias the tables here.

Want to add the department name of the manager? That’s easy: join to DEPT with the right column projection (all prefixed by EMP_MGR_DEPT as the new columns are all about the employee manager’s department):


SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME,EMP_MGR_DEPT_DNAME
    from
    (select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
    natural join
    (select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
    natural join
    (select DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP)
    natural join
    (select DEPTNO EMP_MGR_DEPTNO,DNAME EMP_MGR_DEPT_DNAME,LOC EMP_MGR_DEPT_LOC from DEPT)
    where EMP_DEPT_DNAME='SALES';
 
 EMP_EMPNO EMP_ENAME  EMP_DEPT_DNAME EMP_DEPT_LOC  EMP_LOC    EMP_MGR_DEPTNO EMP_MGR_EN EMP_MGR_DEPT_D
---------- ---------- -------------- ------------- ---------- -------------- ---------- --------------
      7698 BLAKE      SALES          CHICAGO                              10 KING       ACCOUNTING
      7900 JAMES      SALES          CHICAGO                              30 BLAKE      SALES
      7499 ALLEN      SALES          CHICAGO                              30 BLAKE      SALES
      7654 MARTIN     SALES          CHICAGO                              30 BLAKE      SALES
      7844 TURNER     SALES          CHICAGO                              30 BLAKE      SALES
      7521 WARD       SALES          CHICAGO                              30 BLAKE      SALES

This can be even easier when you generate SQL queries. When adding a new table to join to, you just prefix all columns with their role. Check foreign keys so that the naming is consistent with the referenced tables. Then when parsing the result, the naming convention can help to break on the object hierarchy.

Additional notes

I mentioned that aliasing the subquery is not mandatory because I do not have to prefix the column names. However, when looking at the predicates section of the execution plan, the columns may be prefixed with an internal alias:


Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("from$_subquery$_006"."EMP_MGR_DEPTNO"="from$_subquery$_009"."EMP_MGR_DEPTNO")
   2 - access("from$_subquery$_001"."EMP_MGR_EMPNO"="from$_subquery$_006"."EMP_MGR_EMPNO" AND "from$_subquery$_001"."EMP_DEPTNO"="from$_subquery$_003"."EMP_DEPTNO")

Then it is a good idea to add prefixes, such as EMP, EMP_DEPT, EMP_MGR EMP_MGR_DEPTNO in the query above so that the predicates become:


Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("EMP_MGR"."EMP_MGR_DEPTNO"="EMP_MGR_DEPT"."EMP_MGR_DEPTNO")
   2 - access("EMP"."EMP_MGR_EMPNO"="EMP_MGR"."EMP_MGR_EMPNO" AND "EMP"."EMP_DEPTNO"="EMP_DEPT"."EMP_DEPTNO")
   5 - filter("DNAME"='SALES')

I also like to add a QB_NAME hint so that I can reference easily those subqueries if I have to add some hints there. Finally, this is what I can generate for this query:


SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME,EMP_MGR_DEPT_DNAME
    from
    (select /*+qb_name(EMP)*/ DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP) EMP
    natural join
    (select /*+qb_name(EMP_DEPT)*/ DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT) EMP_DEPT
    natural join
    (select /*+qb_name(EMP_MGR)*/ DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP) EMP_MGR
    natural join
    (select /*+qb_name(EMP_MGR_DEPT)*/ DEPTNO EMP_MGR_DEPTNO,DNAME EMP_MGR_DEPT_DNAME,LOC EMP_MGR_DEPT_LOC from DEPT) EMP_MGR_DEPT
    where EMP_DEPT_DNAME='SALES';

So what?

My goal here is not to recommend to always use natural joins. This depends on the context (ad-hoc queries, embedded ones in existing code with naming standards,…) and whether con control exactly the columns names. There are also a few bugs with ANSI joins, and natural join is not widely used, so maybe not tested a lot. But when I hear that Natural Join is bad, I want to explain the why/how/when. And one of the good sides of it is that it forces us to do the projection/rename as soon as possible and this makes the query easier to read/maintain/evolve. Of course, using natural join in that way requires that all tables are added to the FROM clause through a subquery which carefully names all columns in the SELECT clause so that the correlation with the other tables is clearly defined.