Is is not a surprise that applications change over time. New tables get added, columns need to be changed or new ones are required, and sometimes you need to add check constraints or foreign keys to already existing tables which contain data. The issue with adding constraints to tables that already contain data is, that PostgreSQL needs to scan the whole table, to validate that existing data does not violate the constraints you are adding. Is that really an issue, and, if yes, how can it be avoided?

As always, let’s start with a very simple test setup:

postgres=# create table t1 as select 5 as a from generate_series(1,100000000);
SELECT 100000000
postgres=# d t1
                 Table "public.t1"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 a      | integer |           |          | 

Nothing fancy here: One table containing one column, 100’000’000 rows. What happens to concurrent sessions touching this table if we add a check constraint:

-- session 1
postgres=# alter table t1 add constraint c1 check ( a > 4 );
-- session 2, started right after session 1 started to add the check constraint
postgres=# insert into t1 (a) values(6);

The second session will block until the first session committed. Why is that? PostgreSQL can not allow the second session to complete until the constraint is fully in place. Otherwise another session could write data that would violate the constraint. The bigger the table, the longer other sessions will be blocked and this can have a huge impact on your workloads. If tables only contain a dozens of rows you probably never notice anything, but for large tables you need to be careful.

The question now is: Can we avoid the blocking? The good news is, yes you can. Same test as above but with the “NOT VALID” keywords:

-- session 1
postgres=# alter table t1 drop constraint c1;
ALTER TABLE
postgres=# alter table t1 add constraint c1 check ( a > 4 ) not valid;
-- session 2, started right after session 1 started to add the check constraint
postgres=# insert into t1 (a) values(6);

Both sessions complete immediately and new data is validated from now on:

postgres=# insert into t1 (a) values(2);
ERROR:  new row for relation "t1" violates check constraint "c1"
DETAIL:  Failing row contains (2).

For already existing data there might be more work to do: Consider this:

postgres=# alter table t1 drop constraint c1;
ALTER TABLE
postgres=# insert into t1 (a) values(2);
INSERT 0 1
postgres=# alter table t1 add constraint c1 check ( a > 4 ) not valid;
ALTER TABLE
postgres=# 

The “NOT VALID” clause tells PostgreSQL to not validate any existing data. If you try to validate the data now it will fail:

postgres=# alter table t1 validate constraint c1;
ERROR:  check constraint "c1" of relation "t1" is violated by some row

You need to do the cleanup manually:

postgres=# delete from t1 where a < 4;
DELETE 1

Now the validation will succeed and it will not block concurrent sessions:

-- session 1
postgres=# alter table t1 validate constraint c1;
-- session 2, started right after session 1 started to add the check constraint
postgres=# insert into t1 (a) values(6);

For best performance and no interruption of other sessions the procedure therefore is:

  • Create the constraint with “NOT VALID”. This comes back immediately and will avoid any invalid data from now on
  • Check if there are existing rows that violate the constraint
  • Manually clean the data
  • Validate the constraint