Infrastructure at your Service

Daniel Westermann

PostgreSQL partitioning (2): Range partitioning

Now that the data set is ready we will look at the first partitioning strategy: Range partitioning. Usually range partitioning is used to partition a table by days, months or years although you can partition by other data types as well. Time/date based range partitioning is probably the most common use case so we will partition our traffic violation data by year and later on sub partition that month.

Going back to the materialized we created in the first post, this is the structure:

postgres=# \d mv_traffic_violations 
                 Materialized view "public.mv_traffic_violations"
         Column          |          Type          | Collation | Nullable | Default 
-------------------------+------------------------+-----------+----------+---------
 seqid                   | text                   |           |          | 
 date_of_stop            | date                   |           |          | 
 time_of_stop            | time without time zone |           |          | 
 agency                  | text                   |           |          | 
 subagency               | text                   |           |          | 
 description             | text                   |           |          | 
 location                | text                   |           |          | 
 latitude                | numeric                |           |          | 
 longitude               | numeric                |           |          | 
 accident                | text                   |           |          | 
 belts                   | boolean                |           |          | 
 personal_injury         | boolean                |           |          | 
 property_damage         | boolean                |           |          | 
 fatal                   | boolean                |           |          | 
 commercial_license      | boolean                |           |          | 
 hazmat                  | boolean                |           |          | 
 commercial_vehicle      | boolean                |           |          | 
 alcohol                 | boolean                |           |          | 
 workzone                | boolean                |           |          | 
 state                   | text                   |           |          | 
 vehicletype             | text                   |           |          | 
 year                    | smallint               |           |          | 
 make                    | text                   |           |          | 
 model                   | text                   |           |          | 
 color                   | text                   |           |          | 
 violation_type          | text                   |           |          | 
 charge                  | text                   |           |          | 
 article                 | text                   |           |          | 
 contributed_to_accident | boolean                |           |          | 
 race                    | text                   |           |          | 
 gender                  | text                   |           |          | 
 driver_city             | text                   |           |          | 
 driver_state            | text                   |           |          | 
 dl_state                | text                   |           |          | 
 arrest_type             | text                   |           |          | 
 geolocation             | point                  |           |          | 
 council_districts       | smallint               |           |          | 
 councils                | smallint               |           |          | 
 communities             | smallint               |           |          | 
 zip_codes               | smallint               |           |          | 
 municipalities          | smallint               |           |          | 

Lets assume our most common queries against the data set restrict the data for one or more years. Using partitioned tables we can range partition on the “date_of_stop” column and break down the large data set into smaller chunks per year. How would that work in PostgreSQL?

Currently we have this min and max values for the “date_of_stop” column:

postgres=# select min(date_of_stop), max(date_of_stop) from mv_traffic_violations;
    min     |    max     
------------+------------
 2012-01-01 | 2019-06-01
(1 row)

If we want to partition that by year we need at least 8 partitions (2012 to 2019) and maybe already add a partition for 2010 so we are on the safe side when a more recent data set brings data for next year. The first step is to create a partitioned table:

create table traffic_violations_p
( seqid text
, date_of_stop date not null
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, workzone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point
, council_districts smallint 
, councils smallint
, communities smallint
, zip_codes smallint
, municipalities smallint)
partition by range (date_of_stop);

Now we have a partitioned table with no partitions:

postgres=# \d traffic_violations_p
                  Partitioned table "public.traffic_violations_p"
         Column          |          Type          | Collation | Nullable | Default 
-------------------------+------------------------+-----------+----------+---------
 seqid                   | text                   |           |          | 
 date_of_stop            | date                   |           | not null | 
 time_of_stop            | time without time zone |           |          | 
 agency                  | text                   |           |          | 
 subagency               | text                   |           |          | 
 description             | text                   |           |          | 
 location                | text                   |           |          | 
 latitude                | numeric                |           |          | 
 longitude               | numeric                |           |          | 
 accident                | text                   |           |          | 
 belts                   | boolean                |           |          | 
 personal_injury         | boolean                |           |          | 
 property_damage         | boolean                |           |          | 
 fatal                   | boolean                |           |          | 
 commercial_license      | boolean                |           |          | 
 hazmat                  | boolean                |           |          | 
 commercial_vehicle      | boolean                |           |          | 
 alcohol                 | boolean                |           |          | 
 workzone                | boolean                |           |          | 
 state                   | text                   |           |          | 
 vehicletype             | text                   |           |          | 
 year                    | smallint               |           |          | 
 make                    | text                   |           |          | 
 model                   | text                   |           |          | 
 color                   | text                   |           |          | 
 violation_type          | text                   |           |          | 
 charge                  | text                   |           |          | 
 article                 | text                   |           |          | 
 contributed_to_accident | boolean                |           |          | 
 race                    | text                   |           |          | 
 gender                  | text                   |           |          | 
 driver_city             | text                   |           |          | 
 driver_state            | text                   |           |          | 
 dl_state                | text                   |           |          | 
 arrest_type             | text                   |           |          | 
 geolocation             | point                  |           |          | 
 council_districts       | smallint               |           |          | 
 councils                | smallint               |           |          | 
 communities             | smallint               |           |          | 
 zip_codes               | smallint               |           |          | 
 municipalities          | smallint               |           |          | 
Partition key: RANGE (date_of_stop)
Number of partitions: 0

If we try to insert data without any partitions this will obviously fail:

postgres=# insert into traffic_violations_p (date_of_Stop) values ( now() );
psql: ERROR:  no partition of relation "traffic_violations_p" found for row
DETAIL:  Partition key of the failing row contains (date_of_stop) = (2019-05-28).

We need partitions and this is how you can add partitions to that partitioned table:

create table traffic_violations_p_2012
partition of traffic_violations_p
for values from ('2012-01-01') to ('2012-12-31');

create table traffic_violations_p_2013
partition of traffic_violations_p
for values from ('2013-01-01') to ('2013-12-31');

create table traffic_violations_p_2014
partition of traffic_violations_p
for values from ('2014-01-01') to ('2014-12-31');

create table traffic_violations_p_2015
partition of traffic_violations_p
for values from ('2015-01-01') to ('2015-12-31');

create table traffic_violations_p_2016
partition of traffic_violations_p
for values from ('2016-01-01') to ('2016-12-31');

create table traffic_violations_p_2017
partition of traffic_violations_p
for values from ('2017-01-01') to ('2017-12-31');

create table traffic_violations_p_2018
partition of traffic_violations_p
for values from ('2018-01-01') to ('2018-12-31');

create table traffic_violations_p_2019
partition of traffic_violations_p
for values from ('2019-01-01') to ('2019-12-31');

create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2020-12-31');

We now have a partition for each year from 2012 to 2020:

postgres=# \d+ traffic_violations_p
                                      Partitioned table "public.traffic_violations_p"
         Column          |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
-------------------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 seqid                   | text                   |           |          |         | extended |              | 
 date_of_stop            | date                   |           |          |         | plain    |              | 
 time_of_stop            | time without time zone |           |          |         | plain    |              | 
 agency                  | text                   |           |          |         | extended |              | 
 subagency               | text                   |           |          |         | extended |              | 
 description             | text                   |           |          |         | extended |              | 
 location                | text                   |           |          |         | extended |              | 
 latitude                | numeric                |           |          |         | main     |              | 
 longitude               | numeric                |           |          |         | main     |              | 
 accident                | text                   |           |          |         | extended |              | 
 belts                   | boolean                |           |          |         | plain    |              | 
 personal_injury         | boolean                |           |          |         | plain    |              | 
 property_damage         | boolean                |           |          |         | plain    |              | 
 fatal                   | boolean                |           |          |         | plain    |              | 
 commercial_license      | boolean                |           |          |         | plain    |              | 
 hazmat                  | boolean                |           |          |         | plain    |              | 
 commercial_vehicle      | boolean                |           |          |         | plain    |              | 
 alcohol                 | boolean                |           |          |         | plain    |              | 
 workzone                | boolean                |           |          |         | plain    |              | 
 state                   | text                   |           |          |         | extended |              | 
 vehicletype             | text                   |           |          |         | extended |              | 
 year                    | smallint               |           |          |         | plain    |              | 
 make                    | text                   |           |          |         | extended |              | 
 model                   | text                   |           |          |         | extended |              | 
 color                   | text                   |           |          |         | extended |              | 
 violation_type          | text                   |           |          |         | extended |              | 
 charge                  | text                   |           |          |         | extended |              | 
 article                 | text                   |           |          |         | extended |              | 
 contributed_to_accident | boolean                |           |          |         | plain    |              | 
 race                    | text                   |           |          |         | extended |              | 
 gender                  | text                   |           |          |         | extended |              | 
 driver_city             | text                   |           |          |         | extended |              | 
 driver_state            | text                   |           |          |         | extended |              | 
 dl_state                | text                   |           |          |         | extended |              | 
 arrest_type             | text                   |           |          |         | extended |              | 
 geolocation             | point                  |           |          |         | plain    |              | 
 council_districts       | smallint               |           |          |         | plain    |              | 
 councils                | smallint               |           |          |         | plain    |              | 
 communities             | smallint               |           |          |         | plain    |              | 
 zip_codes               | smallint               |           |          |         | plain    |              | 
 municipalities          | smallint               |           |          |         | plain    |              | 
Partition key: RANGE (date_of_stop)
Partitions: traffic_violations_p_2012 FOR VALUES FROM ('2012-01-01') TO ('2012-12-31'),
            traffic_violations_p_2013 FOR VALUES FROM ('2013-01-01') TO ('2013-12-31'),
            traffic_violations_p_2014 FOR VALUES FROM ('2014-01-01') TO ('2014-12-31'),
            traffic_violations_p_2015 FOR VALUES FROM ('2015-01-01') TO ('2015-12-31'),
            traffic_violations_p_2016 FOR VALUES FROM ('2016-01-01') TO ('2016-12-31'),
            traffic_violations_p_2017 FOR VALUES FROM ('2017-01-01') TO ('2017-12-31'),
            traffic_violations_p_2018 FOR VALUES FROM ('2018-01-01') TO ('2018-12-31'),
            traffic_violations_p_2019 FOR VALUES FROM ('2019-01-01') TO ('2019-12-31'),
            traffic_violations_p_2020 FOR VALUES FROM ('2020-01-01') TO ('2020-12-31')

Doing the same insert as before now succeeds because we have a partition where the row can go to:

postgres=# insert into traffic_violations_p (date_of_Stop) values ( now() );
INSERT 0 1
postgres=# delete from traffic_violations_p;
DELETE 1
postgres=# 

What still would fail is an insert with a date before 2012 or after 2020:

postgres=# insert into traffic_violations_p (date_of_Stop) values ( now() + interval '2 years' );
psql: ERROR:  no partition of relation "traffic_violations_p" found for row
DETAIL:  Partition key of the failing row contains (date_of_stop) = (2021-05-28).

To avoid that we can create a “default” partition:

CREATE TABLE traffic_violations_p_default
    PARTITION OF traffic_violations_p DEFAULT;

A default partition will hold all the rows that do not match any of the existing partition definitions:

postgres=# select (date_of_stop) from traffic_violations_p_default;
 date_of_stop 
--------------
 2021-05-28
(1 row)
postgres=# delete from traffic_violations_p;
DELETE 1

As our partitioned table setup is now complete we can load the data:

postgres=# insert into traffic_violations_p select * from mv_traffic_violations;
INSERT 0 1528078

All rows successfully loaded so we can check the counts for each partition:

postgres=# select count(*) from traffic_violations_p_2012;
 count  
--------
 150989
(1 row)

postgres=# select count(*) from traffic_violations_p_2013;
 count  
--------
 189946
(1 row)

postgres=# select count(*) from traffic_violations_p_2014;
 count  
--------
 223248
(1 row)

postgres=# select count(*) from traffic_violations_p_2015;
 count  
--------
 234468
(1 row)

postgres=# select count(*) from traffic_violations_p_2016;
 count  
--------
 219227
(1 row)

postgres=# select count(*) from traffic_violations_p_2017;
 count  
--------
 198879
(1 row)

postgres=# select count(*) from traffic_violations_p_2018;
 count  
--------
 217911
(1 row)

postgres=# select count(*) from traffic_violations_p_2019;
 count 
-------
 89823
(1 row)

postgres=# select count(*) from traffic_violations_p_2020;
 count 
-------
     0
(1 row)

postgres=# select count(*) from traffic_violations_p_default;
 count 
-------
  3588
(1 row)

Why do we have so many rows in the default partition? We checked the min and max values and there should be no row before 2012 and after 2019. But:

postgres=# select distinct date_of_stop from traffic_violations_p_default order by 1;
 date_of_stop 
--------------
 2012-12-31
 2013-12-31
 2014-12-31
 2015-12-31
 2016-12-31
 2017-12-31
 2018-12-31
(7 rows)

The issue it, that the upper bound is exclusive so we did not correctly setup the partitions. The correct way of doing it is:

drop table traffic_violations_p_2012, traffic_violations_p_2013, traffic_violations_p_2014, traffic_violations_p_2015, traffic_violations_p_2016, traffic_violations_p_2017, traffic_violations_p_2018, traffic_violations_p_2019, traffic_violations_p_2020;

create table traffic_violations_p_2012
partition of traffic_violations_p
for values from ('2012-01-01') to ('2013-01-01');

create table traffic_violations_p_2013
partition of traffic_violations_p
for values from ('2013-01-01') to ('2014-01-01');

create table traffic_violations_p_2014
partition of traffic_violations_p
for values from ('2014-01-01') to ('2015-01-01');

create table traffic_violations_p_2015
partition of traffic_violations_p
for values from ('2015-01-01') to ('2016-01-01');

create table traffic_violations_p_2016
partition of traffic_violations_p
for values from ('2016-01-01') to ('2017-01-01');

create table traffic_violations_p_2017
partition of traffic_violations_p
for values from ('2017-01-01') to ('2018-01-01');

create table traffic_violations_p_2018
partition of traffic_violations_p
for values from ('2018-01-01') to ('2019-01-01');

create table traffic_violations_p_2019
partition of traffic_violations_p
for values from ('2019-01-01') to ('2020-01-01');

create table traffic_violations_p_2020
partition of traffic_violations_p
for values from ('2020-01-01') to ('2021-01-01');

delete from traffic_violations_p;

Once the partitions are set up as required the data is correctly stored:

postgres=# insert into traffic_violations_p select * from mv_traffic_violations;
INSERT 0 1528078
postgres=# select count(*) from traffic_violations_p_default ;
 count 
-------
     0
(1 row)

Instead of having one big table containing all the data we now have 10 partitions splitting the table in smaller chunks. Why does that matter? This is the topic of another post after we had a look at list and hash partitioning in the next posts.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure