Infrastructure at your Service

Daniel Westermann

PostgreSQL partitioning (4): Hash partitioning

The last partitioning strategy we will look at is: hash partitioning. If you missed the first posts in this series here they are:

  1. PostgreSQL partitioning (1): Preparing the data set
  2. PostgreSQL partitioning (2): Range partitioning
  3. PostgreSQL partitioning (3): List partitioning

Usually hash partitioning is used when you do not have a natural way of partitioning your data or you want to evenly distribute the data based on hash.

In PostgreSQL hash partitioning might seem a bit strange in the beginning because you define a modulus and a remainder and that defines where the data is going to. Having a look at our materialized view which contains our data set which columns might be a good candidate for that?

postgres=# \d mv_traffic_violations 
                 Materialized view "public.mv_traffic_violations"
         Column          |          Type          | Collation | Nullable | Default 
-------------------------+------------------------+-----------+----------+---------
 seqid                   | text                   |           |          | 
 date_of_stop            | date                   |           |          | 
 time_of_stop            | time without time zone |           |          | 
 agency                  | text                   |           |          | 
 subagency               | text                   |           |          | 
 description             | text                   |           |          | 
 location                | text                   |           |          | 
 latitude                | numeric                |           |          | 
 longitude               | numeric                |           |          | 
 accident                | text                   |           |          | 
 belts                   | boolean                |           |          | 
 personal_injury         | boolean                |           |          | 
 property_damage         | boolean                |           |          | 
 fatal                   | boolean                |           |          | 
 commercial_license      | boolean                |           |          | 
 hazmat                  | boolean                |           |          | 
 commercial_vehicle      | boolean                |           |          | 
 alcohol                 | boolean                |           |          | 
 workzone                | boolean                |           |          | 
 state                   | text                   |           |          | 
 vehicletype             | text                   |           |          | 
 year                    | smallint               |           |          | 
 make                    | text                   |           |          | 
 model                   | text                   |           |          | 
 color                   | text                   |           |          | 
 violation_type          | text                   |           |          | 
 charge                  | text                   |           |          | 
 article                 | text                   |           |          | 
 contributed_to_accident | boolean                |           |          | 
 race                    | text                   |           |          | 
 gender                  | text                   |           |          | 
 driver_city             | text                   |           |          | 
 driver_state            | text                   |           |          | 
 dl_state                | text                   |           |          | 
 arrest_type             | text                   |           |          | 
 geolocation             | point                  |           |          | 
 council_districts       | smallint               |           |          | 
 councils                | smallint               |           |          | 
 communities             | smallint               |           |          | 
 zip_codes               | smallint               |           |          | 
 municipalities          | smallint               |           |          | 

A good candidate would be the councils column. When we setup the hash partitioning correctly the result would be 5 partitions:

postgres=# select distinct councils from mv_traffic_violations ;
 councils 
----------
         
        1
        3
        5
        2
        4
(6 rows)

The partitioned table becomes:

create table traffic_violations_p_hash
( seqid text
, date_of_stop date  
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, workzone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point
, council_districts smallint 
, councils smallint
, communities smallint
, zip_codes smallint
, municipalities smallint
)
partition by hash (councils);

As usual psql will report the partitioning strategy for the table as well:

postgres=# \d traffic_violations_p_hash 
               Partitioned table "public.traffic_violations_p_hash"
         Column          |          Type          | Collation | Nullable | Default 
-------------------------+------------------------+-----------+----------+---------
 seqid                   | text                   |           |          | 
 date_of_stop            | date                   |           |          | 
 time_of_stop            | time without time zone |           |          | 
 agency                  | text                   |           |          | 
 subagency               | text                   |           |          | 
 description             | text                   |           |          | 
 location                | text                   |           |          | 
 latitude                | numeric                |           |          | 
 longitude               | numeric                |           |          | 
 accident                | text                   |           |          | 
 belts                   | boolean                |           |          | 
 personal_injury         | boolean                |           |          | 
 property_damage         | boolean                |           |          | 
 fatal                   | boolean                |           |          | 
 commercial_license      | boolean                |           |          | 
 hazmat                  | boolean                |           |          | 
 commercial_vehicle      | boolean                |           |          | 
 alcohol                 | boolean                |           |          | 
 workzone                | boolean                |           |          | 
 state                   | text                   |           |          | 
 vehicletype             | text                   |           |          | 
 year                    | smallint               |           |          | 
 make                    | text                   |           |          | 
 model                   | text                   |           |          | 
 color                   | text                   |           |          | 
 violation_type          | text                   |           |          | 
 charge                  | text                   |           |          | 
 article                 | text                   |           |          | 
 contributed_to_accident | boolean                |           |          | 
 race                    | text                   |           |          | 
 gender                  | text                   |           |          | 
 driver_city             | text                   |           |          | 
 driver_state            | text                   |           |          | 
 dl_state                | text                   |           |          | 
 arrest_type             | text                   |           |          | 
 geolocation             | point                  |           |          | 
 council_districts       | smallint               |           |          | 
 councils                | smallint               |           |          | 
 communities             | smallint               |           |          | 
 zip_codes               | smallint               |           |          | 
 municipalities          | smallint               |           |          | 
Partition key: HASH (councils)
Number of partitions: 0

The setup for the hash partitions will be as follows:

create table traffic_violations_p_hash_p1
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 0);

create table traffic_violations_p_hash_p2
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 1);

create table traffic_violations_p_hash_p3
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 2);

create table traffic_violations_p_hash_p4
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 3);

create table traffic_violations_p_hash_p5
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 4);

Hash partitioning can not have a default partition as that would not make any sense because of the modulus and the remainder. When you try to do that you will get an error:

postgres=# create table traffic_violations_p_hash_default
postgres-# partition of traffic_violations_p_hash default;
psql: ERROR:  a hash-partitioned table may not have a default partition

The final setup of our hash partitioned table is:

postgres=# \d+ traffic_violations_p_hash
                                   Partitioned table "public.traffic_violations_p_hash"
         Column          |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
-------------------------+------------------------+-----------+----------+---------+----------+--------------+-------------
 seqid                   | text                   |           |          |         | extended |              | 
 date_of_stop            | date                   |           |          |         | plain    |              | 
 time_of_stop            | time without time zone |           |          |         | plain    |              | 
 agency                  | text                   |           |          |         | extended |              | 
 subagency               | text                   |           |          |         | extended |              | 
 description             | text                   |           |          |         | extended |              | 
 location                | text                   |           |          |         | extended |              | 
 latitude                | numeric                |           |          |         | main     |              | 
 longitude               | numeric                |           |          |         | main     |              | 
 accident                | text                   |           |          |         | extended |              | 
 belts                   | boolean                |           |          |         | plain    |              | 
 personal_injury         | boolean                |           |          |         | plain    |              | 
 property_damage         | boolean                |           |          |         | plain    |              | 
 fatal                   | boolean                |           |          |         | plain    |              | 
 commercial_license      | boolean                |           |          |         | plain    |              | 
 hazmat                  | boolean                |           |          |         | plain    |              | 
 commercial_vehicle      | boolean                |           |          |         | plain    |              | 
 alcohol                 | boolean                |           |          |         | plain    |              | 
 workzone                | boolean                |           |          |         | plain    |              | 
 state                   | text                   |           |          |         | extended |              | 
 vehicletype             | text                   |           |          |         | extended |              | 
 year                    | smallint               |           |          |         | plain    |              | 
 make                    | text                   |           |          |         | extended |              | 
 model                   | text                   |           |          |         | extended |              | 
 color                   | text                   |           |          |         | extended |              | 
 violation_type          | text                   |           |          |         | extended |              | 
 charge                  | text                   |           |          |         | extended |              | 
 article                 | text                   |           |          |         | extended |              | 
 contributed_to_accident | boolean                |           |          |         | plain    |              | 
 race                    | text                   |           |          |         | extended |              | 
 gender                  | text                   |           |          |         | extended |              | 
 driver_city             | text                   |           |          |         | extended |              | 
 driver_state            | text                   |           |          |         | extended |              | 
 dl_state                | text                   |           |          |         | extended |              | 
 arrest_type             | text                   |           |          |         | extended |              | 
 geolocation             | point                  |           |          |         | plain    |              | 
 council_districts       | smallint               |           |          |         | plain    |              | 
 councils                | smallint               |           |          |         | plain    |              | 
 communities             | smallint               |           |          |         | plain    |              | 
 zip_codes               | smallint               |           |          |         | plain    |              | 
 municipalities          | smallint               |           |          |         | plain    |              | 
Partition key: HASH (councils)
Partitions: traffic_violations_p_hash_p1 FOR VALUES WITH (modulus 5, remainder 0),
            traffic_violations_p_hash_p2 FOR VALUES WITH (modulus 5, remainder 1),
            traffic_violations_p_hash_p3 FOR VALUES WITH (modulus 5, remainder 2),
            traffic_violations_p_hash_p4 FOR VALUES WITH (modulus 5, remainder 3),
            traffic_violations_p_hash_p5 FOR VALUES WITH (modulus 5, remainder 4)

Loading the data and checking the partitions:

postgres=# insert into traffic_violations_p_hash
postgres-#        select * from mv_traffic_violations;
INSERT 0 1528078
postgres=# select count(*) from traffic_violations_p_hash_p1;
 count  
--------
 988085
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p2;
 count  
--------
 539993
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p3;
 count 
-------
     0
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p4;
 count 
-------
     0
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p5;
 count 
-------
     0
(1 row)

Why do we have only data in the first and second partition? Looking more closely at the distribution of the councils in the materialized view:

postgres=# select distinct councils, count(*) from mv_traffic_violations group by councils;
 councils | count  
----------+--------
        1 | 231070
        2 | 211759
        3 | 328234
        4 | 334142
        5 | 308402
          | 114471
(6 rows)

First of all we have a lot of rows which do not have a council. Null values in hash partitioned tables always go to the partition where the remainder is 0:

postgres=# truncate table traffic_violations_p_hash;
TRUNCATE TABLE
postgres=# insert into traffic_violations_p_hash (councils) select null from generate_series(1,100);
INSERT 0 100
postgres=# select count(*) from traffic_violations_p_hash_p1;
 count 
-------
   100
(1 row)

Here is the reason for this behavior: “The calculated hash value for the null value will be zero, therefore, it will fall to the partition having remainder zero.”.

This does still not explain why nothing is in the partitions 3 to 5. Lets do a few tests with our councils:

postgres=# truncate traffic_violations_p_hash;
TRUNCATE TABLE
postgres=# insert into traffic_violations_p_hash (councils) values (1),(2),(3),(4),(5);
INSERT 0 5
postgres=# select councils from traffic_violations_p_hash_p1;
 councils 
----------
        1
        4
        5
(3 rows)

postgres=# select councils from traffic_violations_p_hash_p2;
 councils 
----------
        2
        3
(2 rows)

This confirms the behavior. Councils 1,4 and 5 go to the first partition while 2 and 3 go to the second one. The decision to hash partition the on the councils column clearly was not a good idea. Hash paritioning is great when you have many different values:

postgres=# truncate traffic_violations_p_hash;
TRUNCATE TABLE
postgres=# insert into traffic_violations_p_hash (councils) select * from generate_series(1,10000);
INSERT 0 10000
postgres=# select count(*) from traffic_violations_p_hash_p1;
 count 
-------
  1969
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p2;
 count 
-------
  2034
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p3;
 count 
-------
  2058
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p4;
 count 
-------
  1928
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p5;
 count 
-------
  2011
(1 row)

This gives much better distribution of the data. A good candidate would be the seqid column:

drop table traffic_violations_p_hash, traffic_violations_p_hash_p1, traffic_violations_p_hash_p2, traffic_violations_p_hash_p3, traffic_violations_p_hash_p4, traffic_violations_p_hash_p5;
create table traffic_violations_p_hash
( seqid text
, date_of_stop date  
, time_of_stop time
, agency text
, subagency text
, description text 
, location text
, latitude numeric
, longitude numeric
, accident text
, belts boolean
, personal_injury boolean 
, property_damage boolean
, fatal boolean
, commercial_license boolean
, hazmat boolean
, commercial_vehicle boolean
, alcohol boolean
, workzone boolean
, state text
, vehicletype text 
, year smallint
, make text
, model text
, color text
, violation_type text 
, charge text
, article text
, contributed_to_accident boolean
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation point
, council_districts smallint 
, councils smallint
, communities smallint
, zip_codes smallint
, municipalities smallint
)
partition by hash (seqid);

create table traffic_violations_p_hash_p1
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 0);

create table traffic_violations_p_hash_p2
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 1);

create table traffic_violations_p_hash_p3
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 2);

create table traffic_violations_p_hash_p4
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 3);

create table traffic_violations_p_hash_p5
partition of traffic_violations_p_hash
for values with (modulus 5, remainder 4);

Loading the data again and checking the partitions:

postgres=# insert into traffic_violations_p_hash
postgres-#        select * from mv_traffic_violations;
INSERT 0 1528078
postgres=# select count(*) from traffic_violations_p_hash_p1;
 count  
--------
 305253
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p2;
 count  
--------
 304999
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p3;
 count  
--------
 305215
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p4;
 count  
--------
 305719
(1 row)

postgres=# select count(*) from traffic_violations_p_hash_p5;
 count  
--------
 306892
(1 row)

That was hash partitioning. In the next post we’ll look at partition pruning.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure