Infrastructure at your Service

Daniel Westermann

PostgreSQL partitioning (1): Preparing the data set

This is the start of a series about partitioning in PostgreSQL. For this and the following posts I will use PostgreSQL 12 (which currently is in beta) so some stuff might not work if you are on PostgreSQL 11 or even on PostgreSQL 10 when declarative partitioning was introduced. This very first post is only about preparing the data set and I decided to separate this one from the next post because it shows various things around loading data in PostgreSQL and therefore might be useful on its one.

When you are looking for free data sets “The home of the U.S. Government’s open data” is a great source. It provides free data sets for research, development or just data you can play with for whatever reason. For the scope of this and the following posts we will use the Traffic violations data set. It contains more the 1.5 millions of rows and is a good starting point for diving into PostgreSQL partitioning.

The initial import goes into a table which only contains text columns and from there we can load our partitions later on. The structure of the table is:

create table traffic_violations
( seqid text
, date_of_stop text
, time_of_stop text
, agency text
, subagency text
, description text
, location text
, latitude text
, longitude text 
, accident text
, belts text
, personal_injury text
, property_damage text
, fatal text
, commercial_license text
, hazmat text
, commercial_vehicle text
, alcohol text
, workzone text
, state text
, vehicletype text
, year text
, make text
, model text
, color text
, violation_type text
, charge text
, article text
, contributed_to_accident text
, race text
, gender text
, driver_city text
, driver_state text
, dl_state text
, arrest_type text
, geolocation text
, council_districts text
, councils text
, communities text
, zip_codes text
, municipalities text
);

For loading the table the easiest way is to use PostgreSQL’s copy command:

postgres=# \! ls -l Traffic_Violations.csv
-rw-rw-r--. 1 postgres postgres 634161687 May 28 11:28 Traffic_Violations.csv
postgres=# copy traffic_violations from '/home/postgres/Traffic_Violations.csv' with ( format csv, header true, null 'null', delimiter ',');
COPY 1528078

So we have around 1.5 millions of rows with real data. As this table contains only text columns we will create a materialized view with data types that fit the data. Looking at one row of the table:

postgres=# \x
Expanded display is on.
postgres=# select * from traffic_violations limit 1;
-[ RECORD 1 ]-----------+-------------------------------------------------------------------------
seqid                   | e13d2082-55a7-4a93-8836-173be19d2648
date_of_stop            | 06/01/2019
time_of_stop            | 13:38:00
agency                  | MCP
subagency               | 2nd District, Bethesda
description             | DRIVER FAILURE TO YIELD RIGHT-OF-WAY TO VEH. UPON EMERGING FROM DRIVEWAY
location                | RANDOLPH RD / NEW HAMPSHIRE
latitude                | 39.07592
longitude               | -77.0011316666667
accident                | Yes
belts                   | No
personal_injury         | No
property_damage         | Yes
fatal                   | No
commercial_license      | No
hazmat                  | No
commercial_vehicle      | No
alcohol                 | No
workzone                | No
state                   | MD
vehicletype             | 02 - Automobile
year                    | 2014
make                    | HYUN
model                   | TK
color                   | SILVER
violation_type          | Warning
charge                  | 21-705(c)
article                 | Transportation Article
contributed_to_accident | false
race                    | WHITE
gender                  | F
driver_city             | SILVER SPRING
driver_state            | MD
dl_state                | MD
arrest_type             | A - Marked Patrol
geolocation             | (39.07592, -77.0011316666667)
council_districts       | 5
councils                | 5
communities             | 26
zip_codes               | 12
municipalities          | 1

The materialized view becomes:

create materialized view mv_traffic_violations
( seqid 
, date_of_stop  
, time_of_stop 
, agency  
, subagency  
, description 
, location 
, latitude 
, longitude 
, accident 
, belts 
, personal_injury 
, property_damage 
, fatal 
, commercial_license 
, hazmat 
, commercial_vehicle 
, alcohol 
, workzone 
, state 
, vehicletype 
, year 
, make 
, model 
, color 
, violation_type 
, charge 
, article 
, contributed_to_accident 
, race 
, gender 
, driver_city 
, driver_state 
, dl_state 
, arrest_type 
, geolocation 
, council_districts 
, councils 
, communities 
, zip_codes 
, municipalities 
)
as
select seqid
     , to_date(date_of_stop,'MM/DD/YYYY')
     , time_of_stop::time
     , agency
     , subagency 
     , description
     , location
     , latitude::numeric
     , longitude::numeric
     , accident
     , belts::boolean
     , personal_injury::boolean
     , property_damage::boolean
     , fatal::boolean
     , commercial_license::boolean
     , hazmat::boolean
     , commercial_vehicle::boolean
     , alcohol::boolean
     , workzone::boolean
     , state
     , vehicletype
     , case year
         when '' then null
         else year::smallint
       end
     , make
     , model
     , color
     , violation_type
     , charge
     , article
     , contributed_to_accident::boolean
     , race
     , gender
     , driver_city
     , driver_state
     , dl_state
     , arrest_type
     , geolocation::point
     , case council_districts
         when '' then null
         else council_districts::smallint
       end
     , case councils
         when '' then null
         else councils::smallint
       end
     , case communities
         when '' then null
         else communities::smallint
       end
     , case zip_codes
         when '' then null
         else zip_codes::smallint
       end
     , case municipalities
         when '' then null
         else municipalities::smallint
       end
  from traffic_violations;

The statement contains some “case” expressions to test for empty strings. If you do not do this you get issues like this when you cast into a specific data type:

psql: ERROR:  invalid input syntax for type smallint: ""

The beauty of a materialized view is, that you can refresh whenever the underlying data set changed, e.g.:

postgres=# refresh materialized view mv_traffic_violations WITH data;
REFRESH MATERIALIZED VIEW

Note that this will block access to the materialized view for the time of the refresh. You could avoid that with refreshing concurrently but that only works when you have a unique index on the materialized view. With this data set we can not create a unique index because some of the rows are duplicates, e.g.:

postgres=# select * from mv_traffic_violations where seqid='b87c908c-ce2d-4c10-89fa-ca48735af485' and date_of_stop = '2012-11-07' and time_of_stop = '05:07:00' and description = 'FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE' and charge = '20-105(b)' and violation_type = 'Citation';
-[ RECORD 1 ]-----------+-------------------------------------------------------------------------------------------
seqid                   | b87c908c-ce2d-4c10-89fa-ca48735af485
date_of_stop            | 2012-11-07
time_of_stop            | 05:07:00
agency                  | MCP
subagency               | 2nd District, Bethesda
description             | FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE
location                | IFO 2531 ROSS RD
latitude                | 39.0222578333333
longitude               | -77.04575825
accident                | Yes
belts                   | f
personal_injury         | f
property_damage         | t
fatal                   | f
commercial_license      | f
hazmat                  | f
commercial_vehicle      | f
alcohol                 | f
workzone                | f
state                   | MD
vehicletype             | 02 - Automobile
year                    | 2002
make                    | SUBARU
model                   | FORESTER
color                   | SILVER
violation_type          | Citation
charge                  | 20-105(b)
article                 | Transportation Article
contributed_to_accident | f
race                    | HISPANIC
gender                  | M
driver_city             | SILVER SPRING
driver_state            | MD
dl_state                | MD
arrest_type             | A - Marked Patrol
geolocation             | (39.022257833333,-77.04575825)
council_districts       | 5
councils                | 5
communities             | 10
zip_codes               | 17
municipalities          | 1
-[ RECORD 2 ]-----------+-------------------------------------------------------------------------------------------
seqid                   | b87c908c-ce2d-4c10-89fa-ca48735af485
date_of_stop            | 2012-11-07
time_of_stop            | 05:07:00
agency                  | MCP
subagency               | 2nd District, Bethesda
description             | FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE
location                | IFO 2531 ROSS RD
latitude                | 39.0222578333333
longitude               | -77.04575825
accident                | Yes
belts                   | f
personal_injury         | f
property_damage         | t
fatal                   | f
commercial_license      | f
hazmat                  | f
commercial_vehicle      | f
alcohol                 | f
workzone                | f
state                   | MD
vehicletype             | 02 - Automobile
year                    | 2002
make                    | SUBARU
model                   | FORESTER
color                   | SILVER
violation_type          | Citation
charge                  | 20-105(b)
article                 | Transportation Article
contributed_to_accident | f
race                    | HISPANIC
gender                  | M
driver_city             | SILVER SPRING
driver_state            | MD
dl_state                | MD
arrest_type             | A - Marked Patrol
geolocation             | (39.022257833333,-77.04575825)
council_districts       | 5
councils                | 5
communities             | 10
zip_codes               | 17
municipalities          | 1

So now we have the data set we can use for the partitioning examples in the next posts. To summarize, what did we do:

  • https://www.data.gov/ is a great source for free data sets and we used one of them to load data into PostgreSQL
  • copy is great for loading data into PostgreSQL
  • The “case” statement is great for testing various conditions in a statement
  • materialized view are great when you want to refresh from an underlying data set that changes, but you will need a unique index for refreshing concurrently
  • You might need to cast one data type into another

In the next post we will look at the first partitioning strategy: Range partitioning.

One Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure