Infrastructure at your Service

Daniel Westermann

bulk loading semi structured data in postgresql

The last post took a look at how to efficiently load 1m rows into a table in PostgreSQL. In this post I’ll take a look on how to do the same with semi structured data.
Read More
Again it is the same system as in the last post using the same PostgreSQL parameters.

Using the hstore datatype

For storing key/value pairs there is the hstore extension in PostgreSQL.
Using the same sample data as in the last post this is the script to load 1m of key/value rows:

\timing off
create extension if not exists hstore;
drop table demo_hstore;
create table demo_hstore ( document hstore );
\timing on
DO $$DECLARE
  l_people_array varchar(15)[12] := '{"Marc", "Bill", "George", "Eliot", "Matt", "Trey", "Tracy","Greg", "Steve", "Kristina", "Katie", "Jeff"}';
  n int;
  lv_hstore_string varchar(80);
BEGIN
  for i in 0..1e6 loop
     n:=trunc(random()*1000+1);
     lv_hstore_string := 'id => '||i||', text => "'||l_people_array[floor((random()*11))+1::int]||'", number => '||n;
     insert into DEMO_HSTORE values(lv_hstore_string::hstore);
  end loop;
END$$;

Lets execute it:

postgres=# \i 1mio_rows_no_pk_hstore.sql
Timing is off.
psql:1mio_rows_no_pk_hstore.sql:2: NOTICE:  extension "hstore" already exists, skipping
CREATE EXTENSION
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 5833.749 ms
postgres=# \i 1mio_rows_no_pk_hstore.sql
Timing is off.
psql:1mio_rows_no_pk_hstore.sql:2: NOTICE:  extension "hstore" already exists, skipping
CREATE EXTENSION
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 5741.429 ms
postgres=# \i 1mio_rows_no_pk_hstore.sql
Timing is off.
psql:1mio_rows_no_pk_hstore.sql:2: NOTICE:  extension "hstore" already exists, skipping
CREATE EXTENSION
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 5789.512 ms

Around 6 seconds. Lets unload this with copy and reload it using copy:

postgres=# copy demo_hstore to '/home/postgres/demo_hstore.txt';
COPY 1000001
Time: 285.805 ms

Loading it with copy should be a lot faster:

\timing off
drop table demo_hstore;
create table demo_hstore ( document hstore );
\timing on
copy DEMO_HSTORE from '/home/postgres/demo_hstore.txt';

How much faster?

postgres=# \i 1mio_copy_no_indexes_hstore.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 2736.709 ms
postgres=# \i 1mio_copy_no_indexes_hstore.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 2882.477 ms
postgres=# \i 1mio_copy_no_indexes_hstore.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 3626.356 ms

More than 2 times faster. Your data now looks like this:

postgres=# select * from demo_hstore limit 5;
                   document                   
----------------------------------------------
 "id"=>"0", "text"=>"Bill", "number"=>"424"
 "id"=>"1", "text"=>"George", "number"=>"681"
 "id"=>"2", "text"=>"Eliot", "number"=>"79"
 "id"=>"3", "text"=>"George", "number"=>"472"
 "id"=>"4", "text"=>"Greg", "number"=>"966"
(5 rows)

Time: 15.175 ms

Using the jsonb datatype

PostgreSQL provides two data types for storing json. json and jsonb. jsonb is the binary format data type introduced in 9.4 while json is based on text.
So, lets do the same tests:

\timing off
drop table demo_jsonb;
create table demo_jsonb ( document jsonb );
\timing on
DO $$DECLARE
  l_people_array varchar(15)[12] := '{"Marc", "Bill", "George", "Eliot", "Matt", "Trey", "Tracy","Greg", "Steve", "Kristina", "Katie", "Jeff"}';
  n int;
  lv_json_string varchar(80);
BEGIN
  for i in 0..1e6 loop
     n:=trunc(random()*1000+1);
     lv_json_string := '{"id":"'||i||'", "text":"'||l_people_array[floor((random()*11))+1::int]||'", "number":"'||n||'"}';
     insert into DEMO_JSONB values(lv_json_string::jsonb);
  end loop;
END$$;

How long does it take in a loop? :

postgres=# \i 1mio_rows_no_pk_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 6229.134 ms
postgres=# \i 1mio_rows_no_pk_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 6210.610 ms
postgres=# \i 1mio_rows_no_pk_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
DO
Time: 6237.734 ms

Around 6 seconds. And by using copy?

postgres=# copy demo_jsonb to '/home/postgres/demo_jsonb.txt';
COPY 1000001
Time: 501.950 ms

Lets load it:

postgres=# \i 1mio_copy_no_indexes_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 3739.681 ms
postgres=# \i 1mio_copy_no_indexes_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 4369.161 ms
postgres=# \i 1mio_copy_no_indexes_jsonb.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 5000.136 ms

Between 4 and 5 seconds. And how does the data look like?

postgres=# select * from demo_jsonb limit 5;
                   document                    
-----------------------------------------------
 {"id": "0", "text": "George", "number": "51"}
 {"id": "1", "text": "Marc", "number": "884"}
 {"id": "2", "text": "Bill", "number": "951"}
 {"id": "3", "text": "Steve", "number": "200"}
 {"id": "4", "text": "Eliot", "number": "722"}
(5 rows)

How do you select on this table if you want to limit the results?

postgres=# SELECT document->'id', document->'text', document->'number' FROM demo_jsonb WHERE document @> '{"number": "884"}' limit 5;
 ?column? | ?column? | ?column? 
----------+----------+----------
 "1"      | "Marc"   | "884"
 "64"     | "Katie"  | "884"
 "201"    | "Tracy"  | "884"
 "1041"   | "George" | "884"
 "2074"   | "Greg"   | "884"
(5 rows)

Check the documentation for more examples and on how to index jsonb columns.
Looking forward to see Franck and David to load the same data …

3 Comments

Leave a Reply

Daniel Westermann
Daniel Westermann

Senior Consultant and Technology Leader Open Infrastructure