Infrastructure at your Service

Daniel Westermann

the fastest way to load 1m rows in postgresql

postreslogo.png

There have been several posts on how to load 1m rows into a database in the last days:
Variations on 1M rows insert (1): bulk insert
Variations on 1M rows insert(2): commit write
Variations on 1M rows insert (1): bulk insert – PostgreSQL
Variations on 1M rows insert(2): commit write – PostgreSQL
Variations on 1M rows insert (3): TimesTen

In this post I’ll focus on how to prepare a PostgreSQL database for bulk loading in more detail. The test system is the same which was used in the previous posts.

cat /etc/oracle-release
Oracle Linux Server release 7.1
cat /proc/cpuinfo | grep proc
processor : 0
cat /proc/meminfo | head -1
MemTotal: 502612 kB

I did not tweak any kernel parameters, just the default:

cat /usr/lib/sysctl.d/00-system.conf | grep -v "#"
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
kernel.shmmax = 4294967295
kernel.shmall = 268435456

All is on standard xfs filesystems. No tweaking there, too:

df -h
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/ol-root   48G  4.5G   44G  10% /
devtmpfs             237M     0  237M   0% /dev
tmpfs                246M  4.0K  246M   1% /dev/shm
tmpfs                246M  4.4M  242M   2% /run
tmpfs                246M     0  246M   0% /sys/fs/cgroup
/dev/sda1            497M  255M  243M  52% /boot
mount | grep /dev/mapper/ol-root
/dev/mapper/ol-root on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

The postgresql version still is 9.4.1:

postgres=# select version();
                                                         version                                                          
--------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.4.1dbi services on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit
(1 row)

I’ll adjust several postgresql parameters to get best performance:

fsync='off'
synchronous_commit='off'
full_page_writes='off'
bgwriter_lru_maxpages=0
wal_level='minimal'
archive_mode='off'
work_mem='64MB'
checkpoint_segments=32
max_wal_senders=0
maintenance_work_mem='64MB'
shared_buffers='128MB'

For a detailed description of the parameters head over to the official documentation. Just to be clear again: Turning fsync and synchronous_commit off is not crash safe. This means if the server crashes during the load for some reason the database will be corrupted. See the reference section at the end of this post for more details. The fastest method to load or unload data in PostgreSQL is the copy command. I’ll use the same file to load the data as in the previous post. Some notes about indexes and foreign keys: It is usually faster to create the indexes and foreign keys after the bulk load and to delete or disable them before the load. I’ll do a test case on this below.
Lets prepare the system:

postgres=# alter system set fsync='off';
ALTER SYSTEM
postgres=# alter system set synchronous_commit='off';
ALTER SYSTEM
postgres=# alter system set full_page_writes='off';
ALTER SYSTEM
postgres=# alter system set bgwriter_lru_maxpages=0;
ALTER SYSTEM
postgres=# alter system set wal_level='minimal';
ALTER SYSTEM
postgres=# alter system set archive_mode='off';
ALTER SYSTEM
postgres=# alter system set work_mem='64MB';
ALTER SYSTEM
postgres=# alter system set checkpoint_segments=32;
ALTER SYSTEM
postgres=# alter system set max_wal_senders=0;
ALTER SYSTEM
postgres=# alter system set maintenance_work_mem='64MB';
ALTER SYSTEM
postgres=# alter system set shared_buffers='128MB';
ALTER SYSTEM

As some of the parameters require a restart of the database lets do this:

pg_ctl -D /u02/pgdata/PG1/ restart -m fast

Ready for the test cases. As said earlier I’ll use the same file to load the data (1000001 records):

[email protected]:/home/postgres/ [PG1] ls -la demo.txt 
-rw-r--r--. 1 postgres postgres 16690526 May  4 15:53 demo.txt
[email protected]:/home/postgres/ [PG1] cat demo.txt | wc -l
1000001
[email protected]:/home/postgres/ [PG1] head -2 demo.txt 
0	Marc	564
1	Bill	345

First test case: No indexes

The first test case will be without any indexes and looks like this:

\timing off
drop table demo;
create table demo ( id int
                  , text varchar(15)
                  , number int
                  );
\timing on
copy DEMO from '/home/postgres/demo.txt';

How long does it take? I’ll execute the test several times to get an average:

postgres=# \i 1mio_copy_no_indexes.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 467.219 ms
postgres=# \i 1mio_copy_no_indexes.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 460.730 ms
postgres=# \i 1mio_copy_no_indexes.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 680.538 ms
postgres=# \i 1mio_copy_no_indexes.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 444.932 ms
postgres=# select count(*) from demo;
  count  
---------
 1000001
(1 row)

Time: 108.792 ms

Between 0,4 and and 0,7 seconds for 1000001 rows.

Second test case: Primary key exists prior to the load

This test case creates a primary key (and thus an index) before the load is started:

\timing off
drop table demo;
create table demo ( id int
                  , text varchar(15)
                  , number int
                  , constraint demo_pk primary key ( id )
                  );
\timing on
copy DEMO from '/home/postgres/demo.txt';

The results ( Again several executions to get an average ):

postgres=# \i 1mio_copy_pk_before_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 5721.706 ms
postgres=# \i 1mio_copy_pk_before_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 6653.489 ms
postgres=# \i 1mio_copy_pk_before_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 3979.691 ms

Something between 3,9 and 6,6 seconds.

Third test case: Primary key after the load

In this case the primary key will be added after the load:

\timing off drop table demo;
create table demo ( id int , text varchar(15) , number int );
\timing on copy DEMO from '/home/postgres/demo.txt';
alter table demo add constraint demo_pk primary key (id);

The results ( Again several executions to get an average ):

postgres=# \i 1mio_copy_pk_after_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 519.758 ms
ALTER TABLE
Time: 555.851 ms
postgres=# \i 1mio_copy_pk_after_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 626.097 ms
ALTER TABLE
Time: 654.467 ms
postgres=# \i 1mio_copy_pk_after_load.sql
Timing is off.
DROP TABLE
CREATE TABLE
Timing is on.
COPY 1000001
Time: 756.139 ms
ALTER TABLE
Time: 694.572 ms

Something between 1,2 and 1,5 seconds.

Conclusion
:
If you need indexes (and who does not) it is definitely faster do drop or disable from before the load and create them afterwards.

References:
Populating a Database
PostgreSQL as an In-Memory Only Database

3 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure