Coming back to the last post: 8000 databases in one PostgreSQL cluster?. As we’ve seen this is doable but comes with a lot of pain points. Another solution for this kind of requirement would be to host 8000 PostgreSQL clusters on one host. Again, I am not saying this is something you should do, and again, this is not how it finally will be implemented. This is more about if you can do it, and what is means for operating such a setup. Eight thousand is a huge number, no matter if it is about this number of databases in a single cluster, or this number of clusters on a single node.

The default PostgreSQL configuration on Linux comes with 128MB of shared_buffers. Again, if we do the math, that would be 1’024’000MB for 8000 clusters on a single host. Surprisingly I do not have such a machine available right now. We’ll use a AWS EC2 t3a.xlarge which comes with 4vCPUs and 16GB of memory. If we reduce shared_buffers to 128kB (which is the minimum), we come to 1024000kB, which is around 8GB of shared_buffers for all clusters. This should be doable, not counting any other memory related parameters.

The next bit we need to calculate is the required disk space. A fresh initialized PGDATA for PostgreSQL 13.1 consumes 29MB. For 8000 clusters we will therefore need 312’000MB of disk space. Se, here we go:

postgres@pg13:/home/postgres/ [pg131] grep MemTotal /proc/meminfo 
MemTotal:       16284668 kB
ostgres@pg13:/home/postgres/ [pg131] grep proc /proc/cpuinfo 
processor       : 0
processor       : 1
processor       : 2
processor       : 3
postgres@pg13:/home/postgres/ [pg131] df -h
Filesystem       Size  Used Avail Use% Mounted on
udev             7.8G     0  7.8G   0% /dev
tmpfs            1.6G  8.4M  1.6G   1% /run
/dev/nvme0n1p1   492G  2.9G  469G   1% /
tmpfs            7.8G     0  7.8G   0% /dev/shm
tmpfs            5.0M     0  5.0M   0% /run/lock
tmpfs            7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/nvme0n1p15  124M  262K  124M   1% /boot/efi
tmpfs            1.6G     0  1.6G   0% /run/user/1000

Initializing 8000 clusters will take some time, but the procedure is quite simple. We need a different directory for each PGDATA and a different port for each cluster. In addition we need to reduce shared_buffers to 128kB and then start up all the clusters. First, lets create all the data directories and check if we run into any issues with this already:

port="5432"
for i in {1..8000}
do
    PGDATA="/u02/pgdata/pgdata${i}"
    initdb -D ${PGDATA} > /dev/null 2>&1 > /dev/null
    echo "port=${port}" >> ${PGDATA}/postgresql.auto.conf
    echo "shared_buffers=128kB" >> ${PGDATA}/postgresql.auto.conf
    (( port += 1 ))
done

If you want to monitor the progress, you might do something like this:

postgres@pg13:/home/postgres/ [pg131] watch "ls -l /u02/pgdata | grep -v total | wc -l"
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 10:50:41 2021

146
...
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 11:06:14 2021

1197
...
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 11:13:27 2021

1696
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 11:54:09 2021

4452
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 12:19:44 2021

6116
Every 2.0s: ls -l /u02/pgdata | grep -v total | wc -l                                                 pg13: Fri Feb 12 13:35:24 2021

8000

That takes quite some time. How many space or we consuming?

postgres@pg13:/home/postgres/ [pg131] du -sh /u02/pgdata/
308G    /u02/pgdata/

Not far from what we calculated above. Before we try to start all the clusters there is another bit we need to think about. If we startup a PostgreSQL 13.1 instance, how many processes do we get by default?

postgres@debian10pg:/home/postgres/ [pg131] ps -ef | egrep "postgres -D|postgres:"
postgres 28408     1  0 07:42 ?        00:00:00 /u01/app/postgres/product/DEV/db_1/bin/postgres -D /u02/pgdata/DEV
postgres 28409 28408  0 07:42 ?        00:00:00 postgres: logger 
postgres 28411 28408  0 07:42 ?        00:00:00 postgres: checkpointer 
postgres 28412 28408  0 07:42 ?        00:00:00 postgres: background writer 
postgres 28413 28408  0 07:42 ?        00:00:00 postgres: walwriter 
postgres 28414 28408  0 07:42 ?        00:00:00 postgres: autovacuum launcher 
postgres 28415 28408  0 07:42 ?        00:00:00 postgres: stats collector 
postgres 28416 28408  0 07:42 ?        00:00:00 postgres: logical replication launcher 
ostgres@debian10pg:/home/postgres/ [pg131] ps -ef | egrep "postgres -D|postgres:" | wc -l
8

That’s eight per cluster, multiplied by 8000, that makes 64’000 processes. In the default configuration on Debian we have these limits:

postgres@debian10pg:/home/postgres/ [pg131] ulimit -a
postgres@pg13:/home/postgres/ [pg131] ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63541
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 63541
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

This will for sure not work, as we can easily try:

port="5432"
for i in {1..8000}
do
    PGDATA="/u02/pgdata/pgdata${i}"
    pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null
    (( port += 1 ))
done

This will go fine until around 1500 postmasters have been started:

unable to fork process: Resource temporarily unavailable

… so we are clearly hit some limits here. Lets increase the maximum number of open files and processes limits for the postgres user (-1 means unlimited, only hard limits here):

postgres@pg13:/home/postgres/ [pg131] cat /etc/security/limits.conf  | egrep -v "^#|^$"
postgres         hard     nproc          -1
postgres         hard     nofile         -1

After restarting the session we can check the actual value of the hard limits:

postgres@pg13:/home/postgres/ [pg131] ulimit -H -n
1048576
postgres@pg13:/home/postgres/ [pg131] ulimit -H -u
unlimited

This means we should be able to create an unlimited amount of user processes, and we should be able to open 1048576 files (this seems to be the maximum on Debian). As these are the hard limits we need to increase the soft limits as well:

postgres@pg13:/home/postgres/ [pg131] ulimit -S -u unlimited
postgres@pg13:/home/postgres/ [pg131] ulimit -S -n 1048576
postgres@pg13:/home/postgres/ [pg131] ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63540
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Lets check if that is sufficient for starting the 8000 clusters:

postgres@pg13:/home/postgres/ [pg131] port="5432"
postgres@pg13:/home/postgres/ [pg131] for i in {1..8000}
> do
>     PGDATA="/u02/pgdata/pgdata${i}"
>     pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null
>     (( port += 1 ))
> done

No, we’ll still hit limits (again around 1500 clusters):

no data was returned by command ""/u01/app/postgres/product/13/db_1/bin/postgres" -V"
The program "postgres" is needed by pg_ctl but was not found in the
same directory as "/u01/app/postgres/product/13/db_1/bin/pg_ctl".
Check your installation.
sh: 1: Cannot fork

Lets try to use a larger instance type: c5.9xlarge: This one comes with 36vCPU and 72GiB of memory. Will that work?

postgres@pg13:/home/postgres/ [pg131] cat /proc/cpuinfo | grep proc | wc -l
36
postgres@pg13:/home/postgres/ [pg131] cat /proc/meminfo | grep Mem
MemTotal:       72033208 kB
MemFree:        71577940 kB
MemAvailable:   71139240 kB
postgres@pg13:/home/postgres/ [pg131] ulimit -S -u unlimited
postgres@pg13:/home/postgres/ [pg131] ulimit -S -n 1048576
postgres@pg13:/home/postgres/ [pg131] for i in {1..8000}
> do
>     PGDATA="/u02/pgdata/pgdata${i}"
>     pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null
>     (( port += 1 ))
> done

We can go a bit further and come to approx. 1740 running cluster, but this is far from 8000:

postgres@pg13:/home/postgres/ [pg131] ps -ef | grep postgres | grep "postgres -D" | wc -l
1736

If we increase even more and go to c5a.24xlarge we’ll come to (96vCPUs and 192GiB of memory):

postgres@pg13:/home/postgres/ [pg131] cat /proc/cpuinfo | grep proc | wc -l
96
postgres@pg13:/home/postgres/ [pg131] cat /proc/meminfo | grep Mem
MemTotal:       195998200 kB
MemFree:        195223608 kB
MemAvailable:   194067016 kB
postgres@pg13:/home/postgres/ [pg131] ulimit -S -u unlimited
postgres@pg13:/home/postgres/ [pg131] ulimit -S -n 1048576
postgres@pg13:/home/postgres/ [pg131] for i in {1..8000}; do     PGDATA="/u02/pgdata/pgdata${i}";     pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null;     (( port += 1 )); done

This will give us around 4096 clusters, which is a bit more than half the 8000 we wanted. But this also means that there is another limit we’re hitting. Four thousand clusters will consume 32000 processes, and we’re not close to the limit we’ve set above. What is it then? Increasing the number of vCPUs clearly gave us more running clusters, so there must be a relation between the number of vCPUs and the number of processes we can start. There is a kernel parameter which limits the maximum number of processes one can start on a Linux system:

postgres@pg13:/home/postgres/ [pg131] cat /proc/sys/kernel/pid_max
98304
postgres@pg13:/home/postgres/ [pg131] man proc 5
       /proc/sys/kernel/pid_max (since Linux 2.5.34)
              This file specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID).  PIDs greater than this value are not allocated; thus, the value in this file also acts as a system-wide limit on  the  total
              number  of  processes and threads.  The default value for this file, 32768, results in the same range of PIDs as on earlier kernels.  On 32-bit platforms, 32768 is the maximum value for pid_max.  On 64-bit systems, pid_max can be set to any value up
              to 2^22 (PID_MAX_LIMIT, approximately 4 million).

We’re already higher than the default, which is 32768 and this value is dependent on the number of CPUs (more information here). This is actually not a real limit, but once that number is reached, the counter starts from 1 again and if no numbers are left, no additional process can be started. If we scale down our instance to c5.large (4vCPUs and 8GiB of memory) we’ll again get the default:

postgres@pg13:/home/postgres/ [pg131] cat /proc/sys/kernel/pid_max
32768

Although we should not hit the limit of 98304 lets try to increase that and do another run. This time we’ll also have a look at the memory usage:

postgres@pg13:/home/postgres/ [pg131] sudo bash
root@pg13:/home/postgres$ echo "196608" > /proc/sys/kernel/pid_max
root@pg13:/home/postgres$ exit
postgres@pg13:/home/postgres/ [pg131] cat /proc/sys/kernel/pid_max
196608
postgres@pg13:/home/postgres/ [pg131] ulimit -S -u unlimited
postgres@pg13:/home/postgres/ [pg131] ulimit -S -n 1048576
postgres@pg13:/home/postgres/ [pg131] for i in {1..8000}
> do
>     PGDATA="/u02/pgdata/pgdata${i}"
>     pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null
>     (( port += 1 ))
> done

In a second session monitor memory usage:

postgres@pg13:/home/postgres/ [pg131] watch "free -m"
## with approx.1000 clusters
Every 2.0s: free -m                                                                                   pg13: Tue Feb 16 13:16:50 2021

              total        used        free      shared  buff/cache   available
Mem:         189152        6412      177810        4183        4929      177353
Swap:             0           0           0

## with approx. 2000 clusters
Every 2.0s: free -m                                                                                   pg13: Tue Feb 16 13:18:38 2021

              total        used        free      shared  buff/cache   available
Mem:         189152       12098      167470        8276        9583      167516
Swap:             0           0           0

## with approx. 3000 clusters
Every 2.0s: free -m                                                                                   pg13: Tue Feb 16 13:20:26 2021

              total        used        free      shared  buff/cache   available
Mem:         189152       17803      157119       12369       14229      157664
Swap:             0           0           0

## with approx. 4000 clusters
Every 2.0s: free -m                                                                                   pg13: Tue Feb 16 13:22:16 2021

              total        used        free      shared  buff/cache   available
Mem:         189152       23673      146495       16540       18983      147562
Swap:             0           0           0

… and it still fails with a bit more than 4000 clusters. The issue is not the overall memory consumption, we still have plenty of memory left. The answer is in the PostgreSQL startup log:

2021-02-16 13:43:17.990 UTC [97614] FATAL:  could not create shared memory segment: No space left on device
2021-02-16 13:43:17.990 UTC [97614] DETAIL:  Failed system call was shmget(key=21380032, size=56, 03600).
2021-02-16 13:43:17.990 UTC [97614] HINT:  This error does *not* mean that you have run out of disk space.  It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
        The PostgreSQL documentation contains more information about shared memory configuration.
2021-02-16 13:43:17.990 UTC [97614] LOG:  database system is shut down

So we need to increase shmmni for enough shared memory segments being available. Next try:

postgres@pg13:/home/postgres/ [pg131] sudo sysctl -w kernel.shmmni=8192
kernel.shmmni = 8192
postgres@pg13:/home/postgres/ [pg131] sysctl -p
postgres@pg13:/home/postgres/ [pg131] sudo sysctl -a | grep kernel.shmmni
kernel.shmmni = 8192
postgres@pg13:/home/postgres/ [pg131] for i in {1..8000}
> do
>     PGDATA="/u02/pgdata/pgdata${i}"
>     pg_ctl start -D ${PGDATA}/  -l /dev/null 2>&1 > /dev/null
>     (( port += 1 ))
> done

sh: 1: Cannot fork
no data was returned by command ""/u01/app/postgres/product/13/db_1/bin/postgres" -V"
The program "postgres" is needed by pg_ctl but was not found in the
same directory as "/u01/app/postgres/product/13/db_1/bin/pg_ctl".
Check your installation.

Next limit hit, now we come to around 4600 clusters, which is better, but still not the 8000. This time we are hitting the number of open files:

postgres@pg13:/home/postgres/ [pg131] sudo lsof -u postgres 2>/dev/null | wc -l
1884863
postgres@pg13:/home/postgres/ [pg131] ulimit -Hn
1048576

But even if we increase the system-wide limit we’ll run into the same issue because we can not raise the user defined limit by more than 1048576:

postgres@pg13:/home/postgres/ [pg131] sudo sysctl -w fs.file-max=100000000
fs.file-max = 100000000
postgres@pg13:/home/postgres/ [pg131] sudo sysctl -a | grep fs.file-max
fs.file-max = 100000000

The reason is in the sshd_config:

postgres@pg13:/home/postgres/ [pg131]  sudo cat /etc/ssh/sshd_config | grep -i UsePam
UsePAM yes

PAM is responsible for setting the user limits. Lets skip that by removing this line and restarting sshd:

postgres@pg13:/home/postgres/ [pg131] sudo sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
postgres@pg13:/home/postgres/ [pg131] grep UsePAM /etc/ssh/sshd_config
UsePAM no
postgres@pg13:/home/postgres/ [pg131] sudo systemctl restart sshd

Once we restarted the ssh session let’s try again:

dwe@ltdwe:~$ ssh -i /home/dwe/Documents/aws/dwe-key-pair.pem [email protected]
[email protected]: Permission denied (publickey).

Next issue, we are not able to login anymore. The way to do it is to disable the line which does the enforcing of the limits:

postgres@debian10pg:/home/postgres/ [pg14] grep -A 1 "limits" /etc/pam.d/sshd 
# access limits that are hard to express in sshd_config.
# account  required     pam_access.so
--
# Set up user limits from /etc/security/limits.conf.
# session    required     pam_limits.so

After restarting sshd lets try again:

port="5432"
for i in {1..8000}
do
    PGDATA="/u02/pgdata/pgdata${i}"
    pg_ctl start -D ${PGDATA}/  -l /home/postgres/${i}log.log
    (( port += 1 ))
done

.. and it still fails:

2021-02-19 15:49:38.457 UTC [109716] LOG:  listening on Unix socket "/tmp/.s.PGSQL.10063"
2021-02-19 15:49:38.461 UTC [109717] LOG:  database system was shut down at 2021-02-19 15:39:59 UTC
2021-02-19 15:49:38.465 UTC [109716] LOG:  could not fork worker process: Resource temporarily unavailable
2021-02-19 15:49:38.465 UTC [109716] LOG:  database system is ready to accept connections
2021-02-19 15:49:50.889 UTC [109716] LOG:  received smart shutdown request

The next test is to check if we are hitting a user/session, or a system wide limit. For that we will start the first 4000 clusters with the postgres user we already have. For the remaining 4000 clusters we’ll create another user, change $PGDATA to be owned by that user, and then try to start them. If that works, we can at least start 8000 clusters using two separate OS users:

root@ip-10-0-1-21:/home/admin# useradd -g postgres -m postgres2
root@ip-10-0-1-21:/home/admin# for i in {4001..8000}
> do
>     PGDATA="/u02/pgdata/pgdata${i}"
>     chown -R postgres2:postgres ${PGDATA}
> done
Every 2.0s: ps -ef | egrep 'postgres -D' | wc -l                                                                                                                           ip-10-0-1-21: Tue Feb 23 10:38:49 2021

4000

First 4000 clusters:

postgres@ip-10-0-1-21:/home/postgres/ [pg132] for i in {1..4000}; do     PGDATA="/u02/pgdata/pgdata${i}";     pg_ctl start -D ${PGDATA}/  -l /home/postgres/${i}log.log; done
Every 2.0s: ps -ef | egrep 'postgres -D' | wc -l                                                                                                                           ip-10-0-1-21: Tue Feb 23 10:38:49 2021

4000

Lets check if we can start another 4000 using the other user:

postgres2@ip-10-0-1-21:/home/postgres2/ [pg132] id -a
uid=1002(postgres2) gid=1001(postgres) groups=1001(postgres)
postgres@ip-10-0-1-21:/home/postgres2/ [pg132] for i in {1..4000}; do     PGDATA="/u02/pgdata/pgdata${i}";     pg_ctl start -D ${PGDATA}/  -l /home/postgres2/${i}log.log; done
postgres2@ip-10-0-1-21:/home/postgres2/ [pg132] cat 4632log.log
2021-02-23 10:45:38.573 UTC [61580] LOG:  starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2021-02-23 10:45:38.573 UTC [61580] LOG:  listening on IPv4 address "127.0.0.1", port 10063
2021-02-23 10:45:38.578 UTC [61580] LOG:  listening on Unix socket "/tmp/.s.PGSQL.10063"
2021-02-23 10:45:38.585 UTC [61581] LOG:  database system was shut down at 2021-02-21 17:53:04 UTC
2021-02-23 10:45:38.597 UTC [61580] LOG:  could not fork worker process: Resource temporarily unavailable
2021-02-23 10:45:38.597 UTC [61580] LOG:  database system is ready to accept connections

Which leads to the same issue once we reach around 4600 clusters. At this point I decided to stop this experiment as the conclusion is clear: Don’t do it, for obvious reasons:

  • You’ll run into all kinds of limits
  • Managing 8000 clusters on one host is not doable, you’ll for sure do a mistake one time and then one or may clusters are damaged
  • Only looking at the number of open files and the number of processes should already tell you, again: Don’t do it
  • Many, many other reasons (backup/restore/patching/upgrades)