Infrastructure at your Service

Franck Pachot

PostgreSQL Shared Buffers vs free RAM

PostgreSQL, like all other database engines, modifies the table and index blocks in shared buffers. People think that the main goal of buffered reads is to act as a cache to avoid reading from disk. But that’s not the main reason as this is not mandatory. For example PostgreSQL expects that the filesystem cache is used. The primary goal of shared buffers is simply to share them because multiple sessions may want to read a write the same blocks and concurrent access is managed at block level in memory. Without shared buffers, you would need to lock a whole table. Most of the database engines use the shared buffers for caching. Allocating more memory can keep the frequently used blocks in memory rather than accessing disk. And because they manage the cache with methods suited to the database (performance and reliability) they don’t need another level of cache and recommend direct I/O to the database files. But not with Postgres. In order to keep the database engine simple and portable, PostgreSQL relies on the filesystem cache. For example, no multiblock read is implemented. Reading contiguous blocks should be optimized by the filesystem read-ahead.
But this simplicity of code implies a complexity for configuring a PostgreSQL database system. How much to set for the shared_buffers? And how much to keep free in the OS to be used for the filesystem cache?

I am not giving any answer here. And I think there is, unfortunately, no answer. The documentation is vague. It defines the recommended value as a percentage of the available RAM. That makes no sense for me. The cache is there to keep frequently used blocks and that depends on your application workload. Not on the available RAM. There is also this idea that because there is double buffering, you should allocate the same size to both caches. But that makes no sense again. If you keep blocks in the shared buffers, they will not be frequently accessed from the filesystem and will not stay in the filesystem cache. Finally, don’t think the defaults are good. The default shared_buffers is always too small and the reason is that a small value eases the installation on Linux without having to think about SHM size. But you have more, need more, and will need to configure SHM and huge pages for it.

My only recommendation is to understand your workload: what is the set of data that is frequently used (lookup tables, index branches, active customers, last orders,…). This should fit in the shared buffers. And the second recommendation is to understand how it works. For the set of data that is frequently read only, this is not so difficult. You need to have an idea about the postgres cache algorithm (LRU/Clock sweep in shared buffers which is static size) and the filesystem cache (LRU with the variable size of free memory). For the set of data that is modified, this much more complex.

Let’s start with the reads. As an example I have run a workload that reads at random within a small set of 150MB. I used pgio from Kevin Closson for that. I’ve executed multiple runs with warying the shared_buffers from smaller than my work set, 50MB, to 2x larger: 50MB, 100MB, 150MB, 200MB, 250MB, 300MB. Then, for each size of shared buffers, I’ve run with variations in the available RAM (which is used by the filesystem cache): 50MB, 100MB, 150MB, 200MB, 250MB, 300MB, 350MB, 400MB. I ensured that I have enough physical memory so that the system does not swap.

I determined the free space in my system:


[[email protected] postgresql]$ sync ; sync ; sync ; free -hw
              total        used        free      shared     buffers       cache   available
              total        used        free      shared     buffers       cache   available
Mem:           983M         63M        835M        436K          0B         84M        807M
Swap:            0B          0B          0B

I also use https://medium.com/@FranckPachot/proc-meminfo-formatted-for-humans-350c6bebc380


[[email protected] postgresql]$ awk '/Hugepagesize:/{p=$2} / 0 /{next} / kB$/{v[sprintf("%9d MB %-s",int($2/1024),$0)]=$2;next} {h[$0]=$2} /HugePages_Total/{hpt=$2} /HugePages_Free/{hpf=$2} {h["HugePages Used (Total-Free)"]=hpt-hpf} END{for(k in v) print sprintf("%-60s %10d",k,v[k]/p); for (k in h) print sprintf("%9d MB %-s",p*h[k]/1024,k)}' /proc/meminfo|sort -nr|grep --color=auto -iE "^|( HugePage)[^:]*" #awk #meminfo                                                                           33554431 MB VmallocTotal:   34359738367 kB                    16777215
      983 MB MemTotal:        1006964 kB                            491
      946 MB DirectMap2M:      968704 kB                            473
      835 MB MemFree:          855628 kB                            417
      808 MB MemAvailable:     827392 kB                            404
      491 MB CommitLimit:      503480 kB                            245
      200 MB Committed_AS:     205416 kB                            100
       78 MB DirectMap4k:       79872 kB                             39
       69 MB Cached:            71436 kB                             34
       55 MB Active:            56640 kB                             27
       43 MB Inactive:          44036 kB                             21
       42 MB Inactive(file):    43612 kB                             21
       29 MB Slab:              30000 kB                             14
       28 MB AnonPages:         29260 kB                             14
       28 MB Active(anon):      29252 kB                             14
       26 MB Active(file):      27388 kB                             13
       14 MB SUnreclaim:        14876 kB                              7
       14 MB SReclaimable:      15124 kB                              7
       13 MB Mapped:            14212 kB                              6
        4 MB PageTables:         4940 kB                              2
        2 MB Hugepagesize:       2048 kB                              1
        1 MB KernelStack:        1952 kB                              0
        0 MB Shmem:               436 kB                              0
        0 MB Inactive(anon):      424 kB                              0
        0 MB HugePages Used (Total-Free)
        0 MB HugePages_Total:       0
        0 MB HugePages_Surp:        0
        0 MB HugePages_Rsvd:        0
        0 MB HugePages_Free:        0
        0 MB Dirty:                 4 kB                              0

Then in order to control how much free RAM I want to set ($fs_MB) I allocate the remaining as huge pages:


sudo bash -c "echo 'vm.nr_hugepages = $(( (835 - $fs_MB) / 2 ))' > /etc/sysctl.d/42-hugepages.conf ; sysctl --system"

This limits the RAM available for the fileystem cache because huges pages cannot be used for it. And the huges pages can be used for the postgres shared buffers:


sed -i -e "/shared_buffers/s/.*/shared_buffers = ${pg_MB}MB/" -e "/huge_pages =/s/.*/huge_pages = on/" $PGDATA/postgresql.conf
grep -E "(shared_buffers|huge_pages).*=" $PGDATA/postgresql.conf
/usr/local/pgsql/bin/pg_ctl -l $PGDATA/logfile restart

Note that I actually stopped postgres before to be sure that no huge pages are used when resizing them:


for pg_MB in 50 100 150 200 250 300 350 400
do
for fs_MB in 400 350 300 250 200 150 100 50
do
/usr/local/pgsql/bin/pg_ctl -l $PGDATA/logfile stop
# set huge pages and shared buffers
/usr/local/pgsql/bin/pg_ctl -l $PGDATA/logfile restart
# run pgio runit.sh
done
done

100% Read only workload on 150MB

Here is the result. Each slide on the z-axis is a size of shared-buffers allocated by postgres. On the x-axis the size of the available RAM that can be used for filesystem cache by the Linux kernel. the y-axis is the number of tuples read during the run.

You will never get optimal performance when the frequent read set doesn’t fit in shared buffers. When the read set is larger than the shared buffers, you need more RAM in order to get lower performance. The frequently read set of data should fit in shared buffers.

50% updates on 150MB

Here is the same run where I only changed PCT_UPDATE to 50 in pgio.conf

This looks similar but there are two main differences, one visible here and another that is not represented in this graphic because I aggregated several runs.

First, increasing the shared buffers above the set of frequently manipulated data still improves performance, which was not the case with reads. As soon as the shared buffer is above the working set of 150MB the buffer cache hit ratio is at 100%. But that’s for reads. Updates generate a new version of data and both versions will have to be vacuumed and checkpoint.

Here is a graph about blks_read/s which shows that for a read-only workload we do not do any physical reads (I/O calls from the database to the filesystem) as soon as the working set fits in shared buffers. When we write, the read calls still improve when we increase the shared buffers a bit above the modified set. And the physical read efficiency is the best when there is as much free RAM as shared buffers.

Second point about the write workload, performance is not homogenous at all. Vacuum and Checkpoint happen at regular intervals and make the performance un-predictable. When showing the tuples/second for the 50% write workload, I aggregated many runs to display the maximum throughput achieved. Having the modified set of data fitting in free RAM helps to lower this variation as it avoids immediate physical write latency. The balance between shared buffers and free RAM is then a balance between high performance and predictability of performance: keep free ram as a performance “buffer” between userspace access and disk reads. There are also many parameters to “tune” this like with the frequency of vacuum and checkpoints. And this makes memory tuning very complex. Note that I changed only the shared_buffers here. When increasing shared_buffers for high write throughput, there are other structures to increase like the WAL segments.

The filesystem cache adds another level of un-predictable performance. For example, you may run a backup that reads all the database, bringing the blocks into the Least Recently Used. And I didn’t do any sequential scans here. They benefit from filesystem buffers with pre-fetching. All theses make any quick recommendation incorrect. Buffer cache hit ratios make no sense to estimate the response time improvement as they are not aware of the filesystem cache hits. But looking at them per relation, tables or indexes may help to estimate which relation is frequently accessed. Because that’s what matters: not the size of the database, not the size of your server RAM, not the general buffer cache hit ratio, but the size of data that is read and written frequently by your application.

3 Comments

  • 150-400MB set is nothing. It can be hold in memory without any problems. On this size you can see lot of noise. For benchmark IO related tests your database should be larger than RAM, and tests should be longer than ten minutes. Configuration of share buffers is not extra magic – Postgres is not too much sensitive on this value because it uses other RAM in form filesystem cache. When shared buffers is too low, then there are CPU problem with spin locks. Too high shared buffers can increase latency due checkpoint (can be fixed by more aggressive background writer).

    For small databases to 4GB you can have shared buffer larger than database. But still you have to have enough memory for query calculation – work_mem per process.

    Simplified formula is

    Shared buffers + 2 * work_mem * max connection + memory for filesystem cache + memory for OS 1/10 dbsize
    OLAP .. RAM > 1/100 dbsize

  • The ending lines was broken in my comment

    Shared buffers + 2 * work_mem * max connection + memory for filesystem cache + memory for OS < RAM

  • Franck Pachot says:

    Hi Pavel,

    Thanks for your comment. I may not have explained enough in the post because I try to keep posts small (big challenge for me!) and this is definitely not a benchmark. The idea is to get people thinking by the working set size rather than magic percentages on available RAM. As you raised some points that may need more explanation, I’m adding more precision here following your feedback.

    >> It can be hold in memory without any problems. […] For benchmark IO related tests your database should be larger than RAM

    Exactly. This is a logical I/O test. The goal is, on purpose, to avoid any physical I/O in order to run aside from any other cache in storage, or other unpredictable latency.
    What I am testing here is: get a page from the userspace buffer vs. from the kernel cache.

    >> Postgres is not too much sensitive on this value because it uses other RAM in form filesystem cache

    That’s a common idea that I want to challenge. People think that, as all comes from RAM, there’s no difference and share_buffer is only for concurrency. But copying a page from the filesystem cache to the shared buffer has a cost which is clearly visible in the read test above. When the working set fits in shared_buffer the throughput is optimal. When it does not fit, the size still matters a little.
    My 1st here recommendation is: the shared_buffer should fit the whole frequently-read set. Note that this is for random reads (access by index) or very small scans (‘small’ depending on the shared_buffer size itself which is why I didn’t test those here). Large scans do not stay anyway in shared_buffer and benefit from filesystem prefetching, so better keeping free RAM for them. I didn’t mention this because I think that ‘frequently read’ implicitly excludes large scans, or the problem is in the application design (or query plan).

    >> When shared buffers is too low, then there are CPU problem with spin locks.

    I think this can also be a consequence of the higher latency. But here my test was with one session only: no concurrent access on the buffers (in the read-only test, the one with updates has background writer and vacuum).

    >> Too high shared buffers can increase latency due checkpoint (can be fixed by more aggressive background writer).

    Yes, the predictabilty of read-write test is very hard to acheive and depends on other parameters. More aggressive background writer is required with larger shared_buffer. However, in order to benefit from the filesystem re-ordering, the filesystem cache should be high as well. My 2nd recommendation is: the frequently modified pages should fit largely in both in shared buffers and in filesystem cache. But the size here is more difficult to estimate.

    >> For small databases to 4GB you can have shared buffer larger than database.

    I don’t really care about the size of the database when it is about RAM (or maybe a disk-based RDBMS is not the right solution). Only the subset that application is working on frequently matters. Except maybe when backup runs and messes up the filesystem cache LRU lists – reason why I prefer to keep pages cached in shared_buffers to keep predictable performance on them.

    Writing this I’ve seen good points in Robert Haas post: http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html

    There is also an interesting anonymous comment saying: I’m also a bit surprised that all the words we hear about shared_buffers setting is just based on rules of thumb and guesses or feedbacks (which are probably true but that can’t be generalized). This comment from 2014 and still accurate today in my opinion.

    Franck.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist