Infrastructure at your Service

Oracle Team

PostgreSQL on Oracle free tier ARM

By Franck Pachot

.
This follows the previous post about the Oracle Cloud ARM compute shape just announced on May 25th. The processor is ARM v8.2 with LSE (atomic instructions) and PostgreSQL can benefit from it (see Dramatical Effect of LSE Instructions for PostgreSQL on Graviton2 Instances).

I have installed GCC 11 in the previous post, on a Oracle Linux 7.9 image with comes with GCC 7. If you installed the Ubuntu 20.4 image, you have GCC 9 and I also recommand to install the latest version. Anyway, if you use a version before GCC 11 you must set CFLAGS as below to benefit from ARM v8.1 LSE.

I’ll install PostgreSQL 14 which is currently in beta1:


(
sudo yum install -y git gcc readline-devel bison-devel zlib-devel
curl -s https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz | tar -zxvf -
sudo rm -rf /usr/local/pgsql ; sudo mkdir -p /usr/local/pgsql
cd postgresql-14beta1
make clean
CFLAGS="-march=armv8.2-a+fp16" ./configure --enable-debug
make
sudo make install
cd contrib
make
sudo make install
)

I initially tried to use CFLAGS=”-mcpu=neoverse-n1″ to compile for this processor only (GCC 11 defaults to -m_ouline_atomics to build binaries compatible with ARM v8 but this adds the very small overhead of runtime detection) but I got “unknown architectural extension `rcpc+dotprod+profile'” and that’s why I used CFLAGS=”-march=armv8.2-a+fp16″. Actually installing GCC11 here is not sufficient, the assemble should also be upgraded for Neoverse-N1 instructions. See below for the right usage of GCC in this distribution (devtoolset-10).


[[email protected] ~]$ nm /usr/local/pgsql/bin/postgres | grep -E "aarch64(_have_lse_atomics)?"                                                     

[[email protected] ~]$ objdump -d /usr/local/pgsql/bin/postgres | awk '/\t(ldxr|ldaxr|stxr|stlxr)/{print $3"\t(load and store exclusives)"}/\t(cas|casp|swp|ldadd|stadd|ldclr|stclr|ldeor|steor|ldset|stset|ldsmax|stsmax|ldsmin|stsmin|ldumax|stumax|ldumin|stumin)/{print $3"\t(large-system extensions)"}' | sort | uniq -c | sort -n

      1 ldclral (large-system extensions)
      3 ldsetal (large-system extensions)
     12 ldaddal (large-system extensions)
     15 casal   (large-system extensions)
     31 swpa    (large-system extensions)

I see no __aarch64_ functions to outline atomics and no load and store for pre-v8.1 ARM because I disabled the default -mno-outline-atomics runtime detection. That’s what I wanted to be optimized for this processor. I’ll compile differently later to check the difference.


/usr/local/pgsql/bin/psql --version
/usr/local/pgsql/bin/initdb -D /var/tmp/pgdata
/usr/local/pgsql/bin/pg_ctl -D /var/tmp/pgdata start

This creates a database and starts the instance

[[email protected] contrib]$ free -h

              total        used        free      shared  buff/cache   available
Mem:            22G        1.0G        4.9G         50M         16G         18G
Swap:          8.0G          0B        8.0G

sed -ie "/huge_pages/s/^.*=.*/huge_pages= true/" /var/tmp/pgdata/postgresql.conf
sed -ie "/shared_buffers/s/^.*=.*/shared_buffers= 4800MB/" /var/tmp/pgdata/postgresql.conf
/usr/local/pgsql/bin/pg_ctl -D /var/tmp/pgdata -l postgres.log stop

I have 24 GB RAM here (which is what Oracle gives you for free – this is quite awesome, isn’t it?)
I’ll use enough shared buffers to fit my working set (4 schemas of 1GB) because I want to measure shared buffer cache access (this is where atimics are used).21

[[email protected] contrib]$ $ grep Hugepagesize /proc/meminfo

Hugepagesize:     524288 kB

The huge pages are 512MB here. This processor is build for large systems and that’s perfect for a database. Of course, I’ll define enough huge pages for the shared buffers.


awk '/Hugepagesize.*kB/{print ( 1 + 1024 * MB / $2 ) }' MB=8192 /proc/meminfo | tee /dev/stderr | sudo bash -c  "cat > /proc/sys/vm/nr_hugepages"

/usr/local/pgsql/bin/pg_ctl -D /var/tmp/pgdata -l postgres.log restart

[[email protected] ~]$ free -h

              total        used        free      shared  buff/cache   available
Mem:            22G        9.1G        4.0G         24M        9.4G         10G
Swap:          8.0G        112M        7.9G

Here we are. 8GB for the postgresql shared buffers. 10GB available for the filesystem cache. You need this double buffering because PostgreSQL has no pre-fetching and write ordering optimizations, but still need its own cache to avoid a system call per page access.

I’ll run PGIO to look at the CPU performance, as I did on AWS Graviton2. Because it measures exactly what I want to measure: access to the shared buffers. Without spending all time on client-server calls (see https://amitdkhan-pg.blogspot.com/2020/05/postgresql-on-arm.html for other ways to do it with a pgbench-like workload).


git clone https://github.com/therealkevinc/pgio
tar -zxf pgio/pgio*tar.gz
cat > pgio/pgio.conf <<CAT
 UPDATE_PCT=0
 RUN_TIME=$(( 60 * 60))
 NUM_SCHEMAS=4
 NUM_THREADS=1
 WORK_UNIT=255
 UPDATE_WORK_UNIT=8
 SCALE=1024M
 DBNAME=pgio
 CONNECT_STRING="pgio"
 CREATE_BASE_TABLE=TRUE
CAT
/usr/local/pgsql/bin/psql postgres <<<'create database pgio;'
time ( PATH=/usr/local/pgsql/bin:$PATH ; cd pgio && bash setup.sh )

This setup a database for a read-only workload from 4 sessions reading in their own schema. If you are interesting by atomics improvement for AWL lightweight locks, you can set the PCT_UPDATE. But I like to show something that is often overlooked: a read-only workload on the database is still a read-write workload on the shared memory. And there is no connection, parsing, vacuuming, analyze… here. I use Kevin Closson PGIO, the “SLOB method” to run only what I want to measure: pin and read shared buffers.


( PATH=/usr/local/pgsql/bin:$PATH ; cd pgio && bash runit.sh | sed -e "s/^/$n|\t/" )

Date: Sun May 23 16:56:49 GMT 2021
Database connect string: "pgio".
Shared buffers: 4800MB.
Testing 4 schemas with 1 thread(s) accessing 1024M (131072 blocks) of each schema.
Running iostat, vmstat and mpstat on current host--in background.
Launching sessions. 4 schema(s) will be accessed by 1 thread(s) each.
pg_stat_database stats:
          datname| blks_hit| blks_read|tup_returned|tup_fetched|tup_updated
BEFORE:  pgio    | 10070507688 |   6920153 |   9870736490 |  9862382424 |      988185
AFTER:   pgio    | 10936832180 |   7446586 |  10723766109 | 10715208260 |      988185
DBNAME:  pgio. 4 schemas, 1 threads(each). Run time: 600 seconds. RIOPS >877< CACHE_HITS/s >1443874<

objdump -d /usr/local/pgsql/bin/postgres | awk '/\t(ldxr|ldaxr|stxr|stlxr)/{print $3"\t(load and store exclusives)"}/\t(cas|casp|swp|ldadd|stadd|ldclr|stclr|ldeor|steor|ldset|stset|ldsmax|stsmax|ldsmin|stsmin|ldumax|stumax|ldumin|stumin)/{print $3"\t(large-system extensions)"}' | sort | uniq -c | sort -n

      1 ldclral (large-system extensions)
      3 ldsetal (large-system extensions)
     12 ldaddal (large-system extensions)
     15 casal   (large-system extensions)
     31 swpa    (large-system extensions)

1443874 page read per second from the 4 sessions. This is 360969 LIOPS/thread, with CFLAGS=”-march=armv8.2-a” which calls the CASAL atomic in order to acquire the share mode LWLock to lookup for the block in the shared buffer partition.

I mentioned that I compiled for ARM v8.2 only rather than relying on the defaults that try to build an ARM v8 compatible binary. Let’s do the same with CFLAGS=”-march=armv8-a -moutline-atomics”.


( PATH=/usr/local/pgsql/bin:$PATH ; cd pgio && bash runit.sh | sed -e "s/^/$n|\t/" )

Date: Sun May 23 16:31:43 GMT 2021
Database connect string: "pgio".
Shared buffers: 4800MB.
Testing 4 schemas with 1 thread(s) accessing 1024M (131072 blocks) of each schema.
Running iostat, vmstat and mpstat on current host--in background.
Launching sessions. 4 schema(s) will be accessed by 1 thread(s) each.
pg_stat_database stats:
          datname| blks_hit| blks_read|tup_returned|tup_fetched|tup_updated
BEFORE:  pgio    | 9196454698 |   6919962 |   9010603761 |  9002477779 |      988150
AFTER:   pgio    | 10070502579 |   6919975 |   9870712088 |  9862380498 |      988183
DBNAME:  pgio. 4 schemas, 1 threads(each). Run time: 600 seconds. RIOPS >0< CACHE_HITS/s >1456746<

objdump -d /usr/local/pgsql/bin/postgres | awk '/\t(ldxr|ldaxr|stxr|stlxr)/{print $3"\t(load and store exclusives)"}/\t(cas|casp|swp|ldadd|stadd|ldclr|stclr|ldeor|steor|ldset|stset|ldsmax|stsmax|ldsmin|stsmin|ldumax|stumax|ldumin|stumin)/{print $3"\t(large-system extensions)"}' | sort | uniq -c | sort -n

      1 ldclral (large-system extensions)
      1 ldsetal (large-system extensions)
      1 stxr    (load and store exclusives)
      1 swpa    (large-system extensions)
      2 casal   (large-system extensions)
      2 ldaddal (large-system extensions)
      6 stlxr   (load and store exclusives)
      7 ldaxr   (load and store exclusives)

1456746 page read per second from the 4 sessions. This is 364187 LIOPS/thread, with CFLAGS=”-march=armv8 -moutline-atomics” but with LSE outlined (which is the default in GCC 11). This uses a a runtime detection to use CASAL is LSE are available or LDAXR/STXX if not. I would expect a little overhead for this detection but my test shows that I was able to read nearly 1% more buffers with there. Surprising, isn’t it?

Here are some screenshots fro perf top when those were running:

You can see that I’ve also run with the ARM v8 without LSE:


( PATH=/usr/local/pgsql/bin:$PATH ; cd pgio && bash runit.sh | sed -e "s/^/$n|\t/" )

Date: Sun May 23 17:17:35 GMT 2021
Database connect string: "pgio".
Shared buffers: 4800MB.
Testing 4 schemas with 1 thread(s) accessing 1024M (131072 blocks) of each schema.
Running iostat, vmstat and mpstat on current host--in background.
Launching sessions. 4 schema(s) will be accessed by 1 thread(s) each.
pg_stat_database stats:
          datname| blks_hit| blks_read|tup_returned|tup_fetched|tup_updated
BEFORE:  pgio    | 10936836909 |   7446768 |  10723783569 | 10715210130 |      988187
AFTER:   pgio    | 11785746100 |   7973262 |  11559676977 | 11550904052 |      988213
DBNAME:  pgio. 4 schemas, 1 threads(each). Run time: 600 seconds. RIOPS >877< CACHE_HITS/s >1414848<

objdump -d /usr/local/pgsql/bin/postgres | awk '/\t(ldxr|ldaxr|stxr|stlxr)/{print $3"\t(load and store exclusives)"}/\t(cas|casp|swp|ldadd|stadd|ldclr|stclr|ldeor|steor|ldset|stset|ldsmax|stsmax|ldsmin|stsmin|ldumax|stumax|ldumin|stumin)/{print $3"\t(large-system extensions)"}' | sort | uniq -c | sort -n
     15 ldaxr   (load and store exclusives)
     31 stlxr   (load and store exclusives)
     31 stxr    (load and store exclusives)
     47 ldxr    (load and store exclusives)

1414848 page read per second from the 4 sessions. This is 353712 LIOPS/thread, with CFLAGS=”-march=armv8 -mno-outline-atomics”. I’ve run 2% less buffer reads when the LWLock is implemented as load store loop.

I show all these to show how you can check if your binaries use all the processor features, and how to measure and understand the difference. But the numbers have no meaning outside of this specific context. Numbers from benchmarks are useless outside of marketing slides. Here, my context is only 4 sessions (because I want my demos to run on free tier) and PostgreSQL has the shared buffer partitioned to 16 latches. My data is equally distributed, there are no page misses, no session reading the same blocks, no update… I have set, on purpose, the situation with the less contention to show that the lightweight lock optimizations are still important here. And you see where the contention will be in case of CPU starvation: non-LSE loop on load store will be a problem and LSE atomics will help, outlined for runtime detection, or not.

Update 27-MAY-2021

I mentioned above that I installed the latest GCC (version 11) and PostgreSQL (14 beta) for this test, but I think it is important to give the correct way to install the latest stable version (here 13.3) with the latest GCC shipped with OL7 (the devtoolset-10 software collection) as this one has all required components to generate Neoverse-N1 instructions:


VER=13.3 CFLAGS="-mcpu=neoverse-n1" scl enable devtoolset-10 bash <<'SCL'
sudo yum install -y git gcc readline-devel zlib-devel bison bison-devel flex && \
curl https://ftp.postgresql.org/pub/source/v$VER/postgresql-$VER.tar.gz | tar -zxf - && \
cd postgresql-$VER &&  ./configure --enable-debug && make && sudo make install
SCL
/usr/local/pgsql/bin/initdb -D /var/tmp/test_pgdata
/usr/local/pgsql/bin/pg_ctl -D /var/tmp/test_pgdata start -l /dev/stdout | grep GCC
/usr/local/pgsql/bin/pg_ctl -D /var/tmp/test_pgdata stop  

waiting for server to start….2021-05-27 19:20:43.024 GMT [30480] LOG: starting PostgreSQL 14beta1 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 10.2.1 20200804 (Red Hat 10.2.1-2.0.2), 64-bit

4 Comments

  • Excellent post. I’ve always pointed out that with a shared memory concurrent processing system there is no such thing as a read-only workload and you’re showing that here.

    I’m hoping you’ll allow a link to PGIO for your readers:

    https://kevinclosson.net/2019/09/21/announcing-pgio-the-slob-method-for-postgresql-is-released-under-apache-2-0-and-available-at-github/

  • Josue says:

    Excellent post! Thanks for share your knowledge. Please, if is possible, can you share a system unit file to make persistent the installation as service? Thanks for your reply.

  • Josue says:

    In case it is of interest to someone, i share the process to make service unit for this installation of postgresql:

    1. Create Service Unit File
    #Service unit file
    sudo nano /etc/systemd/system/postgresql.service

    [Unit]
    Description=PostgreSQL database server
    After=network.target

    [Service]
    Type=forking

    User=postgres
    Group=postgres

    # Where to send early-startup messages from the server (before the logging
    # options of postgresql.conf take effect)
    # This is normally controlled by the global default set by systemd
    # StandardOutput=syslog

    # Disable OOM kill on the postmaster
    OOMScoreAdjust=-1000
    # … but allow it still to be effective for child processes
    # (note that these settings are ignored by Postgres releases before 9.5)
    Environment=PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
    Environment=PG_OOM_ADJUST_VALUE=0

    # Maximum number of seconds pg_ctl will wait for postgres to start. Note that
    # PGSTARTTIMEOUT should be less than TimeoutSec value.
    Environment=PGSTARTTIMEOUT=270

    Environment=PGDATA=/usr/local/pgsql/data

    ExecStart=/usr/local/pgsql/bin/pg_ctl start -D ${PGDATA} -s -w -t ${PGSTARTTIMEOUT}
    ExecStop=/usr/local/pgsql/bin/pg_ctl stop -D ${PGDATA} -s -m fast
    ExecReload=/usr/local/pgsql/bin/pg_ctl reload -D ${PGDATA} -s

    # Give a reasonable amount of time for the server to start up/shut down.
    # Ideally, the timeout for starting PostgreSQL server should be handled more
    # nicely by pg_ctl in ExecStart, so keep its timeout smaller than this value.
    TimeoutSec=300

    [Install]
    WantedBy=multi-user.target

    #EOF

    $ sudo systemctl daemon-reload # load the updated service file from disk
    $ sudo systemctl enable postgresql
    $ sudo systemctl start postgresql
    $ sudo systemctl status postgresql # check the status of service

    2. Add postgres bin route to path
    #Edit /etc/profile and append to EOF
    sudo nano /etc/profile
    PATH=$PATH:/usr/local/pgsql/bin

    Thanks for sharing!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Oracle Team
Oracle Team