Infrastructure at your Service

Franck Pachot

Full page logging in Postgres and Oracle

By February 14, 2018 Oracle, Postgres One Comment

In my opinion, the volume of logging (aka redo log, aka xlog, aka WAL) is the most important factor for OLTP performance, availability and scalability, for several reasons:

  • This is the only structure where disk latency is a mandatory component of response time
  • This is a big part of the total volume of backups
  • This is sequential by nature, and very difficult to scale by parallelizing

In this post, I look at the volume of logging generated by some DML in Postgres and Oracle. I know Oracle quite well and just start to look at Postgres. The comparison here is not a contest but a way to better understand. For example, the default behavior of Postgres, with full_page_writes=on, is very similar to Oracle ‘begin backup’ mode. The comparison makes no sense for most of Postgres DBAs, but probably helps Oracle DBAs to understand it.

Measure WAL segment writes

Here is how I measured the volume of transaction log written: start the Postgres server with ‘strace -f’ and parse with ‘awk’ the open(), write() and close() calls:

sudo su postgres <<'END'
export PGDATA=/u01/pgdata
/usr/pgsql-10/bin/pg_ctl stop
strace -e trace=open,close,write,recvfrom -f /usr/pgsql-10/bin/pg_ctl start 2>&1 | awk '
/^[^[]/{
$0="[pid MAIN] "$0
}
/strace: Process [0-9][0-9]* attached/{
sub(/^.*strace: /,"strace: ") ; "ps -o cmd -hp " $3 |& getline proc[$3"]"] ; print "" ; print $0,proc[$3"]"] }
/open[(].*pg_wal[/].* = [0-9]*$/{
z=$0 ; gsub(qq," ") ; fd_wal[$2 $NF]=$4
}
/checkpoint;/{
total_written_wal=0
}
/write[(]/{
#pid=$2 ; sub("]","",$2) ; "ps -o cmd -hp " p |& getline proc[p"]"] z=$0 ; gsub("[(,]"," ") ; if ( fd_wal[$2 $4]>0 ) { written_wal[$2 $4]=written_wal[$2 $4]+$NF ; total_written_wal=total_written_wal+$NF } next
}
/close[(]/{
pid=$2 ; sub("[^0-9]","",pid) ;
z=$0 ; gsub("[()]"," ") ; if ( ( fd_wal[$2 $4]!="" ) && ( written_wal[$2 $4] > 0 ) ) {
printf " ( written %d bytes to %s -> total WAL segments: %.2f MB ) cmd=%s\n",written_wal[$2 $4],fd_wal[$2 $4],total_written_wal/1024/1024 , proc[$2] ; fd_wal[$2 $4]=""
} next
}
' qq='"'
END

Do not do that in production. This is experimentation in a lab. Do not attach strace to a critical process in production.

There’s probably an easier way to get the same information, maybe with postgres activity statistics, or through a size counting archive_command, so please don’t hesitate to comment. Anyway, from the ‘write()’ calls I am sure that I’m counting exactly what I want: the volume of logging written to disk. As an Oracle DBA used to LogWriter and its slave threads managing all writes, I started to trace only the WAL writer process but quickly realized that part ot the logging is directly written by my server process.

Postgres: insert

I create a table with some numbers and a 100 bytes character string.

create table demo as select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(0,0);

The first operation I test is the insert of 1 million rows.

insert into demo select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(1,1000000);
( written 4349952 bytes to pg_wal/000000010000000A0000005F -> total WAL segments: 4.16 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 8192 bytes to pg_wal/000000010000000A0000005F -> total WAL segments: 9.00 MB ) cmd=postgres: wal writer process
( written 17735680 bytes to pg_wal/000000010000000A00000060 -> total WAL segments: 20.07 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 3309568 bytes to pg_wal/000000010000000A00000060 -> total WAL segments: 31.40 MB ) cmd=postgres: wal writer process
( written 33783808 bytes to pg_wal/000000010000000A00000061 -> total WAL segments: 36.03 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 3997696 bytes to pg_wal/000000010000000A00000061 -> total WAL segments: 39.80 MB ) cmd=postgres: wal writer process
( written 49676288 bytes to pg_wal/000000010000000A00000062 -> total WAL segments: 51.19 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 65273856 bytes to pg_wal/000000010000000A00000063 -> total WAL segments: 66.06 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 79364096 bytes to pg_wal/000000010000000A00000064 -> total WAL segments: 82.04 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 6660096 bytes to pg_wal/000000010000000A00000064 -> total WAL segments: 82.39 MB ) cmd=postgres: wal writer process
( written 88285184 bytes to pg_wal/000000010000000A00000065 -> total WAL segments: 98.02 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 14491648 bytes to pg_wal/000000010000000A00000065 -> total WAL segments: 106.82 MB ) cmd=postgres: wal writer process
( written 101703680 bytes to pg_wal/000000010000000A00000066 -> total WAL segments: 113.99 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 17825792 bytes to pg_wal/000000010000000A00000066 -> total WAL segments: 117.19 MB ) cmd=postgres: wal writer process
( written 115769344 bytes to pg_wal/000000010000000A00000067 -> total WAL segments: 128.20 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 18661376 bytes to pg_wal/000000010000000A00000067 -> total WAL segments: 135.09 MB ) cmd=postgres: wal writer process
( written 19824640 bytes to pg_wal/000000010000000A00000068 -> total WAL segments: 144.17 MB ) cmd=postgres: wal writer process
( written 131350528 bytes to pg_wal/000000010000000A00000068 -> total WAL segments: 148.16 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 27435008 bytes to pg_wal/000000010000000A00000069 -> total WAL segments: 159.80 MB ) cmd=postgres: wal writer process
( written 140132352 bytes to pg_wal/000000010000000A00000069 -> total WAL segments: 159.80 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
INSERT 0 1000000

You can see that my ‘strace|awk’ script is running in the background and has counted about 160 MB of logging, partially from the ‘postgres: wal writer process’ and partly from ‘postgres: demo demo 192.168.56.122(38013)’ serving my connection.

The relation size as stored on disk is about 150 MB;

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 19231 | 1e+06 | 0 | 150
(1 row)

This makes sense. An insert has to write all new data into the log in order to be able to recover the pages until they are checkpointed.

Note that I have no index on this table for this test.

Postgres: update

I’m now updating one column for all rows.

update demo set b=b+1;
( written 150528000 bytes to pg_wal/000000010000000A0000006A -> total WAL segments: 4.01 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 162693120 bytes to pg_wal/000000010000000A0000006B -> total WAL segments: 17.84 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 29769728 bytes to pg_wal/000000010000000A0000006B -> total WAL segments: 28.44 MB ) cmd=postgres: wal writer process
...
( written 84287488 bytes to pg_wal/000000010000000A00000081 -> total WAL segments: 343.65 MB ) cmd=postgres: wal writer process
( written 453705728 bytes to pg_wal/000000010000000A00000082 -> total WAL segments: 347.36 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
UPDATE 1000001

I touched only a small part of the volume in bytes, but I touched all rows and all pages. An, even if only a few bytes are modified, Postgres logs the whole page to protect from fractured blocks in case of crash (pages partially written). So that’s about 150 MB. But postgres do not update rows in-place. The whole row is inserted in its new version, which means the whole volume again, which is another 150 MB. If we look at the size of the table, we can see 300MB of pages:

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-------------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 38462 | 1.21882e+06 | 0 | 300
(1 row)

So this update has generated even more logging: 347 MB.

Postgres: sparse update

Now updating only 1 row out of ten, still one column only:

update demo set b=b+1 where mod(a,10)=1;
( written 89923584 bytes to pg_wal/000000010000000A00000083 -> total WAL segments: 13.88 MB ) cmd=postgres: wal writer process
( written 469123072 bytes to pg_wal/000000010000000A00000084 -> total WAL segments: 22.98 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
...
( written 563576832 bytes to pg_wal/000000010000000A0000008D -> total WAL segments: 151.07 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 130940928 bytes to pg_wal/000000010000000A0000008D -> total WAL segments: 151.27 MB ) cmd=postgres: wal writer process
UPDATE 100000
analyze demo;

So, 10% of the rows had to be copied to their new version, which brings the table size to additional 15 MB.

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-------------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 40385 | 1.07267e+06 | 0 | 315
(1 row)

For these additional 15 MB, half of the table pages had to be modified (the current version having to point to the new version), and the logging generated was 150 MB. Because of MVCC at tuple level, doing something similar to ‘chained rows’ and ‘row migration’ for all updates, and because of full page logging, even sparse updates generate a lot log writes.

Postgres: delete

Here is a delete of those million rows:

delete from demo;
( written 576364544 bytes to pg_wal/000000010000000A0000008E -> total WAL segments: 6.44 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
( written 134930432 bytes to pg_wal/000000010000000A0000008E -> total WAL segments: 6.73 MB ) cmd=postgres: wal writer process
( written 589225984 bytes to pg_wal/000000010000000A0000008F -> total WAL segments: 18.70 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
...
( written 162054144 bytes to pg_wal/000000010000000A00000099 -> total WAL segments: 184.70 MB ) cmd=postgres: wal writer process
( written 740352000 bytes to pg_wal/000000010000000A0000009A -> total WAL segments: 189.80 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
DELETE 1000001
( written 163217408 bytes to pg_wal/000000010000000A0000009A -> total WAL segments: 196.22 MB ) cmd=postgres: wal writer process

Marking tuples as deleted does not increase the table:

analyze demo;
ANALYZE
select relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, relpages*8/1024 MB from pg_class where relname = 'demo';
relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | mb
---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+-----
demo | 2200 | 25157 | 0 | 16385 | 0 | 25155 | 0 | 40385 | 275837 | 0 | 315
(1 row)

But all current tuples have to be marked as deleted and not visible once the transaction is committed. This touches all pages for the current version, which is more than 150 MB of logging here.

Postgres: vacuum

After two updates and a delete, I have old tuples in this table. It seems that VACUUM does not generate any logging:

vacuum demo;
( written 762445824 bytes to pg_wal/000000010000000A0000009B -> total WAL segments: 14.67 MB ) cmd=postgres: demo demo 192.168.56.122(38013)
VACUUM

My guess (but remember that I am a newbie in Postgres) is that in case of a crash occurring before the next checkpoint we will just have to vacuum again. But this is not what was answered in the postgres-general list a few years ago.

Note that full page logging is not necessary for all changes, but only for the first change after the page was read from disk after a checkpoint. This is sufficient to cover future writes failures because recovery will start from there. Once we have full page logged, change vector is sufficient for further recovery. However, I had the same amount of WAL, 15 MB, when vacuuming after a checkpoint.

Oracle: insert

Let’s do some similar things in Oracle, which MVCC implementation is completely different: at block level, with undo logging.

SQL> create table demo as select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('0 to 0');
Table created.

I have exposed in a previous post how I get the delta values from V$MYSTAT join V$STATNAME using (STATISTIC#) for ‘redo size’, so no need to strace here. But we can see the same result by measuring the writes to redo log groups (do not double count the multiplexed members).

SQL> insert into demo select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('1 to 1000000');
1000001 rows inserted.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
141,342 155,218,876 4,380,448 104,411

This is about 150MB, which is the volume of the table:

SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.
SQL> select table_name,num_rows,blocks,(blocks*block_size/1024/1024) MB, avg_row_len from user_tables join dba_tablespaces using(tablespace_name) where table_name='DEMO';
 
TABLE_NAME NUM_ROWS BLOCKS MB AVG_ROW_LEN
------------------------------ ---------- ---------- ---------- -----------
DEMO 1000001 19280 150.625 131

Conclusion for inserts: all databases have to log the whole data inserted in order to be protected from instance crash. Note that Oracle has a way to insert directly into the file, bypassing the buffer cache, and then reduce the logging required for crash recovery. But I’m not doing bulk inserts here.

Oracle: update

The update in Oracle is done in-place. There is no need to copy the whole row (except in the rare cases where the row increases and do not fit into the block). However, the old value of the column must be copied for MVCC, into the UNDO segment. This is why we see 46 MB of ‘undo change vector size’ here.

SQL> update demo set b=b+1;
1000001 rows updated.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
170,777 105,301,308 48,641,772 82,221

The UNDO is only the change vector, not the full block. If you read about copies of full blocks to rollback segments, it is a confusion either from veterans of Oracle 5, or a misunderstanding of flashback features. The UNDO being stored in segments, written first into buffer cache, it is protected by redo logging, so about 46 MB of redo is actually the redo vector of undo vectors. The other 54 MB of redo is the new value of the update.

Oracle: sparse update

The logging of change vectors rather than full pages is even cheaper with sparse updates:

SQL> update demo set b=b+1 where mod(a,10)=1;
100001 rows updated.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
56,583 15,414,328 6,111,608 36,921

The volume of undo and redo generated is only 15 MB here, including 6 MB of undo vectors. This is really optimized and this is one reason why you should update only the columns changed (and not use the default non-dynamic update of Hibernate for example).

Oracle: delete

The delete has to mark all rows as deleted and because the space can immediately be reused then whole row must be logged into the UNDO, and this has to be logged into the REDO, so the delete generates lot of logging:

SQL> delete from demo;
1000001 rows deleted.
SQL> commit;
Commit complete.
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
2,124,823 403,755,892 240,302,088 1,093,821

I have no indexes here. With indexes, all index entries have to be marked as deleted, and this generates undo and redo vector because MVCC in Oracle is at block level: each block modification – for table or index – have to be logged.

Deleting a lot of rows is an expensive operation in Oracle. For bulk purges, it is often better to truncate and insert /*+ append */ when possible (as in non-atomic materialized view refresh). Partitioning helps for that for example to purge old data when partitioned on date.

Postgres without full page logging

Given the huge overhead, is full page logging really required? There are plans to avoid it, mentioned in the Postgres ToDo wiki, or at least to keep it only short term for crash recovery and not media recovery. Another possibility is to implement a checksum on the blocks so that fractured blocks can be detected. Then, when detected, the fractured blocks may not need full page logging to recover them if we can restore a previous backup. This takes longer to recover, but can be acceptable given the low probability of this kind of failure. In addition to that, when you have a physical standby synchronized with log-shipping, you have a easy way to recover without having to restore files. But you need a checksum to detect the problem.

Without a checksum, the problem is the detection of partial writes. But if you trust your storage and if you failover to the standby in case of a crash, you may accept to set full_page_writes=off and this is what I did here.


insert into demo select generate_series a,generate_series b,generate_series c,generate_series d,generate_series e,generate_series f,lpad('x',100,'x') g from generate_series(1,1000000);
...
( written 125255680 bytes to pg_wal/000000010000000A000000E3 -> total WAL segments: 140.65 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
INSERT 0 1000000

The insert still have to log all new data: 140 MB.


update demo set b=b+1;
...
( written 72613888 bytes to pg_wal/000000010000000A000000F2 -> total WAL segments: 213.02 MB ) cmd=postgres: wal writer process
UPDATE 1000001

The update has to log only what is modified, but because of Postgres MVCC implementation, the whole row has to be written in its new version, and the old ones have their pointer updated: 210 MB here.


update demo set b=b+1 where mod(a,10)=1;
( written 305709056 bytes to pg_wal/000000010000000A000000F3 -> total WAL segments: 1.96 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 72613888 bytes to pg_wal/000000010000000A000000F3 -> total WAL segments: 5.62 MB ) cmd=postgres: wal writer process
( written 75718656 bytes to pg_wal/000000010000000A000000F4 -> total WAL segments: 9.65 MB ) cmd=postgres: wal writer process
( written 310665216 bytes to pg_wal/000000010000000A000000F4 -> total WAL segments: 9.65 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
UPDATE 100000

The sparse update benefits from logging only the changed rows: 10 MB here. This one is even smaller than with Oracle because there’s no UNDO to write here: the old values stay in-place.


delete from demo;
( written 323256320 bytes to pg_wal/000000010000000A000000F5 -> total WAL segments: 11.27 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 338829312 bytes to pg_wal/000000010000000A000000F6 -> total WAL segments: 26.92 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 76562432 bytes to pg_wal/000000010000000A000000F6 -> total WAL segments: 31.41 MB ) cmd=postgres: wal writer process
( written 345415680 bytes to pg_wal/000000010000000A000000F7 -> total WAL segments: 39.73 MB ) cmd=postgres: demo demo 192.168.56.122(38109)
( written 83410944 bytes to pg_wal/000000010000000A000000F7 -> total WAL segments: 40.41 MB ) cmd=postgres: wal writer process
DELETE 1000001

The delete is cheap when full_page_writes=off because there’s only the visibility is changed but data remains (until committed and vacuumed). If you have a lot of rows to delete, then consider to set full_page_writes=off and be sure to have a backup to restore in case of crash.

Oracle full page logging in backup mode

So, Oracle by default does not need to protect from fractured blocks, because they can be detected. If the storage crashes while a block is partially written, the block is corrupt. Thanks to the checksum, this corruption will be detected during recovery (or even earlier depending on DB_BLOCK_CHECKSUM and DB_LOST_WRITE_PROTECT). The redo is not sufficient, as it contains only change vectors, but you can recover from the last backup and Oracle can do a simple block recover. This recovery can also be done from the standby database.

However, full page logging exists in Oracle. When running backup from a non-Oracle tool, not aware of block checksum, you need to enclose the copy or snapshot between ‘begin backup’ and ‘end backup’. You do this because online backup may read partially updated blocks, and without the checksum, cannot detect it. A corrupt backup is not very useful and this is why this backup mode will generate more redo to be able to recover them. This is very similar to full page logging: the redo generated for the first modification of the buffer will store the whole block. Next modifications, until buffer is checkpointed, will need only the change vectors.

I think the first article I’ve ever written was a description of the Oracle backup mode. And it is still visible thanks to archive.org only because it was published on… Google Knol!

So, here is the same run with Oracle in backup mode.

Insert does not change a lot as it fills full blocks:

SQL> insert into demo select rownum a,rownum b,rownum c,rownum d,rownum e,rownum f,lpad('x',100,'x') g from xmltable('1 to 1000000');
1000000 rows created.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
141,376 156,527,072 4,380,448 124,195

Full update of one column generates same undo, but more than 2x redo because of full page logging:

SQL> update demo set b=b+1;
1000001 rows updated.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
170,778 238,317,632 48,641,772 104,640

Sparse update is exactly the same as full update because this 10% touches all pages:

SQL> update demo set b=b+1 where mod(a,10)=1;
100001 rows updated.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
319,622 240,502,284 17,832,196 192,815

Delete generates even more because there’s all the the UNDO in addition to all data pages:

SQL> delete from demo;
1000001 rows deleted.
 
SQL> commit;
Commit complete.
 
SQL> @ _mystat_diff.sql
 
db block changes redo size undo change vector size redo entries
---------------- ---------------- ----------------------- ----------------
2,125,285 558,510,928 240,303,768 1,143,131

So what?

Beyond the very different implementation of Postgres and Oracle, we can see that we have flexibility: the large logging generated by Postgres by default may be reduced in some cases, and the minimal logging which is the default for Oracle may be larger in some situations. The most important, as for all technologies, is to understand how it works. Only then you can do the right choice to balance between performance, availability, and cost. Understand how it works means: read the docs (how it is supposed to work) and test (how it actually works). With Oracle there’s additional information from a huge community testing and using it for decades. With Postgres, as with all Open Source projects, the source code with comments is an amazing documentation.

 

One Comment

Leave a Reply


+ 9 = thirteen

Franck Pachot
Franck Pachot

Technology Leader