By Franck Pachot

.

I posted a tweet with a link to a very old document – 20 years old – about ‘internals of recovery’. It’s a gem. All the complexity of the ACID mecanisms of Oracle are explained in a very simple way. It was written for Oracle 7.2 but it’s incredible to see how much the basic things are still relevant today. Of course, there is  a reason for that: the mecanisms of recovery are critical and must be stable. There is one more reason in my opinion: the Oracle RDBMS software was very well designed, then the basic structures designed 20 years ago are still able to cope with new features, and to scale with very large databases, through the versions and the years.

If you check the conversation that followed, a doubt has been raised about the following sentence:

According to write-ahead log protocol, before DBWR can write out a cache buffer containing a modified datablock, LGWR must write out the redo log buffer containing redo records describing changes to that datablock.

There are 2 ways to clear out that kind of doubt: read and test. And we need both of them because:

  • documentation may have bug
  • software may have bug

so you can be sure about a behaviour only when both documentation and test validates your assumption.

Documentation

The first documentation I find about it is another gem describing how Oracle works: Jonathan Lewis ‘Oracle Core (Apress)’. And it’s clearly stated that:

One of the most important features of the Oracle code is that the database writer will not write a changed block to disk before the log writer has written the redo that describes how the block was changed. This write-ahead logging strategy is critical to the whole recovery mechanism.

Then there is of course the Oracle Documentation:

Before DBW can write a dirty buffer, the database must write to disk the redo records associated with changes to the buffer (the write-ahead protocol). If DBW discovers that some redo records have not been written, it signals LGWR to write the records to disk, and waits for LGWR to complete before writing the data buffers to disk.

Test case

Ok, that should be enough. But I want to do a simple testcase in order to see if anything has changed in the latest version (12.1.0.2). My idea is to check two things:

  • whether a checkpoint is requesting so work to be done by logwriter
  • whether a change is written to redo log after a checkpoint, without waiting the usual

I create a table:

19:07:21 SQL> create table DEMO as select '--VAL--1--'||to_char(current_timestamp,'hh24missffff') val from dual;

Table created.

19:07:21 SQL> select * from DEMO;
VAL
 ----------------------------------
 --VAL--1--190721367902367902

I start with a new logfile:

19:07:21 SQL> alter system switch logfile;
System altered.

And I retrieve the log writer process id for future use:

19:07:21 SQL> column spid new_value pid
19:07:21 SQL> select spid,pname from v$process where pname='LGWR';
SPID PNAME
 ------------------------ -----
 12402 LGWR
19:07:21 SQL> host ps -fp &pid
 UID PID PPID C STIME TTY TIME CMD
 oracle 12402 1 0 Jun25 ? 00:00:46 ora_lgwr_DEMO14

update and commit

Here is a scenario where I update and commit:

19:07:21 SQL> update DEMO set val='--VAL--2--'||to_char(current_timestamp,'hh24missffff');
1 row updated.
19:07:21 SQL> select * from DEMO;
VAL
 ----------------------------------
 --VAL--2--190721443102443102
19:07:21 SQL> commit;
Commit complete.

I want to see if a checkpoint has something to wait from the log writer, so I freeze the log writer:

19:07:21 SQL> host kill -sigstop &pid

and I checkpoint:

19:07:21 SQL> alter system checkpoint;
System altered.

No problem. The checkpoint did not require anything from log writer in that case. Note that the dirty buffers related redo has already been written to disk at commit (and log writer was running at that time).

I un-freeze it for the next test:

19:07:21 SQL> host kill -sigcont &pid

update without commit

Now I’m doing the same but without commit. My goal is to see if uncommited dirty blocks need their redo to be written to disk.

19:07:51 SQL> select * from DEMO;
VAL
 ----------------------------------
 --VAL--2--190721443102443102
19:07:51 SQL> host kill -sigstop &pid
19:07:51 SQL> update DEMO set val='--VAL--3--'||to_char(current_timestamp,'hh24missffff');
1 row updated.
19:07:51 SQL> alter system checkpoint;

Here it hangs. Look at the wait events:

b2ap3_thumbnail_CaptureLGWR.JPG

My checkpoint is waiting on ‘rdbms ipc reply’ until the log writer is woken up.

$ kill -sigcont 12402
System altered.
19:09:37 SQL> select * from DEMO;
VAL
 ----------------------------------
 --VAL--3--190751477395477395

The checkpoint is done.

Note that if I run the same but wait 3 seconds after the update (because I know that log writer writes redo at least every 3 seconds even not asked to do it):

21:33:35 SQL> update DEMO set val='--VAL--3--'||to_char(current_timestamp,'hh24missffff');
1 row updated.
21:33:35 SQL> host sleep 3
21:33:38 SQL> host kill -sigstop &pid
21:33:38 SQL> alter system checkpoint;
System altered.
21:33:38 SQL>

checkpoint is not waiting because all the redo that covers the dirty buffers are alerady written.

I’ve also checked that immediately after the checkpoint (without stopping the log writer here) the uncommited change is written to the redo log files:

21:56:38 SQL> select group#,v$log.status,member from v$log join v$logfile using(group#) where v$log.status='CURRENT';
GROUP# STATUS MEMBER
 ---------- ---------------- ----------------------------------------
 2 CURRENT /u01/DEMO/oradata/DEMO14/redo02.log
21:56:38 SQL> update DEMO set val='--VAL--2--'||to_char(current_timestamp,'hh24missffff');
1 row updated.
21:56:38 SQL> select * from DEMO;
VAL
 ----------------------------------
 --VAL--2--215638557183557183
21:56:38 SQL> alter system checkpoint;
System altered.
21:56:38 SQL> host strings &redo | grep "VAL--"
 --VAL--1--215638376899376899
 --VAL--2--2156385571

A simple grep reveals that redo has been written (I’ve no other activity in the database – so no concurrent commits here).

Conclusion

Even if some mecanisms have been improved (see Jonathan lewis book for them) for performance, the fundamentals have not changed.

I’ve said that there are two ways to validated an assumption: documention and test.
But there is a third one: understanding.

When you think about it, if you write uncommited changes to the files, then you must be able to rollback them in case of recovery. Where is the rollback information? In the undo blocks. Are the undo blocks written on disk when the database is written on disk? You don’t know. Then where do you find the undo information in case of recovery? The redo genereated by the transaction contains change vectors for data blocks and for undo blocks. Then if you are sure that all redo is written before the block containing uncomitted changes, then you are sure to be able to rollback those uncommited changes.

Note that this occurs only for modifications through buffer cache. Direct-path insert do not need to be covered by redo to be undone. It’s the change of high water mark that will be undone and this one is done in buffer cache, protected by redo.