By Franck Pachot

.
There are many layers between the Oracle Database pwrite() calls and the physical sector written on disk: filesystem, logical volume, SAN or NAS, with a lot of smart software running for Virtualisation, Compression, Snapshotting, Synchronisation… Are you sure that the changes you made to your data is actually persisted on disk, completely and without any corruption? In case of bug or crash in the storage layer, it may happen that only part of the changes was written. In the case of crash, Oracle ensures that the datafile headers are written at the end, so that recovery can kick-in after the crash. Then, a partially written block can be detected and restored. With different checksum settings, you can also check block integrity while writing or reading. But that protects only for fractured blocks. What if a block write just did not occur? An old version of the block remains and then is perfectly correct for checksum, RMAN, and DBV.

You may be 100% sure that you have never experienced lost writes. But then I’ll ask you: how do you know it? You don’t. Except if you enable Lost Write Protection.

In 11g Oracle introduced this feature for Data Guard configurations. Data Guard is the best protection as the synchronization is done at the highest level: the change vector, generated before any I/O and block modification occurred. Do not use SAN synchronization for your database. Data Guard is less expensive (no option needed), more efficient (only the persistent change information is shipped), and protects over all layers. It protects from lost writes because blocks are written on both sites by a different server, instance, storage. And Data Guard can detect lost writes by shipping the block SCN for each read to compare it with the standby.

However, this has an overhead: redo generation for reads. Oracle 18c comes with a new solution with no need for a standby database: a new LOST WRITE PROTECTION tablespace is created to store the SCN of each block modified.

Do you have lost writes?

First I’ll show what happens without this feature. I create a table filled with number, timestamp and SCN:


SQL> create table DEMO.DEMO pctfree 99 as select rownum id,1 n, current_timestamp ts , (select current_scn from v$database) scn from xmltable('1 to 10');
Table DEMO.DEMO created.
 
SQL> select owner,segment_name,segment_type,block_id,blocks,sum(blocks)over(partition by owner,segment_name,segment_type) from dba_extents where owner='DEMO' and segment_name='DEMO' order by 1,2,3,4;
 
OWNER   SEGMENT_NAME   SEGMENT_TYPE     BLOCK_ID   BLOCKS   SUM(BLOCKS)OVER(PARTITIONBYOWNER,SEGMENT_NAME,SEGMENT_TYPE)
-----   ------------   ------------     --------   ------   -----------------------------------------------------------
DEMO    DEMO           TABLE                3128        8                                                             8
 
SQL> column block_id new_value last_block_id
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
 
  BLOCK_ID   ID   N TS                                                    SCN
  --------   --   - --                                                    ---
      3131    1   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3131    2   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3132    3   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3132    4   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3133    5   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3133    6   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3134    7   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3134    8   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3135    9   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3135   10   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
 
SQL> column block_id clear

I’ve recorded the las BLOCK_ID in &last_block_id and ensures that all those modifications are written on dosk:


SQL> alter system checkpoint;
System CHECKPOINT altered.
 
SQL> alter system flush buffer_cache;
System FLUSH altered.

I save the block 3133 to keep an old version of it. This will be my way to simulate a lost write.


SQL> host dd if=/u01/oradata/CDB1/PDB1/users01.dbf of=/var/tmp/lwp.blk skip=$(( &last_block_id - 2 )) bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000263416 s, 31.1 MB/s

Now, I update all rows, set new timestamp, SCN and increase the number 1 to 2.


SQL> update DEMO.DEMO set n=2+1, ts=current_timestamp, scn=(select current_scn from v$database);
10 rows updated.
SQL> commit;
Commit complete.
 
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.

Here is how I reproduce lost writes. All blocks changed were written to disk, but I restore the old version of block 3133 as if this one was not written:


SQL> host dd if=/var/tmp/lwp.blk of=/u01/oradata/CDB1/PDB1/users01.dbf seek=$(( &last_block_id - 2 )) bs=8k count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000111582 s, 73.4 MB/s

This is what you can see if one of your storage layers missed a write and nevertheless acknowledged the I/O call.

s
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
  BLOCK_ID   ID   N TS                                                    SCN
  --------   --   - ---------------------------------------------     -------
      3131    1   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3131    2   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3132    3   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3132    4   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3133    5   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3133    6   1 03-MAR-18 04.51.37.838991000 PM EUROPE/ZURICH     4619734
      3134    7   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3134    8   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3135    9   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806
      3135   10   3 03-MAR-18 04.51.39.255582000 PM EUROPE/ZURICH     4619806

No errors. No corruption. Just old values for the rows in this block. And there’s no way to detect it. Physical reads, RMAN, DBV, will all see a perfectly correct block. Only if those changes are inconsistent with other tables (with constraints) or indexes, you may detect a logical corruption.

The probability that this problem occurs is very low (a storage bug on exactly one 8k block or multiple of it). But it is critical because it cannot be detected. To detect it, you need to compare the full block or a checksum, or simply the SCN with another copy, such as in the standby database. Or, with this new feature, store the SCN of each data block in a new structure: the 18c LOST WRITE DETECTION.

Enabling Lost Write Protection

You need to create a tablespace to store those SCN. There’s no choice you must use a bigfile tablespace, but you can create multiple small ones if you want.


SQL> create bigfile tablespace SHADOW
  2   datafile '/u01/oradata/CDB1/PDB1/shadow.dbf'
  3   size 5M
  4   lost write protection;
 
Tablespace SHADOW created.
 
SQL> select tablespace_name,status,bigfile,contents,logging,allocation_type,encrypted,lost_write_protect,chunk_tablespace from dba_tablespaces;
TABLESPACE_NAME   STATUS   BIGFILE   CONTENTS                LOGGING     ALLOCATION_TYPE   ENCRYPTED   LOST_WRITE_PROTECT   CHUNK_TABLESPACE
---------------   ------   -------   --------                -------     ---------------   ---------   ------------------   ----------------
SYSTEM            ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          OFF                  N
SYSAUX            ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          OFF                  N
UNDOTBS1          ONLINE   NO        UNDO                    LOGGING     SYSTEM            NO          OFF                  N
TEMP              ONLINE   NO        TEMPORARY               NOLOGGING   UNIFORM           NO          OFF                  N
SHADOW            ONLINE   YES       LOST WRITE PROTECTION   LOGGING     SYSTEM            NO          OFF                  N
USERS             ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          OFF                  N
 
SQL> select * from dba_data_files;
 
FILE_NAME                                FILE_ID TABLESPACE_NAME         BYTES   BLOCKS STATUS        RELATIVE_FNO AUTOEXTENSIBLE        MAXBYTES   MAXBLOCKS   INCREMENT_BY   USER_BYTES   USER_BLOCKS ONLINE_STATUS   LOST_WRITE_PROTECT
---------                                ------- ---------------         -----   ------ ------        ------------ --------------        --------   ---------   ------------   ----------   ----------- -------------   ------------------
/u01/oradata/CDB1/PDB1/undotbs01.dbf         164 UNDOTBS1            104857600    12800 AVAILABLE                9 YES                34359721984     4194302            640    103809024         12672 ONLINE          OFF
/u01/oradata/CDB1/PDB1/sysaux01.dbf          163 SYSAUX              408944640    49920 AVAILABLE                4 YES                34359721984     4194302           1280    407896064         49792 ONLINE          OFF
/u01/oradata/CDB1/PDB1/system01.dbf          162 SYSTEM              272629760    33280 AVAILABLE                1 YES                34359721984     4194302           1280    271581184         33152 SYSTEM          OFF
/u01/oradata/CDB1/PDB1/users01.dbf           169 USERS               104857600    12800 AVAILABLE              169 NO                           0           0              0    103809024         12672 ONLINE          OFF
/u01/oradata/CDB1/PDB1/shadow.dbf             58 SHADOW                5242880      640 AVAILABLE             1024 NO                           0           0              0      4194304           512 ONLINE          OFF

Then you enable this feature for the database (or pluggable database) and for the tablespaces you want to protect:


SQL> alter pluggable database enable lost write protection;
Pluggable database ENABLE altered.
 
SQL> alter tablespace USERS enable lost write protection;
Tablespace USERS altered.

Here are the new columns in DBA_TABLESPACES and DBA_DATA_FILES:


SQL> select tablespace_name,status,bigfile,contents,logging,allocation_type,encrypted,lost_write_protect,chunk_tablespace from dba_tablespaces;
 
TABLESPACE_NAME   STATUS   BIGFILE   CONTENTS                LOGGING     ALLOCATION_TYPE   ENCRYPTED   LOST_WRITE_PROTECT   CHUNK_TABLESPACE
---------------   ------   -------   --------                -------     ---------------   ---------   ------------------   ----------------
SYSTEM            ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          OFF                  N
SYSAUX            ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          OFF                  N
UNDOTBS1          ONLINE   NO        UNDO                    LOGGING     SYSTEM            NO          OFF                  N
TEMP              ONLINE   NO        TEMPORARY               NOLOGGING   UNIFORM           NO          OFF                  N
SHADOW            ONLINE   YES       LOST WRITE PROTECTION   LOGGING     SYSTEM            NO          OFF                  N
USERS             ONLINE   NO        PERMANENT               LOGGING     SYSTEM            NO          ENABLED              N
 
SQL> select * from dba_data_files;
 
FILE_NAME                                FILE_ID TABLESPACE_NAME         BYTES   BLOCKS STATUS        RELATIVE_FNO AUTOEXTENSIBLE        MAXBYTES   MAXBLOCKS   INCREMENT_BY   USER_BYTES   USER_BLOCKS ONLINE_STATUS   LOST_WRITE_PROTECT
---------                                ------- ---------------         -----   ------ ------        ------------ --------------        --------   ---------   ------------   ----------   ----------- -------------   ------------------
/u01/oradata/CDB1/PDB1/undotbs01.dbf         164 UNDOTBS1            104857600    12800 AVAILABLE                9 YES                34359721984     4194302            640    103809024         12672 ONLINE          OFF
/u01/oradata/CDB1/PDB1/sysaux01.dbf          163 SYSAUX              408944640    49920 AVAILABLE                4 YES                34359721984     4194302           1280    407896064         49792 ONLINE          OFF
/u01/oradata/CDB1/PDB1/system01.dbf          162 SYSTEM              272629760    33280 AVAILABLE                1 YES                34359721984     4194302           1280    271581184         33152 SYSTEM          OFF
/u01/oradata/CDB1/PDB1/users01.dbf           169 USERS               104857600    12800 AVAILABLE              169 NO                           0           0              0    103809024         12672 ONLINE          ENABLED
/u01/oradata/CDB1/PDB1/shadow.dbf             58 SHADOW                5242880      640 AVAILABLE             1024 NO                           0           0              0      4194304           512 ONLINE          OFF

Note that we are on the Oracle Cloud here and all tablespaces must be encrypted. This is also the case with the LOST WRITE PROTECTION one. Here you can see ENCRYPTION at none only because I decrypted them to look at what is inside the files.

Here are some internal tables giving some information about the storage. Note that tablespace number TSID=7 here is USERS, the one protected, and the TSID=6 one is the LOST WRITE PROTECTION one.


SQL> select * from new_lost_write_datafiles$;
 
  DATAFILE_TSID_TRACKED   DATAFILE_RFN_TRACKED   SHADOW_DATAFILE_TSID   SHADOW_DATAFILE_RFN   SHADOW_DATAFILE_OFFSET   NUMBER_OF_BLOCKS_ALLOCATED DATAFILE_CURRENT_STATUS
  ---------------------   --------------------   --------------------   -------------------   ----------------------   -------------------------- -----------------------
                      7                    169                      6                  1024                      128                          184 enabled
 
SQL> select * from new_lost_write_extents$;
 
  EXTENT_DATAFILE_RFN   EXTENT_DATAFILE_TSID   EXTENT_START   EXTENT_LENGTH_BLOCKS_2K   EXTENT_NEXT_BLOCK
  -------------------   --------------------   ------------   -----------------------   -----------------
                 1024                      6            312                      1312                 641
 
SQL> select * from new_lost_write_shadows$;
 
  SHADOW_DATAFILE_RFN   SHADOW_DATAFILE_TSID   SHADOW_NUMBER_BLOCKS_ALLOC   SHADOW_FIRST_FREE_BLOCK   SHADOW_BLOCK_SIZE_BYTES   SHADOW_RECS_PER_BLOCK
  -------------------   --------------------   --------------------------   -----------------------   -----------------------   ---------------------
                 1024                      6                          640                       128                      8192                     136

Because there was already a ‘lost write’ feature, this one is called ‘new lost write’.

NEW_LOST_WRITE_DATAFILE$ lists all protected (aka tracked) datafiles with the LOST WRITE PROTECTION (aka shadow) tablespace protecting it. The status can be ‘enabled’ or ‘suspended’. The row is deleted if the protection is removed.

NEW_LOST_WRITE_SHADOWS$ lists the LOST WRITE PROTECTION tablespaces. It contains a 1MB bitmap in the first 128 blocks, and extents starts after this.

NEW_LOST_WRITE_EXTENTS$ maps the free extents in the shadow tablespace. Here, the lost write protection for my USERS tablespace (100MB) takes 312 – 128 = 184 blocks (1.4 MB). The extent is 4MB. The 1312 EXTENT_LENGTH_BLOCKS_2K are the remaining free space in the extent, in 2KB blocks. 4MB-1.4MB=2.6MB which is 1312 x 2k blocks.

Simulate Lost Write

As I did before, I create a DEMO table


SQL> create table DEMO.DEMO pctfree 99 as select rownum id,1 n, current_timestamp ts , (select current_scn from v$database) scn from xmltable('1 to 10');
Table DEMO.DEMO created.
 
SQL> column block_id new_value last_block_id
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
  BLOCK_ID   ID   N TS                                                    SCN
  --------   --   - ---------------------------------------------     -------
      3123    1   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3123    2   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3124    3   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3124    4   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3125    5   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3125    6   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3126    7   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3126    8   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3127    9   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079
      3127   10   1 03-MAR-18 04.21.00.797975000 PM EUROPE/ZURICH     4578079

I save one block:


SQL> column block_id clear
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.
SQL> host dd if=/u01/oradata/CDB1/PDB1/users01.dbf of=/var/tmp/lwp.blk skip=$(( &last_block_id - 2 )) bs=8k count=1
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000325309 s, 25.2 MB/s

Update the table:


SQL> update DEMO.DEMO set n=2+1, ts=current_timestamp, scn=(select current_scn from v$database);
10 rows updated.
SQL> commit;
Commit complete.
SQL> alter system checkpoint;
System CHECKPOINT altered.
SQL> alter system flush buffer_cache;
System FLUSH altered.

Write back the block I saved, to simulate a lost write:


SQL> host dd if=/var/tmp/lwp.blk of=/u01/oradata/CDB1/PDB1/users01.dbf seek=$(( &last_block_id - 2 )) bs=8k count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000104103 s, 78.7 MB/s

Lost Write detection

Now, when I query the table, the lost write is detected and an error is raised:


SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
 
Error starting at line : 93 File @ /media/sf_share/18c/lost_write_protection.sql
In command -
select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO
Error report -
ORA-65478: shadow lost write protection - found lost write

In the alert.log I have the mention of the block that failed:


2018-03-03 16:21:06.842000 +01:00
ERROR - I/O type:buffered I/O found lost write in block with file#:169 rdba:0x2a400c35, Expected SCN:0x000000000045db54 SCN in block:0x000000000045db23, approx current SCN:0x000000000045dbbb, RAC instance:1 pdb:5
*****************************************************************
An internal routine has requested a dump of selected redo.
This usually happens following a specific internal error, when
analysis of the redo logs will help Oracle Support with the
diagnosis.
It is recommended that you retain all the redo logs generated (by
all the instances) during the past 12 hours, in case additional
redo dumps are required to help with the diagnosis.
*****************************************************************

The block defined by the RDBA in alert.log is the one I have manually corrupted:


SQL> select dbms_utility.data_block_address_file(to_number('2a400c35','XXXXXXXX'))file#,dbms_utility.data_block_address_block( to_number('2a400c35','XXXXXXXX'))block# from dual;
 
     FILE#     BLOCK#
---------- ----------
       169       3125

As mentioned in the alert.log the session has dumped the redo, as found in the session trace file:


ALTER SYSTEM DUMP REDO DBA MIN 169 3125 DBA MAX 169 3125 SCN MIN 4578132 SCN MAX 4578235 CON_ID 5

This SCN 4578132 is the commit SCN for my update. And the 4578235 is the current one. I can see the change that was lost here:


CHANGE #10 CON_ID:5 TYP:0 CLS:1 AFN:169 DBA:0x2a400c35 OBJ:73527 SCN:0x000000000045db23 SEQ:2 OP:11.4 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x01  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F  xid:  0x0008.019.000002d6    uba: 0x02401257.01a2.24
KDO Op code: LKR row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x2a400c35  hdba: 0x2a400c32
itli: 2  ispac: 0  maxfr: 4858
tabn: 0 slot: 0 to: 2
CHANGE #11 CON_ID:5 TYP:0 CLS:1 AFN:169 DBA:0x2a400c35 OBJ:73527 SCN:0x000000000045db54 SEQ:1 OP:11.5 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x02  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C  uba: 0x02401257.01a2.25
KDO Op code: URP row dependencies Disabled
  xtype: XAxtype KDO_KDOM2 flags: 0x00000080  bdba: 0x2a400c35  hdba: 0x2a400c32
itli: 2  ispac: 0  maxfr: 4858
tabn: 0 slot: 0(0x0) flag: 0x2c lock: 2 ckix: 0
ncol: 4 nnew: 3 size: 0
Vector content:
col  1: [ 2]  c1 04
col  2: [13]  78 76 03 03 10 16 02 0d 73 cd 48 86 58
col  3: [ 5]  c4 05 3a 52 20

Then we need to recover…

However, unfortunately, this block is not marked as corrupt:


SQL> select * from  V$DATABASE_BLOCK_CORRUPTION;
no rows selected

This means that I cannot use RMAN block recovery:


SQL> host rman target / <<
Starting recover at 03-MAR-18
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=152 device type=DISK
allocated channel: ORA_DISK_2
 
starting media recovery
media recovery complete, elapsed time: 00:00:00
 
Finished recover at 03-MAR-18

And the RMAN recovery advisor is not aware of the problem:


RMAN> list failure;
 
using target database control file instead of recovery catalog
Database Role: PRIMARY
 
no failures found that match specification

So the solution is to mark it as corrupt or restore the whole datafile (or section). And maybe ensure that the write was lost on the data, and not on the lost write tracking block itself. The redo dump may help for that.

strace: additional I/Os

I traced the server process to see what files are read during my query:


SQL> column spid new_value pid
SQL> select spid from v$process join v$session on v$session.paddr=v$process.addr where sid=sys_context('userenv','sid');
SPID
9360
SQL> column spid clear
SQL> define bg=&:
SQL> host strace -p &pid -o strace.txt &bg

This small awk filters I/O calls on users01.dbf and shadow.dbf and displays the system calls on the file handle


awk '/open[(]["].*(shadow.dbf|users01.dbf).*["],/{l=$0;gsub(/[(,)]/," ");h[$NF]=$2" "$NF;print $l;$0=$l}/^[a-zA-Z0-9]+[(][0-9]+[,)]/{l=$0;gsub(/[(,)]/," "); if ( $1 == "close" ) h[$2]="" ; if ( h[$2]!="" ) printf "%-130s t%80sn",l,h[$2]}/F_DUPFD/{if ( h[$2]!="" ) h[$NF]=h[$2]" "$NF;h[$2]=""}' strace.txt | grep --color=auto -E "^|^.*users01.*"

Here are the open() and pread() calls:


open "/u01/oradata/CDB1/PDB1/shadow.dbf"  O_RDWR|O_DSYNC  = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0                                                             "/u01/oradata/CDB1/PDB1/shadow.dbf" 8
fcntl(8, F_DUPFD, 256)                  = 256                                                           "/u01/oradata/CDB1/PDB1/shadow.dbf" 8
fcntl(256, F_SETFD, FD_CLOEXEC)         = 0                                                         "/u01/oradata/CDB1/PDB1/shadow.dbf" 8 256
pread64(256, "X242226V333E1f372366"..., 8192, 1056768) = 8192                                "/u01/oradata/CDB1/PDB1/shadow.dbf" 8 256
open "/u01/oradata/CDB1/PDB1/users01.dbf"  O_RDWR|O_DSYNC  = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0                                                            "/u01/oradata/CDB1/PDB1/users01.dbf" 8
fcntl(8, F_DUPFD, 256)                  = 257                                                          "/u01/oradata/CDB1/PDB1/users01.dbf" 8
fcntl(257, F_SETFD, FD_CLOEXEC)         = 0                                                        "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "#242002f@*%333E14370264"..., 8192, 25575424) = 8192                           "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242003f@*T333E56317H1007371!333E"..., 40960, 25583616) = 40960           "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242005f@*#333E24zo1007371!333E"..., 8192, 25600000) = 8192                "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242003f@*T333E56317H1007371!333E"..., 40960, 25583616) = 40960           "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242005f@*#333E24zo1007371!333E"..., 8192, 25600000) = 8192                "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242003f@*T333E56317H1007371!333E"..., 40960, 25583616) = 40960           "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257
pread64(257, "6242005f@*#333E24zo1007371!333E"..., 8192, 25600000) = 8192                "/u01/oradata/CDB1/PDB1/users01.dbf" 8 257

We can see the lost write protection file read first (1 block at offset 1056768 which is block 129, the first one after the 1MB header) and the SCNs for my 5 blocks table are all there. Then the table blocks are read. Note that all those blocks (lost protection and data) goes into the buffer cache, and then do not have to be re-read each time. Here, I’ve run my failing select 3 times and only the first one had to read the shadow datafile.

X$BH: additional buffer gets

As those blocks are read through the buffer cache during the consistent reads, I checked the buffer cache headers for the 3 times I’ve run the queries. I’ve identified them from the function that reads them: kcbr_lost_get_lost_write_scns


SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
         OBJ   STATE   TCH FP_WHR
         ---   -----   --- ------
  4294967295       1     1 kr_gcur_4: kcbr_lost_get_lost_w
 
 
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
         OBJ   STATE   TCH FP_WHR
         ---   -----   --- ------
  4294967295       1     2 kr_gcur_4: kcbr_lost_get_lost_w
 
 
SQL> select dbms_rowid.rowid_block_number(rowid) block_id,DEMO.* from DEMO.DEMO;
ORA-65478: shadow lost write protection - found lost write
 
SQL> select obj,state,tch,fp_whr from x$bh where fp_whr like 'kr_gcur_4: kcbr_lost_get_lost_w%';
 
         OBJ   STATE   TCH FP_WHR
         ---   -----   --- ------
  4294967295       1     3 kr_gcur_4: kcbr_lost_get_lost_w

Here we can see the touch count increasing. It seems that for each query the kcbr_lost_get_lost_write_scns is called, even when there was no modification and no new read from disk.

While we’re there, let’s breakpoint on this fonction to see when it is called:


(gdb) break kcbr_lost_get_lost_write_scns
Breakpoint 1 at 0x85a9140
(gdb) c
Continuing.
 
Breakpoint 1, 0x00000000085a9140 in kcbr_lost_get_lost_write_scns ()
(gdb) bt
#0  0x00000000085a9140 in kcbr_lost_get_lost_write_scns ()
#1  0x0000000001cf9c01 in kcbzibmlt ()
#2  0x0000000001ce2f29 in kcbzib ()
#3  0x0000000011e7f6e9 in kcbgtcr ()
#4  0x0000000011e366bd in ktrget2 ()
#5  0x00000000121d5ca7 in kdst_fetch0 ()
#6  0x00000000121e4f5a in kdstf000110100000000km ()
#7  0x00000000121d398e in kdsttgr ()
#8  0x000000001224d28f in qertbFetch ()
#9  0x00000000120340ef in opifch2 ()
#10 0x0000000002d8d033 in kpoal8 ()
#11 0x000000001203af9c in opiodr ()
#12 0x000000001230acf7 in ttcpip ()
#13 0x00000000026a5667 in opitsk ()
#14 0x00000000026aa27d in opiino ()
#15 0x000000001203af9c in opiodr ()
#16 0x00000000026a10a3 in opidrv ()
#17 0x00000000032a58af in sou2o ()
#18 0x0000000000d68047 in opimai_real ()
#19 0x00000000032b2667 in ssthrdmain ()
#20 0x0000000000d67e53 in main ()

Look at Frits Hoogland annotations for the signification and you will see that this is called during consistent reads -> input buffer.

So what?

This feature is interesting. Of course, we need to measure the overhead of this detection, but this additional storage of the SCN being implemented as any data block, benefits from all its efficiency: buffer cache, background writes by dbwr, protection by redo, backups,… These times, I see more and more databases installed on storage with fancy features, and admins playing with snapshot without really knowing whether it is consistent or not. This is the opposite of the ‘reliable’ and ‘keep it simple’ properties that we want for our data. For these environments, when I cannot convince the storage admins to forget about those features and rely on Data Guard on top of the simplest storage, then at least we have a way to protect us from failures in those layers.