By Mouhamadou Diaw

Last day Franck and me were discussing about OCR backup

Let’s take a 2 nodes RAC. We can see that OCR backup is automatically done by oracle in only one node on local (with a certain frequency every day, every week and every 4 hours))

[oracle@racsrv2 ~]$ /u01/app/12.1.0.2/grid/bin/ocrconfig -showbackup

1
2
3
4
5
6
7
8
9
racsrv2     2016/02/03 19:58:05     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup00.ocr     2528224568
racsrv2     2016/02/03 15:58:03     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup01.ocr     2528224568
racsrv2     2016/02/03 11:58:02     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup02.ocr     2528224568
racsrv2     2016/02/02 23:57:57     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/day.ocr     2528224568
racsrv2     2016/02/02 23:57:57     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/week.ocr     2528224568

But what is determining which node is doing the backup? It seems that backup is always done on the master node

In our environment we have 2 servers and actually  the master is server racsrv2 as shown below

[oracle@racsrv2 ~]$ /u01/app/12.1.0.2/grid/bin/olsnodes -s -n

1
2
3
4
5
6
7
8
9
10
11
12
13
14
racsrv1 1       Active
racsrv2 2       Active
[oracle@racsrv1 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv1/crs/trace/crsd.trc
2016-02-02 19:46:25.134016 :  OCRMAS:3316590336: proath_master: SUCCESSFULLY CONNECTED TO THE MASTER
2016-02-02 19:46:25.134024 :  OCRMAS:3316590336: th_master: NEW OCR MASTER IS 2
2016-02-02 19:46:26.235447 :   CRSPE:2637657856: {1:21661:2} PE MASTER NAME: racsrv2
2016-02-02 19:52:26.126885 :  CRSOCR:2646062848: {1:21661:2} Registered PE standby with CSS. I AM A STANDBY MASTER.
[oracle@racsrv2 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv2/crs/trace/crsd.trc
2016-02-02 19:46:45.355996 : default:3428738816:  LIST_LOGS: 2016-02-02 19:46:45.343: proas_open: I AM THE MASTER NODE, key_name = [DATABASE.NODEAPPS]
2016-02-02 19:46:45.356604 : default:2713634560:  LIST_LOGS: 2016-02-02 19:46:45.346: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP]
2016-02-02 19:46:45.356668 : default:3456055040:  LIST_LOGS: 2016-02-02 19:46:45.347: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP]
2016-02-02 19:46:51.407749 : default:3458156288:  LIST_LOGS: 2016-02-02 19:46:51.407: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP]

Ok since the backup location is not shared, what happens if my master node crashes and disks on this node no longer accessible. Did Oracle  immediately do a OCR backup on the new master node?
Let’s simulate a crash of my master node (racsrv2) with a power off . We can see now that racsrv1 is the new master

[oracle@racsrv1 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv1/crs/trace/crsd.trc

1
2
3
4
5
6
7
2016-02-02 19:46:25.134016 :  OCRMAS:3316590336: proath_master: SUCCESSFULLY CONNECTED TO THE MASTER
2016-02-02 19:46:25.134024 :  OCRMAS:3316590336: th_master: NEW OCR MASTER IS 2
2016-02-02 19:46:26.235447 :   CRSPE:2637657856: {1:21661:2} PE MASTER NAME: racsrv2
2016-02-02 19:52:26.126885 :  CRSOCR:2646062848: {1:21661:2} Registered PE standby with CSS. I AM A STANDBY MASTER.
2016-02-03 21:47:34.878814 :  OCRMAS:3316590336: proath_master: GRPMASTER event. Ignored. Waiting for RCFG event for the old OCR Cache Writer:[2]. New Cache Writer:[1]
2016-02-03 21:47:36.155945 :  OCRMAS:3316590336: th_master:13: I AM THE NEW OCR MASTER at incar 3. Node Number 1
2016-02-03 21:47:36.165548 :  OCRSRV:2678048512: proas_amiwriter: ctx is MASTER CHANGING/CONNECTING

But no new OCR backup was automatically done immediately after racsrv1 crash as shown below (even after a few hours) and cluster is still referencing OCR backups on the racsrv2.

1
2
3
4
5
6
7
8
9
[root@racsrv1 tmp]# /u01/app/12.1.0.2/grid/bin/ocrconfig -showbackup
racsrv2     2016/02/03 19:58:05     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup00.ocr     2528224568
racsrv2     2016/02/03 15:58:03     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup01.ocr     2528224568
racsrv2     2016/02/03 11:58:02     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/backup02.ocr     2528224568
racsrv2     2016/02/02 23:57:57     /u01/app/12.1.0.2/grid/cdata/racsrv-cluster/day.ocr     2528224568

That means that if our OCR is now corrupted, no backup will be available to restore the OCR, and then we will have to reconstruct our cluster.

To prevent this we can for example

1- Backup the default OCR backups  (using os commands, using ocrconfig copy or using ocrconfig export)
2- Choose a shared place for the OCR backup location