By Mouhamadou Diaw
Last day Franck and me were discussing about OCR backup
Let’s take a 2 nodes RAC. We can see that OCR backup is automatically done by oracle in only one node on local (with a certain frequency every day, every week and every 4 hours))
[oracle@racsrv2 ~]$ /u01/app/12.1.0.2/grid/bin/ocrconfig -showbackup
1
2
3
4
5
6
7
8
9
|
racsrv2 2016 /02/03 19:58:05 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup00 .ocr 2528224568 racsrv2 2016 /02/03 15:58:03 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup01 .ocr 2528224568 racsrv2 2016 /02/03 11:58:02 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup02 .ocr 2528224568 racsrv2 2016 /02/02 23:57:57 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/day .ocr 2528224568 racsrv2 2016 /02/02 23:57:57 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/week .ocr 2528224568 |
But what is determining which node is doing the backup? It seems that backup is always done on the master node
In our environment we have 2 servers and actually the master is server racsrv2 as shown below
[oracle@racsrv2 ~]$ /u01/app/12.1.0.2/grid/bin/olsnodes -s -n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
racsrv1 1 Active racsrv2 2 Active [oracle@racsrv1 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv1/crs/trace/crsd .trc 2016-02-02 19:46:25.134016 : OCRMAS:3316590336: proath_master: SUCCESSFULLY CONNECTED TO THE MASTER 2016-02-02 19:46:25.134024 : OCRMAS:3316590336: th_master: NEW OCR MASTER IS 2 2016-02-02 19:46:26.235447 : CRSPE:2637657856: {1:21661:2} PE MASTER NAME: racsrv2 2016-02-02 19:52:26.126885 : CRSOCR:2646062848: {1:21661:2} Registered PE standby with CSS. I AM A STANDBY MASTER. [oracle@racsrv2 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv2/crs/trace/crsd .trc 2016-02-02 19:46:45.355996 : default:3428738816: LIST_LOGS: 2016-02-02 19:46:45.343: proas_open: I AM THE MASTER NODE, key_name = [DATABASE.NODEAPPS] 2016-02-02 19:46:45.356604 : default:2713634560: LIST_LOGS: 2016-02-02 19:46:45.346: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP] 2016-02-02 19:46:45.356668 : default:3456055040: LIST_LOGS: 2016-02-02 19:46:45.347: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP] 2016-02-02 19:46:51.407749 : default:3458156288: LIST_LOGS: 2016-02-02 19:46:51.407: proas_open: I AM THE MASTER NODE, key_name = [SYSTEM.ASM.DEFAULT_DISKGROUP] |
Ok since the backup location is not shared, what happens if my master node crashes and disks on this node no longer accessible. Did Oracle immediately do a OCR backup on the new master node?
Let’s simulate a crash of my master node (racsrv2) with a power off . We can see now that racsrv1 is the new master
[oracle@racsrv1 ~]$ grep MASTER /u01/app/oracle/diag/crs/racsrv1/crs/trace/crsd.trc
1
2
3
4
5
6
7
|
2016-02-02 19:46:25.134016 : OCRMAS:3316590336: proath_master: SUCCESSFULLY CONNECTED TO THE MASTER 2016-02-02 19:46:25.134024 : OCRMAS:3316590336: th_master: NEW OCR MASTER IS 2 2016-02-02 19:46:26.235447 : CRSPE:2637657856: {1:21661:2} PE MASTER NAME: racsrv2 2016-02-02 19:52:26.126885 : CRSOCR:2646062848: {1:21661:2} Registered PE standby with CSS. I AM A STANDBY MASTER. 2016-02-03 21:47:34.878814 : OCRMAS:3316590336: proath_master: GRPMASTER event. Ignored. Waiting for RCFG event for the old OCR Cache Writer:[2]. New Cache Writer:[1] 2016-02-03 21:47:36.155945 : OCRMAS:3316590336: th_master:13: I AM THE NEW OCR MASTER at incar 3. Node Number 1 2016-02-03 21:47:36.165548 : OCRSRV:2678048512: proas_amiwriter: ctx is MASTER CHANGING /CONNECTING |
But no new OCR backup was automatically done immediately after racsrv1 crash as shown below (even after a few hours) and cluster is still referencing OCR backups on the racsrv2.
1
2
3
4
5
6
7
8
9
|
[root@racsrv1 tmp] # /u01/app/12.1.0.2/grid/bin/ocrconfig -showbackup racsrv2 2016 /02/03 19:58:05 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup00 .ocr 2528224568 racsrv2 2016 /02/03 15:58:03 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup01 .ocr 2528224568 racsrv2 2016 /02/03 11:58:02 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/backup02 .ocr 2528224568 racsrv2 2016 /02/02 23:57:57 /u01/app/12 .1.0.2 /grid/cdata/racsrv-cluster/day .ocr 2528224568 |
That means that if our OCR is now corrupted, no backup will be available to restore the OCR, and then we will have to reconstruct our cluster.
To prevent this we can for example
1- Backup the default OCR backups (using os commands, using ocrconfig copy or using ocrconfig export)
2- Choose a shared place for the OCR backup location