Last February, I performed an operating system rolling upgrade on a four-nodes RAC cluster (18.104.22.168). I then faced a strange problem when restarting the operating system…
The first step of the procedure was to stop all Grid Infrastructure and Database services running on the first node as well as to disable Cluster and ASM autostart. The following command is supposed to prevent Oracle High Availability Service (OHAS) to be run at operating system startup:
# crsctl disable crs
Then, I powered off the server and asked the storage administrator to disable all LUNs attached to this server, including the one containing Oracle binaries (/u00).
This step was necessary because the storage bay requires specific drivers which are not shipped with the Linux installation media. Without drivers, all mountpoints attached to LUNs and detected during the upgrade process would have displayed errors.
However, when starting the server again to check that all mountpoints were disabled, I saw that the startup procedure was blocked to the OHASD service, preventing the server to finish the startup.
In fact, even if crs autostart is disabled, a deamon called “ohasd” is still running at server startup. Among other things, it checks and indefinitively waits for the presence of CRS binaries. No luck, the LUN attached to the mountpoint containg CRS binaries was disabled…
We can see this check in /etc/init.d/ohasd file:
# Wait until it is safe to start CRS daemons while [ ! -r $CRSCTL ] do $LOGMSG "Waiting for filesystem containing $CRSCTL." $SLEEP $DEP_CHECK_WAIT done
Where $CRSCTL corresponds to /u00/app/11.2.0/grid/bin/crsctl
What is crazy is that the loop is performed no matter if autostart is enabled or not. Just after the loop is done, the script checks if autostart is enabled – thanks to the file /etc/oracle/scls_scr/$MY_HOST/root/ohasdstr, which contains “enable” or “disable” depending of the autostart configuration.
Why do not check if autostart is enabled before looking for CRS binaries? A question that Oracle does not seem to have answered in 22.214.171.124, because we can see that, even if the ohasd startup mecanism was updated, ohasd is still waiting for CRS binaries:
# Wait until it is safe to start CRS daemons. # Wait for 10 minutes for filesystem to mount # Print message to syslog and console works=true for minutes in 10 9 8 7 6 5 4 3 2 1 do if [ ! -r $CRSCTL ] then works=false log_console "Waiting $minutes minutes for filesystem containing $CRSCTL." $SLEEP $DEP_CHECK_WAIT else works=true break fi done
As you can see, in 126.96.36.199, the server will now finish to start, but will be waiting 10 minutes. A message is displayed in the log startup:
Apr 30 15:31:29 rac1 logger: Waiting 10 minutes for filesystem containing /u00/app/11.2.0/grid/bin/crsctl. Apr 30 15:32:29 rac1 logger: Waiting 9 minutes for filesystem containing /u00/app/11.2.0/grid/bin/crsctl. Apr 30 15:33:29 rac1 logger: Waiting 8 minutes for filesystem containing /u00/app/11.2.0/grid/bin/crsctl. Apr 30 15:34:29 rac1 logger: Waiting 7 minutes for filesystem containing /u00/app/11.2.0/grid/bin/crsctl. [...]
At this time, the workaround I found to allow the server to start immediatly is to prevent “ohasd” service to run by renaming or moving the /etc/init.d/ohasd file, or to comment the loop section in order to skip the infinite loop.
Hopefully, SSH deamon (if enabled) runs before OHAS deamon. It is possible, if the startup procedure is blocked, to access the machine through an SSH session in order to apply this workaround and to restart the server.
I finally added this pre-requisite to the upgrade procedure for the remaining nodes of the cluster.