Infrastructure at your Service

Joël Cattin

When unreachable NFS share mess up your Dbvisit Standby configuration

If I had to rank my favorite Oracle-related tools and software, Dbvisit Standby would likely be at the top of the list.
You are reading this post, so you probably know that Dbvisit Standby is a Disaster/Recovery solution for Oracle Database Standard Edition (aka Data Guard for poor people 😛 ).

The reasons why I like this product are mostly related to the following points (non-exhaustive list) :

  • Ease of installation and configuration
  • Ease of use
  • Lightness
  • Stability
  • Continuous evolution (new features)
  • Documentation quality
  • Technical support efficiency

However, when running NFS, you have to ensure the share is always reachable for Dbvisit Standby to operate correctly.
In this blog post, I will describe a NFS issue I encountered on a customer running Dbvisit Standby version 10.1 and how it is resolved.

Issue

The following error message appeared while running the dbvctl command to transfer and the apply archive logs :

Dbvisit Standby process for preposs still running on odaprep01 (pid=53835).
See trace file 53835_dbvctl_preposs_202112081932.trc for more details.
Exceeded RUNNING_MAX_TIMES_TRIED=1 attempts.
(if Dbvisit Standby process is no longer running, then delete lock file /u01/app/dbvisit/standby/pid/dbvisit_preposs.pid)

This seems to indicate that the archive logs transfer process did not succeed properly and got stuck on the server.

Let’s have a look to the existing processes :

[[email protected] ~]$ ps -ef | grep dbvisit
oracle   33833 14053  0 Dec08 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_3_12003.3.dbvisit.202112081923.sqlplus.dbv 2>/dev/null
oracle   33834 33833  0 Dec08 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_3_12003.3.dbvisit.202112081923.sqlplus.dbv
oracle   34813 34812  0 Dec08 ?        00:00:00 /bin/sh -c /u01/app/dbvisit/standby/dbvctl -d preposs >/tmp/dbvisit_apply_logs_preposs.log 2>&1
oracle   34814 34813  0 Dec08 ?        00:00:00 /u01/app/dbvisit/standby/dbvctl                                             -d preposs
oracle   35239 34814  0 Dec08 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/34814.0.dbvisit.202112081924.sqlplus.dbv 2>/dev/null
oracle   35240 35239  0 Dec08 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/34814.0.dbvisit.202112081924.sqlplus.dbv
oracle   36929     1  0 Sep17 ?        02:54:14 /u01/app/dbvisit/standby/dbvctl                                             -d preposs -D start
oracle   38142 14053  0 Dec08 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_4_1.4.dbvisit.202112082033.sqlplus.dbv 2>/dev/null
oracle   38143 38142  0 Dec08 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_4_1.4.dbvisit.202112082033.sqlplus.dbv
oracle   41184     1  0 Sep03 ?        01:17:16 /u01/app/dbvisit/dbvnet/dbvnet -d start
oracle   44524 14053  0 Dec08 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_5_1.5.dbvisit.202112082143.sqlplus.dbv 2>/dev/null
oracle   44525 44524  0 Dec08 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_5_1.5.dbvisit.202112082143.sqlplus.dbv
oracle   45780 14053  0 Dec08 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_6_1.6.dbvisit.202112082253.sqlplus.dbv 2>/dev/null
oracle   45781 45780  0 Dec08 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_6_1.6.dbvisit.202112082253.sqlplus.dbv
oracle   51450 14053  0 00:03 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_7_1.7.dbvisit.202112090003.sqlplus.dbv 2>/dev/null
oracle   51451 51450  0 00:03 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_7_1.7.dbvisit.202112090003.sqlplus.dbv
oracle   53234 14053  0 01:13 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_8_1.8.dbvisit.202112090113.sqlplus.dbv 2>/dev/null
oracle   53235 53234  0 01:13 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_8_1.8.dbvisit.202112090113.sqlplus.dbv
oracle   54442 14053  0 02:23 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_9_1.9.dbvisit.202112090223.sqlplus.dbv 2>/dev/null
oracle   54443 54442  0 02:23 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_9_1.9.dbvisit.202112090223.sqlplus.dbv
oracle   59799 14053  0 03:33 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_10_1.10.dbvisit.202112090333.sqlplus.dbv 2>/dev/null
oracle   59800 59799  0 03:33 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_10_1.10.dbvisit.202112090333.sqlplus.dbv
oracle   60840 14053  0 04:43 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_11_1.11.dbvisit.202112090443.sqlplus.dbv 2>/dev/null
oracle   60841 60840  0 04:43 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_11_1.11.dbvisit.202112090443.sqlplus.dbv
oracle   61931 14053  0 05:53 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_12_1.12.dbvisit.202112090553.sqlplus.dbv 2>/dev/null
oracle   61932 61931  0 05:53 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_12_1.12.dbvisit.202112090553.sqlplus.dbv
oracle   64979 14053  0 07:03 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_13_1.13.dbvisit.202112090703.sqlplus.dbv 2>/dev/null
oracle   64980 64979  0 07:03 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_13_1.13.dbvisit.202112090703.sqlplus.dbv
oracle   67322 14053  0 08:13 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_14_1.14.dbvisit.202112090813.sqlplus.dbv 2>/dev/null
oracle   67323 67322  0 08:13 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_14_1.14.dbvisit.202112090813.sqlplus.dbv
oracle   68513 14053  0 09:23 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_15_1.15.dbvisit.202112090923.sqlplus.dbv 2>/dev/null
oracle   68514 68513  0 09:23 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_15_1.15.dbvisit.202112090923.sqlplus.dbv
oracle   71517 14053  0 10:33 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_16_1.16.dbvisit.202112091033.sqlplus.dbv 2>/dev/null
oracle   71518 71517  0 10:33 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_16_1.16.dbvisit.202112091033.sqlplus.dbv
oracle   72932 14053  0 11:43 ?        00:00:00 sh -c /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_17_1.17.dbvisit.202112091143.sqlplus.dbv 2>/dev/null
oracle   72933 72932  0 11:43 ?        00:00:00 /usr/sbin/fuser /u01/app/dbvisit/standby/tmp/14053_17_1.17.dbvisit.202112091143.sqlplus.dbv
oracle   78878     1  0 Sep03 ?        00:09:45 /u01/app/dbvisit/dbvagent/dbvagent -d start
[[email protected] ~]$

Mmmh, quite a big mess here, isn’t ?

Troubleshooting

You certainly noticed that there is a lot of blocked processes running the fuser command. Actually, before transferring an archive logs from the primary server to the standby one, fuser is executed to ensure that the concerned archive log is not being used by another process (e.g RMAN backup). This behavior can be confirmed by analyzing the trace file (<dbvisit_home>/traces directory) generated by the dbvctl command :

209 11:08:41 main::UTIL_UNIX_is_file_open: run command: /usr/sbin/fuser /u03/app/oracle/fast_recovery_area/PRODOSS_DC41/archivelog/2021_12_09/o1_mf_1_162040_jv3oc7xb_.arc
20211209 11:08:41 main::UTIL_run_command: ORACLE_HOME: /u01/app/oracle/product/19.0.0.0/dbhome_2
20211209 11:08:41 main::UTIL_run_command: PATH: /usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/sbin:/sbin:/u01/app/oracle/product/19.0.0.0/dbhome_2/bin
20211209 11:08:41 main::UTIL_run_command: LD_LIBRARY_PATH: /u01/app/dbvisit/standby/lib:/u01/app/oracle/product/19.0.0.0/dbhome_2/lib

The first action I took was to disable the Dbvisit jobs from the crontab – on both sides – to avoid some new processes to be generated.
Then I killed all the blocked processes listed above to start from a clean situation, and I deleted the PID file stored in the <dbvisit_home>/pid > directory.
Finally I executed manually the fuser command against an archive log and there I could see that the command remained stuck, without any result to return. Ctrl-C was needed to stop it. This strange behavior was also the same with some other commands like df  or lsof.

Root cause

By analyzing the system logs (/var/log/messages), I could observe that a NFS share mounted on the server was no longer reachable :

Dec  8 19:38:08 odaprep01 kernel: nfs: server 10.84.48.100 not responding, timed out
Dec  8 19:41:08 odaprep01 kernel: nfs: server 10.84.48.100 not responding, timed out
Dec  8 19:41:14 odaprep01 kernel: nfs: server 10.84.48.100 not responding, timed out

And that’s why the fuser command triggered by dbvctl never ended.

Now, you are probably wondering “the archive logs are not stored on the NFS, so why did this have an impact ?“.
Let me explain…
fuser is designed to access all processes at one time to determine if any of their files is stored on the local file system. To discover that, a stat() call is used to identify the attributes of the processes’ executable. If the executable is stored on a NFS, the stat() call result depends on the NFS availability to be successful and can hang if it’s not reachable.

To solve the issue, I had of course to unmount the unreachable NFS mount point (umount -l <mount_point>).
As Dbvisit were not working properly since several hours, the standby database was out of sync because of archive logs gap. Unfortunately, the missing archive logs where not present anymore into the Fast Recovery Area, so I had to restore them from the RMAN backup :

RMAN> restore archivelog from logseq=26160 until logseq=26190;

Starting restore at 09-DEC-2021 12:26:39
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=365 device type=DISK

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26169
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26170
channel ORA_DISK_1: reading from backup piece /u03/app/oracle/fast_recovery_area/PREPOSS_DC41/backupset/2021_12_09/o1_mf_annnn_TAG20211209T002732_jv2hv54g_.bkp
channel ORA_DISK_1: piece handle=/u03/app/oracle/fast_recovery_area/PREPOSS_DC41/backupset/2021_12_09/o1_mf_annnn_TAG20211209T002732_jv2hv54g_.bkp tag=TAG20211209T002732
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26160
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26161
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26162
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26163
...
...
...
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26188
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26189
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=26190
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/admin/preposs/backup/20211209_111002_arc_PREPOSS_533362994_s2246_p1.bck
channel ORA_DISK_1: piece handle=/u01/app/oracle/admin/preposs/backup/20211209_111002_arc_PREPOSS_533362994_s2246_p1.bck tag=ARC_20211209_111002
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 09-DEC-2021 12:27:17

RMAN>

And finally, I was able to restart the dbvctl command to resolve the gap by transferring and applying the restored archive logs :

[email protected]:/u01/app/dbvisit/standby/trace/ [preposs] dbvctl -d preposs
=============================================================
Dbvisit Standby Database Technology (10.1.0_0_gba3a9e08) (pid 15459)
dbvctl started on odaprep01: Thu Dec  9 12:28:33 2021
=============================================================

>>> Obtaining information from standby database (RUN_INSPECT=Y)... done
    Thread: 1 Archive log gap: 35. Transfer log gap: 35
>>> Transferring Log file(s) from preposs on odaprep01 to odaprep02:

    thread 1 sequence 26160 (o1_mf_1_26160_jv3t012k_.arc)... done
    thread 1 sequence 26161 (o1_mf_1_26161_jv3szzng_.arc)... done
    thread 1 sequence 26162 (o1_mf_1_26162_jv3t00lo_.arc)... done
    thread 1 sequence 26163 (o1_mf_1_26163_jv3szzpx_.arc)... done
    ...
    ...
    ...
    thread 1 sequence 26188 (o1_mf_1_26188_jv3t0jgx_.arc)... done
    thread 1 sequence 26189 (o1_mf_1_26189_jv3t0jpb_.arc)... done
    thread 1 sequence 26190 (o1_mf_1_26190_jv3t0gr6_.arc)... done

>>> Dbvisit Archive Management Module (AMM)

    Config: number of archives to keep      = 0
    Config: number of days to keep archives = 1
    Config: archive backup count            = 1
    Config: diskspace full threshold        = 80%
==========

    Total number of archive logs   : 35
    Current disk percent full (/u03/app/oracle/fast_recovery_area/) = 10%
==========

    Current disk percent full (FRA) = 0%
==========

    Number of archive logs deleted = 0

=============================================================
dbvctl ended on odaprep01: Thu Dec  9 12:30:19 2021
=============================================================

Conclusion

That was not a Dbvisit issue. As I said at the very beginning, Dbvisit is a great tool 😉 .
Hint: if you want to get the fuser, df, lsof commands working even when a mounted NFS is temporarily unreachable, a solution would be to mount it with the soft option :

mount -o rw,soft host.server.com/share /mymountpoint

According to the documentation :
soft
      Generates a soft mount of the NFS file system. If an error occurs, the stat() function returns with an error.
      If the option hard is used, stat() does not return until the file system is available.”

Hope this helps.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Joël Cattin
Joël Cattin

Senior Consultant