A few months ago day, as I was writing the documentation for a monitoring probe, I suddenly realized that that probe, along with others I wrote during that time to monitor Documentum installations, had all a big, unexpected flaw. Indeed, it struck me that if it hang for some reason while running, it could stay there well after the next monitoring cycle had begun, which could too be affected by the same problem, and so on, until lots of such processes could be hanging and possibly hogging valuable repository sessions, causing their complete exhaustion and the catastrophic consequence on the client applications. How ironic would it be that health checks would actually endanger an application ?

A true story

I realized all this because it already happened once, years ago, at a different client’s. It was just after we migrated from Documentum content server v5.3 to v6.x. A big shift was introduced in that new version: the command-line tools iapi, idql, dmbasic and dmawk went java. More precisely, they switched from the native libdmcl40.so C library to the library libdmcl.so which calls the DfCs behind the scenes, with this sorcery made possible thanks to JNI. The front is still native code but all the Documentum stuff is henceforth delegated to the java DfCs.
What was the impact on those tools ? It was huge: all those tools that used to start in less than a second took now around 10 seconds or more to start because of all the bloatware initialization. We vaguely noticed it during the tests and supposed it was caused by a big load in that less powerful environment so we went confidently to production one week-end.
The next Monday morning, panicked calls flooded the Help Desk; users were complaining that part of their applications did not work any more. A closer look in the application’s log showed that it had become impossible to open new sessions to some repositories. The process list on the server machine showed tens of documentum and idql processes running at once. Those idql processes were stuck instances of a monitoring probe that run once per minute. Its job was just to connect to the target docbase, run a quick query and exit with a status. For some reason, it was probably waiting for a session, or idql was taking a lot more than the expected few seconds to do its job; therefore, the next monitoring cycle started before the previous one was completed and it too it hang there, and so on until affected users became vocal. The real root cause was programmatic since one developer thought it was a good idea to periodically and too frequently connect to docbases from within Ajax code in the clients’ home page, without informing the docbases’ administrators of this new resource hungry feature. This resulted in a saturation of the allowed sessions, stuck idql processes, weblogic threads waiting for a connection and, ultimately, application downtime.
Needless to say, the flashy Ajax feature was quickly removed, the number of allowed concurrent sessions was boosted up and we decided to keep around a copy of those fast, full binary v5.3 tools for low-level tasks such as our monitoring needs.
So let’s see how to protect the probes from themselves and from changing environments or well-meaning but ingenuous developers.

The requirements

1. If the monitoring cycles are tight, the probes shall obviously do very simple things; complex things can take time, and be fragile and buggy. Simple things complete quickly and are less subject to hasards.
2. As seen, unless the probe is started only once and runs constantly in the background, the probe’s interpreter shall start very quickly which excludes java code and its JVM; this also avoids recent issues such as the random number generator entropy that used to plague java programs for some time now and, I’m sarcastic but confident, the next ones still lurking around the corner. The interpreter that executes the probe shall be that of some well known scripting language such as the bash or ksh shells, python or perl with the needed binding to access the resource to be monitored, e.g. a Documentum repository, or some native binary tool that is part of the product to monitor, such as idql or sqlplus, launched by the shell, or even a custom compiled program.
3. While a probe is running, no other instance of it shall be allowed to start; i.e. the next instance shall not start until after the current one completes.
4. A probe shall only be allowed to execute during an allotted time; once this delay is elapsed, the probe shall be terminated manu militari with a distinct return status.
5. The probe’s cycles too shall be monitored, e.g. missing cycles should be reported.

Points 1 easy to implement; e.g. to check the availability of a repository or a docbase, just try a connection to it and exit. If a more exhaustive test is required, a quick and simple query could be sent to the server. It all depends on how exhaustive we want to be. A SELECT query won’t be able to detect, say, unusable database indexes or indexes being rebuilt off-line, if it still completes within the allowed delay. Some neutral UPDATE could be attempted to detect those kinds of issues or, more straightforwardly yet, just query the state of the indexes. But whatever is monitored, let’s keep it quick and direct. The 3rd and 4th requirements can help detecting anomalies such as the preceding index problem (in an Oracle database, unusable indexes causes UPDATEs to hang, so timeout detection and forced termination are mandatory in such cases).

Point 2 is quite obvious: if the monitoring is so aggressive that it runs in one-minute cycles, the script that it executes shall complete in less than one minute; i.e. start time + execution time shall be less than the monitoring period, let’s say less than half that time to be safe. If the monitoring tools and script cannot keep up with the stringent timing, a different, more efficient approach shall be considered, unless the timing requirement is relaxed somewhat. For example, a reverse approach could be considered where instead of pulling the status from a target, it’s the target that publish its status, like a heartbeat; that would permit very tight monitoring cycles.

Point 3 requires a barrier to prevent the next cycle to start. This does not need to be a fancy test-and-set semaphore because concurrency is practically nonexistent. A simple test of existence of a conventional file is enough. If the file exists, it means a cycle is in progress and the next cycle is not allowed in. If the file does not exist, create it and continue. There may be a race condition but it is unlikely to occur given that the monitoring cycles are quite widely spread apart, one minute at the minimum if defined in the crontab.

Point 4 means that a timer shall be set up upon starting the probe. This is easy to do from a shell, e.g. thanks to the “timeout” command. Some tools may have their own command-line option to run in batch mode within a timeout duration, which is even better. Nonetheless, an external timer offers a double protection and is still desirable.

Point 5: Obviously, this part is only possible from outside the probe. On some system (e.g. nagios), the probe’s log file itself is monitored and, if not updated within some time interval, an alert is raised. This kind of passive or indirect heartbeat permits to detect disabled or stuck probes, but doesn’t remove them. Resilience shall be auto-applied whenever possible in order to minimize human intervention. This check is useful to detect cases where the probe or the scheduler itself have been suspended abruptly or are no longer available on the file system (it can even happen that the file system itself has been unmounted by mistake or due to some technical problem or unscheduled intervention).

An example

Let’s say that we want to monitor the availability of a docbase “doctest”. We propose to attempt a connection with idql as “dmadmin” from the server machine so trusted mode authentication is used and no password is needed. A response from the docbase shall arrive within 15s. The probe shall run with a periodicity of 5 minutes, i.e. 12 times per hour. Here is a no frills attempt:

#!/bin/bash

BARRIER=/tmp/sentinel_file
DOCBASE=doctest
LOG_FILE=/var/dctm/monitor_docbase_${DOCBASE}
TIMEOUT=15s
export DOCUMENTUM=/u01/app/documentum53/product/5.3

if [ -f $BARRIER ]; then
   echo "WARNING: previous $DOCBASE monitoring cycle still running" > $LOG_FILE
   exit 100
fi
touch $BARRIER
if [ $? -ne 0 ]; then
   echo "FATAL: monitoring of $DOCBASE failed while touch-ing barrier $BARRIER" > $LOG_FILE
   exit 101
fi

timeout $TIMEOUT $DOCUMENTUM/bin/idql $DOCBASE -Udmadmin -Pxx 2>&1 > /dev/null <<EoQ
   select * from dm_server_config;
   go
   quit
EoQ
rc=$?
if [ $rc -eq 124 ]; then
   echo "FATAL: monitoring of $DOCBASE failed in timeout of $TIMEOUT" > $LOG_FILE
elif [ $rc -eq 1 ]: then
  echo "FATAL: connection to $DOCBASE was unsuccessful"               > $LOG_FILE
else
   echo "OK: connection to $DOCBASE was successful"                   > $LOG_FILE
fi

rm $BARRIER
exit $rc

Line 3: the barrier is an empty file whose existence or inexistence simulates the state of the barrier; if the file exists, then the barrier is down and the access is forbidden; if the file does not exist, then the barrier is up and the access is allowed;
Line 7: we use the full native, DMCL-based idql utility for a quick start up;
Line 9: the barrier is tested by checking the file’s existence as written above; if the file already exists, it means that an older monitoring cycle is still running, so the new cycle aborts and returns an error message and an exit code;
Line 13: the barrier has been lowered to prevent the next cycle to execute the probe;
Line 19: the idql command is launched and monitored by the command timeout with a duration of $TIMEOUT;
Line 24: the timeout command’s return status is tested; if it is 124 (line 25), it means a timeout has occurred; the probe aborts with an appropriate error message; otherwise, it’s the command’s error code: if it is 1, idql could not connect; if it is 0, the connection was OK;
Lines 27 and 29: the connection attempt returned within the $TIMEOUT time interval, meaning the idql has a connection status;
Line 33: the barrier is removed so the next monitoring cycle has the green light;
Line 34: the exit code is returned; it should be 124 for timeout, 1 for no connection to the docbase, 0 if connection OK;

The timeout command belongs to the coreutils package so install that package through your linux distribution’s package manager if the command is missing.

If cron is used as a scheduler, the crontab entry could look like below (assuming the probe’s name is test-connection.sh):

0,5,10,15,20,25,30,35,40,45,50,55 * * * /u01/app/documentum/monitoring/test-connection.sh 2>&1 > /dev/null

cron is sufficient most of the time, even though its time granularity is 1 minute.
The probe could be enhanced very easily in such a way that, once deployed, it optionally installs itself in dmadmin’s crontab, e.g.:

/u01/app/documentum/monitoring/test-connection.sh --install "0,5,10,15,20,25,30,35,40,45,50,55 * * *"

for the maximum simplicity. But this is a different topic.

Some comments

On some large infrastructures, centralized scheduler and orchestration software may be in use (CTRL-M, Activeeon, Rundeck, Dkron, etc. Just check the web, they are plenty to shop for) which have their own management of rogue jobs and commands. Still, dedicated probes such as the preceding one have to be called but the timeout logic could be removed and externalized into the launcher. Better yet, only externalize the hard-coded timeout parameter so it is passed to the probe as a command-line parameter and the probe can still work independently from the launcher in use.
Other systems use a centralized monitoring system (e.g. nagios, Icinga, BMC TrueSight, Cacti, etc. Again, search the web) but whatever the software be prepared to manually write probes because, unless it is so ubiquitous like apache, tomcat or mysql, it is unlikely that the system to be monitored is supported out of the box, particularly specialized products such as Documentum.
Some of the above software need an agent process deployed and permanently running on each monitored or enrolled machine. There are pros and cons in this architecture but we won’t go there as the subject is out of scope.

Conclusion

Monitoring a target is interacting with it. Physics tells us that it is impossible to be totally invisible while observing but at least we can minimize the probes’ footprint. While it is impossible to anticipate every abnormality on a system, these few very simple guidelines can help a long way in making monitoring probes more robust, resilient and as unobtrusive as possible.