Infrastructure at your Service

Cesare Cervini

A Ruthless Repository Shutdown Utility, Part II

Stopping the unreachable repositories

Suppose that the docbroker has been stopped prematurely and that we want to shut down the repositories but the out-of-the-box dm_shutdown_repository is not effective. Why is it so by the way ? If we look closely inside the shutdown script, we quickly notice the reason:

#!/bin/sh
################## DOCUMENTUM SERVER SHUTDOWN FILE ######################
#
# 1994-2018 OpenText Corporation. All rights reserved
# Version 16.4 of the Documentum Server.
#
# A generated server shutdown script for a repository.
# This file was generated on Fri Aug 30 12:15:10 CEST 2019 by user dmadmin.
#
check_connect_status() {
status=$1
if [ ! $status = 0 ] ; then
  cat <<-END
  ***** $0: ERROR
  ***** Unable to complete shutdown - unable to connect
  ***** to the server to issue $2 request.
END
  exit 1
fi
}
...
# Stop the server
echo Stopping Documentum server for repository: [dmtestgr02]
echo ''
DM_DMADMIN_USER=dmadmin
#
# Get the pid for the root process
#
DM_PID=`./iapi dmtestgr02 -U$DM_DMADMIN_USER -P -e << EOF  | grep 'root_pid' | sed -e 's/ .*[: A-Za-z]//'
apply,s0,NULL,LIST_SESSIONS
next,s0,q0
dump,s0,q0
exit
EOF`
status=$?
check_connect_status $status LIST_SESSIONS
...
            kill -9 $child_pid
...
  kill -9 $DM_PID
...
         kill -9 $child_pid
...

On line 29, the shutdown script first attempts to connect to the repository in order to retrieve the root pid of the server processes. On line 36, this attempt’s result is checked by the function check_connect_status defined earlier in the script at line 10. If something went wrong during the connection, iapi’s return status will be != 0 and check_connect_status will simply exit the script on line 18. So, if a repository has gone berserk, or no free sessions are available, or the docbroker is unreachable, the script will not be able to stop it. That logic is quite restrictive and we must fall back to killing the repository’s processes ourselves anyway.
Strangely enough, the script is not scared of killing processes, it does this from several places, but it rather looks like it is a bit shy in identifying the right ones and therefore relies on the server itself or, ultimately, on the user, for help in this area.
Admittedly, it is not always easy to pinpoint the right processes from the list returned by the command ps, especially if the repository is running in HA on the same machine, or if several repositories share the same machine, so extra care must be used in order not to kill the wrong ones. The dm_shutdown_docbase avoids this difficulty altogether by asking the content server (aka CS) its root pid and that is why it aborts if it cannot contact it.
Historically, the “kill” command could only “kill -9” (SIGKILL, forceful, coercive kill) but nowadays it has been generalized to send signals and could just as well have been forked to “signal” or “send”. So, can a signal be sent to the main executable ${DM_HOME}/bin/documentum to ask it to cleanly shut down the repository ? We wish but this has not been implemented. Signals such as SIGQUIT, SIGTRAP, SIGINT and SIGABRT are trapped indeed but will only kill the server after printing to the server’s log the last executed SQL or the call stack trace, e.g. after a SIGINT was sent:

2019-10-11T13:14:14.045467 24429[24429] 0100c35080004101 Error: dm_bear_trap: Unexpected exception, (SIGINT: interrupt: (2) at (Connection Failure)), during new session creation in module dmapply.cxx after line 542. Process exiting.
Last SQL statement executed by DB was:
 
 
Last SQL statement executed by DB was:
Last SQL statement executed by DB was:
 
 
 
 
Last SQL statement executed by DB was:
 
 
(23962) Outer Exception handler caught exception: SIGINT: interrupt: (2) at (RPC MAIN)

Thus, a corruption is theoretically possible while using any of those signals, just as it is when a SIGKILL signal is issued.
According to OTX Support, a trap handler that shuts down cleanly the repository has not been implemented because it needs a session to invoke the shutdown server method. OK, and what if a hidden session were opened at startup time and kept around just for such administrative cases ? How about a handler to immediately force a projection to the available docbrokers instead of waiting for the next checkpoint cycle ? As you see, there are ways to make the shutdown more resilient but my overall feeling is there is a lack of willingness to improve the content server.
Therefore, if waiting about 5 minutes for the repository to project to a docbroker is not acceptable, there is no other alternative than kill -9 the repository’s processes, start the docbroker(s) and then the repository. Other signals can work, but not always, and are not any safer.
In order to use that command, one needs to know the content server’s root pid and since the CS does not accept any connection at this point, one must get it from another source. Once the root pid is available, it can be given to the kill command with a slight subtlety: in order to include its children processes, the root pid must be negated, e.g.:

# use the standalone /bin/kill command;
$ /bin/kill --signal SIGKILL -12345
# or use bash's kill builtin:
$ command kill -s SIGKILL -12345

This will transitively kill the process with pid 12345 and all the others in same group, which are the ones it started itself, directly or indirectly.
If a numeric signal is preferred, the equivalent command is:

$ /bin/kill -9 -12345

I leave it to you to decide which one is more readable.
So now, we need to identify the repository’s root process. Once found, we can send its negated value the SIGKILL signal, which will propagate to all the child processes. Let’s see now how to identify this root process.

Identifying the content server’s root process

Ordinarily, the LIST_SESSIONS server method returns a collection containing the root_pid attribute among other valuable information, e.g.:

API> apply,c,NULL,LIST_SESSIONS
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...
USER ATTRIBUTES
 
  root_start                      : 12/11/2019 22:53:19
  root_pid                        : 25329
  shared_mem_id                   : 2588691
  semaphore_id                    : 0
  session                      [0]: 0100c3508000a11c
                               [1]: 0100c3508000a102
                               [2]: 0100c3508000a101
  db_session_id                [0]: 272
                               [1]: 37
                               [2]: 33
  typelockdb_session_id        [0]: -1
                               [1]: -1
                               [2]: -1
  tempdb_session_ids           [0]: -1
                               [1]: 45
                               [2]: 36
  pid                          [0]: 17686
                               [1]: 26512
                               [2]: 26465
  user_name                    [0]: dmadmin
                               [1]: dmadmin
                               [2]: dmadmin
  user_authentication          [0]: Trusted Client
                               [1]: Password
                               [2]: Trusted Client
  client_host                  [0]: docker
                               [1]: 172.19.0.3
                               [2]: docker
  client_lib_ver               [0]: 16.4.0070.0035
                               [1]: 16.4.0070.0035
                               [2]: 16.4.0070.0035
...

But in our case, the CS is not reachable so it cannot be queried.
An easy alternative is to simply look into the CS’s log:

[email protected]:/app/dctm$ less /app/dctm/dba/log/dmtest.log
 
    OpenText Documentum Content Server (version 16.4.0080.0129  Linux64.Oracle)
    Copyright (c) 2018. OpenText Corporation
    All rights reserved.
 
2019-12-11T22:53:19.757264      25329[25329]    0000000000000000        [DM_SERVER_I_START_SERVER]info:  "Docbase dmtest attempting to open"
 
2019-12-11T22:53:19.757358      25329[25329]    0000000000000000        [DM_SERVER_I_START_KEY_STORAGE_MODE]info:  "Docbase dmtest is using database for cryptographic key storage"
...

The number 25329 is the root_pid. It can be extracted from the log file as shown below:

$ grep "\[DM_SERVER_I_START_SERVER\]info" /app/dctm/dba/log/dmtest.log | gawk '{if (match($2, /\[[0-9]+\]/)) {print substr($2, RSTART + 1, RLENGTH - 2); exit}}'
25329
# or compacter:
gawk '{if (match($0, /\[([0-9]+)\].+\[DM_SERVER_I_START_SERVER\]info/, root_pid)) {print root_pid[1]; exit}}' /app/dctm/dba/log/dmtest.log
25329

The extracted root_pid can be confirmed by the ps command with options ajxf showing a nice tree-like view of the running processes. E.g.:

[email protected]:/app/dctm$ ps_gpid 25329
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    1 25329 25329 25329 ?           -1 Ss    1001   0:01 ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 25370 25329 25329 ?           -1 S     1001   0:00  \_ /app/dctm/product/16.4/bin/mthdsvr master 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 25329 dmtest /app/dctm/dba/log
25370 25371 25329 25329 ?           -1 Sl    1001   0:05  |   \_ /app/dctm/product/16.4/bin/mthdsvr worker 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 0 dmtest /app/dctm/dba/log
25370 25430 25329 25329 ?           -1 Sl    1001   0:05  |   \_ /app/dctm/product/16.4/bin/mthdsvr worker 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 1 dmtest /app/dctm/dba/log
25370 25451 25329 25329 ?           -1 Sl    1001   0:05  |   \_ /app/dctm/product/16.4/bin/mthdsvr worker 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 2 dmtest /app/dctm/dba/log
25370 25464 25329 25329 ?           -1 Sl    1001   0:05  |   \_ /app/dctm/product/16.4/bin/mthdsvr worker 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 3 dmtest /app/dctm/dba/log
25370 25482 25329 25329 ?           -1 Sl    1001   0:05  |   \_ /app/dctm/product/16.4/bin/mthdsvr worker 0xe901fc83, 0x7f084db15000, 0x223000 50000  5 4 dmtest /app/dctm/dba/log
25329 25431 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 25432 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 25453 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 25465 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 25489 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
25329 26439 25329 25329 ?           -1 Sl    1001   0:11  \_ ./dm_agent_exec -docbase_name dmtest.dmtest -docbase_owner dmadmin -sleep_duration 0
25329 26465 25329 25329 ?           -1 S     1001   0:00  \_ ./documentum -docbase_name dmtest -security acl -init_file /app/dctm/dba/config/dmtest/server.ini
    1 10112 25329 25329 ?           -1 Rl    1001   0:03 ./dm_agent_exec -docbase_name dmtest.dmtest -docbase_owner dmadmin -trace_level 0 -job_id 0800c3508000218b -log_directory /app/dctm/dba/log -docbase_id 50000

On line 3, the CS for docbase dmtest was started with pid 25329 and same value for its pgid. This process started then a few child processes all with the pgid 25329.
ps_pgid on line 1 is a bash function defined in ~/.bashrc as follows:

# returns the lines from ps -ajxf with given gpid;
# the ps command's header line is printed only if at least 1 entry is found;
function ps_pgid {
   pgid=$1
   ps -ajxf | gawk -v pgid=$pgid 'BEGIN {getline; header = $0; h_not_printed = 1} {if ($3 == pgid) {if (h_not_printed) {print header; h_not_printed = 0}; print}}'
}

The command does not show the method server nor the docbroker as they were started separately from the CS.
Thus, if we execute the command below:

$ kill --signal SIGKILL -25329

the CS will be killed along with all its child processes, which is exactly what we want.

Putting both commands together, we get:

kill --signal SIGKILL -$(grep "\[DM_SERVER_I_START_SERVER\]info" /app/dctm/dba/log/dmtest.log | gawk '{if (match($2, /\[[0-9]+\]/)) {print substr($2, RSTART + 1, RLENGTH - 2); exit}}')

It may be worth defining a bash function for it too:

function kill_cs {
   repo=$1
   kill --signal SIGKILL -$(grep "\[DM_SERVER_I_START_SERVER\]info" /app/dctm/dba/log/${repo}.log | gawk '{if (match($2, /\[[0-9]+\]/)) {print substr($2, RSTART + 1, RLENGTH - 2); exit}}')
}
 
# source it:
. ~/.bashrc
 
# call it:
kill_cs dmtest

where test is the content server to kill.
The naive way to search the running content server via the command “ps -ef | grep docbase_name” can be too ambiguous in case of multiple content servers for the same repository (e.g. in a high-availability installation) or when docbase_name is the stem of a family of docbases (e.g. dmtest_1, dmtest_2, …, dmtest_10, etc…). Besides, even if no ambiguity were possible, it would return too many processes to be killed individually. xargs could do it at once, sure, but why risk killing the wrong ones ? The above ps_pgid function is directly looking for the given group id which is the root_pid of the content server of interest taken straight out of its log file, no ambiguity here.

Hardening start-stop.sh

This ruthless kill functionality could be added to the start-stop script listed above, either as a command-line option to the stop parameter (say, like -k as in the dm_shutdown_repository script) or as a full parameter on a par with the stop | start | status ones, i.e.:

start-stop.sh stop | start | status | kill ...

or, simply by deciding that a stop should always succeed and forcing a kill if needed. In such variant, the stop_docbase() function becomes:

stop_docbase() {
   echo "stopping $docbase"
   docbase=$1
   ./dm_shutdown_${docbase}
   if [[ $? -eq 1 ]]; then
      echo "killing docbase $docbase"
      kill_cs $docbase
   fi
   echo "docbase $docbase stopped"
}

Conclusion

If the content server were open source we wouldn’t have this article’s title. Instead, it would be “Forcing impromptu projections to docbrokers through signal handling in content server: an implementation” or “Shutting down a content server by sending a signal: a proposal”. We could send this request to the maintainers and probably receive a positive answer. Or we could implement the changes ourselves and submit them as a RFC. This model does not work so much in closed, commercial source which evolves following its own marketing agenda. Nonetheless, this situation gives us the opportunity to rant about it and find work-arounds. Imagine a world where all software were flawless, would it be as fun ?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cesare Cervini
Cesare Cervini

Consultant