Infrastructure at your Service

Cesare Cervini

How to stop Documentum processes in a docker container, and more (part I)

How to stop Documentum processes in a docker container, and more

Ideally, but not mandatorily, the management of Documentum processes is performed at the service level, e.g. by systemd. In my blog here, I showed how to configure init files for Documentum under systemd. But containers don’t have systemd, yet. They just run processes, often only one, sometimes more if they are closely related together (e.g. the docbroker, the method server and the content servers), so how to replicate the same functionality with containers ?
The topic of stopping processes in a docker container is abundantly discussed on-line (see for example the excellent article here). O/S signals are the magic solution so much so that I should have entitled this blog “Fun with the Signals” really !
I’ll simply see here if the presented approach can be applied in the particular case of a dockerized Documentum server. However, in order to keep things simple and quick, I won’t test such a real dockerized Documentum installation but rather use a script to simulate the Documentum processes, or any other processes at that since it is so generic.
But first, why bother with this matter ? During all the years that I have been administrating repositories I’ve never noticed anything going wrong after restarting a suddenly stopped server, be it after an intentional kill, a pesky crash or an unplanned database unavailability. Evidently, the content server (CS henceforth) seems quite robust in this respect. Or maybe we were simply lucky so far. Personally, I don’t feel confident if I don’t shut down cleanly a process or service that must be stopped; some data might be still buffered in the CS’ memory and not flushing them properly might introduce inconsistencies or even corruptions. The same goes when an unsuspected multi-step operation is started and aborted abruptly in the middle; ideally, transactions, if they are used, exist for this purpose but anything can go wrong during rollback. Killing a process is like slamming a door, it produces a lot of noise, vibrations in the walls, even damages in the long run and always leaves a bad impression behind. Isn’t it more comforting to clean up and shut the door gently ? Even then something can go wrong but at least it will be through no fault of our own.

A few Reminders

When a “docker container stop” is issued, docker sends the SIGTERM signal to the process with PID == 1 running inside the container. That process, if programmed to do so, can then react to the signal and do anything seen fit, typically shutting the running processes down cleanly. After a 10 seconds grace period, the container is stopped manu militari. In the case of Documentum processes, to put it politely, they don’t give a hoot to signals, except of course to the well-known, unceremonious SIGKILL one. Thus, a proxy process must be introduced which will accept the signal and invoke the proper shutdown scripts to stop the CS processes, usually the dm_shutdown_* and dm_stop_* scripts, or a generic one that takes care of everything, at start up and at shut down time.
Said proxy must run with PID == 1 i.e. it must be the first one started in the container. Sort of, but even if it is not the very first, its PID 1 parent can pass it the control by using one of the exec() family functions; unlike forking a process, those in effect allow a child process to replace its parent under the latter’s PID, kind of like in the Matrix movies the agents Smith inject themselves into someone else’s persona, if you will ;-). The main thing being that at one point the proxy becomes PID 1. Luckily for us, we don’t have to bother with this complexity for the dockerfile’s ENTRYPOINT[] clause takes care of everything.
The proxy also will be the one that starts the CS. In addition, since it must wait for the SIGTERM signal, it must never exit. It can indefinitely wait listening on a fake input (e.g. tail -f /dev/null), or wait for an illusory input in a detached container (e.g. while true; do read; done) or, better yet, do something useful like some light-weight monitoring.
While at it, the proxy process can listen to several conventional signals and react accordingly. For instance, a SIGUSR1 could mean “give me a docbroker docbase map” and a SIGUSR2 “restart the method server”. Admittedly, these actions could be done directly by just executing the relevant commands inside the container or from the outside command-line but the signal way is cheaper and, OK, funnier. So, let’s see how we can set all this up !

The implementation

As said, in order to focus on our topic, i.e. signal trapping, we’ve replaced the CS part with a simple simulation script, dctm.sh, that starts, stops and queries the status of dummy processes. It uses the bash shell and has been written under linux. Here it is:

#!/bin/bash
# launches in the background, or stops or queries the status of, a command with a conventional identification string in the prompt;
# the id is a random number determined during the start;
# it should be passed to enquiry the status of the started process or to stop it;
# Usage:
#   ./dctm.sh stop  | start | status 
# e.g.:
# $ ./dctm.sh start
# | process started with pid 13562 and random value 33699963
# $ psg 33699963
# | docker   13562     1  0 23:39 pts/0    00:00:00 I am number 33699963
# $ ./dctm.sh status 33699963
# $ ./dctm.sh stop 33699963
#
# cec - dbi-services - April 2019
#
trap 'help'         SIGURG
trap 'start_all'    SIGPWR
trap 'start_one'    SIGUSR1
trap 'status_all'   SIGUSR2
trap 'stop_all'     SIGINT SIGABRT
trap 'shutdown_all' SIGHUP SIGQUIT SIGTERM

verb="sleep"
export id_prefix="I am number"

func() {
   cmd="$1"
   case $cmd in
      start)
         # do something that sticks forever-ish, min ca. 20mn;
         (( params = 1111 * $RANDOM ))
         exec -a "$id_prefix" $verb $params &
         echo "process started with pid $! and random value $params"
         ;;
      stop)
         params=" $2"
         pid=$(ps -ajxf | gawk -v params="$params" '{if (match($0, " " ENVIRON["id_prefix"] params "$")) pid = $2} END {print (pid ? pid : "")}')
         if [[ ! -z $pid ]]; then
            kill -9 ${pid} &> /dev/null
            wait ${pid} &> /dev/null
         fi
         ;;
      status)
         params=" $2"
         read pid gid < <(ps -ajxf | gawk -v params="$params" '{if (match($0, " " ENVIRON["id_prefix"] params "$")) pid = $2 " " $3} END {print (pid ? pid : "")}')
         if [[ ! -z $pid ]]; then
            echo "random value${params} is used by process with pid $pid and pgid $gid"
         else
            echo "no such process running"
         fi
         ;;
   esac
}

help() {
   echo
   echo "send signal SIGURG for help"
   echo "send signal SIGPWR to start a few processes"
   echo "send signal SIGUSR1 to start a new process"
   echo "send signal SIGUSR2 for the list of started processes"
   echo "send signal SIGINT | SIGABRT  to stop all the processes"
   echo "send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container"
}

start_all() {
   echo; echo "starting a few processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for loop in $(seq 5); do
      func start
   done

   # show them;
   echo; echo "started processes"
   ps -ajxf | grep "$id_prefix" | grep -v grep
}

start_one() {
   echo; echo "starting a new process at $(date +"%Y/%m/%d %H:%M:%S")"
   func start
}

status_all() {
   echo; echo "status of running processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for no in $(ps -ef | grep "I am number " | grep -v grep | gawk '{print $NF}'); do
      echo "showing $no"
      func status $no
   done
}

stop_all() {
   echo; echo "shutting down the processes at $(date +"%Y/%m/%d %H:%M:%S")"
   for no in $(ps -ef | grep "I am number " | grep -v grep | gawk '{print $NF}'); do
      echo "stopping $no"
      func stop $no
   done
}

shutdown_all() {
   echo; echo "shutting down the container at $(date +"%Y/%m/%d %H:%M:%S")"
   stop_all
   exit 0
}

# -----------
# main;
# -----------

# starts a few dummy processes;
start_all

# display some usage explanation;
help

# make sure the container stays up and waits for signals;
while true; do read; done

The main part of the script starts a few processes, displays a help screen and then waits for input from stdin.
The script can be first tested outside a container as follows.
Run the script:

./dctm.sh

It will start a few easily distinguishable processes and display a help screen:

starting a few processes at 2019/04/06 16:05:35
process started with pid 17621 and random value 19580264
process started with pid 17622 and random value 19094757
process started with pid 17623 and random value 18211512
process started with pid 17624 and random value 3680743
process started with pid 17625 and random value 18198180
 
started processes
17619 17621 17619 1994 pts/0 17619 S+ 1000 0:00 | \_ I am number 19580264
17619 17622 17619 1994 pts/0 17619 S+ 1000 0:00 | \_ I am number 19094757
17619 17623 17619 1994 pts/0 17619 S+ 1000 0:00 | \_ I am number 18211512
17619 17624 17619 1994 pts/0 17619 S+ 1000 0:00 | \_ I am number 3680743
17619 17625 17619 1994 pts/0 17619 S+ 1000 0:00 | \_ I am number 18198180
 
send signal SIGURG for help
send signal SIGPWR to start a few processes
send signal SIGUSR1 to start a new process
send signal SIGUSR2 for the list of started processes
send signal SIGINT | SIGABRT to stop all the processes
send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container

Then, it will simply sit there and wait until it is asked to quit.
From another terminal, let’s check the started processes:

ps -ef | grep "I am number " | grep -v grep
docker 17621 17619 0 14:40 pts/0 00:00:00 I am number 19580264
docker 17622 17619 0 14:40 pts/0 00:00:00 I am number 19094757
docker 17623 17619 0 14:40 pts/0 00:00:00 I am number 18211512
docker 17624 17619 0 14:40 pts/0 00:00:00 I am number 3680743
docker 17625 17619 0 14:40 pts/0 00:00:00 I am number 18198180

Those processes could be Documentum ones or anything else, the point here is to control them from the outside, e.g. another terminal session, in or out of a docker container. We will do that though O/S signals. The bash shell lets a script listen and react to signals through the trap command. On top of the script, we have listed all the signals we’d like the script to react upon:

trap 'help'         SIGURG
trap 'start_all'    SIGPWR
trap 'start_one'    SIGUSR1
trap 'status_all'   SIGUSR2
trap 'stop_all'     SIGINT SIGABRT
trap 'shutdown_all' SIGHUP SIGQUIT SIGTERM

It’s really a feast of traps !
The first line for example says that on receiving the SIGURG signal, the script’s function help() should be executed, no matter what the script was doing at that time, which in our case is just waiting for input from stdin.
The SIGPWR signal is interpreted as start in the background another suite of five processes with the same naming convention “I am number ” followed with a random number. The function start_all() is called on receiving this signal.
The SIGUSR1 signal starts one new process in the background. Function start_one() does just this.
The SIGUSR2 signal displays all the started processes so far by invoking function status_all().
The SIGINT and SIGABRT signals shut down all the started processes so far. Function stop_all() is called to this purpose.
Finally, signals SIGHUP, SIGQUIT, or SIGTERM all invokes function shutdown_all() to stop all the processes and exit the script.
Admittedly, those signal’s choice is a bit stretched out but this is for the sake of the demonstration so bear with us. Feel free to remap the signals to the functions any way you prefer.
Now, how to send those signals ? The ill-named kill command or program is here for this. Despite its name, nobody will be killed here fortunately; signals will be sent and processes decide to react opportunely. Here, of course, we do react opportunely.
Here is its syntax (let’s use the −−long-options for clarity):

/bin/kill --signal pid

Since bash has a built-in kill command that behaves differently, make sure to call the right program by specifying its full path name, /bin/kill.
Example of use:

/bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
# or shorter:
/bin/kill --signal SIGURG $(pgrep ^dctm.sh$)

The signal’s target is our test program dctm.sh, which is identified vis a vis kill through its PID.
Signals can be specified by their full name, e.g. SIGURG, SIGPWR, etc… or without the SIG prefix such as URG, PWR, etc … or even through their numeric value as shown below:

/bin/kill -L
1 HUP 2 INT 3 QUIT 4 ILL 5 TRAP 6 ABRT 7 BUS
8 FPE 9 KILL 10 USR1 11 SEGV 12 USR2 13 PIPE 14 ALRM
15 TERM 16 STKFLT 17 CHLD 18 CONT 19 STOP 20 TSTP 21 TTIN
22 TTOU 23 URG 24 XCPU 25 XFSZ 26 VTALRM 27 PROF 28 WINCH
29 POLL 30 PWR 31 SYS
 
or:
 
kill -L
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR
31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3
38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8
43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7
58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2
63) SIGRTMAX-1 64) SIGRTMAX

Thus, the following incantations are equivalent:

/bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal URG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal 23 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')

On receiving the supported signals, the related function is invoked and thereafter the script returns to its former activity, namely the loop that waits for a fake input. The loop is needed otherwise the script would exit on returning from a trap handler. In effect, the trap is processed like a function call and, on returning, the next statement at the point the trap occurred is given control. If there is none, then the script terminates. Hence the loop.
Here is the output after sending a few signals; for clarity, the signals sent from another terminal have been manually inserted as highlighted comments before the output they caused.
Output terminal:

# SIGUSR2:
status of running processes at 2019/04/06 16:12:46
showing 28046084
random value 28046084 is used by process with pid 29248 and pgid 29245
showing 977680
random value 977680 is used by process with pid 29249 and pgid 29245
showing 26299592
random value 26299592 is used by process with pid 29250 and pgid 29245
showing 25982957
random value 25982957 is used by process with pid 29251 and pgid 29245
showing 27830550
random value 27830550 is used by process with pid 29252 and pgid 29245
5 processes found
 
# SIGUSR1:
starting a new process at 2019/04/06 16:18:56
process started with pid 29618 and random value 22120010
 
# SIGUSR2:
status of running processes at 2019/04/06 16:18:56
showing 28046084
random value 28046084 is used by process with pid 29248 and pgid 29245
showing 977680
random value 977680 is used by process with pid 29249 and pgid 29245
showing 26299592
random value 26299592 is used by process with pid 29250 and pgid 29245
showing 25982957
random value 25982957 is used by process with pid 29251 and pgid 29245
showing 27830550
random value 27830550 is used by process with pid 29252 and pgid 29245
showing 22120010
random value 22120010 is used by process with pid 29618 and pgid 29245
6 processes found
 
# SIGURG:
send signal SIGURG for help
send signal SIGPWR to start a few processes
send signal SIGUSR1 to start a new process
send signal SIGUSR2 for the list of started processes
send signal SIGINT | SIGABRT to stop all the processes
send signal SIGHUP | SIGQUIT | SIGTERM to shutdown the processes and exit the container
 
# SIGINT:
shutting down the processes at 2019/04/06 16:20:17
stopping 28046084
stopping 977680
stopping 26299592
stopping 25982957
stopping 27830550
stopping 22120010
6 processes stopped
 
# SIGUSR2:
status of running processes at 2019/04/06 16:20:18
0 processes found
 
# SIGPWR:
starting a few processes at 2019/04/06 16:20:50
process started with pid 29959 and random value 2649735
process started with pid 29960 and random value 14971836
process started with pid 29961 and random value 14339677
process started with pid 29962 and random value 4460665
process started with pid 29963 and random value 12688731
5 processes started
 
started processes:
29245 29959 29245 1994 pts/0 29245 S+ 1000 0:00 | \_ I am number 2649735
29245 29960 29245 1994 pts/0 29245 S+ 1000 0:00 | \_ I am number 14971836
29245 29961 29245 1994 pts/0 29245 S+ 1000 0:00 | \_ I am number 14339677
29245 29962 29245 1994 pts/0 29245 S+ 1000 0:00 | \_ I am number 4460665
29245 29963 29245 1994 pts/0 29245 S+ 1000 0:00 | \_ I am number 12688731
 
# SIGUSR2:
status of running processes at 2019/04/06 16:20:53
showing 2649735
random value 2649735 is used by process with pid 29959 and pgid 29245
showing 14971836
random value 14971836 is used by process with pid 29960 and pgid 29245
showing 14339677
random value 14339677 is used by process with pid 29961 and pgid 29245
showing 4460665
random value 4460665 is used by process with pid 29962 and pgid 29245
showing 12688731
random value 12688731 is used by process with pid 29963 and pgid 29245
5 processes found
 
# SIGTERM:
shutting down the container at 2019/04/06 16:21:42
 
shutting down the processes at 2019/04/06 16:21:42
stopping 2649735
stopping 14971836
stopping 14339677
stopping 4460665
stopping 12688731
5 processes stopped

In the command terminal:

/bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGUSR1 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGURG $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGINT $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGPWR $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGUSR2 $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
/bin/kill --signal SIGTERM $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')

Of course, sending the untrappable SIGKILL signal will abort the process that executes dctm.sh. However, its children processes will survive and be reparented to the root process:

...
status of running processes at 2019/04/10 22:38:25
showing 19996889
random value 19996889 is used by process with pid 24520 and pgid 24398
showing 5022831
random value 5022831 is used by process with pid 24521 and pgid 24398
showing 1363197
random value 1363197 is used by process with pid 24522 and pgid 24398
showing 18185959
random value 18185959 is used by process with pid 24523 and pgid 24398
showing 10996678
random value 10996678 is used by process with pid 24524 and pgid 24398
5 processes found
# /bin/kill --signal SIGKILL $(ps -ef | grep dctm.sh | grep -v grep | gawk '{print $2}')
Killed
 
ps -ef | grep number | grep -v grep
docker 24520 1 0 22:38 pts/1 00:00:00 I am number 19996889
docker 24521 1 0 22:38 pts/1 00:00:00 I am number 5022831
docker 24522 1 0 22:38 pts/1 00:00:00 I am number 1363197
docker 24523 1 0 22:38 pts/1 00:00:00 I am number 18185959
docker 24524 1 0 22:38 pts/1 00:00:00 I am number 10996678
 
# manual killing those processes;
ps -ef | grep number | grep -v grep | gawk '{print $2}' | xargs kill -9

 
ps -ef | grep number | grep -v grep
<empty>
 
# this works too:
kill -9 $(pgrep -f "I am number [0-9]+$")
# or, shorter:
pkill -f "I am number [0-9]+$"

Note that there is a simpler way to kill those related processes: by using their PGID, or process group id:

ps -axjf | grep number | grep -v grep
1 25248 25221 24997 pts/1 24997 S 1000 0:00 I am number 3489651
1 25249 25221 24997 pts/1 24997 S 1000 0:00 I am number 6789321
1 25250 25221 24997 pts/1 24997 S 1000 0:00 I am number 15840638
1 25251 25221 24997 pts/1 24997 S 1000 0:00 I am number 19059205
1 25252 25221 24997 pts/1 24997 S 1000 0:00 I am number 12857603
# processes have been reparented to PPID == 1;
# highlighted columns 3 is the PGID;
# kill them using negative-PGID;
kill -9 -25221
ps -axjf | grep number | grep -v grep
<empty>

This is why the status() commands displays the PGID.
In order to tell kill that the given PID is actually a PGID, it has to be prefixed with a minus sign. Alternatively, the command:

pkill -g pgid

does that too.
All this looks quite promising so far !
Please, join me now to part II of this article for the dockerization of the test script.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cesare Cervini
Cesare Cervini