Infrastructure at your Service

Gérard Wisson

What to do in case all active Documentum jobs are no more running ?

The application support informed me that their jobs are not running anymore. When I started the analysis, I found that all activated jobs did not start for a few weeks.

First of all, I decided to work on a specific job which is not one from application team but where I know that I can start it several times without impacting the business.
Do you know which one? dm_ContentWarning

I checked the job attributes like start_date, expiration_date, is_inactive, target_server (as we have several Content Server to cover the high availability), a_last_invocation, a_next_invocation and of course the a_current_status.
Once this first check was done, with the DA I started the job (selected run now and saved the job).

  object_name                : dm_ContentWarning
  start_date                 : 5/30/2017 20:00:00
  expiration_date            : 5/30/2025 20:00:00
  max_iterations             : 0
  run_interval               : 1
  run_mode                   : 3
  is_inactive                : F
  inactivate_after_failure   : F
  target_server              : [email protected]
  a_last_invocation          : 9/20/2018 19:05:29
  a_last_completion          : 9/20/2018 19:07:00
  a_current_status           : ContentWarning Tool Completed at
                        9/20/2018 19:06:50.  Total duration was
                        1 minutes.
  a_next_invocation          : 9/21/2018 19:05:00

Few minutes later, I checked again the result and the different attributes, not all attributes like before but only a_last_completion and a_next_invocation and of course the content of the job log file. The job ran as expected when I forced the job to run.

  a_last_completion          : 10/31/2018 10:41:25
  a_current_status           : ContentWarning Tool Completed at
                        10/31/2018 10:41:14.  Total duration
                        was 2 minutes.
  a_next_invocation          : 10/31/2018 19:05:00
[[email protected] agentexec]$ more job_0801234380000359
Wed Oct 31 10:39:54 2018 [INFORMATION] [LAUNCHER 12071] Detected while preparing job dm_ContentWarning for execution: Agent Exec
connected to server Docbase1:  [DM_SESSION_I_SESSION_START]info:  "Session 01012343807badd5 started for user dmadmin."
...
...

Ok the job ran and the a_next_invocation was set accordingly to run_interval and run_mode in our case once a day. (I thought), I found the reason of the issue: the repository was stopped for a few days and therefore, when restarted, the a_next_invocation date was in the past (a_next_invocation: 9/21/2018 19:05:00). So I decided to see the result the day after once the job ran based on the defined schedule (a_next_invocation: 10/31/2018 19:05:00).

The next day… the job did not run. Strange!
I decided to think a bit deeper ;-). Do something else to go a step further and set the a_next_invocation date to run the job in 5 minutes.

update dm_job objects set a_next_invocation = date('01.11.2018 11:53:00','dd.mm.yyyy hh:mi:ss') where object_name = 'dm_ContentWarning';
1

select r_object_id, object_name, a_next_invocation from dm_job where object_name = 'dm_ContentWarning';
0801234380000359	dm_ContentWarning	11/01/2018 11:53:00

Result, the job did not start. 🙁 Hmmm, why ?

Before continuing to work on the job, I did some other checks, like analyzing the log files, repository, agent_exec, sysadmin etc.
I found that the DB was down a few days before and decided to restart the repository, set the a_next_invocation again but unfortunately this did not help.

To be sure it’s not related to the full installation, I ran, successfully, a distributed job (the dm_contentWarningvmcs2_Docbase1) on the second Content Server. This meant the issue is only located on my first Content Server.

Searching in the OpenText knowledge base (KB9264366, KB8716186 and KB6327280), none of them gave me the solution.

I knew, even if I did not used often it in my last 20 years in the Documentum world, that we can trace the agent_exec so let’s see this point:

  1. add for the dm_agent_method the parameter -trace_level 1
  2. reinit the server
  3. kill the dm_agent_exec process related to Docbase1, the process will be started automatically after few minutes.
[[email protected] agentexec]$ ps -ef | grep agent | grep Docbase1
dmadmin  27312 26944  0 Oct31 ?        00:00:49 ./dm_agent_exec -enable_ha_setup 1 -docbase_name dmadmin  27312 26944  0 Oct31 ?        00:00:49 ./dm_agent_exec -enable_ha_setup 1 -docbase_name Docbase1.Docbase1 -docbase_owner dmadmin -sleep_duration 0
[[email protected] agentexec]$ kill -9 27312
[[email protected] agentexec]$ ps -ef | grep agent | grep Docbase1
[[email protected] agentexec]$
[[email protected] agentexec]$ ps -ef | grep agent | grep Docbase1
dmadmin  15440 26944 57 07:48 ?        00:00:06 ./dm_agent_exec -enable_ha_setup 1 -trace_level 1 -docbase_name Docbase1.Docbase1 -docbase_owner dmadmin -sleep_duration 0
[[email protected] agentexec]$

I changed again the a_next_invocation and check the agent_exec log file where the executed queries have been recorded.
Two recorded queries seemed to be important:

SELECT count(r_object_id) as cnt FROM dm_job WHERE ( (run_now = 1) OR ((is_inactive = 0) AND ( ( a_next_invocation <= DATE('now') AND a_next_invocation IS NOT NULLDATE ) OR ( a_next_continuation  DATE('now')) OR (expiration_date IS NULLDATE)) AND ((max_iterations = 0) OR (a_iterations < max_iterations))) ) AND (i_is_reference = 0 OR i_is_reference is NULL) AND (i_is_replica = 0 OR i_is_replica is NULL) AND UPPER(target_server) = '[email protected]'

SELECT ALL r_object_id, a_next_invocation FROM dm_job WHERE ( (run_now = 1) OR ((is_inactive = 0) AND ( ( a_next_invocation <= DATE('now') AND a_next_invocation IS NOT NULLDATE ) OR ( a_next_continuation  DATE('now')) OR (expiration_date IS NULLDATE)) AND ((max_iterations = 0) OR (a_iterations < max_iterations))) ) AND (i_is_reference = 0 OR i_is_reference is NULL) AND (i_is_replica = 0 OR i_is_replica is NULL) AND UPPER(target_server) = '[email protected]' ORDER BY run_now DESC, a_next_invocation, r_object_id ENABLE (RETURN_TOP 3 )

I executed the second query and it found three jobs (RETURN_TOP 3) which are from the application team. As the three selected jobs have an old a_next_invocation value, they will never run and will always be selected when the job is executed and unfortunately this means my dm_ContentWarning job will never be selected for automatic execution.

I informed the application team that I will keep only one job active (dm_ContentWarning) to see if the job will run. And guess what, it ran … YES!

Okay, now we have the solution:

  • reactivate all previously deactivated job
  • set the a_next_invocation to a future date

And do not forget to deactivate the trace for the dm_agent_exec.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Gérard Wisson
Gérard Wisson

Head of Delivery and Principal Consultant