As you already know if you are following our Documentum Story, we are building, working and managing, for some time now, a huge Documentum Platform with more than 115 servers so far (still growing). To be able to manage properly this platform, we need an efficient monitoring tool. In this blog, I will not talk about Documentum but rather I will talk a little bit about the monitoring solution we integrated with Nagios to be able to support all of our WebLogic Servers. For those of you who don’t know, Nagios is a very popular Open Source monitoring tool launched in 1999. By default Nagios doesn’t provide any interface to monitor WebLogic or Documentum and therefore we choose to build our own script package to be able to properly monitor our Platform.

 

At the beginning of the project when we were installing the first WebLogic Servers, we used the monitoring scripts coming from the old Platform (a Documentum 6.7 Platform not managed by us). The idea behind these monitoring scripts was the following one:

  • The Nagios Server needs to perform a check of a service
  • The Nagios Server contacts the Nagios Agent which executes the check
  • The Check is starting its own WLST script to retrieve only the value needed for this check (each check calls a different WLST script)
  • The Nagios Agent returns the value to the Nagios Server which is then happy with it

 

This pretty simple approach was working fine at the beginning when we only had a few WebLogic Servers with not so much to monitor on them… The problem is that the Platform was growing very fast and we quickly started to see a few timeouts on the different checks because Nagios was trying to execute a lot of check at the same time on the same host. For example on a specific environment, we had two WebLogic Domains running with 4 or 5 Managed Servers for each domain that were hosting a Documentum Application (DA, D2, D2-Config, …). We were monitoring the heapSize, the number of threads, the server state, the number of sessions, the different URLs with and without Load Balancer, aso… for each Managed Server and for the AdminServers too. Therefore we quickly reached a point where 5 or 10 WLST scripts were running at the same time for the monitoring and only the monitoring.

 

The problem with the WLST script is that it takes a lot of time to initialize itself and start (several seconds) and during that time, 1 or 2 CPUs are fully used only for that. Now correlate this figure with the fact that there are dozens of checks running every 5 minutes for each domain and that are all starting their own WLST script. In the end, you will get a WebLogic Server highly used with a huge CPU consumption only for the monitoring… So that might be sufficient for a small installation but that’s definitively not the right thing to do for a huge Platform.

 

Therefore we needed to do something else. To solve this particular problem, I developed a new set to scripts that I integrated with Nagios to replace the old ones. The idea behind these new scripts was that it should be able to provide us at least the same thing as the old ones but without starting so much WLST scripts and it should be easily extensible. I worked on this small development and this is what I came with:

  • The Nagios Server needs to perform a check of a service
  • The Nagios Server contacts the Nagios Agent which executes the check
  • The Check is reading a log file to find the value needed for this check
  • The Nagios Agent returns the value to the Nagios Server which is then happy with it

 

Pretty similar isn’t it? Indeed… And yet so different! The main idea behind this new version is that instead of starting a WLST script for each check which will fully use 1 or 2 CPUs and last for 2 to 10 seconds (depending on the type of check and on the load), this new version will only read a very short log file (1 log file per check) that contains one line: the result of the check. Reading a log file like that takes a few milliseconds and it doesn’t consume 2 CPUs for doing that… Now the remaining question is how can we handle the process that will populate the log files? Because yes checking a log file is fast but how can we ensure that this log file will contain the correct data?

 

To manage that, this is what I did:

  • Creation of a shell script that will:
    • Be executed by the Nagios Agent for each check
    • Check if the WebLogic Domain is running and exit if not
    • Check if the WLST script is running and start it if not
    • Ensure the log file has been updated in the last 5 minutes (meaning the monitoring is running and the value that will be read is correct)
    • Read the log file
    • Analyze the information coming from the log file and return that to the Nagios Agent
  • Creation of a WLST script that will:
    • Be started once, do its job, sleep for 2 minutes and then do it again
    • Retrieve the monitoring values and store that in log files
    • Store error messages in the log files if there is any issue

 

It will not describe any longer the shell script because that’s just basic shell commands but I will show you instead an example of a WLST script that can be used to monitor a few things (ThreadPool of all Servers, HeapFree of all Severs, Sessions of all Applications deployed on all Servers):

[nagios@weblogic_server_01 scripts]$ cat DOMAIN_check_weblogic.wls
from java.io import File
from java.io import FileOutputStream

directory='/app/nagios/etc/objects/scripts'
userConfig=directory + '/DOMAIN_configfile.secure'
userKey=directory + '/DOMAIN_keyfile.secure'
address='weblogic_server_01'
port='8443'

connect(userConfigFile=userConfig, userKeyFile=userKey, url='t3s://' + address + ':' + port)

def setOutputToFile(fileName):
  outputFile=File(fileName)
  fos=FileOutputStream(outputFile)
  theInterpreter.setOut(fos)

def setOutputToNull():
  outputFile=File('/dev/null')
  fos=FileOutputStream(outputFile)
  theInterpreter.setOut(fos)

while 1:
  domainRuntime()
  for server in domainRuntimeService.getServerRuntimes():
    setOutputToFile(directory + '/threadpool_' + domainName + '_' + server.getName() + '.out')
    cd('/ServerRuntimes/' + server.getName() + '/ThreadPoolRuntime/ThreadPoolRuntime')
    print 'threadpool_' + domainName + '_' + server.getName() + '_OUT',get('ExecuteThreadTotalCount'),get('HoggingThreadCount'),get('PendingUserRequestCount'),get('CompletedRequestCount'),get('Throughput'),get('HealthState')
    setOutputToNull()
    setOutputToFile(directory + '/heapfree_' + domainName + '_' + server.getName() + '.out')
    cd('/ServerRuntimes/' + server.getName() + '/JVMRuntime/' + server.getName())
    print 'heapfree_' + domainName + '_' + server.getName() + '_OUT',get('HeapFreeCurrent'),get('HeapSizeCurrent'),get('HeapFreePercent')
    setOutputToNull()

  try:
    setOutputToFile(directory + '/sessions_' + domainName + '_console.out')
    cd('/ServerRuntimes/AdminServer/ApplicationRuntimes/consoleapp/ComponentRuntimes/AdminServer_/console')
    print 'sessions_' + domainName + '_console_OUT',get('OpenSessionsCurrentCount'),get('SessionsOpenedTotalCount')
    setOutputToNull()
  except WLSTException,e:
    setOutputToFile(directory + '/sessions_' + domainName + '_console.out')
    print 'CRITICAL - The Server AdminServer or the Administrator Console is not started'
    setOutputToNull()

  domainConfig()
  for app in cmo.getAppDeployments():
    domainConfig()
    cd('/AppDeployments/' + app.getName())
    for appServer in cmo.getTargets():
      domainRuntime()
      try:
        setOutputToFile(directory + '/sessions_' + domainName + '_' + app.getName() + '.out')
        cd('/ServerRuntimes/' + appServer.getName() + '/ApplicationRuntimes/' + app.getName() + '/ComponentRuntimes/' + appServer.getName() + '_/' + app.getName())
        print 'sessions_' + domainName + '_' + app.getName() + '_OUT',get('OpenSessionsCurrentCount'),get('SessionsOpenedTotalCount')
        setOutputToNull()
      except WLSTException,e:
        setOutputToFile(directory + '/sessions_' + domainName + '_' + app.getName() + '.out')
        print 'CRITICAL - The Managed Server ' + appServer.getName() + ' or the Application ' + app.getName() + ' is not started'
        setOutputToNull()

  java.lang.Thread.sleep(120000)

[nagios@weblogic_server_01 scripts]$

 

A few notes related to the above WLST script:

  • userConfig and userKey are two files created previously in WLST that contain the username/password of the current user (at the time of creation of these files) in an encrypted way. This allows you to login to WLST without having to type your username and password and more importantly, without having to put a clear text password in this file…
  • To ensure the security of this environment we are always using t3s to perform the monitoring checks and this requires you to configure the AdminServer to HTTPS.
  • In the script, I’m using the “setOutputToFile” and “setOutputToNull” functions. The first one is to redirect the output to the file mentioned in parameter while the second one is to remove all output. That’s basically to ensure that the log files generated ONLY contain the needed lines and nothing else.
  • There is an infinite loop (while 1) that executes all checks, create/update all log files and then sleep for 120 000 ms (so that’s 2 minutes) before repeating it.

 

As said above, this is easily extendable and therefore you can just add a new paragraph with the new values to retrieve. So have fun with that! 🙂

 

Comparison between the two methods. I will use below real figures coming from one of our WebLogic Server:

  • Old:
    • 40 monitoring checks running every 5 minutes => 40 WLST scripts started
    • each one for a duration of 6 seconds (average)
    • each one using 200% CPU during that time (2 CPUs)
  • New:
    • Shell script:
      • 40 monitoring checks running every 5 minutes => 40 log files read
      • each one for a duration of 0,1s (average)
      • each one using 100% CPU during that time (1 CPU)
    • WLST script:
      • One loop every 2 minutes (so 2.5 loops in 5 minutes)
      • each one for a duration of 0.5s (average)
      • each one using 100% CPU during that time (1 CPU)

 

Period CPU Time (Old) CPU Time (New)
5 minutes 40*6*2 <~> 480 s 40*0.1*1 + 2.5*0.5*1 <~> 5.25 s
1 day 480*(1440/5) <~> 138 240 s
<~> 2 304 min
<~> 38.4 h
4.25*(1440/5) <~> 1 512 s
<~> 25.2 min
<~> 0.42 h

Based on these figures, we can see that our new monitoring solution is almost 100 times more efficient than the old one so that’s a success: instead of spending 38.4 hours using the CPU on a 24 hours period (so that’s 1.6 CPU the whole day), we are now using 1 CPU for only 25 minutes! Here I’m just talking about the CPU Time but of course you can do the same thing for the memory, processes, aso…

 

Note: Starting with WebLogic 12c, Oracle introduced the RESTful services which can now be used to monitor WebLogic too… It has been improved in 12.2 and that can become a pretty good alternative to WLST scripting but for now we are still using this WLST approach with one single execution every 2 minutes and then Nagios reading the log files when needs be.