Infrastructure at your Service

David Hueber

ODA and KVM: Debugging of DBsystem creation failure

Debugging errors when working with ODA is not always the easiest thing do… 😛

It may become a bit tricky and not a straightforward process. In this blog I wanted to show you and example we faced with the debugging of a Dbsystem creation failure and how we found out the real reason it failed.

Before starting let’s do a short reminder about KVM virtualisation on ODA.

Since 19.9, ODA supports hard partitioning for database virtualisation on ODA. This works on a principle based on 2 types of VMs:

  1. Compute instance (more info here)
  2. DB Systems

While the first one is intended for traditional VM hosted any workload except oracle DBs, the second one is dedicated to database virtualisation.
A DB System is then an Oracle Linux with a similar stack than the ODA  BM (GI, DB, …).

Provisioning a new DBSystem is pretty easy and straightforward using the command odaacli create-dbsystem and a JSON file as input…as long as it works…and you don’t do any mistake.

In our case, here the error we got when trying to create a new DB System:

Job details
----------------------------------------------------------------
                     ID:  75115716-4ce3-4eb1-af1a-4d3d8bef441a
            Description:  DB System srvdb01 creation
                 Status:  Failure
                Created:  November 5, 2021 11:37:48 AM CET
                Message:  DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.

Task Name                                Start Time                          End Time                            Status
---------------------------------------- ----------------------------------- ----------------------------------- ----------
Create DB System metadata                November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:48 AM CET    Success
Persist new DB System                    November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:48 AM CET    Success
Validate DB System prerequisites         November 5, 2021 11:37:48 AM CET    November 5, 2021 11:37:52 AM CET    Success
Setup DB System environment              November 5, 2021 11:37:52 AM CET    November 5, 2021 11:37:53 AM CET    Success
Create DB System ASM volume              November 5, 2021 11:37:53 AM CET    November 5, 2021 11:38:00 AM CET    Success
Create DB System ACFS filesystem         November 5, 2021 11:38:00 AM CET    November 5, 2021 11:38:09 AM CET    Success
Create DB System VM ACFS snapshots       November 5, 2021 11:38:09 AM CET    November 5, 2021 11:38:39 AM CET    Success
Create temporary SSH key pair            November 5, 2021 11:38:39 AM CET    November 5, 2021 11:38:39 AM CET    Success
Create DB System cloud-init config       November 5, 2021 11:38:39 AM CET    November 5, 2021 11:38:40 AM CET    Success
Provision DB System VM(s)                November 5, 2021 11:38:40 AM CET    November 5, 2021 11:38:41 AM CET    Success
Attach disks to DB System                November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:41 AM CET    Success
Add DB System to Clusterware             November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:41 AM CET    Success
Start DB System                          November 5, 2021 11:38:41 AM CET    November 5, 2021 11:38:44 AM CET    Success
Wait DB System VM first boot             November 5, 2021 11:38:44 AM CET    November 5, 2021 11:39:56 AM CET    Success
Setup Mutual TLS (mTLS)                  November 5, 2021 11:39:56 AM CET    November 5, 2021 11:40:15 AM CET    Success
Export clones repository                 November 5, 2021 11:40:15 AM CET    November 5, 2021 11:40:15 AM CET    Success
Setup ASM client cluster config          November 5, 2021 11:40:16 AM CET    November 5, 2021 11:40:18 AM CET    Success
Install DB System                        November 5, 2021 11:40:18 AM CET    November 5, 2021 11:40:26 AM CET    InternalError

So…it failed on installing DB into the newly creaated VM. Error code is: DCS-10001:Internal error

The first we tried is to get more info on this error code using dcserr:

[[email protected] log]# dcserr 10001
10001, Internal_Error, "Internal error encountered: {0}."
// *Cause: An internal error occurred.
// *Action: Contact Oracle Support Services for assistance.
/

Not helping very much… Unfortunately the describe-job doesn’t give much more information about any kind of log file…

The only remaining solution is then to analyse the DCS log file. All operation we run using odacli are going through the dcsagent which generates a log in:

/opt/oracle/dcs/log

There you will find several types of log file such as the dcs-admin one or the dcs-components and obviously the dcs-agent log file

[[email protected] log]# pwd
/opt/oracle/dcs/log
[[email protected] log]# ls -l dcs-agent*
-rw-r--r-- 1 root root 144752279 Nov 3 23:30 dcs-agent-2021-11-03.log
-rw-r--r-- 1 root root 231235959 Nov 4 23:30 dcs-agent-2021-11-04.log
-rw-r--r-- 1 root root 151900 Nov 3 11:59 dcs-agent-requests-2021-11-03-03.log
-rw-r--r-- 1 root root 60331 Nov 3 12:59 dcs-agent-requests-2021-11-03-11.log
-rw-r--r-- 1 root root 122337 Nov 3 13:58 dcs-agent-requests-2021-11-03-13.log
-rw-r--r-- 1 root root 74029 Nov 3 14:59 dcs-agent-requests-2021-11-03-14.log
-rw-r--r-- 1 root root 112641 Nov 3 15:59 dcs-agent-requests-2021-11-03-15.log
-rw-r--r-- 1 root root 154503 Nov 3 16:59 dcs-agent-requests-2021-11-03-16.log
-rw-r--r-- 1 root root 10575 Nov 3 17:03 dcs-agent-requests-2021-11-03-17.log
-rw-r--r-- 1 root root 184 Nov 4 07:53 dcs-agent-requests-2021-11-04-07.log
-rw-r--r-- 1 root root 24097 Nov 4 08:42 dcs-agent-requests-2021-11-04-08.log
-rw-r--r-- 1 root root 6556 Nov 4 09:59 dcs-agent-requests-2021-11-04-09.log
-rw-r--r-- 1 root root 7711 Nov 4 10:56 dcs-agent-requests-2021-11-04-10.log
-rw-r--r-- 1 root root 17646 Nov 4 11:52 dcs-agent-requests-2021-11-04-11.log
-rw-r--r-- 1 root root 1837 Nov 4 12:58 dcs-agent-requests-2021-11-04-12.log
-rw-r--r-- 1 root root 122202 Nov 4 13:59 dcs-agent-requests-2021-11-04-13.log
-rw-r--r-- 1 root root 71837 Nov 4 14:59 dcs-agent-requests-2021-11-04-14.log
-rw-r--r-- 1 root root 215518 Nov 4 15:59 dcs-agent-requests-2021-11-04-15.log
-rw-r--r-- 1 root root 4497 Nov 4 16:24 dcs-agent-requests-2021-11-04-16.log
-rw-r--r-- 1 root root 660 Nov 5 07:56 dcs-agent-requests-2021-11-05-07.log
-rw-r--r-- 1 root root 513 Nov 5 08:00 dcs-agent-requests-2021-11-05-08.log
-rw-r--r-- 1 root root 45592 Nov 5 10:59 dcs-agent-requests-2021-11-05-10.log
-rw-r--r-- 1 root root 126945 Nov 5 11:59 dcs-agent-requests-2021-11-05-11.log
-rw-r--r-- 1 root root 17460 Nov 5 12:21 dcs-agent-requests.log
-rw-r--r-- 1 root root 75603907 Nov 5 12:21 dcs-agent.log

However the challenge is that this log file is pretty verbose and therefore pretty long.
Just to give you and idea, on our test ODA (where there were nothing much running) we had already almost 1 million rows in an half day.

So the option we used was to run a grep command to gather only the lines concerning the DB System we tried to create:

grep srvdb01 dcs-agent.log

…which still represents 850+ lines 😉

Going bottom up, we found first all entries about the DELET DB SYSTEM we run after the failure, such as:

...
2021-11-05 11:47:50,962 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'dw-19811 - DELETE /dbsystem/srvdb01' released READ lock for Resource type 'Metadata' with name 'metadata'
2021-11-05 11:47:50,963 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.m.KvmBaseModule: Starting new job 586fce36-8131-4f46-b447-36fab882f060 for taskFlow: seq(id: 586fce36-8131-4f46-b447-36fab882f060, name: 586fce36-8131-4f46-b447-36fab882f060, jobId: 586fce36-8131-4f46-b447-36fab882f060, status: Created,exposeTaskResultToJob: false, result: null, output: , on_failure: FailOnAny):
2021-11-05 11:47:50,963 INFO [dw-19811 - DELETE /dbsystem/srvdb01] [] c.o.d.a.k.m.KvmBaseModule: Job report: ServiceJobReport(jobId=586fce36-8131-4f46-b447-36fab882f060, status=Created, message=null, reports=[], createTimestamp=2021-11-05 11:47:50.957, resourceList=[], description=DB System srvdb01 deletion, updatedTime=2021-11-05 11:47:50.957)
  "description" : "DB System srvdb01 deletion",
  "description" : "DB System srvdb01 deletion",
2021-11-05 11:47:50,973 INFO [DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060' trying to acquire WRITE lock for Resource type 'DB System' with name 'srvdb01'
2021-11-05 11:47:50,973 INFO [DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060] [] c.o.d.a.k.o.l.SingleNodeLockController: Thread 'DeleteDbSystem_KvmLockContainer_38554 : JobId=586fce36-8131-4f46-b447-36fab882f060' acquired WRITE lock for Resource type 'DB System' with name 'srvdb01'
	 Mountpath: /u05/app/sharedrepo/srvdb01
...

So we could simply skip all lines containing DELET or Operation Type = Delete.

Then arrive plenty of lines which contains the error message you receive in the odacli describe-job as well as the content of the JSON file used to run the job.

...
2021-11-05 11:46:48,763 DEBUG [Process new DB System] [] c.o.d.a.k.t.KvmBaseTaskBuilder$KvmTaskExecutor: Output request: DbSystemCreateRequest(systemInfo=DbSystemCreateRequest.SystemInfo(dbSystemName=srvdb01, shapeName=odb2, cpuPoolName=cpupool4srv, diskGroup=DATA, systemPassword=*****, provisionType=rhp, timeZone=Europe/Zurich, enableRoleSeparation=true, customRoleSeparationInfo=DbSystemCreateRequest.CustomRoleSeparationInfo(groups=[DbSystemCreateRequest.GroupInfo(id=1001, role=oinstall, name=oinstall), DbSystemCreateRequest.GroupInfo(id=1002, role=dbaoper, name=dbaoper), DbSystemCreateRequest.GroupInfo(id=1003, role=dba, name=dba), DbSystemCreateRequest.GroupInfo(id=1004, role=asmadmin, name=asmadmin), DbSystemCreateRequest.GroupInfo(id=1005, role=asmoper, name=asmoper), DbSystemCreateRequest.GroupInfo(id=1006, role=asmdba, name=asmdba)], users=[DbSystemCreateRequest.UserInfo(id=1000, role=gridUser, name=grid), DbSystemCreateRequest.UserInfo(id=1001, role=oracleUser, name=oracle)])), networkInfo=DbSystemCreateRequest.NetworkInfo(domainName=dbi-lab.ch, ntpServers=[216.239.35.0], dnsServers=[8.8.8.8, 8.8.4.4], scanName=null, scanIps=null, nodes=[DbSystemCreateRequest.NetworkNodeInfo(number=0, name=srvdb01, ipAddress=10.36.0.245, netmask=255.255.255.0, gateway=10.36.0.1, vipName=null, vipAddress=null)], publicVNetwork=pubnet), gridInfo=DbSystemCreateRequest.GridInfo(language=en, enableAfd=false), dbInfo=DbSystemCreateRequest.DbInfo(name=srvTEST, uniqueName=srvTEST, domainName=dbi-lab.ch, adminPassword=**********, version=19.12.0.0.210720, edition=EE, type=SI, dbClass=OLTP, shape=odb2, role=PRIMARY, redundancy=MIRROR, characterSet=DbSystemCreateRequest.DbCharacterSetInfo(characterSet=AL32UTF8, nlsCharacterSet=AL16UTF16, dbTerritory=AMERICA, dbLanguage=ENGLISH), enableDbConsole=false, enableFlashStorage=false, enableFlashCache=false, enableSEHA=false, rmanBackupPassword=*****, level0BackupDay=null, tdePassword=*****, enableTde=false, enableUnifiedAuditing=true, isCdb=false, pdbName=null, pdbAdminUser=null, targetNodeNumber=null), devInfo=null)
2021-11-05 11:46:48,763 DEBUG [CreateDbSystem_KvmLockContainer_38327 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.c.t.TaskDetail: set task result as DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 INFO [CreateDbSystem_KvmLockContainer_38327 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.t.KvmBaseTaskBuilder$KvmLockContainer:  Task[id: CreateDbSystem_KvmLockContainer_38327, TaskName: CreateDbSystem_KvmLockContainer_38327] result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 DEBUG [33793dd8-6704-407a-8dd0-f2b83a9deb10 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.c.t.TaskDetail: set task result as DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,763 DEBUG [33793dd8-6704-407a-8dd0-f2b83a9deb10 : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.m.i.KvmJobHelper$KvmTaskReportRecorder: Recording job report: id: 33793dd8-6704-407a-8dd0-f2b83a9deb10, name: 33793dd8-6704-407a-8dd0-f2b83a9deb10, jobId: 33793dd8-6704-407a-8dd0-f2b83a9deb10, status: Failure,exposeTaskResultToJob: false, result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''., output:
  "message" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "description" : "DB System srvdb01 creation",
  "message" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "description" : "DB System srvdb01 creation",
...

Still not much useful…so we skipped these too and continue our journey upward. Finally looking for the first (going up) line without any error, we could found in the next one the following message:

2021-11-05 11:46:47,948 INFO [dw-18140 - GET /instances/storage/dgSpace/ALL] [] c.o.i.a.IDMAgentAuthorizer: IDMAgentAuthorizer::user:ODA-srvdb01:role:list-dgstorages
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
! Causing: com.oracle.dcs.commons.exception.DcsException: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.
2021-11-05 11:46:48,745 DEBUG [Install DB System : JobId=33793dd8-6704-407a-8dd0-f2b83a9deb10] [] c.o.d.a.k.m.i.KvmJobHelper$KvmTaskReportRecorder: Recording task report: id: CreateDbSystem_KvmTask_38345,name: Install DB System, jobId: 33793dd8-6704-407a-8dd0-f2b83a9deb10, status: InternalError,exposeTaskResultToJob: false, result: DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.,output: DcsException{errorHttpCode=InternalError, msg=Internal error encountered: Error creating job 'Provision DB System 'srvdb01''., msgId=10001,causedBy=com.oracle.pic.commons.client.exceptions.RestClientException: DCS-11002:Password for database admin user does not comply with the password policy.}
  "taskResult" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",
  "taskResult" : "DCS-10001:Internal error encountered: Error creating job 'Provision DB System 'srvdb01''.",

Look at the 4th line 😉 …yes at the end…scroll a bit more…here we go:

client.exceptions.RestClientException: DCS-11002:Password for database admin user does not comply with the password policy.}

 

So finally the root cause of the failure was “simply” that the password given for the sys/system accounts was not compliant… 😕 😕

However the remaining question is: Why don’t we get this error message back in the odacli describe-job instead of a useless generic error message??

It would have been so easier:

[[email protected] log]# dcserr 11002
11002, Password_too_simple, "Password for {0} does not comply with the password policy."
// *Cause: The user provided password does not satisfy the password policy rules.
// *Action: Refer to the Deployment and User's Guide for the password policy.
//          Provide a password which meets the criteria.
/

I hope that this can help.

Enjoy! 😎

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Hueber
David Hueber

Chief Executive Officer (CEO), Principal Consultant