By Franck Pachot

.
We experienced the first planned outage this week-end, so let’s see how it is notified and what happens.

EMEA Commercial 2 – Amsterdam

Outage notification before:

  • subject: Announcement: Upcoming Mandatory Maintenance for Oracle Cloud
  • e-mail date: 16 Octobre 2015 1:49
  • message: Start Time / End Time:
    Saturday, October 17, 2015 9:00:00 PM CEST – Sunday, October 18, 2015 12:00:00 PM CEST
    Instances will be brought down before the start and restarted to match original state post completion.

So the notification was sent a bit less than 2 days before.

Actually I had a session logged at that time.


$ date
Sun Oct 18 21:03:27 CEST 2015
$ last -x -t 20151018170000 | head
oracle   pts/0        178.197.234.212  Sun Oct 18 13:17 - 13:35  (00:17)
oracle   pts/0        178.197.234.212  Sun Oct 18 13:12 - 13:14  (00:02)
oracle   pts/0        178.197.234.212  Sun Oct 18 13:09 - 13:10  (00:01)
runlevel (to lvl 3)   2.6.39-400.109.1 Sun Oct 18 10:05 - 21:03  (10:57)
reboot   system boot  2.6.39-400.109.1 Sun Oct 18 10:05 - 21:03  (10:57)
shutdown system down  2.6.39-400.109.1 Sat Oct 17 21:19 - 10:05  (12:46)
runlevel (to lvl 0)   2.6.39-400.109.1 Sat Oct 17 21:19 - 21:19  (00:00)
oracle   pts/1        xdsl-188-154-161 Sat Oct 17 18:02 - down   (03:16)
oracle   pts/1        xdsl-188-154-161 Sat Oct 17 18:02 - 18:02  (00:00)
oracle   pts/1        56.227.197.178.d Wed Oct 14 06:13 - 07:35  (01:21)

Remark: a reboot was not what I expected from the message ‘restarted to match original state’. I expected something like a ‘save state’ that includes the RAM.

Outage notification after:

  • subject: Announcement: Maintenance Was Completed for Oracle Cloud Outage Details
  • e-mail date: 18 Octobre 2015 18:16
  • message: Start Time / End Time:
    Saturday, October 17, 2015 9:00:00 PM CEST – Sunday, October 18, 2015 10:30:00 AM CEST

Remark: the system was up at 10:05 but the notification that it is up came 8 hours later. Then if I have something to restart manually, when am I expected to do it? at 10:05 when I see the system reboot? at 12:00 that was the planned end? Or at 18:16 when I receive the notification? When I’m responsible for an outage, I count the duration from start of maintenance up to availability notification.

Here is the summary from the ‘CLOUD My Service’:

CaptureOutageEU

US Commercial 2 – North America

There was an outage later on the US cloud. It was planned from 6:00:00 AM to 9:00:00 PM CEST and here is the summary:

CaptureOutageUS

The problem is that it overlaps, so we can’t consider a Data Guard setup between both to ensure High Availability.


$ date
Sun Oct 18 21:05:46 CEST 2015
$ last -x -t 20151018170000 | head
runlevel (to lvl 3)   2.6.39-400.109.1 Sun Oct 18 16:12 - 21:05  (04:53)
reboot   system boot  2.6.39-400.109.1 Sun Oct 18 16:12 - 21:05  (04:53)
shutdown system down  2.6.39-400.109.1 Sun Oct 18 06:10 - 16:12  (10:02)
runlevel (to lvl 0)   2.6.39-400.109.1 Sun Oct 18 06:10 - 06:10  (00:00)
oracle   pts/1        xdsl-188-154-161 Sat Oct 17 21:25 - 00:01  (02:35)
oracle   pts/1        xdsl-188-154-161 Mon Oct  5 07:29 - 11:41  (04:12)
oracle   pts/1        xdsl-188-154-161 Mon Oct  5 07:22 - 07:22  (00:00)
oracle   pts/1        109.132.241.235  Sun Sep 20 10:57 - 10:57  (00:00)
oracle   pts/1        109.132.241.235  Sun Sep 20 10:28 - 10:28  (00:00)
oracle   pts/2        109.132.241.235  Sat Sep 19 18:03 - 19:07  (01:03)

During the 06:00 – 16:08 outage (end was planned for 21:00), the system was stopped from 06:10 to 16:12 and the notification came 2 hours later.

So what?

On any server you should be confident that all your services restart on server reboot. Check your init.d scripts. Test them. Take care of dependencies if one must start after another one. In the cloud, because you’re not there when the system is brought up (and not notified immediately) then you must be 100% sure that everything restarts.