While supporting since several years ODAs of different generations and versions, we faced time to time some hardware alerts sent back by the ILOM. However all of them are not related to real hardware issues and are false positive. To get rid of them the solution is to reset them manually.

When an hardware error occurs the first reaction is to open a Service Request and to provide an ILOM snapshot to the support. This can easily be done using the Maintenance menu in the ILOM web interface.

Based on support feedback, they may confirm that this alert is simply a false positive. Another solution if support answer is too slow is simply to give a try  😀
However this will need a server reboot to ensure the alert really disappeared.

Here an example of a fault alarm about CPU we faced:

Date/Time                 Subsystems          Component
------------------------  ------------------  ------------
Tue Feb 13 14:00:26 2018  Power               PS1 (Power Supply 1)
        A loss of AC input power to a power supply has been detected.
        (Probability:100, UUID:84846f3c-036d-6941-eaca-de18c4c236bd,
        Resource:/SYS/PS1, Part Number:7333459, Serial
        Number:465824T+1734D30847, Reference
        Document:http://support.oracle.com/msg/SPX86A-8003-EL)
Thu Feb 15 14:27:04 2018  System              DBP (Disk Backplane)
        ILOM has detected that a PCIE link layer is inactive. (Probability:25,
        UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/DBP, Part
        Number:7341145, Serial Number:465136N+1739P2009T, Reference
        Document:http://support.oracle.com/msg/SPX86A-8009-3J)
Thu Feb 15 14:27:04 2018  System              MB (Motherboard)
        ILOM has detected that a PCIE link layer is inactive. (Probability:25,
        UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB, Part
        Number:7317636, Serial Number:465136N+1742P500BX, Reference
        Document:http://support.oracle.com/msg/SPX86A-8009-3J)
Thu Feb 15 14:27:04 2018  Processors          P1 (CPU 1)
        ILOM has detected that a PCIE link layer is inactive. (Probability:25,
        UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB/P1, Part
        Number:SR3AX, Serial Number:54-85FED07F672D3DD3, Reference
        Document:http://support.oracle.com/msg/SPX86A-8009-3J)

 

We can see that there are indeed 3 alerts for this issue.

In order to reset such an alert, you need first to log in on the server as root and access the IPMI tool

[root@oda-dbi01 ~]# ipmitool -I open sunoem cli
Connected. Use ^D to exit.

Oracle(R) Integrated Lights Out Manager

Version 4.0.0.28 r121827

Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.

Warning: password is set to factory default.

Warning: HTTPS certificate is set to factory default.

Hostname: oda-dbi01-ilom

 

Once in IPMI, you can list the Open Problems to get the same output than above using the following command:

-> ls /System/Open_Problems

In the list of the Open Problems we can find the UUID of the concerned component (see 3rd line)

Thu Feb 15 14:27:04 2018  Processors          P1 (CPU 1)
        ILOM has detected that a PCIE link layer is inactive. (Probability:25,
        UUID:49015767-38b2-6372-9526-c2d2c3885a72, Resource:/SYS/MB/P1, Part
        Number:SR3AX, Serial Number:54-85FED07F672D3DD3, Reference
        Document:http://support.oracle.com/msg/SPX86A-8009-3J)

 

Now it is time to access the fault manager to reset all alerts related to this UUID

-> cd SP/faultmgmt/shell/
/SP/faultmgmt/shell

-> start
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

 

The reset of the alert is done with the fmadm command

faultmgmtsp> fmadm acquit 49015767-38b2-6372-9526-c2d2c3885a72

At this point the alerts are already removed from the Open problems. However to make sure the issue is really gone, we need to reboot the ODA and check the Open Problems afterwards.

Note that I presented here the way to check Open Problems using the IPMI command line, but the same output is also available in the ILOM web page.

Hope it helps!