By Franck Pachot

.
The Linux metrics that are usually considered for performance monitoring and troubleshooting have many drawbacks and frequent misinterpretations. The recent kernels provide PSI (Pressure Stall Information) which is not enabled by default in Oracle Linux “unbreakable” kernel. However, the kernel is compiled with it. Here is how to enabled it in a newly create Autonomous Linux instance (the Oracle Cloud Linux ,based on OEL 7.9, with no-downtime patching)

Check PSI availability


[opc@sillymetrics ~]$ tail /proc/pressure/*

==> /proc/pressure/cpu  /proc/pressure/io  /proc/pressure/memory <==
tail: error reading ‘/proc/pressure/memory’: Operation not supported

PSI (Pressure Stall Information) is not available


[opc@sillymetrics ~]$ uname --kernel-release

5.4.17-2036.101.2.el7uek.x86_64

Kernel >= 4.20 is ok


[opc@sillymetrics ~]$ grep PSI /boot/config-$(uname --kernel-release)

CONFIG_PSI=y
CONFIG_PSI_DEFAULT_DISABLED=y

Kernel is compiled with PSI but disabled by default


[opc@sillymetrics ~]$ cat /proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-5.4.17-2036.101.2.el7uek.x86_64 root=UUID=53da1572-8973-42e7-965b-fa6c46efb805 ro crashkernel=auto LANG=en_US.UTF-8 console=tty0 console=ttyS0,9600 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 rd.iscsi.bypass=1 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi iscsi_param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 ipmi_si.trydefaults=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 ip=single-dhcp crash_kexec_post_notifiers

I need to add “psi=1” to the kernel boot command line

Add boot command line option


[opc@sillymetrics ~]$ tuned-adm active

Current active profile: oci-rps-xps oci-busy-polling oci-cpu-power oci-nic

The system is managed by tuned, I’ll add a custom bootcmdline profile


[opc@sillymetrics ~]$ sudo su -
[root@sillymetrics ~]# mkdir /etc/tuned/my_bootcmdline && cat > /etc/tuned/my_bootcmdline/tuned.conf <<'TAC'

[main]
summary=Enable Pressure Stall Information as in https://www.dbi-services.com/blog/psi-pressure-stall-information/
[bootloader]
cmdline=psi=1

TAC

I’ve created a new profile for the bootloader commandline options


[root@sillymetrics ~]# tuned-adm profile oci-rps-xps oci-busy-polling oci-cpu-power oci-nic my_bootcmdline

I switched to this profile


[root@sillymetrics ~]# systemctl reboot
Connection to 129.213.62.243 closed by remote host.
Connection to 129.213.62.243 closed.

[opc@a pmm]$ ssh 129.213.62.243
Last login: Wed Jan  6 14:07:18 2021
Welcome to Autonomous Linux
Effective kernel version is 5.4.17-2036.101.2.el7uek.x86_64

I restarted the system


[opc@sillymetrics ~]$ cat /proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-5.4.17-2036.101.2.el7uek.x86_64 root=UUID=53da1572-8973-42e7-965b-fa6c46efb805 ro crashkernel=auto LANG=en_US.UTF-8 console=tty0 console=ttyS0,9600 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 rd.iscsi.bypass=1 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi iscsi_param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 ipmi_si.trydefaults=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 ip=single-dhcp crash_kexec_post_notifiers psi=1

My boot command line option is there (psi=1)


[opc@sillymetrics ~]$ sudo tail /proc/pressure/*

[root@psi log]# sudo tail /proc/pressure/*
==> /proc/pressure/cpu  /proc/pressure/io  /proc/pressure/memory <<==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The Pressure Stall Information is now available

Test on CPU resource

PSI measures the “pressure” on a ressource. There is no pressure if the resource is immediately available when needed: throughput can go to its maximum with the latency at its minimum. Pressure increases when contention happens on the resource usage, introducing delay, queuing, stalling. IO and RAM access can block all processes when pressure is at maximum, PSI will label that as “full” pressure where all tasks have to wait, and then the CPU being idle during that time, wasting all CPU cycles because all runnable tasks cannot run. But in this blog post I’ll test on CPU pressure only, where we never see this “full” state because the baseline of PSI is the time where the CPU can execute a task. And PSI measures the amount of this runnable time that is wasted because of CPU starvation. PSI has a counter for the time where not all, but “some” tasks are delayed because of resource pressure: accounted as soon as at least one runnable task was delayed, waiting in the runqueue. The important point is that PSI is about wallclock time where tasks are delayed. However, this time is normalized to account for the surface area of the pressure. One task waiting, on average, in front of many running ones, is a smaller pressure than one waiting in front of one running. So this wallclock time is divided by the number of CPU.

Here is what I’ve run on a 2 vCPU instance (the free tier one in the Oracle Cloud).


uptime ; cat /proc/pressure/cpu
for i in {1..1} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
for i in {1..2} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
for i in {1..3} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
for i in {1..4} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
for i in {1..5} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
for i in {1..6} ; do timeout 60 yes & done > /dev/null ; wait ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu
sleep 10 ; uptime ; cat /proc/pressure/cpu

I’ll detail the result for each line (60 second run) where I scale the running processes.


 14:53:24 up 46 min,  1 user,  load average: 0.00, 0.10, 0.21
some avg10=0.24 avg60=0.45 avg300=2.01 total=124839977

After no activity for more than 5 minutes all counters are at zero.


 14:54:24 up 47 min,  1 user,  load average: 0.63, 0.27, 0.26
some avg10=0.46 avg60=0.54 avg300=1.74 total=125243785

After 1 process in R state for 60 seconds the pressure is negligible as I have two vCPUs here. PSI is much easier than load average: you don’t have to compare it to the number of vCPU. AS long as you run less tasks than the number of CPU thread, there is no pressure.


 14:55:24 up 48 min,  1 user,  load average: 1.70, 0.65, 0.39
some avg10=2.87 avg60=2.05 avg300=1.95 total=127191314

After 2 processes in R state for 60 seconds the pressure is still negligible. As my instance is an Oracle Cloud free tier one, it is actually 1/8th of OCPU. Here only 25% of the CPU is available (the remaining is seen as %steal by the hypervisor). But the Pressure Stall Information doesn’t look at physical processors. The baseline is the available clock time for each vCPU. Don’t forget that PSI measures the pressure, not the slowdown. Running one task on a 2-vCPU does not experience pressure, as it is always being able to run with the maximum potential of the vCPUs.


 14:56:24 up 49 min,  1 user,  load average: 2.52, 1.08, 0.56
some avg10=51.43 avg60=33.01 avg300=11.04 total=158241805

After 3 processes 100% in R state for 60 seconds I have encountered CPU starvation. With 3 runnable tasks and only 2 vCPU available, there is always 1 task stall, at each second. The wallclock time in the “some” state is 60 seconds. However, PSI divides this pressure, from one task starving, by the number of vCPU. Pressure will be 100% only when all vCPU will have a task waiting for it. So, here, 1 task on average was delayed out in front of 2 busy threads, and then each second adds only 0.5 second of pressure. This is why the total stall time (in microseconds) shows (158241805-127191314)/1e6=31 seconds during this 60 seconds of elapsed time. This, within 60 seconds, is 51%, as displayed by the average on the last 10 seconds. The 60 seconds average was not updated immediately and shows only 33%. But the “total” is real time and this is where PSI is great: fine grained visualization of stall situation. This is a big advantage over load average, where you have to wait 1 minute to get the full picture.


 14:57:24 up 50 min,  1 user,  load average: 3.43, 1.60, 0.77
some avg10=95.91 avg60=72.92 avg300=26.86 total=216575324

After 4 processes in the R state for 60 seconds, each were able to run on CPU only half of their time. The stall time is (216575324-158241805)/1e6=58 seconds (full seconds as 2 tasks are waiting in front of 2 CPUs) within this minute. The pressure is high.


 14:58:24 up 51 min,  1 user,  load average: 4.47, 2.22, 1.04
some avg10=99.86 avg60=89.92 avg300=40.43 total=276578094

 14:59:24 up 52 min,  1 user,  load average: 5.52, 2.94, 1.36
some avg10=99.93 avg60=96.20 avg300=51.49 total=336583746

After 5 processes in R state for 60 seconds, and then 6 processes, the pressure is very high. It will never reach 100% because you can always add a new task willing to run. But the pressure increases, with the runqueue and the wait time.


 14:59:34 up 52 min,  1 user,  load average: 4.75, 2.86, 1.35
some avg10=39.88 avg60=82.54 avg300=50.00 total=336584509
 14:59:44 up 53 min,  1 user,  load average: 4.02, 2.77, 1.34
some avg10=14.68 avg60=69.89 avg300=48.32 total=336584773
 14:59:54 up 53 min,  1 user,  load average: 3.40, 2.67, 1.32
some avg10=6.34 avg60=59.45 avg300=46.75 total=336782396
 15:00:04 up 53 min,  1 user,  load average: 2.88, 2.59, 1.31
some avg10=2.33 avg60=50.34 avg300=45.17 total=336787339
 15:00:14 up 53 min,  1 user,  load average: 2.44, 2.50, 1.30
some avg10=0.85 avg60=42.63 avg300=43.64 total=336791024
 15:00:24 up 53 min,  1 user,  load average: 2.65, 2.55, 1.32
some avg10=1.90 avg60=36.57 avg300=42.28 total=337114177
 15:00:34 up 53 min,  1 user,  load average: 2.32, 2.48, 1.31
some avg10=0.70 avg60=30.96 avg300=40.85 total=337114942

Here is the cooldown during one minute. Looking at the difference in the cumulative total shows that starvation is gone: (337114942-336583746)/1e6=0.5 second. The 10s and 60s time window counters take some time to decrease. PSI is real time, and can even trigger notifications. It makes it a very interesting metric to monitor for serverless elastic auto-scaling services.

I have the same test on different shapes of VM, with 2, 4 and 8 vCPUs. In the graph below, the X-axis is the number of processes running and the Y-axis is the PSI avg10 after running 20 seconds. The load average would be linear with the number of processes running (but I would have run at least 1 minute to get it). Here we see that PSI is linear when the number of running processes is between one per CPU and two per CPU. But when some CPU are idle, PSI is nearly 0% pressure. And when all CPU have one task stall, PSI is nearly 100%.

The test I did above is similar to the red line here, on a 2 vCPU shape:

The yellow line is on 4 vCPU and the green line on 8 vCPU. In summary, Pressure Stall Information divides the CPU load into 3 zone:

  • no pressure when some CPU are idle
  • pressure increases when all CPU are busy with some tasks stall
  • pressure is maximal when there are more stall task than running