Why does the sysctl.conf value for swappiness on Oracle Linux 7.x not survive a reboot

For a project I applied the dbi services best practices for Oracle databases. One of these is to adjust the swappiness parameter. We recommend to use a very low for the swappiness value like 10 or even lower to reduce the risk for Linux to begin swapping! Swapping on a database server is not a problem per se, but it generates activity on disk, which can negatively impact the performance of a database.

At the end of this project I did the handover to our service desk. The service desk has a lot of things to validate for every installation and developed some scripts to check the dbi services best practices against a new system before it gets under contract and/or monitoring. One of this scripts is detected that the swappiness value on the system was set to the value of 30. After a few hours of investigation we identified the issue. More about this focused on swappiness later on this blog.

In previous versions of Linux we apply or modify parameters in the /etc/sysctl.conf file to tune the Linux kernel, network, disk etc. One of these is vm.swappiness.

[root]# grep -A1 "^# dbi" /etc/sysctl.conf
# dbi services reduces the possibility for swapping, 0 = disable, 10 = reduce the paging possibility to 10%
vm.swappiness = 0

To activate this setting we use then:

[root]# sysctl -p | grep "^vm"
vm.swappiness = 0

To control this setting we can request the current value:

[root]# cat /proc/sys/vm/swappiness 
0

After a reboot of the system when we check the value again:

[root]# cat /proc/sys/vm/swappiness 
30

What a surprise! My value did not survive a reboot.

Why is the default value of 30 applied?

 

Explanation

There are some important changes when it comes to setting the system values for kernel,network,disk etc. in the recent versions of Red Hat Linux 7.

  • In a minimal installation, since version 7, per default the tuned.service is enabled.
  • The tuned.service applies some values after the load of the sysctl.conf values.
  • The default tuned profile that gets applied is network-throughput on a physical machine and virtual-guest on a virtual host

Once we got known to this facts we looked at the values which are set by default.

The tuned.service profiles are located under /usr/lib/tuned/

[root]# ls -als /usr/lib/tuned/
total 36
 4 drwxr-xr-x. 13 root root  4096 Apr  8 14:13 .
12 dr-xr-xr-x. 41 root root  8192 Mar 11 09:39 ..
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 balanced
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 desktop
16 -rw-r--r--.  1 root root 12294 Mar 31 18:46 functions
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 latency-performance
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 network-latency
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 network-throughput
 0 drwxr-xr-x.  2 root root    39 Apr  6 14:56 powersave
 4 -rw-r--r--.  1 root root  1288 Jul 31  2015 recommend.conf
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 throughput-performance
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 virtual-guest           <-default
 0 drwxr-xr-x.  2 root root    23 Apr  6 14:56 virtual-host

There is a list of predefined profiles.

[root]# tuned-adm list
Available profiles:
- balanced
- desktop
- latency-performance
- network-latency
- network-throughput
- powersave
- throughput-performance
- virtual-guest
- virtual-host
The currently active profile is: virtual-guest

To find out more about the tuned profiles:

[root]# man tuned-profiles
TUNED_PROFILES(7)                                                        tuned                                                       TUNED_PROFILES(7)

NAME
       tuned-profiles - description of basic tuned profiles

DESCRIPTION
       These are the base profiles which are mostly shipped in the base tuned package. They are targeted to various goals. Mostly they provide perfor‐
       mance optimizations but there are also profiles targeted to low power consumption, low latency and others. You can mostly deduce the purpose of
       the profile by its name or you can see full description bellow.

       The profiles are stored in subdirectories below /usr/lib/tuned. If you need to customize the profiles, you can copy them to /etc/tuned and mod‐
       ify them as you need. When loading profiles with the same name, the /etc/tuned takes precedence. In such case you will not lose your customized
       profiles  between tuned updates.

       The  power saving profiles contain settings that are typically not enabled by default as they will noticeably impact the latency/performance of
       your system as opposed to the power saving mechanisms that are enabled by default. On the other hand the performance profiles disable the addi‐
       tional power saving mechanisms of tuned as they would negatively impact throughput or latency.

PROFILES
       At the moment we're providing the following pre-defined profiles:

       balanced
              It  is  the  default profile. It provides balanced power saving and performance.  At the moment it enables CPU and disk plugins of tuned
              and it makes sure the ondemand governor is active (if supported by the current cpufreq driver).  It enables ALPM power saving  for  SATA
              host adapters and sets the link power management policy to medium_power. It also sets the CPU energy performance bias to normal. It also
              enables AC97 audio power saving or (it depends on your system) HDA-Intel power savings with 10 seconds timeout. In case your system con‐
              tains supported Radeon graphics card (with enabled KMS) it configures it to automatic power saving.

       powersave
              Maximal  power saving, at the moment it enables USB autosuspend (in case environment variable USB_AUTOSUSPEND is set to 1), enables ALPM
              power saving for SATA host adapters and sets the link power manamgent policy to min_power.  It also enables WiFi power  saving,  enables
              multi  core  power  savings scheduler for low wakeup systems and makes sure the ondemand governor is active (if supported by the current
              cpufreq driver). It sets the CPU energy performance bias to powersave. It also enables AC97 audio power saving or (it  depends  on  your
              system)  HDA-Intel  power  savings  (with 10 seconds timeout). In case your system contains supported Radeon graphics card (with enabled
              KMS) it configures it to automatic power saving. On Asus Eee PCs dynamic Super Hybrid Engine is enabled.

       throughput-performance
              Profile for typical throughput performance tuning. Disables power saving  mechanisms  and  enables  sysctl  settings  that  improve  the
              throughput performance of your disk and network IO. CPU governor is set to performance and CPU energy performance bias is set to perfor‐
              mance. Disk readahead values are increased.

       latency-performance
              Profile for low latency performance tuning. Disables power saving mechanisms.  CPU governor is set to performance andlocked to the low C
              states (by PM QoS).  CPU energy performance bias to performance.

       network-throughput
              Profile  for  throughput network tuning. It is based on the throughput-performance profile. It additionaly increases kernel network buf‐
              fers.

       network-latency
              Profile for low latency network tuning. It is based on the latency-performance profile. It additionaly disables  transparent  hugepages,
              NUMA balancing and tunes several other network related sysctl parameters.

       desktop
              Profile optimized for desktops based on balanced profile. It additionaly enables scheduler autogroups for better response of interactive
              applications.

       virtual-guest
              Profile optimized for virtual guests based on throughput-performance profile.  It additionally decreases virtual memory  swappiness  and
              increases dirty_ratio settings.

       virtual-host
              Profile  optimized for virtual hosts based on throughput-performance profile.  It additionally enables more aggresive writeback of dirty
              pages.

FILES
       /etc/tuned/* /usr/lib/tuned/*

SEE ALSO
       tuned(8) tuned-adm(8) tuned-profiles-atomic(7) tuned-profiles-sap(7) tuned-profiles-sap-hana(7)  tuned-profiles-oracle(7)  tuned-profiles-real‐
       time(7) tuned-profiles-nfv(7) tuned-profiles-compat(7)

AUTHOR
       Jaroslav Škarvada <[email protected]> Jan Kaluža <[email protected]> Jan Včelák <[email protected]> Marcela Mašláňová <[email protected]>
       Phil Knirsch <[email protected]>

Fedora Power Management SIG                                           23 Sep 2014                                                    TUNED_PROFILES(7)

Watch the values insight the default profile (virtual-guest), which includes the network-throughput. We take the focus on the swappiness value, which is set to 30.

[root]# cat /usr/lib/tuned/virtual-guest/tuned.conf 
#
# tuned configuration
#

[main]
include=throughput-performance

[sysctl]
# If a workload mostly uses anonymous memory and it hits this limit, the entire
# working set is buffered for I/O, and any more write buffering would require
# swapping, so it's time to throttle writes until I/O can catch up.  Workloads
# that mostly use file mappings may be able to use even higher values.
#
# The generator of dirty data starts writeback at this percentage (system default
# is 20%)
vm.dirty_ratio = 30

# Filesystem I/O is usually much more efficient than swapping, so try to keep
# swapping low.  It's usually safe to go even lower than this on systems with
# server-grade storage.
vm.swappiness = 30

Some important point, the tuned profile virtual-guest includ’s the settings from the tuned profile throughtput-performance:

[root]# cat /usr/lib/tuned/throughput-performance/tuned.conf
#
# tuned configuration
#

[cpu]
governor=performance
energy_perf_bias=performance
min_perf_pct=100

[disk]
readahead=>4096

[sysctl]
# ktune sysctl settings for rhel6 servers, maximizing i/o throughput
#
# Minimal preemption granularity for CPU-bound tasks:
# (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
kernel.sched_min_granularity_ns = 10000000

# SCHED_OTHER wake-up granularity.
# (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
#
# This option delays the preemption effects of decoupled workloads
# and reduces their over-scheduling. Synchronous workloads will still
# have immediate wakeup/sleep latencies.
kernel.sched_wakeup_granularity_ns = 15000000

# If a workload mostly uses anonymous memory and it hits this limit, the entire
# working set is buffered for I/O, and any more write buffering would require
# swapping, so it's time to throttle writes until I/O can catch up.  Workloads
# that mostly use file mappings may be able to use even higher values.
#
# The generator of dirty data starts writeback at this percentage (system default
# is 20%)
vm.dirty_ratio = 40

# Start background writeback (via writeback threads) at this percentage (system
# default is 10%)
vm.dirty_background_ratio = 10

# PID allocation wrap value.  When the kernel's next PID value
# reaches this value, it wraps back to a minimum PID value.
# PIDs of value pid_max or larger are not allocated.
#
# A suggested value for pid_max is 1024 * <# of cpu cores/threads in system>
# e.g., a box with 32 cpus, the default of 32768 is reasonable, for 64 cpus,
# 65536, for 4096 cpus, 4194304 (which is the upper limit possible).
#kernel.pid_max = 65536

# The swappiness parameter controls the tendency of the kernel to move
# processes out of physical memory and onto the swap disk.
# 0 tells the kernel to avoid swapping processes out of physical memory
# for as long as possible
# 100 tells the kernel to aggressively swap processes out of physical memory
# and move them to swap cache
vm.swappiness=10

 

Solution

There various approaches to solve this issue:

  1. – disable the tuned.service for switching back to the /etc/sysctl.conf values
  2. – adapt the values in the virtual-guest profile but what if they will be a updated automatically by the OS vendor in future patches or releases?
  3. – create a new tuned profile based on the virtual-guest and adapt the values
  4. – use the tuned profile which is deployed by Oracle in the repository from Oracle Linux

I prefer solution 4 which is also the much useful way.

What we need to do to solve it like this:

First of all install the corresponding package from the Oracle Linux 7 repository:

[root]# yum info *tuned-profile*
Loaded plugins: ulninfo
Available Packages
Name        : tuned-profiles-oracle
Arch        : noarch
Version     : 2.5.1
Release     : 4.el7_2.3
Size        : 1.5 k
Repo        : installed
From repo   : ol7_latest
Summary     : Additional tuned profile(s) targeted to Oracle loads
URL         : https://fedorahosted.org/tuned/
License     : GPLv2+
Description : Additional tuned profile(s) targeted to Oracle loads.

Watch the values insight this tuned profile:

[root]# cat /usr/lib/tuned/oracle/tuned.conf 
#
# tuned configuration
#

[main]
include=throughput-performance

[sysctl]
vm.swappiness = 1
vm.dirty_background_ratio = 3
vm.dirty_ratio = 80
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
kernel.shmmax = 4398046511104
kernel.shmall = 1073741824
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 6815744
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
kernel.panic_on_oops = 1

[vm]
transparent_hugepages=never

Activate the profile, check which is really active and then check the current configuration value of the swappiness parameter:

[root]# tuned-adm profile oracle
[root]# tuned-adm active
Current active profile: oracle
[root]# cat /proc/sys/vm/swappiness 
1

Now we have the oracle tuned profile applied which overwrites some values which do also come from Oracle with the oracle-rdbms-server-11gR2-preinstall or oracle-rdbms-server-12cR1-preinstall packages. In my case this is a list of double parameters:

/usr/lib/tuned/oracle/tuned.conf
vm.swappiness = 1
vm.dirty_background_ratio = 3
vm.dirty_ratio = 80
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
kernel.shmmax = 4398046511104
kernel.shmall = 1073741824
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 6815744
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
kernel.panic_on_oops = 1

/etc/sysctl.conf from oracle-rdbms-server-11gR2-preinstall
# oracle-rdbms-server-11gR2-preinstall setting for fs.file-max is 6815744
fs.file-max = 6815744

# oracle-rdbms-server-11gR2-preinstall setting for kernel.sem is '250 32000 100 128'
kernel.sem = 250 32000 100 128

# oracle-rdbms-server-11gR2-preinstall setting for kernel.shmmni is 4096
kernel.shmmni = 4096

# oracle-rdbms-server-11gR2-preinstall setting for kernel.shmall is 1073741824 on x86_64
# oracle-rdbms-server-11gR2-preinstall setting for kernel.shmall is 2097152 on i386
kernel.shmall = 1073741824

# oracle-rdbms-server-11gR2-preinstall setting for kernel.shmmax is 4398046511104 on x86_64
# oracle-rdbms-server-11gR2-preinstall setting for kernel.shmmax is 4294967295 on i386
kernel.shmmax = 4398046511104

# oracle-rdbms-server-11gR2-preinstall setting for kernel.panic_on_oops is 1 per Orabug 19212317
kernel.panic_on_oops = 1

# oracle-rdbms-server-11gR2-preinstall setting for net.core.rmem_default is 262144
net.core.rmem_default = 262144

# oracle-rdbms-server-11gR2-preinstall setting for net.core.rmem_max is 4194304
net.core.rmem_max = 4194304

# oracle-rdbms-server-11gR2-preinstall setting for net.core.wmem_default is 262144
net.core.wmem_default = 262144

# oracle-rdbms-server-11gR2-preinstall setting for net.core.wmem_max is 1048576
net.core.wmem_max = 1048576

# oracle-rdbms-server-11gR2-preinstall setting for net.ipv4.conf.all.rp_filter is 2
net.ipv4.conf.all.rp_filter = 2

# oracle-rdbms-server-11gR2-preinstall setting for net.ipv4.conf.default.rp_filter is 2
net.ipv4.conf.default.rp_filter = 2

# oracle-rdbms-server-11gR2-preinstall setting for fs.aio-max-nr is 1048576
fs.aio-max-nr = 1048576

# oracle-rdbms-server-11gR2-preinstall setting for net.ipv4.ip_local_port_range is 9000 65500
net.ipv4.ip_local_port_range = 9000 65500

Summary

What we need to take care of is that if we need to modify some values we have to watch exactly what we apply and also the way in which we apply the values. The most important point is, control the outcome, like described in this blog, of setting values in the /etc/sysctl.conf. The tuned profiles are a good solution in which manufacturers or suppliers are able to distribute optimized values within the Linux distributions.