Infrastructure at your Service

By Franck Pachot

.
Those parameters, SQLNET.EXPIRE_TIME in sqlnet.ora and ENABLE=BROKEN in a connection description exist for a long time but may have changed in behavior. They are both related to detecting dead TCP connections with keep-alive probes. The former from the server, and the latter from the client.

The change in 12c is described in the following MOS note: Oracle Net 12c: Changes to the Functionality of Dead Connection Detection (DCD) (Doc ID 1591874.1). Basically instead sending a TNS packet for the keep-alive, the server Dead Connection Detection now relies on the TCP keep-alive feature when available. The note mentions that it may be required to set (ENABLE=BROKEN) in the connection string “in some circumstances” - which is not very precise. This “ENABLE=BROKEN” was used in the past for transparent failover when we had no VIP (virtual IP) in order to detect a lost connection to the server.

I don’t like those statements like “on some platform”, “in some circumstances”, “with some drivers”, “it may be necessary”… so there’s only one solution: test it in your context.

My listener is on (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521))) and I will connect to it and keep my connection idle (no user call to the server).I trace the server (through the forks of the listener, found by pgrep with the name of listener associated with this TCP address) and color it in green (GREP_COLORS=’ms=01;32′):


pkill strace ; strace -fyye trace=socket,setsockopt -p $(pgrep -f "tnslsnr $(lsnrctl status "(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521)))" | awk '/^Alias/{print $2}') ") 2>&1 | GREP_COLORS='ms=01;32' grep --color=auto -E '^|.*sock.*|^=*' &

I trace the client and color it in yellow (GREP_COLORS=’ms=01;32′):


strace -fyye trace=socket,setsockopt sqlplus demo/[email protected]"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=PDB1))(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))" <<<quit 2>&1 | GREP_COLORS='ms=01;33' grep --color=auto -E '^|.*sock.*|^=*'

I’m mainly interested by the setsockopt() here because this is how to enable TCP Keep Alive.

(ENABLE=BROKEN) on the client

My first test is without enabling DCD on the server: I have nothing defined in sqlnet.ora on the server side. I connect from the client without mentioning “ENABLE=BROKEN”:


The server (green) has set SO_KEEPALIVE but not the client.

Now I run the same scenario but adding (ENABLE=BROKEN) in the description:


strace -fyye trace=socket,setsockopt sqlplus demo/[email protected]"(DESCRIPTION=(ENABLE=BROKEN)(CONNECT_DATA=(SERVICE_NAME=PDB1))(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))" <<<quit 2>&1 | GREP_COLORS='ms=01;33' grep --color=auto -E '^|.*sock.*|^=*'

The client (yellow) has now a call to set keep-alive:


setsockopt(9<TCP:[1810151]>, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0

However, as I’ll show later, this uses the TCP defaults:


[[email protected] tmp]$ tail /proc/sys/net/ipv4/tcp_keepalive*
==> /proc/sys/net/ipv4/tcp_keepalive_intvl <== 
75
==> /proc/sys/net/ipv4/tcp_keepalive_probes <== 
9
==> /proc/sys/net/ipv4/tcp_keepalive_time <== 
7200

After 2 hours (7200 seconds) of idle connection, the client will send a probe 9 times, every 75 seconds. If you want to reduce it you must change it on the client system settings. If you don’t add “(ENABLE=BROKEN)” the dead broken connection will not be detected before then next user call, after the default TCP timeout (15 minutes).

That’s only from the client when its connection to the server is lost.

SQLNET.EXPIRE_TIME on the server

On the server side, we have seen that SO_KEEPALIVE is set - using the TCP defaults. But, there, it may be important to detect dead connections faster because a session may hold some locks. You can (and should) set a lower value in sqlnet.ora with SQLNET.EXPIRE_TIME. Before 12c this parameter was used to send TNS packets as keep-alive probes but now that SO_KEEPALIVE is set, this parameter will control the keep-alive idle time (using TCP_KEEPIDL instead of the default /proc/sys/net/ipv4/tcp_keepalive_time).
Here is the same as my first test (without the client ENABLE=BROKER) but after having set SQLNET.EXPIRE_TIME=42 in $ORACLE_HOME/network/admin/sqlnet.ora

Side note: I’ve got the “do we need to restart the listener?” question very often about changes in sqlnet.ora but the answer is clearly “no”. This file is read for each new connection to the database. The listener forks the server (aka shadow) process and this one reads the sqlnet.ora, as we can see here when I “strace -f” the listener but the forked process is setting-up the socket.

Here is the new setsockopt() from the server process:


[pid  5507] setsockopt(16<TCP:[127.0.0.1:1521->127.0.0.1:31374]>, SOL_TCP, TCP_KEEPIDLE, [2520], 4) = 0
[pid  5507] setsockopt(16<TCP:[127.0.0.1:1521->127.0.0.1:31374]>, SOL_TCP, TCP_KEEPINTVL, [6], 4) = 0
[pid  5507] setsockopt(16<TCP:[127.0.0.1:1521->127.0.0.1:31374]>, SOL_TCP, TCP_KEEPCNT, [10], 4) = 0

This means that the server waits for 42 minutes of inactivity (the EXPIRE_TIME that I’ve set, here TCP_KEEPIDLE=2520 seconds) and then sends a probe. Without answer (ack) it re-probes every 6 seconds during one minute (the 6 seconds interval is defined by TCP_KEEPINTVL and TCP_KEEPCNT sets the retries to 10 times). We control the idle time with SQLNET.EXPIRE_TIME and then can expect that a dead connection is closed after one additional minute of retry.

Here is a combination of SQLNET.EXPIRE_TIME (server detecting dead connection in 42+1 minute) and ENABLE=BROKEN (client detecting dead connection after the default of 2 hours):

tcpdump and iptable drop

The above, with strace, shows the translation of Oracle settings to Linux settings. Now I’ll translate to the actual behavior by tracing the TCP packets exchanged, with tcpdump:


sqlplus demo/[email protected]"(DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=PDB1))(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521)))"
host cat $ORACLE_HOME/network/admin/sqlnet.ora
host sudo netstat -np  | grep sqlplus
host sudo netstat -np  | grep 36316
set time on escape on
host sudo tcpdump -vvnni lo port 36316 \&

“netstat -np | grep sqlplus” finds the client connection in order to get the port and “netstat -np | grep $port” shows both connections (“sqlplus” for the client and “oracleSID” for the server).

I have set SQLNET.EXPIRE_TIME=3 here and I can see that the server sends a 0-length packets every 3 minutes (connection at 14:43:39, then idle, 1st probe: 14:46:42, 2nd probe: 14:49:42…). And each time the client replied with an ACK and then the server knows that the connection is still alive:

Now I simulate a client that doesn’t answer, by blocking the input packets:


host sudo iptables -I INPUT 1 -p tcp --dport 36316 -j DROP
host sudo netstat -np  | grep 36316

Here I see the next probe 3 minutes after the last one (14:55:42) and then, as there is no reply, the 10 probes every 6 seconds:

At the end, I checked the TCP connections and the server one has disappeared. But the client side remains. That is exactly what DCD does: when a session is idle for a while it tests if the connection is dead and closes it to release all resources.
If I continue from there and try to run a query, the server cannot be reached and I’ll hang for the default TCP timeout of 15 minutes. If I try to cancel, I get “ORA-12152: TNS:unable to send break message” as it tries to send an out-of-bound break. SQLNET.EXPIRE_TIME is only for the server-side. The client detects nothing until it tries to send something.

For the next test, I remove my iptables rule to stop blocking the packets:


host sudo iptables -D INPUT 1

And I’m now running the same but with (ENABLE=BROKEN)


connect demo/[email protected](DESCRIPTION=(ENABLE=BROKEN)(CONNECT_DATA=(SERVICE_NAME=PDB1))(ADDRESS=(PROTOCOL=tcp)(HOST=localhost)(PORT=1521)))
host sudo netstat -np  | grep sqlplus
host sudo netstat -np  | grep 37064
host sudo tcpdump -vvnni lo port 37064 \&
host sudo iptables -I INPUT 1 -p tcp --dport 37064 -j DROP
host sudo netstat -np  | grep 37064
host sudo iptables -D INPUT 1
host sudo netstat -np  | grep 37064

Here is the same as before: DCD after 3 minutes idle, and 10 probes that fail because I’ve blocked again with iptables:

As with the previous test, the server connection (the oracleSID) has been closed and only the client one remains. As I know that SO_KEEPALIVE has been enabled thanks to (ENABLE=BROKEN) the client will detect the closed connection:

17:52:48 is 2 hours after the last activity and probes 9 times every 1’15 according to the system defaults:


[[email protected] tmp]$ tail /proc/sys/net/ipv4/tcp_keepalive*
==> /proc/sys/net/ipv4/tcp_keepalive_intvl <==    TCP_KEEPINTVL
75
==> /proc/sys/net/ipv4/tcp_keepalive_probes <==     TCP_KEEPCNT
9
==> /proc/sys/net/ipv4/tcp_keepalive_time <==      TCP_KEEPIDLE

It was long (but you can change those defaults on the client) but finally, the client connection is cleared up (sqlplus not there in the last netstat).
Now, an attempt to run a user call fails immediately with the famous ORA-03113 because the client knows that the connection is closed:

Just a little additional test to show ORA-03135. If the server detects and closes the dead connection, but before the dead connection is detected on the client, we have seen that we wait for a 15 minutes timeout. But that’s because the iptable rule was still there to drop the packet. If I remove the rule before attempting a user-call, the server can be reached (then no wait and timeout) and detects immediately that there’s no endpoint anymore. This raises “connection lost contact”.

In summary:

  • On the server, the keep-alive is always enabled and SQLNET.EXPIRE_TIME is used to reduce the tcp_keepalive_time defined by the system, because it is probably too long.
  • On the client, the keep-alive is enabled only when (ENABLE=BROKEN) is in the connection description, and uses the tcp_keepalive_time from the system. Without it, the broken connection will be detected only when attempting a user call.

Setting SQLNET.EXPIRE_TIME to a few minutes (like 10) is a good idea because you don’t want to keep resources and locks on the server when a simple ping can ensure that the connection is lost and we have to rollback. If we don’t, then the dead connections may disappear only after 2 hours and 12 minutes (the idle time + the probes). On the client-side, it is also a good idea to add (ENABLE=BROKEN) so that idle sessions that have lost contact have a chance to know it before trying to use them. This is a performance gain if it helps to avoid sending a “select 1 from dual” each time you grab a connection from the pool

And, most important: the documentation is imprecise, which means that the behavior can change without notification. This is a test on specific OS, specific driver, specific version,… Do not take the results from this post, but now you know how to check in your environment.

2 Comments

  • Grzegorz Goryszewski says:

    Thank You for in deepth analyzis, btw You can check keepalive timer via
    ss -nop | grep sqlplus
    Regards.
    G

  • Franck Pachot says:

    Hi Grzegorz,
    Thanks a lot. Yes, that’s way much better to check the keep-alive!

    07:28:22 SQL> host ss -nop | grep -E ‘sqlplus|oracle’
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:10100 users:((“oracle_52358_cd”,pid=52358,fd=16)) timer:(keepalive,116min,0)
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:57169 users:((“oracle_72335_cd”,pid=72335,fd=16)) timer:(keepalive,28sec,0)
    tcp ESTAB 0 0 127.0.0.1:10100 127.0.0.1:1521 users:((“sqlplus”,pid=52354,fd=9))
     
    07:28:29 SQL> exec null;
     
    PL/SQL procedure successfully completed.
     
    07:28:34 SQL> host ss -nop | grep -E ‘sqlplus|oracle’
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:10100 users:((“oracle_52358_cd”,pid=52358,fd=16)) timer:(keepalive,116min,0)
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:57169 users:((“oracle_72335_cd”,pid=72335,fd=16)) timer:(keepalive,22sec,0)
    tcp ESTAB 0 0 127.0.0.1:10100 127.0.0.1:1521 users:((“sqlplus”,pid=52354,fd=9))
     
    07:28:35 SQL> host ss -nop | grep -E ‘sqlplus|oracle’
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:10100 users:((“oracle_52358_cd”,pid=52358,fd=16)) timer:(keepalive,116min,0)
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:57169 users:((“oracle_72335_cd”,pid=72335,fd=16)) timer:(keepalive,19sec,0)
    tcp ESTAB 0 0 127.0.0.1:10100 127.0.0.1:1521 users:((“sqlplus”,pid=52354,fd=9))
     
    07:28:38 SQL> host ss -nop | grep -E ‘sqlplus|oracle’
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:10100 users:((“oracle_52358_cd”,pid=52358,fd=16)) timer:(keepalive,116min,0)
    tcp ESTAB 0 0 127.0.0.1:1521 127.0.0.1:57169 users:((“oracle_72335_cd”,pid=72335,fd=16)) timer:(keepalive,14sec,0)
    tcp ESTAB 0 0 127.0.0.1:10100 127.0.0.1:1521 users:((“sqlplus”,pid=52354,fd=9))

    Cheers,
    Franck.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist