Infrastructure at your Service

Franck Pachot

TRANSPORT_CONNECT_TIMEOUT and RETRY_COUNT

By Franck Pachot

.
When you have a Data Guard configuration, you want the application to connect to the right server, where the primary is, without taking too much time. The default TCP timeout is 1 minute which is too long. When you don’t want to configure a virtual IP address (VIP) you can simply list all the addresses in the client connection string. But then you need to reduce the timeout. A short duration in 1 to 5 seconds will be ok most of the time, but in case of network issue, you want to give a chance to retry with a longer timeout. This post is about the connection string parameters to define this. Of course, all is documented but the goal of this post is also to show how to quickly test it. Because a reliable understanding of how it works relies on both documentation and test.

Here is a simple client failover configuration where the connection tries 10.10.10.10 and, if it fails, tries 10.10.10.11


DEFAULT=
 (DESCRIPTION=
  (CONNECT_DATA=(SERVICE_NAME=pdb1))
  (ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521))
   (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))
  )
 )

The problem with that is when the 10.10.10.10 is down then the 10.10.10.11 will be tried only after 60 seconds, the default TCP timeout. You can completely avoid waiting for the timeout by using a virtual IP that will always be up, started on the failed-over server. But you can also reduce the TCP timeout to a few seconds.

Here is a tnsping with the above tnsnames.ora entry and when both servers are down:


$ time tnsping DESCRIPTION
 
TNS Ping Utility for Linux: Version 18.0.0.0.0 - Production on 10-AUG-2018 15:15:55
 
Copyright (c) 1997, 2018, Oracle.  All rights reserved.
 
Used parameter files:
 
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (CONNECT_DATA=(SERVICE_NAME=pdb1)) (ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521)) (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))))
TNS-12535: TNS:operation timed out
 
real    2m0.051s
user    0m0.005s
sys     0m0.011s

That’s 2 minutes because there is a 1 minute timeout for each address.

TRANSPORT_CONNECT_TIMEOUT

Now, just adding the TRANSPORT_CONNECT_TIMEOUT to the connection string description to reduce the timout to 4 seconds:


DESCRIPTION=
 (DESCRIPTION=
  (CONNECT_DATA=(SERVICE_NAME=pdb1))
  (TRANSPORT_CONNECT_TIMEOUT=4)
  (ADDRESS_LIST=
   (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521))
   (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))
  )
 )

The total time to get the answer from both addresses is 8 seconds – 4 second for each:


$ time tnsping DESCRIPTION
 
TNS Ping Utility for Linux: Version 18.0.0.0.0 - Production on 10-AUG-2018 15:15:55
 
Copyright (c) 1997, 2018, Oracle.  All rights reserved.
 
Used parameter files:
 
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (CONNECT_DATA=(SERVICE_NAME=pdb1)) (TRANSPORT_CONNECT_TIMEOUT=4) (ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521)) (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))))
TNS-12535: TNS:operation timed out
 
real    0m8.023s
user    0m0.010s
sys     0m0.006s

RETRY_COUNT

If you lower the timeout, you may give a chance to retry a few times with RETRY_COUNT. There, RETRY_COUNT=2 will give 3 attempts ( 1 + 2 retries ) to the address list:


$ time tnsping RETRY_COUNT
 
TNS Ping Utility for Linux: Version 18.0.0.0.0 - Production on 10-AUG-2018 15:49:34
 
Copyright (c) 1997, 2018, Oracle.  All rights reserved.
 
Used parameter files:
 
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (CONNECT_DATA=(SERVICE_NAME=pdb1)) (TRANSPORT_CONNECT_TIMEOUT=4) (RETRY_COUNT=2) (ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521)) (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))))
TNS-12535: TNS:operation timed out
 
real    0m24.049s
user    0m0.011s
sys     0m0.010s

This has tried 10.10.10.10 and then 10.10.10.11 for 4 seconds each, and then retried 2 times wich in total takes 4+4+2*(4+4)=24 seconds

DESCRIPTION_LIST

The TRANSPORT and RETRY_COUNT are used only in the DESCRIPTION. You may want to give several attempts with an increasing timeout. For example: try each address for one second to get a quick connection to the primary, wherever it is, when the network is in good health. Then give two attempts with a 5 seconds timeout for bad network times. And then one final attempt to each with the default timeout to be sure that the servers are down.

You can use a DESCRIPTION_LIST for this:


INCREASING=
 (DESCRIPTION_LIST=
  (LOAD_BALANCE=off)
  (DESCRIPTION=
   (CONNECT_DATA=(SERVICE_NAME=pdb1))
   (TRANSPORT_CONNECT_TIMEOUT=1)
   (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))
   )
  )
  (DESCRIPTION=
   (CONNECT_DATA=(SERVICE_NAME=pdb1))
   (TRANSPORT_CONNECT_TIMEOUT=5)
   (RETRY_COUNT=1)
   (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))
   )
  )
  (DESCRIPTION=
   (CONNECT_DATA=(SERVICE_NAME=pdb1))
   (TRANSPORT_CONNECT_TIMEOUT=2)
   (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.11)(PORT=1521))
   )
  )
 )

Rather than just time the total attempts, I’ll strace each connections:


$ strace -tT tnsping INCREASING 2>&1 | grep -C1 --color=auto -E 'poll.*|inet_addr[()".0-9]*'
 
16:15:49 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000008>
16:15:49 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.10")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000087>
16:15:49 times(NULL)                    = 434920117 <0.000011>
16:15:49 mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efce31bc000 <0.000013>
16:15:49 poll([{fd=4, events=POLLOUT}], 1, 1000) = 0 (Timeout) <1.001435>
16:15:50 close(4)                       = 0 <0.000256>
--
16:15:50 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000060>
16:15:50 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.11")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000495>
16:15:50 times(NULL)                    = 434920218 <0.000062>
16:15:50 poll([{fd=4, events=POLLOUT}], 1, 1000) = 0 (Timeout) <1.000768>
16:15:51 close(4)                       = 0 <0.000050>
--
16:15:51 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000015>
16:15:51 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.10")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000060>
16:15:51 times(NULL)                    = 434920318 <0.000010>
16:15:51 poll([{fd=4, events=POLLOUT}], 1, 5000) = 0 (Timeout) <5.005563>
16:15:56 close(4)                       = 0 <0.000027>
--
16:15:56 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000012>
16:15:56 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.11")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000081>
16:15:56 times(NULL)                    = 434920819 <0.000015>
16:15:56 poll([{fd=4, events=POLLOUT}], 1, 5000) = 0 (Timeout) <5.006265>
16:16:01 close(4)                       = 0 <0.000192>
--
16:16:01 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000079>
16:16:01 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.10")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000486>
16:16:01 times(NULL)                    = 434921320 <0.000087>
16:16:01 poll([{fd=4, events=POLLOUT}], 1, 5000) = 0 (Timeout) <5.004660>
16:16:06 close(4)                       = 0 <0.000611>
--
16:16:06 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000114>
16:16:06 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.11")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000536>
16:16:06 times(NULL)                    = 434921822 <0.000097>
16:16:06 poll([{fd=4, events=POLLOUT}], 1, 5000) = 0 (Timeout) <5.008128>
16:16:11 close(4)                       = 0 <0.000135>
--
16:16:11 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000137>
16:16:11 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.10")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000584>
16:16:11 times(NULL)                    = 434922323 <0.000079>
16:16:11 poll([{fd=4, events=POLLOUT}], 1, 60000) = 0 (Timeout) <60.053782>
16:17:11 close(4)                       = 0 <0.000166>
--
16:17:11 fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000195>
16:17:11 connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("10.10.10.11")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000549>
16:17:11 times(NULL)                    = 434928329 <0.000488>
16:17:11 poll([{fd=4, events=POLLOUT}], 1, 60000) = 0 (Timeout) <60.007246>
16:18:11 close(4)                       = 0 <0.000043>

With ‘-T’ strace shows the duration of the poll() system call between brackets after the return code. You can see here 1-second timeout attempts to each address, then 2 attempts with 5 seconds timeout and then 60 seconds.

Note that I have added (LOAD_BALANCE=OFF) here because the default is ON in a DESCRIPTION_LIST but here I want to take them in the order I specified them.

8 Comments

  • Jan Schnackenberg says:

    Very interesting idea. I’ll definitely put this “multiple DESCRIPTION” schema into my toolbox.

    I’ve had this issue (what happens in case of the complete unavailability of the primary) before and couldn’t think of a viable solution. We had setup a way to distribute changed TNSNAMES.ORA files in case of a failover, because the application was very sensitive to long connection times.

    Thank you very much for this idea.

  • Tarun says:

    RETRY_COUNT:

    Last line 4+4+2*(4+4)=24
    I think might be corrected

  • Bernd says:

    Any idea why the drivers are not remembering which host failed? This way each connection in a pool would have to discover this again. Would some state be hold with FAN active (as it needs to remeber host states anyway)?

  • Piotr says:

    Are you really sure that this time is in seconds? Not in miliseconds. Java doc for methods used to get connection is saying that this should be in ms.

  • Franck Pachot says:

    Hi Piotr,
    TRANSPORT_CONNECT_TIMEOUT can specify the unit (like adding ms, sec, or min) but the default is seconds.
    Which java doc are you referring to?
    Franck.

  • Piotr says:

    Hi Franck,
    That is for newest driver, but in 12.x I’ve end up in java.util.Timer in method:
    /**
    * Schedules the specified task for execution after the specified delay.
    *
    * @param task task to be scheduled.
    * @param delay delay in milliseconds before task is to be executed.
    * @throws IllegalArgumentException if delay is negative, or
    * delay + System.currentTimeMillis() is negative.
    * @throws IllegalStateException if task was already scheduled or
    * cancelled, timer was cancelled, or timer thread terminated.
    * @throws NullPointerException if {@code task} is null
    */
    public void schedule(TimerTask task, long delay) {
    if (delay < 0)
    throw new IllegalArgumentException("Negative delay.");
    sched(task, System.currentTimeMillis()+delay, 0);
    }
    I will try to migrate to the newest one then and verify this.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Franck Pachot
Franck Pachot

Principal Consultant / Database Evangelist
 Oracle ACE Director
 Oracle Database OCM 12c certified
 AWS Database Specialty certified
 Oak Table member

RSS for this blog: feed
Twitter: @FranckPachot
LinkedIn: www.linkedin.com/in/franckpachot
Podcast en français: DBPod