Infrastructure at your Service

David Barbarin

SQL Server on Linux – I/O internal thoughts

Let’s start the story from the beginning with some funny testing I tried to perform with SQL Server on Linux a couple of months ago. At that time, I wanted to get some pictures of syscalls from SQL Server as I already did in a past on Windows side with sysinternal tools as procmon and stack traces. On Linux strace is probably one of the best counterparts.

blog 140 - 0 - 0 -  banner

Firstly, please note this blog is just from my own researches and it doesn’t constitute in any cases an official documentation from Microsoft and may lead likely to some inaccuracies. This kind of write up is definitely not easy especially when you’re not the direct developer of the product and because things change quickly nowadays making at the same time your blog post biased :) Anyway, I was just curious to figure out how SQL Server deals with I/O on Linux side and the main safety point here is certainly to show how you may achieve it on Linux. So let’s start from the beginning with already what we know on the Windows operating system: The SQL Server engine goes through Win32 API and functions like CreateFile(), ReadFile(), WriteFile() to deal with I/O but let’s focus specifically on the CreateFile() function here. CreateFile() is used to create or to open an existing file or an I/O device with some specific flags. Some of them as FILE_FLAG_WRITE_THROUGH are used to meet the Write-Ahead Logging (WAL) Protocol by bypassing all system / disk caches (cf. Microsoft article).

On April 10 2018 I did my first tests on the Linux side with SQL Server 2017 CU5 and here was my first strace output after creating dbi_db database:

blog 140 - 0 - 2 - strace stack twitter

blog 140 - 0 - 1 - strace stack

It was an expected output for me because on Linux SQL Server uses an low-level open() system call – that is the counterpart of createfile() on Windows – but the surprising thing was with O_DIRECT flag only. I’m not a system developer and from my position, I may understand benefits from using O_DIRECT with database systems because it is driving by AIO (asynchronous I/O) and by the fact we may completely bypass any kernel space buffers (by default on Linux). I get the opportunity to thanks @dbaffaleuf with our interesting discussions on this topic

But referring to the documentation we may read the following sample as well about O_DIRECT:

File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred

From my understanding using O_DIRECT that implies durable writes on block devices are not guaranteed and from my trace I noticed the transaction log seemed to be open with O_DIRECT in this case …

2837  20:44:32 open("/u02/sqlserverlog/mssqlserver/dbi_test_log.ldf", O_RDWR|O_CREAT|O_EXCL|O_DIRECT, 0660) = 187
2837  10:13:09 fstat(187, {st_mode=S_IFREG|0660, st_size=0, ...}) = 0

 

… From a WAL protocol perspective using such flag might lead to not meet the requirements because we might experience data loss in case of a system / storage outage unless either implementing another low-level mechanism like fsync() for transaction log and checkpoints or to be sure the storage guarantees writes are O_DIRECT safe. At this stage I expected to find out more fsync() related entries in my trace but no chance as shown below (I put only a sample but in fact no other relevant syscalls in the entire trace that might indicate forcing synchronization stuff)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.59   12.993418        2089      6220      2503 futex
 10.78    1.929811        7509       257           epoll_wait
  9.08    1.625657       47813        34        28 restart_syscall
  3.28    0.587733        8643        68           io_getevents
  1.97    0.351806       70361         5           nanosleep
  1.92    0.344254       34425        10        10 rt_sigtimedwait
  0.32    0.056943           3     22116           clock_gettime
  0.02    0.003530         101        35           munmap
  0.01    0.002149          32        67           io_submit
  0.01    0.001706           6       273           epoll_ctl
  0.01    0.000897           6       154           read
  0.00    0.000765          11        68           writev
  0.00    0.000605           4       136        68 readv
  0.00    0.000591           4       137           write
  0.00    0.000381          10        40        23 open
  …
------ ----------- ----------- --------- --------- ----------------
100.00   17.900633                 29827      2638 total

 

A couple of weeks ago, I wanted to update my test environment with SQL Server 2017 CU8 (on Linux) I noticed the following messages (in the red rectangle):

blog 140 - 2 - forceflush

 

Hmm .. that was pretty new and interesting and in fact, the aforementioned messages were related to this Microsoft article (Thanks @Bobwardms to pointed me out). This new behavior is available since SQL Server 2017 CU6 and the article describes that how Microsoft has introduced a change with a new “forced flush” mechanism for SQL Server on Linux system.

In a nutshell for all scenarios, a new flush mechanism guarantees data is safely written to a stable media for transaction logs and during checkpoints. Let’s dig further into the both methods.

Let’s say first I applied the same pattern for all the tests that follow. The test’s protocol included one dbo.t1 table with only one column (id int). I inserted for each test a bunch of data rows (67 rows to be more precise related to 67 distinct implicit transactions) without any other user concurrent activities. It remains some internal stuff from SQL Server but I guess we may consider them as negligible compared to my tests.

insert dbo.t1 values (1)
go 67

 

  • Default forced flush mechanism behavior

 

In this scenario referring to my strace file output database files are still open with O_DIRECT only as shown below (my database’s name is toto this time)

4745  10:54:34 open("/u01/sqlserverdata/mssqlserver/toto.mdf", O_RDWR|O_CREAT|O_EXCL|O_DIRECT, 0660) = 188

 

I also used the following command to get directly a picture summary of the number of calls per syscall

$ sudo strace -f -c -o sql_strace.txt $(pidof sqlservr |sed 's/\([0-9]*\)/\-p \1/g')

 

Here the sample output I got:

$ cat sql_strace.txt
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.36   24.078231        2685      8969      3502 futex
 12.13    4.093680      120402        34        30 restart_syscall
 10.03    3.384817       12583       269           epoll_wait
  2.67    0.901541       13258        68           io_getevents
  1.77    0.598830       46064        13        13 rt_sigtimedwait
  1.66    0.560893       93482         6           nanosleep
  0.23    0.077873           2     31422           clock_gettime
  0.09    0.030924         462        67           fsync
  0.01    0.003212         321        10           madvise
  0.01    0.003026          22       136           write
  …
------ ----------- ----------- --------- --------- ----------------
100.00   33.742080                 42195      3651 total

 

This time I noticed additional calls of fsync() to guarantee writes on blocks – 67 calls that seem to be related to 67 transactions right? I double checked in my strace output and it seems that is the case but I may be wrong on this point so please feel free to comment. Another interesting point is that I also continued to notice asynchronous IO from io_getevents function that appeared from my trace. That make sense for me. Writes of data are asynchronous while those on transaction logs are synchronous by design. In this mode fsync() is triggered during transaction commits and checkpoints.

 

  • Enabling trace flag 3979

Enabling trace flag 3979 has effect to disable the forced flush behavior replaced by writethrough and alternatewritethrough options. Referring to the Microsoft article the former will translate the well-known FILE_FLAG_WRITE_THROUGH flag requests into O_DSYNC opens but with some performance optimization stuff by using fdatasync() rather than fsync() Indeed, fdatasync() is supposed to generate less I/O activity because it doesn’t require to synchronize file metadata (only the data portion of the file is concerned here).

Anyway, my strace sample file output below confirmed that both data file and transaction log were open with both O_DIRECT and O_DSYNC meaning we are bypassing the kernel buffer space and we are also forcing synchronous I/O for both files.  Does it matter? Well, from my understanding writing dirty data pages is still an asynchronous process when checkpoints occur. How much better it performs in this case? I don’t know and it will require likely another bunch of strong tests on different use cases –

5279  11:07:36 open("/u01/sqlserverdata/mssqlserver/dbi_test.mdf", O_RDWR|O_CREAT|O_EXCL|O_DSYNC|O_DIRECT, 0660) = 185
…
5279  11:07:38 open("/u02/sqlserverlog/mssqlserver/dbi_test_log.ldf", O_RDWR|O_CREAT|O_EXCL|O_DSYNC|O_DIRECT, 0660) = 185

 

As previously I noticed 67 calls of fdatasync() (and not anymore fsync() here) related likely to my 67 implicit transactions and I still continued to notice asynchronous IO driving by io_getevents() or io_submit() syscalls.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.90   12.678046        2360      5371      1927 futex
 10.80    1.853571        7788       238           epoll_wait
  9.05    1.551962       45646        34        30 restart_syscall
  1.93    0.331275       47325         7         7 rt_sigtimedwait
  1.84    0.316524      105508         3           nanosleep
  1.61    0.276806        4131        67           io_getevents
  0.71    0.121815           7     18259           clock_gettime
  0.06    0.010184         152        67           fdatasync
  0.02    0.003851          26       146           read
  0.02    0.003217          12       272           epoll_ctl
  0.01    0.002139          32        67           io_submit
  0.01    0.002070          15       136           write
  …
------ ----------- ----------- --------- --------- ----------------
100.00   17.156630                 25170      2054 total

 

 

Finally, at the moment of this write up, let’s say that Microsoft recommends enabling the trace flag 3979 as well as changing the default values of writethrough and alternatewritethrough options to 0 to revert back to old behavior before CU6 illustrated in the first section of this blog post but only in the case your storage guarantees your writes will be “O_DIRECT” safe. I think you may understand why now – if my understanding of Linux I/O is obviously correct- :)

See you!

 

2 Comments

  • Joe Chang says:

    sorry that this is only partially pertaining to SQL on Linux IO, but …
    There are published TPC-H results for SQL Server 2017 on both Windows and Linux and at the same scale factor (SF 3000GB) and processors, the Windows result by HPE and the Linux result by Cisco, and memory at 1.5TB.
    Though the database scale factor is 3000GB, in column store, the database size might be about 1.5TB? So after the initial data load in
    The over all result QphH, power and throughput metrics are similar. But consider that TPC-H is scored a geometric mean. This means that proportionate gains in small (quick running) queries have as much impact on the score as similar proportionate gains on large queries.

    Back when systems have 4-8 processor cores total, this might have been reasonable method. The systems used for this test both have two 28-core processors for 56 cores total, 112 threads (HT enabled).
    TPC-H query 6 ran in 0.4 sec on the Linux system and 0.6 sec on the Windows system at DOP 112 (HT improve performances by about 10% in TPC-H, so DOP 56 would be slightly slower). There are 3 other queries that ran in less than 2.3 sec.
    The point is that scaling to high DOP might be very effective in the big queries that run for several seconds, but not as effective on short run time queries.

    The linear sum of the 22 TPC-H is 282.7 sec for Windows, and 256.8 sec for Linux, as a result of Linux supporting faster run times on several of the large queries. The reason the composite metric is similar is because Q16 ran in 4.3 sec on Windows and 14.3 sec on Linux, negating the gains on the other queries.

Leave a Reply

David Barbarin
David Barbarin

Senior Consultant & Microsoft Technology Leader