Infrastructure at your Service

Cesare Cervini

An SQLite extension for gawk (part I)

Quick: what is the most used database management system on our planet ? Oracle ? Wrong. SQL server ? Wrong again ! MySQL ? You’re almost there. It’s SQLite. Surprised ? I must confess that I was too. Actually, SQLite is special in that it is not the traditional 2-tiers client/server but one-tier and embedded, which means that it works as a library linked to an application. As such, it is used to fulfill the database needs of browsers, portable devices such as the iPods, iPhones, Android, etc… (see a short list of famous users here). Look also here for a succinct intro and here for a list of features. Here you’ll find distinctive features of SQLite and here common uses for it.
Let’s be clear from the start: although light, this is no toy software but a solid, time-proven, globally used, rigorously tested open-source product.
So, what the relation with gawk ? Well, none. Until now. As you may know, gawk has had for some time now an easy way to be extended with useful libraries. I already talked about this in my previous blogs, e.g. here. In particular, it has also a binding for PostgreSQL (see here). So, I told to myself, wouldn’t it be nice to make one for Oracle too ? Or for some key-value xDBMs such as Berkeley DB ? But fate decided otherwise: I already used SQLite in the past as an almost no-installation SQL database and during the present research I landed fortuitously on SQLite’s web pages. I was immediately hooked on. SQLite is petite (500 KiB in one unique source file for the whole RDBMS library and about 1 Mb for the shared library object file, unbelievable !), easy to learn and use, and open-source. Actually, the SQLite creators propose such an unorthodox license agreement that I feel compelled to list it here:

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.

So inspirational and quite a depart from the traditional, indigestible EULAs ! There is enough here to push the religion business to bankruptcy. And the lawyers. And make our sorry planet a paradise again.
In effect, the only data structure known to gawk is the associative array, or hashes in perl parlance, or dictionaries for pythonists. And they are entirely held in memory. I thought it would be a nice addition to gawk to be able to work with on-disk tables for those cases where huge amount of textual data have to be processed in random order.
As this article is rather large, I’ve split it into 3 parts. Part I, the one you’re reading now, presents the requirements and proposes an API. Part II lists the extension code and shows how to compile and use it in gawk. Part III comments the code and describes a stress test for the interface and SQLite from within a gawk script. So, let’s see how I hammered that screw !

The Objectives

At the very minimum, the gawk interface to SQLite (for short sqlite_gawk henceforth) shall be able to open a database file (a SQLite database is entirely self-contained in a single file, I told you it’s light), send DML/DDL statements to it and run SELECT queries against it. The SELECT statement shall be able to present the result in a nice and flexible way to avoid messing up the screen with wrapped-around, illegible lines, i.e. the columns’ width shall be individually configurable and some truncation, with or without an ellipsis, or wrap-around be possible within those limits.
Also, as SQLite supports blobs, sqlite_gawk should be able to deal with them, i.e. insert, retrieve and display them if they are textual. SQLite comes with an interactive shell functionally akin to Oracle’s sqlplus and named sqlite3 which extends the collection of SQL functions (yes, SQLite allows that too). The SQL functions the shell adds are readfile() and writefile() to read the content of a file into a blob, respectively dump a blob into a file. We definitively want that functionality in the interface too.
Another requirement is that, along with displaying the retrieved rows on the screen, sqlite_select shall be able to store them into an integer-indexed gawk array of associative arrays. The goal is to permit further in-memory processing from within gawk.

The API

All the above requirements converge into the following API:

int do_sqlite_open(db_filename)
-- opens the database file db_filename;
-- in case of success, returns a non-negative integer which is a handle to the db;
-- returns -1 and sends to stderr an error message from SQLite if an error occured;

int sqlite_close(db_handle)
-- closes the database whith db_handle handle;
-- returns 0 is success and -1 plus an error message from SQLite to stderr in case of error;

int sqlite_exec(db_handle, sql_stmt)
-- sends the DML/DDL SQL statement sql_stmt to the database with db_handle handle;
-- returns -1 plus an error message from SQLite to stderr in case of error, a non-negative integer from SQLite if success;

int sqlite_select(db_handle, select_stmt [,"" | separator_string | list-of-columns-widths | , dummy, gawk_array])
-- sends the SELECT select_stmt statement with or without formatting options;
-- the output is either displayed or sent into a gawk array of associative arrays;

sqlite_exec takes INSERT, DELETE, UPDATE, i.e. all SQL statements different from SELECT. In addition, as said above, INSERT, UPDATE and SELECT also accept the SQL extensions readfile() and writefile() for blob I/Os from/to file. Here are a few examples of usage from gawk:

rc = sqlite_exec(my_db, "INSERT INTO test_with_blob(n1, my_blob) VALUES(1000, readfile('/home/dmadmin/setup_files/instantclient-basic-linux.x64-12.2.0.1.0.zip'))"

rc = sqlite_select(my_db, "SELECT n1, writefile('blob_2000.dmp', my_blob) FROM test_with_blob where n1 = 2000 limit 1")

rc = sqlite_exec(my_db, "UPDATE test_with_blob set my_blob = readfile('/home/dmadmin/dmgawk/gawk-4.2.1/extension/sqlite_gawk.c') where n1 = 1000")

sqlite_select()

As expected from functions that must produce human-readable output, this is the most feature rich, and complex, function of the interface:

int sqlite_select(db_handle, select_stmt [, "" | , "col_separator_string" | , "list-of-columns-formats" | , dummy, gawk_array])

This compact syntax can be split into the following five acceptations:

int sqlite_select(db_handle, select_stmt)
int sqlite_select(db_handle, select_stmt, "")
int sqlite_select(db_handle, select_stmt, "col_separator_string")
int sqlite_select(db_handle, select_stmt, "list-of-columns-formats")
int sqlite_select(db_handle, select_stmt, dummy, gawk_array)

The function sqlite_select is overloaded and takes from 2 to 4 parameters; when the arities are identical (variants 2 to 4), the format of the 3rd parameter makes the difference. Let’s see what service they provide.

int sqlite_select(db_handle, select_stmt)

The first select outputs its result as a table with fixed column-widths; those widths are from 8 to 15 characters wide. Its purpose is to give a quick overview of a query’s result without messing up the screen with long, wrapped around lines. Here is an example with fake data:

sqlite_select(my_db, 'SELECT * FROM test1')
n1 s1 s2
-------- -------- --------
100 hello1 hello...
200 hello2 hello...
300 hello3 hello...
400 hello4 hello...
400 hello... hello...
400 hello... hello...
6 rows selected
3 columns displayed

We can see that too large columns are truncated and an ellipsis string (…) is appended to them to show this fact.
This quick overview terminates with a table of optimum columns widths so that, if used later, screen width permitting, the columns can be displayed entirely without truncation.

Optimum column widths
=====================
for query: SELECT * FROM test1
n1 3
s1 18
s2 21

This variant is simple to use and handy for having a quick peek at the data and their display needs.
The 4th sqlite_select() variant let us provide column formats. Here is an example of how to do that with previous optimum column widths:

sqlite_select(my_db, 'SELECT * FROM test1', "3 18 21")
n1 s1 s2
--- ------------------ ---------------------
100 hello1 hello0101
200 hello2 hello0102
300 hello3 hello0103
400 hello4 hello0104
400 hello5 with spaces hello0105 with spaces
400 hello6 with spaces hello0106
6 rows selected

The columns are now displayed without truncation.

int sqlite_select(db_handle, select_stmt, "")

The second variant takes en empty string as the 3rd parameters, which means “use | as a column separator”. Here is an example of output:

sqlite_select(my_db, 'SELECT * FROM test1', "")
n1|s1|s2
100|hello1|hello0101
200|hello2|hello0102
300|hello3|hello0103
400|hello4|hello0104
400|hello5 with spaces |hello0105 with spaces
400|hello6 with spaces |hello0106
6 rows selected

If such a default character is not appropriate, a more suitable string can be provided, which is the purpose of sqlite_select() 3rd variant, e.g. ‘||’ as shown below:

sqlite_select(my_db, 'SELECT * FROM test1', "||")
n1 ||s1 ||s2
--------||--------||--------
100 ||hello1 ||hello...
200 ||hello2 ||hello...
300 ||hello3 ||hello...
400 ||hello4 ||hello...
400 ||hello...||hello...
400 ||hello...||hello...
6 rows selected
3 columns displayed

Since this format is easy to parse, it is handy for exporting the data into a file and subsequently import them into a spreadsheet program.

int sqlite_select(db_handle, select_stmt, "list-of-columns-formats")

We’ve already seen the function’s 4th variant but there is more to the column formats.
The parameter “list-of-columns-formats” is a comma- or space-separated ordered list of numeric values, the column widths, one number for each column in the SELECT clause of that statement. If they are too many values, the superfluous ones are ignored. If they are fewer, the last one is extended to cover for the missing values.
They can also end with one of t, e or w characters where t stands for t(runcation), e stands for e(llipsis) suffix if truncation and w stands for w(rap-around).
The minimal width is 3 characters if an ellipsis is requested to accommodate the suffix itself.
Here is an example of invocation and output:

sqlite_select(my_db, 'SELECT * FROM test1', "2e 15e 10w")
n1 s1 s2
--- --------------- ----------
100 hello1 hello0101
200 hello2 hello0102
300 hello3 hello0103
400 hello4 hello0104
400 hello5 with ... hello0105
with space
s
400 hello6 with ... hello0106
400 hello6-with-... hello0106-
12345
7 rows selected

Here, we want the first column to be displayed in a 3-character wide field with truncation allowed and an ellipsis suffix. No truncation occurred in this column.
The second column should be displayed in a 15-character wide column also with ellipsis as suffix if truncation, which is visible here.
The third column is displayed in a 10-character wide column with wrapping-around.
This variant attempts to emulate some of the flexibility of Oracle sqlplus “col XX format YY” command.

int sqlite_select(db_handle, select_stmt, dummy, gawk_array)

Finally, the last acceptation below does not output anything on the screen but fills in a gawk array with the query’s result.
A dummy parameter has been introduced to allow resolving the overloaded function and invoke the expected code. It can be set to any value since this formal parameter is ignored or, as the say, is reserved for future use.
Here is an example of invocation:

sqlite_select(my_db, 'SELECT * FROM test1', 0, a_test)
6 rows selected

If we iterate and print the gawk array:

   for (row in a_test) {
      printf("row %d: ", row)
      for (col in a_test[row])
         printf("  %s = %s", col, a_test[row][col])
      printf "\n" 
   }
   printf "\n"

, we can guess its structure:

row 0: n1 = 100 s1 = hello1 s2 = hello0101
row 1: n1 = 200 s1 = hello2 s2 = hello0102
row 2: n1 = 300 s1 = hello3 s2 = hello0103
row 3: n1 = 400 s1 = hello4 s2 = hello0104
row 4: n1 = 400 s1 = hello5 with spaces s2 = hello0105 with spaces
row 5: n1 = 400 s1 = hello6 with spaces s2 = hello0106

As we can see, the array’s structure is the following (say the query returns a table of count rows and ncolumns):

array[0] = sub-array_0
array[1] = sub-array_1
...
array[count-1] = sub-array_count-1
where array is indexed by an integer and the sub-arrays are associative arrays with following composition:
sub-array0[col0] = value0,0
sub-array0[col1] = value0,1
...
sub-array0[coln-1] = value0,n-1
sub-array1[col0] = value1,0
...
sub-array1[coln-1] = value1,n-1
...
sub-arraycount-1[col0] = valuecount-1,0
...
sub-arraycount-1[coln-1] = valuecount-1,n-1

Said otherwise, the returned array is an integer-indexed array of associative arrays; its first dimension contains the rows and its second dimension contains the columns, i.e. it’s a table of database rows and value columns.
This feature is very interesting but comes with some limitation since all the data are held in memory. Maybe a future release of gawk will be able to transparently paginate gawk arrays to/from disk providing a kind of virtual memory for them. Its implementation could even use SQLite. The problem turns out to map multi-dimensional associative arrays onto relational tables. Some performance degradation is expected, unless an efficient pagination mechanism is introduced. Documentum implemented the mapping partially for repeating attributes, i.e. flat arrays of values, with joins between _s and _r tables, but there is much more to it in gawk’s array. This would be a fascinating subject for another blog on its own as one can imagine.
An implicit and free-of-charge benefit of this implementation would be persistence. Another would be serialization.
For the time being, if a query returns too many rows to be held in memory at once, it may be better to first print them into a file as delimited values (use variants 2 or 3 of sqlite_select for that, as describe above) and later work on them sequentially from there, with as many passes as needed. Or use the SQLite shell and work the data on-the-fly. e.g.:

cat - << EoQ | sqlite3 | gawk -v FS="|" '{
# do something, e.g.:
print "row", NR, $0
}'
.open my_db
.separator |
select * from test1;
.exit
EoQ
row 1 100|hello1|hello01001
row 2 200|hello2|hello01002
row 3 300|hello3|hello01003

How could Documentum benefit of SQLite ?

Whereas it would not be realistic to use SQLite to store documents’ metadata as a replacement for a full 2-tiers RDBMS, it could still find a purpose within Documentum.
Documentum, and lots of other software, stores its configuration files, such as the server.ini and the old dmcl.ini, in ini files, e.g.:

[SERVER_STARTUP]
docbase_id = 1000000
docbase_name = mydocbase
database_name = mydb
database_conn = dbconn
database_owner = dmadmin
database_password_file = /home/dmadmin/dba/mydocbase/dbpasswd.txt

[DOCBROKER_PROJECTION_TARGET]
host = myhost

[DOCBROKER_PROJECTION_TARGET_n]
#n can be 0-49
key=value

[FUNCTION_SPECIFIC_STORAGE]
#Oracle & DB2 only
key=value

[TYPE_SPECIFIC_STORAGE]
key=value
#Oracle & DB2 only

[FUNCTION_EXTENT_SIZE]
key=value
#Oracle only

[TYPE_EXTENT_SIZE]
key=value

There are also key-value files such as the dfc.properties.
By storing these data inside a SQLite table, in its own database for example (databases are so cheap in SQlite, just one file), common typos could be avoided. A classic error with ini files in particular consists in misplacing a setting in the wrong section; such errors are not easy to spot because Documentum silently ignores them.
Consider this snippet from dfcfull.properties for instance:

# ACS configuration
# =================
# Preferences prefixed dfc.acs are used by dfc for distributed content services 
# ACS.                                                                          


# Defines how often dfc verifies acs projection, specified in seconds.          
# min value:  0, max value: 10000000
# 
dfc.acs.avail.refresh_interval = 360


# Indicates whether to verify if ACS server projects.                           
# 
dfc.acs.check_availability = true


# Defines how often dfc verifies that docbase related config objects are 
# modified, specified in seconds.                                               
# min value:  0, max value: 10000000
# 
dfc.acs.config.refresh_interval = 120


# Defines how often dfc verifies that global registry related config objects are 
# modified, specified in seconds.                                               
# min value:  0, max value: 10000000
# 
dfc.acs.gr.refresh_interval = 120
...

Wouldn’t it be nice to move these settings into a SQLite table with the structure dfc_properties(key PRIMARY KEY, value, comment, is_enabled) ? Both could still coexist and be merged at server startup time, with more priority to the file (i.e. if a given setting is present in both the table and in the file, the latter would prevail). We could have a classic file dfc.propertes along with the database file dfc.properties.db.
Common operations on the parameters would be done through SQL statements, e.g.:

SELECT key, value, comment from dfc_properties where is_enable = TRUE;
SELECT 'other_values', key, value from dfc_properties where is_enable = FALSE and key = ... ORDER BY key, value;
SELECT 'all_values', is_enable, key, value from dfc_properties where ORDER BY is_enable DESC, key, value;
UPDATE server_ini SET value = 'mydb.world' WHERE key = 'database_conn';

All the supported parameters would already be present in the table with correct names and default values. Alternative or deactivated values would have their is_enabled value to FALSE.
But the main interest of using SQlite’s tables here would be the CHECK constraints (or FOREIGN keys) to prevent typos in the parameter names. And for a server.ini.db with structure server_ini(section, key, value, comment, is_enabled), the section + key would be a composite foreign key with a domain CHECK constraints in the parent table, to prevent misplaced parameters. This would require a real database schema with some complexity, and would be delivered by Documentum. The point here is that SQLite would be an excellent tool for this kind of data.
But let’s leave it at that. Such an design and implementation could also deserve its own blog.
I hope this interface got your interest. In the next part, I’ll present its gory details, comment on the implementation and its limits and show how to compile and use it within gawk. See you there !

 

Leave a Reply

Cesare Cervini
Cesare Cervini