By Franck Pachot
The development teams often see the database as a ‘persistence layer’ only. And sometimes it is mentioned that the database is always the cause of the problems, especially when implementing continuous integration and trying to be agile. Then cames the idea to have this ‘persistence layer’ in an isolated environment, such as a docker container, with the database for each developer. However, this overlooks the real cause of the problems, which is not the persistence function of the database, but the fact that it is shared. And when you share something, in a multi-user environment, you reach another level of complexity. And if you are not prepared for that, you perceive it as a problem.
This philosophical blog post contains some cool words which, in my opinion, must be carefully considered when dealing database: agile, persistence, stateless, docker, microservices, stateless, NoSQL, containers, lake, cloud,…
When I hear about ‘persistence layer’ it always reminds me a meeting a few years ago where the architects of a new application draw on the whiteboard something that they could have called microservices a few years later. Dedicated web services reading data, processing it, and writing the result to the ‘persistence layer’. And when the data was only a transient interface between two of those services, they’ve written ‘temporary persistence’ on the whiteboard diagram. When you come with two antonyms to define something, there is probably something wrong in the design.
In order to understand what’s beyond persistence, you should think about what happens when you:
SQL> commit work;
First, the commit statement closes a transaction that has been started by the first DML. That’s the first purpose of a database management system: process transactions. This is already something that is not very clear for some development architects. They came with stateless services, just because thinking about transactions is not easy when there is no one-to-one relationship between the database session and the end-user (as in client/server on dedicated terminal). So they designed stateless services. Then they encountered issues such as lost updates as soon as they had more than one user. And then implemented optimistic locking in their application server, not always in the right layer. I recommend this talk from Vlad Mihalcea about transactions.
Second, the commit statement ensures that changes are persistent and durable. The changes are written to disk, to backups, and to standby databases. This could be as simple as a write call, but is a bit more complex for performance reasons: random writes done on cache, written to disk asynchronously, redo transaction log written sequentially, synced on commit. Here, the developer do not need to care about the underlying mechanisms, beyond just trying to change only what is needed and commit only when needed.
Third, the commit marks the changes as publicly visible. Because the data is shared among multiple users. This is why developing and testing on your own personal database is limited to unit tests. As soon as you have to work on implementing a real use-case, you must work on a database shared by different services. I have seen enterprises going early on agile Database as a Service for agile development where each developer was working on his own copy (thin clone) of the database. They are now asking for common environments where multiple developers can work and test their different part of the application. Back to this pre-agile idea, and back to the same problem: the database is shared.
Finally, I think that some developers like to see the database as only a ‘persistence layer’ just because it is easier. You can be lazy and let the database system and the database administrators manage the durability of the data. The non-optimal performance will be compensated by software and hardware. And ignoring the two other properties of a database system is just walking away from the complexity. NoSQL to ignore transactions and consistencies, and containers to ignore the sharing concern.
This is not new. ‘Object Oriented’ was the buzzword before ‘buzzword’ itself was a buzzword. And OO development was ok until it had to manipulate data from the database. They called ‘impedance mismatch’ the problems encountered when trying to match the object-oriented model with the relational model. And they built frameworks to do this matching as if it were simple, such as matching the Object Identity concept with the Primary Key concept. And my experience is that this was nice to build good Proof or Concepts, but failed in production on consistency, performance, and scalability. Object Oriented development is good for non-shared transient objects. A GUI is based on objects, such as a window or a text field, where the object identity is the address in memory. If you restart the application, it is another object, with different address memory. And the GUI on your colleague computer is again another object. It is not persistent but transient. And it is not shared but all different. Applying this model to data doesn’t work. You can simulate persistency with an Object-Relational mapping (ORM) but sharing will be a problem. Those ORM usually work on proxy cached objects in the application server, trying to re-invent the management of concurrent changes, without using the database system which is built for that.
The current shift from ‘database technology’ to ‘developer community’ is probably a good idea, but only if we do not do the same errors such as using the database as a black box to persist objects. We must keep in mind the reasons why Relational Database Management Systems were built for.
Before RDBMS, data was modeled hierarchically. This was good for performance (because data is clustered) but was good for only one use-case. Other use-cases had to replicate the data into another physical design. The relational modeling stores data in a way it can be used by all use-cases. For example, You can look at the orders from one customer, or for the orders on one product. The business entities are in different tables and are joined dynamically when queried. This is a data point of view. The developer builds something different because the developer works on one use-case. This was the trend for XML a few years ago, and JSON now, bringing back the hierarchical model that failed decades ago. Very good to work on your use-case, but very bad when data is shared with other use cases. You have good Proof of Concept and good unit test. But integration will be impossible. I think that we will see the same with microservices: each one designed for its own usage without thinking about other (including future) use-cases.
Before RDBMS, data definition was included in the code for each use-case. A COBOL program had a data division describing the structure of the data that will be accessed by the program. This was not good for shared databases because one structural change had to change all programs. And this is why the relational model was introduced with physical and logical independence. The logical model is designed for the application, and it is the common interface between the developers and the database administrators. Here again, I’ve seen some application architects going backward, using ORM mapping used in the same way the COBOL data division was used in the past.
Today, developers are running their code in containers. This has the same properties as the OO development I described before: isolated and transient. Docker starts a few processes in its own image of the disk. This is perfect. But, like they did with OO, they try to bring this idea to the database. And that will fail again because, except in early stage of development, you need a database that is shared and persistent.
In my opinion, running the Oracle Database in a Docker container gives a wrong message to the developer community. A database involves a lot of resources and is highly coupled with the OS. My lab databases on my laptop are VirtualBox VMs. Easy to build, easy to start, and easy to share. I’ve tested database on docker 3 years ago to see how we can build data virtualization, and thin clones for development, with a standby database on Docker. I’ve written an article about that for DOAG. The technology has evolved but I think that it is not a good idea except for the fun of trying something new.
Today, we have containers on the database itself, with multitenant pluggable databases and application containers. You can provision a pluggable database for each docker container running the application. And you have easy ways to copy, clone, refresh or have common data. This is ok for development.
Application containers have been introduced for SaaS: each application tenant has its own pluggable database. The data in the PDB is still shared by all application use-cases, and metadata (and some reference tables) is shared in application root by several applications.
However, we also see in 12cR2, and probably more in 18c, some cross-PDB DML which allows queries and modification among several pluggable databases. It is not yet stable, using database links for modifications (see here). But I can imagine that it will be fixed in 18c or later being able to update different PDBs within the same transaction.
And then, I foresee how this will be perceived by some developer architects (those ignoring that the database must be shared). They will build microservices, with small application servers running in a Docker container. And they will map one PDB for each service, doing something like a Microservice Database. Some will store data in XML, others in JSON, and some in relational tables. All those will be consolidated into a multi-tenant database to be managed easily. And my fear is that we will see a large bowl of spaghetti to exchange data between those Microservice PDBs using database links, common views, cross-PDB DML, and maybe some logical replication.
In my opinion, microservices can be an interesting design for future applications, but trying to map the same idea to the database will be a failure if developers don’t realize that the database is not there only to persist data but also to share it. But that’s just my own current opinion and I’ve no idea about the future. Maybe all this data modified by microservices will be shared in real time in a big Data Lake universally available on the cloud, and all IT problems about concurrency and latency will be solved.