Friday, February 15, 2013

Building SOLID Databases: Dependency Inversion and Robust DB Interfaces

Dependency inversion is the idea that interfaces should depend on abstractions not on specifics.  According to Wikipedia, the principle states:

A. High-level modules should not depend on low-level modules. Both should depend on abstractions.
B. Abstractions should not depend upon details. Details should depend upon abstractions.

Of course the second part of this principle is impossible if read literally.  You can't have an abstraction until you know what details are to be covered, and so the abstraction and details are both co-dependent.  If the covered details change sufficiently the abstraction will become either leaky or inadequate and so it is worth seeing these as intertwined to some extent.

The focus on abstraction is helpful because it suggests that the interface contract should be designed in such a way that neither side really has to understand any internal details of the other in order to make things work.  Both sides depend on well-encapsulated API's and neither side has to worry about what the other side is really doing.  This is what is meant by details depending on abstractions rather than the other way around.

This concept is quite applicable beyond object oriented programming because it covers a very basic aspect of API contract design, namely how well an API should encapsulate behavior.

This principle is first formulated in its current form in the object oriented programming paradigm but is generally applicable elsewhere.

SQL as an Abstraction Layer, or Why RDBMS are Still King

There are plenty of reasons to dislike SQL, such as the fact that nulls are semantically ambiguous.  As a basic disclaimer I am not holding SQL up to be a paragon of programming languages or even db interfaces, but I think it is important to discuss what SQL does right in this regard.

SQL is generally understood to be a declarative language which approximates relational mathematics for database access purposes.  With SQL, you specify what you want returned, not how to get it, and the planner determines the best way to get it.  SQL is thus an interface language rather than a programming language per se.  With SQL, you can worry about the logical structure, leaving the implementation details to the db engine.

SQL queries are basically very high level specifications of operations, not detailed descriptions of how to do something efficiently.  Even update and insert statements (which are by nature more imperative than select statements) leave the underlying implementation entirely to the database management system.

I think that this, along with many concessions the language has made to real-world requirements (such as bags instead of sets and the addition of ordering to bags) largely account for the success of this language.  SQL, in essence, encapsulates a database behind a mature mathematical, declarative model in the same way that JSON and REST do (in a much less comprehensive way) in many NoSQL db's.  In essence SQL provides encapsulation, interface, and abstraction in a very full-featured way and this is why it has been so successful.

SQL Abstraction as Imperfect

One obvious problem with treating SQL as an abstraction layer in its own right is that one is frequently unable to write details in a way that is clearly separate from the interface.  Often storage tables are hit directly, and therefore there is little separation between logical detail and logical interface, and so this can break down when database complexity reaches a certain size.  Approaches to managing this problem include using stored procedures or user defined functions, and using views to encapsulate storage tables.

Stored Procedures and User Defined Functions Done Wrong

Of the above methods, stored procedures and functional interfaces have bad reputations frequently because of bad experiences that many people have with them.    These include developers pushing too much logic into stored procedures, and the fact that defining functional interfaces in this way usually produces a very tight binding between database code and application code, often leading to maintainability problems.

The first case is quite obvious, and includes the all-too-frequent case of trying to send emails directly from stored procedures (always a bad idea).  This mistake leads to certain types of problems, including the fact that ACID-compliant operations may be mixed with non-ACID-compliant ones, leading to cases where a transaction can only be partially rolled back.  Oops, we didn't actually record the order as shipped, but we told the customer it was.....  MySQL users will also note this is an argument against mixing transactional and nontransactional backend table types in the same db.....  However that problem is outside the scope of this post.  Additionally, MySQL is not well suited for many applications against a single set of db relations.

The second problem, though, is more insidious.  The traditional way stored procedures and user defined functions are typically used, the application has to be deeply aware of the interface to the database, but the rollout for these aspects is different leading to the possibility or service interruptions, and a need to very carefully and closely time rollout of db changes with application changes.  As more applications use the database, this becomes harder and the chance of something being overlooked becomes greater.

For this reason the idea that all operations must go through a set of stored procedures is a decision fraught with hazard as the database and application environment evolves.  Typically it is easier to manage backwards-compatibility in schemas than it is in functions and so a key question is how many opportunities you have to create new bugs when a new column is added.

There are, of course, more hazards which I have dealt with before, but the point is that stored procedures are potentially harmful and a major part of the reason is that they usually form a fairly brittle contract with the application layer.  In a traditional stored procedure, adding a column to be stored will require changing the number of variables in the stored procedure's argument list, the queries to access it, and each application's call to that stored procedure.  In this way, they provide (in the absence of other help) at best a leaky abstraction layer around the database details.  This is the sort of problem that dependency inversion helps to avoid.

Stored Procedures and User Defined Functions Done Right

Not all stored procedures are done wrong.  In the LedgerSMB project we have at least partially solved the abstraction/brittleness issue by looking to web services for inspiration.  Our approach provides an additional mapping layer and dynamic query generation around a stored procedure interface.  By using a service locator pattern, and overloading the system tables in PostgreSQL as the service registry, we solve the problem of brittleness.

Our approach of course is not perfect and it is not the only possibility.  One shortcoming is that our approach is that the invocation of the service locator is relatively spartan.  We intend to allow more options there in the future.  However one thing I have noticed is the fact that there are far fewer places where bugs can hide and therefore faster and more robust development takes place.

Additionally a focus on clarity of code in stored procedures has eliminated a number of important performance bottlenecks, and it limits the number of places where a given change propagates to.

Other Important Options in PostgreSQL

Stored procedures are not the only abstraction mechanisms available from PostgreSQL.  In addition to views, there are also other interesting ways of using functions to accomplish this without insisting that all access goes through stored procedures.  In addition these methods can be freely mixed to produce very powerful, intelligent database systems.

Such options include custom types, written in C, along with custom operators, functions and the like.  These would then be stored in columns and SQL can be used to provide an abstraction layer around the types.  In this way SQL becomes the abstraction and the C programs become the details.  A future post will cover the use of ip4r in network management with PostgreSQL db's as an example of what can be done here.

Additionally, things like triggers and notifications can be used to ensure that appropriate changes trigger other changes in the same transaction or, upon transaction commit, hand off control to other programs in subsequent transactions (allowing for independent processing and error control for things like sending emails).


Rather than specific recommendations, the overall point here is to look at the database itself as a an application running in an application server (the RDBMS) and design it as an application with an appropriate API.  There are many ways to do this, from writing components in C and using SQL as an abstraction mechanism to writing things in SQL and using stored procedures as a mechanism.  One could even write code in SQL and still use SQL as an abstraction mechanism.

The key point however is to be aware of the need for discoverable abstraction, a need which to date things like ORMs and stored procedures often fill very imperfectly.  A well designed db with appropriate abstraction in interfaces, should be able to be seen as an application in its own right, engineered as such, and capable of serving multiple client apps through a robust and discoverable API.

As with all things, it starts by recognizing the problems and putting solutions as priorities from the design stage onward.