Perspectives on LedgerSMB: Building SOLID Databases: Single Responsibility and Normalization

Introduction

This instalment will cover the single responsibility principle in object-relational design, and its relationship both to data normalization and object-oriented application programming. While single responsibility is a fairly easy object-oriented principle to apply here, I think it is critical to explore in depth because it helps provide a clearer framework to address object-relational design.

As in later instalments I will be using snippets of code developed elsewhere for other areas. These will not be full versions of what was written, but versions sufficient to show the basics of data structure and interface.

Relations and Classes: Similarities

Objects and classes, in the surface, look deceptively similar, to the point where one can look at relations as sets of classes, and in fact this equivalence is the basis of object-relational database design.

Objects are data structures used to store state, which have identity and are tightly bound to interface. Relations are data structures which store state, and if they meet second normal form, have identity in the form of a primary key. Object-relational databases then provide interface and thus in an object-relational database, relations are classes, and contain sets of objects of a certain class.

Relations and Classes: Differences

So similar are objects and classes in structure that a very common mistake is to simply use a relational database management system as a simple object store. This tends to result in brittle database leading to a relatively brittle application. Consequently many of us see this approach as something of an anti-pattern.

The reason why this doesn't work terribly well is not that the basic equivalence is bad but that relations and classes are used in very different ways. On the application layer, classes are used to model (and control) behavior, while in the database, relations and tuples are used to model information. Thus tying database structures to application classes in this way essentially overloads the data structures, turning the structures into reporting objects as well as behavior objects.

Relations thus need to be seen not only as classes but as specialized classes used for persistent storage and reporting. Thus they have fundamentally different requirements than the behavior classes in the application and thus they have different reasons to change. An application class typically changes when there is a need for a change in behavior, while a relation should only change when there is a change in data retention and reporting.

Relations have traditionally tended to be divorced from interface and this provides a great deal of power. While classes tend to be fairly opaque, relations tend to be very transparent. The reason here is that while both represent state information whether it is application state or other facts, objects traditionally encapsulate behavior (and thus act as building blocks of behavior), relations always encapsulate information and are building blocks of information. Thus the data structures of relations must be transparent while object-oriented design tends to push for less transparency and more abstraction.

It is worth noting then that because these systems are designed to do different things, there are many DBA's who suggest encapsulating the entire database behind an API, defined by stored procedures. The typical problem with this approach is that loose coupling of the application to the interface is difficult (but see one of the first posts on this blog for a solution). When the db interface is tightly coupled to the application, then typically one ends up with problems on several levels, and it tends to sacrifice good application design for good relational database design.

Single Responsibility Principle in Application Programming

The single responsibility principle holds that every class should have a single responsibility which it should entirely encapsulate. A responsibility is defined as a reason to change. The canonical example of a violation of this principle is a class which might format and print a report. Because both data changes and cosmetic changes may require a change to the class, this would be a violation of the principle at issue. In an ideal world, we'd separate out the printing and the formatting so that cosmetic changes do not require changes when data changes are made and vice versa.

The problem of course with the canonical example is that it is not self-contained. If you change the data in the report, it will almost certainly require cosmetic changes. You can try to automate those changes but only within limits, and you can abstract interfaces (dependency inversion) but in the end if you change the data in the report enough, cosmetic changes will become necessary.

Additionally a "reason to change" is epistemologically problematic. Reasons foreseen are rarely if ever atomic, and so there is a real question as far as how far one pushes this. In terms of formatting a report, do we want to break out the class that adjusts to paper size so that if we want to move from US Letter to A4 we no longer have to change the rest of the cosmetic layout?

Perfect separation of responsibilities in that example is thus impossible, as it probably always is --- you can only change business rules to a certain point before interfaces must change, and when that happens the cascading flow of necessary changes can be quite significant.

The database is, however, quite different in that that responsibility of database-level code (including DDL and DML) is limited to the proposition that we should construct answers from known facts. This makes a huge difference in terms of single responsibility, and it is possible to develop mathematical definitions for single responsibility.

Not only is this possible but it has been done. All of the normal forms from third on up address single responsibility.

The Definition of Third Normal Form

Quoting Wikipedia,

Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

The relation R (table) is in second normal form (2NF)

Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.

A non-prime attribute is an attribute not part of a superkey. In essence what third normal form states is that every relation must contain a superkey and values functionally and directly dependent on that superkey.

This will become more important as we look at how data anomilies dovetail with single responsibility.

Normalization and Single Responsibility

The process of database normalization is an attempt to create relational databases where data anomalies do not exist. Data anomalies occur where modifying data either requires modifying other data to maintain accuracy (where no independent fact changes are recorded), or where existing data may project current or historical facts not in existence (join anomilies).

This process occurs by breaking down keys and superkeys, and their dependencies, such that data is tracked in smaller, self-contained units. Beginning at third normal form, one can see relations are forming single responsibilities of managing data directly dependent on their superkeys. From this point forward, relations' structures would change (assuming no decision to further decompose a relation into a higher normal form) if and only if a change is made to what data is tracked that is directly dependent on a superkey.

The responsibility of the database layer is the storage of facts and the synthesis of answers. Since the storage relations themselves handle the first, normalization is a prerequisite to good object-relational design.

The one major caveat here however is that first normal form's atomicity requirement must be interpreted slightly differently in object-relational setups because more complex data structures can be atomic compared to a purely relational design. In a purely relational database, the data types that can be used are relatively minor and therefore facts must be maximally decomposed. For example we might store an IP address plus network mask as 4 ints for the address and an int for the network mask, or we might store as a single 32-bit int plus another int for the network mask but the latter poses problems of display that the former does not. In an object-relational database, however, we might store the address as an array of 4 ints for IP v4 or, if we need better performance we might build a custom type. If storage is not a problem but ease of maintenance is, we might even define relations, domains, and such to hold IP addresses, and then store the tuple in a column with appropriate functional interfaces.

None of these approaches necessarily violate first normal form, as long as the data type involved properly and completely encapsulates the required behavior. Where such encapsulation is problematic, however, they would violate 1NF because they can no longer be treated as atomic values. In all cases, the specific value has a 1:1 correspondence to an IP address.

Additionally where the needs are different, storage, application interface, and reporting classes should be different (this can be handled with updateable views, object-relational interfaces, and the like).

Object-Relational Interfaces and Single Responsibility

For purely relational databases, normalization is sufficient to address single responsibility. Object-relational designs bring some additional complexity because some behavior may be encapsulated in the object interfaces. There are two fundamental cases where this may make a difference, namely in terms of compositional patterns and in terms of encapsulated data within columns.

A compositional pattern in PostgreSQL typically would occur when we use table inheritance to manage commonly co-occuring fields which occur in ways which are functionally dependent on many other fields in a database. For example, we might have a notes abstract table, and then have various tables which inherit this, possibly as part of other larger tables. A common case where composition makes a big difference is in managing notes. People may want to attach notes to all kinds of other data in a database, and so one cannot say that the text or subject of a note is mutually dependent,

A typical purely relational approach is to either have many independently managed notes tables or have a single global note table which stores notes for everything, and then have multiple join tables to add join dependencies. The problem with this is that the note data is usually dependent logically, if not mathematically, on the join dependency, and so there is no reasonable way of expressing this without a great deal of complexity in the database design.

An object-relational approach might be to have multiple notes tables, but have them inherit the table structure of a common notes table. This table can then be expanded, interfaces added as needed, and it should fill the single responsibility principle even though we might not be able to say that there is a natural functional dependency within the table itself.

The second case has to do with storing complex information in columns. Here stability and robustness of code is especially important, and traditional approaches of the single responsibility principle apply directly to the contained data type.

Example: Machine Configuration Database and SMTP configuration

One of my current projects is building a network configuration database for a LedgerSMB hosting business I am helping to found (more on this soon). For competitive reasons I cannot display my whole code here. However, what I would like to do is show a very abbreviated version here as I used to solve a very specific issue.

One of the basic challenges in a network configuration database is that the direct functional dependencies for a given machine may become quite complex when we assume that a given piece of network software is not likely to be running more than once on a given machine. Additionally we often want to ensure that certain sorts of software are set to be configured for certain types of machines, and so constraints can exist that force wider tables.

The width and complexity of some configuration tables can possibly pose a management problem over time for the reason that they may not be obviously broken into manageable chunks of columns.

One possible solution is to decompose the storage class into smaller mix-ins, each of which expresses a set of functional dependencies on a specific key, fully encapsulating a single responsibility. The overall storage class then exists to manage cross-mixin constraints and handle the actual physical storage. The data can then be presented as a unified table, or as multiple joined tables (and this works even where views would add significant complexity). In this way the smaller sub-tables can be given the responsibility of managing the configuration of specific pieces of software.

We might therefore have tables like:

-- abstract table, contains no data
CREATE TABLE mi_smtp_config (
    mi_id bigint,
    smtp_hostname text,
    smtp_forward_to text
);

CREATE TABLE machine_instance (
   mi_id bigserial,
   mi_name text not null,
   inservice_date date not null,
   retire_date date.
   ....
) INHERITS (mi_smtp_config, ...);

The major advantage to this approach is that we can easily check and add which fields are set up to configure which software, without looking through a much larger, wider table.   This also provides additional interfaces for related data, and the like.

For example, "select * from mi_smtp_config" is directly equivalent of "select (mi::mi_smtp_config).* from machine_instance mi;

Conclusions

When we think of relations as specialized "fact classes" as opposed to "behavior classes" in the application world, the idea of the single responsibility principle works quite well with relational databases, particularly when paired with other encapsulation processes like stored procedures and views.

In object-relational designs, the principle can be used as a general guide for further decomposing relations into mix-in classes, or creating intelligent data types for attributes, and it becomes possible to solve a number of problems in this regard without breaking normalization rules.

Perspectives on LedgerSMB

Wednesday, January 23, 2013

Building SOLID Databases: Single Responsibility and Normalization