Perspectives on LedgerSMB: Object/Relational modelling Part 7: General Design Considerations

The previous posts have hopefully opened up the way we look at modelling relational data, in a move from thinking about relations and tuples to thinking about sets of objects coupled with derived values and catalogs. These approaches are all viable as long as the actual table storage layout still meets all criteria of a well-designed relational database.

Unfortunately with new techniques and approaches, additional antipatterns are possible. Every feature can be badly used, and every feature can be badly combined with other features. For this reason, the more one adds advanced features which complicate the data model, the more it becomes possible to spectacularly fail. Some of the tools discussed so far can be liberally used, but many must be used sparingly and only as necessary.

I have personally been working with features like table inheritance for some years now. The features can be of immense help if used properly, but much of the current documentation sets users up to be spectacularly disappointed.

I am now at a point where I can summarize the differences between relational thinking and object-relational thinking. The difference is similar to the difference between algebra and calculus. Relationally we want our information arranged so we can select, join, filter, and aggregate to provide useful answers to questions. This is not so different from using algebra to factor and simplify expressions or to solve problems for specific unknowns.

When a student is learning calculus however, the same tools are used but applied in a very different way. The approach to thinking is very different. Calculus applies many of the same techniques to very small numbers and to trends between numbers. For example, if you have a curve and want to find out what the slope is at a certain point, you look at the limit of the slope between two points as those points converge. Similarly an integral is the sum of an infinite series of infinitesimals. Calculus can thus be seen as a level of abstraction above algebra useful primarily for those problems that algebra cannot solve directly, but it builds upon algebraic techniques, and anyone doing calculus will apply algebra wherever possible first.

Here we have a similar process. We are aggregating or decomposing relational information such that we can derive interesting values from them. However, for a whole host of reasons we do not want to lose sight of the fact that the database --- at least in the sections we are running day to day SELECT queries from --- is relationally well-formed.

If the above comparison has merit then it is worth heeding the words of my college Calculus teacher, Mike Lavender. Many of these features are power tools and should be used both sparingly and with care.

A second important point is that object-relational and object-oriented thinking is also quite different. Object-relational thinking ties data structures to their functional dependencies which may not be stored in the database, and tends to be best when approaching data structures from the perspective of what answers can be derived. Thus, a square is a rectangle with an additional constraint, and this is perfectly consistent with the Liskov Substitution Principle applied to an information model. A square is mathematically substitutable for a rectangle in terms of what we can derive from it.

On the other hand, object-oriented programming is about modelling behavior and what can be done to objects without transforming them across categories. Thus in object-oriented programming, you cannot say that a square is a rectangle because there are things we can do to a rectangle that we cannot do to a square without transforming them across class boundaries. As I will attempt to show in the future, while object-oriented design in application code is very much an art, object-relational design can be expressed in terms of mathematics, namely relational algebra with some extremely modest extensions. This difference in role and approach means that while many of the object-oriented design principles can be found to apply, they apply to object-relational designs in ways very different than they would in the app layer.

I am planning a short (three post) series on my attempt at a object-relational algebra, and a longer series on applying SOLID principles to Object-Relational design.

The Problem: "Relational Decay"

The basic problem can be described as the ability to silently degrade the usefulness of relational data constraints because the design does not allow for clean mapping of constraints. This requires either ever-increasing complexity of data management or it requires degraded data constraints. Both of these are dangerous and can lead eventually to data integrity and accuracy issues.

Anti-Pattern: Mixing Stored and Subclass Data in Same Virtual Table

Having data written to both parent and child table complicates data modelling in ways that result in functionally ambiguous foreign keys. Functionally ambiguous foreign keys are generally bad and to be avoided. In general foreign keys should be clear and unambiguous and should reference specific tables. Data inheritance, mixing stored data and data of subclasses together is a recipe for problems. If a single query pulls data like this together, relational integrity becomes problematic. Don't do it.

Anti-Pattern: Multi-way sync'ing.

One of the solutions for virtually any key issue is to use materialized views. This is particularly helpful when representation of data must be transformed from a form where derived values can be constrained to one where it can be reasonably relationally queried (nested table structures create this problem among others). Its tempting to try to synchronize data both says, but this is unmanageably complex. Data flow needs to only go one way.

Anti-Pattern: Commonly Retrieving Data Directly from Nested Storage

As we have shown, nested tables present massive problems regarding storage and indexing because the logical storage and the physical storage are not really of the same structure. Nested storage makes a great deal of sense for some problems, as we have seen, but used indiscriminately, the result is invariably a mess.

Anti-Pattern: Casting complex type to text

When prototyping types, perhaps before writing a type in C, it may be tempting to write casts of tupe types to text. If you do this however, things are likely to break because there are built-in casts to text that are used internally. Instead use an as_text method.

Solutions: Relational first, Object second

The solutions below have one thing in common. They allow relational integrity to be preserved while putting these sorts of power tools to use. These however are not necessarily to be used willy-nilly either. Each one has something of a complexity cost and this must be weighed against the reductions in complexity that each one provides for a project.

Interface inheritance

Interfaces should be inherited, not data. In this regard, the inheritance design follows from data. This first pattern is to build inheritance trees assuming that the only tables being actively queried will be those at the bottom of the tree, where data is actually stored. In this case, inheritance is used not to provide additional query capabilities but rather to provide consistent interfaces across a domain.

Stock and Leaf Inheritance Tree Design

Stock and leaf is a name I gave to the idea that you can build an inheritance tree which separates tables out into functional units from which no further inheritance occurs (leaves) and infrastructure tables which provide consistent query capability for a subset of table information. These stock tables never have data stored in them but provide general query capabilities.

A way to think of the stock and leaf design might be a table partitioned in a complex way. Maybe we have an initial partition based on some ranges of the one primary key column, but one of the first-level tables here (but not all) is partitioned on a second primary key column. Thus we have leaf nodes occurring at several different levels of inheritance. This allows for organic growth of inheritance, but avoids the functionally ambiguous foreign keys that often result.

Simplified Write Model

If the data model requires synchronizing some aspects of it, it is very important to keep the write scenarios to a minimum. Fewer write scenarios means fewer things that can go wrong, easier testing, more understandable failure cases, and easier troubleshooting when things go wrong.

Log-Aggregate-Snapshot Modelling

One way to greatly simplify writes is to move from representing current state in the database to representing cumulative changes in the database. Accounting systems, for example, have a very restricted write model (usually few if any update scenarios, most important tables being append-only, etc) along with some current state (balance of the checking account) being calculated based on past changes. This is a pattern which can be used elsewhere, and it enables other approaches (see below).

The typical objection is that representing state as a series of changes means one is storing a lot more data and that calculating state then requires expensive calculations. The solution to this is to use periodic snapshots which can be used as roll-forward points, as well as to constrain when input may be entered. For example we may take year-end snapshots of account balances, plus month-end snapshots of invoice balances, preventing invoices from being entered that are more than a month old, and any financial transactions from being entered into periods in which books are closed. Similarly this allows us to delete old data without destroying our ability to track current state.

Log-Aggregate-Snapshot modelling trades data model complexity for write simplicity. It is itself a power tool but it makes many other power tools safely usable.

Separation of Entry Storage and Query Storage for Append-Only Data

Another approach we can borrow from accounting, this time the paper world, is the idea that the format at point of entry can be transformed and re-saved for later use in another form. This works well only when the data is append-only and the write transformations are well understood. However one can use it for materialized views for inheritance trees where needed, or for breaking out data stored in nested tables where this is needed in order to enforce constraints.

For example, we may have nested tables in order to enforce subset constraints. We can break these out on save into conventional tables where row constraints can be enforced and data more flexibly queried and optimized.

object_id inherited field

When using stock and leaf approaches one of the difficulties can be in tracking rows back from the stock table catalogs where this is needed into the actual tables. Inheriting an object_id field poses some problems, but it works relatively well. The solution typically is to do something like:

CREATE TABLE object_root (
object_id bigserial
);

And then inherit object_root.

The O/R "Hand Tools"

The following are in the hand tools category. They can be elegantly used wherever they seem useful. However they are ranked by increasing complexity cost.

Object methods for derived data and storage

The major complexity cost here is that as table structures change, methods can get out of date or break. This can be mitigated by source code control and factoring of the code so that it is easy to understand the codebase.

Table Inheritance to Model Interface Inheritance

This approach trades enforcement of interface consistency with increased ramp-up time and knowledge. This will slow down the database table layout, but it will speed up the use of the database once designed. On the whole this is a bit of a wash complexity-wise but one should counsel against over-using this.

The O/R "Power Tools"

Like the hand tools above, these are ranked according to complexity costs. Unlike the hand tools, these are specialized tools, best used rarely. They have high complexity costs and usually require compensating designs to use safely.

Complex Data Types in Views

Complex data types in views are machine, not human interfaces. These make it harder to access data and can confuse casual observers. Additionally as types change underlying plumbing may have to change as well.

Complex Data Types in Tables

Complex data in tables greatly increases the complexity of table constraints. There are cases where it helps, but in general multiple inheritance is a cleaner alternative with fewer gotchas. This is likely to be useful when moving O/R code over from Oracle or other systems that do not support multiple inheritance or the rare case where table inheritance is inadequate to solve the problem at hand.

Nested Table Storage

Currently nested table storage is relatively useless except in cases where it allows otherwise impossible constraints to be modelled. The current approaches to storage make nested table storage relatively useless for general purpose queries, at least where the data sets are likely to get large. They can, however be used for small data sets where indexes might not be useful anyway, or for cases as a point of original entry where data is then copied into another table for actual access.

There isn't any reason why these issues can't be solved in future versions, but the fact that this is a tool that works well only with edge cases anyway means it is unlikely to be a very high priority to folks not using it.

Table Inheritance to Model Data Inheritance

This adds tremendous complexity to the schema, and requires a lot of effort to make work correctly. Only use it if you are sure that in your specific case the complexity issues it solves for you are worth the costs.

Perspectives on LedgerSMB

Wednesday, September 19, 2012

Object/Relational modelling Part 7: General Design Considerations