Thursday, September 27, 2012

O/R Series Epilogue 2: Towards a not-so-simple explanation of Object Relational Database Management Systems

Object relational databases build on the relational model but focus on answering questions regarding information that is derived from other information using a variety of general purpose tools.  Modelling of the information then shifts not only to what is known but what can be easily (or even not so easily) extrapolated from what is known.  Existing ORDBMS-type systems appear to include Informix (of course, given the PostgreSQL legacy), Oracle, and DB2, but PostgreSQL and Informix have the longest lineage in this area.

The key features of an ORDBMS are:
  1. The ability to model easily extrapolated information within the data model itself, through use of user defined functions
  2. Extensibility with regard to types and functions, written in general purpose programming languages
These features work together in important ways and it is possible, as we have seen, to build a great deal of intelligence into a data model by only moving slightly beyond the standard relational approach.  Although every step further brings with it complexity costs, these are very often useful and allow problems to be solved close to the database which cannot be really easily solved otherwise.

Towards an Object-Relational Algebra

One way to think of object-relational modelling is in the language of relational algebra.  I haven't been able to find accepted notation for functions on individual tuples, so here I use f(relation) notation, to show that the function operates over the set of tuples in the relation.

Relational algebra as defined by Dr Codd is extremely important but it cannot solve certain classes of important queries and so SQL has gone beyond it.  Important blind-spots include:
  1. Transitive operations at arbitrary depth.  Transitive closure and "n next highest values" are both examples of this.
  2. Calculated data problems
Most problems fall into one of these two categories if they are outside the relational model.

We already know that elements of a tuple are functionally dependent on others if, and only if, for each value of the dependency, there is only one functionally dependent value.  So if a is functionally dependent on b, for every b there is exactly one valid value of a.  Functional dependency, as in algebra, is value-based, not expression-based.

I am choosing an algebraic-inspired notation for this, where f(R) is a function of relation R if, and only if, for every tuple in relation R,  there is only one f(R).

f(R) is trivial if for every tuple (a, b, c) in relation R, f(R) returns a value or a tuple that is a subset of the tuple processed.  So if for every tuple (a, b, c), if f(R) = a, or if f(R) = (a, c), then the function is trivial.   Every trivial function also represents a trivial functional dependency within the tuple.

A function is relational if it can be expressed solely through relational algebra.  All trivial functions can be expressed relationally (using π operations) and therefore are also relational.  A relational function thus always specifies a functional data dependency in or between relations.  Relational functions have the property of always denoting global functional dependencies.

A function is non-relational if it cannot be expressed solely through non-relational algebra, for example if it involves processing of the actual value of one or more of the tuple elements or their functional dependencies.  If we have a relation (Emp) containing an attribute salary_per_wk, and annual_salary(Emp) = πsalary_per_wk * 52(Emp), then annual_salary is non-relational because it involves actual processing of data inside the tuple.  Relational functions often can be expanded in relational operations but as far as relational operations, non-relational functions are black boxes and function very much like attributes of a relation.

For example id(R) = πid(R) and c(R) = πnameid=id(R)(C)) are both relational functions, but only id(R) is trivial.  c(R) represents essentially a join and subselect.

An example of a general operation in functional notation might be:

π age(R) (R).  Similarly we can πname(R)age(R)=41(R))

Of course, we must be careful.  Since age(R) is only locally functionally dependant, indexes are out of the question and we must be careful about specifying constraints.  However defining a relation such that age(R) < 65 might prove problematic unless we are re-checking every day.

This would be similar to the following statement in PostgreSQL:

SELECT r.name
  FROM employee r
 WHERE r.age = 41;

where name and age are table methods.  This allows us to store less information in the database (and hence with less duplication and chances for error) and extrapolate important information out of the data that is stored.  It also allows us to store data in ways which are less traditional (nested structures etc) for the sole purpose of writing functions against it in that specific format and thus modelling constraints which cannot be modelled using more traditional structures (though as we have seen that poses significant gotchas and complexity costs).

Similarly recursive queries require recursive operators to work.  I place a capital Greek Sigma (Σ) to signify recursion above the join operator.  This is borrowed because it is the series designator elsewhere in mathematics.  An optional maximum depth is specified as a subscript to the Σ.  So Σ5 would indicate that the expression or join should be subject to no more than 5 iterations.  In a recursive join, the join is repeated until the θ condition is no longer satisfied.  The functions of path(R) and depth(R) are functionally dependent on the output of a recursive join, so Σ5 is identical to (σdepth(r)=5(Σ(...) r)).  The choice of the Σ is also helpful because while Σ always returns an extended superset, σ always returns a subset.

Since path is functionally dependent on the output of a recursive join, we can prove transitive closure over a finite set using recursive self-joins, functions, and boolean operators.  We can also express next highest N tuples results.  Alternatively the Σ can be followed by parentheses to show that an entire expression should be repeated until it brings no further results into the set.  In a Σ expression the set is divided into two subsets:  previous and new.  New results are those returned by the last iteration, and are the only ones processed for join conditions.  On each iteration, the "new" tuples are moved into the previous set and the tuples which satisfied the join condition are moved into the "new" set.

I also use three other symbols for specifying order-dependent information.  ω (omega) denotes a "window" order and partition in which aggregates can be applied to tuples in order, tuples can be removed from the beginning (α with a subscript for number of tuples to "skip") or truncated from the end (τ with a subscript for the number of tuples after which the set is to be truncated).  These allow me to approach the problems which SQL can address but relational algebra cannot.  An interesting property of ω is that the window order only is valid for some specific operations and is lost on any join or select operations.  These operations have interesting properties as well but they are somewhat outside the scope of this posting.  I will however note that it is usually cleaner to solve next N result issues with window ordering, tuple omission, and truncation than it is with recursion and aggregates.

Next:  Towards a simple explanation of object-relational database systems.

Sunday, September 23, 2012

O/R Series Epilogue 1: 4 Things I Wish PostgreSQL did

In this area there are things I wish PostgreSQL did.  These are listed in order of priority.  Each item also has workarounds listed.

Parameters in Object Methods

In Oracle you can define an object method such that you can do something like:

SELECT li.length(30) from line_segment li;

In PostgreSQL there is no way to pass parameters to an object method called with object.method syntax.  The equivalent in PostgreSQL would be:

SELECT length(li, 30) from line_segment li;

This isn't a huge deal but it would make it a lot clearer that length was closely tied to the line_segment type structure if it was possible to do things that way.

Inheritable foreign key constraints

A very useful and extension to what we have would be to allow foreign keys to be inherited on the referencing side.  This would save one from having to define foreign keys over and over for each child table and it would make key classes even more useful.

Ideally I would like to be able to do

CREATE TABLE country_ref (
     country_id int references country(id)
);

and have the foreign key properly enforced on all inheriting tables.  This would just involve copying the foreign key constraint and so would probably not require any deep changes to structure the way the ability to reference child table and parent tables together might.

Inheritable unique constraints 

Related to this I think another thing that would smooth things over down the road and might ultimately lead to more functionality down the road.  For now, I think the first step would be a SCOPE predicate for unique constraints and indexes.  This could be something like:

CREATE TABLE country_ref_u (
    unique (country_id) SCOPE PER TABLE
    FOREIGN KEY (country_id) REFERENCES country(id)
);

Such a unique constraint could then specify that this is unique on each table where it is inherited.  Initially PER TABLE and ONLY HERE (i.e. not inherited) might be supported with an effort to eventually support SCOPE FULL (and DRI against an inheritance tree).

One of the fundamental problems that one runs into with the issue of unique constraints cross tables is that extending the index format to include tableoid might lead to significantly larger indexes and very likely slower lookups where inheritance is not used.  One option might be to have an additional index page with special pointers to other indexes.  So per table indexes might be useful jumping off points for full tree constraints.

Consistent assumptions in composite types used for columns

Composite types in columns is one of those areas where the behavior is remarkably inconsistent and the edge cases poorly documented.  This isn't surprising given that this is relatively new functionality at the edges of relational database usage, and different people have different ideas of how it should work.  However, it seems very difficult to get everyone on the same set of assumptions and it is impossible to fix the current mish-mash while maintaining backwards compatibility.  Right now, there is no consistency however, and this makes it a difficult area to use going forwards.

Wednesday, September 19, 2012

Object/Relational modelling Part 7: General Design Considerations

The previous posts have hopefully opened up the way we look at modelling relational data, in a move from thinking about relations and tuples to thinking about sets of objects coupled with derived values and catalogs.  These approaches are all viable as long as the actual table storage layout still meets all criteria of a well-designed relational database.

Unfortunately with new techniques and approaches, additional antipatterns are possible.  Every feature can be badly used, and every feature can be badly combined with other features.  For this reason, the more one adds advanced features which complicate the data model, the more it becomes possible to spectacularly fail.  Some of the tools discussed so far can be liberally used, but many must be used sparingly and only as necessary.

I have personally been working with features like table inheritance for some years now.   The features can be of immense help if used properly, but much of the current documentation sets users up to be spectacularly disappointed.

I am now at a point where I can summarize the differences between relational thinking and object-relational thinking.  The difference is similar to the difference between algebra and calculus.  Relationally we want our information arranged so we can select, join, filter, and aggregate to provide useful answers to questions.    This is not so different from using algebra to factor and simplify expressions or to solve problems for specific unknowns.

When a student is learning calculus however, the same tools are used but applied in a very different way.  The approach to thinking is very different.  Calculus applies many of the same techniques to very small numbers and to trends between numbers.  For example, if you have a curve and want to find out what the slope is at a certain point, you look at the limit of the slope between two points as those points converge.  Similarly an integral is the sum of an infinite series of infinitesimals.   Calculus can thus be seen as a level of abstraction above algebra useful primarily for those problems that algebra cannot solve directly, but it builds upon algebraic techniques, and anyone doing calculus will apply algebra wherever possible first.

Here we have a similar process.  We are aggregating or decomposing relational information such that we can derive interesting values from them. However, for a whole host of reasons we do not want to lose sight of the fact that the database --- at least in the sections we are running day to day SELECT queries from --- is relationally well-formed.

If the above comparison has merit then it is worth heeding the words of my college Calculus teacher, Mike Lavender.  Many of these features are power tools and should be used both sparingly and with care.

A second important point is that object-relational and object-oriented thinking is also quite different.  Object-relational thinking ties data structures to their functional dependencies which may not be stored in the database, and tends to be best when approaching data structures from the perspective of what answers can be derived.  Thus, a square is a rectangle with an additional constraint, and this is perfectly consistent with the Liskov Substitution Principle applied to an information model.  A square is mathematically substitutable for a rectangle in terms of what we can derive from it.

On the other hand, object-oriented programming is about modelling behavior and what can be done to objects without transforming them across categories.  Thus in object-oriented programming, you cannot say that a square is a rectangle because there are things we can do to a rectangle that we cannot do to a square without transforming them across class boundaries.  As I will attempt to show in the future, while object-oriented design in application code is very much an art, object-relational design can be expressed in terms of mathematics, namely relational algebra with some extremely modest extensions.  This difference in role and approach means that while many of the object-oriented design principles can be found to apply, they apply to object-relational designs in ways very different than they would in the app layer.

I am planning a short (three post) series on my attempt at a object-relational algebra, and a longer series on applying SOLID principles to Object-Relational design.


The Problem:  "Relational Decay"

 

The basic problem can be described as the ability to silently degrade the usefulness of relational data constraints because the design does not allow for clean mapping of constraints.  This requires either ever-increasing complexity of data management or it requires degraded data constraints.  Both of these are dangerous and can lead eventually to data integrity and accuracy issues.

Anti-Pattern: Mixing Stored and Subclass Data in Same Virtual Table

Having data written to both parent and child table complicates data modelling in ways that result in functionally ambiguous foreign keys.  Functionally ambiguous foreign keys are generally bad and to be avoided.  In general foreign keys should be clear and unambiguous and should reference specific tables.  Data inheritance, mixing stored data and data of subclasses together is a recipe for problems.  If a single query pulls data like this together, relational integrity becomes problematic.  Don't do it.

Anti-Pattern:  Multi-way sync'ing.

One of the solutions for virtually any key issue is to use materialized views.   This is particularly helpful when representation of data must be transformed from a form where derived values can be constrained to one where it can be reasonably relationally queried (nested table structures create this problem among others).  Its tempting to try to synchronize data both says, but this is unmanageably complex.  Data flow needs to only go one way.

Anti-Pattern: Commonly Retrieving Data Directly from Nested Storage

As we have shown, nested tables present massive problems regarding storage and indexing because the logical storage and the physical storage are not really of the same structure.  Nested storage makes a great deal of sense for some problems, as we have seen, but used indiscriminately, the result is invariably a mess.

Anti-Pattern:  Casting complex type to text

When prototyping types, perhaps before writing a type in C, it may be tempting to write casts of tupe types to text.  If you do this however, things are likely to break because there are built-in casts to text that are used internally.  Instead use an as_text method.

 

Solutions:  Relational first, Object second


The solutions below have one thing in common.  They allow relational integrity to be preserved while putting these sorts of  power tools to use.  These however are not necessarily to be used willy-nilly either.  Each one has something of a complexity cost and this must be weighed against the reductions in complexity that each one provides for a project.

Interface inheritance

Interfaces should be inherited, not data.   In this regard, the inheritance design follows from data.  This first pattern is to build inheritance trees assuming that the only tables being actively queried will be those at the bottom of the tree, where data is actually stored.  In this case, inheritance is used not to provide additional query capabilities but rather to provide consistent interfaces across a domain.

Stock and Leaf Inheritance Tree Design

Stock and leaf is a name I gave to the idea that you can build an inheritance tree which separates tables out into functional units from which no further inheritance occurs (leaves) and infrastructure tables which provide consistent query capability for a subset of table information.  These stock tables never have data stored in them but provide general query capabilities.

A way to think of the stock and leaf design might be a table partitioned in a complex way.  Maybe we have an initial partition based on some ranges of the one primary key column, but one of the first-level tables here (but not all) is partitioned on a second primary key column.  Thus we have leaf nodes occurring at several different levels of inheritance.  This allows for organic growth of inheritance, but avoids the functionally ambiguous foreign keys that often result.

Simplified Write Model

If the data model requires synchronizing some aspects of it, it is very important to keep the write scenarios to a minimum.  Fewer write scenarios means fewer things that can go wrong, easier testing, more understandable failure cases, and easier troubleshooting when things go wrong.

Log-Aggregate-Snapshot Modelling

One way to greatly simplify writes is to move from representing current state in the database to representing cumulative changes in the database.  Accounting systems, for example, have a very restricted write model (usually few if any update scenarios, most important tables being append-only, etc) along with some current state (balance of the checking account) being calculated based on past changes.  This is a pattern which can be used elsewhere, and it enables other approaches (see below).

The typical objection is that representing state as a series of changes means one is storing a lot more data and that calculating state then requires expensive calculations.  The solution to this is to use periodic snapshots which can be used as roll-forward points, as well as to constrain when input may be entered.  For example we may take year-end snapshots of account balances, plus month-end snapshots of invoice balances, preventing invoices from being entered that are more than a month old, and any financial transactions from being entered into periods in which books are closed.  Similarly this allows us to delete old data without destroying our ability to track current state.

Log-Aggregate-Snapshot modelling trades data model complexity for write simplicity.  It is itself a power tool but it makes many other power tools safely usable.

Separation of Entry Storage and Query Storage for Append-Only Data

Another approach we can borrow from accounting, this time the paper world, is the idea that the format at point of entry can be transformed and re-saved for later use in another form.   This works well only when the data is append-only and the write transformations are well understood.  However one can use it for materialized views for inheritance trees where needed, or for breaking out data stored in nested tables where this is needed in order to enforce constraints.

For example, we may have nested tables in order to enforce subset constraints.  We can break these out on save into conventional tables where row constraints can be enforced and data more flexibly queried and optimized.

object_id inherited field

When using stock and leaf approaches one of the difficulties can be in tracking rows back from the stock table catalogs where this is needed into the actual tables.  Inheriting an object_id field poses some problems, but it works relatively well.  The solution typically is to do something like:

CREATE TABLE object_root (
    object_id bigserial
);

And then inherit object_root.

 

The O/R "Hand Tools"


The following are in the hand tools category.  They can be elegantly used wherever they seem useful.  However they are ranked by increasing complexity cost.

Object methods for derived data and storage

The major complexity cost here is that as table structures change, methods can get out of date or break.  This can be mitigated by source code control and factoring of the code so that it is easy to understand the codebase.

Table Inheritance to Model Interface Inheritance

This approach trades enforcement of interface consistency with increased ramp-up time and knowledge.  This will slow down the database table layout, but it will speed up the use of the database once designed.  On the whole this is a bit of a wash complexity-wise but one should counsel against over-using this.

The O/R "Power Tools"


Like the hand tools above, these are ranked according to complexity costs.  Unlike the hand tools, these are specialized tools, best used rarely.  They have high complexity costs and usually require compensating designs to use safely.

Complex Data Types in Views

Complex data types in views are machine, not human interfaces.  These make it harder to access data and can confuse casual observers.  Additionally as types change underlying plumbing may have to change as well.

Complex Data Types in Tables

Complex data in tables greatly increases the complexity of table constraints.  There are cases where it helps, but in general multiple inheritance is a cleaner alternative with fewer gotchas.  This is likely to be useful when moving O/R code over from Oracle or other systems that do not support multiple inheritance or the rare case where table inheritance is inadequate to solve the problem at hand.

Nested Table Storage

Currently nested table storage is relatively useless except in cases where it allows otherwise impossible constraints to be modelled.  The current approaches to storage make nested table storage relatively useless for general purpose queries, at least where the data sets are likely to get large.  They can, however be used for small data sets where indexes might not be useful anyway, or for cases as a point of original entry where data is then copied into another table for actual access.

There isn't any reason why these issues can't be solved in future versions, but the fact that this is a tool that works well only with edge cases anyway means it is unlikely to be a very high priority to folks not using it.

Table Inheritance to Model Data Inheritance

This adds tremendous complexity to the schema, and requires a lot of effort to make work correctly.  Only use it if you are sure that in your specific case the complexity issues it solves for you are worth the costs.

Sunday, September 16, 2012

O/R Modelling in part 6: Polymorphism (and Gotchas) in PostgreSQL

Polymorphism is common in virtually all database management systems.  When we add an integer to a float, we expect this to just work.  Similarly various types of text fields can usually be concatenated together, etc.  In particularly extensible database systems, like PostgreSQL, polymorphism tends to lead to additional problems.  Indeed, implicit casts were cut way back a few years ago specifically because of problems they might cause.  This is an area I expect we will do quite a bit more with in LedgerSMB in the future.

The PostgreSQL Type System

PostgreSQL types can be divided into three specific types:  Primitives (those supplied by PostgreSQL out of the box or by extensions), domains, and object types.  The object types include composite types and table types.  Although typically these are lumped together, we will cover table types and composite types separately because the capabilities are far more developed when tables are created.

Base Primitive Types

Base types are intended to be the basic building blocks of relations.  They include such types as int4 and text.  These cannot have  methods attached but they can be arguments in functions provided that the procedural language understands the type.  Most of these will be passed in using the textual representation if all else fails.

Polymorphic Primitive Types

Polymorphic primitives are primitives which can take the form of other base types.  The primary ones are anyelement (which is any simple base type) and anyarray which is an array filled with any simple base type.  These are mostly used in functions so that the function can apply to array operations generally.  I have not encountered them elsewhere.

Pseudotype Primitives

Pseudotypes include triggers and a few other things.  They are mostly used as keywords, not as actual types.

SQL-Created Domain Types

Domains are the first significant type of SQL-created or user-defined type.  An SQL domain is a base type, valid for a subset of the base values.  Domains can be set to be enforced not null, and this applies only on storage, but unlike base types, this is checked as a property of a composite type as well.

SQL-Created Object Composite Types

PostgreSQL allows you to create composite types which can be used as input and output types for functions, elements in object composite types, columns in a table, and so forth.  Composite types have no inheritance, and cannot inherit interfaces from eachother.  Composite types cannot have any recursive elements and this is checked deeply.  For example you cannot:

CREATE TYPE a AS (
   a int
);

CREATE TYPE b  AS (
   a a
);

 ALTER TYPE a ADD attribute b b;

An attempt to do so will result in an error:

 ERROR:  composite type a cannot be made a member of itself

SQL-Created Object Table Types 

Table types are similar to composite types except that inheritance is supported.  This means in practice that the behavior of tables is quite different than the behavior of composite types when it comes to methods, implicit casting, and the like.  Note that implicit casts are not inherited.

Type Conversion:  CREATE CAST

Values and objects can be converted between types using casts.  Casts can be explicit only, implicit on assignment, and implicit generally.  In some cases casts can be dangerous.

The overall syntax of CREATE CAST is discussed in the PostgreSQL manual.   In general almost all user created casts are likely to use a SQL function to create one to another.  In general explicit casts are best because this forces clarity in code and predictability at run-time.

For example see our previous country table:

or_examples=# select * from country;
 id |     name      | short_name
----+---------------+------------
  1 | France        | FR
  2 | Indonesia     | ID
  3 | Canada        | CA
  4 | United States | US
(4 rows)

Now suppose we create a constructor function:

CREATE FUNCTION country(int) RETURNS country
LANGUAGE SQL AS $$
SELECT * FROM country WHERE id = $1 $$;

We can show how this works:

or_examples=# select country(1);
    country   
---------------
 (1,France,FR)
(1 row)

or_examples=# select * from country(1);
 id |  name  | short_name
----+--------+------------
  1 | France | FR
(1 row)

Now we can:

CREATE CAST (int as country)
WITH FUNCTION country(int);

 And now we can cast an int to country:

or_examples=# select (1::country).name;
  name 
--------
 France
(1 row)

The main use here is that it allows you to pass an int anywhere into a function expecting a country and we can construct this at run-time.  Care of course must be used the same way as with dereferencing foreign keys if performance is to be maintained.  In theory you could also cast country to int, but in practice casting complex types to base types tends to cause problems than it solves.

Inheritance-Based Implicit Casts

Inheritance-based implicit casts are an exception to the dangers of implicit casting.  These are not base type to base type casts, and they essentially act as slices of a table.  If your table structures use inheritance in a sane way they will work well.

Defined Implicit Casts

We can generally define implicit casts between tuples.  This is best done sparingly because any implicit cast multiplies execution pathways.  Defined implicit casts are not inherited.

To create an implicit cast, we could do as follows:

DROP CAST (int as country);

CREATE CAST (int as country)
WITH FUNCTION country(int)
AS IMPLICIT;

The problem of course is that as soon as we do this we have the potential for ambiguity.

Suppose I have a hypothetical state table, and I have functions called capital which can take arguments of country or state and returns the name of the capital city.  If I make a query of "select capital(1);" then am I asking for a country or a city capital?  I have always avoided this situation because it can never end well (so much so that I don't actually know how PostgreSQL resolves the issue).

Keep implicit casts for cases where you really need them and avoid them in other cases.

Problems with Implicit Casts to Primitives

The problems with implicit casts are greatly magnified when casting to (and hence between) primitives.  We will discuss just one such problem here:  numbers to timestamps.

Currently PostgeSQL supports casting dates as timestamps which is probably correct.  Suppose we want to take a number and turn it into a timestamp.  Maybe we are converting UNIX epochs to timestamps regularly and want to do so without a cast.

create cast (double precision as timestamp with time zone) 
with function to_timestamp(double precision) 
as implicit;

And so we try this:

CREATE FUNCTION to_timestamp(int)
returns timestamptz
language sql immutable as
$$ select $1::double precision::timestamptz; $$;

And add an explicit cast:
create cast (int as timestamp with time zone) 
with function to_timestamp(int) 
as implicit;

Works and well enough.  However, now we have a bug factory.

or_examples=# select extract(year from '2011-01-01'::date);
 date_part
-----------
      2011
(1 row)

or_examples=# select extract(year from 2011-01-01);
 date_part
-----------
      1969
(1 row)

Why?  Because the second is an integer math expression and evaluates to the integer of 2009, or 2009 seconds after midnight, January 1, 1969.  This is probably not what we had in mind.  Similarly:

or_examples=# select age('2011-01-01'::date);
      age     
---------------
 1 year 8 mons
(1 row)

or_examples=# select age(2011-01-01);
           age           
--------------------------
 42 years 8 mons 07:26:31
(1 row)

It is important to note here that the computer is doing exactly what we told it to do.  The problem however is that we now have bugs which are non-obvious to someone who is not quite fluent on the system to start with.

Another historical case that used to bite people is implicit casts of tuples to text.  This lead to very counterintuitive behavior in some cases, such as the infamous tuple.name or tuple.text issues.   Basically tuple.name would return name(tuple::text) which would be a textual representation of the tuple truncated to 63 characters.  This was fixed by making the cast of tuple to text explicit in PostgreSQL 8.3.

In general, polymorphism is a very important feature of PostgreSQL but it is something which has many gotchas.  In general adding explicit casts won't get you into too much trouble but implicit casts are often asking for issues.

Thursday, September 13, 2012

Object/Relational Interlude: Messaging in PostgreSQL

PostgreSQL as an object-relational development platform doesn't end with data modelling.  The notification system is particularly powerful and worthy of mention.  This system however has some challenges and gotchas which are also worth mentioning.  The example below was early sample code bundled with LedgerSMB from 1.1 onward.  The code is not really useful in production for reasons specified but it can be a basis for production systems, and it functions as an example both of how to make things work and the problems you can encounter.

I would like to thank Marc Balmer of Micro Systems for some contributions from slides he had sent to me on this.

Basic Framework:  LISTEN/NOTIFY

PostgreSQL provides a messaging system with two commands, listen and notify.  These commands allow any process to broadcast an event to any listener.  In some older versions, listeners are listed in a system table and this information is presumed public.  In newer versions this information is not in the system catalogs and so this reduces the security exposure from the use of this method, but many areas of exposure remain and must be understood.

This approach is divided into broadcast channels.  Any user can broadcast along any channel.  Anyone can listen on any channel. The actual notifications are public.  The syntax is:

LISTEN channel; -- for the listener

and


NOTIFY channel [, payload]; -- for the broadcaster


Channels are database-specific.

A Brief Example

or_examples=# listen test;
LISTEN

In another session:

or_examples=# notify test, 'testing testing testing';
NOTIFY

In the first session, psql will only check for notifications when executing another query so we:

or_examples=# select 1;
 ?column?
----------
        1
(1 row)

Asynchronous notification "test" with payload "testing testing testing" received from server process with PID 29052.

It is of course possible to send actual meaningful data in the payload but this is probably not a good idea since we can't really verify the sender without a lot of extra code.

In an actual application we can take the notification to be a sign that we need to check something in the application.  In general, I can't find any case where the payload is actually a good idea to use.  It's probably better to use tables to create a message queue.  Then the notification can be used to reduce the actual requests to the tables.

Note further that NOTIFY is raised on transaction commit, making it transaction-safe.  NOTIFY allows you to build extremely interactive systems where many applications actively communicate across the database and trans-transaction logic can be encapsulated from a user application's perspective, within the database layer itself.

Listen Gotchas

The Listen architecture does not provide security by itself,  This must be provided on the underlying tables.  In general it is always a mistake to trust the payload of the notification too much.

Additionally you need a way to extract only the records you want.  I will go over a badly designed case below.  One approach is a message table.  Another approach might be a message view.  Either way the listener is a separate program.

Finally it is worth noting that the application actually has to ask the connection libraries if there is a notification.  This can cause some confusion because, for example, psql only checks when a query is run.  With a Perl script, however, we might wake up to check once every minute or second, or whatever we want to do.

Notify Gotchas

Listeners require no security privileges to listen to a notification or act on them.  The listener is a separate program and could be anywhere else.  You should never trust that the NOTIFY payload is not overheard somewhere.

Additionally, it is worth remembering that NOTIFY is *always* raised on commit and cannot be removed without rolling back the transaction.  However, if the actual message is in a message table, it could be safely deleted.

Patterns

Message Queue Tables

When we add tables with messages to the system we will be able to actually use PostgreSQL as a message queue server.  For example, in 9.2 we can to do something like:

CREATE TABLE message_queue_test (
    id bigserial primary key, -- appropriate, not natural data
    sender text not null default session_user,
    payload json not null,
    received bool
);

CREATE INDEX msq_test_pending_idx 
ON message_queue_test (id, sender, payload) 
WHERE received IS NOT TRUE;

Of course the cases where this can be used as a covering index are vanishingly small, but over time the index may be useful in eliminating rows already read.  You can then keep old records around for a while for debugging purposes.  You could even have an abstract table, multiple child/"leaf" tables, and unified object-relational interfaces for these.  Indeed this may be an interesting project for the future.

Concurrent Access and Locking

Note that the standard rules exist with concurrent access and locking.  If you want this to be a message queue that several different programs can read and every message must get sent out only once, then you should make sure you use SELECT ... FOR UPDATE.

An Example

 In LedgerSMB I created a sample script that would show how to send an email out whenever a short part was sold, to remind someone to order.  Unfortunately this was not so well designed, but I guess it was good enough as an example.

The Trigger

CREATE OR REPLACE FUNCTION trigger_parts_short() RETURNS TRIGGER
AS
'
BEGIN
  IF NEW.onhand >= NEW.rop THEN
    NOTIFY parts_short;
  END IF;
  RETURN NEW;
END;
' LANGUAGE PLPGSQL;

In actual use cases, this isn't enough information to process this properly.  Please see the problems below.  The basic problem is that this notification causes the listener, when used, to email out a full parts short report to the individual(s) specified.  Depending on how it is set up, this might be a bunch of reports in a short time.  This could really use a queue table.

The Listener

I wrote this code quickly when I was still learning Perl. It's definitely quick and dirty code.  I am not proud of the clarity or maintainability, but what it does is actually trivial.

The script starts off with the basic setup, which basically sets up the configuration information and the database connections:

require "config.pl";

use DBI;
my $dsn = "dbi:Pg:dbname=$database";
my $dbh = DBI->connect(
    $dsn, $db_user,
    $db_passwd,
    {
        AutoCommit => 1,
        PrintError => 0,
        RaiseError => 1,
    }
);
$dbh->{pg_enable_utf8} = 1;

my $sth;

$dbh->do("LISTEN parts_short");



The next part is the main loop, which just wakes up, checks for notifications, acts on them if applicable, and if not goes back to sleep:

while (1) {    # loop infinitely
    if ( $dbh->func('pg_notifies') ) {
        &on_notify;
    }
    sleep $cycle_delay;
}

And finally what we do if we got a notification:

sub on_notify {
    open( MAIL, '|-', "$sendmail" );
    $sth = $dbh->prepare( "
        SELECT partnumber, description, onhand, rop FROM parts
        WHERE onhand <= rop
  " );
    $sth->execute;
    print MAIL $template_top;
    while ( ( $partnumber, $description, $avail, $rop ) = $sth->fetchrow_array )
    {
        write MAIL;
    }
    print MAIL $template_foot;
    close MAIL;
}

Problems

Aside from the fact that the code isn't the most maintainable code out there, there are a number of structural problems with the whole solution.  First, what this actually does is probably not what you actually want.  It wakes up every so often (a configurable value) and checks for notifications and if it finds them will send out a parts short report to the designated individual.  Note it won't send just the new parts, but actually all the parts short.  If you have a store with thousands of parts in stock and are nearing ROP on a few hundred, and then the holiday traffic starts before your orders come in, this will be rather annoying and likely to be intentionally filtered out by whoever is supposed to receive it.

Additionally, if you have a case where you are short many parts, a malicious user may be able to find a way to force the system to send out a parts short report every time it wakes up, which may not cause a denial of service attack but in fact may be highly annoying and prevent the actual emails from being reasonably useful.

Solutions

The best solution is to go with a queue table, and have the trigger write to it.  Then the reader can read it, and email out a notice as to exactly what has happened.  This will mean more relevant information and better signal to noise ratio.

Next:  Polymorphism in PostgreSQL

Saturday, September 8, 2012

O/R modelling interlude: PostgreSQL vs MySQL

Every time we see people look at PostgreSQL and MySQL on the internet it falls into a flame war fast.  I think that a large part of the problem is that advocates of these databases look at the other database (and advocacy of the other database) through a specific lens and therefore are unable to understand the rhetoric from the other side.  This is an attempt to cut through some of this and offered in the spirit of suggesting that if we can't understand eachother's views, then instead of discussion all we will get are flames.

This post is not intended to be a post about my MySQL sucks (although I will point out some cases where it is inadequate and other cases where it reduces transition costs and times) or about why you should choose PostgreSQL instead.  I am of course biased, having worked with MySQL for some time but switching to PostgreSQL for important work back in 1999.  I personally don't much like MySQL and so it is worth stating my bias up front.  However, I don't think that prevents me from trying to place it constructively in the marketplace or express a sympathetic understanding for what MySQL has to offer developers.

The endless arguments I think are the result of very specific pressures on the RDBMS market, and PostgreSQL in many ways targets what the system is in theory and what it can be, while MySQL targets how an RDBMS is more typically used when developing software for sale.  MySQL is disruptive only by virtue of being open source.  PostgreSQL is disruptive by design and the principles pioneered on Postgres have found their way into Informix, DB2, Oracle, and more.

A simple description of the difference (and this is not intended as a flame) is:

MySQL is what you get when application developers build an RDBMS.
PostgreSQL is what you get when database developers build an application development platform.

The above is not intended as a flame on either side by any means but it does show where people run into pain on both sides (server-side application-style development on PostgreSQL doesn't work, and trying to use MySQL the way you would use Oracle for data centralization really doesn't work either).

In case people are wondering, I intend in the future to look at object-relational modelling potentials in DB2 and Oracle in contrast to PostgreSQL as well,  These products however are more similar than are MySQL and PostgreSQL both in terms of use case and market position. 

App-Centric vs Data-Centric

The whole purpose of a database management system is to store data so that it can be re-used.  This is true regardless of whether you are using a NoSQL solution, a light-weight quasi-RDBMS like sqlite, or a heavy duty system like Oracle.    The type and scope of that re-use varies quite a bit between products though.  Larger-scale RDBMS's typically focus on flexible output and rigid, validated input, and this is more important as more applications start to write to the database.  Smaller databases and NoSQL databases tend to place the application in the driver's seat and do less validation.

These approaches exist on a continuum of course but in general, you need more type checking the more applications may be writing to the database.  With loose checking, the assumption can be made that the database is primarily a store of private state information and therefore ideosyncracies of the application don't matter that much, but when you have multiple applications writing to the same relations, the data is "public" in a way that it is not otherwise and therefore there is a lot of value in ensuring that application ideosyncracies do not cause misbehavior in other software or cause data to be misinterpreted.

So on one side we have big, shared data solutions like Oracle and DB2.  On the other, we have applications which are fundamentally designed to be a data store for a single application, where the developer can be given total freedom regarding data validation.  Many of the NoSQL solutions fit here.  Some SQL-like solutions are also on this side, such as SQLite.

We can look at how MySQL and PostgreSQL fit on this continuum and try to get a feel for why the different points of view seem to be so difficult to bridge.  We will look particularly at solutions given to past data integrity concerns and what those mean for information management and application development.  In PostgreSQL we will look at user-defined function enhancements and replication.  In MySQL we will look at strict mode.

These solutions show radically different views of how the software is expected to be used.  PostgreSQL may be characterized as conservative but innovative, and being unwilling to do anything that might prejudice data in multi-application environments.  MySQL on the other hand my be characterized as focusing on the needs of the app developer sometimes to the exclusion of the needs of the DBA.

MySQL on the app vs data spectrum:  SQL Mode Salad

MySQL 4.x and earlier was notoriously loose with data constraints.  Zero dates were seen as valid, as was a date like '2008-02-30.'  Data would be truncated to fit fields, or otherwise transformed.  These transformations were sometimes lossy but were predictable and in fact make some sense in a content management environment (my first use of any RDBMS was MySQL for light-weight content management in this timeframe).  Typically the data stored was not terribly important (unlike accounting systems or the like) and so it was good enough for its use case.  It may be extreme to call data truncation a feature, but the truth in the initial use case for MySQL is that would not be entirely inaccurate.

The big problem though was that people were trying to build applications to do other things beyond the initial use case, and in many of these cases database transactions were needed and better type checking was needed.

To address these problems, MySQL leveraged its pluggable table system to allow third party vendors to create open source or dual-licensed (with a license from MySQL) transactional tables.  InnoDB, BDB, and a few other tables arose in this way.  Of course if transactions are handled at the table level, and tables are handled by plugins, then data definition language statements can never be transactional.   This isn't the end of the world for many MySQL deployments (for reasons stated below) but it also doesn't handle the type checking issues, which have to be handled before storage.

To handle the type checking issue, MySQL implemented the concept of SQL modes, which allows one to select a number of options which then define a dialect of SQL for use in the software.  In this way one can ease porting from other systems to some extent, and address questions like how strictly types should be checked.

MySQL allows any user to set the SQL model for the session, and this can have effects ranging from SQL syntax to whether '2008-02-30' is accepted as a valid date.  In essence in MySQL, the application is king and the db a humble servant.

It is worth noting that the original few modes have been expanded into a very large set, allowing applications to tell MySQL to accept syntactical ideosyncracies of other RDBMS's.  This sort of thing shortens the time necessary to initially port an application to MySQL and this id great as long as certain boundaries are maintained.

This sort of assumption only really works in practice where only one application is writing to a given set of relations.  If you have two or ten applications reading and writing the same tables, then each one of them can decide what sort of data assumptions the server should use when validating the data prior to storing it.  MySQL thus trades robustness and guarantees on output of data for flexibility of input (this a the fundamental tradeoff in NoSQL as well), and this therefore ends up relegating the database to an application's private data store.

The typical response I have received when asking how to manage this is to put an API layer in the application which does the data checking.  In other words, the application, not the database, is expected to be the gatekeeper regarding valid data.

MySQL users think in terms of the "software stack" with MySQL existing just above the operating system.  The sole purpose of MySQL is to store data for the single application owning the relation.  MySQL users generally do not think in terms of shared data between applications using SQL as an API, and the MySQL developers happily provide the software they are looking for.  This software sits  somewhere between traditional RDBMS's and NoSQL systems in terms of flexibility of data in vs out.

MySQL does not provide facilities for DBA's to restrict which SQL modes are available.  This more or less prevents MySQL from outgrowing the single application per table use case, and it prevents MySQL from  being a genuine information management solution.  That isn't to say it is a bad development tool.  However it should be seen as an RDBMS-like application back-end, rather than a classical RDBMS (which has generally been geared towards centralized data management).

PostgreSQL on the app vs data spectrum and New features:  No Inherit Constraints and Named Arguments in SQL Functions

PostgreSQL began life as the Postgres project out of UC Berkeley.  It was initially a research testbed for advanced database concepts, namely those called "object-relational" in terms of data modelling.  The idea is that more complex problems can be modelled when behavior is tied to data structures which can then be relationally manipulated based on that behavior.

The basic promise of object-relational database management is that data structures can be tied to processing routines, so that more advanced models can be built and relationally queried.   This allows more complex applications to be written without overly burdening the database system with large amounts of data transfer or large result sets in memory.  With plain relational processing, we'd have to express what we can relationally and then filter out the excess in our application.  With object-relational modelling we can build more advanced filters into the SQL queries without affecting readability.  In order to do this, PostgreSQL allows for user defined functions to be written in a variety of languages with C, SQL, and PL/PGSQL being available on the default installation.  Other languages can be added using a plugin system (something MySQL has as well, but not well suited there for Object Relational management of data).

As we have been looking at, PostgreSQL provides powerful capabilities of modelling data which goes well beyond what data is stored and includes the ability to model data derived from what is stored.  PostgreSQL in fact pushes the envelope in this area even beyond where Informix, DB2, and Oracle have taken it.  It is thus a platform for building highly intelligent models of data.


All current major object-relational database implementations (including those of DB2 and Oracle) are at least inspired by Michael Stonebraker's work in this area, and by Postgres in particular.  Indeed, Informix's object-relational capabilities started life as a Postgres fork named Illustra, and both DB2 and Oracle more or less implement the O-R bindings to SQL developed on that platform.  PostgreSQL is different SQL-wise in part because Informix split prior to Postgres adopting SQL as a query language. 

Centralizing data modelling code in the database is generally a net win.  Result sets from SQL queries tend to be smaller, there are fewer round trips between the application and the database, and the models themselves can be used to create a public API, similar to the way an ORM might be used, thus allowing centralization of key code across all applications that may read or write to a specific database.  This doesn't mean "all logic belongs in the database."  Rather it provides for the ability to build more advanced (and efficient) data models which can place processing where it is most efficient and avoid the worst of the scalability bottlenecks.

PostgreSQL has always had a data modelling focus rather than the application back-end focus seen in MySQL.  When we look at two recent additions we can see that legacy as well.   In this model, guarantees regarding data is paramount, and applications are expected to use the relations as, essentially, public API's.  Consequently the database, not the application, is responsible for data consistency and semantic clarity.  Foreign keys for example, are never ignored by PostgreSQL (in MySQL, handling is dependent on table type), and the 30th of February is never a valid date no matter how much the application would like to use it as such.  Dates are always in the Gregorian calendar where this is not valid.  If you care handling Gregorian to Julian conversions you will have to do this for your locale perhaps as a custom type.

PostgreSQL 9.2 will add the ability to use named parameters in the body of SQL-language functions.  This is a huge win for us with the LedgerSMB project and makes our lives a lot easier.  The change is backwards-compatible so our existing functions will just work, but it allows for clearer SQL code in function bodies for functions that may sometimes be able to be inlined.  In terms of object-relational modelling this is a big deal.  However in our specific case it is a larger issue because we use these to define application API's within the software.  SQL language functions have been a part of PostgreSQL for some time, but have slowly moved towards being extremely useful and powerful.  For example, sometimes they can be in-lined and treated as subqueries and this has important benefits performance-wise.

The addition of named arguments in bodies means one can do functions like:

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = in_id OR in_id IS NULL) OR
                (tag like in_tag || '%' OR in_tag IS NULL) OR
                (description @@ plainto_tsquery(in_description)
                 OR in_description IS NULL) OR
               (department_id = in_department_id
                 OR in_department_id IS NULL) OR
               (location_id = in_location_id OR in_location_id IS NULL);

$$ LANGUAGE SQL;

Instead of

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = $1 OR $1 IS NULL) OR
                (tag like $2 || '%' OR $2 IS NULL) OR
                (description @@ plainto_tsquery($3) OR $3 IS NULL) OR
               (department_id = $4  OR $4 IS NULL) OR
               (location_id = $5 OR $5 IS NULL);
$$ LANGUAGE SQL;

While we are slowly converting our code base to use ORDBMS features, this will help keep things more maintainable.  Of course a more O/R procedure might be:

CREATE OR REPLACE function similar_to(self asset_item) 
RETURNS SETOF asset_item
LANGUAGE SQL AS $$

 SELECT * FROM asset_item
  WHERE (id = self.id OR self.id IS NULL) OR
        (tag like self.tag || '%' OR self.tag IS NULL) OR
        (description @@ plainto_tsquery(self.description) 
              OR self.description IS NULL) OR
        (department_id = self.department_id  
              OR self.department_id IS NULL) OR
        (location_id = self.location_id 
              OR self.location_id IS NULL);
$$;

You can see the difference in clarity.  In fact, in LedgerSMB, we often find ourselves using pl/pgsql solely due to the ability to use named queries in function bodies.  We find this is more robust for what we do, and easier to read as well.

User defined functions have been with PostgreSQL from the very beginning, and in fact are required for doing any significant object-relational modelling.  However they are still evolving because this is where PostgreSQL's focus has always been.  We have seen major improvements here in every major release and the function-handling capabilities of PostgreSQL are among the best of any database I have ever worked with.  With third-party handlers it is possible to use all kinds of languages for functions (including Perl, Python, R, and Java) and these can be incorporated in standard SQL queries.  PostgreSQL's focus is and has always been on doing advanced data modelling where the application is best seen as a consumer of managed data models.

From the perspective of the original Postgres developers and the primary developers today, PostgreSQL is a solution for managing and modelling data, where many applications may write to the same relations, and where the relations function, essentially, as an API.  For this reason, PostgreSQL tends to be far more strict about what it will allow applications to do than engines built on the single-app use case.  SQL, in effect, evokes public API's and it is the job of the RDBMS in such an environment to ensure that the API's behave consistently.

This approach blends the traditional roles of database and middleware because a great deal of business logic can be reduced to questions of the data model itself, what questions are asked of the database and what sorts of responses there are.

The other feature to look at is the addition of non inherited CHECK constraints in 9.2.  In past versions, as we have covered, all check constraints were inherited on all tables.  Again this allows one to use table inheritance safely without all of the same key issues that have plagued before, although I still consider it an antipattern to insert both into parent and child tables.

MySQL and PostgreSQL offer different sorts of flexibility.  MySQL offers tremendous flexibility with what sorts of constraints an application wants enforced at the session level, and what sorts of guarantees a type of table can offer.  These are very useful for some problems.

PostgreSQL offers neither of these, but instead offers flexibility in building advanced data models.

MySQL is designed with the idea that applications provide logic and the database provides dumb storage of the application's state.  While this has changed a bit with the addition of user-defined functions and stored procedures, the overall design constrains MySQL primarily to this use case.  This is not necessarily a bad thing as, traditionally, software licensing costs and requirements have often required that even advanced database systems like Oracle are used in this way.  MySQL targets the "my app, my database" world and is usually sufficient for this, particularly when lowest common denominators are used to ensure portability.

PostgreSQL, on the other hand, is designed with the idea that the database itself is a modelling tool, and that the applications interact with it over an API defined in SQL.  Object-relational modelling advocates point out that often getting acceptable performance in complex situations requires an ability to put some forms of logic in the database and even tie this to data structures in the database.  In this model, the database itself is a development platform which exposes API's, and multiple applications may read or write data via these API's.  It is thus best seen as an advanced data modelling, storage, and centralization solution rather than as a simple application back-end.

These differences show, I think, that when PostgreSQL people complain that MySQL is not a "real database management system" and MySQL people dispute this that in fact the real difference is in definitions, and in this case the definitions are deceptively far apart.   Understanding those differences is, I think, the key to making an informed choice.

Wednesday, September 5, 2012

O/R Modelling part 5: Nested Data Structures, Do's and Don'ts

One of the promises of object-relational modelling is the ability to address information modelling on complex and nested data structures.

Nested data structures bring considerable richness to the database, which is lost in a pure, flat, relational model.  Nested data structures can be used to model tuple constraints in ways that are impossible to do when looking at flat data structures, at least as long as those constraints are limited to the information in a single tuple.  At the same time there are cases where they simplify things and cases where they complicate things.  This is true both in the case of using these for storage and for interfacing with stored procedures.

PostgreSQL allows for nested tuples to be stored in a database, and for arrays of tuples.  Other ORDBMS's allow something similar (Informix, DB2, and Oracle all support nested tables).

Nested tables in PostgreSQL provide a number of gotchas, and additionally exposing the data in them to relational queries takes some extra work.  In this post we will look at modelling general ledger transactions using a nested table approach, and both the benefits and limitations of this approach.  In general this trades one set of problems for another and it is important to recognize the problems going in.

The storage example came out of a brainstorming session I had with Marc Balmer of  Micro Systems, though it is worth noting that this is not the solution they use in their products, nor is it the approach currently used by LedgerSMB.

Basic Table Structure:

The basic data schema will end up looking like this:

CREATE TABLE journal_type (
    id serial not null unique,
    label text primary key
);

CREATE TABLE account (
   id serial not null unique,
   control_code text primary key, -- account number
   description text
);

CREATE TYPE journal_line_type AS (
    account_id int,
    amount numeric
); 

CREATE TABLE journal_entry (
    id serial not null unique,
    journal_type int references journal_type(id),
    source_document_id text,-- for example invoice number
    date_posted date not null,
    description text, 
    line_items journal_line_type[],
    PRIMARY KEY (journal_type, source_document_id)
);

This schema has a number of obvious gotchas and cannot, by itself, guarantee the sorts of things we want to do.  However, using object-relational modelling we can fix these in ways that cannot do in a purely relational schema.  The main problems are:

  • First, since this is a double entry model, we need a constraint that says that the sum of the amounts of the lines must always equal zero.  However, if we just add a sum() aggregate, we will end up with it summing every record in the db every time we do an insert, which is not what we want.  We also want to make sure that no account_id's are null and no amounts are null.
  • Additionally it is not possible in the schema above to easily expose the journal line information to purely relational tools.  However we can use a VIEW to do this, though this produces yet more problems.
  • Finally referential integrity enforcement between the account lines and accounts cannot be done declaratively.  We will have to create TRIGGERs to enforce this manually.
These problems are traded off against the fact that the relational model does not allow for the first problem to be solved at all so we trade off the fact that we have some solutions which are a bit of a pain for the fact that we have some solutions at all.

Nested Table Constraints

If we simply had a tuple as a column, we could look inside the tuple with check constraints.  Something like check((column).subcolumn is not null).  However in this case we cannot do that because we need to aggregate on a set of tuples attached to the row.  To do this instead we create a set of table methods for managing the constraints:

CREATE OR REPLACE FUNCTION is_balanced(journal_entry) 
RETURNS BOOL
LANGUAGE SQL AS $$

SELECT sum(amount) = 0 FROM unnest($1.line_items);

$$;

CREATE OR REPLACE FUNCTION has_no_null_account_ids(journal_entry)
RETURNS BOOL
LANGUAGE SQL AS $$

SELECT bool_and(account_id is not null) FROM unnest($1.line_items);

$$;

CREATE OR REPLACE FUNCTION has_no_null_amounts(journal_entry)
RETURNS BOOL
LANGUAGE SQL AS $$

select bool_and(amount is not null) from unnest($1.line_items);

$$;

We can then create our constraints.  Note that because we have to create the methods first, we have to add our constraints after the functions are defined, and these are added after the table is constructed.  I have gone ahead and given these friendly names so that errors are easier for people (and machines) to process and handle.

ALTER TABLE journal_entry 
ADD CONSTRAINT is_balanced 
CHECK ((journal_entry).is_balanced);

ALTER TABLE journal_entry
ADD CONSTRAINT has_no_null_account_ids
CHECK ((journal_entry).has_no_null_account_ids);

ALTER TABLE  journal_entry
ADD CONSTRAINT has_no_null_amounts
CHECK ((journal_entry).has_no_null_amounts);

Now we have integrity constraints reaching into our nested data.

So let's test this out.

 insert into journal_type (label) values ('General');

We will re-use the account data from the previous post:

or_examples=# select * from account;
 id | control_code | description
----+--------------+-------------
  1 | 1500         | Inventory
  2 | 4500         | Sales
  3 | 5500         | Purchase
(3 rows)

Let's try inserting a few meaningless transactions, some of which violate our constraints:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type]);

ERROR:  new row for relation "journal_entry" violates check constraint "is_balanced"

So far so good.

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type, 
      row(null, -100)::journal_line_type]);

 ERROR:  new row for relation "journal_entry" violates check constraint "has_no_null_account_ids"

Still good.

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type,
      row(2, -100)::journal_line_type,
      row(3, NULL)::journal_line_type])

ERROR:  new row for relation "journal_entry" violates check constraint "has_no_null_amounts"

Great.  All constraints working properly.  Let's try inserting a valid row:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type,
      row(2, -100)::journal_line_type]);
And it works!

 or_examples=# select * from journal_entry;
 id | journal_type | source_document_id | date_posted |  description   |       li
ne_items       
----+--------------+--------------------+-------------+----------------+---------
---------------
  5 |            1 | ref-10001          | 2012-08-23  | This is a test | {"(1,100
)","(2,-100)"}
(1 row)

Break-Out Views

A second major problem that we will be facing with this schema is that if someone wants to create a report using a reporting tool that only really supports relational data very well, then the financial data will be opaque and not available.  This scenario is one of the reasons why I think it is important generally to push the relational model to its breaking point before looking at object-relational functions.  Consequently I think when doing nested tables it is important to ensure that the data in them is available through a relational interface, in this case, a view.

In this case, we may want to model debits and credits in a way which is re-usable, so we will start by creating two type methods:

CREATE OR REPLACE FUNCTION debits(journal_line_type) 
RETURNS NUMERIC
LANGUAGE SQL AS
$$ SELECT CASE WHEN $1.amount < 0 THEN $1.amount * -1 
               ELSE NULL END 
$$;

CREATE OR REPLACE FUNCTION credits(journal_line_type)
RETURNS NUMERIC 
LANGUAGE SQL AS
$$ SELECT CASE WHEN $1.amount > 0 THEN $1.amount 
               ELSE NULL END
$$;

Now we can use these as virtual columns anywhere a journal_line_type is used.

The view definition itself is rather convoluted and this may impact performance.  I am waiting for the LATERAL construct to become available which will make this easier.

CREATE VIEW journal_line_items AS
SELECT id AS journal_entry_id, (li).*, (li).debits, (li).credits
FROM (SELECT je.*, unnest(line_items) li 
        FROM journal_entry je) j;

Testing this out:

SELECT * FROM journal_line_items;

gives us

 journal_entry_id | account_id | amount | debits | credits
------------------+------------+--------+--------+---------
                5 |          1 |    100 |        |     100
                5 |          2 |   -100 |    100 |       
                6 |          1 |    200 |        |     200
                6 |          3 |   -200 |    200 |    

As you can see, this works.  Now people with purely relational tools can access the information in the nested table.

In general it is almost always worth creating break-out views of this sort where nested data is stored.

Referential Integrity Controls

The final problem is that relational integrity is not a well defined concept for nested data.  For this reason, if we value relational integrity and foreign keys are involved, we must find ways of enforcing these.

The simplest solution is a trigger which runs on insert, update, or delete, and manages another relation which can be used as a proxy for relational integrity checks.

For example, we could:

CREATE TABLE je_account (
   je_id int references journal_entry (id),
   account_id int references account(id),
   primary key (je_id, account_id)
);

This will be a very narrow table and so should be quick to search.  It may also be useful in determining which accounts to look at for transactions if we need to do that.  This table could then be used to optimize queries.

To maintain the table we need to recognize that never ever will a journal entry's line items be updated or deleted.  This is due to the need to maintain clear audit controls and trails.  We may add other flags to the table to indicate transactions but we can handle insert, update, and delete conditions with a trigger, namely:

CREATE FUNCTION je_ri_management()
RETURNS TRIGGER
LANGUAGE PLPGSQL AS $$
DECLARE accounts int[];
BEGIN
    IF TG_OP ILIKE 'INSERT' THEN
         INSERT INTO je_account (je_id, account_id)
         SELECT NEW.id, account_id
         FROM unnest(NEW.line_items)
         GROUP BY account_id;

         RETURN NEW;
    ELSIF TG_OP ILIKE 'UPDATE' THEN
         IF NEW.line_items <> OLD.line_items THEN
             RAISE EXCEPTION 'Cannot journal entry line items!';
         ELSE RETURN NEW;
         END IF;
    ELSIF TG_OP ILIKE 'DELETE' THEN
         RAISE EXCEPTION 'Cannot delete journal entries!';
    ELSE
         RAISE EXCEPTION 'Invalid TG_OP in trigger';
    END IF;
END; $$;

Then we add the trigger with:

CREATE TRIGGER je_breakout_for_ri
AFTER INSERT OR UPDATE OR DELETE
ON journal_entry
FOR EACH ROW EXECUTE PROCEDURE je_ri_management(); 

The final invalid TG_OP could be omitted but this is not a bad check to have.

Let's try this out:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values
(1, 'ref-10003', now()::date, 'This is a test',
ARRAY[row(1, 200)::journal_line_type,
      row(3, -200)::journal_line_type]);

or_examples=# select * from je_account;
 je_id | account_id
-------+------------
    10 |          3
    10 |          1
(2 rows)

In this way referential integrity can be enforced.

Solution 2.0:  Refactoring the above to eliminate the view.

The above solution will work great for small businesses but for larger businesses, querying this data will become slow for certain kinds of reports.   Storage here is tied to a specific criteria, and indexing is somewhat problematic.  There are ways we can address this, but they are not always optimal.  At the same time our work is simplified because the actual accounting details are append-only.

One solution to this is to refactor the above solution.  Instead of:
  1. Main table
  2. Relational view
  3. Materialized view for referential integrity checking
we can have:
  1. Main table, with tweaked storage for line items
  2. Materialized view for RI checking and relational access
Unfortunately this sort of refactoring after the fact isn't simple.  Typically you want to convert the journal_line_type type to a journal_line_type table, and inherit this in your materialized view table.  You cannot simply drop and recreate since the column you are storing the data in is dependent on the structure.

The solution is to rename the type, create a new one in its place.  This must be done manually and there is no current capability to copy a composite type's structure into a table.  You will then need to create a cast and a cast function.  Then, when you can afford the downtime, you will want to convert the table to the new type.  It is quite possible that the downtime will be delayed and you will have an extended time period where you are half-way through migrating the structure of your database.  You can, however, decide to create a cast between the table and the type, perhaps an implicit one (though this is not inherited) and use this to centralize your logic.  Unfortunately this leads to duplication-related complexity and in an ideal world would be avoided.

However, assuming that the downtime ends up being tolerable, the resulting structures will end up such that they can be more readily optimized for a variety of workloads.   In this regard you would have a main table, most likely with line_items moved to extended storage, whose function is to model journal entries as journal entries and apply relevant constraints, and a second table which models journal entry lines as independent lines.  This also simplifies some of the constraint issues on the first table, and makes the modelling easier because we only have to look into the nested storage where we are looking at subset constraints.

This section then provides a warning regarding the use of advanced ORDBMS functionality, namely that it is easy to get tunnel vision and create problems for the future.  The complexity cost here is so high, that the primary model should generally remain relational, with things like nested storage primarily used to create constraints that cannot be effectively modelled otherwise.  However, this becomes a great deal more complicated where values may be update or deleted.  Here, however, we have a relatively simple case regarding data writes combined with complex constraints that cannot be effectively expressed in normalized, relational SQL.  Therefore the standard maintenance concerns that counsel against duplicating information may give way to the fact that such duplication allows for richer constraints.

Now, if we had been aware of the problems going in we would have chosen this structure all along.  Our design would have been:

CREATE TYPE journal_line AS (
    entry_id bigserial primary key, --only possible key
    je_id int not null,
    account_id int,
    amount numeric
);  

After creating the journal entry table we'd:

ALTER TABLE journal_line ADD FOREIGN KEY (je_id) REFERENCES journal_entry(id);

If we have to handle purging old data we can make that key ON DELETE CASCADE.

And the lines would have been of this type instead.  We can then get rid of all constraints and their supporting functions other than the is_balanced one.  Our debit and credit functions then also reference this type.  Our trigger then looks like:

CREATE FUNCTION je_ri_management()
RETURNS TRIGGER
LANGUAGE PLPGSQL AS $$
DECLARE accounts int[];
BEGIN
    IF TG_OP ILIKE 'INSERT' THEN
         INSERT INTO journal_line (je_id, account_id, amount)
         SELECT NEW.id, account_id, amount
         FROM unnest(NEW.line_items);

         RETURN NEW;
    ELSIF TG_OP ILIKE 'UPDATE' THEN
          RAISE EXCEPTION 'Cannot journal entry line items!';
    ELSIF TG_OP ILIKE 'DELETE' THEN
         RAISE EXCEPTION 'Cannot delete journal entries!';
    ELSE
         RAISE EXCEPTION 'Invalid TG_OP in trigger';
    END IF;
END; $$;

Approval workflows can be handled with a separate status table with its own constraints.  Deletions of old information (up to a specific snapshot) can be handled by a stored procedure which is unit tested and disables this trigger before purging data.  This system has the advantage of having several small components which are all complete and easily understood, and it is made possible because the data is exclusively append-only.

As you can see from the above examples, nested data structures greatly complicate the data model and create problems with relational math that must be addressed if data logic will remain meaningful. This is a complex field, and it adds a lot of complexity to storage.  In general, these are best avoided in actual data storage except where this approach makes formerly insurmountable problems manageable.  Moreover, they add complexity to optimization once data gets large.  Thus while non-atomic fields in this regard make sense as an initial point of entry in some narrow cases, as a point of actual query, they are very rarely the right approaches.  It is possible that, at some point, nested storage will be able to have its own indexes, foreign keys, etc. but I cannot imagine this being a high priority and so it isn't clear that this will ever happen.  In general, it usually makes the most sense to simply store the data in a pseudo-normalized way, with any non-1NF designs being the initial point of entry in a linear write model.

Nested Data Structures as Interfaces

Nested data structures as interfaces to stored procedures are a little more manageable.  The main difficulties are in application-side data construction and output parsing.  Some languages handle this more easily than others.

Upper-level construction and handling of these structures is relatively straight-forward on the database-side and poses none of these problems.   However, they do cause additional complexity and this must be managed carefully.

The biggest issue when interfacing with an application is that ROW types are not usually automatically constructed by application-level frameworks even if they have arrays.  This leaves the programmer to choose between unstructured text arrays which are fundamentally non-discoverable (and thus brittle), and arrays of tuples which are discoverable but require a lot of additional application code to handle.  At the same time as a chicken and egg problem, frameworks will not add handling for this sort of problem unless people are already trying to do it.

So my general recommendation is to use nested data types everywhere in the database sparingly, only where the benefits clearly outweigh the complexity costs.

Complexity costs are certainly lower in the interface level and there are many more cases where it these techniques are net wins there, but that does not mean that they should be routinely used even there.