Sunday, September 16, 2012

O/R Modelling in part 6: Polymorphism (and Gotchas) in PostgreSQL

Polymorphism is common in virtually all database management systems.  When we add an integer to a float, we expect this to just work.  Similarly various types of text fields can usually be concatenated together, etc.  In particularly extensible database systems, like PostgreSQL, polymorphism tends to lead to additional problems.  Indeed, implicit casts were cut way back a few years ago specifically because of problems they might cause.  This is an area I expect we will do quite a bit more with in LedgerSMB in the future.

The PostgreSQL Type System

PostgreSQL types can be divided into three specific types:  Primitives (those supplied by PostgreSQL out of the box or by extensions), domains, and object types.  The object types include composite types and table types.  Although typically these are lumped together, we will cover table types and composite types separately because the capabilities are far more developed when tables are created.

Base Primitive Types

Base types are intended to be the basic building blocks of relations.  They include such types as int4 and text.  These cannot have  methods attached but they can be arguments in functions provided that the procedural language understands the type.  Most of these will be passed in using the textual representation if all else fails.

Polymorphic Primitive Types

Polymorphic primitives are primitives which can take the form of other base types.  The primary ones are anyelement (which is any simple base type) and anyarray which is an array filled with any simple base type.  These are mostly used in functions so that the function can apply to array operations generally.  I have not encountered them elsewhere.

Pseudotype Primitives

Pseudotypes include triggers and a few other things.  They are mostly used as keywords, not as actual types.

SQL-Created Domain Types

Domains are the first significant type of SQL-created or user-defined type.  An SQL domain is a base type, valid for a subset of the base values.  Domains can be set to be enforced not null, and this applies only on storage, but unlike base types, this is checked as a property of a composite type as well.

SQL-Created Object Composite Types

PostgreSQL allows you to create composite types which can be used as input and output types for functions, elements in object composite types, columns in a table, and so forth.  Composite types have no inheritance, and cannot inherit interfaces from eachother.  Composite types cannot have any recursive elements and this is checked deeply.  For example you cannot:

CREATE TYPE a AS (
   a int
);

CREATE TYPE b  AS (
   a a
);

 ALTER TYPE a ADD attribute b b;

An attempt to do so will result in an error:

 ERROR:  composite type a cannot be made a member of itself

SQL-Created Object Table Types 

Table types are similar to composite types except that inheritance is supported.  This means in practice that the behavior of tables is quite different than the behavior of composite types when it comes to methods, implicit casting, and the like.  Note that implicit casts are not inherited.

Type Conversion:  CREATE CAST

Values and objects can be converted between types using casts.  Casts can be explicit only, implicit on assignment, and implicit generally.  In some cases casts can be dangerous.

The overall syntax of CREATE CAST is discussed in the PostgreSQL manual.   In general almost all user created casts are likely to use a SQL function to create one to another.  In general explicit casts are best because this forces clarity in code and predictability at run-time.

For example see our previous country table:

or_examples=# select * from country;
 id |     name      | short_name
----+---------------+------------
  1 | France        | FR
  2 | Indonesia     | ID
  3 | Canada        | CA
  4 | United States | US
(4 rows)

Now suppose we create a constructor function:

CREATE FUNCTION country(int) RETURNS country
LANGUAGE SQL AS $$
SELECT * FROM country WHERE id = $1 $$;

We can show how this works:

or_examples=# select country(1);
    country   
---------------
 (1,France,FR)
(1 row)

or_examples=# select * from country(1);
 id |  name  | short_name
----+--------+------------
  1 | France | FR
(1 row)

Now we can:

CREATE CAST (int as country)
WITH FUNCTION country(int);

 And now we can cast an int to country:

or_examples=# select (1::country).name;
  name 
--------
 France
(1 row)

The main use here is that it allows you to pass an int anywhere into a function expecting a country and we can construct this at run-time.  Care of course must be used the same way as with dereferencing foreign keys if performance is to be maintained.  In theory you could also cast country to int, but in practice casting complex types to base types tends to cause problems than it solves.

Inheritance-Based Implicit Casts

Inheritance-based implicit casts are an exception to the dangers of implicit casting.  These are not base type to base type casts, and they essentially act as slices of a table.  If your table structures use inheritance in a sane way they will work well.

Defined Implicit Casts

We can generally define implicit casts between tuples.  This is best done sparingly because any implicit cast multiplies execution pathways.  Defined implicit casts are not inherited.

To create an implicit cast, we could do as follows:

DROP CAST (int as country);

CREATE CAST (int as country)
WITH FUNCTION country(int)
AS IMPLICIT;

The problem of course is that as soon as we do this we have the potential for ambiguity.

Suppose I have a hypothetical state table, and I have functions called capital which can take arguments of country or state and returns the name of the capital city.  If I make a query of "select capital(1);" then am I asking for a country or a city capital?  I have always avoided this situation because it can never end well (so much so that I don't actually know how PostgreSQL resolves the issue).

Keep implicit casts for cases where you really need them and avoid them in other cases.

Problems with Implicit Casts to Primitives

The problems with implicit casts are greatly magnified when casting to (and hence between) primitives.  We will discuss just one such problem here:  numbers to timestamps.

Currently PostgeSQL supports casting dates as timestamps which is probably correct.  Suppose we want to take a number and turn it into a timestamp.  Maybe we are converting UNIX epochs to timestamps regularly and want to do so without a cast.

create cast (double precision as timestamp with time zone) 
with function to_timestamp(double precision) 
as implicit;

And so we try this:

CREATE FUNCTION to_timestamp(int)
returns timestamptz
language sql immutable as
$$ select $1::double precision::timestamptz; $$;

And add an explicit cast:
create cast (int as timestamp with time zone) 
with function to_timestamp(int) 
as implicit;

Works and well enough.  However, now we have a bug factory.

or_examples=# select extract(year from '2011-01-01'::date);
 date_part
-----------
      2011
(1 row)

or_examples=# select extract(year from 2011-01-01);
 date_part
-----------
      1969
(1 row)

Why?  Because the second is an integer math expression and evaluates to the integer of 2009, or 2009 seconds after midnight, January 1, 1969.  This is probably not what we had in mind.  Similarly:

or_examples=# select age('2011-01-01'::date);
      age     
---------------
 1 year 8 mons
(1 row)

or_examples=# select age(2011-01-01);
           age           
--------------------------
 42 years 8 mons 07:26:31
(1 row)

It is important to note here that the computer is doing exactly what we told it to do.  The problem however is that we now have bugs which are non-obvious to someone who is not quite fluent on the system to start with.

Another historical case that used to bite people is implicit casts of tuples to text.  This lead to very counterintuitive behavior in some cases, such as the infamous tuple.name or tuple.text issues.   Basically tuple.name would return name(tuple::text) which would be a textual representation of the tuple truncated to 63 characters.  This was fixed by making the cast of tuple to text explicit in PostgreSQL 8.3.

In general, polymorphism is a very important feature of PostgreSQL but it is something which has many gotchas.  In general adding explicit casts won't get you into too much trouble but implicit casts are often asking for issues.

Thursday, September 13, 2012

Object/Relational Interlude: Messaging in PostgreSQL

PostgreSQL as an object-relational development platform doesn't end with data modelling.  The notification system is particularly powerful and worthy of mention.  This system however has some challenges and gotchas which are also worth mentioning.  The example below was early sample code bundled with LedgerSMB from 1.1 onward.  The code is not really useful in production for reasons specified but it can be a basis for production systems, and it functions as an example both of how to make things work and the problems you can encounter.

I would like to thank Marc Balmer of Micro Systems for some contributions from slides he had sent to me on this.

Basic Framework:  LISTEN/NOTIFY

PostgreSQL provides a messaging system with two commands, listen and notify.  These commands allow any process to broadcast an event to any listener.  In some older versions, listeners are listed in a system table and this information is presumed public.  In newer versions this information is not in the system catalogs and so this reduces the security exposure from the use of this method, but many areas of exposure remain and must be understood.

This approach is divided into broadcast channels.  Any user can broadcast along any channel.  Anyone can listen on any channel. The actual notifications are public.  The syntax is:

LISTEN channel; -- for the listener

and


NOTIFY channel [, payload]; -- for the broadcaster


Channels are database-specific.

A Brief Example

or_examples=# listen test;
LISTEN

In another session:

or_examples=# notify test, 'testing testing testing';
NOTIFY

In the first session, psql will only check for notifications when executing another query so we:

or_examples=# select 1;
 ?column?
----------
        1
(1 row)

Asynchronous notification "test" with payload "testing testing testing" received from server process with PID 29052.

It is of course possible to send actual meaningful data in the payload but this is probably not a good idea since we can't really verify the sender without a lot of extra code.

In an actual application we can take the notification to be a sign that we need to check something in the application.  In general, I can't find any case where the payload is actually a good idea to use.  It's probably better to use tables to create a message queue.  Then the notification can be used to reduce the actual requests to the tables.

Note further that NOTIFY is raised on transaction commit, making it transaction-safe.  NOTIFY allows you to build extremely interactive systems where many applications actively communicate across the database and trans-transaction logic can be encapsulated from a user application's perspective, within the database layer itself.

Listen Gotchas

The Listen architecture does not provide security by itself,  This must be provided on the underlying tables.  In general it is always a mistake to trust the payload of the notification too much.

Additionally you need a way to extract only the records you want.  I will go over a badly designed case below.  One approach is a message table.  Another approach might be a message view.  Either way the listener is a separate program.

Finally it is worth noting that the application actually has to ask the connection libraries if there is a notification.  This can cause some confusion because, for example, psql only checks when a query is run.  With a Perl script, however, we might wake up to check once every minute or second, or whatever we want to do.

Notify Gotchas

Listeners require no security privileges to listen to a notification or act on them.  The listener is a separate program and could be anywhere else.  You should never trust that the NOTIFY payload is not overheard somewhere.

Additionally, it is worth remembering that NOTIFY is *always* raised on commit and cannot be removed without rolling back the transaction.  However, if the actual message is in a message table, it could be safely deleted.

Patterns

Message Queue Tables

When we add tables with messages to the system we will be able to actually use PostgreSQL as a message queue server.  For example, in 9.2 we can to do something like:

CREATE TABLE message_queue_test (
    id bigserial primary key, -- appropriate, not natural data
    sender text not null default session_user,
    payload json not null,
    received bool
);

CREATE INDEX msq_test_pending_idx 
ON message_queue_test (id, sender, payload) 
WHERE received IS NOT TRUE;

Of course the cases where this can be used as a covering index are vanishingly small, but over time the index may be useful in eliminating rows already read.  You can then keep old records around for a while for debugging purposes.  You could even have an abstract table, multiple child/"leaf" tables, and unified object-relational interfaces for these.  Indeed this may be an interesting project for the future.

Concurrent Access and Locking

Note that the standard rules exist with concurrent access and locking.  If you want this to be a message queue that several different programs can read and every message must get sent out only once, then you should make sure you use SELECT ... FOR UPDATE.

An Example

 In LedgerSMB I created a sample script that would show how to send an email out whenever a short part was sold, to remind someone to order.  Unfortunately this was not so well designed, but I guess it was good enough as an example.

The Trigger

CREATE OR REPLACE FUNCTION trigger_parts_short() RETURNS TRIGGER
AS
'
BEGIN
  IF NEW.onhand >= NEW.rop THEN
    NOTIFY parts_short;
  END IF;
  RETURN NEW;
END;
' LANGUAGE PLPGSQL;

In actual use cases, this isn't enough information to process this properly.  Please see the problems below.  The basic problem is that this notification causes the listener, when used, to email out a full parts short report to the individual(s) specified.  Depending on how it is set up, this might be a bunch of reports in a short time.  This could really use a queue table.

The Listener

I wrote this code quickly when I was still learning Perl. It's definitely quick and dirty code.  I am not proud of the clarity or maintainability, but what it does is actually trivial.

The script starts off with the basic setup, which basically sets up the configuration information and the database connections:

require "config.pl";

use DBI;
my $dsn = "dbi:Pg:dbname=$database";
my $dbh = DBI->connect(
    $dsn, $db_user,
    $db_passwd,
    {
        AutoCommit => 1,
        PrintError => 0,
        RaiseError => 1,
    }
);
$dbh->{pg_enable_utf8} = 1;

my $sth;

$dbh->do("LISTEN parts_short");



The next part is the main loop, which just wakes up, checks for notifications, acts on them if applicable, and if not goes back to sleep:

while (1) {    # loop infinitely
    if ( $dbh->func('pg_notifies') ) {
        &on_notify;
    }
    sleep $cycle_delay;
}

And finally what we do if we got a notification:

sub on_notify {
    open( MAIL, '|-', "$sendmail" );
    $sth = $dbh->prepare( "
        SELECT partnumber, description, onhand, rop FROM parts
        WHERE onhand <= rop
  " );
    $sth->execute;
    print MAIL $template_top;
    while ( ( $partnumber, $description, $avail, $rop ) = $sth->fetchrow_array )
    {
        write MAIL;
    }
    print MAIL $template_foot;
    close MAIL;
}

Problems

Aside from the fact that the code isn't the most maintainable code out there, there are a number of structural problems with the whole solution.  First, what this actually does is probably not what you actually want.  It wakes up every so often (a configurable value) and checks for notifications and if it finds them will send out a parts short report to the designated individual.  Note it won't send just the new parts, but actually all the parts short.  If you have a store with thousands of parts in stock and are nearing ROP on a few hundred, and then the holiday traffic starts before your orders come in, this will be rather annoying and likely to be intentionally filtered out by whoever is supposed to receive it.

Additionally, if you have a case where you are short many parts, a malicious user may be able to find a way to force the system to send out a parts short report every time it wakes up, which may not cause a denial of service attack but in fact may be highly annoying and prevent the actual emails from being reasonably useful.

Solutions

The best solution is to go with a queue table, and have the trigger write to it.  Then the reader can read it, and email out a notice as to exactly what has happened.  This will mean more relevant information and better signal to noise ratio.

Next:  Polymorphism in PostgreSQL

Saturday, September 8, 2012

O/R modelling interlude: PostgreSQL vs MySQL

Every time we see people look at PostgreSQL and MySQL on the internet it falls into a flame war fast.  I think that a large part of the problem is that advocates of these databases look at the other database (and advocacy of the other database) through a specific lens and therefore are unable to understand the rhetoric from the other side.  This is an attempt to cut through some of this and offered in the spirit of suggesting that if we can't understand eachother's views, then instead of discussion all we will get are flames.

This post is not intended to be a post about my MySQL sucks (although I will point out some cases where it is inadequate and other cases where it reduces transition costs and times) or about why you should choose PostgreSQL instead.  I am of course biased, having worked with MySQL for some time but switching to PostgreSQL for important work back in 1999.  I personally don't much like MySQL and so it is worth stating my bias up front.  However, I don't think that prevents me from trying to place it constructively in the marketplace or express a sympathetic understanding for what MySQL has to offer developers.

The endless arguments I think are the result of very specific pressures on the RDBMS market, and PostgreSQL in many ways targets what the system is in theory and what it can be, while MySQL targets how an RDBMS is more typically used when developing software for sale.  MySQL is disruptive only by virtue of being open source.  PostgreSQL is disruptive by design and the principles pioneered on Postgres have found their way into Informix, DB2, Oracle, and more.

A simple description of the difference (and this is not intended as a flame) is:

MySQL is what you get when application developers build an RDBMS.
PostgreSQL is what you get when database developers build an application development platform.

The above is not intended as a flame on either side by any means but it does show where people run into pain on both sides (server-side application-style development on PostgreSQL doesn't work, and trying to use MySQL the way you would use Oracle for data centralization really doesn't work either).

In case people are wondering, I intend in the future to look at object-relational modelling potentials in DB2 and Oracle in contrast to PostgreSQL as well,  These products however are more similar than are MySQL and PostgreSQL both in terms of use case and market position. 

App-Centric vs Data-Centric

The whole purpose of a database management system is to store data so that it can be re-used.  This is true regardless of whether you are using a NoSQL solution, a light-weight quasi-RDBMS like sqlite, or a heavy duty system like Oracle.    The type and scope of that re-use varies quite a bit between products though.  Larger-scale RDBMS's typically focus on flexible output and rigid, validated input, and this is more important as more applications start to write to the database.  Smaller databases and NoSQL databases tend to place the application in the driver's seat and do less validation.

These approaches exist on a continuum of course but in general, you need more type checking the more applications may be writing to the database.  With loose checking, the assumption can be made that the database is primarily a store of private state information and therefore ideosyncracies of the application don't matter that much, but when you have multiple applications writing to the same relations, the data is "public" in a way that it is not otherwise and therefore there is a lot of value in ensuring that application ideosyncracies do not cause misbehavior in other software or cause data to be misinterpreted.

So on one side we have big, shared data solutions like Oracle and DB2.  On the other, we have applications which are fundamentally designed to be a data store for a single application, where the developer can be given total freedom regarding data validation.  Many of the NoSQL solutions fit here.  Some SQL-like solutions are also on this side, such as SQLite.

We can look at how MySQL and PostgreSQL fit on this continuum and try to get a feel for why the different points of view seem to be so difficult to bridge.  We will look particularly at solutions given to past data integrity concerns and what those mean for information management and application development.  In PostgreSQL we will look at user-defined function enhancements and replication.  In MySQL we will look at strict mode.

These solutions show radically different views of how the software is expected to be used.  PostgreSQL may be characterized as conservative but innovative, and being unwilling to do anything that might prejudice data in multi-application environments.  MySQL on the other hand my be characterized as focusing on the needs of the app developer sometimes to the exclusion of the needs of the DBA.

MySQL on the app vs data spectrum:  SQL Mode Salad

MySQL 4.x and earlier was notoriously loose with data constraints.  Zero dates were seen as valid, as was a date like '2008-02-30.'  Data would be truncated to fit fields, or otherwise transformed.  These transformations were sometimes lossy but were predictable and in fact make some sense in a content management environment (my first use of any RDBMS was MySQL for light-weight content management in this timeframe).  Typically the data stored was not terribly important (unlike accounting systems or the like) and so it was good enough for its use case.  It may be extreme to call data truncation a feature, but the truth in the initial use case for MySQL is that would not be entirely inaccurate.

The big problem though was that people were trying to build applications to do other things beyond the initial use case, and in many of these cases database transactions were needed and better type checking was needed.

To address these problems, MySQL leveraged its pluggable table system to allow third party vendors to create open source or dual-licensed (with a license from MySQL) transactional tables.  InnoDB, BDB, and a few other tables arose in this way.  Of course if transactions are handled at the table level, and tables are handled by plugins, then data definition language statements can never be transactional.   This isn't the end of the world for many MySQL deployments (for reasons stated below) but it also doesn't handle the type checking issues, which have to be handled before storage.

To handle the type checking issue, MySQL implemented the concept of SQL modes, which allows one to select a number of options which then define a dialect of SQL for use in the software.  In this way one can ease porting from other systems to some extent, and address questions like how strictly types should be checked.

MySQL allows any user to set the SQL model for the session, and this can have effects ranging from SQL syntax to whether '2008-02-30' is accepted as a valid date.  In essence in MySQL, the application is king and the db a humble servant.

It is worth noting that the original few modes have been expanded into a very large set, allowing applications to tell MySQL to accept syntactical ideosyncracies of other RDBMS's.  This sort of thing shortens the time necessary to initially port an application to MySQL and this id great as long as certain boundaries are maintained.

This sort of assumption only really works in practice where only one application is writing to a given set of relations.  If you have two or ten applications reading and writing the same tables, then each one of them can decide what sort of data assumptions the server should use when validating the data prior to storing it.  MySQL thus trades robustness and guarantees on output of data for flexibility of input (this a the fundamental tradeoff in NoSQL as well), and this therefore ends up relegating the database to an application's private data store.

The typical response I have received when asking how to manage this is to put an API layer in the application which does the data checking.  In other words, the application, not the database, is expected to be the gatekeeper regarding valid data.

MySQL users think in terms of the "software stack" with MySQL existing just above the operating system.  The sole purpose of MySQL is to store data for the single application owning the relation.  MySQL users generally do not think in terms of shared data between applications using SQL as an API, and the MySQL developers happily provide the software they are looking for.  This software sits  somewhere between traditional RDBMS's and NoSQL systems in terms of flexibility of data in vs out.

MySQL does not provide facilities for DBA's to restrict which SQL modes are available.  This more or less prevents MySQL from outgrowing the single application per table use case, and it prevents MySQL from  being a genuine information management solution.  That isn't to say it is a bad development tool.  However it should be seen as an RDBMS-like application back-end, rather than a classical RDBMS (which has generally been geared towards centralized data management).

PostgreSQL on the app vs data spectrum and New features:  No Inherit Constraints and Named Arguments in SQL Functions

PostgreSQL began life as the Postgres project out of UC Berkeley.  It was initially a research testbed for advanced database concepts, namely those called "object-relational" in terms of data modelling.  The idea is that more complex problems can be modelled when behavior is tied to data structures which can then be relationally manipulated based on that behavior.

The basic promise of object-relational database management is that data structures can be tied to processing routines, so that more advanced models can be built and relationally queried.   This allows more complex applications to be written without overly burdening the database system with large amounts of data transfer or large result sets in memory.  With plain relational processing, we'd have to express what we can relationally and then filter out the excess in our application.  With object-relational modelling we can build more advanced filters into the SQL queries without affecting readability.  In order to do this, PostgreSQL allows for user defined functions to be written in a variety of languages with C, SQL, and PL/PGSQL being available on the default installation.  Other languages can be added using a plugin system (something MySQL has as well, but not well suited there for Object Relational management of data).

As we have been looking at, PostgreSQL provides powerful capabilities of modelling data which goes well beyond what data is stored and includes the ability to model data derived from what is stored.  PostgreSQL in fact pushes the envelope in this area even beyond where Informix, DB2, and Oracle have taken it.  It is thus a platform for building highly intelligent models of data.


All current major object-relational database implementations (including those of DB2 and Oracle) are at least inspired by Michael Stonebraker's work in this area, and by Postgres in particular.  Indeed, Informix's object-relational capabilities started life as a Postgres fork named Illustra, and both DB2 and Oracle more or less implement the O-R bindings to SQL developed on that platform.  PostgreSQL is different SQL-wise in part because Informix split prior to Postgres adopting SQL as a query language. 

Centralizing data modelling code in the database is generally a net win.  Result sets from SQL queries tend to be smaller, there are fewer round trips between the application and the database, and the models themselves can be used to create a public API, similar to the way an ORM might be used, thus allowing centralization of key code across all applications that may read or write to a specific database.  This doesn't mean "all logic belongs in the database."  Rather it provides for the ability to build more advanced (and efficient) data models which can place processing where it is most efficient and avoid the worst of the scalability bottlenecks.

PostgreSQL has always had a data modelling focus rather than the application back-end focus seen in MySQL.  When we look at two recent additions we can see that legacy as well.   In this model, guarantees regarding data is paramount, and applications are expected to use the relations as, essentially, public API's.  Consequently the database, not the application, is responsible for data consistency and semantic clarity.  Foreign keys for example, are never ignored by PostgreSQL (in MySQL, handling is dependent on table type), and the 30th of February is never a valid date no matter how much the application would like to use it as such.  Dates are always in the Gregorian calendar where this is not valid.  If you care handling Gregorian to Julian conversions you will have to do this for your locale perhaps as a custom type.

PostgreSQL 9.2 will add the ability to use named parameters in the body of SQL-language functions.  This is a huge win for us with the LedgerSMB project and makes our lives a lot easier.  The change is backwards-compatible so our existing functions will just work, but it allows for clearer SQL code in function bodies for functions that may sometimes be able to be inlined.  In terms of object-relational modelling this is a big deal.  However in our specific case it is a larger issue because we use these to define application API's within the software.  SQL language functions have been a part of PostgreSQL for some time, but have slowly moved towards being extremely useful and powerful.  For example, sometimes they can be in-lined and treated as subqueries and this has important benefits performance-wise.

The addition of named arguments in bodies means one can do functions like:

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = in_id OR in_id IS NULL) OR
                (tag like in_tag || '%' OR in_tag IS NULL) OR
                (description @@ plainto_tsquery(in_description)
                 OR in_description IS NULL) OR
               (department_id = in_department_id
                 OR in_department_id IS NULL) OR
               (location_id = in_location_id OR in_location_id IS NULL);

$$ LANGUAGE SQL;

Instead of

CREATE OR REPLACE FUNCTION asset_item__search
(in_id int, in_tag text, in_description text, in_department_id int, in_location_id int)
RETURNS SETOF asset_item  AS
$$
 SELECT * FROM asset_item
  WHERE (id = $1 OR $1 IS NULL) OR
                (tag like $2 || '%' OR $2 IS NULL) OR
                (description @@ plainto_tsquery($3) OR $3 IS NULL) OR
               (department_id = $4  OR $4 IS NULL) OR
               (location_id = $5 OR $5 IS NULL);
$$ LANGUAGE SQL;

While we are slowly converting our code base to use ORDBMS features, this will help keep things more maintainable.  Of course a more O/R procedure might be:

CREATE OR REPLACE function similar_to(self asset_item) 
RETURNS SETOF asset_item
LANGUAGE SQL AS $$

 SELECT * FROM asset_item
  WHERE (id = self.id OR self.id IS NULL) OR
        (tag like self.tag || '%' OR self.tag IS NULL) OR
        (description @@ plainto_tsquery(self.description) 
              OR self.description IS NULL) OR
        (department_id = self.department_id  
              OR self.department_id IS NULL) OR
        (location_id = self.location_id 
              OR self.location_id IS NULL);
$$;

You can see the difference in clarity.  In fact, in LedgerSMB, we often find ourselves using pl/pgsql solely due to the ability to use named queries in function bodies.  We find this is more robust for what we do, and easier to read as well.

User defined functions have been with PostgreSQL from the very beginning, and in fact are required for doing any significant object-relational modelling.  However they are still evolving because this is where PostgreSQL's focus has always been.  We have seen major improvements here in every major release and the function-handling capabilities of PostgreSQL are among the best of any database I have ever worked with.  With third-party handlers it is possible to use all kinds of languages for functions (including Perl, Python, R, and Java) and these can be incorporated in standard SQL queries.  PostgreSQL's focus is and has always been on doing advanced data modelling where the application is best seen as a consumer of managed data models.

From the perspective of the original Postgres developers and the primary developers today, PostgreSQL is a solution for managing and modelling data, where many applications may write to the same relations, and where the relations function, essentially, as an API.  For this reason, PostgreSQL tends to be far more strict about what it will allow applications to do than engines built on the single-app use case.  SQL, in effect, evokes public API's and it is the job of the RDBMS in such an environment to ensure that the API's behave consistently.

This approach blends the traditional roles of database and middleware because a great deal of business logic can be reduced to questions of the data model itself, what questions are asked of the database and what sorts of responses there are.

The other feature to look at is the addition of non inherited CHECK constraints in 9.2.  In past versions, as we have covered, all check constraints were inherited on all tables.  Again this allows one to use table inheritance safely without all of the same key issues that have plagued before, although I still consider it an antipattern to insert both into parent and child tables.

MySQL and PostgreSQL offer different sorts of flexibility.  MySQL offers tremendous flexibility with what sorts of constraints an application wants enforced at the session level, and what sorts of guarantees a type of table can offer.  These are very useful for some problems.

PostgreSQL offers neither of these, but instead offers flexibility in building advanced data models.

MySQL is designed with the idea that applications provide logic and the database provides dumb storage of the application's state.  While this has changed a bit with the addition of user-defined functions and stored procedures, the overall design constrains MySQL primarily to this use case.  This is not necessarily a bad thing as, traditionally, software licensing costs and requirements have often required that even advanced database systems like Oracle are used in this way.  MySQL targets the "my app, my database" world and is usually sufficient for this, particularly when lowest common denominators are used to ensure portability.

PostgreSQL, on the other hand, is designed with the idea that the database itself is a modelling tool, and that the applications interact with it over an API defined in SQL.  Object-relational modelling advocates point out that often getting acceptable performance in complex situations requires an ability to put some forms of logic in the database and even tie this to data structures in the database.  In this model, the database itself is a development platform which exposes API's, and multiple applications may read or write data via these API's.  It is thus best seen as an advanced data modelling, storage, and centralization solution rather than as a simple application back-end.

These differences show, I think, that when PostgreSQL people complain that MySQL is not a "real database management system" and MySQL people dispute this that in fact the real difference is in definitions, and in this case the definitions are deceptively far apart.   Understanding those differences is, I think, the key to making an informed choice.

Wednesday, September 5, 2012

O/R Modelling part 5: Nested Data Structures, Do's and Don'ts

One of the promises of object-relational modelling is the ability to address information modelling on complex and nested data structures.

Nested data structures bring considerable richness to the database, which is lost in a pure, flat, relational model.  Nested data structures can be used to model tuple constraints in ways that are impossible to do when looking at flat data structures, at least as long as those constraints are limited to the information in a single tuple.  At the same time there are cases where they simplify things and cases where they complicate things.  This is true both in the case of using these for storage and for interfacing with stored procedures.

PostgreSQL allows for nested tuples to be stored in a database, and for arrays of tuples.  Other ORDBMS's allow something similar (Informix, DB2, and Oracle all support nested tables).

Nested tables in PostgreSQL provide a number of gotchas, and additionally exposing the data in them to relational queries takes some extra work.  In this post we will look at modelling general ledger transactions using a nested table approach, and both the benefits and limitations of this approach.  In general this trades one set of problems for another and it is important to recognize the problems going in.

The storage example came out of a brainstorming session I had with Marc Balmer of  Micro Systems, though it is worth noting that this is not the solution they use in their products, nor is it the approach currently used by LedgerSMB.

Basic Table Structure:

The basic data schema will end up looking like this:

CREATE TABLE journal_type (
    id serial not null unique,
    label text primary key
);

CREATE TABLE account (
   id serial not null unique,
   control_code text primary key, -- account number
   description text
);

CREATE TYPE journal_line_type AS (
    account_id int,
    amount numeric
); 

CREATE TABLE journal_entry (
    id serial not null unique,
    journal_type int references journal_type(id),
    source_document_id text,-- for example invoice number
    date_posted date not null,
    description text, 
    line_items journal_line_type[],
    PRIMARY KEY (journal_type, source_document_id)
);

This schema has a number of obvious gotchas and cannot, by itself, guarantee the sorts of things we want to do.  However, using object-relational modelling we can fix these in ways that cannot do in a purely relational schema.  The main problems are:

  • First, since this is a double entry model, we need a constraint that says that the sum of the amounts of the lines must always equal zero.  However, if we just add a sum() aggregate, we will end up with it summing every record in the db every time we do an insert, which is not what we want.  We also want to make sure that no account_id's are null and no amounts are null.
  • Additionally it is not possible in the schema above to easily expose the journal line information to purely relational tools.  However we can use a VIEW to do this, though this produces yet more problems.
  • Finally referential integrity enforcement between the account lines and accounts cannot be done declaratively.  We will have to create TRIGGERs to enforce this manually.
These problems are traded off against the fact that the relational model does not allow for the first problem to be solved at all so we trade off the fact that we have some solutions which are a bit of a pain for the fact that we have some solutions at all.

Nested Table Constraints

If we simply had a tuple as a column, we could look inside the tuple with check constraints.  Something like check((column).subcolumn is not null).  However in this case we cannot do that because we need to aggregate on a set of tuples attached to the row.  To do this instead we create a set of table methods for managing the constraints:

CREATE OR REPLACE FUNCTION is_balanced(journal_entry) 
RETURNS BOOL
LANGUAGE SQL AS $$

SELECT sum(amount) = 0 FROM unnest($1.line_items);

$$;

CREATE OR REPLACE FUNCTION has_no_null_account_ids(journal_entry)
RETURNS BOOL
LANGUAGE SQL AS $$

SELECT bool_and(account_id is not null) FROM unnest($1.line_items);

$$;

CREATE OR REPLACE FUNCTION has_no_null_amounts(journal_entry)
RETURNS BOOL
LANGUAGE SQL AS $$

select bool_and(amount is not null) from unnest($1.line_items);

$$;

We can then create our constraints.  Note that because we have to create the methods first, we have to add our constraints after the functions are defined, and these are added after the table is constructed.  I have gone ahead and given these friendly names so that errors are easier for people (and machines) to process and handle.

ALTER TABLE journal_entry 
ADD CONSTRAINT is_balanced 
CHECK ((journal_entry).is_balanced);

ALTER TABLE journal_entry
ADD CONSTRAINT has_no_null_account_ids
CHECK ((journal_entry).has_no_null_account_ids);

ALTER TABLE  journal_entry
ADD CONSTRAINT has_no_null_amounts
CHECK ((journal_entry).has_no_null_amounts);

Now we have integrity constraints reaching into our nested data.

So let's test this out.

 insert into journal_type (label) values ('General');

We will re-use the account data from the previous post:

or_examples=# select * from account;
 id | control_code | description
----+--------------+-------------
  1 | 1500         | Inventory
  2 | 4500         | Sales
  3 | 5500         | Purchase
(3 rows)

Let's try inserting a few meaningless transactions, some of which violate our constraints:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type]);

ERROR:  new row for relation "journal_entry" violates check constraint "is_balanced"

So far so good.

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type, 
      row(null, -100)::journal_line_type]);

 ERROR:  new row for relation "journal_entry" violates check constraint "has_no_null_account_ids"

Still good.

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values 
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type,
      row(2, -100)::journal_line_type,
      row(3, NULL)::journal_line_type])

ERROR:  new row for relation "journal_entry" violates check constraint "has_no_null_amounts"

Great.  All constraints working properly.  Let's try inserting a valid row:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values
(1, 'ref-10001', now()::date, 'This is a test',
ARRAY[row(1, 100)::journal_line_type,
      row(2, -100)::journal_line_type]);
And it works!

 or_examples=# select * from journal_entry;
 id | journal_type | source_document_id | date_posted |  description   |       li
ne_items       
----+--------------+--------------------+-------------+----------------+---------
---------------
  5 |            1 | ref-10001          | 2012-08-23  | This is a test | {"(1,100
)","(2,-100)"}
(1 row)

Break-Out Views

A second major problem that we will be facing with this schema is that if someone wants to create a report using a reporting tool that only really supports relational data very well, then the financial data will be opaque and not available.  This scenario is one of the reasons why I think it is important generally to push the relational model to its breaking point before looking at object-relational functions.  Consequently I think when doing nested tables it is important to ensure that the data in them is available through a relational interface, in this case, a view.

In this case, we may want to model debits and credits in a way which is re-usable, so we will start by creating two type methods:

CREATE OR REPLACE FUNCTION debits(journal_line_type) 
RETURNS NUMERIC
LANGUAGE SQL AS
$$ SELECT CASE WHEN $1.amount < 0 THEN $1.amount * -1 
               ELSE NULL END 
$$;

CREATE OR REPLACE FUNCTION credits(journal_line_type)
RETURNS NUMERIC 
LANGUAGE SQL AS
$$ SELECT CASE WHEN $1.amount > 0 THEN $1.amount 
               ELSE NULL END
$$;

Now we can use these as virtual columns anywhere a journal_line_type is used.

The view definition itself is rather convoluted and this may impact performance.  I am waiting for the LATERAL construct to become available which will make this easier.

CREATE VIEW journal_line_items AS
SELECT id AS journal_entry_id, (li).*, (li).debits, (li).credits
FROM (SELECT je.*, unnest(line_items) li 
        FROM journal_entry je) j;

Testing this out:

SELECT * FROM journal_line_items;

gives us

 journal_entry_id | account_id | amount | debits | credits
------------------+------------+--------+--------+---------
                5 |          1 |    100 |        |     100
                5 |          2 |   -100 |    100 |       
                6 |          1 |    200 |        |     200
                6 |          3 |   -200 |    200 |    

As you can see, this works.  Now people with purely relational tools can access the information in the nested table.

In general it is almost always worth creating break-out views of this sort where nested data is stored.

Referential Integrity Controls

The final problem is that relational integrity is not a well defined concept for nested data.  For this reason, if we value relational integrity and foreign keys are involved, we must find ways of enforcing these.

The simplest solution is a trigger which runs on insert, update, or delete, and manages another relation which can be used as a proxy for relational integrity checks.

For example, we could:

CREATE TABLE je_account (
   je_id int references journal_entry (id),
   account_id int references account(id),
   primary key (je_id, account_id)
);

This will be a very narrow table and so should be quick to search.  It may also be useful in determining which accounts to look at for transactions if we need to do that.  This table could then be used to optimize queries.

To maintain the table we need to recognize that never ever will a journal entry's line items be updated or deleted.  This is due to the need to maintain clear audit controls and trails.  We may add other flags to the table to indicate transactions but we can handle insert, update, and delete conditions with a trigger, namely:

CREATE FUNCTION je_ri_management()
RETURNS TRIGGER
LANGUAGE PLPGSQL AS $$
DECLARE accounts int[];
BEGIN
    IF TG_OP ILIKE 'INSERT' THEN
         INSERT INTO je_account (je_id, account_id)
         SELECT NEW.id, account_id
         FROM unnest(NEW.line_items)
         GROUP BY account_id;

         RETURN NEW;
    ELSIF TG_OP ILIKE 'UPDATE' THEN
         IF NEW.line_items <> OLD.line_items THEN
             RAISE EXCEPTION 'Cannot journal entry line items!';
         ELSE RETURN NEW;
         END IF;
    ELSIF TG_OP ILIKE 'DELETE' THEN
         RAISE EXCEPTION 'Cannot delete journal entries!';
    ELSE
         RAISE EXCEPTION 'Invalid TG_OP in trigger';
    END IF;
END; $$;

Then we add the trigger with:

CREATE TRIGGER je_breakout_for_ri
AFTER INSERT OR UPDATE OR DELETE
ON journal_entry
FOR EACH ROW EXECUTE PROCEDURE je_ri_management(); 

The final invalid TG_OP could be omitted but this is not a bad check to have.

Let's try this out:

insert into journal_entry
(journal_type, source_document_id, date_posted, description, line_items)
values
(1, 'ref-10003', now()::date, 'This is a test',
ARRAY[row(1, 200)::journal_line_type,
      row(3, -200)::journal_line_type]);

or_examples=# select * from je_account;
 je_id | account_id
-------+------------
    10 |          3
    10 |          1
(2 rows)

In this way referential integrity can be enforced.

Solution 2.0:  Refactoring the above to eliminate the view.

The above solution will work great for small businesses but for larger businesses, querying this data will become slow for certain kinds of reports.   Storage here is tied to a specific criteria, and indexing is somewhat problematic.  There are ways we can address this, but they are not always optimal.  At the same time our work is simplified because the actual accounting details are append-only.

One solution to this is to refactor the above solution.  Instead of:
  1. Main table
  2. Relational view
  3. Materialized view for referential integrity checking
we can have:
  1. Main table, with tweaked storage for line items
  2. Materialized view for RI checking and relational access
Unfortunately this sort of refactoring after the fact isn't simple.  Typically you want to convert the journal_line_type type to a journal_line_type table, and inherit this in your materialized view table.  You cannot simply drop and recreate since the column you are storing the data in is dependent on the structure.

The solution is to rename the type, create a new one in its place.  This must be done manually and there is no current capability to copy a composite type's structure into a table.  You will then need to create a cast and a cast function.  Then, when you can afford the downtime, you will want to convert the table to the new type.  It is quite possible that the downtime will be delayed and you will have an extended time period where you are half-way through migrating the structure of your database.  You can, however, decide to create a cast between the table and the type, perhaps an implicit one (though this is not inherited) and use this to centralize your logic.  Unfortunately this leads to duplication-related complexity and in an ideal world would be avoided.

However, assuming that the downtime ends up being tolerable, the resulting structures will end up such that they can be more readily optimized for a variety of workloads.   In this regard you would have a main table, most likely with line_items moved to extended storage, whose function is to model journal entries as journal entries and apply relevant constraints, and a second table which models journal entry lines as independent lines.  This also simplifies some of the constraint issues on the first table, and makes the modelling easier because we only have to look into the nested storage where we are looking at subset constraints.

This section then provides a warning regarding the use of advanced ORDBMS functionality, namely that it is easy to get tunnel vision and create problems for the future.  The complexity cost here is so high, that the primary model should generally remain relational, with things like nested storage primarily used to create constraints that cannot be effectively modelled otherwise.  However, this becomes a great deal more complicated where values may be update or deleted.  Here, however, we have a relatively simple case regarding data writes combined with complex constraints that cannot be effectively expressed in normalized, relational SQL.  Therefore the standard maintenance concerns that counsel against duplicating information may give way to the fact that such duplication allows for richer constraints.

Now, if we had been aware of the problems going in we would have chosen this structure all along.  Our design would have been:

CREATE TYPE journal_line AS (
    entry_id bigserial primary key, --only possible key
    je_id int not null,
    account_id int,
    amount numeric
);  

After creating the journal entry table we'd:

ALTER TABLE journal_line ADD FOREIGN KEY (je_id) REFERENCES journal_entry(id);

If we have to handle purging old data we can make that key ON DELETE CASCADE.

And the lines would have been of this type instead.  We can then get rid of all constraints and their supporting functions other than the is_balanced one.  Our debit and credit functions then also reference this type.  Our trigger then looks like:

CREATE FUNCTION je_ri_management()
RETURNS TRIGGER
LANGUAGE PLPGSQL AS $$
DECLARE accounts int[];
BEGIN
    IF TG_OP ILIKE 'INSERT' THEN
         INSERT INTO journal_line (je_id, account_id, amount)
         SELECT NEW.id, account_id, amount
         FROM unnest(NEW.line_items);

         RETURN NEW;
    ELSIF TG_OP ILIKE 'UPDATE' THEN
          RAISE EXCEPTION 'Cannot journal entry line items!';
    ELSIF TG_OP ILIKE 'DELETE' THEN
         RAISE EXCEPTION 'Cannot delete journal entries!';
    ELSE
         RAISE EXCEPTION 'Invalid TG_OP in trigger';
    END IF;
END; $$;

Approval workflows can be handled with a separate status table with its own constraints.  Deletions of old information (up to a specific snapshot) can be handled by a stored procedure which is unit tested and disables this trigger before purging data.  This system has the advantage of having several small components which are all complete and easily understood, and it is made possible because the data is exclusively append-only.

As you can see from the above examples, nested data structures greatly complicate the data model and create problems with relational math that must be addressed if data logic will remain meaningful. This is a complex field, and it adds a lot of complexity to storage.  In general, these are best avoided in actual data storage except where this approach makes formerly insurmountable problems manageable.  Moreover, they add complexity to optimization once data gets large.  Thus while non-atomic fields in this regard make sense as an initial point of entry in some narrow cases, as a point of actual query, they are very rarely the right approaches.  It is possible that, at some point, nested storage will be able to have its own indexes, foreign keys, etc. but I cannot imagine this being a high priority and so it isn't clear that this will ever happen.  In general, it usually makes the most sense to simply store the data in a pseudo-normalized way, with any non-1NF designs being the initial point of entry in a linear write model.

Nested Data Structures as Interfaces

Nested data structures as interfaces to stored procedures are a little more manageable.  The main difficulties are in application-side data construction and output parsing.  Some languages handle this more easily than others.

Upper-level construction and handling of these structures is relatively straight-forward on the database-side and poses none of these problems.   However, they do cause additional complexity and this must be managed carefully.

The biggest issue when interfacing with an application is that ROW types are not usually automatically constructed by application-level frameworks even if they have arrays.  This leaves the programmer to choose between unstructured text arrays which are fundamentally non-discoverable (and thus brittle), and arrays of tuples which are discoverable but require a lot of additional application code to handle.  At the same time as a chicken and egg problem, frameworks will not add handling for this sort of problem unless people are already trying to do it.

So my general recommendation is to use nested data types everywhere in the database sparingly, only where the benefits clearly outweigh the complexity costs.

Complexity costs are certainly lower in the interface level and there are many more cases where it these techniques are net wins there, but that does not mean that they should be routinely used even there.

Sunday, September 2, 2012

O/R Modelling in PostgreSQL part 4: Composite Columns and Cross-Table Refs

So far, we have looked at tables as classes and various ways of looking at inheritance within that structure.  However, tables can be nested in PostgreSQL and this has a number of uses, particularly in views, and data structures passed to and from functions.  As PostgreSQL support for things like JSON becomes more important, these approaches will become more important as well.  Composite types in columns however opens up a significant number of gotchas and in general I tend to suggest avoiding this for physical storage.

Composite types, also known as structured data types, allow you to further define interfaces for user defined functions, and add additional sub-interfaces to those.  They can also be used to define complex data in tables although often multiple inheritance provides for a cleaner solution when it comes to actual storage.   However, complex types can be useful in a number of cases where existing data types are inadequate, as an alternative to building the type in C.

In some cases, the logic can be helpful here too.  In general when looking at base storage, my preference is to look at inheritance first, and then failing that to look at in-type storage.  Inheritance can give you all the benefits of in-column storage with fewer gotchas, and greater reusability.  However, there are cases where this breaks down, notably in views.  Also as we noted before tables are composite types (but not every composite type is a table).

We will look at the differences by considering an inheritable table storing address information and see how it behaves differently when used as a compsite type for a column.

Warning:  This is a comparatively new area for PostgreSQL and there are many areas of inconsistent behavior that have not been worked out.  If you are using composite types for column definitions, be prepared to run into unexpected sharp edges and the possibility of behavior changing in the future.  Additionally multiple inheritance (presuming unique column names) provides the same benefits with fewer inconsistencies.  This is really at the edges of the system.

Table Schema:

CREATE TABLE country (
     id serial not null unique,
     name text primary key,
     short_name text not null unique,
     state_nullable bool not null default false,
     requires_mailcode bool not null default true,
     unique(id, state_nullable, requires_mailcode),
);

Interface tables:

CREATE TABLE country_ref (
    country_id int,
);

CREATE TABLE country_ref_ext (
    country_state_nullable bool,
    country_requires_mailcode bool,
    CHECK (country_id NOT NULL)
) INHERITS (country_ref);

And of course, our traverse function:

CREATE OR REPLACE FUNCTION country(country_ref)
RETURNS country 
LANGUAGE SQL AS $$

SELECT * FROM country WHERE id = $1.country_id;

$$;

Ok, so now we have this part set up.

CREATE TABLE address (
    id serial not null unique,
    street_one text not null,
    street_two text,
    street_three text,
    floor text,
    care_of text,
    city text not null,
    state_province text,
    mail_code text,
    foreign key (country_state_nullable, 
                 country_requires_mailcode, 
                 country_id) 
       references country
               (state_nullable, 
                requires_mailcode, 
               id)
) INHERITS (country_ref_ext); 

Of course, now I can take an address and pull the country info off of it by dereferencing the primary key, but the nullable state and mailcode properties are also guaranteed to be equal to what they are in the country table.

Now in this case, we have inherited a specific country reference.  The country is required and so we know there will be a country.

Now in this case, any check constraints we have inherited would be applicable and we could subclass interface classes however we'd like.

Now, suppose we implement the country ref as a database column instead.

So our table looks like this instead:

CREATE TABLE address_noinherit (
    id serial not null unique,
    street_one text not null,
    street_two text,
    street_three text,
    floor text,
    care_of text,
    city text not null,
    state_province text,
    mail_code text,
    country_ptr country_ref_ext
);

Composite Type Behavior and Gotchas

In the first case we can simply do:

SELECT (a.country).name from address a where id = 1;

In the second gase, we would similarly:

SELECT ((a.country_ref).country).name
  FROM address_noinherit a WHERE id = 1;

You can then use this to implement a pointer system in your tables for recurring links.

Except that it doesn't really work that way as well as you would might like.

The NOT NULL check is not checked on a table which is not of the type where it is defined.  Consequently we have to declare that separately:

    CHECK ((country_ref).country_id IS NOT NULL)

Worse, adding a column to the type table can lead to problems.   You cannot add NOT NULL or DEFAULT clauses when adding the type if the table is being used as a type for the column of another table.  To work around this you must add the NOT NULL and DEFAULT constraints after first adding the column.

Except, assuming that runs you into trouble too.   There are exceptions to the above rule, particularly where domains are involved in types.  For example:

A domain for positive, not null integers:
or_examples=# create domain pos_not_null_int as int not null check (value > 0);
CREATE DOMAIN






A table which uses these:
or_examples=# create table rel_examples.domaintest (id pos_not_null_int);
CREATE TABLE



Constraints on the domain are enforced (as a test):
or_examples=# insert into rel_examples.domaintest values (-1);
ERROR:  value for domain pos_not_null_int violates check constraint "pos_not_null_int_check"
or_examples=# insert into rel_examples.domaintest values (null);
ERROR:  domain pos_not_null_int does not allow null values

A table which uses this as a column type:
or_examples=# create table comp_domain_test ( test rel_examples.domaintest);
CREATE TABLE


Despite what I said before, the constraints are enforced here (which they would not be if the initial table had its check constraints declared at the table level):
or_examples=# insert into comp_domain_test values (row(null));
ERROR:  domain pos_not_null_int does not allow null values
or_examples=# insert into comp_domain_test values (row(-1));
ERROR:  value for domain pos_not_null_int violates check constraint "pos_not_null_int_check"

When using complex types in columns the dba is advised not to make too many assumptions.  Given the inconsistent behavior here, I expect that it will likely change in future versions, but in which way one cannot say.

Additionally sub-tuple elements cannot be a part of foreign keys, so you are left with something with all the problems that Oracle's REF implementation does, namely that it gives you a weak foreign key implementation which does not actually enforce the underlying storage.  However it can still be automatically traversed.  In a future post we will look at solutions to foreign key enforcement.

In general, inheritance is cleaner than composite columns and there are far fewer gotchas.  Nevertheless there are some cases where they are an elegant solution to a problem, particularly when looking at nested data storage to enforce constraints which cannot otherwise be expressed.

Composite types in VIEWs

However, where storage is not at issue, composite types, whether based on tables or not, can be used here without the previous issues.  Stored procedures can be used to generate composite types (i.e. without inheritance) or directly from the inheritance tree.

For compsite types which are returned from stored procedures, these can be presented directly in the view, or not.  Methods of previous tables can be called as well (so inventory_item.markup in my previous example might go from a function call to an output column).  Similarly methods associated with types can be called here as well.  However, it may be useful to expose the types directly as well as a way of allowing for additional functionality to be requested.

In the initial address case which uses inheritance, we might:

CREATE VIEW address_view AS 
SELECT * FROM address;

But then we cannot:

SELECT (av.country).* from address_view av where ....;

One way around this is:

CREATE VIEW address_view AS
SELECT id, street_one, city, state_province, mail_code,
       a::country_ref AS country_ref
  FROM address a;

Now we can:

SELECT ((country_ref).country).name FROM address_view
 WHERE city = 'Paris'
GROUP BY country_ref;

Table aliases can be similarly captured for function output that creates complex types and behavior exposed in views this way.

In object-relational design it is critical to consider how a view or interface will be used.  One real problem with including too many heavy O/R features in a table is that it can complicate the use of the data by people running queries.   Nested data structures primarily are helpful when they create interfaces to O/R logic that is needed and would not otherwise be available, or where the application expects the data in this format.  In other words, human and computer interfaces may use varying levels of O/R logic.  This, of course, will be covered by a future item in this series.

Next:  Nested Data Structures and Storage (Uses and Gotchas)
Next Week:  Interlude:  Contrasting to MySQL and Messaging in PostgreSQL