Wednesday, April 24, 2013

A few cases where EAV modelling really works in LedgerSMB.

Entity-attribute-value (EAV) modelling is typically seen as an antipattern and for good reasons.  Misapplied it leads to inflexible queries, poor performance and a number of other bad things.  Nonetheless there are classes of problems where EAV is the right solution for the problem at hand.  Two specific areas of LedgerSMB will demonstrate this, namely account/dropdown management and the menu schema.  These actually represent different classes of problems but they have a number of things in common.  The goal of this post is to help readers better understand where EAV modelling can be of great help.

In general, the areas of LedgerSMB which use EAV modelling are stable, rarely changing data with relatively simple to non-existent validation routines (the menu is expected to be externally validated).   In some cases the basic expectations of EAV are modified slightly.  In both these cases (which represent relatively different classes of problems) we have been very happy with EAV and while our approach will no doubt be refined in the future, it is unlikely that these portions of the software will ever move from EAV.

Note the LedgerSMB development team does not advocate starting with EAV for most problems.  Usually where EAV is appropriate it will naturally emerge in the schema design process.

What is EAV?


EAV is a way to model relatively free-form data in such a way as to dispense with the need for extremely complex, long tables for specifying sparse data.  A typical EAV approach has three basic tables representing entity data, attributes, and value data.

To make EAV work successfully, as much data as possible should be put in the entity table or parent/child tables.  It's only where this approach breaks down that EAV is a good idea.  The examples below will flesh this out more.  However  a basic typical design might look like:

CREATE TABLE entity (
    id serial not null unique,
    label text primary key,
    ....
);

CREATE TABLE attribute (
   id serial not null unique,
   att_name text not null,
   att_datatype text not null,
   ....
);

CREATE TABLE value_table (
   id bigserial not null unique,
   entity_id int not null references entity(id),
   attribute_id int not null references attribute(id),
   value text not null
);

In portions below we well explore what is right and what is wrong with this approach.  This post is hardly a definitive guide to the subject but it is hoped it is a step towards exploring this area.

Why EAV?


EAV dispenses with the requirement for rigid schemas.  Particularly in cases where data validation is not complex or not demanded, EAV can solve problems that standard row-oriented models cannot readily solve.  Some classes of problems (tags to social media for example) fall readily in a modified EAV model.

In essence EAV can sometimes(!) be a useful approach to use when the data does not fit well into other models.  Unfortunately when it is grabbed for too quickly bad things result, but this does not prevent it from being useful in cases where it is at home.

Why Not EAV?


EAV, improperly applied, leads to a significant number of problems.  Improperly applied, it leads to an inability to declaratively validate data within the schema of the database itself beyond very simple rules.  Improperly applied it leads to performance problems, and basic data management problems.

In general my own view is that EAV is best kept as a tool to solve very specific problems, and that most other problems are better solved by careful table design.

Modified EAV in the Account/Dropdown Management Schema


In LedgerSMB chart of accounts are attached to drop downs.  The schema looks like:

CREATE TABLE account (
  id serial not null unique,
  accno text primary key,
  description text,
  category CHAR(1) NOT NULL,
  gifi_accno text,
  heading int not null references account_heading(id),
  contra bool not null default false,
  tax bool not null default false
);



CREATE TABLE account_link_description (
    description text    primary key,
    summary     boolean not null,
    custom      boolean not null
);


CREATE TABLE account_link (
   account_id int references account(id),
   description text references account_link_description(description),
   primary key (account_id, description)
);


Now what is obviously missing here in an EAV perspective are values.  We have entities, and attributes, but the presence of the attribute itself is the value.  It would be like EAV with merely boolean values.

The architecture poses all the classic issues one has with EAV.  There are rules regarding what combinations regarding which combinations of drop-downs are invalid.  These are not complex rules but they could change over time.  Currently if an account_link is a summary account, then it is only allowed a single drop down link, so no others can exist.  Custom accounts are not validated and can break these rules.

Currently validation occurs via a stored procedure which encodes the rules in its own logic.

Normally the tables involved here are small.  A typical business will have no more than a few dozen account entries, and maybe a hundred or two account_link entries, and there are only 27 account link descriptions.  However larger businesses could have thousands of accounts and tens of thousands of account_link entries.

The approach here scales well because there are only really three typical search patterns. 

The first is a search of accounts for a specific link description (dropdown box).  This will usually pull only a portion of the accounts and can be managed through indexes.

The second is a search of account_links for a given account.  This is trivial and usually no more than about 20 items.

Finally we may pull all entries from all tables, and join/aggregate.  This is not going to require multiple table scans, comparing results, in order to determine the appropriate rows to return.

One thing that is important to note is that chart of accounts data is pretty stable.  It may change a few records over time, but radical changes or mass updates are very rare.  Thus we can sacrifice some immediate performance over the lon run.
 

Lightweight EAV in the menu definition schema


Our menu schema looks like this:

CREATE TABLE menu_node (
    id serial NOT NULL,
    label character varying NOT NULL,
    parent integer,
    "position" integer NOT NULL
);


CREATE TABLE menu_attribute (
    node_id integer NOT NULL references menu_node(id),
    attribute character varying NOT NULL,
    value character varying NOT NULL,
    id serial NOT NULL,
    primary key(node_id, attribute)
);


CREATE TABLE menu_acl (
    id serial NOT NULL,
    role_name character varying,
    acl_type character varying,
    node_id integer,
    CONSTRAINT menu_acl_acl_type_check CHECK ((((acl_type)::text = 'allow'::text) OR ((acl_type)::text = 'deny'::text))),
    PRIMARY KEY (node_id, role_name)
);


What is omitted here is the attribute table.  In essence our menu model has no data validation and the data is extremely stable.  It is expected that the developer validates against the application.

The tables here are relatively small.  By default a couple hundred rows in menu_node and nearly a thousand in menu_attribute.  This structure stretches what relational db's will let you do gracefully since menu_node represents a tree hierarchy, and the rest establishes a very open (unvalidated) EAV model.

Access patterns here are typically once per login, pulling all data from these tables (two scans, one for permissions), and then aggregating the attributes such that we get enough information to create a menu structure (<ul> tags with hyperlinks inside <li> tags).

Menu items are usually only created once and are extraordinarily stable. 

Conclusions


These use cases have a number of striking things in common.  The access patterns are simple, as are the in-db data validation rules (if they exist at all).  The data is rarely written.

The second point is something higher level that these cases have in common.  They represent a way of the application storing its own API in the database.  In other words they represent a way of storing application configuration information, or hooks or references to application logic (though not the logic itself).  These areas are semantically very flexible but typically the data is long-lived once everything is set up.

Tuesday, April 16, 2013

A (Distributist) View on Software Freedom

We are relatively familiar with Stallman's views on software freedom.  These views are deeply based on the idea that we can enumerate rights which should be respected.  However sometimes enumeration of rights gets in the way of actually having freedom, because of the assumption that anything beyond those rights are not rights at all.  The idea of four freedoms essentially encourages a lawyer's view of software freedom where anything outside can be restricted in order to achieve the appearance of freedom.

This post provides a Distributist view of software freedom.  It functions in part as an introduction to Distributist economic and political theory for open source software professionals and an introduction to open source software for Distributists.  The approach will be shown to be very different than Stallman's or other groups but it also shows in greater detail why software freedom is important in a more general way than Stallman's enumerated rights approach shows.

It's worth noting off the bat that Distributism arose as a critique both of Capitalism and Communism and represents something with some of the ideals of both sides, but altogether different in character than either.  It's my thesis that open source makes most sense when looked at through a distributist approach instead a liberal rights approach.

Economic Activity, Means of Production, and Ownership


Capitalist, Marxist, and Distributist economic theory is fundamentally based on the recognition of means of production and its place in regulating economic activity.

We humans no longer live, anywhere in the world, in unimproved conditions (say in trees without tools, clothing, or the like).  For us to live we must improve our environment and make it more conducive to human activity.    Production of goods and delivery of services, however, requires a few specifics:  land, tools, labor, and capital.  Without land, a business has nowhere to work.  Without tools, there is nothing that can be done.  Without labor there is nobody to do the work, and without capital there is no way to pay the costs to get started.

The means of production constitute two parts of this equation, namely land (facilities) and tools used to produce goods and deliver services.  Labor and capital represent two other inputs but they are different in kind.  Capital is expended flexibly, and labor is human input by the actual workers.  Under a system of slavery, slaves might be included but otherwise, free workers are different in that they are not owned.

The term ownership is however somewhat problematic.    Ownership itself is socially constructed, and can take different forms in different ways.  The simplest, perhaps even most satisfying, definition of ownership is simply the right to utilize, and direct utilization of, an item, particularly in production of economic goods or services.  This definition largely follows that of Hilaire Belloc's The Servile State which is one of the fundamental works of Distributist economic and historical theory.

Distributism, Capitalism, and Marxist Communism can be differentiated though as to who owns, in the sense of  the right to utilize or direct utilization of the means of production.  These represent fundamentally different approaches to trying to manage economic problems of a society (other approaches exist as well).

Capitalism, Communism, and Distributism as Economic Orders


For this discussion it is important to define these terms clearly and their relationship, so that the scope and type of the Distributist critique of Capitalism and Communism are understood.  This is also a key to understanding the economic benefits of open source software.  It is also worth noting that "perfect" examples of Capitalism, Communism, and Distributism are hard to find, so to some extent one has to ask what aspects of these three systems one finds in any given economic order.  In fact one may find aspects of all three in the US today despite the fact that the US is one of the most clearly Capitalist economies around today.

Capitalism is a system where, in a class-based society, the class with capital buys land and tools, and hires labor to start businesses.  Capitalism thus means "ownership by financier."  Capitalism is fundamentally prone to market failures through mechanisms like overproduction, and workers are largely powerless because they have few options other than working for financier-owned businesses.  As Adam Smith noted, wages and salaries are determined less by supply and demand and more by individual bargaining power, so the less empowered a worker is, the less the worker will earn.

Marxist Communism seeks to solve the problems of Capitalism by taking the means of production from the financiers and vesting those in the state as the representative of the worker.  Of course the problems of Capitalism are caused by too few people having too much control, so further concentrating power in fewer hands will not solve the problem.   Further, while financiers may have a primary goal of making money, government officials have complex goals and so the power that comes with control over economic production (which amounts to control over life itself) inevitably leads to great and terrible dictatorships.

Distributism looks back to the progression of the economy of the Middle Ages towards an economy where most people were effectively self-employed and thus seeks to undo the centralization of power that occurs under Capitalism by vesting ownership in the means of production with individual workers, not the state as an intermediary.  An economy dominated by the self-employed would be a Distributist economy.

The Distributist critique of Capitalism at the beginning tracks the Marxist critique of Capitalism.  Capitalism is seen to create problems for workers who are disadvantaged by wealthy business owners, but the mechanism of that disempowerment is different, and so here they begin to diverge.  In essence because workers are denied the opportunity to participate in the economy as free agents, they are dependent on their employers for basic subsistence.  This leads workers to demand better conditions which leads to class warfare.  This threatens the liberal capitalist order, and results in some socialization of some things that the wealthy business owners are willing to give up for promises of greater gains in the future.

Unlike Marxists, Distributists usually assume that the wealthy classes, by virtue of their economic control, will usually win class warfare and this is well attested historically.  Therefore Distributism seeks to avoid rather than stoke this warfare.  The goal is to gently cultivate businesses and individuals where labor and capital are not separate and therefore not in conflict.  A handyman who owns his own tools and pays for the start-up expenses of going into business is thus a job creator in a way in which venture capital firms are not.  There is nothing more empowering to a worker than the ability to quit and easily start a business, so a porous border between formal employment and self-employment is believed to increase wages.

Software Ownership, EULA's and Open Source


Typically software is not sold but licensed.  The licensee (end user) receives certain rights to utilize the software for certain purposes decided by the licensor.  For this reason, if we define ownership as the right to utilize, particularly for economic production, software comes with a wide range of degrees of ownership.

Some software, for example, is free for non-commercial use.  This offers a very limited degree of ownership (regarding economic production) unless often significant fees are paid.  Other software comes with seemingly arbitrary limits which are specifically designed to provide upselling opportunities to the vendor either regarding software or license rights.

For example, in the past certain versions of Windows have included terms in their End User License Agreement specifying that web servers may not be run on the systems.  The whole purpose is to ensure that people who want to run Apache on their systems must buy versions that also come with IIS.

Similarly client access licenses are typically used as a general upselling opportunity, so that larger users must pay more.

Of course one is only entitled to use what one pays for and therefore businesses using commercial off the shelf software often must spend time and energy ensuring they are compliant with all the various terms on every piece of software license.

Open source is different in that it provides very few limits on utilization at all, and some software (such as under the MIT license) provide few limitations on utilization at all.

In general open source software licenses allow unlimited utilization within a business.  They provide varying degrees of restriction regarding utilization in products that incorporate the software, particularly in how much ownership can be asserted over the software once it is distributed to a customer.

Open Source Development Communities vs Dual-Licensing Companies


There are two predominant approaches to the development and distribution of open source software.  The first is that of dual-licensing by a single company which holds the copyright and sells proprietary licenses in addition to an open source offering, often one which has fewer features.  The second is the multi-vendor model where several companies come together to jointly offer an open source offering.  The companies may or may not have their own proprietary offerings as well.

The big difference between these two models is in who has what degree of control over what people and businesses do with the software.  In the first model,  the company has the incentive to provide the software under a license with as many restrictions on utilization as possible.  In the second, no one company has that power, and thus the incentive is to provide the software under fewer restrictions.

These differences largely show the Distributist position to be correct, that distribution of control means that the vendors have less control and that the workers and consumers have more.  In essence programs developed and owned by communities are less restricted and more open than those which are developed and owned by single companies (owned meaning having the right to tell you what you can or cannot do with the software).

The typical copyleft critique of BSD- and MIT-licensed software is that it gives too much freedom, to the point where it is not possible to enforce downstream software freedom.  I think to some extent this is both incorrect and it misses the point.  It is true that many permissively licensed projects have seen proprietary spinoffs which no longer contribute code back, but in the long view, these rarely outlive a project with significant vitality.

For example Solaris began as a proprietary spinoff of UC BSD.    Other UC spinoffs, such as FreeBSD is still being actively developed but mostly under the same license approved by University of California.  The only development on Solaris I am aware of is on an open source fork of OpenSolaris.  Illustra was an early proprietary spinoff of PostgreSQL.  It was purchased and some code incorporated into Informix but otherwise died.  Mammoth PostgreSQL is now alive only as an open  sourced set of patches.  We can expect that other forks in coming decades which have broken themselves off from community development will be unable to compete long run.  History is littered with failed proprietary forks of BSD-licensed software.....

The community, however, continues.

The obvious counter-example is NCSA Mosaic, but it isn't clear that any proprietary forks of the browser survived either, and Mosaic was originally developed by a single entity.  Thus this exception largely proves the rule.

The community endures.  Companies come and go.

From this perspective corporations are actually not that powerful compared to the flexible network of developers building open source software.  These networks in the case of many pieces of important software infrastructure, have proven remarkably robust:

  • BSD began in 1977, and while the University of California Berkeley is no longer the one that manages development, several forks exist today (OpenBSD, FreeBSD, NetBSD, etc).
  • PostgreSQL began in 1985 at UC Berkeley and like BSD escaped into the wild.  It is one of the most successful open source database engines around.
  • Apache Web Server was released first in 1995.
  • The Linux Kernel was begun in ernest in 1991.
  • TeX was initially released in 1978
  • LaTeX was initially released in 1982
 In comparison commercialization ventures of these areas have tended to be much shorter lived and many technologies that seemed like they were here to stay have disappeared or are dying (Flash being an example that comes to mind).

Thoughts on MIT, BSD, GPL, and AGPL Licenses


Careful observers will note I listed the licenses in order from the most permissive to the most restrictive.

Looking at open source from the eyes of a Distributist theory of economic production and ownership, the MIT, BSD, GPL, and AGPL licenses are differentiated regarding the specifics of the right to utilize that is passed downstream.  Of these, the BSD license family provides, IMO, the best balance between the interests of market participants.  Sublicensing is not allowed, but all other utilization is.

The permissive licenses are fundamentally more Distributist than the copyleft licenses are, and the copyleft licenses are more Capitalist than the permissive licenses are.  With a copyleft license there are important rights retained for further licensing, so one can essentially require additional fees be paid upstream for certain uses.  Much of my discussion here is drawn from Larry Rosen's book on the subject.

The MIT license largely only requires attribution.  One is allowed to do pretty much anything else with the software.  This includes, notably, to sublicense it, in other words to sell part of one's rights received under the license to downstream parties.  In other words, I can take MIT Kerberos, rename it to GuardDog Authentication Services (totally hypothetical name and any resemblance to a real product is purely coincidental), and sell it without making any modifications to the code at all.  I can also incorporate the software with my products, and license the code with restrictions not present in the MIT license.  In essence the MIT license differs from public domain only in that it requires attribution.

The BSD license is significantly more restrictive than the MIT license, in that the license does not mention sublicensing.  I can take PostgreSQL, add important changes, and license the new product with restrictions not present in the original license.  However, I cannot do this unless I add significant, copyright-worthy changes.  The restrictions I add affect changes I make and the work as a whole, not the code I did not modify, and this is a significant difference from MIT-licensed code.  While there are exceptions, the vast majority of IP lawyers I have discussed the interpretation of these licenses with concur with this interpretation.

The permissive licenses above have the advantage in that they pass downstream the right of those who do further development to fully own their work to the extent that society allows (through copyright law and the like).  The copyleft licenses are different in that they pass only limited rights to utilize downstream to further developers.

In the description below I am going to avoid directly addressing hot topics like whether linking has anything to do with derivation.  These arguments however further serve to limit the right to utilize the software by providing gray controversial zones of use and so I will note them.  The copyleft licenses are efforts to enshrine and protect Stallman's 4 freedoms (the right to run the software, the right to study the software, the right to copy the software, the right to modify the software).  This is obviously an intentional reference to the rhetorical framework of Roosevelt's New Deal, and it highlights what I think it the flaw in the copyleft approach.

Roosevelt's New Deal was sold to the public as a way out of the Great Depression and a way to help the poor who were suffering from the economic downturn.  However the New Deal included programs intended to centralize industries (some of which were struck down by the Supreme Court prior to the court packing threat), and programs which supplanted organic, family-centric safety nets with impersonal institutions.  In essence the New Deal was corporatist and paved the way for the neoliberalism which has ruled the West in recent decades.  The same basic problems occur within the FSF's approach, which results in the GPL and AGPL are particularly popular with corporations wanting to reduce open source software to shareware.  Note that regarding institutional control, the FSF advocates giving them (the FSF) control over future licensing changes to the software via an "or at your option any later version" clause in the licensing.  For those of us trying to break up the corporate control over software, however, the New Deal is the wrong model.  Trust-busting and common goods are the right model.

This is not to say that distributist approaches cannot use the GPL for their software.  It is just to say that the BSD license is probably better in such areas.  The fundamental difference is whether one is to see the primary method of preventing abuse being centralized regulation (via the FSF in the form of new versions of the GPL family of licenses) or decentralized downstream control (via multi-vendor communities).  While these are not mutually exclusive, I don't think it is possible to both equally rely on centralization and decentralization at the same time.  One has to choose one as primary.  If you choose centralization, you have to pretty much follow the GNU model of requiring copyright assignments, etc.  If you choose decentralization and not require copyright assignments, you have to admit that this dilutes the protections of the GPL, moving it more towards the BSD-family of licenses (because it is harder to prove copyright infringement when you have to prove ownership of code you created first).

The GNU GPL family of licenses places significant burdens on the right to utilize the software to produce new goods incorporating the software.  This is due to the so-called viral nature of the license.  One cannot close off the software and refuse to give downstream authors the same rights.  However on top of that there are controversies around where this starts and stops which make this restriction remarkably broader than it might seem on the surface.  The first is whether, as a copyright matter, you need permission to link to a library and utilize that library's code.  This affects whether, in the GPL v2, a proprietary licensed library connected to a GPL'd application (or vice versa!) is mere aggregation or a derivative work under the full control of the GPL.  This issue has never been resolved in court anywhere in the world to my knowledge and so people avoid doing this.

The GPL v3 is a very long, complex, legal contract which attempts to address the above issue among others (and it places further restrictions on goods produced regarding cryptographic signatures and the like).  On the whole the license creates more problems than it solves.  For example, everyone I have talked to agrees that the GPL v3 license and the BSD family of licenses (assuming no obnoxious advertising clauses) are compatible, but the view particularly of the FSF on what these licenses require is very different from what most lawyers I have talked to from the Software Freedom Law Center and independent IP lawyers have said.  The question boils down to what section 7, Additional Terms, requires of other licenses, and whether (or, more properly, how) the BSD license can be construed to be compatible with that.

The mainstream view is that the licenses were intended to be compatible and so the GPL should be read to be compatible with the BSD license.  This means that, since the BSD license does not mention sublicensing, the additional downstream permissions granted by the copyright holder take effect no matter what, and the 7(b) reasonable legal notices section should be read broadly enough to allow additional non-removable permissions granted by the copyright holder of certain contributions, provided the permissions only govern those contributions specifically.  It is worth noting that the Software Freedom Law Center has dedicated significant resources towards publishing viewpoints explaining this specific position.

Another view I have heard (from Eben Moglen shortly after the GPL v3 was finalized, but to be honest I don't know if his view has changed) is to read the BSD license as allowing sublicensing and read the GPL as requiring a right to sublicense, and then read removal of permissions not as a sublicense per se but just as a notice that the code cannot be safely treated as BSD-licensed.  But if this is the case and the BSD license does not grant that right, then it raises significant legal issues.  As a software developer I don't feel like I can accept this interpretation of both licenses safely.

These controversies and others conspire to create a situation where there exists  significant uncertainty about how one can safely utilize the software in the creation of new products.  In general, one cannot go far wrong by following community norms, which treats the BSD license as compatible and linking to proprietary software and libraries as problematic, but these could be destabilized in the future with litigation and so it is a significant concern.

Of the copyleft licenses, the Affero GPL or AGPL stands alone for placing signficant restrictions on the utilization of software to deliver services.  Of the licenses discussed, it is the one which is plainly incompatible with a Distributist approach to economic ownership.  The AGPL requires that modifications to the software are available to the user of the software, when that user is interactively using the software as a service.

The AGPL is specifically intended to close the so-called SaaS loophole, but it is the wrong solution to the wrong problem.  From a customer's perspective the primary concern with SaaS is not functionality but data ownership.  If I take a piece of software and modify it, I can assume users can hire someone to re-implement the modifications easily enough, but if they don't have access to their raw data, they are stuck with me.  This sort of lock-in is the real problem with SaaS from a competitive perspective but it cannot be reasonably addressed through software licenses while maintaining a commitment to distributed ownership and development, and freedom to utilize.

Conclusions


Free and open source software today is split between two models.  The first, what we might call the Capitalist model of products like MySQL, which presuppose a single commercial entity holding total copyright ownership over the software.   The second is the Distributist model where ownership and right to utilize is distributed.  PostgreSQL is a good example as is FreeBSD.  The second model is not incompatible with capitalist entities participating but they don't control the right to utilize the software in the same way.  Similarly the Capitalist model can't stop Distributist off-shoots (see MariaDB and LedgerSMB) and those offshoots are usually forced into a Distributist model because the founders cannot assert copyright control sufficient to engage in dual licensing.

In general, the Distributist understanding of property and business ownership provides a very different look at free and open source software, and why these are important even for non-programmers.  These programs also provide a very good look at a  Distributist economic order and how it can function in a Capitalist society's knowledge economy.

Thursday, February 28, 2013

The LedgerSMB Service Locator for UDF's: How it Works

One of the questions I got from a post last week was how the LedgerSMB service locator works.  In the past I have covered the database design side, but not how the service locator works.  Part of the reason is that the service locator is still very much a work in progress and over time, we expect to have something quite a bit more full-featured than what we have now which is simple but works.

We use currently a stored procedure API divided into an upper and lower half,  The upper half generates an API call for the lower half, which can be called independently when finer control is needed.

The Bottom Half


The bottom half of this mechanism is the stored procedure call API.    The Perl API uses the following syntax:

@results = call_procedure( procname => $procname, args => $args );

@results is an array of hash references each of which represents a row returned.  $procname is naturally the name of the procedure, and $args is an array reference of scalar argument values.

For example if I want to call a function named company__get which takes a single argument of $id, I might:

($company) = call_procedure( 
          procname => 'company__get', args => [$id] 
);

This function then generates a query from the arguments of:

SELECT * FROM company__get(?);

And runs it, parameterized, with $id as the argument.

The full source code for the function in Perl is:

sub call_procedure {
    my $self     = shift @_;
    my %args     = @_;
    my $procname = $args{procname};
    my $schema   = $args{schema};
    my @call_args;
    my $dbh = $LedgerSMB::App_State::DBH;
    if (!$dbh){
        $dbh = $self->{dbh};
    }
    @call_args = @{ $args{args} } if defined $args{args};
    my $order_by = $args{order_by};
    my $query_rc;
    my $argstr   = "";
    my @results;

    if (!defined $procname){
        $self->error('Undefined function in call_procedure.');
    }
    $procname = $dbh->quote_identifier($procname);
    # Add the test for whether the schema is something useful.
    $logger->trace("\$procname=$procname");
    $schema = $schema || $LedgerSMB::Sysconfig::db_namespace;

    $schema = $dbh->quote_identifier($schema);

    for ( 1 .. scalar @call_args ) {
        $argstr .= "?, ";
    }
    $argstr =~ s/\, $//;
    my $query = "SELECT * FROM $schema.$procname()";
    if ($order_by){
        $query .= " ORDER BY $order_by";
    }
    $query =~ s/\(\)/($argstr)/;
    my $sth = $dbh->prepare($query);
    my $place = 1;
    # API Change here to support byteas: 
    # If the argument is a hashref, allow it to define it's SQL type
    # for example PG_BYTEA, and use that to bind.  The API supports the old
    # syntax (array of scalars and arrayrefs) but extends this so that hashrefs
    # now have special meaning. I expect this to be somewhat recursive in the
    # future if hashrefs to complex types are added, but we will have to put
    # that off for another day. --CT
    foreach my $carg (@call_args){
        if (ref($carg) eq 'HASH'){
            $sth->bind_param($place, $carg->{value},
                       { pg_type => $carg->{type} });
        } else {
            $sth->bind_param($place, $carg);
        }
        ++$place;
    }
    $query_rc = $sth->execute();
    if (!$query_rc){
          if ($args{continue_on_error} and  #  only for plpgsql exceptions
                          ($dbh->state =~ /^P/)){
                $@ = $dbh->errstr;
          } else {
                $self->dberror($dbh->errstr . ": " . $query);
          }
    }

    my @types = @{$sth->{TYPE}};
    my @names = @{$sth->{NAME_lc}};
    while ( my $ref = $sth->fetchrow_hashref('NAME_lc') ) {
        for (0 .. $#names){
            #   numeric            float4/real
            if ($types[$_] == 3 or $types[$_] == 2) {
                $ref->{$names[$_]} = Math::BigFloat->new($ref->{$names[$_]});
            }
        }
        push @results, $ref;
    }
    return @results;
}


In addition to the portions described above, this function also does some basic error handling, delegating to another function which logs full errors and hides some errors (particularly security-sensitive ones) behind more generic user-facing error messages.

There is a modest amount more to the logic in terms of type handling and the like, but that's roughly it for the main logic.

Future Enhancements for the Bottom Half


In the future I would like to add a number of features including window definitions and window function aggregates which can be tacked on to a function's output.  I would in essence like to be able to go from a maximum complexity of something like:

SELECT * FROM my_func(?) order by foo;

to

SELECT *, sum(amount) 
          over (partition by reference order by entry_id)
          AS running_balance
  FROM gl_report(?, ?, ?) order by transdate;

This sort of thing would make reporting functions a lot more flexible.

The Top Half 


The top half serves as a general service with regard to the location of the stored procedure.  The function is located in the DBObject module, and is called "exec_method."  This function provides service location capabilities, provided that function names are unique (this may change in future generations as we experiment with other representations and interfaces).

The top half currently uses an object-property to argument mapping approach or an enumerated argument approach exclusively.  There is no ability to mix these, which is a current shortcoming.  The current code allows for an enumerated argument approach which I almost never use since it is relatively brittle.

Additionally the ordering API in the Perl code is really suboptimal and needs to be redone in future versions.

The Perl code is:

sub exec_method {
    my $self   = shift @_;
    my %args  = (ref($_[0]) eq 'HASH')? %{$_[0]}: @_;
    my $funcname = $args{funcname};

    my $schema   = $args{schema} || $LedgerSMB::Sysconfig::db_namespace;

    $logger->debug("exec_method: \$funcname = $funcname");
    my @in_args;
    @in_args = @{ $args{args} } if $args{args};

    my @call_args;

    my $query = "
        SELECT proname, pronargs, proargnames FROM pg_proc
         WHERE proname = ?
               AND pronamespace =
               coalesce((SELECT oid FROM pg_namespace WHERE nspname = ?),
                        pronamespace)
    ";
    my $sth   = $self->{dbh}->prepare(
                $query
    );
    my $ref;

    $ref = $sth->fetchrow_hashref('NAME_lc');

    my $pargs = $ref->{proargnames};
    my @proc_args;

    if ( !$ref->{proname} ) {    # no such function
        # If the function doesn't exist, $funcname gets zeroed?
        $self->error( "No such function:  $funcname");
#        die;
    }
    $ref->{pronargs} = 0 unless defined $ref->{pronargs};
    # If the user provided args..
    if (!defined $args{args}) {
        @proc_args = $self->_parse_array($pargs);
        if (@proc_args) {
            for my $arg (@proc_args) {
                #print STDERR "User Provided Args: $arg\n";

                if ( $arg =~ s/^in_// ) {
                     if ( defined $self->{$arg} )
                     {
                        $logger->debug("exec_method pushing $arg = $self->{$arg}");
                     }
                     else
                     {
                        $logger->debug("exec_method pushing \$arg defined $arg | \$self->{\$arg} is undefined");
                        #$self->{$arg} = undef; # Why was this being unset? --CT
                     }
                     push ( @call_args, $self->{$arg} );
                }
            }
        }
        for (@in_args) { push @call_args, $_ } ;
        $self->{call_args} = \@call_args;
        $logger->debug("exec_method: \$self = " . Data::Dumper::Dumper($self));

        return $self->call_procedure( procname => $funcname,
                                          args => \@call_args,
                                      order_by => $self->{_order_method}->{"$funcname"},
                                         schema=>$schema,
                             continue_on_error => $args{continue_on_error});
    }
    else {
        return $self->call_procedure( procname => $funcname,
                                          args => \@in_args,
                                      order_by => $self->{_order_method}->{"$funcname"},
                                         schema=>$schema,
                             continue_on_error => $args{continue_on_error});
    }
}


Opportunities for Future Improvements


Initial improvements include replacing the enumerated argument API with one where a hashref can be passed overwriting part or all of the args sent to the database.  This would continue to make the API flexible and dynamic but would allow for richer client code.  The ordering API also needs to be moved into the actual API call, instead of a separate call.

Next Generation Interface Under Development


The next generation interface will support an API call like:

SELECT * FROM save('(,A-12334,"Test, Inc.",232)'::entity);

The major challenge here is recursively building what is essentially a potentially nested CSV structure.  For example, we might have:

SELECT * FROM save('(,JE-12334,Cash Transfer,2013-05-01,f,"{""(,4,-1000)"",""(,7,1200)"",""(,12,200)""}")'::journal_entry);

The escaping is not actually too hard.  The key challenges are actually questions of performance optimizations, such as making sure that we cache data structures properly and the like.

However in addition to that problem I would like to be able to define window functions on result sets over the API so that running balances can be added in the database (where they can be done most efficiently).

A fair bit of work has been done on this already.

Licensing


The code  above is licensed under the GNU General Public License version 2 or at your option any later version.  The code is not the cleanest code we have written on the subject but it is the code which is used by LedgerSMB in production.

If you would like BSD-licensed code to work with, which is also likely cleaner code, please see the PHP implementation that John Locke and myself have put together.

Monday, February 18, 2013

A look back on something I wrote 9 years ago

I recently came across a comment I wrote in 2004 on Slashdot.  I read it and was surprised actually both where my approach had changed and where it hadn't.  The entire comment is worth posting with a few very minor edits:

First, I think that stored procs are not only often good but lead to better, more powerful, secure, and flexible applications that would be feasible without them. But on the other hand, they lead to hard to maintain applications which are explicitly tied to one database. So they are necessary but often misused.

Triggers are more important and usually use stored procedures to ensure that the information in the database is always meaningful or that some other automated activity happens on an insert or update.

Unless I absolutely have to, I try to avoid having my application call stored procedures directly. A relational database manager *should* be able to hide the stored procedures behind a view allowing a nice standard interface for your data. This means that if you have to move to another RDBMS later, porting is much more simple and mostly confined to the backend.

BTW, I agree with the points about having your business logic in one place. Stored procedures allow you to move business logic which is inherent to the database into the database, thus making it available from all clients regardless of the language they are written in. For a single app/db pair this is not an issue but if you have a large DB with many different clients, it is a larger issue. Maintaining your application in one location is a LOT less work than reimplementing it all in every one of your apps.

Triggers, BTW, as I mentioned before are very powerful mechanisms. They are not called by the app directly but are run every time a row is inserted, updated, or deleted (some RDBMS's also allow select triggers, though some have alternate ways of implementing this). They can be used to implement additional security restrictions, enforce referential integrity, or more advanced stuff such as send email when a new order is added to the database. Again, this is done regardless of whether the order is filed using a web app or a Windows app for the rep in the call center. Since the logic is unknown to the app, it isn't even aware of what happens after the order is placed. Talk about clean interfaces..... This requires stored procedures.

So, these are the great things about stored procedures. But when they are used badly, you end up with the stored procedures reducing the agility of the application because they tie it in too closely to the database. What do you do when your app is tied to Sybase and your customers want MS SQL Server? What if *all* your logic is in that database? Do you rewrite *all* of it for MS SQL Server? Probably Not. You are stuck and out of your investment.

In my opinion, it is important to maintain a clear line between what is the database's job and what is the application's job. If this line is blurred, bad things can result. Stored procedures are a very easy way to blur this line.

I design my apps with the following layers to them:

UI
Application-specific logic
Database Access Layer
Information Presentation (SQL API using views)
Information Management and Data Logic (usually stored procedures)
Information Storage (highly normalized, usually).

The comment was made as part of a discussion on stored procedures. I was discussing at the time an open source CRM solution I was working on which never gained much traction in part because it was one of those things I did primarily to learn PHP, and so I made some important errors in architecture on the PHP side, and that I made a bunch of other errors in managing the project.  I learned a lot from the failure both in what to not to do but just as importantly what was done right.

In reading the post I am struck by two things.  The first is that my understanding of the problems has advanced less than my understanding of solutions, and also because my approach to the solutions at the time and today aim squarely at the same problems.

I have said before many times that one of the primary causes of grief with stored procedures is the fact that they very often create very brittle API's which require way too many changes in code to work right.  The choice to go stored procedure-based then is one which has a number of possible hazards (and different db's have different ways of addressing this), and rolling out applications against a new revision of stored procedures can be a problem which can become more difficult to coordinate over time (Oracle's "Edition-based revision" system helps with this a great deal).

The way I tried to solve this back  a decade ago was to wrap functions in views, so that an application would not need to know the semantics of the call, and in fact everything could be done in dynamic sql.  This has the advantage of entirely hiding the call structure of the stored procedure, but it has the disadvantage of making it somewhat difficult to come up with a view structure sufficiently robust and flexible as to be generally useful.  This also was intended to ensure that the application could just send ansi-SQL and get answers back.

As I look at this solution I recognize that today there are no perfect solutions but that this was a failed but noble effort to deal with a real problem (but one that worked surprisingly well for the app I was building.

The approach we take in LedgerSMB is both similar and different from the approach we took in HERMES.  The basic layers are still there, but the tools in each layer are different:

LayerHERMESLedgerSMB
UIPHPTemplate Toolkit
Application-specific logicPHPPerl
Database Access LayerHand written PHP moduleDBI
Information PresentationViewsMoved up a level, a service locator written in Perl
Information Management and Data LogicStored Procedures, pretty basicStored procs plus named args
Information StorageNormalized tablesNormalized tables

The biggest differences between then and now are worth considering.  Instead of insisting that the public API is relational, the stored procedures form a discoverable API, and this allows one to utilize much more sophisticated stored procedures.  I couldn't imagine doing something like a trial balance or income statement in a view-based API.  Having access to function arguments makes a huge difference there.

To accomplish this the information presentation was moved above data access, and turned into a service locator rather than a simple wrapper.  More needs to be done on this wrapper, but it is a significant improvement over my previous attempts to approach these problems,  The approach is still a work in progress but the automation is sufficiently simple currently as to add a level of robustness not present before but not sufficient to have opacity-related bugs frequently.

I am much happier with the current approach I am using, but it is interesting to note how how little my approach has changed over the last decade or so.

Friday, February 15, 2013

Building SOLID Databases: Dependency Inversion and Robust DB Interfaces

Dependency inversion is the idea that interfaces should depend on abstractions not on specifics.  According to Wikipedia, the principle states:

A. High-level modules should not depend on low-level modules. Both should depend on abstractions.
B. Abstractions should not depend upon details. Details should depend upon abstractions.

Of course the second part of this principle is impossible if read literally.  You can't have an abstraction until you know what details are to be covered, and so the abstraction and details are both co-dependent.  If the covered details change sufficiently the abstraction will become either leaky or inadequate and so it is worth seeing these as intertwined to some extent.

The focus on abstraction is helpful because it suggests that the interface contract should be designed in such a way that neither side really has to understand any internal details of the other in order to make things work.  Both sides depend on well-encapsulated API's and neither side has to worry about what the other side is really doing.  This is what is meant by details depending on abstractions rather than the other way around.

This concept is quite applicable beyond object oriented programming because it covers a very basic aspect of API contract design, namely how well an API should encapsulate behavior.

This principle is first formulated in its current form in the object oriented programming paradigm but is generally applicable elsewhere.

SQL as an Abstraction Layer, or Why RDBMS are Still King


There are plenty of reasons to dislike SQL, such as the fact that nulls are semantically ambiguous.  As a basic disclaimer I am not holding SQL up to be a paragon of programming languages or even db interfaces, but I think it is important to discuss what SQL does right in this regard.

SQL is generally understood to be a declarative language which approximates relational mathematics for database access purposes.  With SQL, you specify what you want returned, not how to get it, and the planner determines the best way to get it.  SQL is thus an interface language rather than a programming language per se.  With SQL, you can worry about the logical structure, leaving the implementation details to the db engine.

SQL queries are basically very high level specifications of operations, not detailed descriptions of how to do something efficiently.  Even update and insert statements (which are by nature more imperative than select statements) leave the underlying implementation entirely to the database management system.

I think that this, along with many concessions the language has made to real-world requirements (such as bags instead of sets and the addition of ordering to bags) largely account for the success of this language.  SQL, in essence, encapsulates a database behind a mature mathematical, declarative model in the same way that JSON and REST do (in a much less comprehensive way) in many NoSQL db's.  In essence SQL provides encapsulation, interface, and abstraction in a very full-featured way and this is why it has been so successful.


SQL Abstraction as Imperfect


One obvious problem with treating SQL as an abstraction layer in its own right is that one is frequently unable to write details in a way that is clearly separate from the interface.  Often storage tables are hit directly, and therefore there is little separation between logical detail and logical interface, and so this can break down when database complexity reaches a certain size.  Approaches to managing this problem include using stored procedures or user defined functions, and using views to encapsulate storage tables.

Stored Procedures and User Defined Functions Done Wrong


Of the above methods, stored procedures and functional interfaces have bad reputations frequently because of bad experiences that many people have with them.    These include developers pushing too much logic into stored procedures, and the fact that defining functional interfaces in this way usually produces a very tight binding between database code and application code, often leading to maintainability problems.

The first case is quite obvious, and includes the all-too-frequent case of trying to send emails directly from stored procedures (always a bad idea).  This mistake leads to certain types of problems, including the fact that ACID-compliant operations may be mixed with non-ACID-compliant ones, leading to cases where a transaction can only be partially rolled back.  Oops, we didn't actually record the order as shipped, but we told the customer it was.....  MySQL users will also note this is an argument against mixing transactional and nontransactional backend table types in the same db.....  However that problem is outside the scope of this post.  Additionally, MySQL is not well suited for many applications against a single set of db relations.

The second problem, though, is more insidious.  The traditional way stored procedures and user defined functions are typically used, the application has to be deeply aware of the interface to the database, but the rollout for these aspects is different leading to the possibility or service interruptions, and a need to very carefully and closely time rollout of db changes with application changes.  As more applications use the database, this becomes harder and the chance of something being overlooked becomes greater.

For this reason the idea that all operations must go through a set of stored procedures is a decision fraught with hazard as the database and application environment evolves.  Typically it is easier to manage backwards-compatibility in schemas than it is in functions and so a key question is how many opportunities you have to create new bugs when a new column is added.

There are, of course, more hazards which I have dealt with before, but the point is that stored procedures are potentially harmful and a major part of the reason is that they usually form a fairly brittle contract with the application layer.  In a traditional stored procedure, adding a column to be stored will require changing the number of variables in the stored procedure's argument list, the queries to access it, and each application's call to that stored procedure.  In this way, they provide (in the absence of other help) at best a leaky abstraction layer around the database details.  This is the sort of problem that dependency inversion helps to avoid.

Stored Procedures and User Defined Functions Done Right


Not all stored procedures are done wrong.  In the LedgerSMB project we have at least partially solved the abstraction/brittleness issue by looking to web services for inspiration.  Our approach provides an additional mapping layer and dynamic query generation around a stored procedure interface.  By using a service locator pattern, and overloading the system tables in PostgreSQL as the service registry, we solve the problem of brittleness.

Our approach of course is not perfect and it is not the only possibility.  One shortcoming is that our approach is that the invocation of the service locator is relatively spartan.  We intend to allow more options there in the future.  However one thing I have noticed is the fact that there are far fewer places where bugs can hide and therefore faster and more robust development takes place.

Additionally a focus on clarity of code in stored procedures has eliminated a number of important performance bottlenecks, and it limits the number of places where a given change propagates to.

Other Important Options in PostgreSQL


Stored procedures are not the only abstraction mechanisms available from PostgreSQL.  In addition to views, there are also other interesting ways of using functions to accomplish this without insisting that all access goes through stored procedures.  In addition these methods can be freely mixed to produce very powerful, intelligent database systems.

Such options include custom types, written in C, along with custom operators, functions and the like.  These would then be stored in columns and SQL can be used to provide an abstraction layer around the types.  In this way SQL becomes the abstraction and the C programs become the details.  A future post will cover the use of ip4r in network management with PostgreSQL db's as an example of what can be done here.

Additionally, things like triggers and notifications can be used to ensure that appropriate changes trigger other changes in the same transaction or, upon transaction commit, hand off control to other programs in subsequent transactions (allowing for independent processing and error control for things like sending emails).

Recommendations


Rather than specific recommendations, the overall point here is to look at the database itself as a an application running in an application server (the RDBMS) and design it as an application with an appropriate API.  There are many ways to do this, from writing components in C and using SQL as an abstraction mechanism to writing things in SQL and using stored procedures as a mechanism.  One could even write code in SQL and still use SQL as an abstraction mechanism.

The key point however is to be aware of the need for discoverable abstraction, a need which to date things like ORMs and stored procedures often fill very imperfectly.  A well designed db with appropriate abstraction in interfaces, should be able to be seen as an application in its own right, engineered as such, and capable of serving multiple client apps through a robust and discoverable API.

As with all things, it starts by recognizing the problems and putting solutions as priorities from the design stage onward.

Tuesday, February 12, 2013

Building SOLID Databases: Interface Segregation, or Keep Stored Procedures Simple

As far as interfaces go, we are talking specifically about functional interfaces (UDF's, or "User Defined Functions", and stored procedures).  Raw interfaces like SQL queries have very different profiles of issues.  This is one principle that more or less takes care of itself in most databases due to existing concerns regarding normalization and the like.  Subtle violations are quite possible, however, but they can be avoided by careful design and the signs are relatively easy to spot.  Problems can occur in both purely relational and object-relational designs though.

Interface Segregation in Application Programming


In application programming, interface segregation is used to ensure that dissimilar interfaces are not too tightly coupled, posing maintenance problems.  The canonical example of a violation is the actual case of the Xerox print system job class which did everything from printing to collation and stapling of papers.  The problem is simply that the methods would become sufficiently entangled that small changes would require significant cascading interventions elsewhere in the system.

Interface segregation is a principle which suggests that the basic abstractions should be internally cohesive but externally less so, such that cascading changes are minimized.  It therefore goes hand in hand with the single responsibility principle but focuses on the interface level rather than the class level.

Interface Segregation in the Database


We'd like to think that interface segregation is a solution to a problem that only affects applications.  We really would.  However, many of the bad experiences people have had with stored procedures are in fact violations of this rule.

In general, my experience has been that stored procedure interfaces (or UDFs returning sets) work well when they are a single large query with some minor support logic, and become much harder to work with when they are many smaller queries chained together.  When a stored procedure or UDF tries to do to much, that's where you get problems and rather than counting screen lines, numbers of individual statements provide a better estimate of reducible complexity there.

When a stored procedure or UDF tries to do too much,  maintainability problems occur because functionality is intertwined in a way which exactly mirrors the problems found in the application layer.  This gives people bad experiences....

An Example of a Violation


In LedgerSMB 1.3 we introduced a number of stored procedures for payment management.  These each have some difficulties associated with them so what I am talking about here is not necessarily a condemnation of one interface over the other, but one specifically violates this principle by combining support for prepayments with support for paying existing invoices when the input requirements are different and these are handled dissimilarly in the underlying database.

This single payment interface, which can support either an overpayment or a payment of an invoice, is difficult to maintain because the input complexity is much larger than necessary.  In fact the function requires nearly 20 inputs vs 12 for each interface if these were split up, and about 20 statements as well, vs far fewer if these were split up (additional refactoring might be possible in that case too, to further reduce the number of statements through consolidation).  This is a piece of code I tend to dread fixing problems in.  It works well currently, but there is always a fear that I am breaking something that I am not aware of, and test cases only get you so far.  Fortunately we haven't had to change much there but all our payment interfaces are up for redesign when we get back around to them.

Danger Signs


 The most obvious danger sign that a stored procedure is trying to do too much is what I call "reducible complexity."  The ideal stored procedure in my view is a single very large query with some minor supporting logic around them.  Reducible complexity is bad, and should be reduced.  Singular large queries can be debugged much more easily than a series of shorter ones because one can take advantage of the highly structured nature of SQL.  Consolidating queries thus increases the complexity locally in the SQL statement but reduces it over the scope of the function.

For purposes of a blank rule, my thinking is that if you have more than about 5 queries in a stored procedure, it is worth taking a look, and if you have more than 10 or 20, something is definitely worth looking at.

A large number of small queries, or worse, a large number of large queries, is a sign that something is amiss.  The first step is to consolidate queries to the extent possible.    However if a large number of queries persist, then the immediate question is what can be done to break the logic into separate interfaces which may be independently used.

A second warning sign are stored procedures that require a very large number of inputs.  Large here is best seen in the context of distribution relative to other interfaces, and it is not an absolute guide.

Recommendations


I am a very large fan of using functional interfaces (stored procedures and user-defined functions) to grant access to the data, but such interfaces must be kept as simple as possible.  Simple in this case is defined by complexity that can be reduced by consolidation of logic at least as much as it is defined by complexity that can be reduced by breaking a function into several components.

Friday, February 8, 2013

Building SOLID Databases: Liskov Substitution Weirdness

If the Open/Closed principle starts looking a little strange as applied to object-relational design, the Liskov Substitution Principle applies in almost the exactly opposite way in the database as in the applications.  The reason become only clear once exploring this principle, seeing the limits on it, and investigating the realities of database development.  The canonical examples of LSP violations in application programs do not violate the principle at all in the database.  As always the reason is that databases model different things than applications and so their constraints are different.

The Formal Definition of the Liskov Substitution Principle


Let  be a property provable about objects of type . Then should be provable for objects of type where is a subtype of .

The Liskov Substitution Principle in Application Programming


The Liskov Substitution Principle is important because it ensures provability of state and behavior for subtypes.  This means you have to look at the full constraints that operate on a class method (excluding constructors I think, which operate under a different set of constraints).

In general for the LSP to be satisfied, preconstraints cannot be weakened, post-constraints cannot be strengthened, and invariants must be maintained not only in inherited methods but also non-inherited ones (this is broad enough to include the history constraint).  The typical problem this addresses is the square-rectangle problem, which is a well-understood problem in object-oriented programming, and one which calls into question certain promises of OOP as a whole.

The square-rectangle problem establishes a problem of irreducible complexity in object-oriented programming.  The typical examples are either "A square is-a rectangle" or "A circle is-a elipse."  Mathematically both of these statements are true but implementing them in an object-oriented framework adds a great deal of complexity.  The reason is that the mathematical hierarchy has no concept of a change of state which preserves invariances, and so the object hierarchy ends up being complicated by the need to shim in what is notably lacking in the geometric hierarchy.  This leads to either an explosion of abstract classes, an explosion of object metadata (can this rectangle be stretched -- always 'no' for a square)?  Can it be scaled?), or a loss of the sort of functionality we typically want from inheritance.

On the database level, these problems mostly don't exist in relational designs but they do exist in object relational designs, only differently (more on that below).

The Liskov Substitution Principle in application programming primarily operates primarily to preserve assumptions regarding state changes in an object when an interface is called.  If a square (defined as a rectangle where X and Y measurements are the same) can be stretched, it is no longer a square.  On the other hand, if we can't stretch it, it may not fill the contract of the rectangle.

When A Square Is-A Rectangle Subtype


The square-rectangle problem can be turned on its head to some extent with the question of when this is-a relationship is valid.  In fact the is-a relationship is valid for immutable objects.  It is perfectly valid to have a class ImmutableSquare be a subclass of ImmutableRectangle.

Secondly the subtyping can be valid if sufficient information is attached to the class to specify the invariants.  For example we might have a class with attributes "can_stretch" and "can_scale" where setting can_stretch off (as it would always be in a square) ensures that length to width proportions are always preserved.  The problem here is that the subclass invariances require support in the superclass and this leads to a lot of complexity in implementing the superclass, as well as fundamental limits of what can then be subclassed.

A Look at Database Limitations


In a normal relational database, the LSP is always met.  Domain-level constraints only apply to storage and not to calculation outputs, for example.

In database systems where tables instantiate types and substitutability is not anticipated (Oracle, DB2), then the LSP applies in the same way it does in application programming.  In the cases of table inheritance (Informix, PostgreSQL), things start to get a little weird.

For the purpose of stored data, the LSP is generally satisfied because constraints are inherited and strengthened.  A subtype is simply a sub-domain of the parent type possibly with some additional attributes.

The Square-Rectangle Problem in Object-Relational Databases


The Square-Rectangle problem is only even a potential problem in object-relational databases.  Purely relational designs can never run into this issue.  In object-relational designs a few very specific issues can occur depending entirely on how object-relational functionality is implemented at the database level.  In databases like Oracle and DB2, the LSP applies pretty much as is.  In Informix and PostgreSQL, the problems faced are actually somewhat different and lead to new classes of anomilies that need to be considered.  These include most prominently update anomilies when parent tables are updated.  These do not necessarily have easy workarounds.

Consider the following table structure:

CREATE TABLE my_rectangle (
    id serial primary key,
    height numeric,
    width numeric
 );

 CREATE TABLE my_square ( 
    check (height = width)
 ) INHERITS (my_rectangle);

Now, we may insert a bunch of squares and rectangles, and we could have checks.  Moreover we could have custom triggers to verify referential integrity for rows referencing my_rectangle and my_square.

So suppose we use update on my_rectangle to double the height of every rectangle in the system:

UPDATE my_rectangle SET height = height * 2;

Oops, we can't do that.  While rows in my_rectangle will happily be adjusted, rows in the child table my_square will not, and you will get an error.  This error can be avoided through a few issues, each of which has problems:

  1. Disallow updates, only allow inserts and deletes.  This makes referential integrity harder to manage when the row must be deleted and re-inserted.
  2. Use a trigger to rewrite an update to be a delete plus insert into either the current or parent table depending on constraints.  This makes RI harder to manage in that the procedure MUST disable custom RI triggers before doing this.   This would require that the procedure be aware of all custom RI triggers in order to do this.

Conclusions


The Liskov Substitution Principle depends quite a bit on specifics of how an object-relational database system intersects tables and objects for how it applies.  In general in a purely relational design you will never need to worry about it, but in an object-relational design there are some nasty corner cases that can come up, particularly where a query may operate on heterogenous subtypes as a set.  This is, perhaps, the only one of the SOLID principles which is O-R specific and it hits the db in different ways than the application because the db  operates on different principles.