Friday, August 24, 2012

PostgreSQL OR Modelling Part 2: Intro to Object Relational Classes in PostgreSQL

In the last post we went briefly over the considerations and concerns of object-relational programming in PostgreSQL.  Now we will put this into practice.  While LedgerSMB has not fully adopted this approach I think it is likely to be the long-term direction for the project.

In object relational thinking classes have properties and methods, and sets of objects are retrieve and even manipulated using a relational interface.  While relational systems look at sets of tuples, object-relational systems look at relational manipulation of sets of objects.  A table (and to a lesser extent, a view of a composite type) is hence a class and can have various forms of behavior associated with it the rows, which act as objects.  This means a number of considerations change.  Some differences include:

  • SELECT * is often useful to ensure one receives a proper data type
  • Derived values can be immitated by methods, as can dereferences of keys
  • Common filter conditions can be centralized, as can common queries
In general, understanding this section is important to understanding further sections in this series.

Basic Table Structure

Our current example set will use a very simplified schema for storing inventory.  Consider the (greatly simplified) chart of accounts table:

 CREATE TABLE account (
     id int not null unique,
     control_code text primary key, -- account number
     description text not null
);

Populated by:

INSERT INTO account (id, accno, description)
VALUES (1, '1500', 'Inventory'),
       (2, '4500', 'Sales'),
       (3, '5500', 'Purchase'); 

Of course in a real system the chart of account (and inventory) tables would be more complex.

CREATE TABLE inventory_item (
    id serial primary key,
    cogs_account_id int references account(id),
    inv_account_id int references account(id),
    income_account_id int references account(id),
    sku text not null,
    description text,
    last_cost numeric, -- null if never purchased
    sell_price numeric not null,
    active bool not null default true
);

Now, we'd also want to make sure only one active part can be associated with a sku at any given time, so we'd:

CREATE UNIQUE INDEX inventory_item_sku_idx_u
ON inventory_item (sku) WHERE active IS TRUE;

The create table statement also creates a complex type with the same structure, and thus it defines a data structure.  The idea of relations as data structures in themselves will come up again and again in this series.  It is the fact that tables are data structures which allows us to do interesting things here.

Method 1:  Derived Value called "markup"

The first method we may want to add to this table is a markup method, for calculating our markup based on current sell price and last cost.  Since this value will always be based on two other stored values, there is no sense in storing it (other than possibly in a pre-calculated index).  To do this, we:

CREATE FUNCTION markup(inventory_item) RETURNS numeric AS
$$ SELECT CASE WHEN $1.last_cost = 0 THEN NULL

               ELSE  ($1.sell_price - $1.last_cost)
                     / $1.last_cost

           END;

$$ LANGUAGE SQL IMMUTABLE;

Looking through the syntax here, this is a function named "markup" which receives a single input of the table type.  It then calculates a value and returns it based solely on inputs and returns a value.  The fact that it function always returns the same value when the same input is passed is reflected in the IMMUTABLE designation.  The planner uses this to plan queries, and PostgreSQL will not index function outputs unless they are marked immutable.

Once this is done, we can:

SELECT sku, description, i.markup from inventory_item i;

Note that you must include the table designation in calling the method.  PostgreSQL converts the i.markup into markup(i).

Of course, our table being empty, no values will be returned.  However if you try to omit the i. before markup, you will get an error.

Not only can we include this in the output we can search on the output:

SELECT sku, description from inventory_item i 
 where i.markup  < 1.5;

If we find ourselves doing a lot of such queries and need to index the values we can create an index:

CREATE INDEX inventory_item_markup_idx 
ON inventory_item (markup(inventory_item));

If you use object.method notation, you must add extra parentheses because of create index semantics [see comment added below]:

 CREATE INDEX inventory_item_markup_idx ON inventory_item ((inventory_item.markup));

Instead of adding columns and using triggers to maintain them, we can simply use functions to calculate values on the fly.  Any value you can derive directly from values already stored in the table can thus be calculated on output perhaps even with the values indexed.  This is one of the key optimizations that ORDBMS's allow.

Method 2:  Dereferencing  an Account

The above table is defined in a way that makes it easy to do standard, relational joins.  More specialized Object-Relational-friendly references will be covered in a future posting.   However suppose we want to create a link to the accounts.  We might create a method like:

CREATE OR REPLACE FUNCTION cogs_account(inventory_item) 
RETURNS account
LANGUAGE SQL AS
$$ SELECT * FROM account where id = $1.cogs_account_id $$;

We can now:

or_examples=# select (i.cogs_account).* FROM inventory_item i;

We will get an empty row with the chart of accounts structure.  This gives us the beginnings of a path system for LedgerSMB (a real reference/path system will be discussed in a future posting).

We can even:

select * from inventory_item i 
 where (i.cogs_account).control_code = '5500';

Note that in many of these cases, parentheses are necessary to ensure that the system can tell that we are talking about an object instead of a schema or other system.  This is true generally wherever complex data types are used in PostgreSQL (otherwise the i might be taken to be a schema name).  Unlike Oracle, PostgreSQL does not make assumptions of this sort, and unlike DB2 does not have a separate dereferencing operator.

Warning: Dereferencing objects in this way essentially forces a nested loop join.  It is not recommended when working with large return sets.  For example the previous has a plan (given that I already have one row in this table) as:

or_examples=# explain analyze
or_examples-# select * from inventory_item i where (i.cogs_account).control_code = '5500';
                                                 QUERY PLAN                    
                           
--------------------------------------------------------------------------------
----------------------------
 Seq Scan on inventory_item i  (cost=0.00..1.26 rows=1 width=145) (actual time=0
.174..0.175 rows=1 loops=1)
   Filter: ((cogs_account(i.*)).control_code = '5500'::text)
 Total runtime: 0.199 ms
(3 rows)

Note that the filter is not expanded to a join.  This means that for every row filtered upon, it is executing a query to pull the resulting record from the account table.  Depending on the size of that table and the number of pages containing rows referenced, this may perform badly.  For a few rows, however, it will perform well enough.

Method 3:  Computer-discoverable save method

One application for object-relational modeling is to provide a top-level machine-discoverable API which software programs can use, creating more autonomy and such between the application and the database.  We will create here a machine-discoverable save function which only updates values that may be updated, and returns the values as saved back to the application.  We can consider this sort of loose coupling a "dialog" rather than "remote command" interface because both are assumed to be autonomous and try to provide meaningful responses to the other where appropriate.

Our founction looks like this (not using writeable CTE's for backward compatibility):

CREATE FUNCTION save(inventory_item)
RETURNS inventory_item
LANGUAGE PLPGSQL STABLE AS
$$
  DECLARE out_item inventory_item;

 BEGIN
 -- we don't want to allow accounts to change on existing items
 UPDATE inventory_item
    SET sku = in_item.sku,
        description = in_item.description,
        last_cost = in_item.last_cost,
        sell_price = in_item.sell_price,
        active = in_item.active
  WHERE id = in_item.id;

 IF FOUND THEN
     SELECT * INTO out_item FROM inventory_item
      WHERE id = in_item.id;
     RETURN out_item;
 ELSE
 INSERT INTO inventory_item
                 (cogs_account_id,
                  inv_account_id,
                  income_account_id,
                  sku,
                  description,
                  last_cost,
                  sell_price,
                  active)

          VALUES (in_item.cogs_account_id,
                  in_item.inv_account_id,
                  in_item.income_account_id,
                  in_item.sku,
                  in_item.description,
                  in_item.last_cost,
                  in_item.sell_price,
                  in_item.active);
 SELECT * INTO out_item
      FROM inventory_item
     WHERE id = currval('inventory_item_id_seq');
    RETURN out_item;

  END IF;

 END;
$$;

This function is idempotent and it returns the values as saved to the application so it can determine whether to commit or rollback the transaction.  The structure of the tuple is also discoverable using the system catalogs and so an application can actually look up how to construct a query to save the inventory item.  However it certainly cannot be inlined and it will be slow on large sets.  If you have thousands of inventory parts, doing:

SELECT i.save FROM inventory_item i;

Will be painful and do very little other than generate dead tuples.  Don't do it.

On the other hand a software program (either at code generation or run-time) can look up the structure of the type and generate a call like:

 SELECT (i.save).* 
   FROM (SELECT (row(null, 3, 1, 2, 'TEST123', 'Inventory testing item', 1, 2, true)::inventory_item).save) i;

Obviously this is a sub-optimal interface for humans but it has the advantage of discoverability for a computer.  Note the "::inventory_item" may be unnecessary in some cases, however it is required for all practical purposes on all non-trivial databases because it avoids ambiguity issues.  We really want to make sure that it is an inventory_item we are saving especially as data types may change.  This then allows us to control application entry points to the data.

Note that the application has no knowledge of of what is actually happening under the hood of the save function.  We could be saving it in unrelated relations (and this may be a good way to deal with updateable views in an O-R paradigm where single row updates are the primary use case).

Caveats:  In general I think it is a little dangerous to mix imperative code with declarative SQL in this way.   Object-relational modelling is very different from object-oriented programming because with object-relational modelling we are modelling information, while with object-oriented programming we are encapsulating behavior.  This big difference results in endless confusion.

The most obvious way around this is to treat all SQL queries as questions and treat the transactional boundaries as imperative frames to the declarative conversation.  A human-oriented translation of the following exchange might be:

BEGIN;  -- Hello.  I have some questions for you.

SELECT (i.save).*
  FROM (SELECT (row(null, 3, 1, 2, 'TEST124', 
                'Inventory testing item 2', 1, 2, 
                true)::inventory_item).save) i;

-- Suppose I ask you to save an inventory item with the following info for me.
-- What will be saved?


  id | cogs_account_id | inv_account_id | income_account_id |   sku   |       desc
ription        | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
---------------+-----------+------------+--------
  4 |               3 |              1 |                 2 | TEST124 | Inventory
testing item 2 |         1 |          2 | t


-- The above information will be saved if you ask me to.

COMMIT; -- Do it.

This way of thinking about the overall framework helps prevent a lot of problems down the road.  In particular this helps establish a separation of concerns between the application and the database.  The application is responsible for imperative logic (i.e. what must be done) while the database answers declarative queries and only affirmatively stores data on commit.  Imperative changes to data only occur when the application issues the commit command.

In this regard object behavior (outside of storage which is an odd fit for the model) really doesn't belong in the database.  The database is there to provide answers to questions and update stored information when told to commit changes.  All other behavior should be handled by other applications.  In a future post we will look at some ways to broaden the ways applications can receive data from PostgreSQL.  Object-relational modelling then moves beyond the question of "what information do I have and how can I organize it to get answers" to "what derivative information can be useful and how can I add that to my otherwise properly relational database?"

Alternate Constructor inventory_item(int)

Now in many cases we may not want to have to provide the whole object definition in order to instantiate it.  In fact we may want to be able to ask the database to instantiate it for us.  This is where alternate constructors come in.  Alternate constructors furthermore can be for single objects or for sets of objects.  We will look at the single objects first, and later look at the set-based constructors.

This constructor looks like:

CREATE OR REPLACE FUNCTION inventory_item(int)
RETURNS inventory_item
LANGUAGE SQL
AS $$

SELECT * FROM inventory_item WHERE id = $1

$$;

If I have an item in my db with an id of two (saved in with the previous method call) I can:

 or_examples=# select * from inventory_item(2);
 id | cogs_account_id | inv_account_id | income_account_id |   sku   |      descr
iption       | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
-------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item |         1 |          2 | t
(1 row)

I can also chain this together with other methods.  If all I want is the markup for item 2, I can:

or_examples=# select i.markup from inventory_item(2) i;
         markup        
------------------------
 1.00000000000000000000
(1 row)

I can even:

or_examples=# select (inventory_item(2)).markup;
         markup        
------------------------
 1.00000000000000000000
(1 row)

An application can then use something like this to traverse in-application links and retrieve new objects.

Alternate Constructor inventory_item(text)

Similarly we can have a constructor which constructs this from a  text field, looking up by active SKU:

CREATE OR REPLACE FUNCTION inventory_item(text)
RETURNS inventory_item
LANGUAGE sql AS $$

SELECT * FROM inventory_item WHERE sku = $1 AND active is true;

$$;

We can then:

SELECT (inventory_item('TEST123')).markup;

and get the same result as before.

Warning:  Once you start dealing with text and int constructors, you have the possibility of ambiguity on queries in.  For example, SELECT inventory_item('2') will run this on the text constructor instead of the integer constructor, giving you no results.  For this reason it is a very good idea to explicitly cast your inputs to the constructor.

Set Constructor inventory_item(tsquery)

Not only can this be used to create a constructor for a single item.  Whole sets can be constructed this way.  For example we could create a text search constructor (and this allows the default search criteria to change over time in a centrally managed way):

CREATE OR REPLACE FUNCTION inventory_item(tsquery)
RETURNS SETOF inventory_item
LANGUAGE SQL AS $$

SELECT * FROM inventory_item WHERE description @@ $1;

$$;

This allows some relatively powerful searches to be done, without the application having to worry about exactly what is searched on.  For example, we can:

or_examples=# select * from inventory_item(plainto_tsquery('test')); id | cogs_account_id | inv_account_id | income_account_id |   sku   |      descr
iption       | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
-------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item |         1 |          2 | t
(1 row)

Here test has been found to match testing because testing is a form of the word test.

Of coruse if we have more than one item in the table, we get more than one result:

 or_examples=# select * from inventory_item(plainto_tsquery('test'));
 id | cogs_account_id | inv_account_id | income_account_id |   sku   |       desc
ription        | last_cost | sell_price | active
----+-----------------+----------------+-------------------+---------+-----------
---------------+-----------+------------+--------
  2 |               3 |              1 |                 2 | TEST123 | Inventory
testing item   |         1 |          2 | t
  4 |               3 |              1 |                 2 | TEST124 | Inventory
testing item 2 |         1 |          2 | t
(2 rows)


These provide examples, I hope, provide some ideas for how one can take the object-relational concepts and apply them towards building more sophisticated, robust, and high-performance databases, as well as better interfaces for object-oriented programs.

Next Week:  Table Inheritance in PostgreSQL

5 comments:

  1. > If we find ourselves doing a lot of such queries and need to index the values we can create an index:
    >
    > CREATE INDEX inventory_item_markup_idx
    > ON inventory_item (markup(inventory_item));
    >
    > Note this statement does not support object.method notation here. You must do method(class) instead.

    It works as

    CREATE INDEX inventory_item_markup_idx ON inventory_item ((inventory_item.markup));

    with the extra parentheses. Normally an expression (as opposed to a plain column reference) in an index definition has to have extra parentheses round it, but there's an exception for function calls which is why your original version works.

    ReplyDelete
  2. Note this sort of approach is what makes table inheritance useful. In the next week we will be exploring how to use table inheritance without destroying relational integrity.

    ReplyDelete
  3. INSERT INTO account (id, accno, description)
    VALUES (1, '1500', 'Inventory'),
    (2, '4500', 'Sales'),
    (3, '5500', 'Purchase');


    Should accno be control_code, since it's control_code in CREATE TABLE account (
    id int not null unique,
    control_code text primary key, -- account number
    description text not null
    );

    ReplyDelete
  4. Hi, Chris

    Where is the definition of in_item in


    CREATE FUNCTION save(inventory_item)
    RETURNS inventory_item
    LANGUAGE PLPGSQL STABLE AS
    $$
    DECLARE out_item inventory_item;

    BEGIN
    -- we don't want to allow accounts to change on existing items
    UPDATE inventory_item
    SET sku = in_item.sku,
    description = in_item.description,
    last_cost = in_item.last_cost,
    sell_price = in_item.sell_price,
    active = in_item.active
    WHERE id = in_item.id;

    IF FOUND THEN
    SELECT * INTO out_item FROM inventory_item
    WHERE id = in_item.id;
    RETURN out_item;
    ELSE
    INSERT INTO inventory_item
    (cogs_account_id,
    inv_account_id,
    income_account_id,
    sku,
    description,
    last_cost,
    sell_price,
    active)

    VALUES (in_item.cogs_account_id,
    in_item.inv_account_id,
    in_item.income_account_id,
    in_item.sku,
    in_item.description,
    in_item.last_cost,
    in_item.sell_price,
    in_item.active);
    SELECT * INTO out_item
    FROM inventory_item
    WHERE id = currval('inventory_item_id_seq');
    RETURN out_item;

    END IF;

    END;
    $$;

    ReplyDelete