Sunday, May 26, 2019

Table Inheritance: What's it Good For?

Table inheritance is one of the most misunderstood -- and powerful -- features of PostgreSQL.  With it, certain kinds of hard problems become easy.  While many folks who have been bitten by table inheritance tend to avoid the feature, this blog post is intended to provide a framework for reasoning about when table inheritance is actually the right tool for the job.

Table inheritance is, to be sure, a power tool and thus something to use only when it brings an overall reduction in complexity to the design.  Moreover the current documentation doesn't provide a lot of guidance regarding what the tool actually helps with and where are the performance costs and because inheritance sits orthogonal to relational design, working this out individually is very difficult.

This blog post covers uses of table inheritance which simplify overall database design and are not addressed by declarative partitioning, because they are used in areas other than table partitioning.

Table Inheritance Explained

PostgreSQL provides the ability for tables to exist in an inheritance directed acyclic graph.  Columns provided by parent tables are merged in name and type into the child table.  Altering a parent table and adding a column thus cascades this operation to all child tables, though if any child table has a column with the same name and different type, the operation will fail.

Inheritance, Tables, and Types

Every table in PostgreSQL has a corresponding campsite type, and any table can be implicitly cast to any parent table.  This is transitive.  Combined with tuple processing functions, this gives you a number of very powerful ways of working with data at various different levels of scale.

Indexes and foreign keys are not inherited.  Check constraints are inherited unless set to NO INHERIT.

Inheritance and Querying

When a table is queried, by default all child tables are also queried and their results appended to the result.  Because of exclusion constraint processing, this takes out an ACCESS SHARE lock on all child tables at planning time.  All rows are cast back to the type of the table target (in other words you get the columns of the table you queried).

Comparison to Java Interfaces

Despite the name, the closest equivalent to table inheritance in other programming languages are Java Interfaces.  Here too you get implicit casts, a subset of fields, and a promise of compatible interfaces. And as a Java class can implement multiple interfaces, multiple inheritance in PostgreSQL is supported.  Java programmers are encouraged to think of inheriting tables in interface rather than inheritance terms.

Use in Database Management Design

When we design a database there are often two overlapping concerns.  The first is in relational algebra operations on the data, and the second is in managing the data.  In a purely relational model this breaks down.

Notes Tables

One of the first really productive uses of table inheritance I had was in the notes tables in LedgerSMB.  There are several hundred table in the database, and we want to attach notes to some subset of these tables.  A naive approach might be a single global notes table with a bunch of foreign keys, or an ambiguous foreign key, or we just have a bunch of completely independent notes tables.  All of these have serious obvious problems however.  Large numbers of sparse foreign keys provide tons of NULL-handling problems, and provide a wide table that is harder to reason about.  Ambiguous foreign keys are a terrible anti pattern which should never be used due to data consistency problems, and large numbers of independent tables provide an opportunity for subtle errors due to knowledge management problems.

A slightly better solution might be to define a notes composite type, and use CREATE TABLE OF TYPE instead.  However typed tables of this sort have completely immutable schemas which makes them harder to manage over time.

We can then define a table structure something like as follows:

create table notes (
    id serial primary key,
    created_at timestamp not null default now(),
    created_by text not null,
    subject text not null,
    body text not null,
    fkey int not null,
    check (false) NO INHERIT
);

This table will never have any rows, but child tables can have rows.  For child tables, creating them is now easy:

create table invoice_notes (
    LIKE notes INCLUDING INDEXES,
    foreign key fkey REFERENCES invoice(id),
) INHERITS (notes);

The LIKE ... INCLUDING ALL indicates that we will copy in defaults, primary keys, and index definitions.  This now provides a forward-looking way of managing all notes tables going forward.  Uniqueness criteria remains enforced on a per-table basis.

If I later want to add a materialization of a column using a function I can do that in a reasonably straight-forward manner, at least compared to alternative approaches.

However, that's not all I can do.  I can then provide a search_terms function on the parent table which can be used to query child tables.

create or replace function search_terms(notes)
returns tsvector language sql immutable as
$$
select to_tsvector($1.subject) || to_tsvector($1.body);
$$;

I could then index, using GIN, the output of this function.  I still have to create the index on all current tables but if I index it now on the notes table, all tables I create with LIKE notes INCLUDING ALL will now have that index too. 

The function itself can be queried in a number of ways:

select * from invoice_notes n 
 where plainto_tsvector('something') @@ search_terms(n);

-- or

select * from invoice_notes n 
 where plainto_tsvector('something') @@ n.search_terms;

Once the function is created, that query works out of the box even though I never created a corresponding function for the invoice_notes table type.  Thus providing a consistent interface to a group of tables is an area where table inheritance can help clear out a lot of complexity very fast. Benefits include a more robust database design, more easily re-used human knowledge in how pieces fit together, and easier management of database schemas.

Note on Use in Set/Superset Modeling

There are a number of cases where the query implications of inheritance are more important.  This area is typically tricky because it often involves multiple inheritance and therefore there are a number of additional concerns that quickly crop up, though these have well-defined solutions discussed below.

Imagine we have an analytics database with numbers of pre-aggregated over possibly overlapping sets.  We want to sum up numbers quickly and easily without complicating the query language.  One option would be to create multiple views over a base table which includes the superset, but if your bulk operations primarily work over discrete subsets, you might get more out of breaking these out into subset tables which inherit the larger sets into which they are members.  This is, in effect, a reverse partitioning scheme where a single physical table shows up in multiple query tables.

In certain cases this can be easier to manage than a single large table with multiple views selecting portions of that view.  Use of this technique requires weighing different kinds of complexity and is best left for other posts.

Managing Schema Changes with Multiple Inheritance

In cases where multiple inheritance is used, adding and removing columns is relatively straight-forward, but altering existing tables can result in cases where an alteration interferes with checks on the other parent.  Renaming columns or changing types of columns is particularly tricky.  In most cases where this happens, a type change will not be done because rewriting tables is prohibitive, but renaming columns becomes the substitute and that is no less of a headache.

The key problem to note here is that the problem is that you have to make sure that both parents are changed at the same time, in the same statement.    So the solution here is to create a super parent table with the subset of columns to be acted on, and then drop it when done.   So here we:

begin;
create table to_modify (id int, new_id bigint);
alter table first_parent inherit to_modify;
alter table second_parent inherit to_modify;
alter table to_modify rename id to old_id;
alter table to_modify rename new_id to id;
commit;

The changes will then cascade down the inheritance graph properly.

Conclusions


Table inheritance is a surprisingly awesome feature in PostgreSQL, but misuse has given it a bad reputation.  There are many cases where it simplifies operation and long-term management of the database in cases where partitioning actually doesn't work that well.  This is a feature I expect to try to improve over time and hope others find it useful too, but to start we need to start using it for what it is good for, not the areas it falls short.

2 comments:

  1. Great post Chris! I've just been experimenting and evaluating inheritance for a new project for the past week, and while I felt it could have been a great fit in some cases, I choose not to use it due to the many recommendations to stay clear of it. I finally felt good about my decision a couple of days ago when I saw the "official" recommendation here: https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_table_inheritance

    Now reading your post, I'm curious again :-) One case where I had inheritance in mind was to track row ownership and dates for app users. For instance, I currently have 56 tables which all contains the same columns (owned_by, created_by, deleted_by, created_at, deleted_at). Now as far as I understand, I'd still have to define foreign keys for owned_by, created_by and deleted_by on each table, so I wasn't sure the gain of inheritance was good enough considering potential downsides I wasn't able to evaluate properly without more knowledge / experience. I think my case looks a lot like your notes example though, and I wonder if you can see any pros/cons in using inheritance for a case like this?

    ReplyDelete
    Replies
    1. So here's the tradeoff in this case.

      yes, you have to define the foreign keys, but if this needs to be indexed, you can define the indexes on the parent table and use like ... including all to get identical indexes created on the child. As long as those columns never change, there is no benefit from using inheritance in addition, but if you need to add an additional audit field for audited tables, adding it to the parent cascades (though you can have locking issues).

      The upside is you get forced consistency. The downside is there are a few cases where you can get some fun with locks. Understanding the locks (as documented above) is helpful to avoiding problems, but it is still a (small)c concern.

      Delete