Sunday, May 26, 2019

Table Inheritance: What's it Good For?

Table inheritance is one of the most misunderstood -- and powerful -- features of PostgreSQL.  With it, certain kinds of hard problems become easy.  While many folks who have been bitten by table inheritance tend to avoid the feature, this blog post is intended to provide a framework for reasoning about when table inheritance is actually the right tool for the job.

Table inheritance is, to be sure, a power tool and thus something to use only when it brings an overall reduction in complexity to the design.  Moreover the current documentation doesn't provide a lot of guidance regarding what the tool actually helps with and where are the performance costs and because inheritance sits orthogonal to relational design, working this out individually is very difficult.

This blog post covers uses of table inheritance which simplify overall database design and are not addressed by declarative partitioning, because they are used in areas other than table partitioning.

Table Inheritance Explained

PostgreSQL provides the ability for tables to exist in an inheritance directed acyclic graph.  Columns provided by parent tables are merged in name and type into the child table.  Altering a parent table and adding a column thus cascades this operation to all child tables, though if any child table has a column with the same name and different type, the operation will fail.

Inheritance, Tables, and Types

Every table in PostgreSQL has a corresponding campsite type, and any table can be implicitly cast to any parent table.  This is transitive.  Combined with tuple processing functions, this gives you a number of very powerful ways of working with data at various different levels of scale.

Indexes and foreign keys are not inherited.  Check constraints are inherited unless set to NO INHERIT.

Inheritance and Querying

When a table is queried, by default all child tables are also queried and their results appended to the result.  Because of exclusion constraint processing, this takes out an ACCESS SHARE lock on all child tables at planning time.  All rows are cast back to the type of the table target (in other words you get the columns of the table you queried).

Comparison to Java Interfaces

Despite the name, the closest equivalent to table inheritance in other programming languages are Java Interfaces.  Here too you get implicit casts, a subset of fields, and a promise of compatible interfaces. And as a Java class can implement multiple interfaces, multiple inheritance in PostgreSQL is supported.  Java programmers are encouraged to think of inheriting tables in interface rather than inheritance terms.

Use in Database Management Design

When we design a database there are often two overlapping concerns.  The first is in relational algebra operations on the data, and the second is in managing the data.  In a purely relational model this breaks down.

Notes Tables

One of the first really productive uses of table inheritance I had was in the notes tables in LedgerSMB.  There are several hundred table in the database, and we want to attach notes to some subset of these tables.  A naive approach might be a single global notes table with a bunch of foreign keys, or an ambiguous foreign key, or we just have a bunch of completely independent notes tables.  All of these have serious obvious problems however.  Large numbers of sparse foreign keys provide tons of NULL-handling problems, and provide a wide table that is harder to reason about.  Ambiguous foreign keys are a terrible anti pattern which should never be used due to data consistency problems, and large numbers of independent tables provide an opportunity for subtle errors due to knowledge management problems.

A slightly better solution might be to define a notes composite type, and use CREATE TABLE OF TYPE instead.  However typed tables of this sort have completely immutable schemas which makes them harder to manage over time.

We can then define a table structure something like as follows:

create table notes (
    id serial primary key,
    created_at timestamp not null default now(),
    created_by text not null,
    subject text not null,
    body text not null,
    fkey int not null,
    check (false) NO INHERIT

This table will never have any rows, but child tables can have rows.  For child tables, creating them is now easy:

create table invoice_notes (
    foreign key fkey REFERENCES invoice(id),
) INHERITS (notes);

The LIKE ... INCLUDING ALL indicates that we will copy in defaults, primary keys, and index definitions.  This now provides a forward-looking way of managing all notes tables going forward.  Uniqueness criteria remains enforced on a per-table basis.

If I later want to add a materialization of a column using a function I can do that in a reasonably straight-forward manner, at least compared to alternative approaches.

However, that's not all I can do.  I can then provide a search_terms function on the parent table which can be used to query child tables.

create or replace function search_terms(notes)
returns tsvector language sql immutable as
select to_tsvector($1.subject) || to_tsvector($1.body);

I could then index, using GIN, the output of this function.  I still have to create the index on all current tables but if I index it now on the notes table, all tables I create with LIKE notes INCLUDING ALL will now have that index too. 

The function itself can be queried in a number of ways:

select * from invoice_notes n 
 where plainto_tsvector('something') @@ search_terms(n);

-- or

select * from invoice_notes n 
 where plainto_tsvector('something') @@ n.search_terms;

Once the function is created, that query works out of the box even though I never created a corresponding function for the invoice_notes table type.  Thus providing a consistent interface to a group of tables is an area where table inheritance can help clear out a lot of complexity very fast. Benefits include a more robust database design, more easily re-used human knowledge in how pieces fit together, and easier management of database schemas.

Note on Use in Set/Superset Modeling

There are a number of cases where the query implications of inheritance are more important.  This area is typically tricky because it often involves multiple inheritance and therefore there are a number of additional concerns that quickly crop up, though these have well-defined solutions discussed below.

Imagine we have an analytics database with numbers of pre-aggregated over possibly overlapping sets.  We want to sum up numbers quickly and easily without complicating the query language.  One option would be to create multiple views over a base table which includes the superset, but if your bulk operations primarily work over discrete subsets, you might get more out of breaking these out into subset tables which inherit the larger sets into which they are members.  This is, in effect, a reverse partitioning scheme where a single physical table shows up in multiple query tables.

In certain cases this can be easier to manage than a single large table with multiple views selecting portions of that view.  Use of this technique requires weighing different kinds of complexity and is best left for other posts.

Managing Schema Changes with Multiple Inheritance

In cases where multiple inheritance is used, adding and removing columns is relatively straight-forward, but altering existing tables can result in cases where an alteration interferes with checks on the other parent.  Renaming columns or changing types of columns is particularly tricky.  In most cases where this happens, a type change will not be done because rewriting tables is prohibitive, but renaming columns becomes the substitute and that is no less of a headache.

The key problem to note here is that the problem is that you have to make sure that both parents are changed at the same time, in the same statement.    So the solution here is to create a super parent table with the subset of columns to be acted on, and then drop it when done.   So here we:

create table to_modify (id int, new_id bigint);
alter table first_parent inherit to_modify;
alter table second_parent inherit to_modify;
alter table to_modify rename id to old_id;
alter table to_modify rename new_id to id;

The changes will then cascade down the inheritance graph properly.


Table inheritance is a surprisingly awesome feature in PostgreSQL, but misuse has given it a bad reputation.  There are many cases where it simplifies operation and long-term management of the database in cases where partitioning actually doesn't work that well.  This is a feature I expect to try to improve over time and hope others find it useful too, but to start we need to start using it for what it is good for, not the areas it falls short.

Friday, September 21, 2018

PostgreSQL at 20TB and Beyond Talk (PGConf Russia 2018)

It came out a while ago but I haven't promoted it much yet.

This is the recorded version of the PostgreSQL at 20TB and Beyond talk.  It covers a large, 500TB analytics pipeline and how we manage data.

For those wondering how well PostgreSQL actually scales, this talk is worth watching.

Thoughts on the Code of Conduct Controversy

My overall perspective here is that the PostgreSQL community needs a code of conduct, and one which allows the committee to act in some cases for off-infrastructure activity, but that the current code of conduct has some problems which could have been fixed if efforts had been better taken ensure that feedback was gathered when it was actionable.

This piece discusses what I feel was done poorly but also what was done well and why, despite a few significant missteps, I think PostgreSQL as a project is headed in the right direction in this area.

But a second important point here is to defend the importance of a code of conduct to dissenters here, explain why we need one, and why the scope needs to extend where it needs to extend to, and why we should not be overly worried about this going in a very bad direction.  The reason for this direction is that in part I found myself defending the need for a code of conduct to folks I collaborate with in Europe and the context had less to do with PostgreSQL than with the Linux kernel.  But the projects in this regard are far more different than they are similar.

Major Complaint:  Feedback Could Have Been Handled Better (Maybe Next Time)

In early May there was discussion about the formation of a code of conduct committee, in which I argued (successfully) that it was extremely important that the committee be geographically and culturally diverse so as to avoid one country's politics being unintentionally internationalized through a code of conduct.  This was accepted and as I will go into below this is the single most important protection we have against misuse of the code of conduct to push political agendas on the community.  However after this discussion there was no further solicitation for feedback until mid-September.

In Mid-September, the Code of Conduct plan was submitted to the list.  In the new code of conduct was a surprising amendment which had been made the previous month, expanding the code of conduct to all interactions between community members unless another code of conduct applied and superseded the PostgreSQL community code of conduct.  I objected to this as did several others with actionable criticism and alternatives.  Unfortunately we were joined by a large numbers of people wanting to relitigate whether we needed a code of conduct in the first place.   Those of us with actionable feedback were told that no changes would be made for about a year.  In essence what looked like a public comment period was not and the more actionable feedback was, the more clearly it was ignored.

Had there been an actual comment period on the proposed language, I maintain that things would have been more tame, but ignoring even actionable feedback in such a period, in my view, helped throw fuel on the fire regarding those who wanted to re-litigate the whole concept because it further helped push the view that a plan was announced and then any concern ignored.  This was unfortunate.  If there had been a comment period, a deliberation, and a final draft things would have gone better.

I hope that next time such a process is followed, where feedback on proposed final wording is taken before the decision is made to refuse to make changes for a year.

Why We Have Codes of Conduct

Humans are social animals.  Groups of humans form social groups, which often have group infrastructure which needs to be managed.  Open source thus has all of the political considerations of a multi-national collaborative community and this includes management of common infrastructure, and how we treat each other.  The kinds of social relationships and interactions that we have in the community are shaped by our culture, gender, and outlook on life, and in an international project there can be a lot of problems.  When national political issues are kept out of the project and the project consists mostly of people who are willing to defend themselves possibly aggressively, a project can get along ok without a code of conduct, but as things change, it is important that there be a means of resolving conflicts within the community.  Hence one needs a dedicated committee and a document which reminds people to act in ways that keep the peace.

Codes of conduct thus have a role in ensuring that people can come together and work in a collegial and civil manner across cultural, political and other disagreements, and continue to build the great software that we all rely on.  In this regard I think the PostgreSQL community has hit the most important milestones and begun to build a code of conduct infrastructure which can last and ensure that the code of conduct does not turn into a way of one group of people forcing a political agenda on the world.

I have been to many conferences.  Often at some point discussions turn to politics in some way.  With the exception of one conversation, these comments have been thoughtful, receptive, and mutually entertaining but in that one exception, I saw a certain degree of aggressiveness that might, for others, even rise to the level of physical intimidation.  A reminder that we all need to genuinely be nice to eachother is a step in the right direction.

Codes of conduct cannot create fairness.  They cannot create social justice.  They cannot broaden meritocracy to community contribution beyond code.  Those things have to be done through other means, but they can remind everyone to treat each other collegially and to respect differences of opinion and so forth.

However, codes of conduct cannot enable merely formalities to defeat this purpose.  A campaign of harassment that is taken off-list is at least as much of a problem as discussions on-list.  Therefore community-related conversations are things which might have to sometimes fall under the jurisdiction of community conflict adjudication mechanisms such as the Code of Conflict.

What PostgreSQL is Doing Right

The danger in any code of conduct is that an internal controversy from one country or culture will be read into disputes in a way which ensures that other cultural groups do not feel comfortable participating.  GLBT issues are an area where this commonly comes up, and in a project where you have a lot of involvement from countries where the views are very different from those of the US, this leads to big problems.  On one hand, some people may see others' cultural views as invalidating their sexual identity, while others would see views pushing universalism in GLBT rights as invalidating their cultural identity.  These issues cannot be resolved without retreating to a single cultural context as the norm, discouraging participation from much of the world, and thus need to be outside what a code of conduct handles.  In this case it does not matter what one believes to be the right approach, but rather the fact that the consequences of siding with either side in such a controversy would be devastating for the community.

One of the key points of the current Code of Conduct is that the committee is itself geographically and culturally diverse.  This ensures that the intra-committee cultural divisions will help ensure that the committee cannot just bull-doze a political orthodoxy out of fear of how a domestic controversy might be perceived.  The cultural diversity thus is an immense protection and it effectively ensures that there is a right to engage in the free struggle of political opinion in one's own country.

From a responsibility to civic engagement comes a right to such a free struggle of political opinion, and in my view this is something which is effectively preserved in the community today.  Note that this would not apply to trying to position the PostgreSQL project as against any political, cultural, or other group.  Nor should it protect actual personally directed harassment against any member for any reason.  I believe that the committee is capable of drawing these lines and hence I see the PostgreSQL project as off to a shaky but viable start.

Unlucky Timing

The Code of Conduct controversy accidentally coincided with the Linux Foundation adopting the Contributor Covenant as its code of conduct.  The Contributor Covenant is a code of conduct which transparently attempts to push certain norms of certain parts of the US political spectrum on the rest of the world (see, for example, Opalgate).   While I believe this to be a mistake, time will tell how this is handled.  The Contributor Covenant was soundly and decisively rejected by the PostgreSQL community early on as too transparently political.

A lot of the emotional reactions in this controversy by dissenters may well be in relation to that.  This is one of those things one cannot plan for and it makes it harder to have real discussions today.

Calls to Action and Conclusions

I have submitted a couple of requests for wording changes to the code of conduct committee.  For others who see a need for a committee to help ensure a collegial and productive community, and see opportunities for improvement I suggest you do the same.  But simply arguing about whether we need a resolution process is not productive and that probably needs to stop.

I also think the community needs to insist on two modifications to the current process:

  1. There needs to be a comment period and deliberation over feedback between a draft of a new revision and its adoption
  2. The code of conduct committee needs to reply with reasons why particular suggestions were rejected.
However on the other hand I think PostgreSQL as a project is off to a viable start in what is likely to become the right direction.  And that is something we should all be thankful for.

Wednesday, August 9, 2017

On Contempt Culture, a Reply to Aurynn Shaw

I saw an interesting presentation recorded and delivered on LinkedIn on contempt culture by Aurynn Shaw, delivered this year at PyCon.  I had worked with Aurynn on projects back when she used to work for Command Prompt.  You can watch the video below:

Unfortunately comments on a social media network are not sufficient for discussing nuance so I decided to put this blog post together.  In my view she is very right about a lot of things but there are some major areas where I disagree and therefore wanted to put together a full blog post explaining what I see as an alternative to what she rightly condemns.

To start out, I think she is very much right that there often exists a sort of tribalism in tech with people condemning each others tools, whether it be Perl vs PHP (her example) or vi vs emacs, and I think that can be harmful.  The comments here are aimed at fostering a sort of inclusive and nuanced conversation that is needed.

The Basic Problem

Every programming culture has norms, and many times groups from outside those norms tend to be condemned in some way or another. There are a number of reasons for this.  One is competition and the other is seeking approval in one's in group.   I think one could take her points further and argue that in part it is about an effort to improve the relative standing of one's group relative to others around it.

Probably the best example we can come up with in the PostgreSQL world is the way MySQL is looked at.  A typical attitude is that everyone should be using PostgreSQL and therefore people choosing MySQL are optimising for the wrong things.

But where I would start to break with Aurynn's analysis would be when we contrast how we look at MySQL with how we look at Oracle.  Oracle, too, has some major oversights (empty string being null if it is a varchar, no transactional DDL, etc).  Almost all of us may dislike the software and the company.  But people who work with Oracle still have prestige.  So bashing tools isn't quite the same thing as bashing the people who use them.  Part of it, no doubt, is that Oracle is more established, is an older player in the market, and therefore there is a natural degree of prestige that comes from working with the product.  But the question I have is what can we learn from that?

Some time ago, I wrote a the most popular blog post in the history of this blog.  It was a look at the differences in design between MySQL and PostgreSQL and was syndicated on DZone, featured in Hacker News, and otherwise got a fairly large review.   In general, aside from a couple of historical errors, the PostgreSQL-using audience loved the piece.  What surprised me though was that the MySQL-users also loved the piece.  In fact one comment that appeared (I think on Reddit) said that I had expressed why MySQL was better.

The positive outpouring from MySQL users, I think, came from the fact that I sympathetically looked at what MySQL was designed to do and what market it was designed for (applications that effectively own the database), describing how some things I considered misfeatures actually could be useful in that environment, but also being brutally honest about the tradeoffs.

Applying This to Programming Language Debates

Before I start discussing this topic, it is worth a quick tour of my experience as a software developer.

The first language I ever worked with was BASIC on a C64.  I then dabbled in Logo and some other languages, but the first language I taught myself professionally was PHP.  From there I taught myself some very basic Perl, Python, and C.  For a few years I worked with PHP and bash scripting, only to fall into doing Perl development by accident.  I also became mildly proficient in Javascript.

My PostgreSQL experience grew out of my Perl experience.  And about 3 years ago I was asked to start teaching Python courses.  I rose to this challenge.  Around the same time, I had a small project where we used Java and quickly found myself teaching Java and now I feel like I am moderately capable in that language.   I am now teaching myself Haskell (something I think I could not have done before really mastering Python). So I have worked with a lot of languages.  I can pick up new languages with ease.  Part of it is because I generally seek to understand a language as a product of its own history and the need it was intended to address.

As we all know different programming languages are associated with stereotypes.  Moreover, I would argue that stereotypes are usually imperfect understandings that out-group people have of in-group dynamics, so dismissing stereotypes is often as bad as simply accepting them.

PHP as a case study, compared to C.

I would like to start with an example of PHP, since this is the one specifically addressed in the talk and it is a language I have some significant experience writing software in.

PHP often is seen to be insecure because it is easy to write insecure software in the language.  Of course it is easy to write insecure software in any language, but certain vulnerabilities are a particular problem in PHP due to lexical structure and (sometimes) standard library issues.

Lexically, the big issue with PHP is the fact that the language is designed to be a preprocessor to SGML files (and in fact it used to be called the PHP Hypertext Preprocessor).  For this reason everything, PHP is easy to embed in SGML PI tags (so you can write a PHP template as a piece of valid HTML).  This is a great feature but it makes cross site scripting particularly easy to overlook.  A lot of the standard library in the 1990's had really odd behaviour, though much of this has been corrected.

Aurynn is right to point to the fact that these were exacerbated by a flood of new programmers during the rise of PHP, but one thing she does not discuss in the talk is how software and internet security were also changing during the time.  In essence, the late 1990's saw the rise of SSH (20k users in 1995 to over 2M in 2000), the end of transmission of passwords across the internet in plain text, the rise of concern about SQL injection and XSS, and so forth.  PHP's basic features were in place just before this really got going, and adding to this a new developer community, and you have a recipe for security problems.  Of course, today, PHP has outgrown a lot of this and PHP developers today have codified best practices to deal with a lot of the current threats.

If we contrast this with C as programming language, C has even more glaring lexical issues regarding security, from double free bug possibilities to buffer overruns.  C, however, is a very unforgiving language and consequently, it doesn't tend to be a language that has a large, novice developer community. At the same time, a whole lot of security issues come out of software in C.


There is no such thing as a perfect tool (database, programming language, etc).  As we grow as professionals, part of that process is learning to better use the strengths of the technologies we work with and part of it is learning to overcome the oversights and problems of the tools as well.

It is further not the case that just because a programmer primarily uses a tool with real oversights that this reflects poor judgment from the programmer.  Rather this process of learning can have the opposite impact.  C programmers tend to be very knowledgeable because they have to be.  The same is true for Javascript programmers for very different reasons.  And one doesn't have to validate all language design decisions in order to respect others.

Instead of attacking developers of other languages, my recommendation is, when you see a problem, to neutrally and respectfully point it out, not from a position of superiority but a position of respectful assistance and also to understand that often what may seem like poor decisions in the design of a language may in fact have real benefits in some cases.

For example, Java as a language encourages mediocrity of code. It is very easy to become a mediocre Java developer.  But once you understand Java as a language, this becomes a feature because it means that the barrier to understanding and debugging (and hence maintaining!) code is reduced, and once you understand that you can put emphasis instead on design and tooling.    This, of course, also has costs since it is easy for legacy patterns to emerge in the tooling (JavaBeans for example) but it allows some really amazing frameworks, such as Spring.

On the other extreme, Javascript is a language characterised by shortcuts taken during the initial design stage (for time constraint reasons) and some of those cause real problems, but others make hard things possible.  Javascript makes it, also, very easy to be a bad Javascript programmer.  But perhaps for this reason I have found that professional Javascript programmers tend to be extremely knowledgeable, and have had to work very hard to master software development in the language, and they usually bring to the table great insights into computing problems generally.

So what I would recommend that people take away is the idea that in fact we do grow out of hardship, and that problems in tools are overcome over time.  So for that reason discussing real shortcomings of tools while at the same time respecting communities and their ability to grow and overcome problems is important.

Monday, February 13, 2017

PostgreSQL at 10TB and Beyond Recorded Talk

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch.  For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers.  You can book the course for the end of this month.  It will be held in Malmo, Sweden.

Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];

You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));


Tuesday, January 24, 2017

PostgreSQL at 10 TB and Above

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden.  The seminar is free to attend.  I expect to be talking for about 45 minutes with some time for questions and answers.  I also have been invited to give the talk at PG Conf Russia in March.  I do not know whether either will be recorded.  But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own.  But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes).   The other two address volume of data.

  1. Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes.  This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
  2. Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables.  We will talk about how this race condition happens and we solved it without using row locks.  We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
  3. Slow access and poor plans regarding accessing data in large tables.  We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
  4. Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.
In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database.  These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.