Monday, May 27, 2013

Introducing the CPAN PGObject namespace

So I have decided to begin work on a set of CPAN modules, essentially implementing the stored procedure interface work that I have done for LedgerSMB, and incorporating the lessons learned by myself and others.  It is my hope that Perl programmers will find this useful in building sophisticated applications on PostgreSQL.  The idea is to turn PostgreSQL into an encapsulated database sever with a discoverable API, and then allow programs to discover the API directly.  In my experience this leads to more robust code because code contracts between stored procedures and applications are more readily enforcible.

I just got preliminary versions of two of them on CPAN.  The first is PGObject which implements the database-facing routines, and PGObject::Simple which is developer-facing.  The reason for breaking things off this way is to allow for multiple different developer-facing systems off the same backend-facing services.  These are functional, but ones with a great deal more functionality are on their way.

I feel a need to justify "Simple" in the name of the latter.  PGObject::Simple is a quick and dirty stored procedure mapping convention system based on our approach in LedgerSMB.  It is not reduced in functionality, but the overall conventions are simple, and therefore require less overhead to use.  This also means, however, that the flexibility is such that it requires a certain discipline to use effectively.  Objects can quickly become amorphous but used with discipline it is a very powerful system.

For additional structure, PGObject::Simple::Role is under development which creates a Moo role, which can be used with or without Moose, to provide this sort of functionality in more formalized object systems.

These modules are designed to facilitate mapping of stored procedure functionality into applications, allowing PostgreSQL databases to become encapsulated intelligent information servers in their own right.  In future posts I will be going over both of these in detail.

For now though start with the documentation in the links above.

Wednesday, May 1, 2013

Arrays and Indexing, problems and solutions

Arrays in PostgreSQL are relatively well supported by the indexing engines of this RDBMS.  There are a fair number of options however to be aware of and an understanding of how arrays are indexed is very helpful.  Before going forward, please read my intro on arrays so that you understand the basics.

I am leaving out hash indexes which are not very useful for arrays (or much of anything else).  There are probably cases where hash indexes are useful but if you need them you will know....

BTREE Indexing for Ordinality-centric arrays


The default type of index used by PostgreSQL is a btree index, which provides a search tree using only three operators: <, =, >.  Ordering and ordinality of arrays here is quite important and btree indexes are essentially useless when indexing arrays where ordinality is unimportant.

One should remember that btree indexes are useful primarily in ordering based searches.  The following could use a btree index (and might depending on table statistics):

  • SELECT id FROM mytable WHERE myarray < '{0,2,1,2,3}';
  • SELECT id FROM mytable WHERE myarray = '{0,6,5,4,3}';
In theory a prefix search might be possible if such an operator existed and appropriate work was done on the back-end but this is not currently possible without using functional index.  Again in theory the following might work, but GIN indexes would be better for reasons that will become clear.

  • SELECT id FROM mytable WHERE myarray[1] = '5';

The following cannot work with a btree index under any circumstances:

  • SELECT id FROM mytable WHERE myarray[2] = 5;
 The reason this cannot work is because while myarray[1] would correspond to something we could search the index on, myarray[2] would not be exposed to the tree view of the index until we already know myarray[1].  Thus we cannot traverse a tree on the basis of  array elements beyond the first.

Btree indexes may be useful in some cases with arrays, particularly where ordinality is very important to the search.  However, they are entirely useless where ordinality is unimportant.

GiST and GIN indexing


PostgreSQL provides two more general indexing opportunities that work for more general (and non-1NF uses) of arrays than do Btree indexes.

In general the tradeoff is that GIN indexes are more costly to maintain and take up more disk space but are less costly to  query.   As the read to write ratio goes up, GiST indexes start looking more desirable and GIN indexes start looking less desirable.

These indexes are helpful because they suppose a wider range of searches than do BTree indexes.  In particular you can do overlap searches (select * from pages where tags && array['PostgreSQL']), and much more.  GiST and GIN indexes also support other operators.  Also it is worth noting that GiST indexes are originally designed for geometric and geospacial searches, while GIN indexes are designed for arrays.

The supported GiST operators are:
  • <<  Strictly left of
  • &< Extends to the left of
  • &> Extends to the right of
  • >> Strictly right of
  • <<| Strictly Below
  • &<| Does not extend above
  • |&> Does not extend below
  • |>> Strictly above
  • @> Contains
  • <@ Contained in
  • ~=  Is the same as (identical to)
  • && Overlaps
Now this supports a significant superset of GIN index functionality (see below) but it does not really necessarily apply to arrays in a straightforward way.    In general GIN indexing is a better match, but it is worth noting you could easily write functions that could be used to check some of these index conditions and add that functionality if you need it.    In general, however, this is an advanced topic and well beyond the scope of this post.   If you ever need to compare two arrays though and see which has the highest single, or the largest dimensions, or some other objective metric, you might be able to hook this into GiST indexing.

The supported GIN operators are:
  • <@    Is contained in
  •  @> Contains
  •  =     equals
  • &&  Overlaps
 These are generally particularly useful for arrays, and since they are lossless, they are significantly faster for lookups.  For arrays these are preferred most of the time if you are looking primarily for matching specific elements in the array.

On the whole the two most important indexes depending on what you are doing with the array are btree and GIN indexes.  Btrees cover well the case where ordinality matters and the search is primarily on the beginning of the array.  GIN indexes cover more general (and in particular non-1NF setups) relatively well.  GiST indexes seem nice, but on the whole they do not offer a good performance/functionality tradeoff out of the box.  They would be powerful tools however for the advanced developer.