Perspectives on LedgerSMB: Arrays and Indexing, problems and solutions

Arrays in PostgreSQL are relatively well supported by the indexing engines of this RDBMS. There are a fair number of options however to be aware of and an understanding of how arrays are indexed is very helpful. Before going forward, please read my intro on arrays so that you understand the basics.

I am leaving out hash indexes which are not very useful for arrays (or much of anything else). There are probably cases where hash indexes are useful but if you need them you will know....

BTREE Indexing for Ordinality-centric arrays

The default type of index used by PostgreSQL is a btree index, which provides a search tree using only three operators: <, =, >. Ordering and ordinality of arrays here is quite important and btree indexes are essentially useless when indexing arrays where ordinality is unimportant.

One should remember that btree indexes are useful primarily in ordering based searches. The following could use a btree index (and might depending on table statistics):

SELECT id FROM mytable WHERE myarray < '{0,2,1,2,3}';
SELECT id FROM mytable WHERE myarray = '{0,6,5,4,3}';

In theory a prefix search might be possible if such an operator existed and appropriate work was done on the back-end but this is not currently possible without using functional index. Again in theory the following might work, but GIN indexes would be better for reasons that will become clear.

SELECT id FROM mytable WHERE myarray[1] = '5';

The following cannot work with a btree index under any circumstances:

SELECT id FROM mytable WHERE myarray[2] = 5;

The reason this cannot work is because while myarray[1] would correspond to something we could search the index on, myarray[2] would not be exposed to the tree view of the index until we already know myarray[1]. Thus we cannot traverse a tree on the basis of array elements beyond the first.

Btree indexes may be useful in some cases with arrays, particularly where ordinality is very important to the search. However, they are entirely useless where ordinality is unimportant.

GiST and GIN indexing

PostgreSQL provides two more general indexing opportunities that work for more general (and non-1NF uses) of arrays than do Btree indexes.

In general the tradeoff is that GIN indexes are more costly to maintain and take up more disk space but are less costly to query. As the read to write ratio goes up, GiST indexes start looking more desirable and GIN indexes start looking less desirable.

These indexes are helpful because they suppose a wider range of searches than do BTree indexes. In particular you can do overlap searches (select * from pages where tags && array['PostgreSQL']), and much more. GiST and GIN indexes also support other operators. Also it is worth noting that GiST indexes are originally designed for geometric and geospacial searches, while GIN indexes are designed for arrays.

The supported GiST operators are:

<< Strictly left of
&< Extends to the left of
&> Extends to the right of
>> Strictly right of
<<| Strictly Below
&<| Does not extend above
|&> Does not extend below
|>> Strictly above
@> Contains
<@ Contained in
~= Is the same as (identical to)
&& Overlaps

Now this supports a significant superset of GIN index functionality (see below) but it does not really necessarily apply to arrays in a straightforward way. In general GIN indexing is a better match, but it is worth noting you could easily write functions that could be used to check some of these index conditions and add that functionality if you need it. In general, however, this is an advanced topic and well beyond the scope of this post. If you ever need to compare two arrays though and see which has the highest single, or the largest dimensions, or some other objective metric, you might be able to hook this into GiST indexing.

The supported GIN operators are:

<@ Is contained in
@> Contains
= equals
&& Overlaps

These are generally particularly useful for arrays, and since they are lossless, they are significantly faster for lookups. For arrays these are preferred most of the time if you are looking primarily for matching specific elements in the array.

On the whole the two most important indexes depending on what you are doing with the array are btree and GIN indexes. Btrees cover well the case where ordinality matters and the search is primarily on the beginning of the array. GIN indexes cover more general (and in particular non-1NF setups) relatively well. GiST indexes seem nice, but on the whole they do not offer a good performance/functionality tradeoff out of the box. They would be powerful tools however for the advanced developer.

Perspectives on LedgerSMB

Wednesday, May 1, 2013

Arrays and Indexing, problems and solutions

BTREE Indexing for Ordinality-centric arrays

GiST and GIN indexing

No comments:

Post a Comment