Wednesday, November 2, 2011

10 Underrated Features of PostgreSQL

Here are a list of features which I think are generally underrated in PostgreSQL and why.  Many though not all of these are being used in the development version of LedgerSMB, and more of these will likely be used as time goes on either by the main software or by addons.

These are functions which I greatly miss when I set down and work with many other RDBMS's, or which are fairly new and often underutilized.

10:  Language-Specific Full Text Searches
What it is:  Language-specific searches on related words.

Why I like it:  This function allows for a full text search to query related words in a post.  These can be indexed using functional indexes below as long as the system knows which language to search under.  So imagine if you put a text field that says "I am going to test this for the customer next week" but when you go back and search you search for "testing for customer" the result will still show up, because PostgreSQL knows that in English, "test" and "testing" can be two forms of the same word.

How I'd use it:  Any application which requires full text searches can take advantage of this to provide a much richer experience.

The LedgerSMB budget module currently uses full text searches for the descriptions of budgets.  Currently this is locale specific.  Future versions will probably allow notes to be in various languages and searched via a full text search per language.

9:   Pgcrypto
What it is:  Cryptographic extensions to PostgreSQL

Why I like it:   With pgcrypto, one can design systems which provide strong protection of data in the database, making the encryption transparent to the application.

How I'd use it:  If someone wanted to store, say, credit card data in the database, this could be installed using suitable forms of managed keys such that the user could unlock a key and retrieve the data.  This could then be handled such that the application itself would not need to handle encryption or decryption of the data, thus allowing other applications to handle key management and other important tasks.



8:    Functional Indexes


What it is:  Indexes on the outputs of immutable functions

Why I like it:  Functional indexes allow us to store information functionally dependent on information actually stored in the database.  Consequently just because we need to search on information.  This is available in Oracle and to a far lesser extent in MS SQL and DB2.  It is not available in MySQL.

How I'd use it:  Functional indexes are behind language-dependent full-text indexing (and hence can make searches there a lot faster).  It also has a large number of other uses.

For example, suppose I am storing a large bytea column and I want to search quickly on results.  I could index md5(my_bytea) and then search on input = my_bytea and md5(input) = md5(my_bytea).  This would allow an index scan to pull out matching md5's and then checking the full result against matching records.  For many other databases, one would have to store both the original data and the md5 checksum.

Another use would be to ensure uniqueness based on function output.  So for example, we could:

CREATE UNIQUE INDEX u_functional_index ON my_table (lower(my_col));

Then values would be rejected if they are non-unique to a case insensitive search.

7:    Partial Indexes


What it is:  Indexes on a subset of rows.

Why I like it:  Partial indexes allow the development of relatively complex data constraints.  For example, one can require a unique constraint on a subset of rows matching a certain criteria, since uniqueness is implemented on the index.  Of the other major RDBMS's, only Microsoft SQL Server implements anything like this.

How I'd use it:  Partial indexes can be used in two fundamentally different ways:  speeding up common search criteria and enforcing partial unique constraints.  Many are familiar with speeding up searches in this way, but partial unique indexes are where the real power lies. 

So suppose ar and ap invoices share the same physical table, but ar invoices must have unique invoice numbers while ap invoices must be unique per vendor.  We can use partial unique indexes to support both these constraints independently.

CREATE UNIQUE INDEX u_ar_invnumber ON invoices (invnumber) WHERE type = 'ar';

A second important application is adding forward-only unique constraints.  These are helpful where a new unique constraint needs to be added immediately but not all data in the table conforms to that constraint.  As a stop-gap measure (while data is being cleaned up), one might use a partial unique constraint to encourage uniqueness going forward.  Something like:

CREATE UNIQUE INDEX u_ar_invnumber ON invoices (invnumber) WHERE type = 'ar' AND id > 12345;

Data can be cleaned up (if possible), then the index can be dropped and recreated with a different where clause.

6:    Third Party Solutions


What it is:           An extensive set of third party add-ons for PostgreSQL

Why I like it:        There's so much to choose from!

A few of my favorites:   PGTAP is a test harness for stored procedures, building on other test harness frameworks.  It allows you to write production-safe test cases and run them without risking your data.  Very very good.

Slony-I is a very good well-developed single-master, asynchronous replication system for PostgreSQL.  it was developed to help ensure the .org domain registrar would scale appropriately.  Slony-I supports, among other things, replicating between different versions of PostgreSQL, something the native solutions do not.

Speaking of replication, Bucardo allows you to do multi-master and master-slave asynchronous replication as well.

PostGIS provides an excellent basis for GIS applications, and there are many, many other third party, free software add-ons for PostgreSQL  Uses are largely limited by the imagination.

5:    Language Handles


What it is:  Define your own stored procedure languages!

Why I like it:  Makes migration often more feasible than it would be otherwise, and exposes much advanced functionality to stored procedure interfaces.

How I'd use it:  There are three major uses of stored procedure handles (i.e. to create stored procedures in languages other than C, SQL, and PL/PGSQL).

The first of these is migration.  If one has an Oracle database which uses a lot of Java stored procedures you can often port these to PostgreSQL and PL/J with a minimum of headaches.  Similarly if you have an MS SQL database with .Net stored procedures you may be able to make these work with PL/Mono.

The second is to make advanced features of a language available to your stored procedures.  If you are doing a lot of text processing, PL/Perl may be helpful, but if you are doing a lot of statistical calculations, PL/R may be very helpful.

Finally (and here is the dangerous one), language handles can make it possible to do very un-database-like things from within stored procedures.  So for example, you could use PL/SH to execute shell scripts from the stored procedures.  However in general for these areas, external scripts and LISTEN/NOTIFY (see below) are better tools.

4:    Complex Types


What it is:        Tuple-based types, often used as function output types.

Why I like it:     These provide an opportunity to create a very rich API within the db.  PostgreSQL supports not only pure tuples but tuples which contain arrays of other tuples as their members.

How I'd use it:   In LegerSMB, we use complex types to define much of our interface for the application and this dependency will increase over time.  These types allow one to represent in SQL form any well-structured representation of data in any other application.

These can then be used to define inputs and outputs of stored procedures.

3:    Listen/Notify


What it is:  Notification of specified database events.

Why I like it:  this provides a notification service within the database.  Applications listen for a specified notification handle, and other functions can raise those notifications.  This allows one to create message queues (similar to Oracle's Database Notification Services), but that's not all.

Because the notifications are decoupled from low-level database operations, one can raise the notifications based on arbitrary criteria.  Suppose we want to raise a notification anytime the postgres user deletes data from the audit trail?  We can do that. No problem.  Similarly we could raise a notification anytime

How I'd use it:  You can use LISTEN/NOTIFY to send a message to a script that information has been dropped in a table and is ready to email out to a decision-maker.  For example you could use this to provide real-time emailing when, for example, a part's on-hand value drops below the reorder point.  In this way all manner of message queues can be created using the database as a center.

On top of that one can send notifications out around things that do not modify the database. I don't have a specific use case for this yet, but the fact that it is there means that many other things suddenly become possible.

2:    Windowing Functions


What it is:  The ability to have an aggregate function that aggregates incrementally over a set or portion of a set.  Although most RDBMS's these days do support windowing functions (MySQL, DB2, Oracle, and MS SQL all do), this is a feature which is often overlooked.

Why I like it:  This allows you to specify aggregates which are run incrementally across a set.  This set can be partitioned and ordered independent of any ordering on the final set.

How I'd use it:  There are huge numbers of uses for this sort of thing.  For example, windowing functions make it possible to select the highest ranked rows within a set by certain criteria.  They also allow one to add running totals to financial reports, or add running totals within specific transactions for financial reports.

No doubt there are many many more uses I just haven't thought of or come across.  However if you think about being able to use windowing functions the sorts of queries you can write are well expanded.

1:     PL/PGSQL


What it is:   A very clear, concise, well organized language for stored procedures.  Borrowing heavily in approach from Oracle's PL/SQL, this language is a mixture of Ada, PL/I, and SQL.  The result, while different from what most programmers are used to, highly readable, easy to follow, and extremely powerful.

Why I like it: PL/PGSQL adds almost everything you could want that vanilla SQL does not offer in a procedural language.  Conditionals, loops, and named variables all come to mind.  But what it has going for it is conciseness.  Often times I hear people say they are "just" doing stored procedures in PL/PGSQL.  There is no "just" about it though.  This language is very well designed for what it does.

How I'd use it:  LedgerSMB currently contains most SQL in the newer code inside stored procedures written in PL/PGSQL.  We use it for most procedures with large numbers of arguments or where the control structures are at all helpful.  In essence we use it a LOT.

The above list is largely arbitrary.  Another author might put in the fact that data definition language calls are usually transaction-safe (meaning you can roll back a transaction that dropped a table), incremental backups, or any number of other features.  However the features that I have included above are generally those which allow PostgreSQL to be used as a very extensible database server.  It is my view that these features, which are at their best when creating intelligent databases, are frequently under-used especially by small developers because they think of PostgreSQL as merely a data store instead of the very powerful system that it is.

What features do you think are underrated?  Comment below to mention them!

10 comments:

  1. Nice post. I'm looking forward to trying out tables for ephemeral data (no logging, so faster writes) in 9.1.

    ReplyDelete
  2. Foreign Data Wrapper's seem fascinating. A large amount of my dataset is relation, and becomes immutable overtime. So, the prospect of archiving that off to a c-store (maybe luciddb?) then creating a FDW to represent that as a table in Postgres is exciting.

    ReplyDelete
  3. @Brian Cardarella -- he mentioned the awesomeness of PostGIS under #6... unless that was added after your comment.

    ReplyDelete
  4. This is underrated only by people who don't use PostgreSQL, but Transactional DDL is high on my list.

    ReplyDelete
  5. PostGIS is mentioned under 3rd Party Solutions along with Slony and Bucardo. If I was using it more, it would probably get a listing itself.

    ReplyDelete
  6. Point in time recovery is a huge one for me. Although some RDBMS also have it, I find it extremely valuable to keep my data secure and safe.

    ReplyDelete
  7. Streaming of Large Objects (http://www.postgresql.org/docs/9.1/interactive/largeobjects.html), streaming replication (http://www.postgresql.org/docs/9.1/static/warm-standby.html#STREAMING-REPLICATION) to mention even more great features..

    ReplyDelete