Perspectives on LedgerSMB: When to use SELECT * in PostgreSQL

In LedgerSMB we use a lot of queries which involve SELECT *. Many people consider SELECT * harmful but there are several cases where it is useful. Keep in mind we encapsulate the database behind an API, so SELECT * has different implications than it does from applications selecting directly from tables.

The Fundamental Design Questions

It all comes down to software interface contracts and types. Poorly thought-out contracts, loosely applied, lead to unmaintainable code. Clear contracts, carefully applied, lead to maintainable code because the expectations are easily enforced.

PostgreSQL comes with a complex type system where every table, view, or composite type is an object class. In the right contexts, SELECT * provides you a result of a guaranteed type. This is important when doing object relational work because it means you get a series of objects back in a defined class. This allows you to then pass those on to other functions to get derived data.

Select * therefore helps you when working with objects, because you can ensure that the result types are in fact valid objects of a specified class defined in the relation clause of the query.

Where SELECT * can't be helpful

SELECT * is never helpful (and can have significant problems) in specific areas, such as anywhere you have a join. There are specific reasons for these problems.

Consider the following:

chris=# create table typetest (test text);
CREATE TABLE
^
chris=# insert into typetest values ('test1'), ('test2');
INSERT 0 2
chris=# CREATE VIEW typetestview AS select * from typetest;
CREATE VIEW
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)

chris=# alter table typetest add newfield bool default false;
ALTER TABLE
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)

(Edited, see note [1]). Interestingly this is fixed in the view definition so dump and reload won't change it. However without setting up casts, you can't take advantage of the fact you are passing up the data structure.

The key problem here is that views are their own types, and consequently you cannot guarantee that the view type will be the same as the underlying table type. This makes castes more complex to maintain and probably not worth the trouble.

Once joins are used in a query, however, SELECT * loses any potential benefit. Joins do not return a defined type, and so SELECT * should never be used in queries utilizing joins (aside possibly from ad hoc queries run by the dba to explore the data).

SELECT * and Stored Procedures

Consider for example the following CRUD stored procedure:

CREATE OR REPLACE FUNCTION accounts__list_all()
RETURNS SETOF accounts
LANGUAGE SQL AS
$$
SELECT * FROM accounts ORDER BY account_no;
$$;

This query is relatively simple, but the stored procedure returns a type that is defined by the underlying table. We all run into cases where application data can't be much further normalized and we may want to have stored procedures delivering that data to the application. In this case, we are likely to use a function like this, and that enables us to do other object-relational things outside it.

Now, if we need to change the underlying accounts table, we can always make a decision as to whether to make accounts a view with a stable representation, a complex type with a hand-coded query returning it, or just propagate the changes upwards. Because the application is not directly selecting from the underlying storage, we have options to ensure that the contract can be maintained. In essence this injects a dependency that allows us to maintain contracts more easily through schema changes.

Consequently although it leads to the same execution plan in this example, there is a tremendous difference, software engineering-wise, between an application calling:

SELECT * FROM accounts ORDER BY account_no;

and

SELECT * FROM accounts__list_all();

In the first case, you have only one contract, between the high level application code and the low-level storage. In the second case, you have two contracts, one between the storage and the procedure (which can be centrally adjusted), and a more important one between the application code and the stored procedure.

Conclusions

In PostgreSQL, the choice of whether to use SELECT * in a query is a relatively clear one. If you want to return objects of a type of an underlying construct, and the return type is closely tied over time to the output type, then SELECT * is fine. On the other hand, if these things aren't true then either you should find ways to make them true, or avoid using SELECT * altogether.

This makes a lot more sense when you realize that things like table methods can be passed up when select * is used (or methods applied to views, or the like).

In general you are going to find two areas where select * is most helpful. The first is in object-relational design. The second case is where you want PostgreSQL to define an object model for you. In reality the first case is a special case of the second.

This way of doing things is very different than the way most applications work today. The database is encapsulated behind an object model and the application consumes that object model. In those cases, select * is very helpful.

[1] Corrected as per comment. I was under the mistaken impression that select * would be a part of the view definition. This still leads to annoying and unexpected changes in view definition, such as when you drop and recreate the view and so I would still discourage it here however.

3 comments:

UnknownAugust 26, 2013 at 6:39 AM
the first point isn't true, a view contains NOT select *, the view-definition contains the column-names.

Regards, akretschmer
UnknownAugust 27, 2013 at 3:47 PM
I'm finding views of the type:
select u.*, row_to_json(ut.*) user_type_json
from user u join user_type ut on ut.user_type_seq=u.user_type_seq

fairly useful in a programming sense. I use my framework to turn these structures back into objects automatically, and you can't believe what my coders can develop with some very similar queries. Yes, it can be more CPU intensive for large data sets, so we avoid it in those instances, but for standard objects like users and other things, it's making the programming much easier-- usually 1 db round trip to get the data they need, and believe it or not, it's faster than the way they were using the api, which was using 2 queries instead of a join. Huge speed improvement using code Igniter (with a few custom plugins), but much better than the original models they had set up.

Monday, August 26, 2013

When to use SELECT * in PostgreSQL

The Fundamental Design Questions

Where SELECT * can't be helpful

SELECT * and Stored Procedures

Conclusions

3 comments: