Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

No comments:

Post a Comment