Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

9 comments:

  1. Chris, Set::Relation directly supports every feature that SQL does for working with relations (and it expects you'd use Perl's ops for types like numbers and strings and doesn't provide its own). If you can't find something you're looking for then you can ask me about it and I'll help you.

    The general selection operator is named "restriction", as is common practice, and works like Perl's "grep" operator, taking a Perl higher-order function to define the filter condition. See http://search.cpan.org/dist/Set-Relation/lib/Set/Relation.pm#restriction .

    Also, for the very common special case where the selection filter is simply that specific fields match specific values, I provide semijoin() that replaces the Perl higher order function with a simpler, relation, input that is just the values to match. In its general form, semijoin() corresponds to SQL's "WHERE (...) IN (...)" syntax but it is also meant to replace "WHERE x=1 AND y=2 OR x=4 AND y=6". See http://search.cpan.org/dist/Set-Relation/lib/Set/Relation.pm#semijoin .

    I should also clarify for the readers that all relational operators in Set::Relation result in a relation value (Set::Relation object), same as their relation inputs, and projection() by itself is an example of this.

    The members() method is just a way to export the contents of a relation into a standard Perl data structure so it can be viewed easily, and it is in no way specific to a projection operation.

    You should update your article comments on Set::Relation to reflect these things, especially the presence of selection.

    ReplyDelete
    Replies
    1. On selection, making those changes.

      On projection, let me clarify what I saw so we can decide what edits if any to make:

      1. create a relation.

      2. perform a projection

      3. dump the relation (using Data::Dumper). It is a bag, not a set.

      4. run members(). This exports a set, not a bag.

      The interesting question is whether bag operations are possible using Set::Relation. My sort of testing above suggests that they are, which matches your statement that it supports everything SQL supports. However, SQL makes a fairly major concession to reality in doing bag as opposed to set operations.

      It seems to me that members() is what allows me to restore the bag to a set with your module. Is this correct? Or is there another way to think of it?

      Delete
    2. Chris, we probably should take some things going forward to private email, or instant message, or Skype etc, unless you want it all to be public. My main email is at the bottom of the module documentation and I can tell you other contact info privately.

      That being said, I will clarify some things and answer your questions here.

      First, Set::Relation operates under the paradigm that relations are sets of tuples only, and not bags, per my understanding of the relational model, which your posts so far agree with.

      The most correct way to dump a Set::Relation object is using its export_for_new() method with no arguments; its result, if fed to new(), will produce a logically identical object. Said export has no duplicates. Generally you want to run Data::Dumper on the output of export_for_new() if you want to see the relation value.

      Anything extra you see using Data::Dumper on a Set::Relation object is not logically significant, that is, they don't affect the actual value of the relation. Some Set::Relation operations are implemented lazily behind the scenes for performance, so internally you may see a bag / there may be duplicate tuples, but I assure you they will be eliminated at the latest when it would actually affect the result, eg if you're going to call cardinality(), or if you're going to do an export.

      Loosely speaking its like how Postgres keeps around "deleted" tuples until vacuumed, some which may be duplicates, but still logically treats them as not present, because actually removing them eagerly hurts performance.

      Also, Set::Relation objects internally will generate indexes and retain them when any operation occurs that would use them; for example, indexes on the corresponding attributes are automatically made when doing a join() so that the join performs in O(N) rather than O(N*M), except when doing a cross product which that can't help; these indexes are extra noise you'd see if doing a Data::Dumper on the object itself.

      Set::Relation does not explicitly support bag semantics at all and will not in general produce correct results if you are expecting such. The sole purpose of the optional argument "allow_dup_tuples" on some methods is to let the user gain performance at the expense of correctness when they really don't care, and the effect of that argument is to persist the underlying lazy behavior, preventing de-duplication from happening at some critical points where it otherwise would have.

      I have no intention to add explicit support for bag semantics to this Perl module, as I feel it doesn't have any value. On the other hand, technically Set::Relation is really just an interface role with 2 provided implementing classes (the older one eagerly deduplicates while the newer one does it lazily). And so, a third party is quite welcome to make another class composing that role which adds explicit bag semantic support, as such a class would still be drop-in-compatible with the role and programs using it if properly designed.

      Barring that, the canonical way to do bags with Set::Relation is to fake it over sets, such as by adding an extra attribute to each relation with some auto-generated surrogate key value and then updating your user code with respect to specifying that one or not as applicable. Not trivial, and I mean really in practice set semantics are what users want, but it can be done with work if you need it.

      Delete
    3. Another point of clarity is that the paradigm of Set::Relation is that tuple attributes are identified by name only and not by position. As such, it does not have an internal concept of or track attribute order. It just focuses on what is "natural", per your interpretation of Chris Date. This naturally jives with using a Perl hash to canonically represent a tuple, and a relation as a Perl array of said.

      The only time tuple attribute ordering is significant in Set::Relation is on import and export as it allows you to represent a relation as an array of arrays while just specifying the attribute names once in total, rather than once per tuple; the significance is keeping the attribute names and values lined up with each other, similarly to how Perl's hash keys() and values() output in corresponding order by design, I believe.

      This paradigm is significant because all Set::Relation operations that involve comparing tuple attributes such as union() or join() are "natural" and implicitly do so on corresponding attribute names. As such, before you do a union or join etc you have to make sure that input attributes either do or don't have the same names as appropriate, using rename() for example, to guide the behavior you want; the result relation would just have 1 attribute of each of the corresponding names.

      Doing things this way makes things much more logically clean.

      Under that paradigm, a well designed SQL database schema will, whereever reasonably possible, give corresponding table columns the same names when they have the same domains, and different when different. So eg, use "author_id" both in the "authors" and "books" tables, and don't call it just "id" in the former, say. The more that is followed, the less renaming ("foo AS bar") you have to do and queries are less verbose.

      Delete
    4. Thanks for your reply. I will update accordingly.

      For the record, I usually prefer to keep such discussions public as long as they are likely to portray both sides in a good light because we all know that all the information shared will not make it into the update. This means that anyone in the future reading this gets more information about your module than I might put in place.

      Thanks again!

      Delete
    5. Thank you for your update. There's still at least one thing that needs correcting though, where you say this:

      "Selection could be done using the semijoin method, or using members() and map."

      The most egregious error is that you said "map". Perl's map is for 1:1 list transforms, where grep is for list filtering.

      The other important thing is you should mention Set::Relation's "restriction" method as the general form while semijoin is a common specific form that works for your example.

      For example:

      "Selection could be done using the semijoin or restriction methods, or using members() and grep."

      Delete
  2. Hi,

    I need to validate data in PostGreSQL database against the data in an XML file.
    Could you please help me by providing any idea or sugession.

    Thanks & Regards,
    Venkatesh Gubba
    venkatesh.gubba@gmail.com

    ReplyDelete
    Replies
    1. If you use a Postgres Foreign Data Wrapper for XML, you can use all of Postgres' built-in facilities to compare the XML against the database using just SQL.

      Delete
  3. As long as you stick with strictly relational operators, relations *are* unordered sets. They are merely *represented as* (arbitrarily) ordered bags. Because, e.g., {1,2,3} = {3,2,2,1,3} you don't need to worry about order or duplicates and any result will be equivalent, so you get to enjoy the highly tractable relational world. There are three caveats: (1) You'll want to process the result with a non-relational operation (e.g. select distinct order by) for presentation to a human. This will make the human happy and also hide the implementation's representation of the sets. (2) In those rare non-contrived cases where there are enough duplicates that it's worth the expense of eliminating them before doing further relational processing you can do so - it won't affect the result but you should watch out for the usual dangers of premature and/or unnecessary optimization. (3) Sometimes what you want to do is better expressed non-relationally, so you use non-relational features of SQL at the cost of it being harder to analyze, optimize, debug and explain to colleagues as well as harder to ensure it will work the same when you upgrade your database engine. TL;DR = "Treat relations as sets and they'll act that way!"

    ReplyDelete