Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection

2 comments:

  1. Tupes in the Relational Model seems pretty hash-like. The ordering in SQL comes from SQL, not the relational model.

    ReplyDelete
    Replies
    1. I interpret this differently (this will become more clear as I discuss projection and selection). The names are an alternate representation, and they don't address the differences in ordering.

      I think the reason why people think that this comes out of SQL is because SQL makes projection a bit of an implicit operation. We don't actually specify that our column list in a SELECT statement is a projection operation, for example.

      Tuple order in the relational model is a product of projection or of initial relational definition. The descriptive labels are a nicety but I don't think they change the fundamentals.

      Delete