Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection

10 comments:

  1. Tupes in the Relational Model seems pretty hash-like. The ordering in SQL comes from SQL, not the relational model.

    ReplyDelete
    Replies
    1. I interpret this differently (this will become more clear as I discuss projection and selection). The names are an alternate representation, and they don't address the differences in ordering.

      I think the reason why people think that this comes out of SQL is because SQL makes projection a bit of an implicit operation. We don't actually specify that our column list in a SELECT statement is a projection operation, for example.

      Tuple order in the relational model is a product of projection or of initial relational definition. The descriptive labels are a nicety but I don't think they change the fundamentals.

      Delete
  2. In section II you say "SQL and other relational query languages work on the set level, and perform set operations" which is incorrect regarding SQL in general, while in section IV you correctly say SQL works with bags. You should update the wording of the earlier part so you don't contradict yourself. For example in the earlier part you could say "Relational query languages (but not SQL) work on the set level,..." instead.

    Also, see my comments I just made on part 3 of your series, in case you don't get an automatic notification each time blog comments are added.

    ReplyDelete
    Replies
    1. Thanks for your feedback. I will make this change.

      Delete
  3. Note that a way to effectively do ordering while staying as sets the whole time is to do a relational extend adding a calculated element using a rank or row_number window function, and then effectively doing a 'limit' is then a relational restriction on that new element; in the latter case you can then project away the element if its no longer needed.

    With regard to your comment on multisets in SQL, "This is a compromise to allow for real world data to be easily input into the database." ...

    I believe that is a very bad rationale in practice for having multisets, given all the problems their presence causes, and that 99% of the time users want setlike behavior. There are other methods to achieve the same goal without compromising on keeping everything as sets. Which to use would depend on the use cases in question.

    When the fact there are duplicates is not logically significant, attempts to insert a duplicate tuple into a database relation could just silently discard the duplicate, so there's still just one unique instance of the tuple to represent all duplicates. Most of the time this is the behavior users desire.

    When the fact there are duplicates is logically significant, either because you want to count occurrences of something or you want to at least detect that it is an error without holding things up, you could have an extra calculated tuple element added at insert time which is either a surrogate key distinct for each copy, or you keep one copy but update an element storing an occurrence count. I expect the first would be the normal solution given it is both simpler on existing DBMSs and also because the insert order may also be significant.

    The only reason in practice I can think that allowing duplicates is that for some operations the implementation of the DBMS itself can be simpler or perform better, if relations are internally represented using ordered arrays that are being appended to and duplicates aren't a majority of data that would swell memory/storage use, and you don't care about getting somewhat wrong answers to queries (such as how many X). In fact, I illustrate such differences in one of my Perl Set::Relation module implementations, where users can choose on a per-operation basis whether or not to do extra work to eliminate duplicates that might come from the operation, such as a restriction or union, so they can choose to make it faster at the sacrifice of some correctness (set like correctness is the default behaviour).

    ReplyDelete
    Replies
    1. I made a typo, in the last paragraph it should have said "such as PROJECTION or union".

      Delete
    2. I am, as a sort of thought exercise (won't be big or professional at the moment) creating a relational query language. I expect to implement a parser for Perl for it (and an ability to transform it into SQL).

      One thing I am thinking about is allow setting a return type of either set, bag, or sequence (and an ability to turn a set or bag into a sequence). This would allow ordering in a nice, natural way.

      Something like:

      GIVEN foo
      SELECT bar = 'baz'
      PROJECT (id, bar)
      RETURNING SET TO SEQUENCE ON ID ASC

      Delete
    3. Chris, what is your timeline for that project?

      If you're going down that road, maybe you would be interested to work with me on my Muldis D project, which does exactly what you propose, including being implemented in Perl to start, but that it is meant to be for industry.

      Like Perl, Muldis D is intended to be both a language itself as well as a toolkit to make other languages, but it is particularly savvy to working with tuples and relations. Using my project as a foundation for yours would save you a lot of work and give you a better quality result, including letting you take your ideas or experiments to levels you wouldn't have made the effort to get to had you gone on your own.

      You can both run your queries natively in Perl as well as transform them to SQL (where SQL is capable of doing what you're trying to do, but PostgreSQL is much better at being ready than some other SQLs).

      By the way, this is a translation to standard Muldis D syntax of what you said (and you can define other syntaxes):

      to_Array( foo matching \%{bar:'baz'} on \~{id,bar}, ordered_by : \~{id} )

      In the above example, the "matching" (semijoin) and "on" (projection) functions are called using infix syntax while the "to_Array" function is called using standard function call syntax with 1 positional and 1 named argument. Also, "foo" is the name of a relation-typed variable, \%{...} is a tuple literal and \~{...} is a literal for a list of attribute names. The ordering syntax is necessarily more complicated than the other operations, and is also more subject to change. Note that arrays and relations are disjoint types, so you'll have to explicitly convert back if you want to use relational operators on the result. Users can define their own functions with the same capabilities of any built-in ones, and their own types.

      Delete
    4. I don't really have a timeline yet. The first step is to draw up a spec. I have a pretty good idea. Then my thinking is to use Parse::RecDescent to create a parser.

      I am actually thinking of two approaches here. The first is the pipeline approach (given this, give me the following transforms) and the second as an algebraic approach, for example PROJECT (foo, bar)(SELECT (id > 1000)(foobarbaz)). I suspect that might be easier if I want to do something in Common Lisp ;-), but....

      After this series is done, maybe I can start discussions on a spec and then we can go from there. My thinking is that after that we can talk implementation. I would be interested in using your projects as a back-end.

      Delete
  4. Chris, okay, sounds like a plan. I look forward to seeing what you come up with later, and discussing it.

    Speaking for myself, while it can be parsed functionally, I eventually ruled out the use of any parser building tools like Parse::RecDescent for the Muldis D parser and opted to do it manually and in a quasi-procedural fashion. One main reason for this is I wanted my parser to be able to operate in a streaming fashion and be able to handle source code text of arbitrary size without using much memory to do it, for example database dumps. I don't know of any parser building tools that don't require the entire source and parse tree to be in memory at once. So while at the user level the parser is essentially just a function that takes a character string as input and returns a data structure as output, internally I would be using a pull-based pipeline, eg one stage turns the source text into a sequence of tokens, another one interprets some meaning of those tokens, another does some tree building, etc. Another reason is that conceptually speaking I'm not sure how to express some aspects of the grammar in the form PRD/etc need but know how to do it as the pipeline. For example, given a bareword sequence "foo bar baz etc" I don't know how in a parser toolkit to express rules for which of those is an infix function and which is a variable, for example, but I know how with a pipeline. (It is an intentional feature that the parser generally has no knowledge of what operators or types or variables etc are defined in the system, including built-ins, and just parses on what things look like; binding to actual operators or variables etc happens after the parse is completely done.) I'm sure a more savvy person than I can come later and translate the parser to that form if useful.

    ReplyDelete