Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection

Monday, August 11, 2014

Math and SQL part 3: MVCC, Immutability, and Functional Programming

While first normal form is pretty much restricted to relational operations, this piece is more general.  It looks at the basic promise of functional programming, and the way that this applies to PostgreSQL.  In the mean time, we will also look at a fully functional approach to a key-value store in Perl as a close analogue.

I The Basic Promise of a Function


We should start by reminding ourselves of what a function represents: the promise that if you know one or more pieces of information, you can derive or look up another piece of information.

That promise is behind all functional programming, and in areas which are close to the math (such as relational databases) it is especially important.  This basic promise is important, but it also changes the way we have to think about programs.  Rather than computer programs doing things, they merely calculate things.  Whatever is done in the real world as a result is a byproduct of that calculation.

In general, a simple way to think of functional programming is whether output is solely dependent on input, but in relational math, this actually becomes a remarkably blurry line.  With data which can be looked up or stored, this line is often drawn at the point of mutability, because otherwise it is very difficult for routines to share data structures.  This means that data and state which cannot change during the function's existence does not count against it being a function.

Now, obviously a functional view of programming is pretty limited.  While a network server might listen to requests and return a function of input, side effects in the control flow become impossible in a purely function-oriented approach.  For this reason many functional programming environments I have worked with have imperative programs which link the calculations to real-world actions and side effects.

II How State Complicates Programming and Testing


Application state is a major source of bugs in any program.  Very often unforeseen circumstances arise because of slow changes the application's internal control data (called "state") undergoes.  These changes are cumulative, and if not carefully controlled, can lead to eventual inconsistencies which result in operational problems.

Testing a function is far easier than testing a mutable relation.  The number of data points to test are lower if you can do bounds tests on each change of state rather than only at the initial checking.  Concurrency is also easier if you adopt a copy-on-write model (instantiating new objects or other data structures when they are modified).  This avoids worrying about locks when sharing access because writes do not block reads.

III A Functional Key Value Store in Perl


In a past post, we looked at a simple key-value store for Perl.  Now it is time to flesh this out and look at immutability in order to make it more functional.  This new version has a number of important tradeoffs.  While it uses more memory than the previous version, it is far more concurrency safe because we copy on write, and it is easier to make guarantees since everything is immutable.

package MyKVS;
use overload 
    '+' => 'add';
    '-'  => 'remove';

sub new {
     my ($class, $initial) = @_;
     $initial = {} unless ref $initial =~ /HASH/ and not keys %$initial;
     bless $initial, $class;
     return $initial;
};

sub add {
    my ($self, $other) = @_;
    my $new = {};
    $new->{$_} = $self->{$_} for keys %$self;
    $new->{$_} = $other->{$_} for keys %$other;
    return $self->new($new);
}

sub remove {
    my ($self, @other) = @_;
    my $new = {};
    for my $k (keys %$self){
        $new->{$k} = $self->{$k} unless grep {$_ eq $k}, @other;
    }
    return $self->new($new);
}

sub get_value {
    my ($self, $key) = @_;
    return $self->{$key};
}

You will then note that the key value store creates a new key value store on each change.  This means that if we go through the public API, the key value store is mutable for its lifetime.  The code above has not been tested, is probably not the best Perl code in the world, etc. but illustrates the concepts of copy on write fairly well.  Nothing above requires any locking under any circumstances.

We can then use it like:

use MyKVS;
use 5.010;

my $kvs = MyKVS->new({foo => 'bar', 1 => 1});

$kvs += {2 => 'foobar', cat => 'dog'};
my $kvs2 = $kvs + {dog => 'lion'};
$kvs2 -= 2;

say $kvs->get_value('cat');
say $kvs2->get_value('dog');

This will print:
dog
lion

The keys in $kvs are now foo, 1, 2, and cat.  Those in kvs2 are foo, 1, cat, and dog.

IV:  Mutability Challenges


One of the major challenges here is that state ends up having to change over time, and often has to be shared.  Functional programming offers a number of solutions to this problem but all involve essentially handing out immutable structures and copying those when changed.  These principles are not entirely at odds with having some sort of shared state.

This has a number of possibilities, which are beyond the scope of this piece, but this notion of copying on write is fundamental to understanding MVCC on PostgreSQL.

So back to the Key Value Store example, suppose we want to add an ability to share key value stores across an entire application.  We may add then two more methods, optionally in a closure, along with a lexically scoped kvs for the module or methods.

Now, since we copy on write, the only place we need to lock is on the update.  Since our locking strategy will depend on our approach.  In this module the store is static to the module and not shared.  This means we do not need to lock.  If we did, however, we would only lock the update function.  No other functions would need a lock.  In such a way, writers may block writers, but writers never block readers (or vice versa).

{    # start closure

my $kvs = __PACKAGE__->new({});

sub get_store {
     return $kvs;
}

sub update {
     my($self, $update) = @_;
     $kvs += $update;
}

}    # end closure 

Now with these methods something magical happens.  When I update, the local reference hashref gets updated, but everyone getting a hashref gets a private copy, which is effectively immutable (unless one breaks encapsulation). Moreover if one wants to update the global store and get a new updated store, it is as simple as:

$kvs = $kvs->update({key => 'value'});

And now we have a crude multi-version concurrency control for our key value store.  If we wanted to take this further we would store the kvs in shared memory and provide appropriate locking.  Note that locks would only be required on write and retrieving the master copy.

V:  How MVCC Works in PostgreSQL


MVCC is a major feature of modern relational database management systems and it refers to the ability to maintain consistency between requests, handing out older versions of data when necessary to show a consistent snapshot at a specific point in time.  That specific point in time differs based on isolation levels.

MVCC works on the same basic principles we have been discussing here.  Data is copied on write (so there is no difference between an update and delete followed by a select, except that integrity checking doesn't happen on the delete portion).  The database keeps the required bookkeeping information needed to give proper views of the information and then delivers it on request.  Thus inserts are not visible until commit, deletes merely mark data as obsolete, and updates do both.  This avoids a large number of concurrency issues by treating data as immutable and subject to garbage collection.

This basic idea is implemented differently in other database systems of course.  Some physically move rows in anticipation of commit or rollback (using a rollback segment), but PostgreSQL simply stores the rows in the tables, whether obsolete or not, and later collects those via autovacuum (or periodic vacuuming).

The same principle in all cases though is a logical copy of the data being made when the application reads it and another being made when the data writes.  The transaction isolation level affects when this copy happens and which copy the application sees, but not the basic process.

Next:  Basic Foundational Data Types: Sets, Tuples, and Bags

Wednesday, August 6, 2014

Math and SQL, part 2: Functions and First Normal Form

There is probably no piece of relational database theory which is so poorly understood in professional literature than first normal form.  This piece seeks to ground an understanding of 1NF in set mathematics not only so that it can be better leveraged but also so that one can be aware of when (and when not) to violate it.

I:  Defining First Normal Form


In the previous post, we talked about the definition of function and relation.  This is key to understanding first normal form.

First Normal Form requires two things, as defined by Codd:
  1. Each relation is a well-formed relation (i.e. a well-formed set of fact correlations). 
  2. No element of the relation (i.e. correlated fact) is itself a set.  Therefore sets within sets are not allowed.
By extension the second prong prevents elements of the relation from having forms which are readily reduced to sets.  For example, an unordered comma separated list of values violates 1NF.  A clearer way of putting the second prong is:

Each element of a relation must contain a single value for its domain.

A clear mathematical way of looking at First Normal Form is:


  1. Each relation is a set of tuples.
  2. No tuple contains a set.


II: Sets and Ordinality


Let's further note that one defining feature of a set is that it has no ordering.  More formally we would say that sets have no ordinality.  {1, 2, 3} = {2, 1, 3}.  For this reason, no duplicate rows may be allowed, as that means that a relation is not a well-formed set.  

Tuples (the correspondences which make up the relation) do have mandatory ordering, and allow duplication.  However the ordering is not "natural" and therefore can be specified in relational operations.  In relational algebra, the project operation allows you to generate one tuple ordering from another.

Ordinality is an extremely important concept regarding normalization and arrays in the database, as we will see with SQL arrays.

III: SQL Arrays, Ordinality and 1NF


An SQL array is actually a mathematical matrix, carrying with it all the principle properties of this.  In particular, two basic constraints exist:

  1. SQL arrays are ordinal, so [1, 2, 3] != [2, 1, 3]
  2. Multi-dimensional arrays must be well-formed matrices.  I.e. [[2, 1, 3], [3, 2, 1]] is ok but [1, 2, [3, 2]] is not.
Arrays can be used in a number of different ways.  Some of these violate first normal form.  Others do not.  For example the following does not violate 1NF (though I am deliberately avoiding PostgreSQL's built-in types for networking):

CREATE TABLE network_addrs (
    host_id int primary key,
    hostname text,
    ip_as_octets int[]
);
INSERT INTO network_addrs(1, 'myhost', ARRAY[192,168,1,124]);

Because [192,168,1,124] and [124,168,1,192] are not equivalent, first normal form is not violated.

On the other hand consider:

CREATE TABLE web_page (
      file_name text primary key,
      content text,
      tags text[]
);
INSERT INTO web_page ('home.html', '<html></html>', ARRAY['foo', 'bar']);

Because [foo, bar] is equivalent to [bar, foo], we have a 1NF violation.  Now, in PostgreSQL, there may be good reasons to do this, but there are significant costs in any RDBMS as well.

The basic tradeoff in PostgreSQL in the latter version is that you gain the possibility of optimizing select queries for tags using GIN indexes.  However, deleting a tag from the system becomes very painful.  The difference between the one which violates 1NF using tags and the one which properly uses arrays for IP addresses is that I am only going to manage the IP address as it exists at once.  I will never be modifying or deleting octets individually.  For web pages, however, I may need to reach into the array with SQL to update or delete elements.  

This is a good way of thinking about when arrays violate 1NF, and it presents an understanding of some practical problems that come into play when 1NF is violated, but doesn't quite match the theory when tuples are stored as relational elements.  More on this will be covered below.

IV:  1NF, Candidate Keys, and Functional Dependencies


Since a 1NF relation is a set, it has no duplicates.  Thus, the relation itself can be used as a domain for a function, the result of which is its members.  Thus the whole tuple can be said to be a candidate key, and every member a trivial function of it.

Interestingly PostgreSQL supports queries notating columns as functions of relations.  While we might normally run something like:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON p.pronamespace = n.oid;

We can also write it as:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON pronamespace(p) = oid(n);

These notations are actually directly equivalent, and while the second is sufficiently non-standard that I would probably avoid it for production code, it has the advantage of making the relationship between function and attribute quite clear.

V:  Notes about Tuples in Tuples


Now, technically storing tuples inside of tuples in relations never violates first normal form by itself (though arrays of tuples might be if they are not ordinal).  This is because tuples are always ordinal.  Nonetheless, storing tuples inside of tuples can result in some problems which look very much like 1NF violation problems.  The same holds true with JSON, and other programming structures.  These don't violate 1NF themselves, but they can cause problems of a similar sort depending on how they are used.

In general, a decent rule to live by is that if a tuple will be updated as a whole, problems don't exist but that problems begin to exist when a tuple may be updated (logically) by merely replacing elements in it.  If one thinks of tuples as immutable (more on this in an upcoming post), these problems are solved.

This is an area where functional programming as a mentality makes database design a lot easier.  One should think of tuples as immutable correlated data structures, not as internal state of the system.

VI:  When to Break 1NF


There are many times when 1NF must be broken.  

In the real world we don't often get to verify that our data conforms to rigid mathematical definitions before we load it.  For this reason, the non-duplication rule often must be ignored for data import workflows.  In these cases, one loads the data with duplicates, and filters them out with subsequent operations.  This is acceptable.

Additionally there are cases where the workloads are such that, with PostgreSQL, GIN indexes can give you major performance boosts when you need it.  This comes at a cost, in that deleting a valid element from the system can no longer perform well, but sometimes this cost is worth it for relatively unimportant and freeform data.

VII:  Perl modules and Related Concepts


In writing this I came across a number of Perl modules which are closely related to this post.

The first is Set::Relation, which implements basic relational logic in Perl.  This is worth looking at if you are trying to see how to implement set logic without an RDBMS.  A very nice feature is that by default, relations are immutable, which means you have an ability to use them in a fully Functional Programming environment.

The second is Math::Matrix, which gives you a library for matrix mathematics.  For integer and float arrays, this allows you to incorporate this into the database back-end using PL/PerlU procedures, and would be particularly useful when combined with SQL language arrays (either in the back-end or middle-ware).

VIII:  Final Thoughts


First Normal Form is often treated as a given in relational theory.  However since most books aimed at database professionals don't cover the mathematical basis for the model, it is generally poorly understood.  Understanding First Normal Form however is the beginning of understanding relational logic and relational database systems.  It therefore deserves much more exploration than the brief space it is usually given.

Understanding 1NF is particularly important when one decides to use a non-1NF database design.  These have significant tradeoffs which are not evident to beginners.  However, this is only the starting point.  In future discussions we will cover related aspects which give a more complete understanding of the basics of the system.

Next Week:  Immutability, MVCC, and Functional Programming

Saturday, August 2, 2014

Math and SQL, part 1: Introduction: Relations and Functions

There are two goals I have in this series.  The first is to heavily ground database thinking in mathematics thinking.  The second is to connect this more explicitly to functional programming which attempts to do the same more generally.  The basic foundation of functional programming and relational theory is the same.  While this may seem very rudamentary for Planet Postgresql, I am hoping that the information here will be useful in theoretical courses, etc.

This series will also discuss uses of the same ideas in Perl.  Many of the same ideas are extremely important for developers in many languages.  Consequently these will all be tagged both PostgreSQL and Perl and most will include perl code as well as PostgreSQL code.

I:  SQL as Relational Math Implementation


In basic terms, SQL provides an interface for relational database systems.  The most powerful features of it are essentially relational algebra-based.  In order to better handle concurrent connections, there is a lot missing from relational math, and in order to handle real-world data, there are features in relational algebra not present in relational math.

However for purposes of this series, I will focus on PostgreSQL's SQL interpretation and I will use non-standard syntax features when that makes the math clearer.  I like PostgreSQL in this regard because it means that the math is fairly clear.

Code samples explaining ways to accomplish various operations will be provided in Perl.  Note that these are mostly fairly naive versions.  If one were really creating an RDBMS in Perl, the code would probably be very different.

II:  Relational Math, Relations and Functions


At the core of both databases and functional programming are the concepts of relations and functions.    At their foundation, these are fairly old concepts.  Since a function is a subset of relation, let's start by defining and understanding what a relation is:

A relation is a set of correlated facts.

For example, let's look at two relations, one showing square roots and one showing squares:


  • For square roots:
    • 1: 1
    • 1: -1
    • 4: 2
    • 4: -2
    • 9: 3
    • 9: -3
  • For squares
    • -3: 9
    • -2: 4
    • -1: 1
    • 0: 0
    • 1: 1
    • 2: 4
    • 3: 9
Both of these are relations. They provide a list of facts which correlate.  A square root of 1 is 1.  Another square root is -1.  -1 squared is 1, and so forth.

In programming terms, one may thing of a relation as being the possible output of a subroutine.  In relational databases, however, a relation is a set of correlating facts.  Typically this is implemented as a "table."

On to functions:

A function is a relation where the fact of the relation are fully determined by the domain of the function.  Therefore if you know the fact from the domain, there is only one related fact.

This definition is key to understanding both functional programming and relational databases.  It underlies almost all aspects of the both disciplines.

The following are mathematical functions:

  1. f(x) = x + 1
  2. f(x) = 2 * x
  3. f(x) = x2
These are functions because for any given x there is exactly one f(x).

So things like random() are not functions but rather very large relations.  Similarly now() is not a function because it has no input.  Programming in a way that emphasizes functions generally leads to more easily testable code, fewer bugs, and fewer concurrency problems than programming in a way which emphasizes state.

Now, in database terms, this all becomes important because of the fact that a function is a subset of relation.  Therefore we can speak of a functional dependency within a relation.   Put simply, a candidate key is a subset of a relation which, as a domain, turns the relation into a function.  All facts in a relation are a function of each candidate key.

So let's look at one example for how this works in Perl.  We will use a hash table, which is basically a key/value store.  A hash table is an unordered relation in Perl, such that the value is a function of the key.  In this example, we will mark this up to be a little more transparent mathematically, though in production.

package KeyValueStore;

my %store;

sub set {
    my ($key, $value) = @_;
    $store{$key} = $value;
}

sub remove {
    my $key = shift;
    delete $store{$key};
}

sub get {
    my $key = shift;
    $store{$key};
}

package main;
use KeyValueStore;
KeyValueStore::set('foo', 'bar');
KeyValueStore::set('1', '2');

print KeyValueStore::get('foo');
1;

Now, in the above code, set and remove define the relation, and get returns a function of the key, based on the relation at any given time.  Functional programming would not see this as a function because of state and concurrency issues, but from a basic math perspective, given a relation (%store in this case), this acts as a function.  There are of course ways to make the above work within the limits of functional programming.  These require turning set and remove into mathematical functions and changing the way we store the information.  In a later piece, when I talk about MVCC, I will talk about how to do this.

What is important here is that for a given relation, a fact is a function of a key.  This is is then usually said in a slightly different way, namely that a fact is functionally dependent on a key.  Functional dependency is the normal label because people frequently use the term "function" in databases in a non-mathematical sense.

Next week:  Functions and First Normal Form