Monday, August 11, 2014

Math and SQL part 3: MVCC, Immutability, and Functional Programming

While first normal form is pretty much restricted to relational operations, this piece is more general.  It looks at the basic promise of functional programming, and the way that this applies to PostgreSQL.  In the mean time, we will also look at a fully functional approach to a key-value store in Perl as a close analogue.

I The Basic Promise of a Function


We should start by reminding ourselves of what a function represents: the promise that if you know one or more pieces of information, you can derive or look up another piece of information.

That promise is behind all functional programming, and in areas which are close to the math (such as relational databases) it is especially important.  This basic promise is important, but it also changes the way we have to think about programs.  Rather than computer programs doing things, they merely calculate things.  Whatever is done in the real world as a result is a byproduct of that calculation.

In general, a simple way to think of functional programming is whether output is solely dependent on input, but in relational math, this actually becomes a remarkably blurry line.  With data which can be looked up or stored, this line is often drawn at the point of mutability, because otherwise it is very difficult for routines to share data structures.  This means that data and state which cannot change during the function's existence does not count against it being a function.

Now, obviously a functional view of programming is pretty limited.  While a network server might listen to requests and return a function of input, side effects in the control flow become impossible in a purely function-oriented approach.  For this reason many functional programming environments I have worked with have imperative programs which link the calculations to real-world actions and side effects.

II How State Complicates Programming and Testing


Application state is a major source of bugs in any program.  Very often unforeseen circumstances arise because of slow changes the application's internal control data (called "state") undergoes.  These changes are cumulative, and if not carefully controlled, can lead to eventual inconsistencies which result in operational problems.

Testing a function is far easier than testing a mutable relation.  The number of data points to test are lower if you can do bounds tests on each change of state rather than only at the initial checking.  Concurrency is also easier if you adopt a copy-on-write model (instantiating new objects or other data structures when they are modified).  This avoids worrying about locks when sharing access because writes do not block reads.

III A Functional Key Value Store in Perl


In a past post, we looked at a simple key-value store for Perl.  Now it is time to flesh this out and look at immutability in order to make it more functional.  This new version has a number of important tradeoffs.  While it uses more memory than the previous version, it is far more concurrency safe because we copy on write, and it is easier to make guarantees since everything is immutable.

package MyKVS;
use overload 
    '+' => 'add';
    '-'  => 'remove';

sub new {
     my ($class, $initial) = @_;
     $initial = {} unless ref $initial =~ /HASH/ and not keys %$initial;
     bless $initial, $class;
     return $initial;
};

sub add {
    my ($self, $other) = @_;
    my $new = {};
    $new->{$_} = $self->{$_} for keys %$self;
    $new->{$_} = $other->{$_} for keys %$other;
    return $self->new($new);
}

sub remove {
    my ($self, @other) = @_;
    my $new = {};
    for my $k (keys %$self){
        $new->{$k} = $self->{$k} unless grep {$_ eq $k}, @other;
    }
    return $self->new($new);
}

sub get_value {
    my ($self, $key) = @_;
    return $self->{$key};
}

You will then note that the key value store creates a new key value store on each change.  This means that if we go through the public API, the key value store is mutable for its lifetime.  The code above has not been tested, is probably not the best Perl code in the world, etc. but illustrates the concepts of copy on write fairly well.  Nothing above requires any locking under any circumstances.

We can then use it like:

use MyKVS;
use 5.010;

my $kvs = MyKVS->new({foo => 'bar', 1 => 1});

$kvs += {2 => 'foobar', cat => 'dog'};
my $kvs2 = $kvs + {dog => 'lion'};
$kvs2 -= 2;

say $kvs->get_value('cat');
say $kvs2->get_value('dog');

This will print:
dog
lion

The keys in $kvs are now foo, 1, 2, and cat.  Those in kvs2 are foo, 1, cat, and dog.

IV:  Mutability Challenges


One of the major challenges here is that state ends up having to change over time, and often has to be shared.  Functional programming offers a number of solutions to this problem but all involve essentially handing out immutable structures and copying those when changed.  These principles are not entirely at odds with having some sort of shared state.

This has a number of possibilities, which are beyond the scope of this piece, but this notion of copying on write is fundamental to understanding MVCC on PostgreSQL.

So back to the Key Value Store example, suppose we want to add an ability to share key value stores across an entire application.  We may add then two more methods, optionally in a closure, along with a lexically scoped kvs for the module or methods.

Now, since we copy on write, the only place we need to lock is on the update.  Since our locking strategy will depend on our approach.  In this module the store is static to the module and not shared.  This means we do not need to lock.  If we did, however, we would only lock the update function.  No other functions would need a lock.  In such a way, writers may block writers, but writers never block readers (or vice versa).

{    # start closure

my $kvs = __PACKAGE__->new({});

sub get_store {
     return $kvs;
}

sub update {
     my($self, $update) = @_;
     $kvs += $update;
}

}    # end closure 

Now with these methods something magical happens.  When I update, the local reference hashref gets updated, but everyone getting a hashref gets a private copy, which is effectively immutable (unless one breaks encapsulation). Moreover if one wants to update the global store and get a new updated store, it is as simple as:

$kvs = $kvs->update({key => 'value'});

And now we have a crude multi-version concurrency control for our key value store.  If we wanted to take this further we would store the kvs in shared memory and provide appropriate locking.  Note that locks would only be required on write and retrieving the master copy.

V:  How MVCC Works in PostgreSQL


MVCC is a major feature of modern relational database management systems and it refers to the ability to maintain consistency between requests, handing out older versions of data when necessary to show a consistent snapshot at a specific point in time.  That specific point in time differs based on isolation levels.

MVCC works on the same basic principles we have been discussing here.  Data is copied on write (so there is no difference between an update and delete followed by a select, except that integrity checking doesn't happen on the delete portion).  The database keeps the required bookkeeping information needed to give proper views of the information and then delivers it on request.  Thus inserts are not visible until commit, deletes merely mark data as obsolete, and updates do both.  This avoids a large number of concurrency issues by treating data as immutable and subject to garbage collection.

This basic idea is implemented differently in other database systems of course.  Some physically move rows in anticipation of commit or rollback (using a rollback segment), but PostgreSQL simply stores the rows in the tables, whether obsolete or not, and later collects those via autovacuum (or periodic vacuuming).

The same principle in all cases though is a logical copy of the data being made when the application reads it and another being made when the data writes.  The transaction isolation level affects when this copy happens and which copy the application sees, but not the basic process.

Next:  Basic Foundational Data Types: Sets, Tuples, and Bags

5 comments:

  1. Chris, a couple things:

    1. I think you have a typo in part V; my correction shows capitalized in this quote: "Data is copied on write (so there is no difference between an update and delete followed by AN INSERT,"

    2. I want to point out that your Perl examples can be made considerably simpler (and probably faster using Perl's built in functional programming features. For add() you can say: "my $new = { %{$self}, %{$other} };" for the same behaviour. For remove() you can say "my $other_h = { map { ($_=>undef) } @other; my $new = { map { ($_=>$self->{$_}) } grep { exists $other_h->{$_} } keys %{$self} };" for the same behaviour.

    ReplyDelete
  2. I made my own typo; the "exists" should be "not exists" in my comment.

    ReplyDelete
  3. Thanks for the insights, Chris. Beginning with solving a problem in application, and then showing how an RDBMS already solves that same problem gives the readers a better appreciaition of the databases they use.

    @Darren, the Perl code you propose may perform better, but it makes it non-Perl developers hard to reason about the problem. I did not read _any_ code in this blog post (I am not a Perl guy), and I still was able to get what Chris was trying to convey.

    ReplyDelete
  4. Gurjeet, my code suggestion wasn't really about the performance (and it might not actually perform better). Rather, its point was more demonstrating the relational or SQL paradigm, which is about functions, in a more general-purpose language, Perl, and it was also about making the code easier to understand. One doesn't normally perform set operations in SQL etc using procedural language, and I'm showing you can do it a more natural set way in Perl too.

    ReplyDelete
    Replies
    1. I think for loops are usually clearer than map for things like this especially when concatenating lists. This being said, I do agree that map would be better here if I was writing beyond general uses because it gives you effectively an expression to operate over rather than a series of instructions.

      Delete