Wednesday, September 17, 2014

PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

Preface


This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject.   It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems.  The next article in this series will address more specific solutions and more complex scenarios.

The Problems


Often times we want to have database fields automatically turned into object types which are useful to an application.  The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy.  Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful.  Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly.  One solution here is to try to follow the basic functional programming approach and copy on write.  This prevents a lot of problems.  Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic.  However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert.  We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository


This article follows the code of PGObject::Type::BigFloat..  The code is licensed under the two-clause BSD license as is the rest of the PGObject framework.  You can read the code to see the boilerplate.  I won't be including it in here.  I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling


To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef."  This is fairly trivial:

sub is_undef {
    my ($self, $set) = @_;
    $self->{_pgobject_undef} = $set if defined $set;
    return $self->{_pgobject_undef};
}

How PGObject Serializes


When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments.  A query is generated to call the procedure, and each argument is checked for a "to_db" method.  That method, if it exists, is called and the output used instead of the argument provided.  This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value.  If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it).  This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
    my $self = shift @_;
    return undef if $self->is_undef;
    return $self->bstr;
}

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes


Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes.  We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what.  The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings.  If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts.  For this type handler:

sub register{
    my $self = shift @_;
    croak "Can't pass reference to register \n".
          "Hint: use the class instead of the object" if ref $self;
    my %args = @_;
    my $registry = $args{registry};
    $registry ||= 'default';
    my $types = $args{types};
    $types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
    for my $type (@$types){
        my $ret =
            PGObject->register_type(registry => $registry, pg_type => $type,
                                  perl_class => $self);
        return $ret unless $ret;
    }
    return 1;
}

Then we can just call this in another script as:

PGObject::Type::BigFloat->register;

Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object.  In this case:

sub from_db {
    my ($self, $value) = @_;
    my $obj = "$self"->new($value);
    $obj->is_undef(1) if ! defined $value;
    return $obj;
}

This supports subclassing, which is in fact the major use case.

Use Cases


This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase.  We subclass this module and add support for localized input and output (with different decimal and thousands separators).  This gives us a data type which can present itself to the user as one format and to the database as another.  The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.

Caveats


PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer.  Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0.  To this point, that behavior is probably correct.  More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series:  Advanced Serialization and Deserialization:  Dates, Times, and JSON

Monday, September 15, 2014

PGObject Cookbook Part 1: Introduction

Preface


I have decided to put together a PGObject Cookbook, showing the power of this framework.  If anyone is interested in porting the db-looking sides to other languages, please let me know.  I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications.  It addresses some of the same problems as ORMs but does so in a very different way.  Some modules are almost ORM-like and more such modules are likely to be added in the future.  However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed.  Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications.  And there are people who should not use this framework because it is not the right tool for the job.  For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework.  But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework.  Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN.  However I expect it to also be of far more general interest since the problems are common problems across frameworks.

Introduction


PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application.  Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities.  As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced.  In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality



  1. registered types, autoserialization, and autodeserialization.
  2. The simple stored procedure mapper
  3. Aggregates and ordering
  4. Declarative mapped methods

Intermediate Functionality


  1. The Bulk Loader
  2. The Composite Type stored procedure mapper
  3. The database admin functions

Advanced Functionality


  1. Memoization of Catalog Lookups
  2. Writing your own stored procedure mapper

This series will cover all the above functionality and likely more.  As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles


The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3.  In general we took what we liked and what seemed to work well and rewrote those things that didn't.  Our overall approach has been based on the following principles:
  • SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages.  The system should leverage hand-coded SQL.
  • Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them.  It's better to store them persistently in the db  or generate well-understood queries in general ways where necessary.
  • Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
  • DB-centric but Loosely Coupled:  The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right.  Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
  • Don't Make Unnecessary Decisions for the Developer:  Applications may use a framework in many atypical ways and we should support them.  This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this.  Consequently you can expect your application to have to slightly extend the framework to configure it.
This framework is likely to be very different from anything else you have used.  While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files.  And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series:  Registered Types:  Autoserialization and Deserialization between Numeric and Math::BigFloat.

Sunday, September 14, 2014

LedgerSMB 1.4.0 Released


15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

"LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

Better Productivity


LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

Better Reporting


The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

Better Integration


Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

Commercial Options


LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at http://ledgersmb.org/topic/commercial-support

Thursday, September 11, 2014

Math and SQL Part 6: The Problem with NULLs

This will be the final installment on Math and SQL and will cover the problem with NULLs.  NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model.  Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables.  This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set


A null set is simply a set with no members.  This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found.  This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown


The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?"  In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages.  But it can also mean that maybe there is an answer and that there is no value associated.  In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use.  The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of.  So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table.  1 + NULL IS NULL.  NULL OR FALSE IS NULL.  NULL OR TRUE IS TRUE.  This makes things complicated.  But sometimes we must....

Null as Not Applicable


One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value."  There are a few data types which have no natural empty/no-op types.  Prime among these are numeric types.  Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc.  In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means.  Consider for example:

CREATE TABLE wage_class (
   id int not null,
   label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
   ssn text not null,
   emp_id int not null,
   wage_class int not null references wage_class(id),
   hourly_wage numeric,
   salary numeric,
   check (wage_class = 1 or salary is null),
   check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not.  This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful.  This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic


Because NULLs can represent unknowns, they introduce three-valued predicate logic.  This itself can be pretty nasty.  Consider the very subtle difference between:

   WHERE ssn like '1234%' AND salary < 50000

vs

    WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints


Despite all the problems, NULLs have become a bit of a necessary evil.  Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify.  This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage.  This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys


Foreign keys are a special case of NULLs where the use is routine and poses no problems.  NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise.  Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like:  NULL, NOVALUE, UNKNOWN


In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL.  UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it.  NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

Tuesday, August 19, 2014

Math and SQL Part 5: Projection and Selection

The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset


Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tuples and returns a set of tuples with a subset of fields (possibly re-arranged).

For example, consider a relation R, where the tuple structure of R is (employee_id, first_name, last_name, badge_number, office_id, tax_id)


(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)
(4JaredYoung2533-44-5442)
(5ToddKals2532-44-2332)

Suppose the data looks like this:

Then we could write π(last_name, office_id)(R), we would get:
(last_nameoffice_id)
(Young1)
(James1)
(Young2)
(Kals2)

Note that there is one fewer row in the output table because in both cases, you have a well-formed set.

The equivalent of the previous expression in SQL would be:

SELECT DISTINCT last_name, office_id FROM R;

From these examples a few things should be fairly clear.  First tuples are strongly ordered and we give human-readable labels as a nicety, but that projection can be used to reorder the tuple elements arbitrarily. Consequently tuples are consistently ordered within a relation, but that order is largely arbitrary.  Consequently tuples cannot be said to have a natural order.  Instead they derive their structure and order from the relation they are a part of.

Unfortunately SQL makes all kinds of implicit tuple transformations either by implicit projection or by some sort of reverse operation (I will call that "overlay" for the moment).  Because projection is entirely implicit in SQL, it is never stated.  The SQL statement above in fact doesn't perform a selection operation but merely performs a projection.

II:  Selection - Turn a Relation into a Horizontal Subset


Selection takes a relation and provides a subset of it based on arbitrary criteria.  The structure of the relation is unchanged, but a subset of the correspondences are in the new relation.  In other words, it selects tuples for further processing, and since the relation is a set, it is guaranteed to give a subset out.  It is represented by the sigma character (σ) and the subscript gives the equality condition.

Given the same relation R, we can perform σoffice_id=1(R) and get:

(employee_idfirst_namelast_nameoffice_idtax_id)
(1EricYoung1123-45-6789)
(2ThomasYoung1324-55-3334)
(3JulieJames1443-45-6234)

Simple right?

In SQL terms, a plain selection is what is found in the where clause.  So the equivalent in SQL is simply:

SELECT * FROM R WHERE office_id = 1;

As you can see from these examples, selection and projection are transformed in SQL in some relatively non-intuitive ways.  At the same time, understanding relational databases demands a basic understanding of these two concepts.

Some Notes on Set::Relation


In the course of putting this together I spent a little time with Set::Relation.  I found that the module seems to behave relatively well.  The projection method combined with members() outputs a set.  This was as  far as I took it in my quick spin.  As per the discussion below, Set::Relation operates under the assumption that relations are well-formed sets instead of bags.  This can be very helpful in operating in a highly formal relational environment.

Selection could be done using the semijoin method, or using members() and grep.

Next:  On the Great Null Debate


As a sort of postscript, I have decided not to include strong definitions of joins in this.  The orthography doesn't lend itself well to HTML blogs, and the basic concepts are reasonably straight-forward from an SQL perspective (the exception perhaps being antijoins).  At any rate it is not worth the effort.

Thursday, August 14, 2014

Math and SQL part 4: Basic Data Types. Sets, Tuples, and Bags

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set


At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple


A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes a value to a meaning.  For example, if the preceding tuples referred to points in three dimensional Euclidean space, the first would correspond to the x axis, the second to the distance along the y axis, and the last to the distance along the z axis.   This means that points (1, 2, 3) and (2, 1, 3) are different, and not equal (if they were sets with the same members they would be equal).

While relations are open ended sets (which presumably can be added to indefinitely), tuples are closed-ended.  While one can add more members, this involves a semantic shift in the data.  For example, points (1, 2) and (1, 2, 3) share the same x and y axis but a two dimensional space is different from a three dimensional space semantically.

Relational algebra works on the set level, and perform set operations.  They do so by providing basic set operations across an entire set.  However the comparisons are done on tuples and their members. 

This point cannot be overemphasized.  Ideal relational databases operate on sets and tuples.  Everything else is secondary.  A well designed database will build on this basic point.  As a caution, SQL works on bags, not sets, so some care is needed to make this happen.

III Tuples, Sets, and Relations in Perl


Tuples and sets of course are generally useful in programming.  Understanding these is often a key to better performance, and in Perl this is somewhat counterintuitive.

A naive view from looking at the database would be to see a tuple as a %hash and a set as an @array.  In fact, it is the other way around.  Understanding this is helpful because if you really want a set, a hash is often a better tool than an array in Perl.

Arrays in Perl are ordered and allow for duplicates.  In this regard they are a tuple or a sequence.  Hashes on the other hand do not allow for duplicate keys and the keys are unordered.  Therefore, where you want a set, a hash is usually a better tool, and where you want a sequence or a tuple, an array is a better structure.

Suppose we want to determine which members of one sequence are not found in a specific set (currently represented in an array).  We might do something like:

my %sethash;
$sethash{$_} = 1 for @setarray;
my @excluded = grep {not exists $sethash{$_}} @sequencearray;

This will perform much better than:

my @excluded;
for my $element (@sequencearray){
    push @excluded, $element unless grep {$_ eq $element} @sequencearray;
}

In Perl, set operations are usually easier to make perform well when using hash tables instead of arrays for sets.  This is because hash tables are unordered and enforce uniqueness among their keys.  Thus they are better optimized for the sorts of access patterns set operations are likely to use.

IV Set and Bag


SQL for better or worse generally compromises on the set operation and uses multi-sets (or bags) instead.  A multiset is like a set but it allows duplicates.  This is a compromise to allow for real world data to be easily input into the database.  In import routines, this is important because data may not be perfectly unique, and this may require some processing during the import process to sanitize.

Unfortunately, set operations and bag operations are different and produce different results.  As a preview to the next in the series, we could project a series of operations and get duplicates.  For example, suppose we project y from y = x2.  In this approach we will get duplicate values of y, because when we square negative numbers we get the same result as when we square the positive numbers.  In a set operation, we would just discard the duplicates.  In a bag operation, we return the duplicates.

This distinction is key to understanding the difference between databases in the ideal world vs in the real world.

V A Note on Ordering


As you will note, sets are unordered.  Ordinality is not important.   We can order tuples thus turning them into a sequence (for purposes of limit and offset) but when we do so, we can still revert back to a set for further set operations.  Ordering is mathematically complicated in databases, but it is intuitively easy.  The key thing to remember is that ordering creates a sequence from a set, and that we have to create a set from the sequence before we do further set operations.

It is for this reason that the results of the following query will be in an order dependent on the query plan:

SELECT f.foo, b.bar
  FROM (select foo, id from foobar order by transdate desc limit 20) f
  JOIN barbaz b ON f.id = b.foo_id;

We have a subquery which generates a sequence, but it returns a set of the members of that sequence.  Those are then used in further set operations, and the order returned depends on how the planner decides to do those operations.

Next:  Projection and Selection

Monday, August 11, 2014

Math and SQL part 3: MVCC, Immutability, and Functional Programming

While first normal form is pretty much restricted to relational operations, this piece is more general.  It looks at the basic promise of functional programming, and the way that this applies to PostgreSQL.  In the mean time, we will also look at a fully functional approach to a key-value store in Perl as a close analogue.

I The Basic Promise of a Function


We should start by reminding ourselves of what a function represents: the promise that if you know one or more pieces of information, you can derive or look up another piece of information.

That promise is behind all functional programming, and in areas which are close to the math (such as relational databases) it is especially important.  This basic promise is important, but it also changes the way we have to think about programs.  Rather than computer programs doing things, they merely calculate things.  Whatever is done in the real world as a result is a byproduct of that calculation.

In general, a simple way to think of functional programming is whether output is solely dependent on input, but in relational math, this actually becomes a remarkably blurry line.  With data which can be looked up or stored, this line is often drawn at the point of mutability, because otherwise it is very difficult for routines to share data structures.  This means that data and state which cannot change during the function's existence does not count against it being a function.

Now, obviously a functional view of programming is pretty limited.  While a network server might listen to requests and return a function of input, side effects in the control flow become impossible in a purely function-oriented approach.  For this reason many functional programming environments I have worked with have imperative programs which link the calculations to real-world actions and side effects.

II How State Complicates Programming and Testing


Application state is a major source of bugs in any program.  Very often unforeseen circumstances arise because of slow changes the application's internal control data (called "state") undergoes.  These changes are cumulative, and if not carefully controlled, can lead to eventual inconsistencies which result in operational problems.

Testing a function is far easier than testing a mutable relation.  The number of data points to test are lower if you can do bounds tests on each change of state rather than only at the initial checking.  Concurrency is also easier if you adopt a copy-on-write model (instantiating new objects or other data structures when they are modified).  This avoids worrying about locks when sharing access because writes do not block reads.

III A Functional Key Value Store in Perl


In a past post, we looked at a simple key-value store for Perl.  Now it is time to flesh this out and look at immutability in order to make it more functional.  This new version has a number of important tradeoffs.  While it uses more memory than the previous version, it is far more concurrency safe because we copy on write, and it is easier to make guarantees since everything is immutable.

package MyKVS;
use overload 
    '+' => 'add';
    '-'  => 'remove';

sub new {
     my ($class, $initial) = @_;
     $initial = {} unless ref $initial =~ /HASH/ and not keys %$initial;
     bless $initial, $class;
     return $initial;
};

sub add {
    my ($self, $other) = @_;
    my $new = {};
    $new->{$_} = $self->{$_} for keys %$self;
    $new->{$_} = $other->{$_} for keys %$other;
    return $self->new($new);
}

sub remove {
    my ($self, @other) = @_;
    my $new = {};
    for my $k (keys %$self){
        $new->{$k} = $self->{$k} unless grep {$_ eq $k}, @other;
    }
    return $self->new($new);
}

sub get_value {
    my ($self, $key) = @_;
    return $self->{$key};
}

You will then note that the key value store creates a new key value store on each change.  This means that if we go through the public API, the key value store is mutable for its lifetime.  The code above has not been tested, is probably not the best Perl code in the world, etc. but illustrates the concepts of copy on write fairly well.  Nothing above requires any locking under any circumstances.

We can then use it like:

use MyKVS;
use 5.010;

my $kvs = MyKVS->new({foo => 'bar', 1 => 1});

$kvs += {2 => 'foobar', cat => 'dog'};
my $kvs2 = $kvs + {dog => 'lion'};
$kvs2 -= 2;

say $kvs->get_value('cat');
say $kvs2->get_value('dog');

This will print:
dog
lion

The keys in $kvs are now foo, 1, 2, and cat.  Those in kvs2 are foo, 1, cat, and dog.

IV:  Mutability Challenges


One of the major challenges here is that state ends up having to change over time, and often has to be shared.  Functional programming offers a number of solutions to this problem but all involve essentially handing out immutable structures and copying those when changed.  These principles are not entirely at odds with having some sort of shared state.

This has a number of possibilities, which are beyond the scope of this piece, but this notion of copying on write is fundamental to understanding MVCC on PostgreSQL.

So back to the Key Value Store example, suppose we want to add an ability to share key value stores across an entire application.  We may add then two more methods, optionally in a closure, along with a lexically scoped kvs for the module or methods.

Now, since we copy on write, the only place we need to lock is on the update.  Since our locking strategy will depend on our approach.  In this module the store is static to the module and not shared.  This means we do not need to lock.  If we did, however, we would only lock the update function.  No other functions would need a lock.  In such a way, writers may block writers, but writers never block readers (or vice versa).

{    # start closure

my $kvs = __PACKAGE__->new({});

sub get_store {
     return $kvs;
}

sub update {
     my($self, $update) = @_;
     $kvs += $update;
}

}    # end closure 

Now with these methods something magical happens.  When I update, the local reference hashref gets updated, but everyone getting a hashref gets a private copy, which is effectively immutable (unless one breaks encapsulation). Moreover if one wants to update the global store and get a new updated store, it is as simple as:

$kvs = $kvs->update({key => 'value'});

And now we have a crude multi-version concurrency control for our key value store.  If we wanted to take this further we would store the kvs in shared memory and provide appropriate locking.  Note that locks would only be required on write and retrieving the master copy.

V:  How MVCC Works in PostgreSQL


MVCC is a major feature of modern relational database management systems and it refers to the ability to maintain consistency between requests, handing out older versions of data when necessary to show a consistent snapshot at a specific point in time.  That specific point in time differs based on isolation levels.

MVCC works on the same basic principles we have been discussing here.  Data is copied on write (so there is no difference between an update and delete followed by a select, except that integrity checking doesn't happen on the delete portion).  The database keeps the required bookkeeping information needed to give proper views of the information and then delivers it on request.  Thus inserts are not visible until commit, deletes merely mark data as obsolete, and updates do both.  This avoids a large number of concurrency issues by treating data as immutable and subject to garbage collection.

This basic idea is implemented differently in other database systems of course.  Some physically move rows in anticipation of commit or rollback (using a rollback segment), but PostgreSQL simply stores the rows in the tables, whether obsolete or not, and later collects those via autovacuum (or periodic vacuuming).

The same principle in all cases though is a logical copy of the data being made when the application reads it and another being made when the data writes.  The transaction isolation level affects when this copy happens and which copy the application sees, but not the basic process.

Next:  Basic Foundational Data Types: Sets, Tuples, and Bags

Wednesday, August 6, 2014

Math and SQL, part 2: Functions and First Normal Form

There is probably no piece of relational database theory which is so poorly understood in professional literature than first normal form.  This piece seeks to ground an understanding of 1NF in set mathematics not only so that it can be better leveraged but also so that one can be aware of when (and when not) to violate it.

I:  Defining First Normal Form


In the previous post, we talked about the definition of function and relation.  This is key to understanding first normal form.

First Normal Form requires two things, as defined by Codd:
  1. Each relation is a well-formed relation (i.e. a well-formed set of fact correlations). 
  2. No element of the relation (i.e. correlated fact) is itself a set.  Therefore sets within sets are not allowed.
By extension the second prong prevents elements of the relation from having forms which are readily reduced to sets.  For example, an unordered comma separated list of values violates 1NF.  A clearer way of putting the second prong is:

Each element of a relation must contain a single value for its domain.

A clear mathematical way of looking at First Normal Form is:


  1. Each relation is a set of tuples.
  2. No tuple contains a set.


II: Sets and Ordinality


Let's further note that one defining feature of a set is that it has no ordering.  More formally we would say that sets have no ordinality.  {1, 2, 3} = {2, 1, 3}.  For this reason, no duplicate rows may be allowed, as that means that a relation is not a well-formed set.  

Tuples (the correspondences which make up the relation) do have mandatory ordering, and allow duplication.  However the ordering is not "natural" and therefore can be specified in relational operations.  In relational algebra, the project operation allows you to generate one tuple ordering from another.

Ordinality is an extremely important concept regarding normalization and arrays in the database, as we will see with SQL arrays.

III: SQL Arrays, Ordinality and 1NF


An SQL array is actually a mathematical matrix, carrying with it all the principle properties of this.  In particular, two basic constraints exist:

  1. SQL arrays are ordinal, so [1, 2, 3] != [2, 1, 3]
  2. Multi-dimensional arrays must be well-formed matrices.  I.e. [[2, 1, 3], [3, 2, 1]] is ok but [1, 2, [3, 2]] is not.
Arrays can be used in a number of different ways.  Some of these violate first normal form.  Others do not.  For example the following does not violate 1NF (though I am deliberately avoiding PostgreSQL's built-in types for networking):

CREATE TABLE network_addrs (
    host_id int primary key,
    hostname text,
    ip_as_octets int[]
);
INSERT INTO network_addrs(1, 'myhost', ARRAY[192,168,1,124]);

Because [192,168,1,124] and [124,168,1,192] are not equivalent, first normal form is not violated.

On the other hand consider:

CREATE TABLE web_page (
      file_name text primary key,
      content text,
      tags text[]
);
INSERT INTO web_page ('home.html', '<html></html>', ARRAY['foo', 'bar']);

Because [foo, bar] is equivalent to [bar, foo], we have a 1NF violation.  Now, in PostgreSQL, there may be good reasons to do this, but there are significant costs in any RDBMS as well.

The basic tradeoff in PostgreSQL in the latter version is that you gain the possibility of optimizing select queries for tags using GIN indexes.  However, deleting a tag from the system becomes very painful.  The difference between the one which violates 1NF using tags and the one which properly uses arrays for IP addresses is that I am only going to manage the IP address as it exists at once.  I will never be modifying or deleting octets individually.  For web pages, however, I may need to reach into the array with SQL to update or delete elements.  

This is a good way of thinking about when arrays violate 1NF, and it presents an understanding of some practical problems that come into play when 1NF is violated, but doesn't quite match the theory when tuples are stored as relational elements.  More on this will be covered below.

IV:  1NF, Candidate Keys, and Functional Dependencies


Since a 1NF relation is a set, it has no duplicates.  Thus, the relation itself can be used as a domain for a function, the result of which is its members.  Thus the whole tuple can be said to be a candidate key, and every member a trivial function of it.

Interestingly PostgreSQL supports queries notating columns as functions of relations.  While we might normally run something like:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON p.pronamespace = n.oid;

We can also write it as:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON pronamespace(p) = oid(n);

These notations are actually directly equivalent, and while the second is sufficiently non-standard that I would probably avoid it for production code, it has the advantage of making the relationship between function and attribute quite clear.

V:  Notes about Tuples in Tuples


Now, technically storing tuples inside of tuples in relations never violates first normal form by itself (though arrays of tuples might be if they are not ordinal).  This is because tuples are always ordinal.  Nonetheless, storing tuples inside of tuples can result in some problems which look very much like 1NF violation problems.  The same holds true with JSON, and other programming structures.  These don't violate 1NF themselves, but they can cause problems of a similar sort depending on how they are used.

In general, a decent rule to live by is that if a tuple will be updated as a whole, problems don't exist but that problems begin to exist when a tuple may be updated (logically) by merely replacing elements in it.  If one thinks of tuples as immutable (more on this in an upcoming post), these problems are solved.

This is an area where functional programming as a mentality makes database design a lot easier.  One should think of tuples as immutable correlated data structures, not as internal state of the system.

VI:  When to Break 1NF


There are many times when 1NF must be broken.  

In the real world we don't often get to verify that our data conforms to rigid mathematical definitions before we load it.  For this reason, the non-duplication rule often must be ignored for data import workflows.  In these cases, one loads the data with duplicates, and filters them out with subsequent operations.  This is acceptable.

Additionally there are cases where the workloads are such that, with PostgreSQL, GIN indexes can give you major performance boosts when you need it.  This comes at a cost, in that deleting a valid element from the system can no longer perform well, but sometimes this cost is worth it for relatively unimportant and freeform data.

VII:  Perl modules and Related Concepts


In writing this I came across a number of Perl modules which are closely related to this post.

The first is Set::Relation, which implements basic relational logic in Perl.  This is worth looking at if you are trying to see how to implement set logic without an RDBMS.  A very nice feature is that by default, relations are immutable, which means you have an ability to use them in a fully Functional Programming environment.

The second is Math::Matrix, which gives you a library for matrix mathematics.  For integer and float arrays, this allows you to incorporate this into the database back-end using PL/PerlU procedures, and would be particularly useful when combined with SQL language arrays (either in the back-end or middle-ware).

VIII:  Final Thoughts


First Normal Form is often treated as a given in relational theory.  However since most books aimed at database professionals don't cover the mathematical basis for the model, it is generally poorly understood.  Understanding First Normal Form however is the beginning of understanding relational logic and relational database systems.  It therefore deserves much more exploration than the brief space it is usually given.

Understanding 1NF is particularly important when one decides to use a non-1NF database design.  These have significant tradeoffs which are not evident to beginners.  However, this is only the starting point.  In future discussions we will cover related aspects which give a more complete understanding of the basics of the system.

Next Week:  Immutability, MVCC, and Functional Programming

Saturday, August 2, 2014

Math and SQL, part 1: Introduction: Relations and Functions

There are two goals I have in this series.  The first is to heavily ground database thinking in mathematics thinking.  The second is to connect this more explicitly to functional programming which attempts to do the same more generally.  The basic foundation of functional programming and relational theory is the same.  While this may seem very rudamentary for Planet Postgresql, I am hoping that the information here will be useful in theoretical courses, etc.

This series will also discuss uses of the same ideas in Perl.  Many of the same ideas are extremely important for developers in many languages.  Consequently these will all be tagged both PostgreSQL and Perl and most will include perl code as well as PostgreSQL code.

I:  SQL as Relational Math Implementation


In basic terms, SQL provides an interface for relational database systems.  The most powerful features of it are essentially relational algebra-based.  In order to better handle concurrent connections, there is a lot missing from relational math, and in order to handle real-world data, there are features in relational algebra not present in relational math.

However for purposes of this series, I will focus on PostgreSQL's SQL interpretation and I will use non-standard syntax features when that makes the math clearer.  I like PostgreSQL in this regard because it means that the math is fairly clear.

Code samples explaining ways to accomplish various operations will be provided in Perl.  Note that these are mostly fairly naive versions.  If one were really creating an RDBMS in Perl, the code would probably be very different.

II:  Relational Math, Relations and Functions


At the core of both databases and functional programming are the concepts of relations and functions.    At their foundation, these are fairly old concepts.  Since a function is a subset of relation, let's start by defining and understanding what a relation is:

A relation is a set of correlated facts.

For example, let's look at two relations, one showing square roots and one showing squares:


  • For square roots:
    • 1: 1
    • 1: -1
    • 4: 2
    • 4: -2
    • 9: 3
    • 9: -3
  • For squares
    • -3: 9
    • -2: 4
    • -1: 1
    • 0: 0
    • 1: 1
    • 2: 4
    • 3: 9
Both of these are relations. They provide a list of facts which correlate.  A square root of 1 is 1.  Another square root is -1.  -1 squared is 1, and so forth.

In programming terms, one may thing of a relation as being the possible output of a subroutine.  In relational databases, however, a relation is a set of correlating facts.  Typically this is implemented as a "table."

On to functions:

A function is a relation where the fact of the relation are fully determined by the domain of the function.  Therefore if you know the fact from the domain, there is only one related fact.

This definition is key to understanding both functional programming and relational databases.  It underlies almost all aspects of the both disciplines.

The following are mathematical functions:

  1. f(x) = x + 1
  2. f(x) = 2 * x
  3. f(x) = x2
These are functions because for any given x there is exactly one f(x).

So things like random() are not functions but rather very large relations.  Similarly now() is not a function because it has no input.  Programming in a way that emphasizes functions generally leads to more easily testable code, fewer bugs, and fewer concurrency problems than programming in a way which emphasizes state.

Now, in database terms, this all becomes important because of the fact that a function is a subset of relation.  Therefore we can speak of a functional dependency within a relation.   Put simply, a candidate key is a subset of a relation which, as a domain, turns the relation into a function.  All facts in a relation are a function of each candidate key.

So let's look at one example for how this works in Perl.  We will use a hash table, which is basically a key/value store.  A hash table is an unordered relation in Perl, such that the value is a function of the key.  In this example, we will mark this up to be a little more transparent mathematically, though in production.

package KeyValueStore;

my %store;

sub set {
    my ($key, $value) = @_;
    $store{$key} = $value;
}

sub remove {
    my $key = shift;
    delete $store{$key};
}

sub get {
    my $key = shift;
    $store{$key};
}

package main;
use KeyValueStore;
KeyValueStore::set('foo', 'bar');
KeyValueStore::set('1', '2');

print KeyValueStore::get('foo');
1;

Now, in the above code, set and remove define the relation, and get returns a function of the key, based on the relation at any given time.  Functional programming would not see this as a function because of state and concurrency issues, but from a basic math perspective, given a relation (%store in this case), this acts as a function.  There are of course ways to make the above work within the limits of functional programming.  These require turning set and remove into mathematical functions and changing the way we store the information.  In a later piece, when I talk about MVCC, I will talk about how to do this.

What is important here is that for a given relation, a fact is a function of a key.  This is is then usually said in a slightly different way, namely that a fact is functionally dependent on a key.  Functional dependency is the normal label because people frequently use the term "function" in databases in a non-mathematical sense.

Next week:  Functions and First Normal Form

Thursday, March 6, 2014

In Defence of Microsoft

Ok, so the next post was going to be on why PostgreSQL is really great.  But this is going to be a different and unusual post for this blog.

Most of my friends know that I do not revel in trying new Microsoft products and in fact avoid most of them when I can.  I don't find them pleasant to work with for the things I like to do and I tend to work with them only grudgingly.  True, I have written and published papers (even through Microsoft!) on Linux-Windows interoperability.  True I used to work for Microsoft.  But to me their products are relatively irrelevant.  Additionally I tend to shy away from politics on this blog because technical issues and politics require different sorts of discourse.

Yet last week a controversy erupted that shows the limits of these limitations.  Microsoft released an ad that was widely attacked for suggesting that Microsoft products are good for planning weddings.  As much as it pains me, I find myself having to defend the software giant here.

Sometimes technology is political, and sometimes it is important to stand up for another side of the issue than makes the press.

This is a contribution to what I hope will be a general industry-wide (and better-yet society-wide) conversation on gender and the economy, particularly in the technology fields.  All of what you find here is my own opinion and although I am male I grew up around the issue as you will see.  I don't expect that I speak for many.  The views here are contrarian and may be controversial (probably should be, since controversy may inspire thought).

The Ad


 I decided not to directly link to the attacks on the ad first, figuring that those who have not seen it may want to see it with fresh eyes rather than after reading the attacks.  The ad is currently found on Youtube and so I linked there.  I searched on Microsoft's site and couldn't find it.  I don't know if they removed it due to the controversy.

The ad portrays a young woman who talks about why she picked a Windows all in one tablet over a mac, and the reasons had to do with the way in which the tablet made her social life easier while planning her wedding.  Perfectly good and valid reasons for choosing one product over another.  In no way is anyone portrayed as trivializing the use of technology or as incapable.  It's a decent ad targetted at a certain subset of the population.  I don't see anything wrong with it per se.

I am not saying that women are well enough represented in the ad campaign.  If that was the argument I might say 'sure, there's a valid point there' but that isn't the argument.   The only thing to see here is that Microsoft, in putting together a perfectly legitimate advertisement, managed to offend a bunch of people.

The Response


 The two primary articles attacking this ad were published in Slate and Think Progress.  The arguments basically have three components and are more similar than they are different.

Both responses accuse Microsoft of trivializing how women use technology for wedding planning and for things related to pregnancy.  They accuse Microsoft of playing into stereotypes, and of thus sending a message that women are less capable than men in the use of technology.

 A Personal Detour


Ordinarily I'd just figure this is an issue for women to sort out and leave it at that.   Unfortunately, that is a little harder to do working in open source software development.  The concerns is basically that there aren't enough women in tech, and I work in a corner of tech where women are nearly absenct.  While on average in the software development industry, men outnumber women 2:1, in open source software, the ratio is closer to 50:1.

Personally I wonder why we are so obsessed with the fact that women are making certain career and lifestyle choices differently than men.  While the reason isn't hard to find (see below), I do think that such obsession ultimately robs women in particular of agency, that idea that they and they alone are best prepared to make the decisions that involve navigation of one's life.  Why do we think there aren't enough women in a given industry?  What would be enough?  Why?  To some extent I think this obsession delegitimizes both the decisions of those women who choose to go into software and those who don't, and I have heard more than one woman complain about being asked about this issue frequently.

The obsession doesn't help anybody.  Women may be as capable of programming computers as men are (and given the dominance of women in the early years of programming it seems hard to argue the contrary with any force of history behind such an argument).

Personally too, the first person who ever showed me a computer for work was my grandmother.  It was a large machine, probably as big as my dining room table, with a small crt screen and a punchcard reader.  What did my grandmother use it for?  She wrote nuclear physics simulations.  Was she typical of her generation?  No.  She worked with nobel-prize-wining physicists and was quite renowned in her field.  Such people are never typical of any group.  She was also the first person I knew who complained about efforts to bring more women into STEM fields because she found it undermined her credibility.

But is everyone able to program a computer?  No.  Nor, perhaps beyond a very basic level, is that a skill everyone needs to know.

What's this Argument About, Anyway?


I don't think the argument is only about women in software development.  Running through both criticisms of the ad is an effort to trivialize getting married and having kids.

This seems like a really weird thing to trivialize since most people get far more happiness out of family contact than they do out of slaving away in a cubical, working so that someone else gets to make some extra profit, and yet there are certain segments of feminism which repetitively seek to trivialize these things.  (For those who jump to offence, please calm down and note that feminism is not an organized movement and in fact has tremendous diversity in view on this subject.)

And yet the reasons why there is an effort to trivialize these things is not hard to find.  The US economy bears two fundamental characteristics that shape this debate in very deep ways.  The first is that the economy is based on the notion that women and men are not merely equal but interchangeable in all ways, and hence interchangeability becomes synonymous with equality.  The second is that the US economy is employer-centric, and thus the employer's needs are what are most important, not the needs of the family.  For these reasons, getting married and having kids (especially) has negative career consequences.  These consequences are worse the younger one is, and since women cannot delay having children as long as men can, they ultimately suffer disproportionate costs of gender neutral policies.

From this viewpoint it is easy to conclude that if only women didn't get married and have kids, inequality would be a thing of the past, but this isn't really a solution.  Rather it is a case where an apparent solution on an individual level papers over and conceals a larger problem.

Another aspect of the problem is the extent to which our society has a rather distorted view of the tech industry.  Technology firm founders are idolized well beyond the proportions of their contributions.  Working in technology is a glamorous job.  But it is also portrayed popularly as the industry of lone geniuses, and startup cultures have personal demands that are truly taxing (Marissa Mayer once bragged about 130 hour work weeks that she used to put in at Google), all for uncertain gains in what amounts to an institutionalized form of gambling with your time.

What?  Women don't want to be founders of tech startups as things stand right now?  Seems like they have more sense than men....

A Few Reasons the Attacks on Microsoft Here Are Wrong


The attacks on Microsoft for this ad are wrong for more reasons than I can count.  There is, after all, nothing wrong with the ad.  It portrays a perfectly legitimate reason that someone might choose one product over another, namely that it makes an important aspect of one's life a bit easier, and the ultimate judge of what is important really should be left to the individual, the family, and the local community.  Here are, however, my top few reasons why I find the attacks on the ad misplaced.

  1. Not everyone is or should be a "techie."  People have different views on software and different priorities in life, and that is ok.  There's nothing wrong with deciding not to get married in the US (there would be in much of the rest of the world, where you are expected to retire with your kids but that is a different story).  But conversely there is nothing wrong with treating your wedding as important.
  2. Technology exists to solve human problems, not the other way around.  The argument in the response carries with it a strong subtext that women should be solving technical problems, not using technology to solve human problems, but this misunderstands the proper place of technology in life.  This is, truth be told, a very common mistake and it is the primary cause of software project failures I have seen.  It is also a major part of the idolization of the tech industry and the perpetual promise to totally change our lives (which never seems quite as great when it happens).  Planning a wedding is a human problem and using technology for that is a fascinating use case, IMHO.
  3. Human Relationships are Anything but Trivial.  Getting married and having kids is fundamentally about human relationships.  Employers come and go. We don't really expect them to stand by their employees when it is not profitable to do so.  Having people in your life you can count on is more important to having security in life than are having employers interested in your work.

Towards a More Just and Inclusive Economy


Standing up and defending Microsoft is to some extent an important first step in starting a conversation.  It can't be the end though.  The fact is that the critics of Microsoft want something (I hope!) that I want too, namely for women to enter the industry of software development on their own terms.  This has to be the topic of a larger conversation, and one which does not loose sight of the individual or the systemic problems of the economy in this regard.

It seems hard to imagine that the systemic injustices of the current system (including an aggregate wage gap, though this may be statistically insignificant in the computer sciences) can be done away with in any way other than reducing the dependence on large employers or doing away with the myth of interchangeability (these two things are closely tied, since the idea of interchangeability is important to the development of large corporate organizations).  Perhaps a return to an economic system where men and women worked together as joint principles in household businesses would be a good model.  That has very little traction in the US today however.

In the end though I think it should be more obvious than it apparently is that you can't force someone to enter into an industry on his or her own terms.  The efforts to solve the problem of a gender gap in terms of culture and institutions are likely to fail as long as women look at the tech industry (and in particular the most glamorous parts of it) and don't want to put in the time and effort, or make the sacrifices involved.

But still, even in a more just economy, there are going to be people for whom a computer is primarily a social tool, a way to coordinate with friends and co-workers, to communicate and to plan, and I have trouble seeing the difference between planning an event of deep personal significance and planning as a middle manager for a company, except that the former ought to be a lot more rewarding,

I remain relatively optimistic though that small household businesses based on open source consulting alone have the potential to provide an opportunity to balance flexibly and productively the demands of family and work in such a way that everyone gets pretty much everything they want.  I think (and hope) that as open source software becomes more mature as an industry that this will be the way things will go.

But whatever we do, we must recognize that people decide on how to go about participating in the economy based on their own needs and desires, and that none of us have perfect knowledge.  The most important thing we can stop doing is delegitimizing life choices we might decide are not for us.  I the goal is to help women enter industries like software development on their own terms, I think the best thing we can do is just get out of their way.

As a man, I don't know what young women want out of the industry as they consider it as a career or business path.  To be honest, it is better that I don't.  The new generations of programmers, male and female, should enter the industry on their own terms, with their own aspirations and hopes, their own determination to do things their own ways, and their own dreams.   And there is nobody that should tell them how to think or address these.  Not me.  Not the marketeers at Microsoft.  Not the authors at Slate and Think Progress.  That's the change that will make the industry more inclusive.

Thursday, February 27, 2014

In Praise of Perl 5

I have decided to do a two-part series here praising Perl in part 1 and PostgreSQL in part 2.  I expect that PostgreSQL folks will get a lot out of the Perl post and vice versa.  The real power of both programming environments is in the relationship between domain-specific languages and general purpose languages.

These posts are aimed at software engineers more than developers and they make the case for building frameworks on these platforms.  The common thread is flexibility and productivity.

This first post is about Perl 5.  Perl 6 is a different language, more a little sister to Perl 5 than a successor.  The basic point is that Perl 5 gives you a way to build domain specific languages (DSL's) that can be seemlessly worked into a general purpose programming environment.  This is almost the exact inverse of PostgreSQL, which offers, as a development environment, a DSL with an ability to work in almost any general purpose development tools into it.  This combination is extremely powerful as I will show.

All code in this post is from the LedgerSMB codebase in different eras (before the fork, during the early rewrite, and now planned code for 1.5).  All code in this post may be used under the GNU General Public License version 2 or at your option any later version.

You can see in the code samples below our evolution in how we use Perl.

This is (bad) Perl


Perl is a language many people love to hate.  Here's an example of bad Perl from early in the LedgerSMB codebase.  It is offered as an example of sloppy coding generally and why maintaining Perl code can be difficult at times.  Note the module makes no use of strict or warnings, and almost all variables are globally package-scoped.

     $column_header{description} =
        "<th><a class=listheading href=$href&sort=linedescription>"
      . $locale->text('Description')
      . "</a></th>";

    $form->{title} =
      ( $form->{title} ) ? $form->{title} : $locale->text('AR Transactions');

    $form->header;

    print qq|
<body>

<table width=100%>
  <tr>
    <th class=listtop>$form->{title}</th>
  </tr>
  <tr height="5"></tr>
  <tr>
    <td>$option</td>


This is not a good piece of maintainable code.  It is very hard to modify safely.  Due to global scoping, unit tests are not possible.  There are many other problems as well.  One of our major goals in LedgerSMB is to rewrite all this code as quickly as we can without making the application unusably unstable.

So nobody can argue that it is possible to create unmaintainable messes in Perl.  But it is possible to do this sort of thing in any language.  One can't judge a language solely because it is easy to write bad code in it.

This is (better) Perl


So what does better Perl look like?   Let's try this newer Perl code, which was added late in the LedgerSMB 1.3 development cycle, and handles asset depreciation:

sub depreciate_all {
    my ($request) = @_;
    my $report = LedgerSMB::DBObject::Asset_Report->new(base => $request);
    $report->get_metadata;
    for my $ac(@{$report->{asset_classes}}){
        my $dep = LedgerSMB::DBObject::Asset_Report->new(base => $request);
        $dep->{asset_class} = $ac->{id};
        $dep->generate;
        for my $asset (@{$dep->{assets}}){
            push @{$dep->{asset_ids}}, $asset->{id};
        }
        $dep->save;
    }
    $request->{message} = $request->{_locale}->text('Depreciation Successful');
    my $template = LedgerSMB::Template->new(
        user =>$request->{_user},
        locale => $request->{_locale},
        path => 'UI',
        template => 'info',
        format => 'HTML'
    );
    $template->render($request);
}


This function depreciates all asset classes to a point at a specific date.  There's a fair bit of logic here but it does many times more work than the previous example, is easier to maintain, and is easier to understand.

This is also Perl!


The above two examples are  pretty straight-forward Perl code examples, but neither one really shows what Perl is capable of doing in terms of writing good-quality, maintainable code.

The fact is that Perl itself is a highly malleable language and this malleability allows you to define domain-specific languages for parts of your program and use them there.

Here's a small class for handling currency records.  POD and comments have been removed.

package LedgerSMB::Currency;
use Moose;
with 'LedgerSMB::PGOSimple::Role', 'LedgerSMB::MooseTypes';

use PGObject::Util::DBMethod;
   
sub _set_prefix { 'currency__' }


has id                => (is => 'rw', isa => 'Int', required => '0');
has symbol            => (is => 'ro', isa => 'Str', required => '1');
has allowed_variance  => (is => 'rw', isa => 'LedgerSMB::Moose::Number',
                          coerce => 1, required => 1);
has display_precision => (is => 'rw', isa => 'Int', required => '0');
has is_default => (is => 'ro', isa => 'Bool', required => '0');

dbmethod list      => (funcname => 'list', returns_objects => 1 );
dbmethod save      => (funcname => 'save', merge_back => 1);

dbmethod get       => (funcname => 'get', returns_objects => 1,
                          arg_list => ['symbol']);

dbmethod get_by_id =>  (funcname => 'get_by_id', returns_objects => 1,
                        arg_list => ['id']);

__PACKAGE__->meta->make_immutable;

Now the code above sets up a whole class including properties, accessors, and methods delegated to database stored prcedures.  The class is effectively entirely declarative.  The same amount of work in a similarly simple module from the 1.3 iteration (TaxForm.pm) requires around 50 lines of code, so more than double, and that's without accessor support.  The 1.4-framework module for handling contact information (phone numbers and email addresses) is around 65 lines of code, with not much more complexity (so around triple).  The simpler Bank.pm (for tracking bank account information) is around 36 lines so nearly double.

What differentiates the examples though is not only line count but readability, testability, and maintainability.  The LedgerSMB::Currency module is more concise, more readable, and has much better testing and maintenance characteristics than the longer modules from the previous frameworks.  Even without comments or POD, if you read the Moose and PGObject::Util::DBMethod documentation, you know immediately what the module does.  And in such a module, comments may not be appropriate, but POD would likely not only be appropriate but take up significantly more space than the code.

How does that work?


Perl is a very flexible and mutable language.  While you can't add keywords, you can add functions that behave more or less like keywords.  Functions can be exported from one module to another and, used judiciously, this can be used to create domain-specific languages which in fact run on generated Perl code.

The example here uses two modules which provide DSL's for specific purposes.  The first is Moose, which has a long history as an extremely important contributor to current Perl object-oriented programming practices.  This module provides the functions "with" and "has" used above.

Moose, in this case works with a PGObject::Simple::Role module which provides a framework for interacting with PostgreSQL db's.  This is extended by LedgerSMB::PGOSimple::Role which provides handling of database connections and the like.

The second is PGObject::Util:DBmethod, which provides the dbmethod function.  It's worth noting that both has and dbmethod are code generators.  When they run, they create functions which they attach to the package.  Used in this way has creates the accessors, while dbmethod creates the delegated methods.

Why is this Powerful and Productive?


The use of robust code generation here at run-time allows you to effectively build modules and classes from specifications of modules and classes rather than implementing that specification by hand.  Virtually all object-oriented frameworks in Perl effectively offer some form of this code generation.

A specification to code language provides a general tradeoff between clarity, expressiveness (in its domain) and robustness on one hand, with inflexibility on the other.  This is the fundamental tradeoff of domain-specific languages generally.  When you merge a domain-specific language into a general-purpose one, however, you gain the freedom to compensate for the lack of flexibility by falling back on more general tools when you need to.  This flexibility is where the production gains are found.

Compare a framework built as a mini-DSL specification language to one built as an object model.  In an object model framework one effectively has to juggle object-oriented design (SOLID principles, etc) with the desire for greater flexibility.  Here, however the DSL's are orthogonal to the object model.  They allow you to define the object model orthogonally to the framework, while re-using the DSL framework however you want.  Of course these are not mutually exclusive, and it is quite possible to have both in a large and powerful application framework.

Other Similarly Powerful Languages


 Perl is not the only language of this kind.  The first example that comes to mind, naturally, is Lisp.  However other Lispy languages are also worth mentioning.  Most prominent among these are Rebol and Red, whose open source implementations are still very immature.  These languages are extremely mutable and the syntax can be easily extended or even rewritten.

Metaprogramming helps to some extent with some of these issues and this is a common way of addressing this in Ruby and Python, but this makes it much harder to build a framework that is truly orthogonal to the object model.

A major aspect of the power of Perl 5 here are the very things which often cause beginning and intermediate programmers headache.  Perl allows, to  a remarkable extent, manipulation of its own internals (perhaps only Rebol, Red, and Lisp take this further).  This allows one to rewrite the language to a remarkable extent, but it also allows for the development of contexts which allow for these sorts of extensions.

The key feature I am looking at here is the mutability of the language.  And there are few languages which are themselves relatively mutable.  Perl isn't just a programming language, but a toolkit for building programming languages inside it.