Wednesday, August 6, 2014

Math and SQL, part 2: Functions and First Normal Form

There is probably no piece of relational database theory which is so poorly understood in professional literature than first normal form.  This piece seeks to ground an understanding of 1NF in set mathematics not only so that it can be better leveraged but also so that one can be aware of when (and when not) to violate it.

I:  Defining First Normal Form


In the previous post, we talked about the definition of function and relation.  This is key to understanding first normal form.

First Normal Form requires two things, as defined by Codd:
  1. Each relation is a well-formed relation (i.e. a well-formed set of fact correlations). 
  2. No element of the relation (i.e. correlated fact) is itself a set.  Therefore sets within sets are not allowed.
By extension the second prong prevents elements of the relation from having forms which are readily reduced to sets.  For example, an unordered comma separated list of values violates 1NF.  A clearer way of putting the second prong is:

Each element of a relation must contain a single value for its domain.

A clear mathematical way of looking at First Normal Form is:


  1. Each relation is a set of tuples.
  2. No tuple contains a set.


II: Sets and Ordinality


Let's further note that one defining feature of a set is that it has no ordering.  More formally we would say that sets have no ordinality.  {1, 2, 3} = {2, 1, 3}.  For this reason, no duplicate rows may be allowed, as that means that a relation is not a well-formed set.  

Tuples (the correspondences which make up the relation) do have mandatory ordering, and allow duplication.  However the ordering is not "natural" and therefore can be specified in relational operations.  In relational algebra, the project operation allows you to generate one tuple ordering from another.

Ordinality is an extremely important concept regarding normalization and arrays in the database, as we will see with SQL arrays.

III: SQL Arrays, Ordinality and 1NF


An SQL array is actually a mathematical matrix, carrying with it all the principle properties of this.  In particular, two basic constraints exist:

  1. SQL arrays are ordinal, so [1, 2, 3] != [2, 1, 3]
  2. Multi-dimensional arrays must be well-formed matrices.  I.e. [[2, 1, 3], [3, 2, 1]] is ok but [1, 2, [3, 2]] is not.
Arrays can be used in a number of different ways.  Some of these violate first normal form.  Others do not.  For example the following does not violate 1NF (though I am deliberately avoiding PostgreSQL's built-in types for networking):

CREATE TABLE network_addrs (
    host_id int primary key,
    hostname text,
    ip_as_octets int[]
);
INSERT INTO network_addrs(1, 'myhost', ARRAY[192,168,1,124]);

Because [192,168,1,124] and [124,168,1,192] are not equivalent, first normal form is not violated.

On the other hand consider:

CREATE TABLE web_page (
      file_name text primary key,
      content text,
      tags text[]
);
INSERT INTO web_page ('home.html', '<html></html>', ARRAY['foo', 'bar']);

Because [foo, bar] is equivalent to [bar, foo], we have a 1NF violation.  Now, in PostgreSQL, there may be good reasons to do this, but there are significant costs in any RDBMS as well.

The basic tradeoff in PostgreSQL in the latter version is that you gain the possibility of optimizing select queries for tags using GIN indexes.  However, deleting a tag from the system becomes very painful.  The difference between the one which violates 1NF using tags and the one which properly uses arrays for IP addresses is that I am only going to manage the IP address as it exists at once.  I will never be modifying or deleting octets individually.  For web pages, however, I may need to reach into the array with SQL to update or delete elements.  

This is a good way of thinking about when arrays violate 1NF, and it presents an understanding of some practical problems that come into play when 1NF is violated, but doesn't quite match the theory when tuples are stored as relational elements.  More on this will be covered below.

IV:  1NF, Candidate Keys, and Functional Dependencies


Since a 1NF relation is a set, it has no duplicates.  Thus, the relation itself can be used as a domain for a function, the result of which is its members.  Thus the whole tuple can be said to be a candidate key, and every member a trivial function of it.

Interestingly PostgreSQL supports queries notating columns as functions of relations.  While we might normally run something like:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON p.pronamespace = n.oid;

We can also write it as:

select proname, nspname 
  from pg_proc p 
  join pg_namespace n ON pronamespace(p) = oid(n);

These notations are actually directly equivalent, and while the second is sufficiently non-standard that I would probably avoid it for production code, it has the advantage of making the relationship between function and attribute quite clear.

V:  Notes about Tuples in Tuples


Now, technically storing tuples inside of tuples in relations never violates first normal form by itself (though arrays of tuples might be if they are not ordinal).  This is because tuples are always ordinal.  Nonetheless, storing tuples inside of tuples can result in some problems which look very much like 1NF violation problems.  The same holds true with JSON, and other programming structures.  These don't violate 1NF themselves, but they can cause problems of a similar sort depending on how they are used.

In general, a decent rule to live by is that if a tuple will be updated as a whole, problems don't exist but that problems begin to exist when a tuple may be updated (logically) by merely replacing elements in it.  If one thinks of tuples as immutable (more on this in an upcoming post), these problems are solved.

This is an area where functional programming as a mentality makes database design a lot easier.  One should think of tuples as immutable correlated data structures, not as internal state of the system.

VI:  When to Break 1NF


There are many times when 1NF must be broken.  

In the real world we don't often get to verify that our data conforms to rigid mathematical definitions before we load it.  For this reason, the non-duplication rule often must be ignored for data import workflows.  In these cases, one loads the data with duplicates, and filters them out with subsequent operations.  This is acceptable.

Additionally there are cases where the workloads are such that, with PostgreSQL, GIN indexes can give you major performance boosts when you need it.  This comes at a cost, in that deleting a valid element from the system can no longer perform well, but sometimes this cost is worth it for relatively unimportant and freeform data.

VII:  Perl modules and Related Concepts


In writing this I came across a number of Perl modules which are closely related to this post.

The first is Set::Relation, which implements basic relational logic in Perl.  This is worth looking at if you are trying to see how to implement set logic without an RDBMS.  A very nice feature is that by default, relations are immutable, which means you have an ability to use them in a fully Functional Programming environment.

The second is Math::Matrix, which gives you a library for matrix mathematics.  For integer and float arrays, this allows you to incorporate this into the database back-end using PL/PerlU procedures, and would be particularly useful when combined with SQL language arrays (either in the back-end or middle-ware).

VIII:  Final Thoughts


First Normal Form is often treated as a given in relational theory.  However since most books aimed at database professionals don't cover the mathematical basis for the model, it is generally poorly understood.  Understanding First Normal Form however is the beginning of understanding relational logic and relational database systems.  It therefore deserves much more exploration than the brief space it is usually given.

Understanding 1NF is particularly important when one decides to use a non-1NF database design.  These have significant tradeoffs which are not evident to beginners.  However, this is only the starting point.  In future discussions we will cover related aspects which give a more complete understanding of the basics of the system.

Next Week:  Immutability, MVCC, and Functional Programming

13 comments:

  1. Chris, thanks for referencing my Set::Relation module, I'm glad it was useful for your article.

    Speaking to your definition of First Normal Form, where no tuple contains a set, I think that having a set here is a good practice when the set as a whole represents a single value and has an identity.

    For example, a value representing a triangle can be represented by a set of 3 points, or a line segment by a set of 2 points. It would make sense to represent said value as a single set value (encapsulated or otherwise) in a single tuple, and not split it over 2 or 3 tuples. Remember, the points are NOT mutually ordered.

    Another point is that there are certain logical things that simply can't be represented at all in a natural way without having set values in tuples. For example, if you want to have a primary key over a set of tuples on elements that are themselves sets. I can point to the same example again, say you want to have a tuple per each triangle or line segment, then the set representing the triangle or line needs to have a key constraint on it, saying each set is distinct from every other set, and the other tuple attributes have a functional dependency on the set value that is the triangle or line segment's identity.

    Note that any natural solution to the use case I mentioned has no surrogate keys by definition.

    As a next point, the relational model works perfectly well from a logical standpoint when tuple elements are of any type at all, including other relations or sets or tuples etc, and I would argue there is no violation of first normal form occurring here, rather the only violation is where duplicate tuples are allowed.

    In fact, practice would require having set-valued elements in tuples. What form do you think you have when you do a GROUP BY on a relation? A relation whose tuples each have relation-typed elements, is what you have. This nested relations is the input for any aggregate operations you would subsequently apply, that then map each inner relation to simpler values.

    I grant you that in practice most of the time in stored relations your key tuple elements would be of simpler types like numbers or strings, but I would not say that having more complicated values as tuple elements is a violation of 1NF.

    ReplyDelete
    Replies
    1. The triangle point is actually very interesting. It isn't that you have a set, but rather that you have a value which is treated as a set for some purposes (i.e. two triangles are identical if and only if their sets of vertices are identical). Interestingly when you move to polygons with more than three sides, you can't reduce equality to a set operation, and therefore an SQL array of points is actually a better model since ordinality actually is important.

      Delete
  2. Replying to myself ...

    Actually, besides allowing duplicate rows, there is 1 major practice that I consider a 1NF violation, and that is when you have multiple tuple elements with interchangeable function, for example, having separate fields named phone_1 and phone_2 or address_1 and address_2, and many similar, where there is no semantic meaning (eg, preference) between the 2+fields.

    So the above example I would consider the most common kind of 1NF violation, that and duplicate rows. Whereas, a single field containing a value that is a set, I would not consider a 1NF violation, even if it otherwise is not the best choice of design for that particular database. So having a field named "phone_numbers" to replace phone_1 .. phone_2 I would consider as coming into 1NF compliance.

    ReplyDelete
    Replies
    1. That's an interesting point as well. Interestingly although this does not meet the math definition of a 1NF violation, it is isomorphic with one and poses most of the same problems.

      Delete
  3. Chris, putting aside matters of sets and 1NF for the moment, I very much agree with your consideration that it does not violate 1NF to have arrays or such ordered data as tuple elements. Indeed, it is much easier to find many examples where an ordered collection of elements is best treated as a single value and should be valid 1NF as a tuple element. The general case of polygon definitions for example (especially if normalized for direction). And in fact, it would be much easier to argue against any proposal to say that ordered collections can never be 1NF. Especially if framed with language like "only scalar values are allowed". After all, if one says that you can only have 1NF if everything is scalars, then I would ask, what is a character string? That is certainly not a scalar in the strictest sense, but rather an ordered list of elements. So it would be hypocritical to deny "non-scalars" but yet allow strings.

    ReplyDelete
    Replies
    1. Think about the danger of "only scalar values are allowed." Where do we stop? Are datetime fields a 1NF violation (since a datetime can be thought of as a (date,time) couple as well as a scalar offset from a specific point in time? Can you store Heisenberg Matrices in a relational database? What about Dirac Vectors? It would be truly ironic if relational databases, based on math, couldn't be used in math and physics.

      Polygons are interesting here. Ideally one would normalize not only for direction but also starting point, but that might be difficult. These are all cases where PostgreSQL's type encapsulation capabilities are really helpful.

      Delete
    2. This conversation is really interesting. It confirms my sense that 1NF is one of those topics that people gloss over but it deserves a lot more coverage.

      Delete
    3. Yes, I agree with everything as you say here, and that limiting 1NF to "scalars" is illogical. As to normalizing polygons, I can think of a simple way, which is to use distance from the origin [0,0,...]; say the first point is the one closest to or on the origin, and then out of the 2 other points it connects to, the one second-closest to the origin is considered the second point, and then we go in that direction. I speak of normalization in terms of ordering a structure so that one can simply compare 2 values for equality/identity by naive element-wise comparison. Likewise, Unicode normal forms or the like do that for text, for some sense of equal (eg, canonical decomposed). By putting more logic into normalizing value representation, equality tests between 2 values can be made trivial.

      Delete
    4. As I think about it, regarding set-represented triangles and 1NF, the interesting thing is that the triangle, represented as a set or not, is still functionally dependent on the tuple or candidate key. The tuple has a definite number of elements (3) and all possible orders are equivalent.

      This gets to your point of where the set is a complete representation of a single value (which happens to be represented by a finite set). This is where type encapsulation in PostgreSQL becomes really handy.

      Delete
    5. Why does the triangle have to be functionally dependent on anything else just because its a triangle? If anything, my use case example has it that the triangle valued attribute IS the primary key of the relation, as in that use case the relation is a set of triangles with information about each distinct triangle, so the other tuple attributes are dependent on the triangle, not the reverse, in my example.

      Delete
    6. Ok, I am having issues posting this (third time is a charm?)...

      Your triangle is at least a function of the tuple. If not, then you have broken 1NF. That is a trivial but important functional dependency. It could also be dependent on a secondary key, but it will be functionally dependent on the tuple as a whole. The each triangle's vertex, however, is *not* functionally dependent on the tuple as a whole and so seeing this as a collection of points instead of a triangle breaks 1NF.

      Delete
    7. Sorry, when I read your comment about the triangle, I had falsely read it as if you were treating triangles differently than other types like strings or numbers. But then I re-read your post eg "Thus the whole tuple can be said to be a candidate key, and every member a trivial function of it." and so yes in that sense you are treating the triangle the same as the other types.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete