Thursday, September 11, 2014

Math and SQL Part 6: The Problem with NULLs

This will be the final installment on Math and SQL and will cover the problem with NULLs.  NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model.  Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables.  This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set


A null set is simply a set with no members.  This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found.  This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown


The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?"  In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages.  But it can also mean that maybe there is an answer and that there is no value associated.  In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use.  The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of.  So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table.  1 + NULL IS NULL.  NULL OR FALSE IS NULL.  NULL OR TRUE IS TRUE.  This makes things complicated.  But sometimes we must....

Null as Not Applicable


One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value."  There are a few data types which have no natural empty/no-op types.  Prime among these are numeric types.  Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc.  In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means.  Consider for example:

CREATE TABLE wage_class (
   id int not null,
   label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
   ssn text not null,
   emp_id int not null,
   wage_class int not null references wage_class(id),
   hourly_wage numeric,
   salary numeric,
   check (wage_class = 1 or salary is null),
   check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not.  This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful.  This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic


Because NULLs can represent unknowns, they introduce three-valued predicate logic.  This itself can be pretty nasty.  Consider the very subtle difference between:

   WHERE ssn like '1234%' AND salary < 50000

vs

    WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints


Despite all the problems, NULLs have become a bit of a necessary evil.  Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify.  This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage.  This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys


Foreign keys are a special case of NULLs where the use is routine and poses no problems.  NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise.  Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like:  NULL, NOVALUE, UNKNOWN


In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL.  UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it.  NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

13 comments:

  1. I think the fundamentally best solution is to eliminate the 3-valued logic entirely and have only strict 2-valued logic in the database system.

    Then, in that context, we can have a series of user-definable singleton types to declare whatever reasons there are for regular values to not be present. (I heard once there was a thought of giving SQL 17-valued logic, but that didn't happen.) Then, in that context, rather than declaring a database field as nullable, you instead declare that the type of that field is a union type which ranges over the regular type plus whatever of the singletons you want to allow as alternatives when a regular type value isn't present.

    This is how my Muldis D language works, strictly 2-valued logic but supporting user-defined types and union types plus other features, so users have a lot of choice as to how they want to deal with the matter of unknown or not applicable or etc data on a context-specific basis. Likewise, there can be a user-definable singleton to emulate the SQL concept of a null, and user-definable operators for what to do when one is encountered, when emulating SQL behaviour is desired. Choice for users is good.

    I also strongly agree with your idea of having separate 'reason' fields which explain more specifically as to why a field doesn't have a regular value. (This idea is good enough that its been independently proposed before.)

    ReplyDelete
    Replies
    1. Three value logic certainly leads to a lot of cases where we can logically prove a proposition where three value logic gets in the way. But I am not sure about cases where a value really is unknown.

      Obviously suppose we have a domain where the value must be greater than 0. In such a case, logically, UNKNOWN > 0 is true, and trivially so. Three-valued logic is a mess.

      But here's the problem which I see in general (which has to do with strictly typed languages): when it comes to some language types, there is no natural value for many of these special values. So when we have numeric data, we can't just pick a number and treat it as unknown. The type system has to be able to handle special values, and once you go there, I am not sure you can get away from three or four valued logic..

      Delete
    2. Chris, maybe you missed my point. My proposal is that the type system is natively strict 2-valued logic, but that it also natively supports union types and polymorphic routines.

      In this context, all the complexity related to special values is boosted out of the core and into the user-definable realm. So the same mechanisms that let you, say, declare that a table field may contain either an integer or a text say, you can declare that a field may contain either a numeric or one or more special values that are not numeric, but not say text or dates.

      Then operators can be declared overloaded, so say "function `>` (numeric, unknown)" if that exists then the system calls it when appropriate, and if it doesn't exist then trying to call that is a no-such-function error.

      Another great use for this feature particularly with numbers is you can have core numeric types without any special values but you can then represent IEEE stuff like underflow, infinity, NaN, etc as such special values, so you can use those special values when you want to and not when you don't.

      So the core DBMS system is much easier to optimize and reason about due to the 2-valued logic, and further complexity only exists when users want it, and since that faux-n-valued-logic is still defined in terms of 2-valued logic, your optimizers or reasoning still works.

      This in contrast with 3-valued logic being baked in at the lowest level, which is what I oppose.

      Delete
    3. This is an interestng point. Polymorphism could take you fairly far, but still not sure how to get rid of three valued logic.

      Consider the following:

      10 < n < 100

      n is unknown within that range.

      but if we are asked, is n + 10 > 15, we can say "yes it is." This could be done by treating an unknown-in-range as a type along your proposal as I understand it.

      But what if instead we have n + 10 > 35? Here we really don't know. We can't say for sure that it is true or that it is false, so it seems ot me we have to return an UNKNOWN BOOL which brings us back to three valued logic.

      How does your proposal work with a case like this?

      Delete
    4. I have a few points to address this.

      First point ...

      The key of my proposal is that the system is natively 2-valued logic, and that if a user wants conceptually 3-valued logic, they have to define it explicitly as a layer over top (this can be reused in the form of an extension or library), and in terms of 2-valued logic. So you still actually have 2-valued logic, but your logical expressions are more complicated when you want to deal with unknowns.

      So, in the likely majority of cases where users only want to deal with knowns, they just use the base system and don't define/use extensions for add-ons. The system-defined boolean operators will only take or return true/false and trying to give anything else to them is an error (no such routine matching that signature).

      Users see complexity where there actually is complexity, and don't where there isn't. Two-valued logic is less complex than N-valued logic, and if users want the latter for some reason, they write more complex code, and it is easy to see when reading the code that it is supporting concepts of unknowns, or that it isn't.

      This is the opposite of SQL. In SQL, 3-valued logic is baked in at the lowest level, and users get that extra complicated behavior without it being obvious, it is hidden. In SQL, if users want to just have 2-valued logic, which usually is the case, they have to write much more verbose and error-prone code to deal with all the possible sources. My proposal is to turn this on its head, only have unknowns and their complexity when desired, and by default we don't and we have better error checking.

      Without unknowns at the fundamental level, the DBMS has a much easier time optimizing code that users write. With SQL, the DBMS often can't do many optimizations because that might change the result in the face of unknowns, even though the vast majority of the time the user doesn't want the unknowns. So the optimizing is much harder and less effective. When unknowns are booted to the user space, and used there optionally, the DBMS can have a much easier time optimizing and error-checking.

      Delete
    5. Second point ...

      With your example, the operator "<" defined for (unknown-in-range,integer) would have a result data type consisting of at least 3 values, two of which are true/false, and the others being whatever the operator could infer otherwise, or unknown in the simplest case.

      This is still 2-valued logic, just the result type of the operator isn't just a boolean.

      And more importantly, the standard "<" operators provided by the system for say (integer,integer) do only return true/false.

      Keep in mind that your "<" example is not a normal less-than operator, but rather some operator that isn't less-than which shares its base name.

      A normal less-than operator is only defined for a type with a total order and only results in true/false.

      If one thinks of "<" as being the basis the system uses for being able to ORDER BY an arbitrary set of values, that is, ORDER BY is not an error only if all of the values in the column are in a common matrix bridged by one or more (consistent) true/false returning "<" definitions, then an overload of "<" that may return something else would produce an error if used here, so in practice your operator might be best with a different base name, or else you don't use ORDER BY etc on unknown-in-range.

      Third point ...

      I want to emphasize that in my proposal, "unknown" is nothing special, it is "just another value", same as a number or string are. Semantics are only ascribed to it by users. Calling your third possible return value of your unknown-in-range operator "unknown" doesn't magically turn things into 3-valued logic, it is still 2-valued logic, just as an operator that might return either a number or a string is still 2-valued logic.

      Likewise, if we assume the "=" equality/identity operator (also the basis for JOIN etc matching or not) is automatically system-defined for all types, as I think should be the way of things (especially in a system supporting union types), then the not special "unknown" is always equal to itself. This is in contrast to SQL where you can't "=" 2 values of a type without explicitly defining that operator for it. If you want to have semantics where this value doesn't equal itself, then define some operator and give it a name other than "=" when that is what you want. Similar to if you want to be able to compare a number and a string and have them compare 'same' if the string looks like a number, which normally isn't default, you declare a not "=" for that.

      Have I answered your question?

      Delete
    6. Example truth table:

      = | true | false | unknown
      ----------------------------
      true | true | false | false
      false | false | true | false
      unknown | false | false | true

      My proposal works like that. We may have "unknown", but since comparing it with itself results in true, and with others results in false, it is still 2-valued logic.

      Delete
    7. I think once you are treating unknown as a bool for comparison purposes, it makes perfect sense to treat it as a possible output. If you want to go with strict two-valued logic, I think you have to say that you only start with true and false as boolean inputs (and so if someone wants to have unknowns, they have to decide how to do the comparisons).

      Otherwise it makes no sense to me to treat unknown as a valid boolean into your truth table but not a valid output.

      Delete
    8. I am assuming of course a strictly typed language here, and assuming that the = operator also only makes sense between comparable types.

      Delete
    9. Yes, I am assuming a strictly typed language, where also there are no automatic coercions that change a value into another value.

      The paradigm I envision is that conceptually at least a type is a set of values and said sets may overlap arbitrarily; there is a universal type, we'll call it "any" here, and all other types are subsets of it; each value may belong to more than one such type.

      Two "values" are considered the same value only if one can be substituted for the other in any possible situation without changing the program behaviour. What we normally think of as distinct types in the absence of any subtyping are then disjoint subsets of "any". Types created with CREATE TYPE in Postgres are disjoint from any others previously defined.

      A subtype-by-constraint (eg a SQL DOMAIN) are subsets of their parent type; the same operators that work for the "parent" work unchanged for the "child" because each child is-a parent.

      A subtype-by-extension (eg that adds attributes) is actually a disjoint set of values from its "parent", however any operators/etc defined for the parent typically also exist with corresponding signatures for the "child"; if necessary a separate union type can exist between the 2 for when one wants to have a variable/parameter/attribute/etc of both types. Suffice it to say I consider subtype-by-extension to be a more complicated affair than subtype-by-constraint. Alternately, the "parent" and said union type can be one and the same if it is defined as, for example, "must have at least these attributes" rather than "must have only these attributes"; the latter is what is typically instantiated so the former tends more to be a virtual type in a manner of speaking.

      Delete
    10. The automatically system-provided "=" operator that I envision would have a signature of (any,any) and thus for all intents and purposes users can not overload it with some other meaning. The system-provided "=" would be very strict and only return true when both inputs are exactly the same value, and false otherwise. This "=" definition would also determine the behaviour of operations like JOIN or UNION or IN etc. For "=" to return true, the inputs must already be of a common type without conversion or coercion, and the same value as that type determines. Using "=" between values of 2 different types returns false, including examples like this: 0=0.0, 0='0', 0=false, etc.

      The automatically system-provided "<>" or "!=" operator also has the signature (any,any), not user-overridable, and is always the logical 'not' of the result of "=".

      Other system-provided operators like "<",">" etc are defined separately per type and only between those types, so eg (integer,integer) and (text,text) exist but not (integer,text) or (integer,unknown) etc, however these are ones users can define.

      The system only defines boolean logical operators like AND,OR,NOT etc which have signatures (boolean,boolean), however users can define others like (boolean,unknown) to add the N-valued logic semantics.

      What I envision is that if some operator signature is system-defined, users can not overload it, and so code that only uses system-defined types won't change behaviour if users define more stuff; however, users can define signatures for operator base names that don't exist yet, thus extending the system.

      Actually, speaking to Muldis D, it actually has an extra feature where the previous paragraph doesn't have to be true. In the language, user code lives in one or more packages, and the system built-ins are exposed in the form of one or more other packages, in the same format as those of users. Normally a user package would "use" the standard system-defined one that declares the built-ins like "=" and such which have the behavior I describe. However, they may instead "use" some substitute package which is arbitrarily different from the standard built-ins one, that is free to define "=" differently for example, eg a package that decides to fully emulate SQL semantics. That difference of "=" only affects that user package, and other packages using the regular built-in don't have their behavior changing on them even if used in the same program, eg we cut on action at a distance. So you effectively can override almost anything, but those changes tend to be package local, for safety. This feature is particularly savvy for, say, different parts of a program wanting to use different versions of a common dependency or language version, say, that are mutually incompatible, and package names are fully qualified with authority and version number.

      Delete
    11. As to my truth table, the fact it doesn't output "unknown", that is just the = and <> operators. In contrast, AND/OR/NOT etc are free to output "unknown" if the user wishes it. Likewise XNOR and XOR, the boolean-specific analogies to = and <>, are free to output "unknown" as well.

      But since JOIN etc are based on = by default (assuming an equi-join), they would treat "unknown" as equaling "unknown".

      That being said, Postgres has a feature where you can use a comparison operator of your choice other than "=" to determine when things match in a JOIN, and if the default behavior is as I say, users can tell a DBMS with that feature to use their differently-named general-equality-op-that-treats-unknown-unequal-to-unknown instead, so to get 3VL-based JOIN behavior.

      Note that what I envision, if users create multiple operators with overlapping signatures, such that either could potentially be called by the system when a user invokes some 'foo' with the same 2 inputs, it is expected both operators would return the same answer for any overlap, and if they don't, that is considered an error (if not caught by the system then a bug in the user's logic).

      Does that answer your questions?

      Delete
  2. Further to that point, my Set::Relation Perl module (which works now) is strictly 2-valued logic as well. Perl's undef is treated as being equal to itself, so eg if you do a join on 2 relation objects, a Perl undef WILL match itself. And Perl already has union types besides. The outer-join methods included come in several versions, one filling unmatched tuples with undefs, another letting you specify what to fill with on unmatched rows, and more options.

    ReplyDelete