Perspectives on LedgerSMB: Math and SQL Part 6: The Problem with NULLs

Thursday, September 11, 2014

Math and SQL Part 6: The Problem with NULLs

This will be the final installment on Math and SQL and will cover the problem with NULLs. NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model. Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables. This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

Definition of a Null Set

A null set is simply a set with no members. This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found. This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

Null as Unknown

The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?" In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages. But it can also mean that maybe there is an answer and that there is no value associated. In almost all databases, this may also be the case in this situation.

But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use. The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of. So very often times, people do use NULL in the database to mean unknown despite the problems.

NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table. 1 + NULL IS NULL. NULL OR FALSE IS NULL. NULL OR TRUE IS TRUE. This makes things complicated. But sometimes we must....

Null as Not Applicable

One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value." There are a few data types which have no natural empty/no-op types. Prime among these are numeric types. Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc. In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means. Consider for example:

CREATE TABLE wage_class (
id int not null,
label text not null
);

INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

CREATE TABLE wage (
ssn text not null,
emp_id int not null,
wage_class int not null references wage_class(id),
hourly_wage numeric,
salary numeric,
check (wage_class = 1 or salary is null),
check (wage_class = 2 or hourly_wage is null)
);

This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not. This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful. This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

What Nulls Do to Predicate Logic

Because NULLs can represent unknowns, they introduce three-valued predicate logic. This itself can be pretty nasty. Consider the very subtle difference between:

WHERE ssn like '1234%' AND salary < 50000

vs

WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

The latter will pull in hourly employees as well, as they have a NULL salary.

Nulls and Constraints

Despite all the problems, NULLs have become a bit of a necessary evil. Constraints are a big part of the reason why.

Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify. This means that wider tables admit to more expression relating to constraints than narrow tables.

In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage. This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

Nulls and Foreign Keys

Foreign keys are a special case of NULLs where the use is routine and poses no problems. NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise. Consequently they can be used routinely here with few if any ill effects.

What Nulls Should Have Looked Like: NULL, NOVALUE, UNKNOWN

In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL. UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it. NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

13 comments:

Darren DuncanSeptember 13, 2014 at 12:11 AM
I think the fundamentally best solution is to eliminate the 3-valued logic entirely and have only strict 2-valued logic in the database system.

Then, in that context, we can have a series of user-definable singleton types to declare whatever reasons there are for regular values to not be present. (I heard once there was a thought of giving SQL 17-valued logic, but that didn't happen.) Then, in that context, rather than declaring a database field as nullable, you instead declare that the type of that field is a union type which ranges over the regular type plus whatever of the singletons you want to allow as alternatives when a regular type value isn't present.

This is how my Muldis D language works, strictly 2-valued logic but supporting user-defined types and union types plus other features, so users have a lot of choice as to how they want to deal with the matter of unknown or not applicable or etc data on a context-specific basis. Likewise, there can be a user-definable singleton to emulate the SQL concept of a null, and user-definable operators for what to do when one is encountered, when emulating SQL behaviour is desired. Choice for users is good.

I also strongly agree with your idea of having separate 'reason' fields which explain more specifically as to why a field doesn't have a regular value. (This idea is good enough that its been independently proposed before.)
ReplyDelete
Replies
Darren DuncanSeptember 13, 2014 at 12:16 AM
Further to that point, my Set::Relation Perl module (which works now) is strictly 2-valued logic as well. Perl's undef is treated as being equal to itself, so eg if you do a join on 2 relation objects, a Perl undef WILL match itself. And Perl already has union types besides. The outer-join methods included come in several versions, one filling unmatched tuples with undefs, another letting you specify what to fill with on unmatched rows, and more options.
ReplyDelete
Replies

Add comment