Comments on Perspectives on LedgerSMB: Math and SQL Part 6: The Problem with NULLs

As to my truth table, the fact it doesn't outp...

2014-09-15T01:05:36.557-07:00

As to my truth table, the fact it doesn't output "unknown", that is just the = and <> operators. In contrast, AND/OR/NOT etc are free to output "unknown" if the user wishes it. Likewise XNOR and XOR, the boolean-specific analogies to = and <>, are free to output "unknown" as well.

But since JOIN etc are based on = by default (assuming an equi-join), they would treat "unknown" as equaling "unknown".

That being said, Postgres has a feature where you can use a comparison operator of your choice other than "=" to determine when things match in a JOIN, and if the default behavior is as I say, users can tell a DBMS with that feature to use their differently-named general-equality-op-that-treats-unknown-unequal-to-unknown instead, so to get 3VL-based JOIN behavior.

Note that what I envision, if users create multiple operators with overlapping signatures, such that either could potentially be called by the system when a user invokes some 'foo' with the same 2 inputs, it is expected both operators would return the same answer for any overlap, and if they don't, that is considered an error (if not caught by the system then a bug in the user's logic).

Does that answer your questions?

The automatically system-provided "=" op...

2014-09-15T01:04:40.766-07:00

The automatically system-provided "=" operator that I envision would have a signature of (any,any) and thus for all intents and purposes users can not overload it with some other meaning. The system-provided "=" would be very strict and only return true when both inputs are exactly the same value, and false otherwise. This "=" definition would also determine the behaviour of operations like JOIN or UNION or IN etc. For "=" to return true, the inputs must already be of a common type without conversion or coercion, and the same value as that type determines. Using "=" between values of 2 different types returns false, including examples like this: 0=0.0, 0='0', 0=false, etc.

The automatically system-provided "<>" or "!=" operator also has the signature (any,any), not user-overridable, and is always the logical 'not' of the result of "=".

Other system-provided operators like "<",">" etc are defined separately per type and only between those types, so eg (integer,integer) and (text,text) exist but not (integer,text) or (integer,unknown) etc, however these are ones users can define.

The system only defines boolean logical operators like AND,OR,NOT etc which have signatures (boolean,boolean), however users can define others like (boolean,unknown) to add the N-valued logic semantics.

What I envision is that if some operator signature is system-defined, users can not overload it, and so code that only uses system-defined types won't change behaviour if users define more stuff; however, users can define signatures for operator base names that don't exist yet, thus extending the system.

Actually, speaking to Muldis D, it actually has an extra feature where the previous paragraph doesn't have to be true. In the language, user code lives in one or more packages, and the system built-ins are exposed in the form of one or more other packages, in the same format as those of users. Normally a user package would "use" the standard system-defined one that declares the built-ins like "=" and such which have the behavior I describe. However, they may instead "use" some substitute package which is arbitrarily different from the standard built-ins one, that is free to define "=" differently for example, eg a package that decides to fully emulate SQL semantics. That difference of "=" only affects that user package, and other packages using the regular built-in don't have their behavior changing on them even if used in the same program, eg we cut on action at a distance. So you effectively can override almost anything, but those changes tend to be package local, for safety. This feature is particularly savvy for, say, different parts of a program wanting to use different versions of a common dependency or language version, say, that are mutually incompatible, and package names are fully qualified with authority and version number.

Yes, I am assuming a strictly typed language, wher...

2014-09-15T00:59:59.681-07:00

Yes, I am assuming a strictly typed language, where also there are no automatic coercions that change a value into another value.

The paradigm I envision is that conceptually at least a type is a set of values and said sets may overlap arbitrarily; there is a universal type, we'll call it "any" here, and all other types are subsets of it; each value may belong to more than one such type.

Two "values" are considered the same value only if one can be substituted for the other in any possible situation without changing the program behaviour. What we normally think of as distinct types in the absence of any subtyping are then disjoint subsets of "any". Types created with CREATE TYPE in Postgres are disjoint from any others previously defined.

A subtype-by-constraint (eg a SQL DOMAIN) are subsets of their parent type; the same operators that work for the "parent" work unchanged for the "child" because each child is-a parent.

A subtype-by-extension (eg that adds attributes) is actually a disjoint set of values from its "parent", however any operators/etc defined for the parent typically also exist with corresponding signatures for the "child"; if necessary a separate union type can exist between the 2 for when one wants to have a variable/parameter/attribute/etc of both types. Suffice it to say I consider subtype-by-extension to be a more complicated affair than subtype-by-constraint. Alternately, the "parent" and said union type can be one and the same if it is defined as, for example, "must have at least these attributes" rather than "must have only these attributes"; the latter is what is typically instantiated so the former tends more to be a virtual type in a manner of speaking.

I am assuming of course a strictly typed language ...

2014-09-14T22:40:57.988-07:00

I am assuming of course a strictly typed language here, and assuming that the = operator also only makes sense between comparable types.

I think once you are treating unknown as a bool fo...

2014-09-14T22:39:51.538-07:00

I think once you are treating unknown as a bool for comparison purposes, it makes perfect sense to treat it as a possible output. If you want to go with strict two-valued logic, I think you have to say that you only start with true and false as boolean inputs (and so if someone wants to have unknowns, they have to decide how to do the comparisons).

Otherwise it makes no sense to me to treat unknown as a valid boolean into your truth table but not a valid output.

Example truth table: = | true | false | unknown -...

2014-09-14T20:35:51.597-07:00

Second point ... With your example, the operator ...

2014-09-14T20:30:19.909-07:00

Second point ...

With your example, the operator "<" defined for (unknown-in-range,integer) would have a result data type consisting of at least 3 values, two of which are true/false, and the others being whatever the operator could infer otherwise, or unknown in the simplest case.

This is still 2-valued logic, just the result type of the operator isn't just a boolean.

And more importantly, the standard "<" operators provided by the system for say (integer,integer) do only return true/false.

Keep in mind that your "<" example is not a normal less-than operator, but rather some operator that isn't less-than which shares its base name.

A normal less-than operator is only defined for a type with a total order and only results in true/false.

If one thinks of "<" as being the basis the system uses for being able to ORDER BY an arbitrary set of values, that is, ORDER BY is not an error only if all of the values in the column are in a common matrix bridged by one or more (consistent) true/false returning "<" definitions, then an overload of "<" that may return something else would produce an error if used here, so in practice your operator might be best with a different base name, or else you don't use ORDER BY etc on unknown-in-range.

Third point ...

I want to emphasize that in my proposal, "unknown" is nothing special, it is "just another value", same as a number or string are. Semantics are only ascribed to it by users. Calling your third possible return value of your unknown-in-range operator "unknown" doesn't magically turn things into 3-valued logic, it is still 2-valued logic, just as an operator that might return either a number or a string is still 2-valued logic.

Likewise, if we assume the "=" equality/identity operator (also the basis for JOIN etc matching or not) is automatically system-defined for all types, as I think should be the way of things (especially in a system supporting union types), then the not special "unknown" is always equal to itself. This is in contrast to SQL where you can't "=" 2 values of a type without explicitly defining that operator for it. If you want to have semantics where this value doesn't equal itself, then define some operator and give it a name other than "=" when that is what you want. Similar to if you want to be able to compare a number and a string and have them compare 'same' if the string looks like a number, which normally isn't default, you declare a not "=" for that.

Have I answered your question?

I have a few points to address this. First point ...

2014-09-14T20:29:21.271-07:00

I have a few points to address this.

First point ...

The key of my proposal is that the system is natively 2-valued logic, and that if a user wants conceptually 3-valued logic, they have to define it explicitly as a layer over top (this can be reused in the form of an extension or library), and in terms of 2-valued logic. So you still actually have 2-valued logic, but your logical expressions are more complicated when you want to deal with unknowns.

So, in the likely majority of cases where users only want to deal with knowns, they just use the base system and don't define/use extensions for add-ons. The system-defined boolean operators will only take or return true/false and trying to give anything else to them is an error (no such routine matching that signature).

Users see complexity where there actually is complexity, and don't where there isn't. Two-valued logic is less complex than N-valued logic, and if users want the latter for some reason, they write more complex code, and it is easy to see when reading the code that it is supporting concepts of unknowns, or that it isn't.

This is the opposite of SQL. In SQL, 3-valued logic is baked in at the lowest level, and users get that extra complicated behavior without it being obvious, it is hidden. In SQL, if users want to just have 2-valued logic, which usually is the case, they have to write much more verbose and error-prone code to deal with all the possible sources. My proposal is to turn this on its head, only have unknowns and their complexity when desired, and by default we don't and we have better error checking.

Without unknowns at the fundamental level, the DBMS has a much easier time optimizing code that users write. With SQL, the DBMS often can't do many optimizations because that might change the result in the face of unknowns, even though the vast majority of the time the user doesn't want the unknowns. So the optimizing is much harder and less effective. When unknowns are booted to the user space, and used there optionally, the DBMS can have a much easier time optimizing and error-checking.

This is an interestng point. Polymorphism could t...

2014-09-13T20:08:34.644-07:00

This is an interestng point. Polymorphism could take you fairly far, but still not sure how to get rid of three valued logic.

Consider the following:

10 < n < 100

n is unknown within that range.

but if we are asked, is n + 10 > 15, we can say "yes it is." This could be done by treating an unknown-in-range as a type along your proposal as I understand it.

But what if instead we have n + 10 > 35? Here we really don't know. We can't say for sure that it is true or that it is false, so it seems ot me we have to return an UNKNOWN BOOL which brings us back to three valued logic.

How does your proposal work with a case like this?

Chris, maybe you missed my point. My proposal is ...

2014-09-13T13:15:51.521-07:00

Chris, maybe you missed my point. My proposal is that the type system is natively strict 2-valued logic, but that it also natively supports union types and polymorphic routines.

In this context, all the complexity related to special values is boosted out of the core and into the user-definable realm. So the same mechanisms that let you, say, declare that a table field may contain either an integer or a text say, you can declare that a field may contain either a numeric or one or more special values that are not numeric, but not say text or dates.

Then operators can be declared overloaded, so say "function `>` (numeric, unknown)" if that exists then the system calls it when appropriate, and if it doesn't exist then trying to call that is a no-such-function error.

Another great use for this feature particularly with numbers is you can have core numeric types without any special values but you can then represent IEEE stuff like underflow, infinity, NaN, etc as such special values, so you can use those special values when you want to and not when you don't.

So the core DBMS system is much easier to optimize and reason about due to the 2-valued logic, and further complexity only exists when users want it, and since that faux-n-valued-logic is still defined in terms of 2-valued logic, your optimizers or reasoning still works.

This in contrast with 3-valued logic being baked in at the lowest level, which is what I oppose.

Three value logic certainly leads to a lot of case...

2014-09-13T07:48:43.371-07:00

Three value logic certainly leads to a lot of cases where we can logically prove a proposition where three value logic gets in the way. But I am not sure about cases where a value really is unknown.

Obviously suppose we have a domain where the value must be greater than 0. In such a case, logically, UNKNOWN > 0 is true, and trivially so. Three-valued logic is a mess.

But here's the problem which I see in general (which has to do with strictly typed languages): when it comes to some language types, there is no natural value for many of these special values. So when we have numeric data, we can't just pick a number and treat it as unknown. The type system has to be able to handle special values, and once you go there, I am not sure you can get away from three or four valued logic..

Further to that point, my Set::Relation Perl modul...

2014-09-13T00:16:20.990-07:00

Further to that point, my Set::Relation Perl module (which works now) is strictly 2-valued logic as well. Perl's undef is treated as being equal to itself, so eg if you do a join on 2 relation objects, a Perl undef WILL match itself. And Perl already has union types besides. The outer-join methods included come in several versions, one filling unmatched tuples with undefs, another letting you specify what to fill with on unmatched rows, and more options.

I think the fundamentally best solution is to elim...

2014-09-13T00:11:56.291-07:00

I think the fundamentally best solution is to eliminate the 3-valued logic entirely and have only strict 2-valued logic in the database system.

Then, in that context, we can have a series of user-definable singleton types to declare whatever reasons there are for regular values to not be present. (I heard once there was a thought of giving SQL 17-valued logic, but that didn't happen.) Then, in that context, rather than declaring a database field as nullable, you instead declare that the type of that field is a union type which ranges over the regular type plus whatever of the singletons you want to allow as alternatives when a regular type value isn't present.

This is how my Muldis D language works, strictly 2-valued logic but supporting user-defined types and union types plus other features, so users have a lot of choice as to how they want to deal with the matter of unknown or not applicable or etc data on a context-specific basis. Likewise, there can be a user-definable singleton to emulate the SQL concept of a null, and user-definable operators for what to do when one is encountered, when emulating SQL behaviour is desired. Choice for users is good.

I also strongly agree with your idea of having separate 'reason' fields which explain more specifically as to why a field doesn't have a regular value. (This idea is good enough that its been independently proposed before.)