Monday, April 29, 2013

Introduction to PostgreSQL arrays, basic concepts and normalization considerations

This is part of a two part series.  The second will discuss indexing arrays which is a complex topic.

PostgreSQL has very good support for arrays of other data types.  One can have nested data types where an attribute of a relation stores an array of tuples, and each of these tuples can hold arrays of other tuples or primitives.

Arrays are extremely powerful in PostgreSQL but they pose a number of significant issues as well.  They can be used in ways which break first normal form but mere use of an array does not necessarily break first normal form's atomicity requirement because arrays differ in significant fundamental ways from relations.  Understanding arrays in PostgreSQL can help avoid costly errors.  Relations and arrays are subject to fundamentally different constraints and therefore are equivalent in only a few very special cases.

Arrays derive their power in PostgreSQL due to the fact that they have strong definitional requirements.

An Array is a Mathematical Matrix


PostgreSQL arrays follow the basic definition of a mathematical matrix.  The array must be rectangular so if one element has an array in it, all other elements must have an array of the same dimensions.   All members of an array must be of the same data types.

The following are thus valid Integer arrays:

  • '{1,2,3,4,5}'
  • '{{1,2},{3,4},{5,6},{7,8},{9,10}}'

The following are not valid integer arrays:

  • '{1,2,3,{4, 5}}'
  • '{{1,2},{3,4},{5}}'
These constraints make arrays particularly useful in PostgreSQL.  One can use text arrays to represent sets of tuples being input into stored procedures for example.  One can also write functions to do matrix math on numerical arrays (we will look at an example below).  If PostgreSQL lacked true arbitrary precision math via the numeric type, we could implement something similar using arrays.  These checks all make arrays safe to work with in ways that most programming languages do not.

Arrays vs Relations vs Tuples (and First Normal Form)


Arrays bear some relationship to relations and tuples, but all three are different in significant ways which means they are best at solving different problems.

Like tuples, arrays are ordered.  However:

  • Unlike tuples every array element must have the same data type.  This means that arrays of text can represent tuples of any type.  
  • Unlike tuples, arrays do not have a fixed number of elements.  Elements can be added without disturbing the basic type.
Like arrays, relations are basically rectangular, and open ended (items can be added or removed from the end).  However:

  • Arrays are ordered, relations are not.  This means that an array value is a domain, while a relation value is a set or bag of domains (depending on constraints).
  • All data types in an array must be identical.  Relations do not have this restriction.
What this means basically is that the mere use of an array in a column of a table does not necessarily violate 1NF, if the array is used in a way that suggests that ordinality matters.

For example we could store matrices for matrix arithmetic as numeric[] arrays.  We could store some problems that are transformed into numeric[] matrices into the db as well (for example simultaneous linear equations) and these would not necessarily violate 1NF.

Example:  Solving Simultaneous Linear Equations in PL/PGSQL


I will leave understanding this as an exercise to the reader.  It is, however, a small example.  This is a very basic program I originally wrote in BASIC on my C64 when I was younger as a homework solving program, and have since written a number other incarnations as a demonstration.  it solves simultaneous linear equations by arraying them in matrix and reducing the matrix accordingly.  The program will only accept an n by n+1 matrix, and then will calculate.

This example is useful because it shows that anything you can solve through matrix math you can solve using PostgreSQL arrays....

 create or replace function solve_linear(numeric[]) returns numeric[]
language plpgsql as
$$
declare retval numeric[];
        c_outer int; -- counters
        c_inner int;
        c_working int;
        upper1 int;
        upper2 int;
        ratio numeric; --caching of calculation
begin
    IF array_upper($1, 1) <> array_upper($1, 2) - 1 THEN
        RAISE EXCEPTION 'bad input, must be n x n+1 matrix';
    END IF;
    upper1 := array_upper($1, 1);
    upper2 := array_upper($1, 2);
    FOR c_outer IN 1 .. upper1 LOOP
        FOR c_inner IN 1 .. upper1 LOOP
            IF c_inner = c_outer THEN CONTINUE;
            END IF;
            ratio := $1[c_inner][c_outer]/$1[c_outer][c_outer];
            for c_working in 1 .. upper2 loop
                $1[c_inner][c_working] := $1[c_inner][c_working]
                                    - ($1[c_outer][c_working] * ratio);
            end loop;
        end loop;
    end loop;
    retval := '{}';
    for c_outer in 1 .. array_upper($1,1) loop
       retval := retval || $1[c_outer][array_upper($1,1) + 1]/ $1[c_outer][c_outer];
    end loop;
    return retval;
end;
$$;


As an example:

 select * from solve_linear('{{1,2,3},{2,3,4}}'::numeric[]);
                       solve_linear                      
----------------------------------------------------------
 {-1.00000000000000000000000000000000,2.0000000000000000}
(1 row)

In other words, if we have 1x + 2y = 3 and 2x + 3y = 4, then these will intersect where x is -1 and y is 2.

Yes, it is order independent.

 select * from solve_linear('{{2,3,4},{1,2,3}}'::numeric[]);
                             solve_linear                            
----------------------------------------------------------------------
 {-1.0000000000000000000000000000000000000000,2.00000000000000000000}
(1 row)

However, ordering within a subarray is important:

select * from solve_linear('{{2,4,3},{1,3,2}}'::numeric[]);
                            solve_linear                            
---------------------------------------------------------------------
 {0.5000000000000000000000000000000000000000,0.50000000000000000000}
(1 row)


One minor issue with the above code is that it throws a division by 0 if the lines are parallel:

select * from solve_linear('{{1,2,3},{1,2,6}}'::numeric[]);
ERROR:  division by zero
CONTEXT:  PL/pgSQL function "solve_linear" line 19 at assignment

Consequently error handling could be improved, but it isn't too bad for a demonstration.

As it turns out this is a fairly generalizable class of problems and may be particularly helpful when doing geometric math.

I hope this helps show some of the uses and value of PostgreSQL array types.  In the next post we will talk about arrays and indexes.....

No comments:

Post a Comment