Perspectives on LedgerSMB

Wednesday, January 30, 2013

Building SOLID Databases: Open/Closed Principle

Introduction

Like the Single Responsibility Principle, the Open/Closed Principle is pretty easy to apply to object-relational design in PostgreSQL, very much unlike the Liskov Substitution Principle which will be the subject of next week's post. However, the Open/Closed principle applies only in a weaker way to relational design and hence object-relational design for much the same reason that the Liskov Substitution Principle applies in completely different ways.

This instalment thus begins the beginning of what is likely to be a full rotation going from similarity to difference and back to similarity again.

As is always the case, relations can be thought of as specialized "fact classes" which are then manipulated and used to create usable information. Because they model facts, and not behavior, design concerns often apply differently to them.

Additionally this is the first real taste of complexity in how object and relational paradigms often combine in an object-relational setup.

The Open/Closed Principle in Application Design

In application design and development the Open/Closed Principle facilitates code stability. The basic principle is that one should be able to extend a module without modifying the source code. In the LedgerSMB 1.4 codebase, for example, there are places where we prepare a screen for entering report criteria but wrap the functions which do so in other functions which set up report-specific input information. The function then is open to extension without modification and so the complexity of the function is limited to the most common aspects of generating report filter screens. Similarly customizations may simply inherit existing code interfaces and add new routines. This allows modified routines to exist without possibly interrupting other callers where needed.

In addition to code stability (meaning lack of rate of changes to code due to changing requirements), there are two other frequently-overlooked benefits to respecting the open-closed principle.

The first is that software tends to be quite complex and state management is a major source of bugs in virtually all software programs. When a frequently used subroutine is changed, it may introduce unexpected changes which are still compliant with the interface specification and test cases, but nonetheless introduce or uncover bugs elsewhere in the application. In essence the consequences of changing code are not always easily foreseen. The major problem here is that state often changes when interfaces are called, and consequently changing code behind a currently used interface means that there may be subtle state changes that are not adequately thought through. As the complexity of an interface grows, so too this problem grows.

Instead if interfaces are built for flexibility, either via accepting additional data which can be passed on to the next stage, or via inheritance, then state is more easily assured, bugs are less likely to surface, and the application can more quickly be adapted to changing business rules.

Problems Defining "Extension" and "Modification" in a db schema

Inside the database, state changes are well defined and follow very specific rules. Consequently it is not sufficient to define "modification" as a "change in source code." If it were, an "alter table add column... " would always qualify. But is this the case? or is it merely an extension? In general merely adding an optional field can't possibly pose the state problems seen in the application environments, but it does possibly pose some problems.

Defining modification and extension is thus a problem. In general some sorts of modification are more dangerous than others. For this reason we may look at these both in a strict view, where ideally tables are left alone and we add new joining tables to extend, and a more relaxed view where extension includes adding columns which do not change internal data constraints (i.e. where all previous insert and update statements would remain valid, and no constraints are relaxed).

In general, what makes a database nice in terms of relational design, is that one can prove of disprove whether a schema change is backwards compatible using math. If it is not backwards-compatible, then it is always modification. If it is backwards compatible, it is always extension when using the relaxed view.

Another point that must be born in mind is that relational math is quite capable of creating new synthetic relations based on the fact that relations are structurally transparent and encapsulation is typically weak, while state management issues are managed through a very well-developed framework of transaction management, locking, and, frequently, snapshot views. When you combine these techniques with declarative constraints on values, one has a very robust state engine which is based on a very different approach than object-oriented applications managing application behavior.

Problems with modifying table schemas

In a case where software may be managed via overlapping deployment cycles, certain problems can occur when extending tables by adding columns. This is because the knowledge of the deployment cycle typically only goes one way--- the extension team has knowledge of the base team's past cycles while the base team typically has no knowledge of the extending team's work. This is typical in cases where software is deployed and then customized. Typically adding fields to tables makes extension easy, but the cost is that major version upgrades of the base package may overwrite or clobber extensions or may fail. In essence a relaxed standard takes on risk that upgrades of the base package may not go so smoothly.

On the other hand, if the software is deployed via a single deployment cycle, as is typical of purely inhouse applications and commercial software, these problems are avoided, and extension by adding fields does not break anything.

The obvious problem here is that these categories are not mutually exclusive, however much they appear to be. The commercial software may be extended by a second team on-site, and therefore a single deployment cycle on one entity does not guarantee a single deployment cycle.

Object/Relational Interfaces and the OCP

Because object-relational interfaces can encapsulate data, and often can be made to run in certain security contexts (which can be made to cascade), the open/closed principle has a number of applications in ensuring testable database interfaces where security barriers are involved.

For example, we might have an automation system where computers connect to the database in various roles to pull job information. Rather than having separate accounts for each computer, we can assign them the same login role, but filter out the data they can see based on the client IP address. This would increase the level of containment in the event of a system compromise.

So we might create an interface of something like my_client() which instantiates the client information from the client IP address, and use that in various functions to filter. Consequently we might just run:

select * from get_jobs();

The problem of course with such a system is that of testability. We can't readily test the output because it depends on client IP address. So we might instead create a more flexible interface, available only to superusers, which accepts a client object which we can instantiate by name, integer id, or the like. In that case we may have a query that can be called by dba's and test suites like this, where 123 is the internal id of the client:

SELECT * FROM get_jobs(client(123));

The get_jobs() function for the production clients would now look like this:

CREATE OR REPLACE FUNCTION get_jobs() RETURNS SETOF jobs
LANGUAGE SQL SECURITY DEFINER AS $$

SELECT * FROM get_jobs(my_client());

$$;

We have essentially built an API which is open to extension for security controls but closed to modification. This means we can run test cases on the underlying database cases even on production (since these can be in transactions that roll back), and push tests of the my_client() interface to the clients themselves, to verify their proper setup.

A Functional Example: Arbitrary data type support in pg_message_queue

There are a few cases, however, where the open-closed principle has more direct applicability.

In pg_message_queue 0.1, only text, bytea, and xml queues were supported. Since virtually anything can be put in a text field, this was deemed to be sufficient at the time. However in 0.2, I wanted to be able to support JSON queues but in a way that would not preclude the extension from running on PostgreSQL 9.1. The solution was to return to the open/closed principle and build a system which could be extended easily for arbitrary types. The result was much more powerful than initially hoped for (and in fact now I am using queues with integer and ip address payloads).

In 0.1, the code looked like this:

CREATE TABLE pg_mq_base (
    msg_id bigserial not null,
    sent_at timestamp not null default now(),
    sent_by name not null default session_user,
    delivered_at timestamp
);

CREATE TABLE pg_mq_xml (
    payload xml not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_text (
    payload text not null,
    primary key (msg_id)
) inherits (pg_mq_base);

CREATE TABLE pg_mq_bytea (
    payload bytea not null,
    primary key (msg_id)
) inherits (pg_mq_base);

The approach here was to use table inheritance so that new queue types could be easily added. When queues are added a table is created like one of the other tables, including all indexes etc. The relevant portion of the pg_mq_create_queue function is:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like ' || quote_ident('pg_mq_' || in_payload_type ) || ' INCLUDING ALL
)';

The problem here is that while it was possible to extend this, one couldn't do so very easily without modifying the source code of the functions. In 0.2, we reduced the latter part to:

-- these are types for return values only. they are not for storage.
-- using tables because types don't inherit

CREATE TABLE pg_mq_text (payload text) inherits (pg_mq_base);
CREATE TABLE pg_mq_bin (payload bytea) inherits (pg_mq_base);

But the real change that the payload type for the queue. The table creation portion of pg_mq_create_queue is now:

EXECUTE 'CREATE TABLE ' || quote_ident(t_table_name) || '(
    like pg_mq_base INCLUDING ALL,
    payload ' || in_payload_type || ' NOT NULL
)';

This has the advantage of allowing payloads of any type known to PostgreSQL. We can have queues for mac addresses, internet addresses, GIS data, and even complex types if we want to do more object-relational processing on output.

This approach will become more important after 0.3 is out and we begin working on object-relational interfaces on pg_message_queue.

Conclusions

The Open/Closed principle is where we start to see a mismatch between object-oriented application programming and object-relational database design. It's not that the basic principle doesn't apply, but just that it does so in often strange and counter-intuitive ways.

Wednesday, January 23, 2013

Building SOLID Databases: Single Responsibility and Normalization

Introduction

This instalment will cover the single responsibility principle in object-relational design, and its relationship both to data normalization and object-oriented application programming. While single responsibility is a fairly easy object-oriented principle to apply here, I think it is critical to explore in depth because it helps provide a clearer framework to address object-relational design.

As in later instalments I will be using snippets of code developed elsewhere for other areas. These will not be full versions of what was written, but versions sufficient to show the basics of data structure and interface.

Relations and Classes: Similarities

Objects and classes, in the surface, look deceptively similar, to the point where one can look at relations as sets of classes, and in fact this equivalence is the basis of object-relational database design.

Objects are data structures used to store state, which have identity and are tightly bound to interface. Relations are data structures which store state, and if they meet second normal form, have identity in the form of a primary key. Object-relational databases then provide interface and thus in an object-relational database, relations are classes, and contain sets of objects of a certain class.

Relations and Classes: Differences

So similar are objects and classes in structure that a very common mistake is to simply use a relational database management system as a simple object store. This tends to result in brittle database leading to a relatively brittle application. Consequently many of us see this approach as something of an anti-pattern.

The reason why this doesn't work terribly well is not that the basic equivalence is bad but that relations and classes are used in very different ways. On the application layer, classes are used to model (and control) behavior, while in the database, relations and tuples are used to model information. Thus tying database structures to application classes in this way essentially overloads the data structures, turning the structures into reporting objects as well as behavior objects.

Relations thus need to be seen not only as classes but as specialized classes used for persistent storage and reporting. Thus they have fundamentally different requirements than the behavior classes in the application and thus they have different reasons to change. An application class typically changes when there is a need for a change in behavior, while a relation should only change when there is a change in data retention and reporting.

Relations have traditionally tended to be divorced from interface and this provides a great deal of power. While classes tend to be fairly opaque, relations tend to be very transparent. The reason here is that while both represent state information whether it is application state or other facts, objects traditionally encapsulate behavior (and thus act as building blocks of behavior), relations always encapsulate information and are building blocks of information. Thus the data structures of relations must be transparent while object-oriented design tends to push for less transparency and more abstraction.

It is worth noting then that because these systems are designed to do different things, there are many DBA's who suggest encapsulating the entire database behind an API, defined by stored procedures. The typical problem with this approach is that loose coupling of the application to the interface is difficult (but see one of the first posts on this blog for a solution). When the db interface is tightly coupled to the application, then typically one ends up with problems on several levels, and it tends to sacrifice good application design for good relational database design.

Single Responsibility Principle in Application Programming

The single responsibility principle holds that every class should have a single responsibility which it should entirely encapsulate. A responsibility is defined as a reason to change. The canonical example of a violation of this principle is a class which might format and print a report. Because both data changes and cosmetic changes may require a change to the class, this would be a violation of the principle at issue. In an ideal world, we'd separate out the printing and the formatting so that cosmetic changes do not require changes when data changes are made and vice versa.

The problem of course with the canonical example is that it is not self-contained. If you change the data in the report, it will almost certainly require cosmetic changes. You can try to automate those changes but only within limits, and you can abstract interfaces (dependency inversion) but in the end if you change the data in the report enough, cosmetic changes will become necessary.

Additionally a "reason to change" is epistemologically problematic. Reasons foreseen are rarely if ever atomic, and so there is a real question as far as how far one pushes this. In terms of formatting a report, do we want to break out the class that adjusts to paper size so that if we want to move from US Letter to A4 we no longer have to change the rest of the cosmetic layout?

Perfect separation of responsibilities in that example is thus impossible, as it probably always is --- you can only change business rules to a certain point before interfaces must change, and when that happens the cascading flow of necessary changes can be quite significant.

The database is, however, quite different in that that responsibility of database-level code (including DDL and DML) is limited to the proposition that we should construct answers from known facts. This makes a huge difference in terms of single responsibility, and it is possible to develop mathematical definitions for single responsibility.

Not only is this possible but it has been done. All of the normal forms from third on up address single responsibility.

The Definition of Third Normal Form

Quoting Wikipedia,

Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

The relation R (table) is in second normal form (2NF)

Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R.

A non-prime attribute is an attribute not part of a superkey. In essence what third normal form states is that every relation must contain a superkey and values functionally and directly dependent on that superkey.

This will become more important as we look at how data anomilies dovetail with single responsibility.

Normalization and Single Responsibility

The process of database normalization is an attempt to create relational databases where data anomalies do not exist. Data anomalies occur where modifying data either requires modifying other data to maintain accuracy (where no independent fact changes are recorded), or where existing data may project current or historical facts not in existence (join anomilies).

This process occurs by breaking down keys and superkeys, and their dependencies, such that data is tracked in smaller, self-contained units. Beginning at third normal form, one can see relations are forming single responsibilities of managing data directly dependent on their superkeys. From this point forward, relations' structures would change (assuming no decision to further decompose a relation into a higher normal form) if and only if a change is made to what data is tracked that is directly dependent on a superkey.

The responsibility of the database layer is the storage of facts and the synthesis of answers. Since the storage relations themselves handle the first, normalization is a prerequisite to good object-relational design.

The one major caveat here however is that first normal form's atomicity requirement must be interpreted slightly differently in object-relational setups because more complex data structures can be atomic compared to a purely relational design. In a purely relational database, the data types that can be used are relatively minor and therefore facts must be maximally decomposed. For example we might store an IP address plus network mask as 4 ints for the address and an int for the network mask, or we might store as a single 32-bit int plus another int for the network mask but the latter poses problems of display that the former does not. In an object-relational database, however, we might store the address as an array of 4 ints for IP v4 or, if we need better performance we might build a custom type. If storage is not a problem but ease of maintenance is, we might even define relations, domains, and such to hold IP addresses, and then store the tuple in a column with appropriate functional interfaces.

None of these approaches necessarily violate first normal form, as long as the data type involved properly and completely encapsulates the required behavior. Where such encapsulation is problematic, however, they would violate 1NF because they can no longer be treated as atomic values. In all cases, the specific value has a 1:1 correspondence to an IP address.

Additionally where the needs are different, storage, application interface, and reporting classes should be different (this can be handled with updateable views, object-relational interfaces, and the like).

Object-Relational Interfaces and Single Responsibility

For purely relational databases, normalization is sufficient to address single responsibility. Object-relational designs bring some additional complexity because some behavior may be encapsulated in the object interfaces. There are two fundamental cases where this may make a difference, namely in terms of compositional patterns and in terms of encapsulated data within columns.

A compositional pattern in PostgreSQL typically would occur when we use table inheritance to manage commonly co-occuring fields which occur in ways which are functionally dependent on many other fields in a database. For example, we might have a notes abstract table, and then have various tables which inherit this, possibly as part of other larger tables. A common case where composition makes a big difference is in managing notes. People may want to attach notes to all kinds of other data in a database, and so one cannot say that the text or subject of a note is mutually dependent,

A typical purely relational approach is to either have many independently managed notes tables or have a single global note table which stores notes for everything, and then have multiple join tables to add join dependencies. The problem with this is that the note data is usually dependent logically, if not mathematically, on the join dependency, and so there is no reasonable way of expressing this without a great deal of complexity in the database design.

An object-relational approach might be to have multiple notes tables, but have them inherit the table structure of a common notes table. This table can then be expanded, interfaces added as needed, and it should fill the single responsibility principle even though we might not be able to say that there is a natural functional dependency within the table itself.

The second case has to do with storing complex information in columns. Here stability and robustness of code is especially important, and traditional approaches of the single responsibility principle apply directly to the contained data type.

Example: Machine Configuration Database and SMTP configuration

One of my current projects is building a network configuration database for a LedgerSMB hosting business I am helping to found (more on this soon). For competitive reasons I cannot display my whole code here. However, what I would like to do is show a very abbreviated version here as I used to solve a very specific issue.

One of the basic challenges in a network configuration database is that the direct functional dependencies for a given machine may become quite complex when we assume that a given piece of network software is not likely to be running more than once on a given machine. Additionally we often want to ensure that certain sorts of software are set to be configured for certain types of machines, and so constraints can exist that force wider tables.

The width and complexity of some configuration tables can possibly pose a management problem over time for the reason that they may not be obviously broken into manageable chunks of columns.

One possible solution is to decompose the storage class into smaller mix-ins, each of which expresses a set of functional dependencies on a specific key, fully encapsulating a single responsibility. The overall storage class then exists to manage cross-mixin constraints and handle the actual physical storage. The data can then be presented as a unified table, or as multiple joined tables (and this works even where views would add significant complexity). In this way the smaller sub-tables can be given the responsibility of managing the configuration of specific pieces of software.

We might therefore have tables like:

-- abstract table, contains no data
CREATE TABLE mi_smtp_config (
    mi_id bigint,
    smtp_hostname text,
    smtp_forward_to text
);

CREATE TABLE machine_instance (
   mi_id bigserial,
   mi_name text not null,
   inservice_date date not null,
   retire_date date.
   ....
) INHERITS (mi_smtp_config, ...);

The major advantage to this approach is that we can easily check and add which fields are set up to configure which software, without looking through a much larger, wider table.   This also provides additional interfaces for related data, and the like.

For example, "select * from mi_smtp_config" is directly equivalent of "select (mi::mi_smtp_config).* from machine_instance mi;

Conclusions

When we think of relations as specialized "fact classes" as opposed to "behavior classes" in the application world, the idea of the single responsibility principle works quite well with relational databases, particularly when paired with other encapsulation processes like stored procedures and views.

In object-relational designs, the principle can be used as a general guide for further decomposing relations into mix-in classes, or creating intelligent data types for attributes, and it becomes possible to solve a number of problems in this regard without breaking normalization rules.

Friday, January 18, 2013

Building SOLID Databases: Introduction

The SOLID design approach is a set of principles developed in object-oriented programming. This series will explore the applicability of these principles to Object-Relational databases, and contrast the way in which they manifest from the way in which they manifest in the application layer.

You can't program a database like you'd program an application. The database essentially serves two twin functions: a persistence store, and a model of information derivable from that store. The ACID consistency model therefore limits what sorts of programming you can reasonably do in the database because it is a bad idea, outside of logging, to mix transactional and non-transactional side effects.

As an information model, the approach taken by applying these approaches then is relatively different. One way to think of it is that if object oriented design might be seen as a triangle, applying this in the db requires turning that triangle upside down so you have a flat base to build your application on.

This series will look at applying SOLID principles to object-relational database design, comparing and contrasting the way in which the principles get applied to that of general software development. Many aspects of relational design in fact play into these topics and so one ends up with a very different mix than one might have with pure object-oriented design.

In general this series continues to look at relations as sets of objects rather than sets of tuples. This means that the question is how we define data structures and interfaces so that we can bridge object and relational worlds in the database. The SOLID principles are good starting points but the sort of logic done is fundamentally different and so they are applied in different ways.

Welcome to the series. I hope you enjoy it.

Thursday, January 10, 2013

An awesome one-liner for PostgreSQL

Recently I was looking at options for exploring CIDR blocks in PostgreSQL. In particular, I was wondering about checking a CIDR block for unallocated IP addresses in another table.

I had been aware of network address types in PostgreSQL for some time but had not been aware of how powerful they actually were. I decided to write a function to expand a CIDR bock into a list of IP blocks. While my initial version wasn't perfect (it includes network and broadcast addresses in the block), changing that will not be hard.

The first version was:

CREATE OR REPLACE FUNCTION all_ips(cidr)
RETURNS SETOF inet LANGUAGE SQL IMMUTABLE AS
$$
select $1 + s from generate_series(0, broadcast($1) - network($1)) s;
$$;

That's it.

To exclude network and broadcast addresses, two simple modifications are required:

CREATE OR REPLACE FUNCTION all_ips(cidr)
RETURNS SETOF inet LANGUAGE SQL IMMUTABLE AS
$$
select $1 + s from generate_series(1, broadcast($1) - (network($1)) - 1) s;
$$;

And there you have it. It is fast, or at least as fast as can be expected.

mq_test=# explain analyze
mq_test-# select all_ips('192.168.1.0/24');
                                      QUERY PLAN

--------------------------------------------------------------------------------
------
Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.213..0.511 rows=254 loo
ps=1)
Total runtime: 0.580 ms
(2 rows)

Ok, not so much for 10.0.0.0/8.....

                                            QUERY PLAN

--------------------------------------------------------------------------------
------------------
Result (cost=0.00..0.26 rows=1 width=0) (actual time=5977.386..32370.877 rows=
16777214 loops=1)
Total runtime: 37185.476 ms
(2 rows)

But what do you expect for generating almost 17 million rows?

Friday, January 4, 2013

Object-Relational Algebra 3: The Series Join function

I am back from a break due to working hard on getting LedgerSMB 1.4 closer to release (beta 2 has been released and we are working hard on moving towards beta 3). Anyway, to finish up the series on object-relational algebra.

The second important addition I would make to relational algebra to give it object-relational capabilities is what I call a "series join" operation. A series join only makes sense in object-relational approaches because the output can be minimal and yet certain functional dependencies on that output can be readily defined.

I use the capital sigma Σ to refer to a series join, acknowledging that the lower case sigma refers to the select operation and so there is a partial conflict.

A series join takes a theta condition much like any other join, but this theta condition operates on the input relation to the series join. The ut set is joined to itself in the first iteration and then to the output set of the previous iteration in subsequent iterations. This is repeated until the output set does not change with subsequent iterations. in a finite data set, the mappings will also be finite. An optional subscript provides a starting set of values in the input relation and an optional superscript provides a maximum iteration depth.

The output is set of tuples of (o1, o2) where o1 is the initial object's key or reference and o2 is the linked to object's key or reference. From here the following functional dependencies arise: path can be used to trace a path (how we get from one record to another) and depth (how many iterations we have to go to reach the destintion record) are two obvious ones.

A series join provides a useful way to express transitive operations. This allows, for example, binary transitive closure to be expressed and tested because one can generate all possible paths from one node on a graph up until the point where they loop back on themselves.

Series join operations in the SQL world are roughly approximated by the CONNECT BY and WITH RECURSIVE constructs, both of which are typically used to construct a finite series of self-joins. However there are key differences too. In my model we are less worried about the tuple membership than what we can clearly derive from a series join.

Please pardon my layout since I am not quite sure how to make mathematical equations display perfectly in blogger.

Suppose we have a relation employee.

We might have reports = employee Σ₁³ with a theta condition of id θ manager and this would provide a list of the direct reports to this employee, their direct reports, and their direct reports. We can also express certain functional dependencies, such as depth(reports), which is between 0 and 3, and path(reports) which will show the chain of command between the report and the employee. If reports = employee Σ₁ and employee 1 is the CEO, then we get the entire organizational chart of the business.

Series joins allow us to do graph traversal, hierarchical searches, and more in an OR database, and approximations of this have been implemented in the relational model. They are mathematically clear, clean, avoiding magic operations and solve a great number of problems.

Monday, October 15, 2012

Object/Relational Algebra 2: Function Properties

Previously I introduced the idea that functional notation allowed relational algebra to play well with other mathematical disciplines, and therefore solved at least in theory, the limits of relational algebra's expressiveness. In this second part I will discuss the initial properties of functions as I see them in this approach, and the next, and final, posting in this series will cover the special series join function. The series join function itself provides a way to address transitive connections between tuples in a relvar.

Definition and Types of Functions

The general approach of this object-relational algebra involves applying algebra to manipulate operations on functions of sets of tuples. A set of tuples is a relvar, or relational variable. The set of tuples in the relation is the domain of the function, and as with all functions, it represents exactly one result per domain element, so for every tuple, a single value will result. That value of course can be a set (including a null set), a tuple, or even a relvar. However, it is also necessary to note that a function must return the same basic type of result for every tuple. We can't return a simple value for one, a relation for another, and a set of values for a third.

Each element in a function can be seen as being relational in the sense of within the expressive boundaries of relational algebra, or non-relational in the sense of being outside those boundaries. Functions themselves can then be divided into three categories:

Relational functions contain only relational elements and can be expressed solely in relational algebra.
Non-relational functions contain only non-relational elements and cannot be further simplified in relational algebra in any circumstance, and
Semi-relational functions contain both relational and non-relational elements, and these can be partly further simplified in relational operations.

All types of functions have some properties in common. If a function is a function of a relation, then it is necessarily true that it is also a function of every candidate key in that relation, and that for any well-formed data that could be added, it will continue to return a single value of consistent type.

A second level of categorization can be had regarding whether the domain of the relation is fully mathematically self-contained or not.

For example, suppose we have a relvar of employee data, E, and E contains an attribute dob, representing the employee's date of birth. Therefore, for every tuple in E, there is a dob element. We can then have two functions:

age(E) represents the current age of the employee at the time it is evaluated.

age2000(E) represents the age of the employee on New Year's Eve, 2000.

age2000(E) has a fully self-contained domain. The values depend on the values stored in the relvar E, and nowhere else. age(E) however does not have a fully self-contained domain. For any given relational operation, age(E) will behave like a function and we can use it as such, but the results will change over time. Oracle calls these determinate and indeterminate functions respectively. PostgreSQL divides these up into additional categories for planning purposes --- in addition to immutable functions whose output never changes for a given input, you have stable and volatile functions, the latter are not really a function per se of the input.

Properties of Relational Functions

I didn't really start getting really excited about this until I started working with relational functions. Once I started though, there was no going back. Relational functions can be expressed in relational algebra and therefore can roughly map to subqueries in SQL. Of course this is not exact, but there may be cases where if is helpful to look at from this perspective. This is particularly important when looking at optimizing simple functions written in SQL when called in where clauses.

Relational functions can be expanded algebraically in other relational operations. For example:

Suppose we have a relvar L which represents landmarks, and has an attribute country_id which is a foreign key to country, and suppose further we have a relvar C which represents countries and has an attribute called name which represents the country's name. We can then define the following functions:

let country(L) = σ_{id=L.country_id}(C)

let frenchl(L) = σ_{(country(L)).name='France'}(L)

country(L) can be in-lined into frenchl(L), and this can be transformed into a subselect, and eventually (in a set- rather than bag- based approach at least) a left join.

The set of single-value returning relational functions is a proper superset to the number of functional dependencies in the database reachable through appropriate join dependencies. Without grouping and aggregation, these sets are identical. However, with grouping and aggregation, relational functions express a larger array of possible values.

Properties of Non-Relational Functions

Non-relational functions cannot be expressed in relational algebra and therefore must be solved outside the stream of relational operations themselves. Non-relational functions must be treated as to their value types. A set-returning function can be treated as a set, a single-value returning function can be treated like a relvar's attibute, and a relvar-returning function can be treated like a relvar itself.

This field moves relational algebra from the area of solving relational queries into the area of symbolic simplification.

Next: The special "Join Series" function.

Thursday, October 11, 2012

Object-Relational Algebra 1: Definitions and Assumptions

Introduction

This post begins a series to try to establish a mathematical basis for object-relational database systems. The idea here is to extend relational math in order to cover object-relational database operations rather than to model current databases mathematically. I therefore want to start with my assumptions and definitions.

For simplicity's sake I will follow with Codd's assumptions, namely that each row is unique, that the ordering is not significant, and that within each row, the columns are significant. This establishes a definition of a relation as a set of tuples. Everything here is then an extension of Codd's math, rather than a replacement for it. In the programming world, boolean operations may mean something slightly different, or bags may be used instead of sets, but the goal here is not to model databases but to model data, and Codd's model is more elegant in this regard. It is the real world that gets messy.

To Codd's assumptions I will add my own, namely that the value of any attribute is opaque to relational operations but may act as the domain for a function in this way. This becomes important when I get to the special Σ function and the functional dependencies on its output. The special function in my system Σ(R) (join series) with a θ condition gives you a way to represent self-joins of arbitrary depth for relational math, for example, and combined with functional dependencies of its output gives you a way to express queries of transitive binary closure, but that is not its most interesting use at least to me.

A second important point here is that I will diverge from some previous efforts in that I will not throw in "magic operators" designed to solve corner cases. The goal here is to build a more expressive system that can solve more problems. Therefore an operator that determines transitive binary closure is ruled out, and if we can't solve a problem without a magic operator, then this will be left for others. Additionally this effort is intended to be minimalist and allow many different behaviors to be encapsulated through different approaches. For this reason I will build directly on the work of Codd and more or less ignore other approaches that various researchers have taken.

Finally this is somewhat to be distinguished from the way SQL does things. SQL itself is a messy (from a math standpoint) application of both relational algebra and predicate calculus, using bags instead of sets.

One important problem is choosing a name for this extension. The term object in computer science tends to be a bit abstract and not really a good description of what is going on here. A better description might be functional-relational algebra. I call it object-relational algebra only because it fits in with object-relational databases.

In the posts that follow, relation and function will be given their ordinary mathematical meanings. However the relationship between the two are not well defined to my knowledge.

What is an Object Anyway?

In computer science terms, an object is a data structure (here represented as a tuple) which has imperative behavior attached to it. Object-oriented programming is thus essentially an extension to structural programming where behavior follows structure. Classes define both structure and behavior, and objects instantiate and apply these rules. Applying a model such as this to an algebra of data is somewhat problematic, because algebra is about derivation of values, not about behavior.

What we are doing here is similar, and yet different. I haven't fleshed out a theory of inheritance vs composition (PostgreSQL supports multiple inheritance which can be used to implement either inheritance or composition, strictly speaking), and collection tables can be used to implement composition in various databases including PostgreSQL. A theory of composition and inheritance is left to others. From the standpoint of my model these are equivalent.

Instead of behavior, I use the same basic approach of tying structure to function in order order to expand the reach of the relational model. In computer science terms, a class is then roughly similar to a relation, and an object roughly similar to a tuple. However, instead of behavior, the typical focus is on derived information. Thus functions become important in a way which they have not typically been used in relational math.

Functions of a Relation

So here, f(R) is a function of relvar R if, and only if, for every possible tuple in R, f(R) returns one distinct result (that result could however be a tuple or even a set). Functional dependencies stored in the database can then be modeled as functions, but so can functional dependencies which are not stored.

The following then is not a function: f(R) = x^0.5 because it resolves to two distinct values for every value of attribute x in relation R (one is positive and the other is negative). However the following is a function: f(R) = (abs(x^0.5), -1 * abs(x^0.5)) because it resolves to a tuple with both possible answers from the previous, at least if x is always positive or imaginary numbers are allowed. It could return a set instead and that would be acceptable, however in this case the structure of the result is also set by the function, and the above function is distinct from g(R) = { abs(x^0.5), -1 * abs(x^0.5) } because the structure of the result is different.

In standard relational algebra, a tuple is finitely expressive, namely one can only express a finite set of values off a single tuple. However, for any given tuple, an infinite number of functions are possible, and thus when combined with functions, a tuple becomes infinitely expressive. Not only can all functional dependencies of the tuple's superkey be expressed as a function of the tuple, but any transformation of the tuple's values, or the values of functional dependencies, can be expressed as such as well.

A functional approach also allows us to dispense with the rename operation in relational algebra, since renamed relations can be seen as relation-returning functions.

Kinds of Functions

In my model, functions can be divided up into the following categories:

Relational Functions can be expressed solely in relational algebra.
Non-relational functions possess no non-trivial relational algebra reduction. x^2 is non-relational.
Complex Functions, relationally speaking have non-trivial relational reductions, but non-relational components too.

Assuming sufficient join dependencies in a database, every functional dependency can be expressed through relational functions. Moreover trivial relational dependencies can always be expressed by relational functions, and indeed by mere projection operations. We can then define a trivial relational function as one which reduces solely to project operations off information stored in the tuple.

Result

The resulting system essentially creates something like an object-oriented database but one which is fundamentally relational, and in which objects behave differently than they do with mere object-oriented persistence. While each tuple is infinitely expressive, this is possible only because of a distinction between primary (Codd's Algebra) stored data and secondary (calculated) answers. This extension however allows any kind of mathematics (or other logic) to be tied into relational algebra. This allows relational algebra to be used along with many other kinds of disciplines to build extremely sophisticated data models.

Forthcoming:
1: Exploring properties of relational, non-relational and complex functions
2: The special function Σ(R) representing the output of a simple series join.