## Friday, January 4, 2013

### Object-Relational Algebra 3: The Series Join function

I am back from a break due to working hard on getting LedgerSMB 1.4 closer to release (beta 2 has been released and we are working hard on moving towards beta 3).   Anyway, to finish up the series on object-relational algebra.

The second important addition I would make to relational algebra to give it object-relational capabilities is what I call a "series join" operation.  A series join only makes sense in object-relational approaches because the output can be minimal and yet certain functional dependencies on that output can be readily defined.

I use the capital sigma Σ to refer to a series join, acknowledging that the lower case sigma refers to the select operation and so there is a partial conflict.

A series join takes a theta condition much like any other join, but this theta condition operates on the input relation to the series join.  The ut set is joined to itself in the first iteration and then to the output set of the previous iteration in subsequent iterations.  This is repeated until the output set does not change with subsequent iterations.  in a finite data set, the mappings will also be finite.  An optional subscript provides a starting set of values in the input relation and an optional superscript provides a maximum iteration depth.

The output is set of tuples of (o1, o2) where o1 is the initial object's key or reference and o2 is the linked to object's key or reference.  From here the following functional dependencies arise:  path can be used to trace a path (how we get from one record to another) and depth (how many iterations we have to go to reach the destintion record) are two obvious ones.

A series join provides a useful way to express transitive operations.  This allows, for example, binary transitive closure to be expressed and tested because one can generate all possible paths from one node on a graph up until the point where they loop back on themselves.

Series join operations in the SQL world are roughly approximated by the CONNECT BY and WITH RECURSIVE constructs, both of which are typically used to construct a finite series of self-joins.  However there are key differences too.  In my model we are less worried about the tuple membership than what we can clearly derive from a series join.

Please pardon my layout since I am not quite sure how to make mathematical equations display perfectly in blogger.

Suppose we have a relation employee.

We might have reports = employee Σ13 with a theta condition of id θ manager and this would provide a list of the direct reports to this employee, their direct reports, and their direct reports.  We can also express certain functional dependencies, such as depth(reports), which is between 0 and 3, and path(reports) which will show the chain of command between the report and the employee.  If reports = employee Σ1 and employee 1 is the CEO, then we get the entire organizational chart of the business.

Series joins allow us to do graph traversal, hierarchical searches, and more in an OR database, and approximations of this have been implemented in the relational model.   They are mathematically clear, clean, avoiding magic operations and solve a great number of problems.

1. What you talk about makes me think of a generic "transitive closure" operation that Date and Darwen talk about in their literature, which takes a binary relation representing nodes of a graph as input and results in another binary relation representing arcs of that graph. This operator is defined in terms of join plus rename plus union plus projection. See also my Perl module Set::Relation which implements this, calling it "tclose". Is your proposed operator the same or different than the "tclose" implemented at http://cpansearch.perl.org/src/DUNCAND/Set-Relation-0.12.7/lib/Set/Relation/V2.pm ?

1. This "tclose" as Chris Date and Hugh Darwen refer to is a convenient foundation on which to write hierarchical queries, such as regarding a bill of materials.

2. I am pretty sure you can show them to be equivalent provided that the operator is a series operator. In other words, what is important is not the join/rename/union/projection (that is effectively what goes on in the series join function) but rather the fact that it is a series.

There is a second difference and that is that I express the output as functions of the relation that is output rather than as tuple members. This is primarily designed to abstract the actual storage output and allow expressive flexibility down the road. The reason has to do with non-relational functional dependencies though to be sure my guess is that these are probably most useful when being dependencies of the joining relations (I can't think of any cases where this would be necessary here), although I suppose if a key was sufficiently informative it might. For example, if you were joining IP addresses together based on routing tables, a non-relational functional dependency might be the broadcast address on any given segment (i.e. it is a functional dependency which can be calculated but not through relational math).

Again anything relational you can do with a series join functional notation you can do with a series of join/rename/union/projection operations but in the end it is the series portion which is important.

I do think they are substantively equivalent on the relational side. The only real difference is that a functional notation allows one to provide hooks for non-relational calculations, looking into the data.

3. One area where a functional notation might make a difference might be if we are generating a hierarchy out of, say CIDR blocks which we might need to know if we were using a database to generate routing tables. We want to know how the blocks relate to eachother.

Here the join condition is going to be only partially relational. Relational math itself is insufficient to tell us whether one CIDR block is contained in another. So we could build a hierarchy using a functional notation, where part of that function is expressed relationally.