Perspectives on LedgerSMB

Saturday, June 8, 2013

Tangent: Design thougths about next gen PKI in today's world

With the revelations of massive surveillance by the NSA on Americans, I have thought a bit about how to securely set up end to end cryptography in order to offer guarantees of security. This is by no means limited to offering security relative to governments, but includes the fact that organized criminals can mount attacks similar to wiretaps using man in in the middle and other possible attack techniques.

Designing a perfectly secure system is probably not possible. A determined enough attacker will be able to gain access to any communications given enough effort, resources, and determination. The goals of the system I am describing however would be to maximize the resources required, and thus force evesdroppers to focus on only the most valuable targets possible and causing as narrow compromises as possible. Additionally metadata is to some extent impossible to protect to the same extent content is.

In general SSL is good enough for key negotiation etc. if the system can be made resistant (not perfectly secure) against man in the middle attacks and if authorities can be sufficiently trusted and validated over time.

Most key exchange approaches focus solely on minimizing risk at the moment of key exchange (i.e. reducing synchronic risk). This approach is different in that it focuses on resistance to exposure over time (reducing diachronic risk) and seeking to provide as much notification of compromised communications as possible.

Such an approach will *not* protect people against government spying in jurisdictions where keys can be demanded via subpoena or even warrant. However this approach seeks to force authorities to seek the keys from the people under surveillance and not rely on digital surveillance.

In general as much as I dislike SSL/TLS (as I dislike pretty much every attempt to port OSI protocols to TCP/IP) it is well developed and well understood and the security dimensions of the protocol are well documented. Additionally unlike IPSec, it is appropriate for cases where the data may need to be routed and rerouted among different servers, possibly altering routing destinations. It is therefore the approach I suggest building upon.

This design is not intended to be entirely anti-government. Given the amount of cyber-espionage a framework may be of interest to governments trying to assure the security of their own communications.

Vulnerabilities of SSL

SSL has several known vulnerabilities in its design and architecture. Many of these are the result of basic design tradeoffs in the protocol. I think that the risk model has shifted (in favor of very large organized crime and surveillance by well funded governments including China and the United States) make the world very different.

The threat model that SSL is designed to address is where you are protecting a large number of low- to mid-value targets against relatively casual or trivial attacks. As attacks by foreign and domestic governments and organized criminals have become more sophisticated, the basic structure of how we use SSL and TLS has not evolved at the same pace.

The two major vulnerabilities of SSL are:

Large, central certificate authorities present single points of attack, and
Man in the middle attacks are possible provided that there are no expectations about certificate authorities on both sides.

My proposal below focuses on broadening the threat type against which security is provable. This is further not inconsistent with existing approaches regarding central certificate authorities, but adds a layer of diachronic protection as well.

I do not believe this is particularly useful for criminal organizations and it poses problems when moving down to the individual level (these can be solved however). Nonetheless it should give corporations and governments and even individuals an ability to ensure that their communications are not being unlawfully reviewed without their knowledge.

Solution to Third Party CA Vulnerability

Third party CA's are a current vulnerability of the SSL system. If keys are registered or obtained through a third party, their processes or systems can be attacked to get bogus certificates or keys retrieved sufficient to forge certificates. Third party CA's thus have to be incredibly secure.

The problem though is that no system is that secure. When computer viruses have been planted in US drone operation centers, and spear phishing attacks have been successful against some of the most secure US government

My proposal is to divide trust between an external certificate authority and an internal certificate authority, both of which are tracked over time in the key negotiation process. The external certificate authority's job remains to validate that the certificate is issued to the proper individual or organization (essentially an internet-based notary public), while the internal certificate authority is to identify individuals and services at an organization. Because of the structure of the internet, and because of current practice, I would recommend keying this to the purchased domain name.

This by itself essentially means that the operational encryption keys are not certified by certificate authorities by rather by an internal tier. In essence root certificate authorities would issue only certificates certifying things like "this domain is owned by who we issued the certificate to" while subdomains would be a local matter.

This does not interfere with provable security if the following rule is enforced: Operational certificates for resources at a domain MUST be regarded provably secure ONLY IF they are issued by a certificate authority whose certificate indicates it was issued for the domain in question (and possibly transitively) by a trusted root authority.

Note that this means two things: An attack on a root ca can no longer reveal keys useful in eavesdropping on key exchange, but it could reveal keys useful for carrying out a man in the middle attack. It thus increases the effort modestly in carrying out eavesdropping on SSL-protected connections, but this is not much protection given the complexity of the attack needed to make it even an issue. The real value comes from looking at diachronic protection against man in the middle and here is where the division really comes in valuable.

This is not limited to three levels, but it is worth noting that it would be necessary to check the certificates at each level to make sure that the above resource chain was not broken. A fourth level might be necessary in scaling down to individual consumers (who typically do not own the domains they send emails from).

Diachronic Protection against the Man in the Middle

The approach mentioned above is chiefly useful because it allows for one to track certificate authorities' and their keys over time relative to a domain. A sudden change could indicate a man in the middle, and with mutual authentication both sides should get alarms.

I would propose extending the new certificate to require signing with the previous key as well as the current one. A revoked and re-issued certificate would then be signed both by the certificate authority (as evidence of continuity) and by the parent certificate authority.

This means that you have strong evidence that the certificate was not only issued to someone the certificate authority is apparently willing to vouch for, but also that it was received by the holder of the private key of the previous certificate. Now in the event that the key is fundamentally lost, and a certificate re-issued, the holder would probably want to say something.

Now, this establishes a timeline of key changes which can be tracked, and what this means is that as keys are issued, timelines diverge. The key issued by the root CA is no longer sufficient to establish new connections without warning of a diverged timeline, meaning that connections with previously unknown parties can be evesdropped on but those who have been in previous contact will detect that something is wrong and hopefully alert their users to a possible problem. This provides both sides an opportunity to avoid problems.

Of course an alert may just mean massive loss of information and the previous key was lost so it does not necessarily mean a man in the middle. However unexplained ones do indicate a problem.

Active Man in the Middle Detection

Additionally this structure should enable us to do man in the middle detection for any real-time bidirectional communication. This depends on a certificate cache for effectiveness and this allows diachronic tracking of certificates for previously known resources. However for new connections there is a problem.

One solution is to orchestrate several additional requests for certificates from several other resources one knows about with the legitimate request occurring at a random spot. The observer in the middle cannot determine which is the legitimate request, and so at first will not, without extensive review first, be able to even guess which ones are unknown. This would make it particularly difficult to eavesdrop on volumes of communications.

Additionally once one contact pair is eavesdropped on, removing the eavesdrop triggers a diachronic protection warning.

Scaling Down to the Personal

The big thing that is required for scaling this down is to recognize that an additional tier is required, because individuals often send emails through the domains of their ISP's or email providers. In this view one would get a personal CA certificate, which would then issue one certificate for each of things like web access, email, etc.

Extensions to X509 Required

The fundamental extension required would be a way of having all relevant internal CA certificates serially in a single format, up to the root CA which could be independently verified. Additionally there may be some others. A reasonable overlap time may need to be specified for certificate authorities transitioning certificates to a new certificate. The determination of reasonable policies for such transition periods is beyond the scope of this proposal, however a temporary change (before reverting to the old cert) would be definitely suspicious.

Additionally, one would need to have a Key Version List, where keys could be listed in sequence for a period of time. This may need to be added to the certificate structure as an extension.

Limitations

Security here is provable over time only given the following assumptions:

1. The private key has not been compromised on either end.
2. Continuity in changes regarding private keys is known and can be shown.

The reliance on unconditional trust of root certificate authorities is reduced, though some reliance is necessary, and the effort needed to mount an attack would be higher. However the above limitations mean that false positives for security concerns may occur in the case where keys are lost and certificates re-issued, and false negatives in the event where private keys are compromised.

In the event where authorities (in jurisdiction which allow this) subpoena a private key, they can eavesdrop on connections. Similarly spearphishing could be used to obtain keys by organized crime. Thus these things are outside the threat model protects against.

However the limitations are narrower and help reduce the risk that certificate authorities face, as well as enabling people to better protect their communications.

Tuesday, June 4, 2013

New ventures and new directions for this blog

I have agreed to help found a LedgerSMB hosting business called Efficito. We have a very basic web page up (which is in the process of further development) but we intend to offer hosting specifically of LedgerSMB 1.3 and higher to customers who want it. If there is interest, please feel free to email me.

One thing we have committed to do is to leverage PostgreSQL for the overall management of our hosting services. We intend to use the database as a point of original entry, allowing other processes to enact changes saved in the database. This is something very different from a standard 3-Tier architecture in that the entry of data into the database triggers real-world side-effects in a distributed environment.

Obviously I cannot disclose our full codebase to the world, but I expect to be covering a number of aspects of how we will use PostgreSQL, and include small snippets of past or current versions of the code in these posts. The examples will be naturally minimalistic and lack the context of the full system, but they should be interesting from a question of how one works with PostgreSQL in an automated environment and a number of other topics.

In essence what started off as a blog about LedgerSMB, which has expanded to include object-relational design generally, will now be expanded yet again to include database-centric automation.

I expect to cover a few pieces in the near future including cross-type referential integrity where inet addresses match cidr blocks, and the use of ip4r to construct exclusion constraints for network ranges in table types. The first post is probably more generally useful because understanding how to manually do referential integrity can help solve broader problems, like referential integrity over inheritance trees.

Anyway I hope you enjoy some of the newer areas of focus as well.

Monday, May 27, 2013

Introducing the CPAN PGObject namespace

So I have decided to begin work on a set of CPAN modules, essentially implementing the stored procedure interface work that I have done for LedgerSMB, and incorporating the lessons learned by myself and others. It is my hope that Perl programmers will find this useful in building sophisticated applications on PostgreSQL. The idea is to turn PostgreSQL into an encapsulated database sever with a discoverable API, and then allow programs to discover the API directly. In my experience this leads to more robust code because code contracts between stored procedures and applications are more readily enforcible.

I just got preliminary versions of two of them on CPAN. The first is PGObject which implements the database-facing routines, and PGObject::Simple which is developer-facing. The reason for breaking things off this way is to allow for multiple different developer-facing systems off the same backend-facing services. These are functional, but ones with a great deal more functionality are on their way.

I feel a need to justify "Simple" in the name of the latter. PGObject::Simple is a quick and dirty stored procedure mapping convention system based on our approach in LedgerSMB. It is not reduced in functionality, but the overall conventions are simple, and therefore require less overhead to use. This also means, however, that the flexibility is such that it requires a certain discipline to use effectively. Objects can quickly become amorphous but used with discipline it is a very powerful system.

For additional structure, PGObject::Simple::Role is under development which creates a Moo role, which can be used with or without Moose, to provide this sort of functionality in more formalized object systems.

These modules are designed to facilitate mapping of stored procedure functionality into applications, allowing PostgreSQL databases to become encapsulated intelligent information servers in their own right. In future posts I will be going over both of these in detail.

For now though start with the documentation in the links above.

Wednesday, May 1, 2013

Arrays and Indexing, problems and solutions

Arrays in PostgreSQL are relatively well supported by the indexing engines of this RDBMS. There are a fair number of options however to be aware of and an understanding of how arrays are indexed is very helpful. Before going forward, please read my intro on arrays so that you understand the basics.

I am leaving out hash indexes which are not very useful for arrays (or much of anything else). There are probably cases where hash indexes are useful but if you need them you will know....

BTREE Indexing for Ordinality-centric arrays

The default type of index used by PostgreSQL is a btree index, which provides a search tree using only three operators: <, =, >. Ordering and ordinality of arrays here is quite important and btree indexes are essentially useless when indexing arrays where ordinality is unimportant.

One should remember that btree indexes are useful primarily in ordering based searches. The following could use a btree index (and might depending on table statistics):

SELECT id FROM mytable WHERE myarray < '{0,2,1,2,3}';
SELECT id FROM mytable WHERE myarray = '{0,6,5,4,3}';

In theory a prefix search might be possible if such an operator existed and appropriate work was done on the back-end but this is not currently possible without using functional index. Again in theory the following might work, but GIN indexes would be better for reasons that will become clear.

SELECT id FROM mytable WHERE myarray[1] = '5';

The following cannot work with a btree index under any circumstances:

SELECT id FROM mytable WHERE myarray[2] = 5;

The reason this cannot work is because while myarray[1] would correspond to something we could search the index on, myarray[2] would not be exposed to the tree view of the index until we already know myarray[1]. Thus we cannot traverse a tree on the basis of array elements beyond the first.

Btree indexes may be useful in some cases with arrays, particularly where ordinality is very important to the search. However, they are entirely useless where ordinality is unimportant.

GiST and GIN indexing

PostgreSQL provides two more general indexing opportunities that work for more general (and non-1NF uses) of arrays than do Btree indexes.

In general the tradeoff is that GIN indexes are more costly to maintain and take up more disk space but are less costly to query. As the read to write ratio goes up, GiST indexes start looking more desirable and GIN indexes start looking less desirable.

These indexes are helpful because they suppose a wider range of searches than do BTree indexes. In particular you can do overlap searches (select * from pages where tags && array['PostgreSQL']), and much more. GiST and GIN indexes also support other operators. Also it is worth noting that GiST indexes are originally designed for geometric and geospacial searches, while GIN indexes are designed for arrays.

The supported GiST operators are:

<< Strictly left of
&< Extends to the left of
&> Extends to the right of
>> Strictly right of
<<| Strictly Below
&<| Does not extend above
|&> Does not extend below
|>> Strictly above
@> Contains
<@ Contained in
~= Is the same as (identical to)
&& Overlaps

Now this supports a significant superset of GIN index functionality (see below) but it does not really necessarily apply to arrays in a straightforward way. In general GIN indexing is a better match, but it is worth noting you could easily write functions that could be used to check some of these index conditions and add that functionality if you need it. In general, however, this is an advanced topic and well beyond the scope of this post. If you ever need to compare two arrays though and see which has the highest single, or the largest dimensions, or some other objective metric, you might be able to hook this into GiST indexing.

The supported GIN operators are:

<@ Is contained in
@> Contains
= equals
&& Overlaps

These are generally particularly useful for arrays, and since they are lossless, they are significantly faster for lookups. For arrays these are preferred most of the time if you are looking primarily for matching specific elements in the array.

On the whole the two most important indexes depending on what you are doing with the array are btree and GIN indexes. Btrees cover well the case where ordinality matters and the search is primarily on the beginning of the array. GIN indexes cover more general (and in particular non-1NF setups) relatively well. GiST indexes seem nice, but on the whole they do not offer a good performance/functionality tradeoff out of the box. They would be powerful tools however for the advanced developer.

Monday, April 29, 2013

Introduction to PostgreSQL arrays, basic concepts and normalization considerations

This is part of a two part series. The second will discuss indexing arrays which is a complex topic.

PostgreSQL has very good support for arrays of other data types. One can have nested data types where an attribute of a relation stores an array of tuples, and each of these tuples can hold arrays of other tuples or primitives.

Arrays are extremely powerful in PostgreSQL but they pose a number of significant issues as well. They can be used in ways which break first normal form but mere use of an array does not necessarily break first normal form's atomicity requirement because arrays differ in significant fundamental ways from relations. Understanding arrays in PostgreSQL can help avoid costly errors. Relations and arrays are subject to fundamentally different constraints and therefore are equivalent in only a few very special cases.

Arrays derive their power in PostgreSQL due to the fact that they have strong definitional requirements.

An Array is a Mathematical Matrix

PostgreSQL arrays follow the basic definition of a mathematical matrix. The array must be rectangular so if one element has an array in it, all other elements must have an array of the same dimensions. All members of an array must be of the same data types.

The following are thus valid Integer arrays:

'{1,2,3,4,5}'
'{{1,2},{3,4},{5,6},{7,8},{9,10}}'

The following are not valid integer arrays:

'{1,2,3,{4, 5}}'
'{{1,2},{3,4},{5}}'

These constraints make arrays particularly useful in PostgreSQL. One can use text arrays to represent sets of tuples being input into stored procedures for example. One can also write functions to do matrix math on numerical arrays (we will look at an example below). If PostgreSQL lacked true arbitrary precision math via the numeric type, we could implement something similar using arrays. These checks all make arrays safe to work with in ways that most programming languages do not.

Arrays vs Relations vs Tuples (and First Normal Form)

Arrays bear some relationship to relations and tuples, but all three are different in significant ways which means they are best at solving different problems.

Like tuples, arrays are ordered. However:

Unlike tuples every array element must have the same data type. This means that arrays of text can represent tuples of any type.
Unlike tuples, arrays do not have a fixed number of elements. Elements can be added without disturbing the basic type.

Like arrays, relations are basically rectangular, and open ended (items can be added or removed from the end). However:

Arrays are ordered, relations are not. This means that an array value is a domain, while a relation value is a set or bag of domains (depending on constraints).
All data types in an array must be identical. Relations do not have this restriction.

What this means basically is that the mere use of an array in a column of a table does not necessarily violate 1NF, if the array is used in a way that suggests that ordinality matters.

For example we could store matrices for matrix arithmetic as numeric[] arrays. We could store some problems that are transformed into numeric[] matrices into the db as well (for example simultaneous linear equations) and these would not necessarily violate 1NF.

Example: Solving Simultaneous Linear Equations in PL/PGSQL

I will leave understanding this as an exercise to the reader. It is, however, a small example. This is a very basic program I originally wrote in BASIC on my C64 when I was younger as a homework solving program, and have since written a number other incarnations as a demonstration. it solves simultaneous linear equations by arraying them in matrix and reducing the matrix accordingly. The program will only accept an n by n+1 matrix, and then will calculate.

This example is useful because it shows that anything you can solve through matrix math you can solve using PostgreSQL arrays....

create or replace function solve_linear(numeric[]) returns numeric[]
language plpgsql as
$$
declare retval numeric[];
        c_outer int; -- counters
        c_inner int;
        c_working int;
        upper1 int;
        upper2 int;
        ratio numeric; --caching of calculation
begin
    IF array_upper($1, 1) <> array_upper($1, 2) - 1 THEN
        RAISE EXCEPTION 'bad input, must be n x n+1 matrix';
    END IF;
    upper1 := array_upper($1, 1);
    upper2 := array_upper($1, 2);
    FOR c_outer IN 1 .. upper1 LOOP
        FOR c_inner IN 1 .. upper1 LOOP
            IF c_inner = c_outer THEN CONTINUE;
            END IF;
            ratio := $1[c_inner][c_outer]/$1[c_outer][c_outer];
            for c_working in 1 .. upper2 loop
                $1[c_inner][c_working] := $1[c_inner][c_working]
                                    - ($1[c_outer][c_working] * ratio);
            end loop;
        end loop;
    end loop;
    retval := '{}';
    for c_outer in 1 .. array_upper($1,1) loop
       retval := retval || $1[c_outer][array_upper($1,1) + 1]/ $1[c_outer][c_outer];
    end loop;
    return retval;
end;
$$;

As an example:

select * from solve_linear('{{1,2,3},{2,3,4}}'::numeric[]);
                       solve_linear
----------------------------------------------------------
{-1.00000000000000000000000000000000,2.0000000000000000}
(1 row)

In other words, if we have 1x + 2y = 3 and 2x + 3y = 4, then these will intersect where x is -1 and y is 2.

Yes, it is order independent.

select * from solve_linear('{{2,3,4},{1,2,3}}'::numeric[]);
                             solve_linear
----------------------------------------------------------------------
{-1.0000000000000000000000000000000000000000,2.00000000000000000000}
(1 row)

However, ordering within a subarray is important:

select * from solve_linear('{{2,4,3},{1,3,2}}'::numeric[]);
                            solve_linear
---------------------------------------------------------------------
{0.5000000000000000000000000000000000000000,0.50000000000000000000}
(1 row)

One minor issue with the above code is that it throws a division by 0 if the lines are parallel:

select * from solve_linear('{{1,2,3},{1,2,6}}'::numeric[]);
ERROR: division by zero
CONTEXT: PL/pgSQL function "solve_linear" line 19 at assignment

Consequently error handling could be improved, but it isn't too bad for a demonstration.

As it turns out this is a fairly generalizable class of problems and may be particularly helpful when doing geometric math.

I hope this helps show some of the uses and value of PostgreSQL array types. In the next post we will talk about arrays and indexes.....

Wednesday, April 24, 2013

A few cases where EAV modelling really works in LedgerSMB.

Entity-attribute-value (EAV) modelling is typically seen as an antipattern and for good reasons. Misapplied it leads to inflexible queries, poor performance and a number of other bad things. Nonetheless there are classes of problems where EAV is the right solution for the problem at hand. Two specific areas of LedgerSMB will demonstrate this, namely account/dropdown management and the menu schema. These actually represent different classes of problems but they have a number of things in common. The goal of this post is to help readers better understand where EAV modelling can be of great help.

In general, the areas of LedgerSMB which use EAV modelling are stable, rarely changing data with relatively simple to non-existent validation routines (the menu is expected to be externally validated). In some cases the basic expectations of EAV are modified slightly. In both these cases (which represent relatively different classes of problems) we have been very happy with EAV and while our approach will no doubt be refined in the future, it is unlikely that these portions of the software will ever move from EAV.

Note the LedgerSMB development team does not advocate starting with EAV for most problems. Usually where EAV is appropriate it will naturally emerge in the schema design process.

What is EAV?

EAV is a way to model relatively free-form data in such a way as to dispense with the need for extremely complex, long tables for specifying sparse data. A typical EAV approach has three basic tables representing entity data, attributes, and value data.

To make EAV work successfully, as much data as possible should be put in the entity table or parent/child tables. It's only where this approach breaks down that EAV is a good idea. The examples below will flesh this out more. However a basic typical design might look like:

CREATE TABLE entity (
    id serial not null unique,
    label text primary key,
    ....
);

CREATE TABLE attribute (
   id serial not null unique,
   att_name text not null,
   att_datatype text not null,
   ....
);

CREATE TABLE value_table (
   id bigserial not null unique,
   entity_id int not null references entity(id),
   attribute_id int not null references attribute(id),
   value text not null
);

In portions below we well explore what is right and what is wrong with this approach. This post is hardly a definitive guide to the subject but it is hoped it is a step towards exploring this area.

Why EAV?

EAV dispenses with the requirement for rigid schemas. Particularly in cases where data validation is not complex or not demanded, EAV can solve problems that standard row-oriented models cannot readily solve. Some classes of problems (tags to social media for example) fall readily in a modified EAV model.

In essence EAV can sometimes(!) be a useful approach to use when the data does not fit well into other models. Unfortunately when it is grabbed for too quickly bad things result, but this does not prevent it from being useful in cases where it is at home.

Why Not EAV?

EAV, improperly applied, leads to a significant number of problems. Improperly applied, it leads to an inability to declaratively validate data within the schema of the database itself beyond very simple rules. Improperly applied it leads to performance problems, and basic data management problems.

In general my own view is that EAV is best kept as a tool to solve very specific problems, and that most other problems are better solved by careful table design.

Modified EAV in the Account/Dropdown Management Schema

In LedgerSMB chart of accounts are attached to drop downs. The schema looks like:

CREATE TABLE account (
id serial not null unique,
accno text primary key,
description text,
category CHAR(1) NOT NULL,
gifi_accno text,
heading int not null references account_heading(id),
contra bool not null default false,
tax bool not null default false
);

CREATE TABLE account_link_description (
    description text    primary key,
    summary     boolean not null,
    custom      boolean not null
);

CREATE TABLE account_link (
   account_id int references account(id),
   description text references account_link_description(description),
   primary key (account_id, description)
);

Now what is obviously missing here in an EAV perspective are values. We have entities, and attributes, but the presence of the attribute itself is the value. It would be like EAV with merely boolean values.

The architecture poses all the classic issues one has with EAV. There are rules regarding what combinations regarding which combinations of drop-downs are invalid. These are not complex rules but they could change over time. Currently if an account_link is a summary account, then it is only allowed a single drop down link, so no others can exist. Custom accounts are not validated and can break these rules.

Currently validation occurs via a stored procedure which encodes the rules in its own logic.

Normally the tables involved here are small. A typical business will have no more than a few dozen account entries, and maybe a hundred or two account_link entries, and there are only 27 account link descriptions. However larger businesses could have thousands of accounts and tens of thousands of account_link entries.

The approach here scales well because there are only really three typical search patterns.

The first is a search of accounts for a specific link description (dropdown box). This will usually pull only a portion of the accounts and can be managed through indexes.

The second is a search of account_links for a given account. This is trivial and usually no more than about 20 items.

Finally we may pull all entries from all tables, and join/aggregate. This is not going to require multiple table scans, comparing results, in order to determine the appropriate rows to return.

One thing that is important to note is that chart of accounts data is pretty stable. It may change a few records over time, but radical changes or mass updates are very rare. Thus we can sacrifice some immediate performance over the lon run.

Lightweight EAV in the menu definition schema

Our menu schema looks like this:

CREATE TABLE menu_node (
    id serial NOT NULL,
    label character varying NOT NULL,
    parent integer,
    "position" integer NOT NULL
);

CREATE TABLE menu_attribute (
    node_id integer NOT NULL references menu_node(id),
    attribute character varying NOT NULL,
    value character varying NOT NULL,
    id serial NOT NULL,
    primary key(node_id, attribute)
);

CREATE TABLE menu_acl (
    id serial NOT NULL,
    role_name character varying,
    acl_type character varying,
    node_id integer,
    CONSTRAINT menu_acl_acl_type_check CHECK ((((acl_type)::text = 'allow'::text) OR ((acl_type)::text = 'deny'::text))),
    PRIMARY KEY (node_id, role_name)
);

What is omitted here is the attribute table. In essence our menu model has no data validation and the data is extremely stable. It is expected that the developer validates against the application.

The tables here are relatively small. By default a couple hundred rows in menu_node and nearly a thousand in menu_attribute. This structure stretches what relational db's will let you do gracefully since menu_node represents a tree hierarchy, and the rest establishes a very open (unvalidated) EAV model.

Access patterns here are typically once per login, pulling all data from these tables (two scans, one for permissions), and then aggregating the attributes such that we get enough information to create a menu structure (<ul> tags with hyperlinks inside <li> tags).

Menu items are usually only created once and are extraordinarily stable.

Conclusions

These use cases have a number of striking things in common. The access patterns are simple, as are the in-db data validation rules (if they exist at all). The data is rarely written.

The second point is something higher level that these cases have in common. They represent a way of the application storing its own API in the database. In other words they represent a way of storing application configuration information, or hooks or references to application logic (though not the logic itself). These areas are semantically very flexible but typically the data is long-lived once everything is set up.

Tuesday, April 16, 2013

A (Distributist) View on Software Freedom

We are relatively familiar with Stallman's views on software freedom. These views are deeply based on the idea that we can enumerate rights which should be respected. However sometimes enumeration of rights gets in the way of actually having freedom, because of the assumption that anything beyond those rights are not rights at all. The idea of four freedoms essentially encourages a lawyer's view of software freedom where anything outside can be restricted in order to achieve the appearance of freedom.

This post provides a Distributist view of software freedom. It functions in part as an introduction to Distributist economic and political theory for open source software professionals and an introduction to open source software for Distributists. The approach will be shown to be very different than Stallman's or other groups but it also shows in greater detail why software freedom is important in a more general way than Stallman's enumerated rights approach shows.

It's worth noting off the bat that Distributism arose as a critique both of Capitalism and Communism and represents something with some of the ideals of both sides, but altogether different in character than either. It's my thesis that open source makes most sense when looked at through a distributist approach instead a liberal rights approach.

Economic Activity, Means of Production, and Ownership

Capitalist, Marxist, and Distributist economic theory is fundamentally based on the recognition of means of production and its place in regulating economic activity.

We humans no longer live, anywhere in the world, in unimproved conditions (say in trees without tools, clothing, or the like). For us to live we must improve our environment and make it more conducive to human activity. Production of goods and delivery of services, however, requires a few specifics: land, tools, labor, and capital. Without land, a business has nowhere to work. Without tools, there is nothing that can be done. Without labor there is nobody to do the work, and without capital there is no way to pay the costs to get started.

The means of production constitute two parts of this equation, namely land (facilities) and tools used to produce goods and deliver services. Labor and capital represent two other inputs but they are different in kind. Capital is expended flexibly, and labor is human input by the actual workers. Under a system of slavery, slaves might be included but otherwise, free workers are different in that they are not owned.

The term ownership is however somewhat problematic. Ownership itself is socially constructed, and can take different forms in different ways. The simplest, perhaps even most satisfying, definition of ownership is simply the right to utilize, and direct utilization of, an item, particularly in production of economic goods or services. This definition largely follows that of Hilaire Belloc's The Servile State which is one of the fundamental works of Distributist economic and historical theory.

Distributism, Capitalism, and Marxist Communism can be differentiated though as to who owns, in the sense of the right to utilize or direct utilization of the means of production. These represent fundamentally different approaches to trying to manage economic problems of a society (other approaches exist as well).

Capitalism, Communism, and Distributism as Economic Orders

For this discussion it is important to define these terms clearly and their relationship, so that the scope and type of the Distributist critique of Capitalism and Communism are understood. This is also a key to understanding the economic benefits of open source software. It is also worth noting that "perfect" examples of Capitalism, Communism, and Distributism are hard to find, so to some extent one has to ask what aspects of these three systems one finds in any given economic order. In fact one may find aspects of all three in the US today despite the fact that the US is one of the most clearly Capitalist economies around today.

Capitalism is a system where, in a class-based society, the class with capital buys land and tools, and hires labor to start businesses. Capitalism thus means "ownership by financier." Capitalism is fundamentally prone to market failures through mechanisms like overproduction, and workers are largely powerless because they have few options other than working for financier-owned businesses. As Adam Smith noted, wages and salaries are determined less by supply and demand and more by individual bargaining power, so the less empowered a worker is, the less the worker will earn.

Marxist Communism seeks to solve the problems of Capitalism by taking the means of production from the financiers and vesting those in the state as the representative of the worker. Of course the problems of Capitalism are caused by too few people having too much control, so further concentrating power in fewer hands will not solve the problem. Further, while financiers may have a primary goal of making money, government officials have complex goals and so the power that comes with control over economic production (which amounts to control over life itself) inevitably leads to great and terrible dictatorships.

Distributism looks back to the progression of the economy of the Middle Ages towards an economy where most people were effectively self-employed and thus seeks to undo the centralization of power that occurs under Capitalism by vesting ownership in the means of production with individual workers, not the state as an intermediary. An economy dominated by the self-employed would be a Distributist economy.

The Distributist critique of Capitalism at the beginning tracks the Marxist critique of Capitalism. Capitalism is seen to create problems for workers who are disadvantaged by wealthy business owners, but the mechanism of that disempowerment is different, and so here they begin to diverge. In essence because workers are denied the opportunity to participate in the economy as free agents, they are dependent on their employers for basic subsistence. This leads workers to demand better conditions which leads to class warfare. This threatens the liberal capitalist order, and results in some socialization of some things that the wealthy business owners are willing to give up for promises of greater gains in the future.

Unlike Marxists, Distributists usually assume that the wealthy classes, by virtue of their economic control, will usually win class warfare and this is well attested historically. Therefore Distributism seeks to avoid rather than stoke this warfare. The goal is to gently cultivate businesses and individuals where labor and capital are not separate and therefore not in conflict. A handyman who owns his own tools and pays for the start-up expenses of going into business is thus a job creator in a way in which venture capital firms are not. There is nothing more empowering to a worker than the ability to quit and easily start a business, so a porous border between formal employment and self-employment is believed to increase wages.

Software Ownership, EULA's and Open Source

Typically software is not sold but licensed. The licensee (end user) receives certain rights to utilize the software for certain purposes decided by the licensor. For this reason, if we define ownership as the right to utilize, particularly for economic production, software comes with a wide range of degrees of ownership.

Some software, for example, is free for non-commercial use. This offers a very limited degree of ownership (regarding economic production) unless often significant fees are paid. Other software comes with seemingly arbitrary limits which are specifically designed to provide upselling opportunities to the vendor either regarding software or license rights.

For example, in the past certain versions of Windows have included terms in their End User License Agreement specifying that web servers may not be run on the systems. The whole purpose is to ensure that people who want to run Apache on their systems must buy versions that also come with IIS.

Similarly client access licenses are typically used as a general upselling opportunity, so that larger users must pay more.

Of course one is only entitled to use what one pays for and therefore businesses using commercial off the shelf software often must spend time and energy ensuring they are compliant with all the various terms on every piece of software license.

Open source is different in that it provides very few limits on utilization at all, and some software (such as under the MIT license) provide few limitations on utilization at all.

In general open source software licenses allow unlimited utilization within a business. They provide varying degrees of restriction regarding utilization in products that incorporate the software, particularly in how much ownership can be asserted over the software once it is distributed to a customer.

Open Source Development Communities vs Dual-Licensing Companies

There are two predominant approaches to the development and distribution of open source software. The first is that of dual-licensing by a single company which holds the copyright and sells proprietary licenses in addition to an open source offering, often one which has fewer features. The second is the multi-vendor model where several companies come together to jointly offer an open source offering. The companies may or may not have their own proprietary offerings as well.

The big difference between these two models is in who has what degree of control over what people and businesses do with the software. In the first model, the company has the incentive to provide the software under a license with as many restrictions on utilization as possible. In the second, no one company has that power, and thus the incentive is to provide the software under fewer restrictions.

These differences largely show the Distributist position to be correct, that distribution of control means that the vendors have less control and that the workers and consumers have more. In essence programs developed and owned by communities are less restricted and more open than those which are developed and owned by single companies (owned meaning having the right to tell you what you can or cannot do with the software).

The typical copyleft critique of BSD- and MIT-licensed software is that it gives too much freedom, to the point where it is not possible to enforce downstream software freedom. I think to some extent this is both incorrect and it misses the point. It is true that many permissively licensed projects have seen proprietary spinoffs which no longer contribute code back, but in the long view, these rarely outlive a project with significant vitality.

For example Solaris began as a proprietary spinoff of UC BSD. Other UC spinoffs, such as FreeBSD is still being actively developed but mostly under the same license approved by University of California. The only development on Solaris I am aware of is on an open source fork of OpenSolaris. Illustra was an early proprietary spinoff of PostgreSQL. It was purchased and some code incorporated into Informix but otherwise died. Mammoth PostgreSQL is now alive only as an open sourced set of patches. We can expect that other forks in coming decades which have broken themselves off from community development will be unable to compete long run. History is littered with failed proprietary forks of BSD-licensed software.....

The community, however, continues.

The obvious counter-example is NCSA Mosaic, but it isn't clear that any proprietary forks of the browser survived either, and Mosaic was originally developed by a single entity. Thus this exception largely proves the rule.

The community endures. Companies come and go.

From this perspective corporations are actually not that powerful compared to the flexible network of developers building open source software. These networks in the case of many pieces of important software infrastructure, have proven remarkably robust:

BSD began in 1977, and while the University of California Berkeley is no longer the one that manages development, several forks exist today (OpenBSD, FreeBSD, NetBSD, etc).
PostgreSQL began in 1985 at UC Berkeley and like BSD escaped into the wild. It is one of the most successful open source database engines around.
Apache Web Server was released first in 1995.
The Linux Kernel was begun in ernest in 1991.
TeX was initially released in 1978
LaTeX was initially released in 1982

In comparison commercialization ventures of these areas have tended to be much shorter lived and many technologies that seemed like they were here to stay have disappeared or are dying (Flash being an example that comes to mind).

Thoughts on MIT, BSD, GPL, and AGPL Licenses

Careful observers will note I listed the licenses in order from the most permissive to the most restrictive.

Looking at open source from the eyes of a Distributist theory of economic production and ownership, the MIT, BSD, GPL, and AGPL licenses are differentiated regarding the specifics of the right to utilize that is passed downstream. Of these, the BSD license family provides, IMO, the best balance between the interests of market participants. Sublicensing is not allowed, but all other utilization is.

The permissive licenses are fundamentally more Distributist than the copyleft licenses are, and the copyleft licenses are more Capitalist than the permissive licenses are. With a copyleft license there are important rights retained for further licensing, so one can essentially require additional fees be paid upstream for certain uses. Much of my discussion here is drawn from Larry Rosen's book on the subject.

The MIT license largely only requires attribution. One is allowed to do pretty much anything else with the software. This includes, notably, to sublicense it, in other words to sell part of one's rights received under the license to downstream parties. In other words, I can take MIT Kerberos, rename it to GuardDog Authentication Services (totally hypothetical name and any resemblance to a real product is purely coincidental), and sell it without making any modifications to the code at all. I can also incorporate the software with my products, and license the code with restrictions not present in the MIT license. In essence the MIT license differs from public domain only in that it requires attribution.

The BSD license is significantly more restrictive than the MIT license, in that the license does not mention sublicensing. I can take PostgreSQL, add important changes, and license the new product with restrictions not present in the original license. However, I cannot do this unless I add significant, copyright-worthy changes. The restrictions I add affect changes I make and the work as a whole, not the code I did not modify, and this is a significant difference from MIT-licensed code. While there are exceptions, the vast majority of IP lawyers I have discussed the interpretation of these licenses with concur with this interpretation.

The permissive licenses above have the advantage in that they pass downstream the right of those who do further development to fully own their work to the extent that society allows (through copyright law and the like). The copyleft licenses are different in that they pass only limited rights to utilize downstream to further developers.

In the description below I am going to avoid directly addressing hot topics like whether linking has anything to do with derivation. These arguments however further serve to limit the right to utilize the software by providing gray controversial zones of use and so I will note them. The copyleft licenses are efforts to enshrine and protect Stallman's 4 freedoms (the right to run the software, the right to study the software, the right to copy the software, the right to modify the software). This is obviously an intentional reference to the rhetorical framework of Roosevelt's New Deal, and it highlights what I think it the flaw in the copyleft approach.

Roosevelt's New Deal was sold to the public as a way out of the Great Depression and a way to help the poor who were suffering from the economic downturn. However the New Deal included programs intended to centralize industries (some of which were struck down by the Supreme Court prior to the court packing threat), and programs which supplanted organic, family-centric safety nets with impersonal institutions. In essence the New Deal was corporatist and paved the way for the neoliberalism which has ruled the West in recent decades. The same basic problems occur within the FSF's approach, which results in the GPL and AGPL are particularly popular with corporations wanting to reduce open source software to shareware. Note that regarding institutional control, the FSF advocates giving them (the FSF) control over future licensing changes to the software via an "or at your option any later version" clause in the licensing. For those of us trying to break up the corporate control over software, however, the New Deal is the wrong model. Trust-busting and common goods are the right model.

This is not to say that distributist approaches cannot use the GPL for their software. It is just to say that the BSD license is probably better in such areas. The fundamental difference is whether one is to see the primary method of preventing abuse being centralized regulation (via the FSF in the form of new versions of the GPL family of licenses) or decentralized downstream control (via multi-vendor communities). While these are not mutually exclusive, I don't think it is possible to both equally rely on centralization and decentralization at the same time. One has to choose one as primary. If you choose centralization, you have to pretty much follow the GNU model of requiring copyright assignments, etc. If you choose decentralization and not require copyright assignments, you have to admit that this dilutes the protections of the GPL, moving it more towards the BSD-family of licenses (because it is harder to prove copyright infringement when you have to prove ownership of code you created first).

The GNU GPL family of licenses places significant burdens on the right to utilize the software to produce new goods incorporating the software. This is due to the so-called viral nature of the license. One cannot close off the software and refuse to give downstream authors the same rights. However on top of that there are controversies around where this starts and stops which make this restriction remarkably broader than it might seem on the surface. The first is whether, as a copyright matter, you need permission to link to a library and utilize that library's code. This affects whether, in the GPL v2, a proprietary licensed library connected to a GPL'd application (or vice versa!) is mere aggregation or a derivative work under the full control of the GPL. This issue has never been resolved in court anywhere in the world to my knowledge and so people avoid doing this.

The GPL v3 is a very long, complex, legal contract which attempts to address the above issue among others (and it places further restrictions on goods produced regarding cryptographic signatures and the like). On the whole the license creates more problems than it solves. For example, everyone I have talked to agrees that the GPL v3 license and the BSD family of licenses (assuming no obnoxious advertising clauses) are compatible, but the view particularly of the FSF on what these licenses require is very different from what most lawyers I have talked to from the Software Freedom Law Center and independent IP lawyers have said. The question boils down to what section 7, Additional Terms, requires of other licenses, and whether (or, more properly, how) the BSD license can be construed to be compatible with that.

The mainstream view is that the licenses were intended to be compatible and so the GPL should be read to be compatible with the BSD license. This means that, since the BSD license does not mention sublicensing, the additional downstream permissions granted by the copyright holder take effect no matter what, and the 7(b) reasonable legal notices section should be read broadly enough to allow additional non-removable permissions granted by the copyright holder of certain contributions, provided the permissions only govern those contributions specifically. It is worth noting that the Software Freedom Law Center has dedicated significant resources towards publishing viewpoints explaining this specific position.

Another view I have heard (from Eben Moglen shortly after the GPL v3 was finalized, but to be honest I don't know if his view has changed) is to read the BSD license as allowing sublicensing and read the GPL as requiring a right to sublicense, and then read removal of permissions not as a sublicense per se but just as a notice that the code cannot be safely treated as BSD-licensed. But if this is the case and the BSD license does not grant that right, then it raises significant legal issues. As a software developer I don't feel like I can accept this interpretation of both licenses safely.

These controversies and others conspire to create a situation where there exists significant uncertainty about how one can safely utilize the software in the creation of new products. In general, one cannot go far wrong by following community norms, which treats the BSD license as compatible and linking to proprietary software and libraries as problematic, but these could be destabilized in the future with litigation and so it is a significant concern.

Of the copyleft licenses, the Affero GPL or AGPL stands alone for placing signficant restrictions on the utilization of software to deliver services. Of the licenses discussed, it is the one which is plainly incompatible with a Distributist approach to economic ownership. The AGPL requires that modifications to the software are available to the user of the software, when that user is interactively using the software as a service.

The AGPL is specifically intended to close the so-called SaaS loophole, but it is the wrong solution to the wrong problem. From a customer's perspective the primary concern with SaaS is not functionality but data ownership. If I take a piece of software and modify it, I can assume users can hire someone to re-implement the modifications easily enough, but if they don't have access to their raw data, they are stuck with me. This sort of lock-in is the real problem with SaaS from a competitive perspective but it cannot be reasonably addressed through software licenses while maintaining a commitment to distributed ownership and development, and freedom to utilize.

Conclusions

Free and open source software today is split between two models. The first, what we might call the Capitalist model of products like MySQL, which presuppose a single commercial entity holding total copyright ownership over the software. The second is the Distributist model where ownership and right to utilize is distributed. PostgreSQL is a good example as is FreeBSD. The second model is not incompatible with capitalist entities participating but they don't control the right to utilize the software in the same way. Similarly the Capitalist model can't stop Distributist off-shoots (see MariaDB and LedgerSMB) and those offshoots are usually forced into a Distributist model because the founders cannot assert copyright control sufficient to engage in dual licensing.

In general, the Distributist understanding of property and business ownership provides a very different look at free and open source software, and why these are important even for non-programmers. These programs also provide a very good look at a Distributist economic order and how it can function in a Capitalist society's knowledge economy.