Wednesday, April 24, 2013

A few cases where EAV modelling really works in LedgerSMB.

Entity-attribute-value (EAV) modelling is typically seen as an antipattern and for good reasons.  Misapplied it leads to inflexible queries, poor performance and a number of other bad things.  Nonetheless there are classes of problems where EAV is the right solution for the problem at hand.  Two specific areas of LedgerSMB will demonstrate this, namely account/dropdown management and the menu schema.  These actually represent different classes of problems but they have a number of things in common.  The goal of this post is to help readers better understand where EAV modelling can be of great help.

In general, the areas of LedgerSMB which use EAV modelling are stable, rarely changing data with relatively simple to non-existent validation routines (the menu is expected to be externally validated).   In some cases the basic expectations of EAV are modified slightly.  In both these cases (which represent relatively different classes of problems) we have been very happy with EAV and while our approach will no doubt be refined in the future, it is unlikely that these portions of the software will ever move from EAV.

Note the LedgerSMB development team does not advocate starting with EAV for most problems.  Usually where EAV is appropriate it will naturally emerge in the schema design process.

What is EAV?


EAV is a way to model relatively free-form data in such a way as to dispense with the need for extremely complex, long tables for specifying sparse data.  A typical EAV approach has three basic tables representing entity data, attributes, and value data.

To make EAV work successfully, as much data as possible should be put in the entity table or parent/child tables.  It's only where this approach breaks down that EAV is a good idea.  The examples below will flesh this out more.  However  a basic typical design might look like:

CREATE TABLE entity (
    id serial not null unique,
    label text primary key,
    ....
);

CREATE TABLE attribute (
   id serial not null unique,
   att_name text not null,
   att_datatype text not null,
   ....
);

CREATE TABLE value_table (
   id bigserial not null unique,
   entity_id int not null references entity(id),
   attribute_id int not null references attribute(id),
   value text not null
);

In portions below we well explore what is right and what is wrong with this approach.  This post is hardly a definitive guide to the subject but it is hoped it is a step towards exploring this area.

Why EAV?


EAV dispenses with the requirement for rigid schemas.  Particularly in cases where data validation is not complex or not demanded, EAV can solve problems that standard row-oriented models cannot readily solve.  Some classes of problems (tags to social media for example) fall readily in a modified EAV model.

In essence EAV can sometimes(!) be a useful approach to use when the data does not fit well into other models.  Unfortunately when it is grabbed for too quickly bad things result, but this does not prevent it from being useful in cases where it is at home.

Why Not EAV?


EAV, improperly applied, leads to a significant number of problems.  Improperly applied, it leads to an inability to declaratively validate data within the schema of the database itself beyond very simple rules.  Improperly applied it leads to performance problems, and basic data management problems.

In general my own view is that EAV is best kept as a tool to solve very specific problems, and that most other problems are better solved by careful table design.

Modified EAV in the Account/Dropdown Management Schema


In LedgerSMB chart of accounts are attached to drop downs.  The schema looks like:

CREATE TABLE account (
  id serial not null unique,
  accno text primary key,
  description text,
  category CHAR(1) NOT NULL,
  gifi_accno text,
  heading int not null references account_heading(id),
  contra bool not null default false,
  tax bool not null default false
);



CREATE TABLE account_link_description (
    description text    primary key,
    summary     boolean not null,
    custom      boolean not null
);


CREATE TABLE account_link (
   account_id int references account(id),
   description text references account_link_description(description),
   primary key (account_id, description)
);


Now what is obviously missing here in an EAV perspective are values.  We have entities, and attributes, but the presence of the attribute itself is the value.  It would be like EAV with merely boolean values.

The architecture poses all the classic issues one has with EAV.  There are rules regarding what combinations regarding which combinations of drop-downs are invalid.  These are not complex rules but they could change over time.  Currently if an account_link is a summary account, then it is only allowed a single drop down link, so no others can exist.  Custom accounts are not validated and can break these rules.

Currently validation occurs via a stored procedure which encodes the rules in its own logic.

Normally the tables involved here are small.  A typical business will have no more than a few dozen account entries, and maybe a hundred or two account_link entries, and there are only 27 account link descriptions.  However larger businesses could have thousands of accounts and tens of thousands of account_link entries.

The approach here scales well because there are only really three typical search patterns. 

The first is a search of accounts for a specific link description (dropdown box).  This will usually pull only a portion of the accounts and can be managed through indexes.

The second is a search of account_links for a given account.  This is trivial and usually no more than about 20 items.

Finally we may pull all entries from all tables, and join/aggregate.  This is not going to require multiple table scans, comparing results, in order to determine the appropriate rows to return.

One thing that is important to note is that chart of accounts data is pretty stable.  It may change a few records over time, but radical changes or mass updates are very rare.  Thus we can sacrifice some immediate performance over the lon run.
 

Lightweight EAV in the menu definition schema


Our menu schema looks like this:

CREATE TABLE menu_node (
    id serial NOT NULL,
    label character varying NOT NULL,
    parent integer,
    "position" integer NOT NULL
);


CREATE TABLE menu_attribute (
    node_id integer NOT NULL references menu_node(id),
    attribute character varying NOT NULL,
    value character varying NOT NULL,
    id serial NOT NULL,
    primary key(node_id, attribute)
);


CREATE TABLE menu_acl (
    id serial NOT NULL,
    role_name character varying,
    acl_type character varying,
    node_id integer,
    CONSTRAINT menu_acl_acl_type_check CHECK ((((acl_type)::text = 'allow'::text) OR ((acl_type)::text = 'deny'::text))),
    PRIMARY KEY (node_id, role_name)
);


What is omitted here is the attribute table.  In essence our menu model has no data validation and the data is extremely stable.  It is expected that the developer validates against the application.

The tables here are relatively small.  By default a couple hundred rows in menu_node and nearly a thousand in menu_attribute.  This structure stretches what relational db's will let you do gracefully since menu_node represents a tree hierarchy, and the rest establishes a very open (unvalidated) EAV model.

Access patterns here are typically once per login, pulling all data from these tables (two scans, one for permissions), and then aggregating the attributes such that we get enough information to create a menu structure (<ul> tags with hyperlinks inside <li> tags).

Menu items are usually only created once and are extraordinarily stable. 

Conclusions


These use cases have a number of striking things in common.  The access patterns are simple, as are the in-db data validation rules (if they exist at all).  The data is rarely written.

The second point is something higher level that these cases have in common.  They represent a way of the application storing its own API in the database.  In other words they represent a way of storing application configuration information, or hooks or references to application logic (though not the logic itself).  These areas are semantically very flexible but typically the data is long-lived once everything is set up.

3 comments:

  1. I wouldn't call the cases you describe EAV.

    The main "feature" of EAV -- which is at the same time its biggest problem -- is that all data for all "classes" is stored in the same tables.

    This makes it next to impossible to have database constraints. And performance is terrible.

    The cases you describe certainly don't have these problems!

    ReplyDelete
    Replies
    1. In the menu system we are actually storing very free-form classes in the database. Technically these can be any of a number of application classes and there is no database constraints possible based on the data semantics itself.

      However this has two important aspects. First it is *application* data rather than business data (we will never run reports on it aside from for the benefit of application developers). Secondly it is extremely stable (it is defined by the developer as an initial application state for the next workflow). Therefore while state is all too often the root of evil, this allows for very tight control.

      Delete
  2. The other thing I'd like to observe is that -- just as in OO programming -- there are many patterns in data modelling.

    Just because the pattern you're using looks similar to the one used by EAV doesn't mean it's bad :-)

    A good database developer learns the patterns and when to use them.

    One thing we do at my company is generate code -- on the fly -- in a Model-Driven-Development platform. Recognising patterns in data models allows us to generate different things in different situations.

    Most of the patterns we recognise and do things with at present are very simple. For example a foreign key to a table with a surrogate key and a unique key on a varchar column in the parent table lets us add that varchar column's value in the response to a web-service request -- denormalising the data on the fly, if you will -- saving a round trip in many cases.


    I'd really like to catalog more data modelling patterns. Do you know of any such efforts?

    ReplyDelete