Tuesday, November 22, 2011

LedgerSMB (and PostgreSQL) at MYGOSSCON

Metatron Technology Consulting has rented a booth at the Malaysia Government Open Source Software Convention (MYGOSSCON), 2011.  This booth will be used to advocate PostgreSQL and LedgerSMB both to private and public sectors.

Anyone going to this convention should look for us!

Saturday, November 12, 2011

The Case for Intelligent Databases

LedgerSMB, as an ERP and accounting system, has over time begun to move towards an intelligent database design based on PostgreSQL.  We believe that this best matches our long-term goals for the project, especially because interoperability is a very important requirement for any ERP system.  Read more to see our general case in this area and see what other projects may be able to learn from our experience.

A Short Developer's Definition of Intelligent Databases

An intelligent database is a database encapsulated behind a defined API.

A Longer Description

While most programmers primarily use the database  primarily as a place to simply store data, this reduces a database to basically an ACID-compliant data store with some reporting capability.  This approach saves the programmer from worrying about advanced database functionality, but it also prevents the database system from being used to do a number of useful tasks that databases can really be leveraged to do.

This post will explore why LedgerSMB is going this direction, and the sorts of functionality (particularly regarding PostgreSQL) that is a part of this plan, what sorts of tradeoffs we see in it, and where we expect to benefit.  This is not an article that discusses the details of implementation (see previous posts for that), but rather at a high-level how and why.

The Toolbox

The fundamental idea of encapsulating a database inside a defined API is that usually the application should hit the API and not the underlying relations where the data is stored if at all possible.,  The  API acts as an interface between actual storage/retrieval and the application.

The major aspects of an API can include views and/or stored procedures.  We use stored procedures for most things but may include views for some sorts of objects as time goes on.

What do Intelligent Databases Cost?

Many current development environments seek to save the programmer from spending heavy engineering time on a database, particularly in the early stages of a project where requirements can change.  The database is relegated to a simple data store with ACID compliance.  Frameworks like Rails will even generate a database schema from programming data structures.

If one goes with an intelligent design for the database, one cannot use agile programming methodologies to design the database.  Instead the database requires some real engineering in its own right. This takes some time and effort in its own right and can slow down development compared to ORM-based code generators such as we see in Rails.

Such engineering requires taking a clear look at what data is expected to be collected, what internal structure it has, designing the database storage first and foremost around that analysis,and then building an API based on how it will be used.  For simple schema development this doesn't take a long time but the complexity does grow with the data collected.

In the LedgerSMB project this is not the end of the world since we are re-engineering the database incrementally and so each major branch includes at least a few areas where old db structures have been pulled out and new, better engineered ones added.  These areas then have an API and the application generally hits the API not the relations.  Old code areas may hit the API or the relations.

Other projects may find that this adds significant development time beyond what is reasonable, but these must still be weighed against the benefits below.

What benefits do Intelligent Databases Provide?

Intelligent databases offer a number of specific benefits that LedgerSMB tries to capitalize on.  The fact that these designs better match database theory is not enough.  tangible benefits were required for us to reach this decision.

Some of these benefits speed up time troubleshooting and debugging code, partly offsetting the additional investment mentioned above.  Others provide compelling features to third party developers who may want to integrate our software with theirs.

The single most important benefit is that of clarity.  First, with very few exceptions, LedgerSMB tries to keep to one language per file.  This means that database queries are in .sql files, and Perl is in .pl and .pm files.  The modules which do query mapping are the only Perl modules that contain both Perl and SQL.  In this way if I want to know what a query does, I can go to the relevant SQL file and read it.  Using features like named inputs in PL/PGSQL, even 100 line queries can be remarkably readable.  The API is also well defined, allowing us to troubleshoot issues with the query writer.

Clarity also means that performance problems can more easily be reviewed and addressed without getting into messy questions of ORMs and how they generate queries.  The queries that produce a specified result set from an API call can be defined and the actual function written out in SQL.

Along with clarity, we get the ability to do as much as possible with declarative constraints.  This provides, we believe, greater robustness in the application and therefore better software in the long run.

The second important benefit is in the area of integration.  Because the API is defined, there are specified integration points that other applications can use to integrate without worrying as much about ensuring that corner cases are properly covered.  This makes integration more robust.  Moreover important areas like LISTEN and NOTIFY allow other applications to integrate in ways approximating message queues, and the like.

In fact, the integration benefit was the driving force behind the decision to move to an intelligent database design.  We wanted to open up the application to add-ons written in other languages (Python at the time, but others could supported without too much effort).

Addressing Separation of Concerns

We have all heard horror stories of what happens when you put everything in the database.  We have also heard of developers afraid to put anything in the database because of fear of going down the roads of these other horror stories.  The secret to success here is in fact to separate concerns and to separate them well.  Many lines are ultimately arbitrary in detail but bright in outlines and so they require a team dedicated to making decisions carefully rather than drawing bold lines and mindlessly following them.

Of course that's a long way of saying you can't get away with sloppy engineering and have to actually put some thought into things.

Other Options

There are other options for application development. However, they all presume that the application has intimate knowledge of the internals of the database.  I am somewhat puzzled why programmers who are so careful to avoid modules understanding the internals of eachother are so careless about encapsulation in the database.

The main other options are use of an ORM, which typically involves designing the database around the ORM and the data model for the application, and thus the RDBMS becomes basically an extension of the object model.  This limits interoperability, and reduces the role that the database has relative to interoperability with other applications.  This being said there isn't necessarily a problem with using VIEWs and ORMs together to provide this sort of abstraction.

A second possibility is to use a dedicated store, like many NoSQL engines offer where the database really is a store for objects, and where the "schema" (in scare quotes because NoSQL usually doesn't really have a schema concept) is just a direct extension of the object model itself.

The tradeoff with NoSQL is similar to using an ORM and building your database around that ORM, but more extreme.  Typically you see even more rapid development early on, combined possibly with additional scalability in some cases, but with the tradeoff of interoperability is more severe, and ad hoc reporting is generally not practical outside the relational model.

Because of our desire to focus on interoperability and a reliance on good, flexible reporting, these options were not options for our project.  They might be for others though.


LedgerSMB has benefited greatly benefited from the decision to move towards an intelligent database layout.  Reasons include our focus on interoperability, and our willingness to do engineering up front.  Other projects may find their mileage varying.

Saturday, November 5, 2011

Elegance and Simplicity

When it comes to software engineering, I always think back to my first year Calculus teacher, Mike Lavender, and his admonishment that "in mathematics, power tools are consider inelegant where hand tools will do." Words to live by in programming and engineering.

So I'd go so far as to redefine elegance in software programming and engineering as "simplicity perfected." The simple fact is that where requirements change, a well engineered simple solution will usually turn out to be more agile than a heavily engineered, complex one (whether programmed with "agile" methodologies or not).

When one takes the idea of trying to perfect simplicity in the art of programming (and software engineering), it becomes easier to recognize foot-guns and inherent problems, and tradeoffs often become more conscious and controlled.

Wednesday, November 2, 2011

10 Underrated Features of PostgreSQL

Here are a list of features which I think are generally underrated in PostgreSQL and why.  Many though not all of these are being used in the development version of LedgerSMB, and more of these will likely be used as time goes on either by the main software or by addons.

These are functions which I greatly miss when I set down and work with many other RDBMS's, or which are fairly new and often underutilized.

10:  Language-Specific Full Text Searches
What it is:  Language-specific searches on related words.

Why I like it:  This function allows for a full text search to query related words in a post.  These can be indexed using functional indexes below as long as the system knows which language to search under.  So imagine if you put a text field that says "I am going to test this for the customer next week" but when you go back and search you search for "testing for customer" the result will still show up, because PostgreSQL knows that in English, "test" and "testing" can be two forms of the same word.

How I'd use it:  Any application which requires full text searches can take advantage of this to provide a much richer experience.

The LedgerSMB budget module currently uses full text searches for the descriptions of budgets.  Currently this is locale specific.  Future versions will probably allow notes to be in various languages and searched via a full text search per language.

9:   Pgcrypto
What it is:  Cryptographic extensions to PostgreSQL

Why I like it:   With pgcrypto, one can design systems which provide strong protection of data in the database, making the encryption transparent to the application.

How I'd use it:  If someone wanted to store, say, credit card data in the database, this could be installed using suitable forms of managed keys such that the user could unlock a key and retrieve the data.  This could then be handled such that the application itself would not need to handle encryption or decryption of the data, thus allowing other applications to handle key management and other important tasks.

8:    Functional Indexes

What it is:  Indexes on the outputs of immutable functions

Why I like it:  Functional indexes allow us to store information functionally dependent on information actually stored in the database.  Consequently just because we need to search on information.  This is available in Oracle and to a far lesser extent in MS SQL and DB2.  It is not available in MySQL.

How I'd use it:  Functional indexes are behind language-dependent full-text indexing (and hence can make searches there a lot faster).  It also has a large number of other uses.

For example, suppose I am storing a large bytea column and I want to search quickly on results.  I could index md5(my_bytea) and then search on input = my_bytea and md5(input) = md5(my_bytea).  This would allow an index scan to pull out matching md5's and then checking the full result against matching records.  For many other databases, one would have to store both the original data and the md5 checksum.

Another use would be to ensure uniqueness based on function output.  So for example, we could:

CREATE UNIQUE INDEX u_functional_index ON my_table (lower(my_col));

Then values would be rejected if they are non-unique to a case insensitive search.

7:    Partial Indexes

What it is:  Indexes on a subset of rows.

Why I like it:  Partial indexes allow the development of relatively complex data constraints.  For example, one can require a unique constraint on a subset of rows matching a certain criteria, since uniqueness is implemented on the index.  Of the other major RDBMS's, only Microsoft SQL Server implements anything like this.

How I'd use it:  Partial indexes can be used in two fundamentally different ways:  speeding up common search criteria and enforcing partial unique constraints.  Many are familiar with speeding up searches in this way, but partial unique indexes are where the real power lies. 

So suppose ar and ap invoices share the same physical table, but ar invoices must have unique invoice numbers while ap invoices must be unique per vendor.  We can use partial unique indexes to support both these constraints independently.

CREATE UNIQUE INDEX u_ar_invnumber ON invoices (invnumber) WHERE type = 'ar';

A second important application is adding forward-only unique constraints.  These are helpful where a new unique constraint needs to be added immediately but not all data in the table conforms to that constraint.  As a stop-gap measure (while data is being cleaned up), one might use a partial unique constraint to encourage uniqueness going forward.  Something like:

CREATE UNIQUE INDEX u_ar_invnumber ON invoices (invnumber) WHERE type = 'ar' AND id > 12345;

Data can be cleaned up (if possible), then the index can be dropped and recreated with a different where clause.

6:    Third Party Solutions

What it is:           An extensive set of third party add-ons for PostgreSQL

Why I like it:        There's so much to choose from!

A few of my favorites:   PGTAP is a test harness for stored procedures, building on other test harness frameworks.  It allows you to write production-safe test cases and run them without risking your data.  Very very good.

Slony-I is a very good well-developed single-master, asynchronous replication system for PostgreSQL.  it was developed to help ensure the .org domain registrar would scale appropriately.  Slony-I supports, among other things, replicating between different versions of PostgreSQL, something the native solutions do not.

Speaking of replication, Bucardo allows you to do multi-master and master-slave asynchronous replication as well.

PostGIS provides an excellent basis for GIS applications, and there are many, many other third party, free software add-ons for PostgreSQL  Uses are largely limited by the imagination.

5:    Language Handles

What it is:  Define your own stored procedure languages!

Why I like it:  Makes migration often more feasible than it would be otherwise, and exposes much advanced functionality to stored procedure interfaces.

How I'd use it:  There are three major uses of stored procedure handles (i.e. to create stored procedures in languages other than C, SQL, and PL/PGSQL).

The first of these is migration.  If one has an Oracle database which uses a lot of Java stored procedures you can often port these to PostgreSQL and PL/J with a minimum of headaches.  Similarly if you have an MS SQL database with .Net stored procedures you may be able to make these work with PL/Mono.

The second is to make advanced features of a language available to your stored procedures.  If you are doing a lot of text processing, PL/Perl may be helpful, but if you are doing a lot of statistical calculations, PL/R may be very helpful.

Finally (and here is the dangerous one), language handles can make it possible to do very un-database-like things from within stored procedures.  So for example, you could use PL/SH to execute shell scripts from the stored procedures.  However in general for these areas, external scripts and LISTEN/NOTIFY (see below) are better tools.

4:    Complex Types

What it is:        Tuple-based types, often used as function output types.

Why I like it:     These provide an opportunity to create a very rich API within the db.  PostgreSQL supports not only pure tuples but tuples which contain arrays of other tuples as their members.

How I'd use it:   In LegerSMB, we use complex types to define much of our interface for the application and this dependency will increase over time.  These types allow one to represent in SQL form any well-structured representation of data in any other application.

These can then be used to define inputs and outputs of stored procedures.

3:    Listen/Notify

What it is:  Notification of specified database events.

Why I like it:  this provides a notification service within the database.  Applications listen for a specified notification handle, and other functions can raise those notifications.  This allows one to create message queues (similar to Oracle's Database Notification Services), but that's not all.

Because the notifications are decoupled from low-level database operations, one can raise the notifications based on arbitrary criteria.  Suppose we want to raise a notification anytime the postgres user deletes data from the audit trail?  We can do that. No problem.  Similarly we could raise a notification anytime

How I'd use it:  You can use LISTEN/NOTIFY to send a message to a script that information has been dropped in a table and is ready to email out to a decision-maker.  For example you could use this to provide real-time emailing when, for example, a part's on-hand value drops below the reorder point.  In this way all manner of message queues can be created using the database as a center.

On top of that one can send notifications out around things that do not modify the database. I don't have a specific use case for this yet, but the fact that it is there means that many other things suddenly become possible.

2:    Windowing Functions

What it is:  The ability to have an aggregate function that aggregates incrementally over a set or portion of a set.  Although most RDBMS's these days do support windowing functions (MySQL, DB2, Oracle, and MS SQL all do), this is a feature which is often overlooked.

Why I like it:  This allows you to specify aggregates which are run incrementally across a set.  This set can be partitioned and ordered independent of any ordering on the final set.

How I'd use it:  There are huge numbers of uses for this sort of thing.  For example, windowing functions make it possible to select the highest ranked rows within a set by certain criteria.  They also allow one to add running totals to financial reports, or add running totals within specific transactions for financial reports.

No doubt there are many many more uses I just haven't thought of or come across.  However if you think about being able to use windowing functions the sorts of queries you can write are well expanded.

1:     PL/PGSQL

What it is:   A very clear, concise, well organized language for stored procedures.  Borrowing heavily in approach from Oracle's PL/SQL, this language is a mixture of Ada, PL/I, and SQL.  The result, while different from what most programmers are used to, highly readable, easy to follow, and extremely powerful.

Why I like it: PL/PGSQL adds almost everything you could want that vanilla SQL does not offer in a procedural language.  Conditionals, loops, and named variables all come to mind.  But what it has going for it is conciseness.  Often times I hear people say they are "just" doing stored procedures in PL/PGSQL.  There is no "just" about it though.  This language is very well designed for what it does.

How I'd use it:  LedgerSMB currently contains most SQL in the newer code inside stored procedures written in PL/PGSQL.  We use it for most procedures with large numbers of arguments or where the control structures are at all helpful.  In essence we use it a LOT.

The above list is largely arbitrary.  Another author might put in the fact that data definition language calls are usually transaction-safe (meaning you can roll back a transaction that dropped a table), incremental backups, or any number of other features.  However the features that I have included above are generally those which allow PostgreSQL to be used as a very extensible database server.  It is my view that these features, which are at their best when creating intelligent databases, are frequently under-used especially by small developers because they think of PostgreSQL as merely a data store instead of the very powerful system that it is.

What features do you think are underrated?  Comment below to mention them!