The perpetual argument over logic in the db will continue forever, but here's one advantage that is often overlooked: stored procedures provide a powerful way to integrate programs written in different environments.
Case in point: Yesterday when I got tired of staring at Perl code, I wrote some basic PHP classes to provide integration logic for LedgerSMB. The basic interface which performs query mapping functions, took me a couple of hours including debugging, despite the fact that I haven't programmed in PHP since 4.2 was current, and includes an additional class for retrieving/saving company records as well. Not everything works yet but this will probably be finished up today.
These classes will make writing integration logic with portions of the software which have been moved over to stored procedures quite easy, and such code that calls stored procedures will be provably free of SQL injection and privilege escalation.
I haven't decided where to put this, whether in the main LedgerSMB project or in another Sourceforge project but it is coming soon. Once these are working I intend to teach myself enough Java to write the same in that language,
Long-run there is a lot of boilerplate code that will probably be able to be generated by code generators in all these languages. The code generators will eventually have to query system catalogs, and some custom application catalogs. This will minimize the amount of code that will have to be written by hand.
This blog tracks development of the open source accounting and ERP software LedgerSMB. I also offer some perspectives on PostgreSQL including new features which we may find useful. Brought to you by Metatron Technology Consulting.
Saturday, July 28, 2012
Friday, July 20, 2012
On Long Queries/Stored Procedures
In general, in most languages, programmers try to keep their subroutines down to a couple screens in size. This helps, among other things, with readability and ease at debugging. Many SQL queries are significantly longer than that. I have queries which are over 100 lines long in LedgerSMB, and worked on debugging queries that are more than 200 lines long. In general, though, the maximum length of a query is not a problem if certain practices are followed. This post describes why I think SQL allows for longer procedures and what is required to maintain readibility.
Why Long Queries Aren't a Problem Per Se
In a typical programming language, there isn't an enforced structure to any subroutines. What this means is as a subroutine becomes longer, maintaining internal patterns and an understanding of state becomes more difficult. SQL queries themselves, however, are defined by their structure. For example, a standard select statement (excluding common table expressions) includes the following parts in order:
Because of the structure, it is pretty easy to dive into even a very long query and figure out exactly where the problems are quickly. Maintainability is not dependent on length or overall complexity.
Moreover if you have common table expressions, it is easy to jump back to the beginning, where these are defined, and reference these as needed.
A second difference is that SQL statements work with state which is typically assumed to be unchanging for purposes of the operation. With rare exceptions, the order of execution doesn't matter (order of operations however does). Consequently you don't have to read line-by-line to track state.
In essence debugging an SQL statement is very much like searching a b-tree, while debugging Perl, Python, or C is very much like traversing a singly linked list.
What Can Be a Problem
Once I spent several days helping a customer troubleshoot a long, complex query. The problem turned out to be a bad filter parameter, but we couldn't tell that right away. The reason was that the structure of the query had decayed a bit and this made maintenance difficult.
The query was not only long, but it was also difficult to understand, because it didn't conform to the above structure. The problem in that case was the use of many inline views to create a cross-tab-type report.
If you can't understand a long query, or if you don't know immediately where to look, it is hard to troubleshoot these.
Best Practices
The following recommendations are about keeping long queries well-structured. Certain features of the SQL language are generally to be avoided more as queries become longer, and other features should be used carefully.
Why Long Queries Aren't a Problem Per Se
In a typical programming language, there isn't an enforced structure to any subroutines. What this means is as a subroutine becomes longer, maintaining internal patterns and an understanding of state becomes more difficult. SQL queries themselves, however, are defined by their structure. For example, a standard select statement (excluding common table expressions) includes the following parts in order:
- Returned column list
- Starting relation
- Join operations
- Filters
- Aggregation criteria
- Aggregation filters
- Postprocessing (Ordering, limits, and off-set)
Because of the structure, it is pretty easy to dive into even a very long query and figure out exactly where the problems are quickly. Maintainability is not dependent on length or overall complexity.
Moreover if you have common table expressions, it is easy to jump back to the beginning, where these are defined, and reference these as needed.
A second difference is that SQL statements work with state which is typically assumed to be unchanging for purposes of the operation. With rare exceptions, the order of execution doesn't matter (order of operations however does). Consequently you don't have to read line-by-line to track state.
In essence debugging an SQL statement is very much like searching a b-tree, while debugging Perl, Python, or C is very much like traversing a singly linked list.
What Can Be a Problem
Once I spent several days helping a customer troubleshoot a long, complex query. The problem turned out to be a bad filter parameter, but we couldn't tell that right away. The reason was that the structure of the query had decayed a bit and this made maintenance difficult.
The query was not only long, but it was also difficult to understand, because it didn't conform to the above structure. The problem in that case was the use of many inline views to create a cross-tab-type report.
If you can't understand a long query, or if you don't know immediately where to look, it is hard to troubleshoot these.
Best Practices
The following recommendations are about keeping long queries well-structured. Certain features of the SQL language are generally to be avoided more as queries become longer, and other features should be used carefully.
- Avoid inline views. Ok, sometimes I use inline views too on longer queries, but usually these are well-tested units in their own right, and re-used elsewhere. These should probably be moved into defined views, or CTE's if applicable.
- Avoid union and union all in long queries. These complicate query maintenance in a number of ways. It's better to move these to small testable units, like a defined view based on a shorter query or a common table expression.
- In long stored procedures, keep your queries front and center, and move as much logic as is reasonable into them.
- Avoid implicit joins, which work by doing cross-joins and then placing the filter condition in the result. Join logic should be separate from filter logic.
Tuesday, July 17, 2012
Why LedgerSMB uses Moose. An intro for PostgreSQL folks.
In LedgerSMB 1.4 we are moving to using a Perl object system called Moose for new code. This post will discuss why we are doing so, what we get out of it, etc. For those new to Moose this may serve as a starting point to decide whether to look further into this object framework.
To start with, however, we will need to address our overall strategy regarding data consistency and accuracy.
Why we Use Database Constraints Aggressively
Before getting into Moose specifically it's important to re-iterate why we use database constraints aggressively in LedgerSMB.
In general you can divide virtually all application bugs into a few categories which fall within two major classifications:
Database constraints allow us to declare mathematically the constraints of meaningful data and thus drastically reduce the chance of state errors occurring in the application. Foreign keys protect against orphaned records, type constraints protect against invalid data, and check constraints can be used to ensure data falls within meaningful parameters. NOT NULL constraints can protect against necessary but missing information.
The other type of error is an execution error. These can be divided into two further categories: misdirected execution (confused deputy problems and the like) and bad execution (cases where the application takes the right information and does the wrong things with it).
Any reduction in state errors has a significant number of benefits to the application. Troubleshooting is simplified, because the chances of a cascade from previous data is reduced. This leads to faster bugfixes and a more robust and secure application generally. Not only are the problems simplified but the problems that remain are reduced. We see this as a very good thing. While it is not necessarily a question of alternatives, testing cannot compare to proof.
One really nice thing about PostgreSQL is the very rich type system and the ability to add very rich constraints against those types. The two of those together makes it an ideal database for this sort of work.
Enter Moose
Recently, Perl programmers have been adopting a system of object-oriented programming based on metaclasses and other concepts borrowed from the LISP world. The leading object class that has resulted is called Moose, and bills itself as a post-modern object system for a post-modern language (Perl). Moose offers a large number of important features here including the following, which we consider to be the most important:
Brief Overview of Useful Type Features
In plain old Perl 5 we would normally build accessors and constructors to check specific types for sanity. This is the imperative approach to programming. With Moose we do so declaratively and everything just works.
For example we can do something like this, if we want to make sure the value is an integer:
This would be similar to having a part of a table definition in SQL being:
location_class int,
This tells Moose to create a constructor that's read/write, and check to make sure the value is a valid Int.
We can also specify that this is a positive integer:
subtype 'PosInt', as 'Int', where { $_ > 0 };
has location_class => (is => 'rw', isa => 'PosInt');
This would be similar to using a domain:
CREATE DOMAIN posint AS int check(VALUE > 0)
Then the table definition fragment might be:
location_class posint,
But there is a huge difference. In the SQL example, the domain check constraint (at least in PostgreSQL) is only checked at data storage. If I do:
I will get -1 returned, not an error. Thus domains are useful only when defining data storage rules. The Moose subtype however, is checked on every instantiation. It defines the runtime allowed values, so any time a negative number is instantiated as a posint, it will generate an error. This then closes a major hole in the use of SQL domains and allows us to tighten up constraints even further.
So here we can simply specify the rules which define the data and the rest works. Typical object oriented systems in most languages do not include such a rich declarative system for dealing with state constraints.
The type definition system is much richer than the above examples allow and we can require that attributes belong to specific classes, specify default values, when the default is created and many other things. Moose is a rich system in this regard.
Transparent Internal Structures
Moose treats objects as blessed references to hash tables (in Perl we'd call this a hashref), where every attribute is simply listed by name. This ensures that when we copy and sanitize data for display, for example, the output is exactly expected. Consequently if I have a class like:
package location;
has location_class => (is => 'rw', isa => 'Int');
has line_one => (is => 'rw', isa => 'Str');
has line_two => (is => 'rw', isa => 'Maybe[Str]');
has line_three => (is => 'rw', isa => 'Maybe[Str]');
has city => (is => 'rw', isa => 'Str');
has state => (is => 'rw', isa => 'Str');
has zip => (is => 'rw', isa => 'Maybe[Str]');
has country_name => (is => 'rw', isa => 'Maybe[Str]');
has country_id => (is => 'rw', isa => 'Int');
In other words we copy these as hashrefs, pass them to the templates and these just work.
Accessors, Constructors, and Methods
Most of the important features here follow from the above quite directly. The declarative approach is used to create constructors and accessors, and this provides a very different way to think about your code.
Now Moose also has some interesting features regarding methods which are useful in integrating LedgerSMB with other programs. These include method modifiers which allow you to specify code to run before, instead of, or after functions. These can take the place of database triggers on the Perl level.
Pitfalls
There are two or three reasons folks may choose not to use Moose, aside from the fact that it does have some overhead.
The first is complexity. Moose has a lot of features and learning how to program effectively in Moose takes some time, though this is reduced by approaching it from a database perspective.
The second is that sometimes it is tempting to look into the raw data structure directly and fiddle with it, and if you do this then the proofs of accuracy are invalid. This is mitigated by the use of correct controls on the database level.
Features we are Still Getting Used To
This isn't to say we are Moose experts already. Moose has a number of concepts we haven't started using yet, like roles, which add cross-class functionality orthogonal to inheritance and which are highly recommended by other Moose programmers.
There are probably many other features too that will eventually come to be indispensable but that we aren't yet excited about.
Future Steps
On the LedgerSMB side, long-run, I would like to move to creating our classes directly using code generators from database classes. We could query system catalogs for methods as well. This is an advanced area we probably won't see much of for a while.
It's also likely that roles will start to be used, and perhaps DBObject will become a role instead of an inheritance base.
Update:
Matt Trout has pointed out that setting required => 0 is far preferable to Maybe[] and helped me figure out why that wasn't working before. I agree. Removing the Maybe[] from the type definitions would be a great thing. Thanks Matt!
To start with, however, we will need to address our overall strategy regarding data consistency and accuracy.
Why we Use Database Constraints Aggressively
Before getting into Moose specifically it's important to re-iterate why we use database constraints aggressively in LedgerSMB.
In general you can divide virtually all application bugs into a few categories which fall within two major classifications:
- State Errors
- Execution Errors
Database constraints allow us to declare mathematically the constraints of meaningful data and thus drastically reduce the chance of state errors occurring in the application. Foreign keys protect against orphaned records, type constraints protect against invalid data, and check constraints can be used to ensure data falls within meaningful parameters. NOT NULL constraints can protect against necessary but missing information.
The other type of error is an execution error. These can be divided into two further categories: misdirected execution (confused deputy problems and the like) and bad execution (cases where the application takes the right information and does the wrong things with it).
Any reduction in state errors has a significant number of benefits to the application. Troubleshooting is simplified, because the chances of a cascade from previous data is reduced. This leads to faster bugfixes and a more robust and secure application generally. Not only are the problems simplified but the problems that remain are reduced. We see this as a very good thing. While it is not necessarily a question of alternatives, testing cannot compare to proof.
One really nice thing about PostgreSQL is the very rich type system and the ability to add very rich constraints against those types. The two of those together makes it an ideal database for this sort of work.
Enter Moose
Recently, Perl programmers have been adopting a system of object-oriented programming based on metaclasses and other concepts borrowed from the LISP world. The leading object class that has resulted is called Moose, and bills itself as a post-modern object system for a post-modern language (Perl). Moose offers a large number of important features here including the following, which we consider to be the most important:
- A rich type system for declaring object properties and constraints
- Transparent internal structures
- Automatic creation of constructors and accessors
- Method modifiers.
Brief Overview of Useful Type Features
In plain old Perl 5 we would normally build accessors and constructors to check specific types for sanity. This is the imperative approach to programming. With Moose we do so declaratively and everything just works.
For example we can do something like this, if we want to make sure the value is an integer:
has location_class => (is => 'rw', isa => 'Int');
This would be similar to having a part of a table definition in SQL being:
location_class int,
This tells Moose to create a constructor that's read/write, and check to make sure the value is a valid Int.
We can also specify that this is a positive integer:
subtype 'PosInt', as 'Int', where { $_ > 0 };
has location_class => (is => 'rw', isa => 'PosInt');
This would be similar to using a domain:
CREATE DOMAIN posint AS int check(VALUE > 0)
Then the table definition fragment might be:
location_class posint,
But there is a huge difference. In the SQL example, the domain check constraint (at least in PostgreSQL) is only checked at data storage. If I do:
select -1::posint;
I will get -1 returned, not an error. Thus domains are useful only when defining data storage rules. The Moose subtype however, is checked on every instantiation. It defines the runtime allowed values, so any time a negative number is instantiated as a posint, it will generate an error. This then closes a major hole in the use of SQL domains and allows us to tighten up constraints even further.
So here we can simply specify the rules which define the data and the rest works. Typical object oriented systems in most languages do not include such a rich declarative system for dealing with state constraints.
The type definition system is much richer than the above examples allow and we can require that attributes belong to specific classes, specify default values, when the default is created and many other things. Moose is a rich system in this regard.
Transparent Internal Structures
Moose treats objects as blessed references to hash tables (in Perl we'd call this a hashref), where every attribute is simply listed by name. This ensures that when we copy and sanitize data for display, for example, the output is exactly expected. Consequently if I have a class like:
package location;
has location_class => (is => 'rw', isa => 'Int');
has line_one => (is => 'rw', isa => 'Str');
has line_two => (is => 'rw', isa => 'Maybe[Str]');
has line_three => (is => 'rw', isa => 'Maybe[Str]');
has city => (is => 'rw', isa => 'Str');
has state => (is => 'rw', isa => 'Str');
has zip => (is => 'rw', isa => 'Maybe[Str]');
has country_name => (is => 'rw', isa => 'Maybe[Str]');
has country_id => (is => 'rw', isa => 'Int');
In that case, when we go to look at the hash, we could print values from the hash in our templates. Our templating engines can copy value for value the resulting hash, escape it (in the new hash) for the format as needed, and then pass that on to the templating engine. This provides cross-format safety that accessor-based reading of attributes would not provide.
In other words we copy these as hashrefs, pass them to the templates and these just work.
Accessors, Constructors, and Methods
Most of the important features here follow from the above quite directly. The declarative approach is used to create constructors and accessors, and this provides a very different way to think about your code.
Now Moose also has some interesting features regarding methods which are useful in integrating LedgerSMB with other programs. These include method modifiers which allow you to specify code to run before, instead of, or after functions. These can take the place of database triggers on the Perl level.
Pitfalls
There are two or three reasons folks may choose not to use Moose, aside from the fact that it does have some overhead.
The first is complexity. Moose has a lot of features and learning how to program effectively in Moose takes some time, though this is reduced by approaching it from a database perspective.
The second is that sometimes it is tempting to look into the raw data structure directly and fiddle with it, and if you do this then the proofs of accuracy are invalid. This is mitigated by the use of correct controls on the database level.
Features we are Still Getting Used To
This isn't to say we are Moose experts already. Moose has a number of concepts we haven't started using yet, like roles, which add cross-class functionality orthogonal to inheritance and which are highly recommended by other Moose programmers.
There are probably many other features too that will eventually come to be indispensable but that we aren't yet excited about.
Future Steps
On the LedgerSMB side, long-run, I would like to move to creating our classes directly using code generators from database classes. We could query system catalogs for methods as well. This is an advanced area we probably won't see much of for a while.
It's also likely that roles will start to be used, and perhaps DBObject will become a role instead of an inheritance base.
Update:
Matt Trout has pointed out that setting required => 0 is far preferable to Maybe[] and helped me figure out why that wasn't working before. I agree. Removing the Maybe[] from the type definitions would be a great thing. Thanks Matt!
Wednesday, July 11, 2012
PostgreSQL CTEs and LedgerSMB
LedgerSMB trunk (which will become 1.4) has moved from connectby() to WITH RECURSIVE common table expressions (CTE's). This post describes our experience, and why we have found that CTE's are an extremely useful way to write queries and why we are moving more and more towards these. We have started using CTE's more frequently in place of inline views when appropriate as well.
CTEs are first supported in PostgreSQL 8.4. They represent a way to write a subquery whose results remain stable for the duration of the query. CTE's can be simple or recursive (for generating tree-like structures via self-joins). the stability of CTEs both adds opportunities for better tuning a query and performance gotchas, and it is important to remember that the optimizer cannot correlate expected results from a CTE with the rest of the query to optimize the CTE (but presumably can go the other way).
The First Example
The simplest example would be our view, "menu_friendly" which shows the structure and layout of the menu. The menu is generated using a self-join and this view returns the whole menu.
CREATE VIEW menu_friendly AS
WITH RECURSIVE tree (path, id, parent, level, positions)
AS (select id::text as path, id, parent, 0 as level, position::text
FROM menu_node where parent is null
UNION
select path || ',' || n.id::text, n.id, n.parent, t.level + 1,
t.positions || ',' || n.position
FROM menu_node n
JOIN tree t ON t.id = n.parent)
SELECT t."level", t.path,
(repeat(' '::text, (2 * t."level")) || (n.label)::text) AS label,
n.id, n."position"
FROM tree t
JOIN menu_node n USING (id)
ORDER BY string_to_array(t.positions, ',')::int[];
This replaced an older view using connectby() from the tablefunc module in contrib:
CREATE VIEW menu_friendly AS
SELECT t."level", t.path, t.list_order,
(repeat(' '::text, (2 * t."level")) || (n.label)::text) AS label,
n.id, n."position"
FROM (connectby('menu_node'::text, 'id'::text, 'parent'::text,
'position'::text, '0'::text, 0, ','::text
) t(id integer, parent integer, "level" integer, path text,
list_order integer)
JOIN menu_node n USING (id));
The replacement query is longer, it is true, However it has a number of advantages. The biggest is that we no longer have to depend on a contrib module which may or may not come with PostgreSQL (most Linux distributions package these in a separate package) and where installation methodology may change (as it did in PostgreSQL 9.1).
I also think there is some value in having the syntax for the self-join a bit more obvious rather than having to run to the tablefunc documentation every time one wants to check. The code is clearer and this adds value too.
I haven't checked performance between these two examples, because this isn't very performance-sensitive. After all it is only there for troubleshooting menu structure, and it is only a set of read operations. Given other tests however I would not expect a performance cost.
Better Performance
LedgerSMB generates the menu for the tree-view in a stored procedure called menu_generate() which takes no arguments. The 1.3 version is:
CREATE OR REPLACE FUNCTION menu_generate() RETURNS SETOF menu_item AS
$$
DECLARE
item menu_item;
arg menu_attribute%ROWTYPE;
BEGIN
FOR item IN
SELECT n.position, n.id, c.level, n.label, c.path,
to_args(array[ma.attribute, ma.value])
FROM connectby('menu_node', 'id', 'parent', 'position', '0',
0, ',')
c(id integer, parent integer, "level" integer,
path text, list_order integer)
JOIN menu_node n USING(id)
JOIN menu_attribute ma ON (n.id = ma.node_id)
WHERE n.id IN (select node_id
FROM menu_acl
JOIN (select rolname FROM pg_roles
UNION
select 'public') pgr
ON pgr.rolname = role_name
WHERE pg_has_role(CASE WHEN coalesce(pgr.rolname,
'public')
= 'public'
THEN current_user
ELSE pgr.rolname
END, 'USAGE')
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type ilike 'DENY'
THEN FALSE
WHEN acl_type ilike 'ALLOW'
THEN TRUE
END))
or exists (select cn.id, cc.path
FROM connectby('menu_node', 'id', 'parent',
'position', '0', 0, ',')
cc(id integer, parent integer,
"level" integer, path text,
list_order integer)
JOIN menu_node cn USING(id)
WHERE cn.id IN
(select node_id FROM menu_acl
JOIN (select rolname FROM pg_roles
UNION
select 'public') pgr
ON pgr.rolname = role_name
WHERE pg_has_role(CASE WHEN
coalesce(pgr.rolname,
'public')
= 'public'
THEN current_user
ELSE pgr.rolname
END, 'USAGE')
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type
ilike 'DENY'
THEN false
WHEN acl_type
ilike 'ALLOW'
THEN TRUE
END))
and cc.path like c.path || ',%')
GROUP BY n.position, n.id, c.level, n.label, c.path, c.list_order
ORDER BY c.list_order
LOOP
RETURN NEXT item;
END LOOP;
END;
$$ language plpgsql;
This query performs adequately for users with lots of permissions but it doesn't do so well when a user has few permissions. Even on decent hardware the query may take several seconds to run if the user only has access to items within one menu section. In many cases this isn't acceptable performance but on PostgreSQL 8.3 we don't have a real alternative outside of trigger-maintained materialized views. For PostgreSQL 8.4, however, we can do this:
CREATE OR REPLACE FUNCTION menu_generate() RETURNS SETOF menu_item AS
$$
DECLARE
item menu_item;
arg menu_attribute%ROWTYPE;
BEGIN
FOR item IN
WITH RECURSIVE tree (path, id, parent, level, positions)
AS (select id::text as path, id, parent,
0 as level, position::text
FROM menu_node where parent is null
UNION
select path || ',' || n.id::text, n.id,
n.parent,
t.level + 1,
t.positions || ',' || n.position
FROM menu_node n
JOIN tree t ON t.id = n.parent)
SELECT n.position, n.id, c.level, n.label, c.path, n.parent,
to_args(array[ma.attribute, ma.value])
FROM tree c
JOIN menu_node n USING(id)
JOIN menu_attribute ma ON (n.id = ma.node_id)
WHERE n.id IN (select node_id
FROM menu_acl acl
LEFT JOIN pg_roles pr on pr.rolname = acl.role_name
WHERE CASE WHEN role_name
ilike 'public'
THEN true
WHEN rolname IS NULL
THEN FALSE
ELSE pg_has_role(rolname,
'USAGE')
END
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type ilike 'DENY'
THEN FALSE
WHEN acl_type ilike 'ALLOW'
THEN TRUE
END))
or exists (select cn.id, cc.path
FROM tree cc
JOIN menu_node cn USING(id)
WHERE cn.id IN
(select node_id
FROM menu_acl acl
LEFT JOIN pg_roles pr
on pr.rolname = acl.role_name
WHERE CASE WHEN rolname
ilike 'public'
THEN true
WHEN rolname IS NULL
THEN FALSE
ELSE pg_has_role(rolname,
'USAGE')
END
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type
ilike 'DENY'
THEN false
WHEN acl_type
ilike 'ALLOW'
THEN TRUE
END))
and cc.path::text
like c.path::text || ',%')
GROUP BY n.position, n.id, c.level, n.label, c.path, c.positions,
n.parent
ORDER BY string_to_array(c.positions, ',')::int[]
LOOP
RETURN NEXT item;
END LOOP;
END;
$$ language plpgsql;
This function is not significantly longer, it is quite a bit clearer, and it performs a lot better. For users with full permissions it runs twice as fast as the 1.3 version, but for users with few permissions, this runs about 70x faster. The big improvement comes from the fact that we are only instantiating the menu node tree once.
So not only do we have fewer external dependencies now but there are significant performance improvements.
A Complex Example
CTE's can be used to "materialize" a subquery for a given query. They can also be nested. The ordering can be important because this can give you a great deal of control as to what exactly you are joining.
I won't post the full source code for the new trial balance function here. However, the main query, which contains a nested CTE.
Here we are using the CTE's in order to achieve some stability across joins and avoid join projection issues, as well as creating a tree of projects, departments, etc (business_unit is the table that stores this).
The relevant part of the stored procedure is:
RETURN QUERY
WITH ac (transdate, amount, chart_id) AS (
WITH RECURSIVE bu_tree (id, path) AS (
SELECT id, id::text AS path
FROM business_unit
WHERE parent_id = any(in_business_units)
OR (parent_id = IS NULL
AND (in_business_units = '{}'
OR in_business_units IS NULL))
UNION
SELECT bu.id, bu_tree.path || ',' || bu.id
FROM business_unit bu
JOIN bu_tree ON bu_tree.id = bu.parent_id
)
SELECT ac.transdate, ac.amount, ac.chart_id
FROM acc_trans ac
JOIN (SELECT id, approved, department_id FROM ar UNION ALL
SELECT id, approved, department_id FROM ap UNION ALL
SELECT id, approved, department_id FROM gl) gl
ON ac.approved and gl.approved and ac.trans_id = gl.id
LEFT JOIN business_unit_ac buac ON ac.entry_id = buac.entry_id
LEFT JOIN bu_tree ON buac.bu_id = bu_tree.id
WHERE ac.transdate BETWEEN t_roll_forward + '1 day'::interval
AND t_end_date
AND ac.trans_id <> ALL(ignore_trans)
AND (in_department is null
or gl.department_id = in_department)
((in_business_units = '{}' OR in_business_units IS NULL)
OR bu_tree.id IS NOT NULL)
)
SELECT a.id, a.accno, a.description, a.gifi_accno,
case when in_date_from is null then 0 else
CASE WHEN a.category IN ('A', 'E') THEN -1 ELSE 1 END
* (coalesce(cp.amount, 0)
+ sum(CASE WHEN ac.transdate <= coalesce(in_date_from,
t_roll_forward)
THEN ac.amount ELSE 0 END)) end,
sum(CASE WHEN ac.transdate BETWEEN coalesce(in_date_from,
t_roll_forward)
AND coalesce(in_date_to,
ac.transdate)
AND ac.amount < 0 THEN ac.amount * -1 ELSE 0 END) -
case when in_date_from is null then coalesce(cp.debits, 0) else 0 end,
sum(CASE WHEN ac.transdate BETWEEN coalesce(in_date_from,
t_roll_forward)
AND coalesce(in_date_to,
ac.transdate)
AND ac.amount > 0 THEN ac.amount ELSE 0 END) +
case when in_date_from is null then coalesce(cp.credits, 0) else 0 end,
CASE WHEN a.category IN ('A', 'E') THEN -1 ELSE 1 END
* (coalesce(cp.amount, 0) + sum(coalesce(ac.amount, 0)))
FROM account a
LEFT JOIN ac ON ac.chart_id = a.id
LEFT JOIN account_checkpoint cp ON cp.account_id = a.id
AND end_date = t_roll_forward
WHERE (in_accounts IS NULL OR in_accounts = '{}'
OR a.id = ANY(in_accounts))
AND (in_heading IS NULL OR in_heading = a.heading)
GROUP BY a.id, a.accno, a.description, a.category, a.gifi_accno,
cp.end_date, cp.account_id, cp.amount, cp.debits, cp.credits
ORDER BY a.accno;
This query does a lot including rolling balances froward from checkpoints (which also effectively close the books), pulling only transactions which match specific departments or projects, and more.
Conclusion
We have found the CTE to be a tool which is hard to underrate in importance and usefulness. They have given us maintainability and performance advantages, and they are very flexible. We will probably move towards using more of them in the future as appropriate.
CTEs are first supported in PostgreSQL 8.4. They represent a way to write a subquery whose results remain stable for the duration of the query. CTE's can be simple or recursive (for generating tree-like structures via self-joins). the stability of CTEs both adds opportunities for better tuning a query and performance gotchas, and it is important to remember that the optimizer cannot correlate expected results from a CTE with the rest of the query to optimize the CTE (but presumably can go the other way).
The First Example
The simplest example would be our view, "menu_friendly" which shows the structure and layout of the menu. The menu is generated using a self-join and this view returns the whole menu.
CREATE VIEW menu_friendly AS
WITH RECURSIVE tree (path, id, parent, level, positions)
AS (select id::text as path, id, parent, 0 as level, position::text
FROM menu_node where parent is null
UNION
select path || ',' || n.id::text, n.id, n.parent, t.level + 1,
t.positions || ',' || n.position
FROM menu_node n
JOIN tree t ON t.id = n.parent)
SELECT t."level", t.path,
(repeat(' '::text, (2 * t."level")) || (n.label)::text) AS label,
n.id, n."position"
FROM tree t
JOIN menu_node n USING (id)
ORDER BY string_to_array(t.positions, ',')::int[];
This replaced an older view using connectby() from the tablefunc module in contrib:
SELECT t."level", t.path, t.list_order,
(repeat(' '::text, (2 * t."level")) || (n.label)::text) AS label,
n.id, n."position"
FROM (connectby('menu_node'::text, 'id'::text, 'parent'::text,
'position'::text, '0'::text, 0, ','::text
) t(id integer, parent integer, "level" integer, path text,
list_order integer)
JOIN menu_node n USING (id));
The replacement query is longer, it is true, However it has a number of advantages. The biggest is that we no longer have to depend on a contrib module which may or may not come with PostgreSQL (most Linux distributions package these in a separate package) and where installation methodology may change (as it did in PostgreSQL 9.1).
I also think there is some value in having the syntax for the self-join a bit more obvious rather than having to run to the tablefunc documentation every time one wants to check. The code is clearer and this adds value too.
I haven't checked performance between these two examples, because this isn't very performance-sensitive. After all it is only there for troubleshooting menu structure, and it is only a set of read operations. Given other tests however I would not expect a performance cost.
Better Performance
LedgerSMB generates the menu for the tree-view in a stored procedure called menu_generate() which takes no arguments. The 1.3 version is:
CREATE OR REPLACE FUNCTION menu_generate() RETURNS SETOF menu_item AS
$$
DECLARE
item menu_item;
arg menu_attribute%ROWTYPE;
BEGIN
FOR item IN
SELECT n.position, n.id, c.level, n.label, c.path,
to_args(array[ma.attribute, ma.value])
FROM connectby('menu_node', 'id', 'parent', 'position', '0',
0, ',')
c(id integer, parent integer, "level" integer,
path text, list_order integer)
JOIN menu_node n USING(id)
JOIN menu_attribute ma ON (n.id = ma.node_id)
WHERE n.id IN (select node_id
FROM menu_acl
JOIN (select rolname FROM pg_roles
UNION
select 'public') pgr
ON pgr.rolname = role_name
WHERE pg_has_role(CASE WHEN coalesce(pgr.rolname,
'public')
= 'public'
THEN current_user
ELSE pgr.rolname
END, 'USAGE')
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type ilike 'DENY'
THEN FALSE
WHEN acl_type ilike 'ALLOW'
THEN TRUE
END))
or exists (select cn.id, cc.path
FROM connectby('menu_node', 'id', 'parent',
'position', '0', 0, ',')
cc(id integer, parent integer,
"level" integer, path text,
list_order integer)
JOIN menu_node cn USING(id)
WHERE cn.id IN
(select node_id FROM menu_acl
JOIN (select rolname FROM pg_roles
UNION
select 'public') pgr
ON pgr.rolname = role_name
WHERE pg_has_role(CASE WHEN
coalesce(pgr.rolname,
'public')
= 'public'
THEN current_user
ELSE pgr.rolname
END, 'USAGE')
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type
ilike 'DENY'
THEN false
WHEN acl_type
ilike 'ALLOW'
THEN TRUE
END))
and cc.path like c.path || ',%')
GROUP BY n.position, n.id, c.level, n.label, c.path, c.list_order
ORDER BY c.list_order
LOOP
RETURN NEXT item;
END LOOP;
END;
$$ language plpgsql;
This query performs adequately for users with lots of permissions but it doesn't do so well when a user has few permissions. Even on decent hardware the query may take several seconds to run if the user only has access to items within one menu section. In many cases this isn't acceptable performance but on PostgreSQL 8.3 we don't have a real alternative outside of trigger-maintained materialized views. For PostgreSQL 8.4, however, we can do this:
CREATE OR REPLACE FUNCTION menu_generate() RETURNS SETOF menu_item AS
$$
DECLARE
item menu_item;
arg menu_attribute%ROWTYPE;
BEGIN
FOR item IN
WITH RECURSIVE tree (path, id, parent, level, positions)
AS (select id::text as path, id, parent,
0 as level, position::text
FROM menu_node where parent is null
UNION
select path || ',' || n.id::text, n.id,
n.parent,
t.level + 1,
t.positions || ',' || n.position
FROM menu_node n
JOIN tree t ON t.id = n.parent)
SELECT n.position, n.id, c.level, n.label, c.path, n.parent,
to_args(array[ma.attribute, ma.value])
FROM tree c
JOIN menu_node n USING(id)
JOIN menu_attribute ma ON (n.id = ma.node_id)
WHERE n.id IN (select node_id
FROM menu_acl acl
LEFT JOIN pg_roles pr on pr.rolname = acl.role_name
WHERE CASE WHEN role_name
ilike 'public'
THEN true
WHEN rolname IS NULL
THEN FALSE
ELSE pg_has_role(rolname,
'USAGE')
END
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type ilike 'DENY'
THEN FALSE
WHEN acl_type ilike 'ALLOW'
THEN TRUE
END))
or exists (select cn.id, cc.path
FROM tree cc
JOIN menu_node cn USING(id)
WHERE cn.id IN
(select node_id
FROM menu_acl acl
LEFT JOIN pg_roles pr
on pr.rolname = acl.role_name
WHERE CASE WHEN rolname
ilike 'public'
THEN true
WHEN rolname IS NULL
THEN FALSE
ELSE pg_has_role(rolname,
'USAGE')
END
GROUP BY node_id
HAVING bool_and(CASE WHEN acl_type
ilike 'DENY'
THEN false
WHEN acl_type
ilike 'ALLOW'
THEN TRUE
END))
and cc.path::text
like c.path::text || ',%')
GROUP BY n.position, n.id, c.level, n.label, c.path, c.positions,
n.parent
ORDER BY string_to_array(c.positions, ',')::int[]
LOOP
RETURN NEXT item;
END LOOP;
END;
$$ language plpgsql;
This function is not significantly longer, it is quite a bit clearer, and it performs a lot better. For users with full permissions it runs twice as fast as the 1.3 version, but for users with few permissions, this runs about 70x faster. The big improvement comes from the fact that we are only instantiating the menu node tree once.
So not only do we have fewer external dependencies now but there are significant performance improvements.
A Complex Example
CTE's can be used to "materialize" a subquery for a given query. They can also be nested. The ordering can be important because this can give you a great deal of control as to what exactly you are joining.
I won't post the full source code for the new trial balance function here. However, the main query, which contains a nested CTE.
Here we are using the CTE's in order to achieve some stability across joins and avoid join projection issues, as well as creating a tree of projects, departments, etc (business_unit is the table that stores this).
The relevant part of the stored procedure is:
RETURN QUERY
WITH ac (transdate, amount, chart_id) AS (
WITH RECURSIVE bu_tree (id, path) AS (
SELECT id, id::text AS path
FROM business_unit
WHERE parent_id = any(in_business_units)
OR (parent_id = IS NULL
AND (in_business_units = '{}'
OR in_business_units IS NULL))
UNION
SELECT bu.id, bu_tree.path || ',' || bu.id
FROM business_unit bu
JOIN bu_tree ON bu_tree.id = bu.parent_id
)
SELECT ac.transdate, ac.amount, ac.chart_id
FROM acc_trans ac
JOIN (SELECT id, approved, department_id FROM ar UNION ALL
SELECT id, approved, department_id FROM ap UNION ALL
SELECT id, approved, department_id FROM gl) gl
ON ac.approved and gl.approved and ac.trans_id = gl.id
LEFT JOIN business_unit_ac buac ON ac.entry_id = buac.entry_id
LEFT JOIN bu_tree ON buac.bu_id = bu_tree.id
WHERE ac.transdate BETWEEN t_roll_forward + '1 day'::interval
AND t_end_date
AND ac.trans_id <> ALL(ignore_trans)
AND (in_department is null
or gl.department_id = in_department)
((in_business_units = '{}' OR in_business_units IS NULL)
OR bu_tree.id IS NOT NULL)
)
SELECT a.id, a.accno, a.description, a.gifi_accno,
case when in_date_from is null then 0 else
CASE WHEN a.category IN ('A', 'E') THEN -1 ELSE 1 END
* (coalesce(cp.amount, 0)
+ sum(CASE WHEN ac.transdate <= coalesce(in_date_from,
t_roll_forward)
THEN ac.amount ELSE 0 END)) end,
sum(CASE WHEN ac.transdate BETWEEN coalesce(in_date_from,
t_roll_forward)
AND coalesce(in_date_to,
ac.transdate)
AND ac.amount < 0 THEN ac.amount * -1 ELSE 0 END) -
case when in_date_from is null then coalesce(cp.debits, 0) else 0 end,
sum(CASE WHEN ac.transdate BETWEEN coalesce(in_date_from,
t_roll_forward)
AND coalesce(in_date_to,
ac.transdate)
AND ac.amount > 0 THEN ac.amount ELSE 0 END) +
case when in_date_from is null then coalesce(cp.credits, 0) else 0 end,
CASE WHEN a.category IN ('A', 'E') THEN -1 ELSE 1 END
* (coalesce(cp.amount, 0) + sum(coalesce(ac.amount, 0)))
FROM account a
LEFT JOIN ac ON ac.chart_id = a.id
LEFT JOIN account_checkpoint cp ON cp.account_id = a.id
AND end_date = t_roll_forward
WHERE (in_accounts IS NULL OR in_accounts = '{}'
OR a.id = ANY(in_accounts))
AND (in_heading IS NULL OR in_heading = a.heading)
GROUP BY a.id, a.accno, a.description, a.category, a.gifi_accno,
cp.end_date, cp.account_id, cp.amount, cp.debits, cp.credits
ORDER BY a.accno;
This query does a lot including rolling balances froward from checkpoints (which also effectively close the books), pulling only transactions which match specific departments or projects, and more.
Conclusion
We have found the CTE to be a tool which is hard to underrate in importance and usefulness. They have given us maintainability and performance advantages, and they are very flexible. We will probably move towards using more of them in the future as appropriate.
Saturday, July 7, 2012
Progress towards LedgerSMB 1.4
I have been relatively submerged in work and haven't gotten back to blog as much as I would have liked. On the whole this is a good thing.
A lot of things are changing in the LedgerSMB community for the better. For starters there has been a lot of work towards making the site more manageable. We expect this to be a lot better than it has been in the past.
Additionally my business is planning to address important as-yet-unmet needs that must be addressed if open source accounting/ERP software is to reach the mainstream. I will probably post more on this later.
Similarly development has accelerated a great deal until the last couple weeks when I have had to focus on other things. In the next series of posts I expect to talk a bit about how we are utilizing CTE's and the like in LedgerSMB 1.4 and how new features of PostgreSQL are making our lives a lot easier. I will say that the more I work with CTE's the more I am convinced this is a feature which is highly underrated. I can't wait until we can start requiring PostgreSQL 9.1 and we can use them in write operations.
I will probably also talk a bit about Moose, why we are moving to it, our experiences, and the like. Some other discussions like changes to our architecture may be discussed as well.
However for now, I figure a progress report is due.
I would say we are most of the way there.
A lot of things are changing in the LedgerSMB community for the better. For starters there has been a lot of work towards making the site more manageable. We expect this to be a lot better than it has been in the past.
Additionally my business is planning to address important as-yet-unmet needs that must be addressed if open source accounting/ERP software is to reach the mainstream. I will probably post more on this later.
Similarly development has accelerated a great deal until the last couple weeks when I have had to focus on other things. In the next series of posts I expect to talk a bit about how we are utilizing CTE's and the like in LedgerSMB 1.4 and how new features of PostgreSQL are making our lives a lot easier. I will say that the more I work with CTE's the more I am convinced this is a feature which is highly underrated. I can't wait until we can start requiring PostgreSQL 9.1 and we can use them in write operations.
I will probably also talk a bit about Moose, why we are moving to it, our experiences, and the like. Some other discussions like changes to our architecture may be discussed as well.
However for now, I figure a progress report is due.
- A basic web services framework has been completed and work is ongoing to expose customers and vendors over the framework.
- Many CRM improvements have been added including lead tracking, an ability to attach documents to customer/vendor records or companies or persons. Many of these improvements affect employees too.
- A lot of code and schema clean-up in the CRM and employee management areas
- Projects and departments have been replaced by a more flexible system of reporting dimensions supporting funds accounting and more
- A new report framework has been written and many reports have already been moved to it.
I would say we are most of the way there.