Perspectives on LedgerSMB

Monday, February 24, 2014

Notes on Software Testing

It seems software testing is one of those really hard things to get right. I find very often I run into projects where the testing is inadequate or where, in an overzealous effort to ensure no bugs, test cases are too invasive and test things that shouldn't be tested. This article attempts to summarize what I have learned from experience in the hope that it is of use to others.

The Two Types of Tests

Software testing has a couple of important functions and these are different enough to form the basis of a taxonomy system of test cases. These functions include:

Ensuring that a change to a library doesn't break something else
Ensuring general usability by the intended audience
Ensuring sane handling of errors
Ensuring safe and secure operation under all circumstances

These functions fall themselves into roughly two groups. The first ensures that the software functions as designed, and the second ensures that where undefined behavior exists, it occurs in a sane and safe way.

The first type of tests then are those which ensure that the behavior conforms to the designed outlines of the contract with downstream developers or users. This is what we may call "design-implementation testing."

The second type of tests are those which ensure that behavior outside the designed parameters is either appropriately documented or appropriately handled, and can be deployed and used in a safe and secure manner. This, generally, reduces to error testing.

These two types of tests are different enough they need to be written by different groups. The design-implementation tests are really best written by the engineers designing the software, and the error tests need to be handled by someone somewhat removed from that process.

Why Software Engineers Should Write Test Cases

Design-implementation tests are a formalization of the interface specification. As such a formalization the people best prepared to write good software contract tests are those specifying the software contracts, namely the software engineers.

There are a couple ways this can be done. Engineers can write quick pseudocode intended to document interfaces and test cases to define the contracts, or can develop a quick prototype with test cases before handing off to developers, or the engineers and the developers can be closely integrated. Either way the engineers are in the best position, knowledge-wise, to write test cases about whether the interface contracts are violated or not.

This works best with an initial short iteration cycle (regarding prototypes). However the full development could be on a much larger cycle so it is not necessarily limited to agile development environments.

Having the engineers write these sorts of test cases ensures that a few very basic principles are not violated:

The tests do not test the internals of dependencies beyond necessity
The tests focus on interface instead of implementation

These rules help avoid test cases broken needlessly when dependencies fix bugs.

Why You Still Need QA Folks Who Write Tests After the Fact

Interface and design-implementation tests are not enough. They cover very basic things, and ensure that correct operation will continue. However they don't generally cover error handling very well, nor do they cover security-critical questions very well.

For good error handling tests, you really need an outside set of eyes, not too deeply tied to current design or coding. It is easier for an outsider to spot that "user is an idiot" that was left in as a placeholder in an error message than it is for the developer or the engineer. Some of these can be reduced by cross-team review of changes as they come in.

A second problem is that to test security-sensitive failure modes, you really need someone who can think about how to break an interface, not just what it was designed to do. The more invested one is, brain-cycle-wise, in implementing the software, the harder it often is to see this.

Conclusion

Software testing is something which is best woven into the development process relatively deeply and should be both a before and after main development. Writing test cases is often harder than writing code, and this goes double for good test cases vs good code.

Now obviously there is a difference in testing SQL stored procedures than testing C code, and there may be cases where you can dispense to a small extent with some after-the-fact testing (particularly in declarative programming environments). After all, you don't have to test what you can prove, but you cannot prove that an existing contract will be maintained into the future.

Thursday, January 23, 2014

PGObject on CPAN: NoSQL Ideas for PostgreSQL Applications

One of the legitimate points Martin Fowler has argued in favor of NoSQL databases is that expecting application to directly manipulate relational data is far less clean from an application design perspective than having a database encapsulated behind a loosely coupled interface (like a web service). I would actually go further and point out that such an approach invariably leads to bad database design too because the information layout becomes the contracted software API and thus one either has to spend a lot of time and effort separating logical from physical storage layouts or one ends up having an ossified physical layout that can never change.

This problem has been well understood in the relational database community for a long time. The real problem has, however, been tooling. There are effectively two traditional tools for addressing this issue:

1. Updateable views. These then form a relational API that allows the database to store information in a way separate from how the application sees it. If you are using an ORM, this is a really valuable tool.

2. Stored procedures. These provide a procedural API, but traditionally a relatively brittle one based on the same approach used by libraries. Namely you typically have an ordered series of arguments, and all users of the API are expected to agree on the ordering and number of arguments. While this may work passably for a single system (and even there lead to "dependency hell"), it poses significant issues in a large heterogeneous environment because the number of applications which must be coordinated in terms of updates becomes very high. Oracle solves this using revision based editions, so you can have side-by-side versioning of stored procedures, and allows applications to specify which edition they are working on. This is similar to side-by-side versioning of C libraries typical for Linux, or side-by-side versioning of assemblies in .Net.

On the application side, ORMs have become popular, but they still lead to a relational API being contractual, so are really best used with updateable views.

In part because of these shortcomings, we started writing ways around them for LedgerSMB starting with 1.3. The implementations are PostgreSQL-specific. More recently I wrote some Perl modules, now on CPAN, to implement these concepts. These create the general PGObject framework, which given an application access to PostgreSQL stored procedures in a loosely coupled way. It is hoped that other implementations of the same ideas will be written and other applications will use this framework.

The basic premise is that a procedural interface that is discoverable allows for easier management of software contracts than one which is non-discoverable. The discoverability criteria then become the software contract.

PGObject allows what I call "API Paradigms" to be built around stored procedures. An API paradigm is a consistent specification of how to write discoverable stored procedures and then re-use them in the application. Most namespaces under PGObject represent such "paradigms." The exceptions currently are the Type, Util, Test, and Debug second-tier namespaces. Currently PGObject::Simple is the only available paradigm.

What follows is a general writeup of the currently usable PGObject::Simple approach and what each module does:

PGObject

PGObject is the bottom half module. It is designed to service multiple top-half paradigms (the Simple paradigm is described below, but also working on a CompositeType paradigm which probably won't be ready initially yet). PGObject has effectively one responsibility: coordinate between application components and the database. This is split into two sub-responsibilities:

Locate and run stored procedures
Encode/decode data for running in #1 above.

Specifically outside the responsibility of PGObject is anything to do with managing database connections, so every call to a database-facing routine (locating or running a stored procedure) requires a database handle to be passed to it.

The reason for this is that the database handles should be managed by the application not our CPAN modules and this needs to be flexible enough to handle the possibility that more than one database connection may be needed by an application. This is not a problem because developers will probably not call these functions unless they are writing their own top-half paradigms (in which case the number of places in their code where they issue calls to these functions will be very limited).

A hook is available to retrieve only functions with a specified first argument type. If more than one function is found that matches, an exception is thrown.

The Simple top-half paradigm (below) has a total of two such calls, and that's probably typical.

The encoding/decoding system is handled by a few simple rules.

On delivery to the database, any parameter that can('to_db') runs that method and inserts the return value in place of the parameter in the stored procedure. This allows one to have objects which specify how they serialize. Bigfloats can serialize as numbers, Datetime subclasses can serialize as date or timestamp strings, and more complex types could serialize however is deemed appropriate (to JSON, a native type string form, a composite type string form, etc).

On retrieval from the database, the type of each column is checked against a type registry (sub-registries may be used for multiple application support, and can be specified at call time as well). If the type is registered, the return value is passed to the $class->from_db method and the output returned in place of the original value. This allows for any database type to be mapped back to a handler class.

Currently PGObject::Type is a reserved namespace for dealing with released type handler classes. We have a type handler for DateTime and one for BigFloat written already and working on one for JSON database types.

PGObject::Simple

The second-level modules outside of a few reserved namespaces designate top-half paradigms for interacting with stored procedures. Currently only Simple is supported.

This must be subclassed to be used by an application and a method provided to retrieve or generate the appropriate database connection. This allows application-specific wrappers which can interface with other db connection management logic.

All options for PGObject->call_procedure supported including running aggregates, order by, etc. This means more options available for things like gl reports database-side than the current LedgerSMB code allows.

$object->call_dbmethod uses the args argument by using a hashref for typing the name to the value. If I want to have a ->save_as_new method, I can add args => {id => undef} to ensure that undef will be used in place of $self->{id}.

Both call_procedure (for enumerated arguments) and call_dbmethod (for named arguments) are supported both from the package and object. So you can MyClass->call_dbmethod(...) and $myobj->call_dbmethod. Naturally if the procedure takes args, you will need to specify them or it will just submit nulls.

PGObject::Simple::Role

This is a Moo/Moose role handler for PGObject::Simple.

One of the main features it has is the ability to declaratively define db methods. So instead of:

sub int {
my $self = @_;
return $self->call_dbmethod(funcname => 'foo_to_int');
}

You can just

dbmethod( int => (funcname => 'foo_to_int'));

We will probably move dbmethod off into another package so that it can be imported early and used elsewhere as well. This would allow it to be called without the outermost parentheses.

The overall benefits of this framework is that it allows for discoverable interfaces, and the ability to specify what an application needs to know on the database. This allows for many of the benefits of both relational and NoSQL databases at the same time including development flexibility, discoverable interfaces, encapsulation, and more.

Saturday, November 23, 2013

Reporting in LedgerSMB 1.4: Part 5, Conclusions

I hope many of you have enjoyed this series. We've tried hard to avoid inner platform syndrome here by making reporting something that a developer does.

Skills Required to Write Basic Reports

The developer skills required to write reports tend to fall on the database side. In general one should have:

A good, solid understanding of SQL and PL/PGSQL in a PostgreSQL environment. This is the single most important skill and it is where most of the reporting effort lies.
Basic understanding of Perl syntax. Any basic tutorial will do.
A basic understanding of Moose. A general understanding of the documentation is sufficient, along with existing examples.
A very basic understanding of the LedgerSMB reporting framework as discussed in this series.

These are required for general tabular reports, and they allow you to build basic tabular reports that can be output in HTML, CSV, ODS, and PDF formats.

Skills Required to Write More Advanced Reports

For more advanced reports, such as new financial statements, government forms, and the like, the following skills are required. These are not fully discussed here. These typically require, additionally:

An in-depth understanding of our HTML elements abstraction system (this will be discussed in a future post here)
A general proficiency with Template Toolkit, possibly including the LaTeX filter for running portions of the template through a LaTeX filter.

Strengths

The reporting framework here is very database-centric. In general you cannot have a non-database-centric reporting structure because the data resides in the database, and some knowledge there is required to get it out in a working form. We have tried hard to make a system where only minimal knowledge elsewhere is required to do this. If you have db folks who work with your finance folks, they can write the reports.

Weaknesses

Ad hoc reporting is outside the scope of this reporting. A one-off report is unlikely to be particularly helpful. Additionally this generates reports as documents that can be shared. Updating the data requires running the report again, and while this can be done as a sharable URL, it is not necessarily ideal for all circumstances.

Other Reporting Options

In cases where this reporting framework is not ideal, there are a number of other options available:

Views can be made which can be pulled in via ODBC into spreadsheets like Excel.
Third party report engines like JasperReports can be used instead, and
One-off SQL queries in PSQL can be used to generate HTML and (in the most recent versions) LaTeX documents that can be shared.

Wednesday, November 20, 2013

Writing Reports in LedgerSMB 1.4: (Mostly-) Declarative Perl Modules

So far we have talked about getting data in, and interacting with the database. Now we will talk about the largest of the modules and cover workflow scripts in relation with this stage,

At this point you would have a filter screen, a user defined function which would take the arguments from that screen's inputs (prefixed with 'in_' usually to avoid column name conflicts), and a tabular data structure you expect to return.

As a note here, all the code I am trashing here is my own, in part because I have learned a lot about how to code with Moose over the course of 1.4 development.

In your workflow script you are likely to need to add the following:

use LedgerSMB::Report::MyNewReport;

and

sub run_my_new_report {
my ($request) = @_;
LedgerSMB::Report::MyNewReport->new(%$request)->render($request);
}

That's all you need in the workflow script.

Overall Structure and Preamble

The actual Perl module basically defines a number of parameters for the report, and the LedgerSMB::Report.pm provides a general framework to cut down on the amount of code (and knowledge of Perl) required to write a report. Minimally we must, however, define inputs, if any, output structure, and how to create the output structure. We can also define buttons for further actions on the workflow script. The same workflow script would have to handle the button's actions.

Typically a report will start with something like this (of course MyNewReport is the name of the report here):

package LedgerSMB::Report::MyNewReport;
use Moose;
extends 'LedgerSMB::Report';
with 'LedgerSMB::Report::Dates'; # if date inputs used, use standard defs

This preamble sets up the basic reporting framework generally along with all the features discussed below. If you need to handle numeric input or secondary dates you will want to change:

with 'LedgerSMB::Report::Dates';

to

with 'LedgerSMB::Report::Dates', 'LedgerSMB::MooseTypes';

so that you can use type coercions for numeric and/or date fields (for processing localized formattings and the like).

Defining Inputs

Inputs are defined as report properties. Usually you want these properties to be read-only because you want them to correspond to the report actually run. You can use the full Moose capabilities in restricting inputs. However typically inputs should be read-only and you are likely to want to restrict to type and possibly coerce as well (at least when using the types defined in LedgerSMB::MooseTypes).

When including the following line you do not have to define the date_from and date_to inputs:

with 'LedgerSMB::Report::Dates';

Typically our conventions are to document inputs inline with POD. While this is (obviously) not necessary for the functioning of the report, it is helpful for future maintenance and highly recommended. It is also worth noting in the POD how a match is made (this should be in SQL also if applicable, in a COMMENT ON statement for easy checking of common assumptions regarding API contracts).

For example, from the GL report:

=item amount_from

The lowest value that can match, amount-wise.

=item amount_to

The highest value that can match, amount-wise.

=cut

has 'amount_from' => (is => 'rw', coerce => 1,
isa => 'LedgerSMB::Moose::Number');
has 'amount_to' => (is => 'rw', coerce => 1,
isa => 'LedgerSMB::Moose::Number');

Those lines demonstrate the full power of Moose in the definition. One obvious thing that will be fixed in beta is making these read-only (is => 'ro') while they are currently read-write. There is no reason for these to be read-write.

From the LedgerSMB::Report::PNL you see the following optional string input defined:

=item partnumber

This is the control code of the labor/overhead, service, or part consumed.

=cut

has partnumber => (is => 'ro', isa => 'Str', required => 0);

This would probably be improved by mentioning that the partnumber is an exact match in the POD, but it shows how to designate a read-only, optional string input.

If an input is not listed, it won't be passed on to the stored procedure. It is critical that all inputs are defined whether using standard modular definitions (LedgerSMB::Report::Dates) or explicit ones. If an input is being ignored this is one of the first places to check. Additionally note that because of other aspects of the reporting, it is not currently possible to use strict or slurpy constructors in any sane way. It is likely we will build our own constructor handling in the future, but currently this is a hard limitation.

Input Definition Antipatterns

There are a few things which someone who has not worked with Moose before is likely to do in this area, and while many of these are relatively harmless in the web interface because of a number of failsafes, but if you ever want to re-use the code in a more stateful environment you will have difficulties. The examples given are, alas, my own code but I have the benefit of being a new-comer to Moose here and so the lessons are fresh in my mind, or rather codebase.

The first is in use of read-write inputs. A report output is closely bound to its inputs, so read-write inputs allows the application to misrepresent the report. The example I gave above is:

has 'amount_to' => (is => 'rw', coerce => 1,
isa => 'LedgerSMB::Moose::Number');

Now this allows the application to do something like this:

my $report = LedgerSMB::Report::GL->new(%request);
$report->run_report();
$report->amount_to('10000000');
$report->render;

The above will represent that the report includes a bunch of transactions that may, in fact, be excluded. This is no good. On the other hand, if amount_to was read-only (is => 'ro'), then the above code would throw an error instead.

The second major anti-pattern is in the use of Maybe[] as an alternative to required => 0. For example see the following:

has 'approved' => (is => 'rw', isa => 'Maybe[Bool]');

Oh the joys of looking at code I wrote that is in need of rewrite.... Not only do we have a read-write input, but it is maybe boolean (i.e. true, false, or undef).

Now, this appears to work because undef is passed as NULL to the database, and the same is achieved by the more proper:

has approved => (is => 'ro', isa => 'Bool', required => 0);

The difference is that we will not accept as input a case where $request->{approved} = undef has been set. Our query handlers drop empty inputs so there is no case where this should happen. Additionally, this prevents unsetting the attribute after running the report and thus decoupling output from purported input.

Defining Report Structure

Report structure is defined using a series of functions which are overridden by actual reports. Some of these functions are optional and some are not. The required ones are covered first.

There are three required functions, namely columns, header_lines, and name. These are expected to return very specific data structures, but function in a largely declarative way. In other words, the functional interface effectively defines them as pseudo-constant (they are not fully constant because they are expected to return the localized names).

In all cases, LedgerSMB::Report::text() can be used to translate a string into its local equivalent (assuming localized strings in the .po files).

The columns function returns an arrayref of hashrefs, each of which is a column definition for our "dynatable" templates. The following are required:

col_id --- the name of the row field to use
type --- the display type of the field (text, href, checkbox, hidden, etc)
name --- localized header for the column

The following are conditionally required or optional:

href_base --- the base of the href. To this is appended the row_id (see below). Only used by href columns, and then required.
pwidth --- Used for width factors for PDF reports.

Here's an example of a columns function for a very simple report (which just lists all SIC codes in the system):

sub columns {
    return [
      { col_id => 'code',
          type => 'href',
     href_base => 'am.pl?action=edit_sic&code=',
          name => LedgerSMB::Report::text('Code'), },

      { col_id => 'description',
          type => 'text',
          name => LedgerSMB::Report::text('Description'), }
    ];
}

In most reports, the columns function is much longer.

The header_lines function provides an arrayref of hashrefs, for displaying inputs on the report. To this, the reporting engine adds the name of the report and the database connected to. If you want no header lines added, you can just return an empty arrayref:

sub header_lines { return []; }

In many cases however, such inputs should be displayed. Each hashref has two components:

name is the name of the input
text is the text label of the input.

Here's a more useful example from LedgerSMB::Report::GL:

sub header_lines {
    return [{name => 'from_date',
             text => LedgerSMB::Report::text('Start Date')},
            {name => 'to_date',
             text => LedgerSMB::Report::text('End Date')},
            {name => 'accno',
             text => LedgerSMB::Report::text('Account Number')},
            {name => 'reference',
             text => LedgerSMB::Report::text('Reference')},
            {name => 'source',
             text => LedgerSMB::Report::text('Source')}];
}

Finally name() returns the localized name of the report. This is usually a very simple function:

sub name {
    return LedgerSMB::Report::text('General Ledger Report');
}

Additionally there are two optional functions, buttons and template, which allow additional flexibility. These are rarely used.

The template function overrides our dynatable-based template as the template to use. This is used mostly in financial statements but is not used in the trial balance, or other fully tabular reports.

If you use it, just return the path to the template to use.

sub template { return 'Reports/PNL' }

Our PNL reporting module has additional features and beyond the scope of this post.

Finally buttons returns a list of buttons to be included on the report. These follow the element_data format of UI/lib/elements.html and are used to add HTML form callbacks to the report. Here's an example:

sub buttons {
    return [{
         text => LedgerSMB::Report::text('Add New Tax Form'),
         name => 'action',
         type => 'submit',
         class => 'submit'
    }];
}

How Columns are Selected

The columns to display are dynamically selected according to the following rules:

If no column selection criteria is found, then all columns are shown
If the render() method is called with a hashref as arguments that includes a a member with the name of the column ID prefixed with 'col_' then the column is shown and those which are not so selected are not.

What this means is that typically you will define inputs for selection of columns in the $request object before passing it through if you want to have a baseline of columns which always show. Otherwise you will usually allow selection of columns in the filter screen using inputs named as above (i.e. an 'id' field would have a selector named 'col_id').

The run_report() Function

The run_report function populates the rows of the report. It should exist and set $self->rows(@array) at the end. This is the only portion where specific knowledge of programming in Perl is particularly helpful. However, assuming nothing more than a little knowledge, here is a basic template from the SIC listing:

sub run_report{
    my ($self) = @_;
    my @rows = $self->exec_method(funcname => 'sic__list');
    for my $row(@rows){
        $row->{row_id} = $row->{code};
    }
    $self->rows(\@rows);
}

Going through this line by line:

my ($self) = @_;

The first line of the function body is boilerplate here. It is possible to accept the $request object here as a second parameter but not typical unless you have very specific needs for it. In that case, you simply:

my ($self, $request) = @_;

Remember that in Perl, @_ is the argument list.

my @rows = $self->exec_method(funcname => 'sic__list');

This line says "Take run the database function named 'sic__list' and map inputs of the report to function arguments. You would typically just copy this line and change the function name.

for my $row(@rows){
    $row->{row_id} = $row->{code};
}

If you have any hyperlinks in the report, it is necessary to set a row_id so that this can be properly handled. In any case the row_id is appended to the link from href_base in the column definition. It is possible to override this on a per-column basis but that's beyond the scope of this introduction.

$self->rows(\@rows);

This assigns the rows() of the report to the rows returned. Currently this is handled as a read/write property of reports, but long-run this will probably be changed so that programs cannot override this after running the report.

Ending the File

Always end report classes with the following line

__PACKAGE__->meta->make_immutable;

This improves performance and ensures that no more attributes can be dynamically added to your report. There are cases where such may be less than desirable outside of the reports of this sort, but such would be outside the reporting use case.

Others in series:

Introduction
Filter Screens
Best Practices regarding Stored Procedures
(this piece)
Conclusions

Tuesday, November 12, 2013

On CPAN, Community, and P: A Case Study in What Not to Do

I am going to try to do this piece as respectfully as I can. I understand people put a lot of work into developing things and they submit them, and when they get panned, it can be difficult. At the same time, community resources are community resources and so a failure to conduct such case studies in things gone amiss can lead to all kinds of bad things. Failure to get honest feedback can lead to people not improving, but worse, it can leave beginners sometimes mistakenly believing that bad practices are best practices. There is also a period of time after which bad practices become less easily excused.

So somewhat reluctantly I am going to undertake such a study here. This is solely about community interfacing. I am not going to critique code. Rather I would hope that this post can be a good one regarding understanding some of the problems regarding community interfaces generally, whether CPAN, PGXN, or others. The lessons apply regardless of language or environment and the critiques I offer are at a very different level than critiques of code.

So with this, I critique the P CPAN module from a community coding perspective. This module exports a single function called "P" which acts kind of like printf and sprintf. It would be an interesting exercise in learning some deep aspects of Perl but from a community resource perspective, it suffers from enough issues to make it a worthy case study.

The gist of this is that community resources require contemplating how something fits into the community and working with the community in mind. I cool idea or something one finds useful is not always something that is a candidate for publishing as a community resource, at least not without modifications aimed at carefully thinking how things fits into more general processes.

Four of my own CPAN modules are actually rewrites of code that I wrote in other contexts (particularly for LedgerSMB), and rewrote specifically for publication on CPAN. In general there is a huge gulf between writing a module for one project or one developer and writing it for everyone. I believe, looking at P, that it is just something the developer thought was useful personally and published it as is without thinking through any of these issues. This is all too common and so going through these I hope will prevent too many from making the same mistakes.

Namespace Issues

The name 'P' as an extraordinarily bad choice of naming for a public module. Perl uses nested namespaces, and nesting implies a clear relationship, such as inheritance (though other relationships are possible too).

Taking a top-level namespace is generally discouraged on CPAN where a second or third level namespace will suffice. There are times and places for top-level namespaces, for example for large projects like Moose, Moo, and the like. In general these are brand names for conglomerates of modules, or they are functional categories. They are not shorthand ways of referring to functionality to save typing.

'P' as a name is not helpful generally, and moreover it denies any future large project that namespace. The project is not ambitious enough to warrant a top-level namespace. There is no real room for sub-modules and so there are real problems with having this as a top-level module.

Proper Namespacing

It's helpful, I think, to look at three different cases for how to address namespacing. All three of these are ones I maintain or am planning to write. I believe they follow generally acceptable practices generally although I have received some criticism for PGObject being a top-level namespace as well.

PGObject is a top-level namespace housing, currently three other modules (perhaps ten by next year). I chose to make it a top-level namespace because it is a framework for making object frameworks, and not a simple module. While the top-level module is a thin "glue" module, it offers services which go in a number of different directions, defying simple categorization.

Additionally the top-level module is largely a framework for building object frameworks, which complicates the categorization further,. In this regard it is more like Moose than like Class::Struct. Sub-modules include PGObject::Simple (a simple framework using PGObject, not a simple version of PGObject), PGObject::Simple::Role, and PGObject;:Type::BigFloat.
Mail::RoundTrip is a module which allows web-based applications to request email verification by users. The module offers only a few pieces of functionality and is not really built for extensibility. This should not be a top-level module.
Device::POS::Printer is a module I have begun to write for point of sale printers, providing a basic interface for printing, controlling cash drawers, getting error messages, etc. The module is likely to eventually have a large number of sub-modules, drivers for various printers etc, but putting Device:: in front does no real harm and adds clarity. There's no reason to make it a top-level namespace.

The main point is thinking about how your module will fit into the community, how it will be found, etc. 'P' a name which suggests these have not been considered.

Exports

The P module exports a single function, P() which functions like printf and sprintf. The major reason for this, according to the author, is both to add additional checking and to save typing. Saving typing is not a worthy goal by itself, though neither is verbosity. Condensing a function which takes over two different functions to a single letter, however, is not a recipe for good, readable code. If others follow suit, you could get code like this:

P(A(R("This is a string", 3));

Now maybe this code is supposed to print the ASCII representation of "This is a string" repeated three times. However that is not obvious from the code, leading to code that is hard to read or debug.

Proper Exports

In Perl, exports affect the language. Exports are thus to be used sparingly as they can lead to conflicts which can lead to hard to maintain code. Exports should be rare, well documented, and not terribly subject to name collision. They should also be verbose enough they can be understood without tremendous prior knowledge of the module. P() as an exported function meets none of these criteria.

A good example of exports done right would be a function like has() used by Moose, Mouse, and Moo. The function is exported and used to declaratively define object properties. The convention has become widespread because it is obvious what it does. Again this does not matter so much for personal projects, but it does for published modules on a community repository.

Test Failure Report Management

The CPANTesters page for P shows that every version on CPAN has had test failures. This is unusual. Most modules have clear passes most of the time. Even complex modules like DBD::Pg show a general attention to test failures and a relatively good record. A lack of this attention shows a lack of interest in community use, and that fixes to test failures, needed for people to use the library, are just not important. So if you manage a module, you really want to take every test failure seriously.

Conclusions

Resources like CPAN, CTAN, PGXN, and the like are subject to one important rule. Just because it is good for your own personal use does not make it appropriate for community publication as a community resources. Writing something that fits the needs of a specific project, or a specific coder's style is very different from writing something that helps a wide range of programmers solve a wide range of problems. These community resources are not places to upload things just because one wrote them. They are places to interface with the community through work. Getting pre-review, post-review, and paying attention to the needs of others is critically important.

Monday, November 11, 2013

Writing Reports in LedgerSMB 1.4, part 3, Stored Procedure Conventions and Best Practices

The stored procedures/UDFs form the heart of the LedgerSMB reports. Everything else basically is there to handle input to the stored procedure and output from it. It's for this reason that the stored procedures are covered before the Perl modules.

The stored procedures follow the conventions and design principles laid out in the very first post on this blog. Arguments are prefixed with 'in_' and given consistent names from report criteria. Anyone wanting to write reports should start with that post in terms of understanding where to start.

Additionally there are a number of standardized conventions for common elements. The most important of these currently is that many dates require a date range, and the arguments here are standardized as in_date_from and in_date_to (corresponding to date_from and date_to on the application side). The conventions are set up on the Perl side so these will be covered in the next section.

In general, there are a number of do's and dont's associated with these procedures.

1. Do provide a default ordering that is useful.

2. Don't provide windowing functions that depend on ordering that could be overridden. Running totals should not be implemented here. As a good general rule, if you use ROWS UNBOUNDED PROCEDING in one of these functions, that's a red flag.

3. Do try to have one central query with, possibly, some small support logic.

4. Do define your types so several different reports can share the same output type.

5. Do write test scripts to test your reporting functions. Use transactions that roll back.

7. Do use plpgsql instead of sql if you have more than a few arguments. named arguments are easier to read. Once we can drop support for 9.2 and lower this will cease to be an issue though.

An example:

CREATE OR REPLACE FUNCTION report__gl
(in_reference text, in_accno text, in_category char(1),
in_source text, in_memo text, in_description text, in_from_date date,
in_to_date date, in_approved bool, in_from_amount numeric, in_to_amount numeric,
in_business_units int[])
RETURNS SETOF gl_report_item AS
$$
DECLARE
         retval gl_report_item;
         t_balance numeric;
         t_chart_id int;
BEGIN

IF in_from_date IS NULL THEN
   t_balance := 0;
ELSIF in_accno IS NOT NULL THEN
   SELECT id INTO t_chart_id FROM account WHERE accno = in_accno;
   t_balance := account__obtain_balance(in_from_date ,
                                       (select id from account
                                         where accno = in_accno));
ELSE
   t_balance := null;
END IF;

FOR retval IN
       WITH RECURSIVE bu_tree (id, path) AS (
            SELECT id, id::text AS path
              FROM business_unit
             WHERE parent_id is null
            UNION
            SELECT bu.id, bu_tree.path || ',' || bu.id
              FROM business_unit bu
              JOIN bu_tree ON bu_tree.id = bu.parent_id
            )
       SELECT g.id, g.type, g.invoice, g.reference, g.description, ac.transdate,
              ac.source, ac.amount, c.accno, c.gifi_accno,
              g.till, ac.cleared, ac.memo, c.description AS accname,
              ac.chart_id, ac.entry_id,
              sum(ac.amount) over (rows unbounded preceding) + t_balance
                as running_balance,
              compound_array(ARRAY[ARRAY[bac.class_id, bac.bu_id]])
         FROM (select id, 'gl' as type, false as invoice, reference,
                      description, approved,
                      null::text as till
                 FROM gl
               UNION
               SELECT ar.id, 'ar', invoice, invnumber, e.name, approved, till
                 FROM ar
                 JOIN entity_credit_account eca ON ar.entity_credit_account
                      = eca.id
                 JOIN entity e ON e.id = eca.entity_id
               UNION
               SELECT ap.id, 'ap', invoice, invnumber, e.name, approved,
                      null as till
                 FROM ap
                 JOIN entity_credit_account eca ON ap.entity_credit_account
                      = eca.id
                 JOIN entity e ON e.id = eca.entity_id) g
         JOIN acc_trans ac ON ac.trans_id = g.id
         JOIN account c ON ac.chart_id = c.id
    LEFT JOIN business_unit_ac bac ON ac.entry_id = bac.entry_id
    LEFT JOIN bu_tree ON bac.bu_id = bu_tree.id
        WHERE (g.reference ilike in_reference || '%' or in_reference is null)
              AND (c.accno = in_accno OR in_accno IS NULL)
              AND (ac.source ilike '%' || in_source || '%'
                   OR in_source is null)
              AND (ac.memo ilike '%' || in_memo || '%' OR in_memo is null)
             AND (in_description IS NULL OR
                  g.description
                  @@
                  plainto_tsquery(get_default_lang()::regconfig, in_description))
              AND (transdate BETWEEN in_from_date AND in_to_date
                   OR (transdate >= in_from_date AND in_to_date IS NULL)
                   OR (transdate <= in_to_date AND in_from_date IS NULL)
                   OR (in_to_date IS NULL AND in_from_date IS NULL))
              AND (in_approved is false OR (g.approved AND ac.approved))
              AND (in_from_amount IS NULL OR ac.amount >= in_from_amount)
              AND (in_to_amount IS NULL OR ac.amount <= in_to_amount)
              AND (in_category = c.category OR in_category IS NULL)
     GROUP BY g.id, g.type, g.invoice, g.reference, g.description, ac.transdate,
              ac.source, ac.amount, c.accno, c.gifi_accno,
              g.till, ac.cleared, ac.memo, c.description,
              ac.chart_id, ac.entry_id, ac.trans_id
       HAVING in_business_units is null or in_business_units
                <@ compound_array(string_to_array(bu_tree.path, ',')::int[])
     ORDER BY ac.transdate, ac.trans_id, c.accno
LOOP
   RETURN NEXT retval;
END LOOP;
END;
$$ language plpgsql;

The above is the function that provides the GL report and search. It is a monster query and we could use RETURN QUERY but for older versions this seems more reliable. Ideally, the inline view would be moved to a formal one, and a few other tweaks, but it works and works well enough and despite the length and the inline view, it is not hard to debug.

Others in series:

Part 1: Overview

Part 2: Filter Screens

Part 4: Perl Modules

Part 5: Conclusions

Friday, November 8, 2013

Interlude: A cool PL/PGSQL trick

I will be getting back to the report writing in LedgerSMB next or the week after.   I have an open source, PostgreSQL-based point of sale project I am hoping to make to a minimally useful version in the mean time.

In the mean time, here's a useful trick I discovered while trying to solve a problem. The basic problem is one of Object-Relational design, in the sense that we may want to use types as classes and therefore return a tuple which is garanteed to be of the type defined by one table or view.

However, we may want to override some of values in this tuple with values from another table.

In older versions of PostgreSQL, I seem to recall it was possible to add additional columns onto the end of a result which would be silently ignored on return. This doesn't work in current versions and isn't very safe anyway. A better way is to work with composite tuples, where each member is a tuple you need for processing.

So for example, suppose we have a table which contains pricing information, and a second table which contains temporary pricing information for a given customer, and these are different table structures. Suppose, for sake of argument we have tables like this:

CREATE TABLE goods (
   id int serial not null unique,
   sku text primary key,
   sell numeric not null, -- sell price
   buy numeric not null -- purchase price
);

CREATE TABLE pricematrix (
   customer_id int not null references customer(id),
   goods_id int not null references goods(id),
   valid_from date,
   valid_to date,
   sell numeric not null
);

CREATE VIEW pricematrix_today AS
select * from pricematrix
WHERE (valid_from is null or valid_from <= 'today') and
              (valid_to is null or valid_to >= 'today');

Now, what I want to do is pull records from the goods table, but override the sell column with the one from the pricematrix if it exists, choosing the lowest available price.

A simple approach (this code is not tested unlike my actual production code based on the same principle):

CREATE OR REPLACE FUNCTION get_goods_by_sku(in_sku text, in_customer_id int)
RETURNS goods LANGUAGE PLPGSQL AS
$$
DECLARE t_rec record;
   retval goods;
    pmatrix record;
BEGIN

   SELECT goods, pm INTO t_rec
   FROM goods
        LEFT JOIN (SELECT *,
                          rank() OVER
                           (ORDER BY sell
                            PARTITION BY goods_id) as ranking
                     FROM pricematrix_today
                    WHERE customer_id = in_customer_id
                  ) pm ON (goods.id = pm.goods_id and ranking = 1)
            WHERE goods.sku = in_sku

   pmatrix := t_rec.pm;
   retval := t_rec.goods;
   IF pmatrix.sell < retval.sell THEN
retval.sell = pmatrix.sell;
   END IF;
   RETURN retval;
END;
$$;

By keeping our types isolated we can safely manipulate them inside the function before returning them and still be guaranteed a named type tuple on output.

Now obviously this is a contrived example and it is less likely we will usually be selecting from tables than views, but the same goes for views since we can define "type methods" on those as well.