Thursday, March 6, 2014

In Defence of Microsoft

Ok, so the next post was going to be on why PostgreSQL is really great.  But this is going to be a different and unusual post for this blog.

Most of my friends know that I do not revel in trying new Microsoft products and in fact avoid most of them when I can.  I don't find them pleasant to work with for the things I like to do and I tend to work with them only grudgingly.  True, I have written and published papers (even through Microsoft!) on Linux-Windows interoperability.  True I used to work for Microsoft.  But to me their products are relatively irrelevant.  Additionally I tend to shy away from politics on this blog because technical issues and politics require different sorts of discourse.

Yet last week a controversy erupted that shows the limits of these limitations.  Microsoft released an ad that was widely attacked for suggesting that Microsoft products are good for planning weddings.  As much as it pains me, I find myself having to defend the software giant here.

Sometimes technology is political, and sometimes it is important to stand up for another side of the issue than makes the press.

This is a contribution to what I hope will be a general industry-wide (and better-yet society-wide) conversation on gender and the economy, particularly in the technology fields.  All of what you find here is my own opinion and although I am male I grew up around the issue as you will see.  I don't expect that I speak for many.  The views here are contrarian and may be controversial (probably should be, since controversy may inspire thought).

The Ad


 I decided not to directly link to the attacks on the ad first, figuring that those who have not seen it may want to see it with fresh eyes rather than after reading the attacks.  The ad is currently found on Youtube and so I linked there.  I searched on Microsoft's site and couldn't find it.  I don't know if they removed it due to the controversy.

The ad portrays a young woman who talks about why she picked a Windows all in one tablet over a mac, and the reasons had to do with the way in which the tablet made her social life easier while planning her wedding.  Perfectly good and valid reasons for choosing one product over another.  In no way is anyone portrayed as trivializing the use of technology or as incapable.  It's a decent ad targetted at a certain subset of the population.  I don't see anything wrong with it per se.

I am not saying that women are well enough represented in the ad campaign.  If that was the argument I might say 'sure, there's a valid point there' but that isn't the argument.   The only thing to see here is that Microsoft, in putting together a perfectly legitimate advertisement, managed to offend a bunch of people.

The Response


 The two primary articles attacking this ad were published in Slate and Think Progress.  The arguments basically have three components and are more similar than they are different.

Both responses accuse Microsoft of trivializing how women use technology for wedding planning and for things related to pregnancy.  They accuse Microsoft of playing into stereotypes, and of thus sending a message that women are less capable than men in the use of technology.

 A Personal Detour


Ordinarily I'd just figure this is an issue for women to sort out and leave it at that.   Unfortunately, that is a little harder to do working in open source software development.  The concerns is basically that there aren't enough women in tech, and I work in a corner of tech where women are nearly absenct.  While on average in the software development industry, men outnumber women 2:1, in open source software, the ratio is closer to 50:1.

Personally I wonder why we are so obsessed with the fact that women are making certain career and lifestyle choices differently than men.  While the reason isn't hard to find (see below), I do think that such obsession ultimately robs women in particular of agency, that idea that they and they alone are best prepared to make the decisions that involve navigation of one's life.  Why do we think there aren't enough women in a given industry?  What would be enough?  Why?  To some extent I think this obsession delegitimizes both the decisions of those women who choose to go into software and those who don't, and I have heard more than one woman complain about being asked about this issue frequently.

The obsession doesn't help anybody.  Women may be as capable of programming computers as men are (and given the dominance of women in the early years of programming it seems hard to argue the contrary with any force of history behind such an argument).

Personally too, the first person who ever showed me a computer for work was my grandmother.  It was a large machine, probably as big as my dining room table, with a small crt screen and a punchcard reader.  What did my grandmother use it for?  She wrote nuclear physics simulations.  Was she typical of her generation?  No.  She worked with nobel-prize-wining physicists and was quite renowned in her field.  Such people are never typical of any group.  She was also the first person I knew who complained about efforts to bring more women into STEM fields because she found it undermined her credibility.

But is everyone able to program a computer?  No.  Nor, perhaps beyond a very basic level, is that a skill everyone needs to know.

What's this Argument About, Anyway?


I don't think the argument is only about women in software development.  Running through both criticisms of the ad is an effort to trivialize getting married and having kids.

This seems like a really weird thing to trivialize since most people get far more happiness out of family contact than they do out of slaving away in a cubical, working so that someone else gets to make some extra profit, and yet there are certain segments of feminism which repetitively seek to trivialize these things.  (For those who jump to offence, please calm down and note that feminism is not an organized movement and in fact has tremendous diversity in view on this subject.)

And yet the reasons why there is an effort to trivialize these things is not hard to find.  The US economy bears two fundamental characteristics that shape this debate in very deep ways.  The first is that the economy is based on the notion that women and men are not merely equal but interchangeable in all ways, and hence interchangeability becomes synonymous with equality.  The second is that the US economy is employer-centric, and thus the employer's needs are what are most important, not the needs of the family.  For these reasons, getting married and having kids (especially) has negative career consequences.  These consequences are worse the younger one is, and since women cannot delay having children as long as men can, they ultimately suffer disproportionate costs of gender neutral policies.

From this viewpoint it is easy to conclude that if only women didn't get married and have kids, inequality would be a thing of the past, but this isn't really a solution.  Rather it is a case where an apparent solution on an individual level papers over and conceals a larger problem.

Another aspect of the problem is the extent to which our society has a rather distorted view of the tech industry.  Technology firm founders are idolized well beyond the proportions of their contributions.  Working in technology is a glamorous job.  But it is also portrayed popularly as the industry of lone geniuses, and startup cultures have personal demands that are truly taxing (Marissa Mayer once bragged about 130 hour work weeks that she used to put in at Google), all for uncertain gains in what amounts to an institutionalized form of gambling with your time.

What?  Women don't want to be founders of tech startups as things stand right now?  Seems like they have more sense than men....

A Few Reasons the Attacks on Microsoft Here Are Wrong


The attacks on Microsoft for this ad are wrong for more reasons than I can count.  There is, after all, nothing wrong with the ad.  It portrays a perfectly legitimate reason that someone might choose one product over another, namely that it makes an important aspect of one's life a bit easier, and the ultimate judge of what is important really should be left to the individual, the family, and the local community.  Here are, however, my top few reasons why I find the attacks on the ad misplaced.

  1. Not everyone is or should be a "techie."  People have different views on software and different priorities in life, and that is ok.  There's nothing wrong with deciding not to get married in the US (there would be in much of the rest of the world, where you are expected to retire with your kids but that is a different story).  But conversely there is nothing wrong with treating your wedding as important.
  2. Technology exists to solve human problems, not the other way around.  The argument in the response carries with it a strong subtext that women should be solving technical problems, not using technology to solve human problems, but this misunderstands the proper place of technology in life.  This is, truth be told, a very common mistake and it is the primary cause of software project failures I have seen.  It is also a major part of the idolization of the tech industry and the perpetual promise to totally change our lives (which never seems quite as great when it happens).  Planning a wedding is a human problem and using technology for that is a fascinating use case, IMHO.
  3. Human Relationships are Anything but Trivial.  Getting married and having kids is fundamentally about human relationships.  Employers come and go. We don't really expect them to stand by their employees when it is not profitable to do so.  Having people in your life you can count on is more important to having security in life than are having employers interested in your work.

Towards a More Just and Inclusive Economy


Standing up and defending Microsoft is to some extent an important first step in starting a conversation.  It can't be the end though.  The fact is that the critics of Microsoft want something (I hope!) that I want too, namely for women to enter the industry of software development on their own terms.  This has to be the topic of a larger conversation, and one which does not loose sight of the individual or the systemic problems of the economy in this regard.

It seems hard to imagine that the systemic injustices of the current system (including an aggregate wage gap, though this may be statistically insignificant in the computer sciences) can be done away with in any way other than reducing the dependence on large employers or doing away with the myth of interchangeability (these two things are closely tied, since the idea of interchangeability is important to the development of large corporate organizations).  Perhaps a return to an economic system where men and women worked together as joint principles in household businesses would be a good model.  That has very little traction in the US today however.

In the end though I think it should be more obvious than it apparently is that you can't force someone to enter into an industry on his or her own terms.  The efforts to solve the problem of a gender gap in terms of culture and institutions are likely to fail as long as women look at the tech industry (and in particular the most glamorous parts of it) and don't want to put in the time and effort, or make the sacrifices involved.

But still, even in a more just economy, there are going to be people for whom a computer is primarily a social tool, a way to coordinate with friends and co-workers, to communicate and to plan, and I have trouble seeing the difference between planning an event of deep personal significance and planning as a middle manager for a company, except that the former ought to be a lot more rewarding,

I remain relatively optimistic though that small household businesses based on open source consulting alone have the potential to provide an opportunity to balance flexibly and productively the demands of family and work in such a way that everyone gets pretty much everything they want.  I think (and hope) that as open source software becomes more mature as an industry that this will be the way things will go.

But whatever we do, we must recognize that people decide on how to go about participating in the economy based on their own needs and desires, and that none of us have perfect knowledge.  The most important thing we can stop doing is delegitimizing life choices we might decide are not for us.  I the goal is to help women enter industries like software development on their own terms, I think the best thing we can do is just get out of their way.

As a man, I don't know what young women want out of the industry as they consider it as a career or business path.  To be honest, it is better that I don't.  The new generations of programmers, male and female, should enter the industry on their own terms, with their own aspirations and hopes, their own determination to do things their own ways, and their own dreams.   And there is nobody that should tell them how to think or address these.  Not me.  Not the marketeers at Microsoft.  Not the authors at Slate and Think Progress.  That's the change that will make the industry more inclusive.

Thursday, February 27, 2014

In Praise of Perl 5

I have decided to do a two-part series here praising Perl in part 1 and PostgreSQL in part 2.  I expect that PostgreSQL folks will get a lot out of the Perl post and vice versa.  The real power of both programming environments is in the relationship between domain-specific languages and general purpose languages.

These posts are aimed at software engineers more than developers and they make the case for building frameworks on these platforms.  The common thread is flexibility and productivity.

This first post is about Perl 5.  Perl 6 is a different language, more a little sister to Perl 5 than a successor.  The basic point is that Perl 5 gives you a way to build domain specific languages (DSL's) that can be seemlessly worked into a general purpose programming environment.  This is almost the exact inverse of PostgreSQL, which offers, as a development environment, a DSL with an ability to work in almost any general purpose development tools into it.  This combination is extremely powerful as I will show.

All code in this post is from the LedgerSMB codebase in different eras (before the fork, during the early rewrite, and now planned code for 1.5).  All code in this post may be used under the GNU General Public License version 2 or at your option any later version.

You can see in the code samples below our evolution in how we use Perl.

This is (bad) Perl


Perl is a language many people love to hate.  Here's an example of bad Perl from early in the LedgerSMB codebase.  It is offered as an example of sloppy coding generally and why maintaining Perl code can be difficult at times.  Note the module makes no use of strict or warnings, and almost all variables are globally package-scoped.

     $column_header{description} =
        "<th><a class=listheading href=$href&sort=linedescription>"
      . $locale->text('Description')
      . "</a></th>";

    $form->{title} =
      ( $form->{title} ) ? $form->{title} : $locale->text('AR Transactions');

    $form->header;

    print qq|
<body>

<table width=100%>
  <tr>
    <th class=listtop>$form->{title}</th>
  </tr>
  <tr height="5"></tr>
  <tr>
    <td>$option</td>


This is not a good piece of maintainable code.  It is very hard to modify safely.  Due to global scoping, unit tests are not possible.  There are many other problems as well.  One of our major goals in LedgerSMB is to rewrite all this code as quickly as we can without making the application unusably unstable.

So nobody can argue that it is possible to create unmaintainable messes in Perl.  But it is possible to do this sort of thing in any language.  One can't judge a language solely because it is easy to write bad code in it.

This is (better) Perl


So what does better Perl look like?   Let's try this newer Perl code, which was added late in the LedgerSMB 1.3 development cycle, and handles asset depreciation:

sub depreciate_all {
    my ($request) = @_;
    my $report = LedgerSMB::DBObject::Asset_Report->new(base => $request);
    $report->get_metadata;
    for my $ac(@{$report->{asset_classes}}){
        my $dep = LedgerSMB::DBObject::Asset_Report->new(base => $request);
        $dep->{asset_class} = $ac->{id};
        $dep->generate;
        for my $asset (@{$dep->{assets}}){
            push @{$dep->{asset_ids}}, $asset->{id};
        }
        $dep->save;
    }
    $request->{message} = $request->{_locale}->text('Depreciation Successful');
    my $template = LedgerSMB::Template->new(
        user =>$request->{_user},
        locale => $request->{_locale},
        path => 'UI',
        template => 'info',
        format => 'HTML'
    );
    $template->render($request);
}


This function depreciates all asset classes to a point at a specific date.  There's a fair bit of logic here but it does many times more work than the previous example, is easier to maintain, and is easier to understand.

This is also Perl!


The above two examples are  pretty straight-forward Perl code examples, but neither one really shows what Perl is capable of doing in terms of writing good-quality, maintainable code.

The fact is that Perl itself is a highly malleable language and this malleability allows you to define domain-specific languages for parts of your program and use them there.

Here's a small class for handling currency records.  POD and comments have been removed.

package LedgerSMB::Currency;
use Moose;
with 'LedgerSMB::PGOSimple::Role', 'LedgerSMB::MooseTypes';

use PGObject::Util::DBMethod;
   
sub _set_prefix { 'currency__' }


has id                => (is => 'rw', isa => 'Int', required => '0');
has symbol            => (is => 'ro', isa => 'Str', required => '1');
has allowed_variance  => (is => 'rw', isa => 'LedgerSMB::Moose::Number',
                          coerce => 1, required => 1);
has display_precision => (is => 'rw', isa => 'Int', required => '0');
has is_default => (is => 'ro', isa => 'Bool', required => '0');

dbmethod list      => (funcname => 'list', returns_objects => 1 );
dbmethod save      => (funcname => 'save', merge_back => 1);

dbmethod get       => (funcname => 'get', returns_objects => 1,
                          arg_list => ['symbol']);

dbmethod get_by_id =>  (funcname => 'get_by_id', returns_objects => 1,
                        arg_list => ['id']);

__PACKAGE__->meta->make_immutable;

Now the code above sets up a whole class including properties, accessors, and methods delegated to database stored prcedures.  The class is effectively entirely declarative.  The same amount of work in a similarly simple module from the 1.3 iteration (TaxForm.pm) requires around 50 lines of code, so more than double, and that's without accessor support.  The 1.4-framework module for handling contact information (phone numbers and email addresses) is around 65 lines of code, with not much more complexity (so around triple).  The simpler Bank.pm (for tracking bank account information) is around 36 lines so nearly double.

What differentiates the examples though is not only line count but readability, testability, and maintainability.  The LedgerSMB::Currency module is more concise, more readable, and has much better testing and maintenance characteristics than the longer modules from the previous frameworks.  Even without comments or POD, if you read the Moose and PGObject::Util::DBMethod documentation, you know immediately what the module does.  And in such a module, comments may not be appropriate, but POD would likely not only be appropriate but take up significantly more space than the code.

How does that work?


Perl is a very flexible and mutable language.  While you can't add keywords, you can add functions that behave more or less like keywords.  Functions can be exported from one module to another and, used judiciously, this can be used to create domain-specific languages which in fact run on generated Perl code.

The example here uses two modules which provide DSL's for specific purposes.  The first is Moose, which has a long history as an extremely important contributor to current Perl object-oriented programming practices.  This module provides the functions "with" and "has" used above.

Moose, in this case works with a PGObject::Simple::Role module which provides a framework for interacting with PostgreSQL db's.  This is extended by LedgerSMB::PGOSimple::Role which provides handling of database connections and the like.

The second is PGObject::Util:DBmethod, which provides the dbmethod function.  It's worth noting that both has and dbmethod are code generators.  When they run, they create functions which they attach to the package.  Used in this way has creates the accessors, while dbmethod creates the delegated methods.

Why is this Powerful and Productive?


The use of robust code generation here at run-time allows you to effectively build modules and classes from specifications of modules and classes rather than implementing that specification by hand.  Virtually all object-oriented frameworks in Perl effectively offer some form of this code generation.

A specification to code language provides a general tradeoff between clarity, expressiveness (in its domain) and robustness on one hand, with inflexibility on the other.  This is the fundamental tradeoff of domain-specific languages generally.  When you merge a domain-specific language into a general-purpose one, however, you gain the freedom to compensate for the lack of flexibility by falling back on more general tools when you need to.  This flexibility is where the production gains are found.

Compare a framework built as a mini-DSL specification language to one built as an object model.  In an object model framework one effectively has to juggle object-oriented design (SOLID principles, etc) with the desire for greater flexibility.  Here, however the DSL's are orthogonal to the object model.  They allow you to define the object model orthogonally to the framework, while re-using the DSL framework however you want.  Of course these are not mutually exclusive, and it is quite possible to have both in a large and powerful application framework.

Other Similarly Powerful Languages


 Perl is not the only language of this kind.  The first example that comes to mind, naturally, is Lisp.  However other Lispy languages are also worth mentioning.  Most prominent among these are Rebol and Red, whose open source implementations are still very immature.  These languages are extremely mutable and the syntax can be easily extended or even rewritten.

Metaprogramming helps to some extent with some of these issues and this is a common way of addressing this in Ruby and Python, but this makes it much harder to build a framework that is truly orthogonal to the object model.

A major aspect of the power of Perl 5 here are the very things which often cause beginning and intermediate programmers headache.  Perl allows, to  a remarkable extent, manipulation of its own internals (perhaps only Rebol, Red, and Lisp take this further).  This allows one to rewrite the language to a remarkable extent, but it also allows for the development of contexts which allow for these sorts of extensions.

The key feature I am looking at here is the mutability of the language.  And there are few languages which are themselves relatively mutable.  Perl isn't just a programming language, but a toolkit for building programming languages inside it.

Monday, February 24, 2014

Notes on Software Testing

It seems software testing is one of those really hard things to get right.  I find very often I run into projects where the testing is inadequate or where, in an overzealous effort to ensure no bugs, test cases are too invasive and test things that shouldn't be tested.  This article attempts to summarize what I have learned from experience in the hope that it is of use to others.

The Two Types of Tests


Software testing has a couple of important functions and these are different enough to form the basis of a taxonomy system of test cases.  These functions include:
  1. Ensuring that a change to a library doesn't break something else
  2. Ensuring general usability by the intended audience
  3. Ensuring sane handling of errors
  4. Ensuring safe and secure operation under all circumstances
 These functions fall themselves into roughly two groups.  The first ensures that the software functions as designed, and the second ensures that where undefined behavior exists, it occurs in a sane and safe way.

The first type of tests then are those which ensure that the behavior conforms to the designed outlines of the contract with downstream developers or users.  This is what we may call "design-implementation testing."

The second type of tests are those which ensure that behavior outside the designed parameters is either appropriately documented or appropriately handled, and can be deployed and used in a safe and secure manner.  This, generally, reduces to error testing.

These two types of tests are different enough they need to be written by different groups.  The design-implementation tests are really best written by the engineers designing the software, and the error tests need to be handled by someone somewhat removed from that process.

Why Software Engineers Should Write Test Cases


Design-implementation tests are a formalization of the interface specification.  As such a formalization the people best prepared to write good software contract tests are those specifying the software contracts, namely the software engineers.

There are a couple ways this can be done.  Engineers can write quick pseudocode intended to document interfaces and test cases to define the contracts, or can develop a quick prototype with test cases before handing off to developers, or the engineers and the developers can be closely integrated.  Either way the engineers are in the best position, knowledge-wise, to write test cases about whether the interface contracts are violated or not.

This works best with an initial short iteration cycle (regarding prototypes).  However the full development could be on a much larger cycle so it is not necessarily limited to agile development environments.

Having the engineers write these sorts of test cases ensures that a few very basic principles are not violated:

  1. The tests do not test the internals of dependencies beyond necessity
  2. The tests focus on interface instead of implementation
These rules help avoid test cases broken needlessly when dependencies fix bugs.

Why You Still Need QA Folks Who Write Tests After the Fact


Interface and design-implementation tests are not enough.  They cover very basic things, and ensure that correct operation will continue.  However they don't generally cover error handling very well, nor do they cover security-critical questions very well.

For good error handling tests, you really need an outside set of eyes, not too deeply tied to current design or coding.  It is easier for an outsider to spot that "user is an idiot" that was left in as a placeholder in an error message than it is for the developer or the engineer.  Some of these can be reduced by cross-team review of changes as they come in.

A second problem is that to test security-sensitive failure modes, you really need someone who can think about how to break an interface, not just what it was designed to do.  The more invested one is, brain-cycle-wise, in implementing the software, the harder it often is to see this.

Conclusion


Software testing is something which is best woven into the development process relatively deeply and should be both a before and after main development.  Writing test cases is often harder than writing code, and this goes double for good test cases vs good code.

Now obviously there is a difference in testing SQL stored procedures than testing C code, and there may be cases where you can dispense to a small extent with some after-the-fact testing (particularly in declarative programming environments).   After all, you don't have to test what you can prove, but you cannot prove that an existing contract will be maintained into the future.

Thursday, January 23, 2014

PGObject on CPAN: NoSQL Ideas for PostgreSQL Applications

One of the legitimate points Martin Fowler has argued in favor of NoSQL databases is that expecting application to directly manipulate relational data is far less clean from an application design perspective than having a database encapsulated behind a loosely coupled interface (like a web service).  I would actually go further and point out that such an approach invariably leads to bad database design too because the information layout becomes the contracted software API and thus one either has to spend a lot of time and effort separating logical from physical storage layouts or one ends up having an ossified physical layout that can never change.

This problem has been well understood in the relational database community for a long time.  The real problem has, however, been tooling.  There are effectively two traditional tools for addressing this issue:

1.  Updateable views.  These then form a relational API that allows the database to store information in a way separate from how the application sees it.  If you are using an ORM, this is a really valuable tool.

2.  Stored procedures.  These provide a procedural API, but traditionally a relatively brittle one based on the same approach used by libraries.  Namely you typically have an ordered series of arguments, and all users of the API are expected to agree on the ordering and number of arguments.  While this may work passably for a single system (and even there lead to "dependency hell"), it poses significant issues in a large heterogeneous environment because the number of applications which must be coordinated in terms of updates becomes very high.  Oracle solves this using revision based editions, so you can have side-by-side versioning of stored procedures, and allows applications to specify which edition they are working on.  This is similar to side-by-side versioning of C libraries typical for Linux, or side-by-side versioning of assemblies in .Net.

On the application side, ORMs have become popular, but they still lead to a relational API being contractual, so are really best used with updateable views.

In part because of these shortcomings, we started writing ways around them for LedgerSMB starting with 1.3.  The implementations are PostgreSQL-specific.  More recently I wrote some Perl modules, now on CPAN, to implement these concepts.  These create the general PGObject framework, which given an application access to PostgreSQL stored procedures in a loosely coupled way.  It is hoped that other implementations of the same ideas will be written and other applications will use this framework.

The basic premise is that a procedural interface that is discoverable allows for easier management of software contracts than one which is non-discoverable.  The discoverability criteria then become the software contract.

PGObject allows what I call "API Paradigms" to be built around stored procedures.  An API paradigm is a consistent specification of how to write discoverable stored procedures and then re-use them in the application.  Most namespaces under PGObject represent such "paradigms."  The exceptions currently are the Type, Util, Test, and Debug second-tier namespaces.  Currently PGObject::Simple is the only available paradigm.

What follows is a general writeup of the currently usable PGObject::Simple approach and what each module does:

PGObject


PGObject is the bottom half module.  It is designed to service multiple top-half paradigms (the Simple paradigm is described below, but also working on a CompositeType paradigm which probably won't be ready initially yet).  PGObject has effectively one responsibility:  coordinate between application components and the database.  This is split into two sub-responsibilities:

  1. Locate and run stored procedures
  2. Encode/decode data for running in #1 above.

Specifically outside the responsibility of PGObject is anything to do with managing database connections, so every call to a database-facing routine (locating or running a stored procedure) requires a database handle to be passed to it.

The reason for this is that the database handles should be managed by the application not our CPAN modules and this needs to be flexible enough to handle the possibility that more than one database connection may be needed by an application.  This is not a problem because developers will probably not call these functions unless they are writing their own top-half paradigms (in which case the number of places in their code where they issue calls to these functions will be very limited).

A hook is available to retrieve only functions with a specified first argument type.  If more than one function is found that matches, an exception is thrown.

The Simple top-half paradigm (below) has a total of two such calls, and that's probably typical.

The encoding/decoding system is handled by a few simple rules.

On delivery to the database, any parameter that can('to_db') runs that method and inserts the return value in place of the parameter in the stored procedure.  This allows one to have objects which specify how they serialize.  Bigfloats can serialize as numbers, Datetime subclasses can serialize as date or timestamp strings, and more complex types could serialize however is deemed appropriate (to JSON, a native type string form, a composite type string form, etc).

On retrieval from the database, the type of each column is checked against a type registry (sub-registries may be used for multiple application support, and can be specified at call time as well).  If the type is registered, the return value is passed to the $class->from_db method and the output returned in place of the original value.  This allows for any database type to be mapped back to a handler class.

Currently PGObject::Type is a reserved namespace for dealing with released type handler classes.  We have a type handler for DateTime and one for BigFloat written already and working on one for JSON database types.

PGObject::Simple


The second-level modules outside of a few reserved namespaces designate top-half paradigms for interacting with stored procedures.  Currently only Simple is supported.

This must be subclassed to be used by an application and a method provided to retrieve or generate the appropriate database connection.  This allows application-specific wrappers which can interface with other db connection management logic.

All options for PGObject->call_procedure supported including running aggregates, order by, etc.  This means more options available for things like gl reports database-side than the current LedgerSMB code allows.

$object->call_dbmethod uses the args argument by using a hashref for typing the name to the value.  If I want to have a ->save_as_new method, I can add args => {id => undef} to ensure that undef will be used in place of $self->{id}.

Both call_procedure (for enumerated arguments) and call_dbmethod (for named arguments) are supported both from the package and object.  So you can MyClass->call_dbmethod(...) and $myobj->call_dbmethod.  Naturally if the procedure takes args, you will need to specify them or it will just submit nulls.

PGObject::Simple::Role


This is a Moo/Moose role handler for PGObject::Simple.

One of the main features it has is the ability to declaratively define db methods.  So instead of:

sub int {
    my $self = @_;
    return $self->call_dbmethod(funcname => 'foo_to_int');
}

You can just

dbmethod( int => (funcname => 'foo_to_int'));

We will probably move dbmethod off into another package so that it can be imported early and used elsewhere as well.  This would allow it to be called without the outermost parentheses.

The overall benefits of this framework is that it allows for discoverable interfaces, and the ability to specify what an application needs to know on the database.  This allows for many of the benefits of both relational and NoSQL databases at the same time including development flexibility, discoverable interfaces, encapsulation, and more.

Saturday, November 23, 2013

Reporting in LedgerSMB 1.4: Part 5, Conclusions

I hope many of you have enjoyed this series.  We've tried hard to avoid inner platform syndrome here by making reporting something that a developer does.

Skills Required to Write Basic Reports


The developer skills required to write reports tend to fall on the database side.  In general one should have:

  1. A good, solid understanding of SQL and PL/PGSQL in a PostgreSQL environment.  This is the single most important skill and it is where most of the reporting effort lies.
  2. Basic understanding of Perl syntax.  Any basic tutorial will do.
  3. A basic understanding of Moose.  A general understanding of the documentation is sufficient, along with existing examples.
  4. A very basic understanding of the LedgerSMB reporting framework as discussed in this series.
These are required for general tabular reports, and they allow you to build basic tabular reports that can be output in HTML, CSV, ODS, and PDF formats.

Skills Required to Write More Advanced Reports


For more advanced reports, such as new financial statements, government forms, and the like, the following skills are required.  These are not fully discussed here.  These typically require, additionally:

  1. An in-depth understanding of our HTML elements abstraction system (this will be discussed in a future post here)
  2. A general proficiency with Template Toolkit, possibly including the LaTeX filter for running portions of the template through a LaTeX filter.

Strengths


The reporting framework here is very database-centric.  In general you cannot have a non-database-centric reporting structure because the data resides in the database, and some knowledge there is required to get it out in a working form.  We have tried hard to make a system where only minimal knowledge elsewhere is required to do this.  If you have db folks who work with your finance folks, they can write the reports.

Weaknesses


Ad hoc reporting is outside the scope of this reporting.  A one-off report is unlikely to be particularly helpful.  Additionally this generates reports as documents that can be shared.  Updating the data requires running the report again, and while this can be done as a sharable URL, it is not necessarily ideal for all circumstances.

Other Reporting Options


In cases where this reporting framework is not ideal, there are a number of other options available:

  1. Views can be made which can be pulled in via ODBC into spreadsheets like Excel.
  2. Third party report engines like JasperReports can be used instead, and
  3. One-off SQL queries in PSQL can be used to generate HTML and (in the most recent versions) LaTeX documents that can be shared.

Wednesday, November 20, 2013

Writing Reports in LedgerSMB 1.4: (Mostly-) Declarative Perl Modules

So far we have talked about getting data in, and interacting with the database.  Now we will talk about the largest of the modules and cover workflow scripts in relation with this stage,

At this point you would have a filter screen, a user defined function which would take the arguments from that screen's inputs (prefixed with 'in_' usually to avoid column name conflicts), and a tabular data structure you expect to return. 

As a note here, all the code I am trashing here is my own, in part because I have learned a lot about how to code with Moose over the course of 1.4 development.

In your workflow script you are likely to need to add the following:

use LedgerSMB::Report::MyNewReport;

and

sub run_my_new_report {
    my ($request) = @_;
    LedgerSMB::Report::MyNewReport->new(%$request)->render($request);
}

That's all you need in the workflow script.

Overall Structure and Preamble


The actual Perl module basically defines a number of parameters for the report, and the LedgerSMB::Report.pm provides a general framework to cut down on the amount of code (and knowledge of Perl) required to write a report.  Minimally we must, however, define inputs, if any, output structure, and how to create the output structure.  We can also define buttons for further actions on the workflow script.  The same workflow script would have to handle the button's actions.

Typically a report will start with something like this (of course MyNewReport is the name of the report here):

package LedgerSMB::Report::MyNewReport;
use Moose;
extends 'LedgerSMB::Report';
with 'LedgerSMB::Report::Dates'; # if date inputs used, use standard defs

This preamble sets up the basic reporting framework generally along with all the features discussed below.  If you need to handle numeric input or secondary dates you will want to change:

with 'LedgerSMB::Report::Dates';

to
 
with 'LedgerSMB::Report::Dates', 'LedgerSMB::MooseTypes';

so that you can use type coercions for numeric and/or date fields (for processing localized formattings and the like). 

Defining Inputs


Inputs are defined as report properties.  Usually you want these properties to be read-only because you want them to correspond to the report actually run.  You can use the full Moose capabilities in restricting inputs.  However typically inputs should be read-only and you are likely to want to restrict to type and possibly coerce as well (at least when using the types defined in LedgerSMB::MooseTypes).

When including the following line you do not have to define the date_from and date_to inputs:

with 'LedgerSMB::Report::Dates';

Typically our conventions are to document inputs inline with POD.  While this is (obviously) not necessary for the functioning of the report, it is helpful for future maintenance and highly recommended.  It is also worth noting in the POD how a match is made (this should be in SQL also if applicable, in a COMMENT ON statement for easy checking of common assumptions regarding API contracts).

For example, from the GL report:

=item amount_from

The lowest value that can match, amount-wise.

=item amount_to

The highest value that can match, amount-wise.

=cut

has 'amount_from' => (is => 'rw', coerce => 1,
                     isa => 'LedgerSMB::Moose::Number');
has 'amount_to' => (is => 'rw', coerce => 1,
                   isa => 'LedgerSMB::Moose::Number');


Those lines demonstrate the full power of Moose in the definition.  One obvious thing that will be fixed in beta is making these read-only (is => 'ro') while they are currently read-write.  There is no reason for these to be read-write.

From the LedgerSMB::Report::PNL you see the following optional string input defined:

=item partnumber

This is the control code of the labor/overhead, service, or part consumed.

=cut

has partnumber => (is => 'ro', isa => 'Str', required => 0);


This would probably be improved by mentioning that the partnumber is an exact match in the POD, but it shows how to designate a read-only, optional string input.

If an input is not listed, it won't be passed on to the stored procedure.  It is critical that all inputs are defined whether using standard modular definitions (LedgerSMB::Report::Dates) or explicit ones.  If an input is being ignored this is one of the first places to check.  Additionally note that because of other aspects of the reporting, it is not currently possible to use strict or slurpy constructors in any sane way.  It is likely we will build our own constructor handling in the future, but currently this is a hard limitation.

Input Definition Antipatterns


There are a few things which someone who has not worked with Moose before is likely to do in this area, and while many of these are relatively harmless in the web interface because of a number of failsafes, but if you ever want to re-use the code in a more stateful environment you will have difficulties.  The examples given are, alas, my own code but I have the benefit of being a new-comer to Moose here and so the lessons are fresh in my mind, or rather codebase.

The first is in use of read-write inputs.  A report output is closely bound to its inputs, so read-write inputs allows the application to misrepresent the report.  The example I gave above is:

has 'amount_to' => (is => 'rw', coerce => 1,
                   isa => 'LedgerSMB::Moose::Number');


Now this allows the application to do something like this:

my $report = LedgerSMB::Report::GL->new(%request);
$report->run_report();
$report->amount_to('10000000');
$report->render; 

The above will represent that the report includes a bunch of transactions that may, in fact, be excluded.   This is no good.  On the other hand, if amount_to was read-only (is => 'ro'), then the above code would throw an error instead.

The second major anti-pattern is in the use of Maybe[] as an alternative to required => 0.  For example see the following:

has 'approved' => (is => 'rw', isa => 'Maybe[Bool]');

Oh the joys of looking at code I wrote that is in need of rewrite....  Not only do we have a read-write input, but it is maybe boolean (i.e. true, false, or undef).

Now, this appears to work because undef is passed as NULL to the database, and the same is achieved by the more proper:

has approved => (is => 'ro', isa => 'Bool', required => 0);

The difference is that we will not accept as input a case where $request->{approved} = undef has been set.  Our query handlers drop empty inputs so there is no case where this should happen.  Additionally, this prevents unsetting the attribute after running the report and thus decoupling output from purported input.

Defining Report Structure


Report structure is defined using a series of functions which are overridden by actual reports.  Some of these functions are optional and some are not.  The required ones are covered first.

There are three required functions, namely columns, header_lines, and name.  These are expected to return very specific data structures, but function in a largely declarative way.  In other words, the functional interface effectively defines them as pseudo-constant (they are not fully constant because they are expected to return the localized names).

In all cases, LedgerSMB::Report::text() can be used to translate a string into its local equivalent (assuming localized strings in the .po files).

The columns function returns an arrayref of hashrefs, each of which is a column definition for our "dynatable" templates.  The following are required:

  • col_id --- the name of the row field to use
  • type --- the display type of the field (text, href, checkbox, hidden, etc)
  • name --- localized header for the column
The following are conditionally required or optional:
  •  href_base --- the base of the href. To this is appended the row_id (see below).  Only used by href columns, and then required.
  • pwidth --- Used for width factors for PDF reports.
Here's an example of a columns function for a very simple report (which just lists all SIC codes in the system):

sub columns {
    return [
      { col_id => 'code',
          type => 'href',
     href_base => 'am.pl?action=edit_sic&code=',
          name => LedgerSMB::Report::text('Code'), },

      { col_id => 'description',
          type => 'text',
          name => LedgerSMB::Report::text('Description'), }
    ];
}


In most reports, the columns function is much longer.

The header_lines function provides an arrayref of hashrefs, for displaying inputs on the report.  To this, the reporting engine adds the name of the report and the database connected to.  If you want no header lines added, you can just return an empty arrayref:

sub header_lines { return []; }

In many cases however, such inputs should be displayed.  Each hashref has two components:

  • name is the name of the input
  • text is the text label of the input.
Here's a more useful example from LedgerSMB::Report::GL:

 sub header_lines {
    return [{name => 'from_date',
             text => LedgerSMB::Report::text('Start Date')},
            {name => 'to_date',
             text => LedgerSMB::Report::text('End Date')},
            {name => 'accno',
             text => LedgerSMB::Report::text('Account Number')},
            {name => 'reference',
             text => LedgerSMB::Report::text('Reference')},
            {name => 'source',
             text => LedgerSMB::Report::text('Source')}];
}


Finally name() returns the localized name of the report.  This is usually a very simple function:

sub name {
    return LedgerSMB::Report::text('General Ledger Report');
}


Additionally there are two optional functions, buttons and template, which allow additional flexibility.  These are rarely used.

The template function overrides our dynatable-based template as the template to use.  This is used mostly in financial statements but is not used in the trial balance, or other fully tabular reports.

If you use it, just return the path to the template to use.

sub template { return 'Reports/PNL' }

Our PNL reporting module has additional features and beyond the scope of this post.

Finally buttons returns a list of buttons to be included on the report.  These follow the element_data format of UI/lib/elements.html and are used to add HTML form callbacks to the report.  Here's an example:

sub buttons {
    return  [{
         text => LedgerSMB::Report::text('Add New Tax Form'),
         name => 'action',
         type => 'submit',
         class => 'submit'
    }];
}


How Columns are Selected


The columns to display are dynamically selected according to the following rules:

  • If no column selection criteria is found, then all columns are shown
  • If the render() method is called with a hashref as arguments that includes a a member with the name of the column ID prefixed with 'col_' then the column is shown and those which are not so selected are not.
What this means is that typically you will define inputs for selection of columns in the $request object before passing it through if you want to have a baseline of  columns which always show.  Otherwise you will usually allow selection of columns in the filter screen using inputs named as above (i.e. an 'id' field would have a selector named 'col_id').

The run_report() Function


The run_report function populates the rows of the report.  It should exist and set $self->rows(@array) at the end.  This is the only portion where specific knowledge of programming in Perl is particularly helpful.  However, assuming nothing more than a little knowledge, here is a basic template from the SIC listing:

sub run_report{
    my ($self) = @_;
    my @rows = $self->exec_method(funcname => 'sic__list');
    for my $row(@rows){
        $row->{row_id} = $row->{code};
    }
    $self->rows(\@rows);
}


Going through this line by line:

my ($self) = @_;

The first line of the function body is boilerplate here.  It is possible to accept the $request object here as a second parameter but not typical unless you have very specific needs for it.  In that case, you simply:

my ($self, $request) = @_;

Remember that in Perl, @_ is the argument list.

my @rows = $self->exec_method(funcname => 'sic__list');

This line says "Take run the database function named 'sic__list' and map inputs of the report to function arguments.  You would typically just copy this line and change the function name.

for my $row(@rows){
    $row->{row_id} = $row->{code};
}

If you have any hyperlinks in the report, it is necessary to set a row_id so that this can be properly handled.  In any case the row_id is appended to the link from href_base in the column definition.  It is possible to override this on a per-column basis but that's beyond the scope of this introduction.

 $self->rows(\@rows);

This assigns the rows() of the report to the rows returned.  Currently this is handled as a read/write property of reports, but long-run this will probably be changed so that programs cannot override this after running the report.

Ending the File


Always end report classes with the following line

__PACKAGE__->meta->make_immutable;

This improves performance and ensures that  no more attributes can be dynamically added to your report.  There are cases where such may be less than desirable outside of the reports of this sort, but such would be outside the reporting use case.

Others in series:

  1. Introduction
  2. Filter Screens
  3. Best Practices regarding Stored Procedures
  4. (this piece)
  5. Conclusions

Tuesday, November 12, 2013

On CPAN, Community, and P: A Case Study in What Not to Do

I am going to try to do this piece as respectfully as I can.  I understand people put a lot of work into developing things and they submit them, and when they get panned, it can be difficult.  At the same time, community resources are community resources and so a failure to conduct such case studies in things gone amiss can lead to all kinds of bad things.  Failure to get honest feedback can lead to people not improving, but worse, it can leave beginners sometimes mistakenly believing that bad practices are best practices.  There is also a period of time after which bad practices become less easily excused. 

So somewhat reluctantly I am going to undertake such a study here.  This is solely about community interfacing.  I am not going to critique code.  Rather I would hope that this post can be a good one regarding understanding some of the problems regarding community interfaces generally, whether CPAN, PGXN, or others.  The lessons apply regardless of language or environment and the critiques I offer are at a very different level than critiques of code.

So with this, I critique the P CPAN module from a community coding perspective.  This module exports a single function called "P" which acts kind of like printf and sprintf.  It would be an interesting exercise in learning some deep aspects of Perl but from a community resource perspective, it suffers from enough issues to make it a worthy case study.

The gist of this is that community resources require contemplating how something fits into the community and working with the community in mind.  I cool idea or something one finds useful is not always something that is a candidate for publishing as a community resource, at least not without modifications aimed at carefully thinking how things fits into more general processes.

Four of my own CPAN modules are actually rewrites of code that I wrote in other contexts (particularly for LedgerSMB), and rewrote specifically for publication on CPAN.  In general there is a huge gulf between writing a module for one project or one developer and writing it for everyone.  I believe, looking at P, that it is just something the developer thought was useful personally and published it as is without thinking through any of these issues.  This is all too common and so going through these I hope will prevent too many from making the same mistakes.

Namespace Issues


The name 'P' as an extraordinarily bad choice of naming for a public module.  Perl uses nested namespaces, and nesting implies a clear relationship, such as inheritance (though other relationships are possible too).

Taking a top-level namespace is generally discouraged on CPAN where a second or third level namespace will suffice.  There are times and places for top-level namespaces, for example for large projects like Moose, Moo, and the like.  In general these are brand names for conglomerates of modules, or they are functional categories.  They are not shorthand ways of referring to functionality to save typing.

'P' as a name is not helpful generally, and moreover it denies any future large project that namespace.  The project is not ambitious enough to warrant a top-level namespace.  There is no real room for sub-modules and so there are real problems with having this as a top-level module.

Proper Namespacing


It's helpful, I think, to look at three different cases for how to address namespacing.  All three of these are ones I maintain or am planning to write.  I believe they follow generally acceptable practices generally although I have received some criticism for PGObject being a top-level namespace as well.

  • PGObject is a top-level namespace housing, currently three other modules (perhaps ten by next year).  I chose to make it a top-level namespace because it is a framework for making object frameworks, and not a simple module.  While the top-level module is a thin "glue" module, it offers services which go in a number of different directions, defying simple categorization. 

    Additionally the top-level module is largely a framework for building object frameworks, which complicates the categorization further,.  In this regard it is more like Moose than like Class::Struct.  Sub-modules include PGObject::Simple (a simple framework using PGObject, not a simple version of PGObject), PGObject::Simple::Role, and PGObject;:Type::BigFloat.
  • Mail::RoundTrip is a module which allows web-based applications to request email verification by users.  The module offers only a few pieces of functionality and is not really built for extensibility.  This should not be a top-level module.
  • Device::POS::Printer is a module I have begun to write for point of sale printers, providing a basic interface for printing, controlling cash drawers, getting error messages, etc.  The module is likely to eventually have a large  number of sub-modules, drivers for various printers etc, but putting Device:: in front does no real harm and adds clarity.  There's no reason to make it a top-level namespace.

The main point is thinking about how your module will fit into the community, how it will be found, etc.  'P' a name which suggests these have not been considered.

Exports

The P module exports a single function, P() which functions like printf and sprintf.  The major reason for this, according to the author, is both to add additional checking and to save typing.  Saving typing is not a worthy goal by itself, though neither is verbosity.  Condensing a function which takes over two different functions to a single letter, however, is not a recipe for good, readable code.  If others follow suit, you could get code like this:

P(A(R("This is a string", 3));

Now maybe this code is supposed to print the ASCII representation of "This is a string" repeated three times.  However that is not obvious from the code, leading to code that is hard to read or debug.

Proper Exports 


In Perl, exports affect the language.  Exports are thus to be used sparingly as they can lead to conflicts which can lead to hard to maintain code.  Exports should be rare, well documented, and not terribly subject to name collision.  They should also be verbose enough they can be understood without tremendous prior knowledge of the module.  P() as an exported function meets none of these criteria.

A good example of exports done right would be a function like has() used by Moose, Mouse, and Moo.  The function is exported and used to declaratively define object properties.  The convention has become widespread because it is obvious what it does.  Again this does not matter so much for personal projects, but it does for published modules on a community repository.

Test Failure Report Management


 The CPANTesters page for P shows that every version on CPAN has had test failures.  This is unusual.  Most modules have clear passes most of the time.  Even complex modules like DBD::Pg show a general attention to test failures and a relatively good record.  A lack of this attention shows a lack of interest in community use, and that fixes to test failures, needed for people to use the library, are just not important.  So if you manage a module, you really want to take every test failure seriously.


Conclusions


Resources like CPAN, CTAN, PGXN, and the like are subject to one important rule.  Just because it is good for your own personal use does not make it appropriate for community publication as a community resources.  Writing something that fits the needs of a specific project, or a specific coder's style is very different from writing something that helps a wide range of programmers solve a wide range of problems.  These community resources are not places to upload things just because one wrote them.  They are places to interface with the community through work.  Getting pre-review, post-review, and paying attention to the needs of others is critically important.