Perspectives on LedgerSMB: Application vs Database Programming

Saturday, February 25, 2012

Application vs Database Programming

A few years ago, I had a problem. A database routine for processing bulk payments for LedgerSMB that was humming along in my test cases was failing when tested under load with real data prior to implementation. Some testing showed that while it ran fine with small numbers of inputs, it eventually slowed way down as the number of inputs rose. Andrew Sullivan, who was also on the project through a different company, suggested it was something called a "cache miss" because I was running through a loop, inserting two rows, and then updating a different table. Although he quickly abandoned that theory, his initial assessment was dead right and within a day I had a fix to the problem.

I learned there why database and application programming are so different. The same mistake I made I have seen in contributions by many other developers. I have therefore concluded that it is a universal mistake.

When we program applications, we think in terms of instructions. We break a problem down into instructions, and we order those to get the right result. We usually focus on simpler elegant code over more complex code and therefore tend to try to make things as simple as possible. This was what I was doing and it failed miserably. I have also had the misfortune of looking through several hundred line stored procedures that were obviously written by application engineers, and where the ability to think in db-land was missing. Such stored procedures are not maintainable and usually contain a lot of hidden bugs.

In database-level routines, however, it is extremely important to think in set operations. A clear, complex query with both be easier to read and easier to maintain than a set of smaller simpler queries. The database is an important math engine. Use it as one. Consequently simplicity and elegance in a stored procedure is often the ability to express the most complicated operations in the fewest database queries possible. Because SQL queries are highly structured (SELECT queries involve a column list, followed by a set of tables linked by JOIN operations, followed by filter operations, etc), it is very easy to track down problems, whether those are bugs or performance issues.

With smaller operations, like I had been doing, the database cannot optimize the operation as a whole, and one may have to contend with problems caching results in an effective way, or the like. Worse, the operations are not linked together in a predictable way and therefore problems are harder to track down when they involve specific inputs or concurrent operations.

The way we think when we code SQL (including PL/PGSQL) and when we code Perl, Python, or C is just fundamentally different.

22 comments:

Adrian KlaverFebruary 25, 2012 at 12:34 PM
So true. I actually learned this the other way around. I had a stored function that used a fairly complex SQL query as the basis for further work. At the time my understanding of SQL was quite rudimentary and I had arrived at the query through trial and error as much as anything else. I figured if I rewrote the query logic in Python (which I understood better) it would be more compact and understandable. That turned out to be false. I quickly eclipsed the lines of code in the original function and I was still a long way from encapsulating all its logic. That was when I truly understood how the set nature of SQL could be beneficial.
ReplyDelete
Replies
py3bdkFebruary 25, 2012 at 1:31 PM
I have seen some mysql stored functions written by php programmers (I don't understand their php codes): A lot of cursors declared for separeted selects, then open and fetch ( without a loop !) for each cursor (only one row per cursor ). And they say that a DBA and pure data modelling are not necessary. I work at a federal agency in my country. The organization as a whole has no DBA! Am I crazy enough ?
ReplyDelete
Replies
David T. MacknetMarch 4, 2012 at 7:04 AM
I totally agree - but set operations can be rather intimidating for application programmers, because they're just not used to it. So, they're intimidated when they glimpse a 5,000 line stored procedure they can easily freak out and think that somehow it would be more elegant (and, therefore, better) to do it with cursors or loop structures from the UI, rather than letting the engine do its job.

That 5K line stored procedure, though, will scale where their loops would be sitting around for days. (The one I have in mind is one I wrote, to process invoices, and it doesn't care if you give it one or 1,000, it chunks through them in only marginally more time).
ReplyDelete
Replies
juMarch 4, 2012 at 7:12 AM
Not sure that I understand you, but I prefer to extract load from Database and move it to application (PHP, Python), because database is usualy the bottleneck, I'm I right?
ReplyDelete
Replies
AnonymousMarch 4, 2012 at 7:57 AM
The unfortunate thing is that junior programmers are going to read your blog post and walk away with the thought that 1,500 line stored procedures are the way to go.

You've learned an important lesson... sometimes the right thing to do is to act on small components of data in SQL and mash the components up in code. Sometimes for optimization purposes the right thing to do a more complex operation in a stored procedure.

What is never the right thing though is to pick up a hammer and think everything looks like a nail. Don't focus on there being a single answer of code or SQL being better. I've inherited code bases of 3,000+ lines of code in each stored procedure with dozens of if x do 1, if y do 2, etc.
ReplyDelete
Replies
JSMarch 4, 2012 at 9:09 AM
I've never come across a case where filtering or joining couldn't be done with more consistency, faster and more elegantly by writing a longer single query...or where those things had to be done algorithmically. *However*, just to play devil's advocate here, there are some situations where consistency and operation speed matter less than preventing your DB from being overwhelmed for a period of time crunching a very large query. You can't assume either unlimited DB resources, or that whatever query we're talking about is the most important query that's going to be running on the machine. One example is generating an annual report from a DB that's getting constant inserts and updates, where 10M+ rows need to be summed and averaged in groups by date, month, and year, and then broken down into product categories and then individual items, each of which has to be grouped in the same several ways. Yes you can write a single query that does it, but there's a good chance customers wouldn't be able to place orders on the website while it was running. Yes you could run it off a slave, but let's say our slave's already busy enough serving customer requests. In that case it might be a lot better to break up the query into manageable chunks that, for instance, mirror the DB's partition structure in a bunch of low priority selects, possibly even to collate all the data in the application layer as each chunk comes in, even if it's 10x slower and inconsistent, because then the strain's on a separate application server, and your DB is free to handle more time-sensitive traffic.

I'm just saying I don't think one size fits all... what's important is to understand the strengths and weaknesses of both the application layer and the DB, how to leverage each one to the greatest advantage, and where you're placing the load in a given operation. Then you can use the best tool at your disposal to balance one operation's priorities against the other stresses on the system as a whole.
ReplyDelete
Replies
dkitchenMarch 4, 2012 at 10:02 AM
I think this is why I'm enjoying LINQ to SQL (and related technologies) in .net. You can break a query down into smaller bits, each one handling a particular concern. Thanks to lazy evaluation, the small steps get evaluated at the end as one big SQL declaration (or at least the potential is there for this to happen). Kind of the best of both worlds. I suspect this approach can also create new problems, but so far I'm pleasantly surprised how well this works.
ReplyDelete
Replies
Christopher SmithMarch 4, 2012 at 12:30 PM
What you're really talking about here is the difference between declarative/functional programming and procedural programming.

In the functional/declarative context, most work is most easily expressed pretty succinctly, so you only need to get in to complexity if you have to tweak performance because of how that declaration has been transformed in to procedure.

It's important to express this semantic difference, because SQL *isn't* a purely functional/declarative language, so as someone else commented, one could misunderstand what you are getting at and run off writing hideously evil and complex stored procedures.

There is, however, a way to do SQL programming that allows you to break down the work in to small pieces without breaking the whole set theory approach to things, and I wish DB developers would employ it more. I refer to it as "view oriented programming". Instead of having a horribly complex SELECT statement, you can break it up in to steps by building views, and layering views on top of each other. As long as you have a half-decent query engine, it should translate selections on the views in to the same underlying query as the complex select, but now you can easily break down your work in to steps *and* you can easily look at the steps along the way so identifying bugs is so much simpler.
ReplyDelete
Replies
theo012887@gmail.comApril 21, 2013 at 9:36 PM
This post is really great it has give new and great idea.Thanks for sharing this

Best regards
Richard scoth
private server
ReplyDelete
Replies

Add comment