Saturday, February 25, 2012

Application vs Database Programming

A few years ago, I had a problem.  A database routine for processing bulk payments for LedgerSMB that was humming along in my test cases was failing when tested under load with real data prior to implementation.  Some testing showed that while it ran fine with small numbers of inputs, it eventually slowed way down as the number of inputs rose.  Andrew Sullivan, who was also on the project through a different company, suggested it was something called a "cache miss" because I was running through a loop, inserting two rows, and then updating a different table.    Although he quickly abandoned that theory, his initial assessment was dead right and within a day I had a fix to the problem.

I learned there why database and application programming are so different.  The same mistake I made I have seen in contributions by many other developers.  I have therefore concluded that it is a universal mistake.

When we program applications, we think in terms of instructions.  We break a problem down into instructions, and we order those to get the right result.  We usually focus on simpler elegant code over more complex code and therefore tend to try to make things as simple as possible.  This was what I was doing and it failed miserably.  I have also had the misfortune of looking through several hundred line stored procedures that were obviously written by application engineers, and where the ability to think in db-land was missing.  Such stored procedures are not maintainable and usually contain a lot of hidden bugs.

In database-level routines, however, it is extremely important to think in set operations.  A clear, complex query with both be easier to read and easier to maintain than a set of smaller simpler queries.  The database is an important math engine.  Use it as one.  Consequently simplicity and elegance in a stored procedure is often the ability to express the most complicated operations in the fewest database queries possible.   Because SQL queries are highly structured (SELECT queries involve a column list, followed by a set of tables linked by JOIN operations, followed by filter operations, etc), it is very easy to track down problems, whether those are bugs or performance issues.

With smaller operations, like I had been doing, the database cannot optimize the operation as a whole, and one may have to contend with problems caching results in an effective way, or the like.  Worse, the operations are not linked together in a predictable way and therefore problems are harder to track down when they involve specific inputs or concurrent operations.

The way we think when we code SQL (including PL/PGSQL) and when we code Perl, Python, or C is just fundamentally different.

22 comments:

  1. So true. I actually learned this the other way around. I had a stored function that used a fairly complex SQL query as the basis for further work. At the time my understanding of SQL was quite rudimentary and I had arrived at the query through trial and error as much as anything else. I figured if I rewrote the query logic in Python (which I understood better) it would be more compact and understandable. That turned out to be false. I quickly eclipsed the lines of code in the original function and I was still a long way from encapsulating all its logic. That was when I truly understood how the set nature of SQL could be beneficial.

    ReplyDelete
    Replies
    1. That's an interesting story. If you don't mind. how big was the function?

      I have worked with 200+ line SQL queries. These can be either well written or poorly written. Poorly written SQL queries of this sort can be a PAIN. Well written queries of this length are a joy to work with.

      Delete
    2. The function I stuck with is about 100 lines. About 50 of that is the SQL. I format for readability so the counts are probably on the high side. At the time it was bigger then anything I had done. The important part to me, was the SQL made it easy to boil down groups of data into the subset I really wanted. When I went the pl/pythonu route I had to pull the data into the function and replicate the filtering the SQL provided. It was doable, just involved a lot more code.

      Delete
    3. I will also tell you I have seen a lot of cases where people did not put enough in SQL queries and this killed performance.

      FWIW, filtering out in SQL is extremely important because it keeps memory overhead lower and allows you to use indexes. It may seem to work ok on small sets..... But I have seen people retrieve a million rows and then check them against criteria in Perl when a HAVING clause would be far more helpful.....

      Delete
  2. I have seen some mysql stored functions written by php programmers (I don't understand their php codes): A lot of cursors declared for separeted selects, then open and fetch ( without a loop !) for each cursor (only one row per cursor ). And they say that a DBA and pure data modelling are not necessary. I work at a federal agency in my country. The organization as a whole has no DBA! Am I crazy enough ?

    ReplyDelete
  3. I totally agree - but set operations can be rather intimidating for application programmers, because they're just not used to it. So, they're intimidated when they glimpse a 5,000 line stored procedure they can easily freak out and think that somehow it would be more elegant (and, therefore, better) to do it with cursors or loop structures from the UI, rather than letting the engine do its job.

    That 5K line stored procedure, though, will scale where their loops would be sitting around for days. (The one I have in mind is one I wrote, to process invoices, and it doesn't care if you give it one or 1,000, it chunks through them in only marginally more time).

    ReplyDelete
  4. Not sure that I understand you, but I prefer to extract load from Database and move it to application (PHP, Python), because database is usualy the bottleneck, I'm I right?

    ReplyDelete
    Replies
    1. Part of the point is to try to let the database do what it is good at as well as it can. So select queries should do as much filtering as possible in the db, etc.

      I have seen people put *crazy* things in stored procedures (hmmm let's pause, send a network connection to the user asking if we should proceed...) but in general, the more logic you can put in the database regarding selection and storage of information the better off you will be.

      I have also see the flip side, where a HAVING clause was simulated in the application and this caused major issues when it came to working with large sets of data......

      Delete
  5. The unfortunate thing is that junior programmers are going to read your blog post and walk away with the thought that 1,500 line stored procedures are the way to go.

    You've learned an important lesson... sometimes the right thing to do is to act on small components of data in SQL and mash the components up in code. Sometimes for optimization purposes the right thing to do a more complex operation in a stored procedure.

    What is never the right thing though is to pick up a hammer and think everything looks like a nail. Don't focus on there being a single answer of code or SQL being better. I've inherited code bases of 3,000+ lines of code in each stored procedure with dozens of if x do 1, if y do 2, etc.

    ReplyDelete
    Replies
    1. Actually, the real lesson is to put as mucn in *single queries* as you can.

      And the problem is if you just write a 1500 line stored procedure as an application programmer, you will have many more problems because you are thinking in instructions rather than sets.

      Now if you have to have a 1500 line SQL query, you have other problems O.o

      Delete
    2. This comment thread would not be complete without a link to http://thedailywtf.com/Articles/The-Query-of-Despair.aspx

      Delete
  6. I've never come across a case where filtering or joining couldn't be done with more consistency, faster and more elegantly by writing a longer single query...or where those things had to be done algorithmically. *However*, just to play devil's advocate here, there are some situations where consistency and operation speed matter less than preventing your DB from being overwhelmed for a period of time crunching a very large query. You can't assume either unlimited DB resources, or that whatever query we're talking about is the most important query that's going to be running on the machine. One example is generating an annual report from a DB that's getting constant inserts and updates, where 10M+ rows need to be summed and averaged in groups by date, month, and year, and then broken down into product categories and then individual items, each of which has to be grouped in the same several ways. Yes you can write a single query that does it, but there's a good chance customers wouldn't be able to place orders on the website while it was running. Yes you could run it off a slave, but let's say our slave's already busy enough serving customer requests. In that case it might be a lot better to break up the query into manageable chunks that, for instance, mirror the DB's partition structure in a bunch of low priority selects, possibly even to collate all the data in the application layer as each chunk comes in, even if it's 10x slower and inconsistent, because then the strain's on a separate application server, and your DB is free to handle more time-sensitive traffic.

    I'm just saying I don't think one size fits all... what's important is to understand the strengths and weaknesses of both the application layer and the DB, how to leverage each one to the greatest advantage, and where you're placing the load in a given operation. Then you can use the best tool at your disposal to balance one operation's priorities against the other stresses on the system as a whole.

    ReplyDelete
    Replies
    1. "One example is generating an annual report from a DB that's getting constant inserts and updates, where 10M+ rows need to be summed and averaged in groups by date, month, and year, and then broken down into product categories and then individual items, each of which has to be grouped in the same several ways."

      Seems like an issue with the implementation of the database not matching your needs. It's entirely possible to have optimistic MVCC that should allow a query like this to move along without interfering with other people's queries. Even without that, if you have a DB with a partitioning structure like you describe, if the query is structured properly it should take advantage of that partition structure to achieve the same results you want to accomplish by breaking up the query in to steps.

      I also don't understand how you discard the notion of running the job on a slave because "it is already busy" but like the approach you describe merely because it lets you load up a separate application server. What is the difference between deploying an application server vs. another slave, particularly if the slave can apparently do the job 10x more efficiently?

      Maybe it's not one size fits all, but it sounds like your straw man is more a case of that size being "DB implementation" rather than database vs. application programming. ;-)

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Yeah, you got me. Possibly not the best example for the reasons you described. First of all, I'm speaking from the POV of someone maintaining application code written in PHP 4 / MySQL 4.x where the DB doesn't even support partitioning, and queries rely heavily on foreign keys with constant battles for row-level locks. I threw in that partitioning thing more as wishful thinking. And no one's about to pay to rewrite the app or database code. The real world version of the above scenario is a little more complicated, because first of all it's not a website but a POS app used in a bunch of physical locations so it's very heavy on writes; and secondly because there's no slave, and the owner doesn't want to pay for a second server, but wanted to be able to run these massive reports at will. It was apparent from the outset that a 100-ish-line multi-join query was resulting in deadlocks against the live DB at least enough of the time to be a problem. The answer, funny or not, was to stick the application code for the reports on a cheap VPS elsewhere that I happened to already be paying for to serve a few sites, mostly static pages, and breaking the queries up with a fast index as the axis and the fewest number of group statements possible, doing all the grouping and summing on the application side. The VPS wasn't processing a lot of dynamic traffic. It didn't seem worth it to slow down the live version by setting up a slave over there, nor did I want to set one up on my VPS; so this way I get my VPS bill paid for, the CEO of the company gets to run his huge reports every six hours without interfering with their InnoDb setup, and there's no need to deal with slave configurations in mysql or rewrite the app code accordingly until they have at least twice as many stores, which probably won't be for a few years. Real life's messy when the client wants to save $30k, which they usually do regardless of warnings from IT consultants. So I'm a big fan of being flexible in how I approach a problem like this, and why it's good to know how to jettison any kind of best practice if you're faced with "Every time I run this report, the POSs at the stores freeze for 15 seconds", vs. "you mean it's gonna cost me five figures to fix that?"

      BTW, I wasn't saying that Chris was wrong here. On the whole I think it's a great point and I mean, no one in their right mind should be making recursive or algorithmic queries that can be consolidated if at all possible. I *do* rely on the DB as a math engine whenever possible. I'm really just playing devil's advocate, but it's because I've had to hack around my own best instincts enough that to learn, if you *do* face a situation where time/money/etc. forces you to fragment your queries, there are circumstances where it can save the day.

      Delete
    4. For the annual report issue, I think you are approaching it from the wrong perspective.

      I think it is completely true that you cannot expect unlimited db resources. However, at the same time, I think it is perfectly reasonable to expect that a properly maintained and tuned database is capable of sharing what resources it has so that a single overwhelming query doesn't bring the whole system down.

      Now, if you have such a high-traffic database, is it unreasonable to add a hot-standby which can be used for such a reporting task?

      Delete
    5. Josh: MySQL 4.x sucks horribly and reasonable database assumptions don't apply to it. Need we go further than that?

      POS applications are also a bit unusual because they are extremely performance-sensitive. 30 sec. may be marginally acceptable when someone in AR is reviewing invoices, but it is near-business-stopping for a POS. In those environments, make sure you have a reasonable db (PostgreSQL ;-) ) and sufficient resources to run all your queries in reasonable time with some I/O contention and one core used for a big report.

      Delete
  7. I think this is why I'm enjoying LINQ to SQL (and related technologies) in .net. You can break a query down into smaller bits, each one handling a particular concern. Thanks to lazy evaluation, the small steps get evaluated at the end as one big SQL declaration (or at least the potential is there for this to happen). Kind of the best of both worlds. I suspect this approach can also create new problems, but so far I'm pleasantly surprised how well this works.

    ReplyDelete
  8. What you're really talking about here is the difference between declarative/functional programming and procedural programming.

    In the functional/declarative context, most work is most easily expressed pretty succinctly, so you only need to get in to complexity if you have to tweak performance because of how that declaration has been transformed in to procedure.

    It's important to express this semantic difference, because SQL *isn't* a purely functional/declarative language, so as someone else commented, one could misunderstand what you are getting at and run off writing hideously evil and complex stored procedures.

    There is, however, a way to do SQL programming that allows you to break down the work in to small pieces without breaking the whole set theory approach to things, and I wish DB developers would employ it more. I refer to it as "view oriented programming". Instead of having a horribly complex SELECT statement, you can break it up in to steps by building views, and layering views on top of each other. As long as you have a half-decent query engine, it should translate selections on the views in to the same underlying query as the complex select, but now you can easily break down your work in to steps *and* you can easily look at the steps along the way so identifying bugs is so much simpler.

    ReplyDelete
    Replies
    1. In general, I think that the maximum length of a maintainable SQL query is probably around the 150-200 line mark, and I don't like inline views because it makes it harder to follow the logic. Common table expressions and pre-defined views help that a great deal...

      The only issue is that there are cases where CTE's and views pose optimization issues. CTE's because the set is stable within the query, are optimization fences. The must be optimized, and run, without knowledge of the rest of the query. As for views, I have occasionally run into trouble doing UNIONs between views which each pulled substantial numbers of rows from the same large table. This lead to multiple sequential scans.....

      So care must be taken to understand that from a performance perspective there are times when you have to really think through what the planner is doing.....

      Delete
  9. This post is really great it has give new and great idea.Thanks for sharing this


    Best regards
    Richard scoth
    private server

    ReplyDelete