Monday, January 2, 2012

A Reply to Tony Marston's advocacy of unintelligent databases

Tony Marston has published an interesting critique of my posting about why intelligent database are helpful.  The response is thought-provoking and I suggest my readers read it.  Nonetheless I believe that the use cases where he is correct are becoming more narrow over time, and so I will explain my thoughts here.

A couple preliminary points must be made though.  There is a case to be made for intelligent databases, for NoSQL, and for all sorts of approaches.  These are not as close to a magic bullet as some proponents would like to think and there will be cases where each approach wins out, because design is the art of endlessly making tradeoffs.  My approach has nothing to do with what is proper and has to do instead with preserving the ability to use the data in a other ways for other applications, as I see this as the core strength of relational database management systems.  Moreover I used to agree with Tony but I have since changed my mind primarily because I have begun working with data environments where the stack assumption for application design doesn't work well.  ERP is a great example and I will go into why below.

Of course in some cases data doesn't need to be reused, and an RDBMS may still be useful.  This is usually the case where ad-hoc reporting is a larger requirement than scalability and rapid development.  In other cases where data reuse is not an issue an RDBMS really brings nothing to the table and NoSQL solutions of various sorts may be better.

My own belief is that a large number of database systems out there operate according to the model of one database to many applications.  I also think that the more that this model is understood the more that even single-application-database designers can design with this in mind or at least ask if this is the direction they want to go.

And so with the above in mind, on to the responses to specific points:

Well, isn't that what it is supposed to be? The database has always been a dumb data store, with all the business logic held separately within the application. This is an old idea which is now reinforced by the single responsibility principle. It is the application code which is responsible for the application logic while the database is responsible for the storing and retrieval of data. Modern databases also have the ability to enforce data integrity, manage concurrency control, backup, recovery and replication while also controlling data access and maintaining database security. Just because it is possible for it to do other things as well is no reason to say that it should do other things.

The above opinion works quite well in a narrow use case, namely where one application and only one application uses the database.  In this case, the database tables can be modelled more or less directly after the application's object model and then the only thing the RDBMS brings to the table is some level of ad hoc reporting, at the cost of added development time and complexity compared to some NoSQL solutions.  (NoSQL solutions are inherently single application databases and therefore are not usable where the considerations below exist.)

The problem is that when you start moving from single applications into enterprise systems, the calculation becomes very different.  It is not uncommon to have several applications sharing a single database, and the databases often must be designed to make it quick and easy to add new applications on top of the same database.

However as soon as you make this step something awful happens when you try to use the same approach used for single application databases:  because the database is based on the first application's object model, all subsequent applications must have intimate knowledge of the first application's object model, which is something of an anti-pattern.

A second major problem also appears at the same time, and that is that while a single application can probably be trusted to enforce meaningful data constraints, applications can't and shouldn't trust eachother to input meaningful data.  The chance of oversights where many applications are each responsible for checking all inputs for sanity and ensuring that only meaningful data is stored is very high, and the consequences can be quite severe.  Therefore things like check constraints become suddenly irreplaceable when one makes the jump from one data entry application to two against the same database.


Just because a database has more features does not mean that you should bend over backwards to use them. The English language contains a huge number of words that I have never used, but does that make my communication with others unintelligible? I have used many programming languages, but I have rarely used every function that has ever been listed in the manual, but does that mean that my programs don't work? With a relational database it is possible to use views, stored procedures, triggers or other advanced functionality, but their use is not obligatory. If I find it easier to do something in my application code, then that is where I will do it.

Indeed.  However, if you aren't thinking in terms of what the RDBMS can do, but rather just in terms of it as a dumb storage layer, you will miss out on these features when they are of benefit.  Therefore it is important to be aware of the costs of this when asking "should my database be usable by more than one application?"

That question becomes surprisingly useful to ask when asking questions like "should data be able to be fed into the database by third party tools?"

Where did this idea come from? The application has to access the database at some point in time anyway, so why do people like you keep insisting that it should be done indirectly through as many intermediate layers as possible? I have been writing database applications for several decades, and the idea that when you want to read from or write to the database you should go indirectly through an intermediate component instead of directly with an SQL query just strikes me as sheer lunacy. This is a continuation of the old idea that programmers must be shielded completely from the workings of the database, mainly because SQL is not objected oriented and therefore too complicated for their closed minds. If modern programmers who write for the web are supposed to know HTML, CSS and Javascript as well as their server-side language, then why is it unreasonable for a programmer who uses a database to know the standard language for getting data in and out of a database?

The idea comes from the necessities of  ensuring that the database is useful from more than one application.  Once that happens, then the issue is one of details of too much intimate knowledge of data structures between components.

However one thing I am not saying is that SQL-type interfaces are categorically out.  It would be a perfectly reasonable approach to encapsulation to build updatable views matching the object models of each application, and indeed reporting frameworks perhaps should be built this way to the extent possible, at least where multiple applications share a database.  The problem this solves is accommodating changes to the schema required by one application but not required by a second application.  One could even use an ORM at that point.

If your programmers have to spend large amounts of time in writing boilerplate code for basic read/write/update/delete operations for your database access then you are clearly not using the right tools. A proper RAD (Rapid Application Development) toolkit should take care of the basics for you, and should even ease the pain of dealing with changes to the database structure. Using an advanced toolkit it should possible to create working read/write/update/delete transactions for a database table without having to write a single line of SQL. I have done this with my RADICORE toolkit, so why can't you? Simple JOINs can be handled automatically as well, but more complex queries need to be specified within the table class itself. I have implemented different classes for different database servers, so customers of my software can choose between MySQL, PostgreSQL, Oracle and SQL Server.

How does this work though when you have multiple applications piping heavy read/write workloads through the same database, but where each application has unique data structure requirements and hence its own object model?

The very idea that your software data structure should be used to generate your database schema clearly shows that you consider your software structure, as produced from your implementation of Object Oriented Design (OOD) to be far superior to that of the database.....

It seems I was misunderstood.  I was saying that the generation of database structures from Rails is a bad thing.  I am in complete agreement with Tony on that part I think.

You may think that it slows down development, but I do not. You still have to analyse an application's data requirements before you can go through the design process, but instead of going through Object Oriented design as well as database design I prefer to ignore OOD completely and spend my time in getting the database right. Once this has been done I can use my RAD toolkit to generate all my software classes, so effectively I have killed two birds with one stone. It also means that I don't have to waste my time with mock objects while I'm waiting for the database to be built as it is already there, so I can access the real thing instead of an approximation. As for using an ORM, I never have the problem that they were designed to solve, so I have no need of their solution.

This is an interesting perspective.   I was comparing it though to initial stages of design and development in agile methodologies.

Here's the basic thing.  Agile developers like to start out when virtually nothing is known about requirements with a prototype designed to flesh out the requirements and refine that prototype until you get something that arguably works.

I am saying that in the beginning at least, asking the question of what the data relating to topics being entered is and how to model it neutral of business rules has a cost in terms of developer time spent in the beginning.  Sometimes, if a requirement to collect, say, addresses is dropped, that time will be entirely lost, so it is not really possible to say that such engineering effort is always a clear win.

I would agree though that the time spent there on looking at the data, modelling issues, and normalization is usually recouped later, so I suppose we are in agreement.


This is another area where OO people, in my humble opinion, make a fundamental mistake. If you build your classes around the database structure and have one class per table, and you understand how a database works, you should realise that there are only four basic operations that can be performed on a database table - Create, Read, Update and Delete (which is where the CRUD acronym comes from). In my framework this means that every class has an insertRecord(), getData(), updateRecord() and deleteRecord() method by default, which in turn means that I do not have to waste my time inventing unique method names which are tied to a particular class. Where others have a createCustomer(), createProduct(), createOrder() and createWhatever() method I have the ubiquitous insertRecord() method which can be applied to any object within my application. This makes use of the OO concept of polymorphism, so it should be familiar to every OO programmer. Because of this I have a single page controller for each of the methods, or combination of methods, which can be performed on an object, and I can reuse the same page controller on any object within my application. I have yet to see the same level of reuse in other application frameworks, but isn't a high level of code reuse what OOP is supposed to be about?
I am often told "but that is not the way it is done!" which is another way of saying "it is not the way I was taught". This tells me that either the critic's education was deficient, or he has a closed mind which is not open to other techniques, other methods or other possibilities.
The problem occurs when CRUD operations are not really as simple as this sounds, and where they must have complex constraints enforced from multiple data entry applications.  A good example here is posting GL transactions, where each transaction must be balanced.  This is far simpler to do in a stored procedure than anything else, because the set has certain emergent constraints that apply beyond the scope of a single record.  Also if abstracting the application interface from the low-level storage, then this may be necessary as part of the mapping of views to relations.


I can even make changes to a table's structure, such as adding or deleting a column, or changing a column's size or type, without having to perform major surgery on my table class. I simply change the database, import the new structure into my data dictionary, and then export the changes to my application. I don't have to change any code unless the structural changes affect any business rules. My ERP application started off with 20 database tables but this has grown to over 200 over the years, so I am speaking from direct experience.

So if you are doing an ERP application and at the same time putting all business logic in your application, you do not expect any third party applications to hit your  database, right?  So really you get the same basic considerations I am arguing for by making the database into the abstraction layer rather than encapsulating the db inside an abstraction layer.

If we think about it this way, it depends on what the platform is that you are developing on.  We like to have the database be that platform, and that means treating interfaces to it as an API.  Evidently you prefer to force all access to go through your application.  I think we get a number of benefits from this approach including language neutrality for third party components, better attention to performance on complex write operations, and more.  That there is a cost cannot be argued with however.

I would conclude by saying that we agree on a fair bit of principles of database design.  We agree normalization is important, and I would add that this leads to application-neutral data storage.  Where we disagree is where the API/third party tool boundary should be.  I would prefer to put in the database whatever seems to belong to the general task of data integrity, storage, retrieval, and structural presentation while leaving what we do with this data to the multiple third party applications which utilize the same database.

9 comments:

  1. This man has very strict opinion on good database design - his article "Stored Procedures are EVIL" is popular - so nobody can expect some different :).

    There are no only one way - inteligent database and stored procedures are strong tool - increases performance and effectivity, robustness, code readability, availability but developer should to have to understand to this technologii. There are no one winner and one super concept.

    ReplyDelete
  2. I've blogged about this piece of insanity myself.

    To me it's a no-brainer. Any time I'm shipping data out of the database to apply some business logic to and stash the result back in the database I'm wasting huge amounts of time. My paper at pgOpen was all about a system where we got it very fast by pushing logic down into the database, and this is a pattern I have repeated almost endlessly. Forget the theory, this is what works.

    ReplyDelete
  3. While I agree that there may be cases to be made for different database types – relational, OO, NoSQL, etc – and particular circumstances may favour one type over another, it has been my experience that articles such as this, while promoting a particular approach for a particular set of circumstances, are often misread by most of the programming community to mean an approach which is to be used under all circumstances. The type of applications which I write have different needs and goals, and your approach would not be appropriate.

    I have been designing databases, and the applications which use them, for several decades. This used to be green-screen or desktop applications, but for the last decade has been for the web with PHP as my language of choice. I have developed my own framework in 3 different languages, with my latest being available as open source at http://www.radicore.org. This is an implementation of the 3 Tier Architecture with separate components in the Presentation, Business and Data Access layers. There is a separate class in the Business layer for each business entity/database table. I have a separate class in the Data Access layer for each DBMS that I currently support, so I can switch my application between MySQL, PostgreSQL, Oracle and SQL Server simply by changing a single setting in the config file. This is one reason why I keep my business rules in the application and not the database – I don’t want to have a separate copy of the business rules for each DBMS. I have no need for a DBMS which is not relational, therefore I have no need to change my approach to cater for non-relational databases.

    Using the Radicore framework I have written an ERP package which is based upon some of the database designs found in Len Silverston’s Data Model Resource Book. I built the databases, imported them into my data dictionary, exported them to produce my class files, and I instantly had sets of screens with which I could write to and read from my databases. All I had to do to make this into a proper application was to customise the screen layouts where necessary, then code the business rules to handle all the clever stuff.

    Your approach has been coloured by a need to allow other tools or applications to access your database, which has prompted you to place your business logic in the database where it can be accessed by those other tools or applications. I have no such need with any of my applications, for the following reasons:

    - My ERP package does not have just one database - I have implemented Len Silverston’s designs as a series of subsystems with one database for each subsystem. I can still access tables from different databases within a transaction, so having multiple databases does not present any disadvantages.
    - Whenever an application task/transaction needs to access a database table it does so by going through that table’s class where the request can be validated before being processed. It does not matter that the table may be “owned” by a separate subsystem as it has access to all classes for all subsystems.
    - None of my databases is ever accessed by an external tool or application. As it is a web application any external access is provided by web services, and these web services always go through my table classes.
    - In some cases I import data from CSV files, but the import process still goes though the relevant table class(es).

    Your example of a “complex” operation, such as a GL posting which must be balanced, where you say that “is far simpler to do in a stored procedure than anything else” is something that I have had to implement many times, so I can say from experience that it is easier to do in my application code than with a stored procedure.

    Perhaps my approach is only beneficial to the types of application that I write, but I would suggest that my type of application is in greater abundance than yours, therefore is not an “edge case” with limited use.

    ReplyDelete
    Replies
    1. >>This is an implementation of the 3 Tier Architecture with separate components in the Presentation, Business and Data Access layers. There is a separate class in the Business layer for each business entity/database table. I have a separate class in the Data Access layer for each DBMS that I currently support, so I can switch my application between MySQL, PostgreSQL, Oracle and SQL Server simply by changing a single setting in the config file.<<

      First, I tend to think in n-tier rather than 3-tier, and also in terms of interaction within a larger environment than a simple software stack. This will be the subject of a future post. Perhaps why n-tier beats 3 tier, and why a db-centric approach offers a level of interactivity that can't be obtained as easily elsewhere.

      This being said, if you are switching between MySQL, MS SQL, PostgreSQL, and Oracle, this will shape your needs and the intelligent database features will generally be unavilable.

      >>- My ERP package does not have just one database - I have implemented Len Silverston’s designs as a series of subsystems with one database for each subsystem. I can still access tables from different databases within a transaction, so having multiple databases does not present any disadvantages.<<

      This is probably due to your need to support MySQL since this db doesn't support schemas? With an rdbms that does have namespace support it makes sense to put those in different schemas instead of different databases. Long-run LSMB will probably go that route, but there are a few obstacles that need to be overcome first.

      >>- None of my databases is ever accessed by an external tool or application. As it is a web application any external access is provided by web services, and these web services always go through my table classes.<<

      How many of those databases are installed at client sites? How do you enforce this?

      >>Your example of a “complex” operation, such as a GL posting which must be balanced, where you say that “is far simpler to do in a stored procedure than anything else” is something that I have had to implement many times, so I can say from experience that it is easier to do in my application code than with a stored procedure.<<

      The reason why this is simpler is that the constraint boils down to something that is fundamentally set-based. In other words, you have an input set and members' amounts must total up to be zero. This is rediculously easy to do in an SQL query without doing a loop. In application logic, you have to loop through inputs (which you might want to do anyway, but this requires ultimately more code, is less transparent, and is more brittle. Hence it's easier to do this via SQL than via PHP, Perl, or the like.

      In other words, something like:

      SELECT sum(amount) INTO t_variance FROM array_explode(in_lines);
      IF t_variance <> 0 THEN RAISE EXCEPTION 'imbalanced transaction';
      END IF;

      is a lot simpler than:

      my $variance = 0;
      for my $line (@lines) {
        $variance += $line->{amount};
      }
      if $variance {
        die 'imbalanced transaction'
      };

      Also I think it is a mistake to look at stored procedures as procedural. A better way to look at them is as a light-weight procedural wrapper over ordered set operations. This being said, at least in Pgsql, you can write fully OO stored procs in languages like Java, Perl, Python, and PHP, though PHP is certainly a niche language in this case.

      Delete
    2. > This is probably due to your need to support MySQL
      > since this db doesn't support schemas?

      Untrue. This is because the PostgreSQL developers are using the wrong wording.

      In MySQL you have Server->Database->Table.

      In PostgreSQL you have Database->Schema->Table.

      Tables are known as "database tables" not "schema tables", therefore their parent should be "database" (as in MySQL) and not "schema" (as in PostgreSQL).

      Where I have multiple databases to a single MySQL connection I have multiple schemas in a single PostgreSQL connection. The effect is exactly the same, it is just the wording which is different.

      Delete
    3. "Untrue. This is because the PostgreSQL developers are using the wrong wording."

      It is true that what MySQL calls a 'database' everyone else in the world calls a 'schema' (everyone else, I recognize, except you, Tony ;-) ) but that doesn't entirely get away from the problem. Namely the fact is that this means that if you use databases the way you'd use schemas, that means generally treating your whole MySQL instance (and not just the "databases" you want to use) as taken over by the multi-schema application. Instead something like PostgreSQL allows a level of separation you can only get in MySQL with separate server instances.

      The fact is that the wording used in PostgreSQL is the wording the standards specifications use, and the wording the rest of the important databases all follow (Oracle, MS SQL, DB2, Interbase, Firebird, etc). For you to call that wrong is not even justifiable ;-).

      But back to the original issue. I am trying to figure out why you need to have separate databases? What do you gain from doing so? Why not try to centralize as much as you can? It seems like requiring cross-database links adds a lot of additional complexity since it means you can't trust foreign key constraints, and consequently it can't be because you have are integrating separate databases used by different applications. What does it buy you in practical terms? Or is it mostly because it is cool to solve these problems?

      Delete
  4. > My own belief is that a large number of
    > database systems out there operate according
    > to the model of one database to many
    > applications.

    That has not been my experience. I have spent all of my career in writing applications which have sole access to their databases. Having multiple applications trying to perform the same function on the same database is something which I have never encountered.

    > The above opinion works quite well in a
    > narrow use case, namely where one application
    > and only one application uses the database.
    > In this case, the database tables can be
    > modelled more or less directly after the
    > application's object model.

    I do things the other way round - my application's classes are built from the database schema by way of my data dictionary. If I change my schema then I regenerate the necessary application code so that they are always in step.

    > then the only thing the RDBMS brings to the
    > table is some level of ad hoc reporting ...

    Incorrect. I use an RDBMS as a reliable and flexible data store, with backup, recovery and replication options, not just as a reporting tool.

    > ... at the cost of added development time and
    > complexity compared to some NoSQL solutions.

    I disagree completely. It is implementing business logic in the database which is more time consuming, difficult and expensive. And please don't muddy the waters with NoSQL databases as they are a recent phenomena which have yet to reach full maturity and live up to their hype.

    > The problem is that when you start moving
    > from single applications into enterprise
    > systems, the calculation becomes very
    > different. It is not uncommon to have several
    > applications sharing a single database, and
    > the databases often must be designed to make
    > it quick and easy to add new applications on
    > top of the same database.

    It has been my experience that when I design and write something like an Order Entry application then that application is the only one which allows orders to be entered. I have never had design a database that can be used by multiple applications, so all your arguments on that matter are irrelevant to me, as I suspect they would be to a lot of other programmers who only ever deal with the "one database, one application" scenario.

    ReplyDelete
  5. Tony Marston wrote:

    "That has not been my experience. I have spent all of my career in writing applications which have sole access to their databases. Having multiple applications trying to perform the same function on the same database is something which I have never encountered."

    It's something I routinely encounter. People write import/export scripts against their databases all the time, and I have never seen a line of business application's database end up over the long run without additional tools written against it.

    "I do things the other way round - my application's classes are built from the database schema by way of my data dictionary. If I change my schema then I regenerate the necessary application code so that they are always in step."

    That's not that far from how we do things in LSMB except that there is an intermediary application-neutral API which is discoverable via the database's data dictionary.

    "Incorrect. I use an RDBMS as a reliable and flexible data store, with backup, recovery and replication options, not just as a reporting tool."

    What sort of flexibility are you looking at that BDB doesn't get you outside of ad hoc reporting?

    "I disagree completely. It is implementing business logic in the database which is more time consuming, difficult and expensive. And please don't muddy the waters with NoSQL databases as they are a recent phenomena which have yet to reach full maturity and live up to their hype."

    First, NoSQL isn't really that new. You could put XML/JSON docs in BDB if you wanted for a long time now, and in fact a lot of applications have used DBD as a data store for a long time. RPM still uses it iirc, as does OpenLDAP by default. The costs and benefits of doing so are not unknown, but in fact are very well known. I have done BDB and XML programming enough to know where those tradeoffs are, although to be honest that was about 10 years ago, though I can't see how anything has fundamentally changed.

    All most of the NoSQL solutions really bring to the table that is new is the idea of loose, multi-master replication and an ability to automate using Javascript or the like. These, however, do not alter the core tradeoff I am talking about which is the use of a schema-based (SQL) vs an application-specific, scheamaless (BDB, NoSQL) approach.

    'I have never had design a database that can be used by multiple applications, so all your arguments on that matter are irrelevant to me, as I suspect they would be to a lot of other programmers who only ever deal with the "one database, one application" scenario.'

    I said this at the beginning, actually..... If you view things simply as stack and expect that all access goes through your application, then this is all irrelevant. However, IME most databases eventually evolve beyond that.

    ReplyDelete
  6. I find this entire discussion fascinating. I do have a question for the OP or anyone else that may come across this topic. Why would a developer be concerned about what future or 3rd party applications might want to do with their data structure?

    From my experience in the enterprise environment, building a new application on top of legacy data storage/structure is a waste of time. I need to build my application around a data structure that represents the real world problem. Building on top of relic data structures that were designed to solve different real world problem unnecessarily convolutes the current problem.

    So in terms of using an RDBMS and the "smart" tools, I can use procs, views, and functions to populate the "dumb" tables in my desired application "schema" and even create tables in the same legacy system that are specific to my application.

    To utilize something like Radicore to classify each table as an object, I could easily prefix my table names and simply ignore the legacy datatables that I have no use for in my application. I could even build sub routines inside of my Radicore application that could easily validate data against the legacy system if for whatever reason I needed to return data back to some other applications table, since I get the legacy structure basically for free.

    There is no reason why I couldn't use Radicore against a legacy database and simply create the source data structure with "smart" tools, prefix my tables, and then continue on with Radicore development as it was designed. Lose nothing with this approach.

    ReplyDelete