Thursday, August 18, 2016

PostgreSQL vs Hadoop

So one of the clients I do work with is moving a large database from PostgreSQL to Hadoop.  The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS).  However this has provided an opportunity to think about big data problems and solutions and their implications.  At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around.  No, LedgerSMB will never likely use Hadoop as a backend.  It is definitely not the right solution to any of our problems.

Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure.  It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone.  Moreover these solutions are hardly without their own significant costs.  All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.

So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions.  Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have.  No, Hadoop does not make a great general ETL platform.....

One of the key things to note is that Hadoop is built to solve all three problems simultaneously.  This means that you effectively buy into a lot of other costs if you are trying to solve only one of the V problems with it.

The single largest cost comes from the solutions to the variety of data issues.  PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.  You force a schema on write and if that is violated, you throw an error.  Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.  Ouch.  But that's very helpful when trying to make sense of a lot of non-structured data.

Now, solutions to check out first if you are faced with volume and velocity problems include Postgres-XL and similar shard/clustering solutions but these really require good data partitioning criteria.  If your data set is highly interrelated, it may not be a good solution because cross-node joins are expensive.  Also you wouldn't use these for smallish datasets either, certainly not if they are under a TB since the complexity cost of these solutions is not lightly undertaken either.

Premature optimization is the root of all evil and big data solutions have their place.  However don't use them just because they are cool or new, or resume-building.  They are specialist tools and overuse creates more problems than underuse.


  1. PostgreSQL 10 roadmap from 2ndQuadrant on columnar indexes may make PostgreSQL suitable for big data:

    1. Certainly for some kinds of big data. And anything that helps certain areas do better is a major win. Having index-oriented tables would also help with certain volume-related issues as well (and there has been talk about that for some time).

    2. But one point worth noting is that "big data" is a bit of a buzzword and poorly defined topic. Columnar stores help certain kinds of aggregation. Index oriented tables help one tune certain kinds of access patterns. Those are important tools in dealing with data volume problems.

      But one of the things which has really been difficult in the deployment I work with that is moving to Hadoop is the fact that TOAST performance overhead is rather difficult to measure and in order to get adequate performance, certain things have had to move to non-1NF designs. Even when you are dealing with structured data, there are access pattern corners where one has to be rather imaginative to keep performance up.

      So it is a mistake to think that all volume problems are the same, or all velocity problems are the same. Once you get into TB of data, attention to detail and awareness of your specific issues become important.

      Today PostgreSQL can deal well with certain kinds of big data problems. And those are expanding. But we should be careful not to be like the folks who think Hadoop is the answer to everything ;-)

  2. The CitusDB extension is an interesting alternative to Postgres-XL, and it's also not a fork. You're correct that if your data doesn't shard well you're still in trouble. Though, doesn't the same apply to hadoop? Or does it just use a hash of the entire row/document to partition?

  3. Attend The Data Analytics Courses From ExcelR. Practical Data Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses.
    Data Analytics Courses

  4. Attend The Data Science Course From ExcelR. Practical Data Science Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science science courses

  5. very well explained .I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Simple Linear Regression
    Correlation vs covariance
    data science interview questions
    KNN Algorithm
    Logistic Regression explained


  7. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    Simple Linear Regression

    Correlation vs Covariance