Monday, February 13, 2017

PostgreSQL at 10TB and Beyond Recorded Talk

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch.  For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers.  You can book the course for the end of this month.  It will be held in Malmo, Sweden.

Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;


You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

Tuesday, January 24, 2017

PostgreSQL at 10 TB and Above

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden.  The seminar is free to attend.  I expect to be talking for about 45 minutes with some time for questions and answers.  I also have been invited to give the talk at PG Conf Russia in March.  I do not know whether either will be recorded.  But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own.  But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes).   The other two address volume of data.

  1. Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes.  This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
  2. Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables.  We will talk about how this race condition happens and we solved it without using row locks.  We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
  3. Slow access and poor plans regarding accessing data in large tables.  We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
  4. Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.
In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database.  These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.