Friday, September 21, 2018

PostgreSQL at 20TB and Beyond Talk (PGConf Russia 2018)

It came out a while ago but I haven't promoted it much yet.


This is the recorded version of the PostgreSQL at 20TB and Beyond talk.  It covers a large, 500TB analytics pipeline and how we manage data.

For those wondering how well PostgreSQL actually scales, this talk is worth watching.

Thoughts on the Code of Conduct Controversy

My overall perspective here is that the PostgreSQL community needs a code of conduct, and one which allows the committee to act in some cases for off-infrastructure activity, but that the current code of conduct has some problems which could have been fixed if efforts had been better taken ensure that feedback was gathered when it was actionable.

This piece discusses what I feel was done poorly but also what was done well and why, despite a few significant missteps, I think PostgreSQL as a project is headed in the right direction in this area.

But a second important point here is to defend the importance of a code of conduct to dissenters here, explain why we need one, and why the scope needs to extend where it needs to extend to, and why we should not be overly worried about this going in a very bad direction.  The reason for this direction is that in part I found myself defending the need for a code of conduct to folks I collaborate with in Europe and the context had less to do with PostgreSQL than with the Linux kernel.  But the projects in this regard are far more different than they are similar.

Major Complaint:  Feedback Could Have Been Handled Better (Maybe Next Time)


In early May there was discussion about the formation of a code of conduct committee, in which I argued (successfully) that it was extremely important that the committee be geographically and culturally diverse so as to avoid one country's politics being unintentionally internationalized through a code of conduct.  This was accepted and as I will go into below this is the single most important protection we have against misuse of the code of conduct to push political agendas on the community.  However after this discussion there was no further solicitation for feedback until mid-September.

In Mid-September, the Code of Conduct plan was submitted to the list.  In the new code of conduct was a surprising amendment which had been made the previous month, expanding the code of conduct to all interactions between community members unless another code of conduct applied and superseded the PostgreSQL community code of conduct.  I objected to this as did several others with actionable criticism and alternatives.  Unfortunately we were joined by a large numbers of people wanting to relitigate whether we needed a code of conduct in the first place.   Those of us with actionable feedback were told that no changes would be made for about a year.  In essence what looked like a public comment period was not and the more actionable feedback was, the more clearly it was ignored.

Had there been an actual comment period on the proposed language, I maintain that things would have been more tame, but ignoring even actionable feedback in such a period, in my view, helped throw fuel on the fire regarding those who wanted to re-litigate the whole concept because it further helped push the view that a plan was announced and then any concern ignored.  This was unfortunate.  If there had been a comment period, a deliberation, and a final draft things would have gone better.

I hope that next time such a process is followed, where feedback on proposed final wording is taken before the decision is made to refuse to make changes for a year.

Why We Have Codes of Conduct


Humans are social animals.  Groups of humans form social groups, which often have group infrastructure which needs to be managed.  Open source thus has all of the political considerations of a multi-national collaborative community and this includes management of common infrastructure, and how we treat each other.  The kinds of social relationships and interactions that we have in the community are shaped by our culture, gender, and outlook on life, and in an international project there can be a lot of problems.  When national political issues are kept out of the project and the project consists mostly of people who are willing to defend themselves possibly aggressively, a project can get along ok without a code of conduct, but as things change, it is important that there be a means of resolving conflicts within the community.  Hence one needs a dedicated committee and a document which reminds people to act in ways that keep the peace.

Codes of conduct thus have a role in ensuring that people can come together and work in a collegial and civil manner across cultural, political and other disagreements, and continue to build the great software that we all rely on.  In this regard I think the PostgreSQL community has hit the most important milestones and begun to build a code of conduct infrastructure which can last and ensure that the code of conduct does not turn into a way of one group of people forcing a political agenda on the world.

I have been to many conferences.  Often at some point discussions turn to politics in some way.  With the exception of one conversation, these comments have been thoughtful, receptive, and mutually entertaining but in that one exception, I saw a certain degree of aggressiveness that might, for others, even rise to the level of physical intimidation.  A reminder that we all need to genuinely be nice to eachother is a step in the right direction.

Codes of conduct cannot create fairness.  They cannot create social justice.  They cannot broaden meritocracy to community contribution beyond code.  Those things have to be done through other means, but they can remind everyone to treat each other collegially and to respect differences of opinion and so forth.

However, codes of conduct cannot enable merely formalities to defeat this purpose.  A campaign of harassment that is taken off-list is at least as much of a problem as discussions on-list.  Therefore community-related conversations are things which might have to sometimes fall under the jurisdiction of community conflict adjudication mechanisms such as the Code of Conflict.

What PostgreSQL is Doing Right


The danger in any code of conduct is that an internal controversy from one country or culture will be read into disputes in a way which ensures that other cultural groups do not feel comfortable participating.  GLBT issues are an area where this commonly comes up, and in a project where you have a lot of involvement from countries where the views are very different from those of the US, this leads to big problems.  On one hand, some people may see others' cultural views as invalidating their sexual identity, while others would see views pushing universalism in GLBT rights as invalidating their cultural identity.  These issues cannot be resolved without retreating to a single cultural context as the norm, discouraging participation from much of the world, and thus need to be outside what a code of conduct handles.  In this case it does not matter what one believes to be the right approach, but rather the fact that the consequences of siding with either side in such a controversy would be devastating for the community.

One of the key points of the current Code of Conduct is that the committee is itself geographically and culturally diverse.  This ensures that the intra-committee cultural divisions will help ensure that the committee cannot just bull-doze a political orthodoxy out of fear of how a domestic controversy might be perceived.  The cultural diversity thus is an immense protection and it effectively ensures that there is a right to engage in the free struggle of political opinion in one's own country.

From a responsibility to civic engagement comes a right to such a free struggle of political opinion, and in my view this is something which is effectively preserved in the community today.  Note that this would not apply to trying to position the PostgreSQL project as against any political, cultural, or other group.  Nor should it protect actual personally directed harassment against any member for any reason.  I believe that the committee is capable of drawing these lines and hence I see the PostgreSQL project as off to a shaky but viable start.

Unlucky Timing


The Code of Conduct controversy accidentally coincided with the Linux Foundation adopting the Contributor Covenant as its code of conduct.  The Contributor Covenant is a code of conduct which transparently attempts to push certain norms of certain parts of the US political spectrum on the rest of the world (see, for example, Opalgate).   While I believe this to be a mistake, time will tell how this is handled.  The Contributor Covenant was soundly and decisively rejected by the PostgreSQL community early on as too transparently political.

A lot of the emotional reactions in this controversy by dissenters may well be in relation to that.  This is one of those things one cannot plan for and it makes it harder to have real discussions today.

Calls to Action and Conclusions


I have submitted a couple of requests for wording changes to the code of conduct committee.  For others who see a need for a committee to help ensure a collegial and productive community, and see opportunities for improvement I suggest you do the same.  But simply arguing about whether we need a resolution process is not productive and that probably needs to stop.

I also think the community needs to insist on two modifications to the current process:

  1. There needs to be a comment period and deliberation over feedback between a draft of a new revision and its adoption
  2. The code of conduct committee needs to reply with reasons why particular suggestions were rejected.
However on the other hand I think PostgreSQL as a project is off to a viable start in what is likely to become the right direction.  And that is something we should all be thankful for.

Wednesday, August 9, 2017

On Contempt Culture, a Reply to Aurynn Shaw

I saw an interesting presentation recorded and delivered on LinkedIn on contempt culture by Aurynn Shaw, delivered this year at PyCon.  I had worked with Aurynn on projects back when she used to work for Command Prompt.  You can watch the video below:



Unfortunately comments on a social media network are not sufficient for discussing nuance so I decided to put this blog post together.  In my view she is very right about a lot of things but there are some major areas where I disagree and therefore wanted to put together a full blog post explaining what I see as an alternative to what she rightly condemns.

To start out, I think she is very much right that there often exists a sort of tribalism in tech with people condemning each others tools, whether it be Perl vs PHP (her example) or vi vs emacs, and I think that can be harmful.  The comments here are aimed at fostering a sort of inclusive and nuanced conversation that is needed.

The Basic Problem

Every programming culture has norms, and many times groups from outside those norms tend to be condemned in some way or another. There are a number of reasons for this.  One is competition and the other is seeking approval in one's in group.   I think one could take her points further and argue that in part it is about an effort to improve the relative standing of one's group relative to others around it.

Probably the best example we can come up with in the PostgreSQL world is the way MySQL is looked at.  A typical attitude is that everyone should be using PostgreSQL and therefore people choosing MySQL are optimising for the wrong things.

But where I would start to break with Aurynn's analysis would be when we contrast how we look at MySQL with how we look at Oracle.  Oracle, too, has some major oversights (empty string being null if it is a varchar, no transactional DDL, etc).  Almost all of us may dislike the software and the company.  But people who work with Oracle still have prestige.  So bashing tools isn't quite the same thing as bashing the people who use them.  Part of it, no doubt, is that Oracle is more established, is an older player in the market, and therefore there is a natural degree of prestige that comes from working with the product.  But the question I have is what can we learn from that?

Some time ago, I wrote a the most popular blog post in the history of this blog.  It was a look at the differences in design between MySQL and PostgreSQL and was syndicated on DZone, featured in Hacker News, and otherwise got a fairly large review.   In general, aside from a couple of historical errors, the PostgreSQL-using audience loved the piece.  What surprised me though was that the MySQL-users also loved the piece.  In fact one comment that appeared (I think on Reddit) said that I had expressed why MySQL was better.

The positive outpouring from MySQL users, I think, came from the fact that I sympathetically looked at what MySQL was designed to do and what market it was designed for (applications that effectively own the database), describing how some things I considered misfeatures actually could be useful in that environment, but also being brutally honest about the tradeoffs.

Applying This to Programming Language Debates

Before I start discussing this topic, it is worth a quick tour of my experience as a software developer.

The first language I ever worked with was BASIC on a C64.  I then dabbled in Logo and some other languages, but the first language I taught myself professionally was PHP.  From there I taught myself some very basic Perl, Python, and C.  For a few years I worked with PHP and bash scripting, only to fall into doing Perl development by accident.  I also became mildly proficient in Javascript.

My PostgreSQL experience grew out of my Perl experience.  And about 3 years ago I was asked to start teaching Python courses.  I rose to this challenge.  Around the same time, I had a small project where we used Java and quickly found myself teaching Java and now I feel like I am moderately capable in that language.   I am now teaching myself Haskell (something I think I could not have done before really mastering Python). So I have worked with a lot of languages.  I can pick up new languages with ease.  Part of it is because I generally seek to understand a language as a product of its own history and the need it was intended to address.

As we all know different programming languages are associated with stereotypes.  Moreover, I would argue that stereotypes are usually imperfect understandings that out-group people have of in-group dynamics, so dismissing stereotypes is often as bad as simply accepting them.

PHP as a case study, compared to C.

I would like to start with an example of PHP, since this is the one specifically addressed in the talk and it is a language I have some significant experience writing software in.

PHP often is seen to be insecure because it is easy to write insecure software in the language.  Of course it is easy to write insecure software in any language, but certain vulnerabilities are a particular problem in PHP due to lexical structure and (sometimes) standard library issues.

Lexically, the big issue with PHP is the fact that the language is designed to be a preprocessor to SGML files (and in fact it used to be called the PHP Hypertext Preprocessor).  For this reason everything, PHP is easy to embed in SGML PI tags (so you can write a PHP template as a piece of valid HTML).  This is a great feature but it makes cross site scripting particularly easy to overlook.  A lot of the standard library in the 1990's had really odd behaviour, though much of this has been corrected.

Aurynn is right to point to the fact that these were exacerbated by a flood of new programmers during the rise of PHP, but one thing she does not discuss in the talk is how software and internet security were also changing during the time.  In essence, the late 1990's saw the rise of SSH (20k users in 1995 to over 2M in 2000), the end of transmission of passwords across the internet in plain text, the rise of concern about SQL injection and XSS, and so forth.  PHP's basic features were in place just before this really got going, and adding to this a new developer community, and you have a recipe for security problems.  Of course, today, PHP has outgrown a lot of this and PHP developers today have codified best practices to deal with a lot of the current threats.

If we contrast this with C as programming language, C has even more glaring lexical issues regarding security, from double free bug possibilities to buffer overruns.  C, however, is a very unforgiving language and consequently, it doesn't tend to be a language that has a large, novice developer community. At the same time, a whole lot of security issues come out of software in C.

Conclusions

There is no such thing as a perfect tool (database, programming language, etc).  As we grow as professionals, part of that process is learning to better use the strengths of the technologies we work with and part of it is learning to overcome the oversights and problems of the tools as well.

It is further not the case that just because a programmer primarily uses a tool with real oversights that this reflects poor judgment from the programmer.  Rather this process of learning can have the opposite impact.  C programmers tend to be very knowledgeable because they have to be.  The same is true for Javascript programmers for very different reasons.  And one doesn't have to validate all language design decisions in order to respect others.

Instead of attacking developers of other languages, my recommendation is, when you see a problem, to neutrally and respectfully point it out, not from a position of superiority but a position of respectful assistance and also to understand that often what may seem like poor decisions in the design of a language may in fact have real benefits in some cases.

For example, Java as a language encourages mediocrity of code. It is very easy to become a mediocre Java developer.  But once you understand Java as a language, this becomes a feature because it means that the barrier to understanding and debugging (and hence maintaining!) code is reduced, and once you understand that you can put emphasis instead on design and tooling.    This, of course, also has costs since it is easy for legacy patterns to emerge in the tooling (JavaBeans for example) but it allows some really amazing frameworks, such as Spring.

On the other extreme, Javascript is a language characterised by shortcuts taken during the initial design stage (for time constraint reasons) and some of those cause real problems, but others make hard things possible.  Javascript makes it, also, very easy to be a bad Javascript programmer.  But perhaps for this reason I have found that professional Javascript programmers tend to be extremely knowledgeable, and have had to work very hard to master software development in the language, and they usually bring to the table great insights into computing problems generally.

So what I would recommend that people take away is the idea that in fact we do grow out of hardship, and that problems in tools are overcome over time.  So for that reason discussing real shortcomings of tools while at the same time respecting communities and their ability to grow and overcome problems is important.

Monday, February 13, 2017

PostgreSQL at 10TB and Beyond Recorded Talk

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch.  For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers.  You can book the course for the end of this month.  It will be held in Malmo, Sweden.

Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;


You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

Tuesday, January 24, 2017

PostgreSQL at 10 TB and Above

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden.  The seminar is free to attend.  I expect to be talking for about 45 minutes with some time for questions and answers.  I also have been invited to give the talk at PG Conf Russia in March.  I do not know whether either will be recorded.  But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own.  But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes).   The other two address volume of data.

  1. Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes.  This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
  2. Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables.  We will talk about how this race condition happens and we solved it without using row locks.  We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
  3. Slow access and poor plans regarding accessing data in large tables.  We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
  4. Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.
In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database.  These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.

Sunday, December 11, 2016

What Git got Right and Wrong

Having been using various vcs solutions for a while, I think it is worth noting that my least favorite from a user experience perspective is git.  To be sure, git has better handling of whitespace-only merge conflicts and a few other major features than any other vcs that I have used.

And the data structure and model are solid.

But even understanding what is going on behind the scenes, I find git to be an unintuitive mess at the CLI issue.

Let me start by saying things that git got right:


  1. The dag structure is nice
  2. Recursive merges are good
  3. The data models and what goes on behind the scenes is solid.
  4. It is the most full-featured vcs I have worked with
However, I still have real complaints with the software  These include fundamentally different concepts merged into the same label and the fact that commands may do many different things depending on how you call them.  The fact that the concepts are not clear means that it is worse than a learning curve issue.  One cannot have a good grasp of what git is doing behind the scenes because this is not always clear.  In particular:

  1. In what world is a fast-forward a kind of merge?
  2. Is there any command you can explain (besides clone) in three sentences or less?
  3. What does git checkout do?  Why does it depend on what you checkoout?   Can you expect an intermediate user to understand what to expect if you have staged changes on a file, when you try to check out a copy of that file from another revision?
  4. What does git reset do from a user perspective?  Is there any way a beginner can understand that from reading the documentation?
  5. In other words, git commands try to do too much at once and this is often very confusing.
  6. Submodules are an afterthought and not well supported across the entire tool chain (why does git-archive not have an option to recurse into submodules?)
These specific complaints come from what I see as a lack of clarity regarding what concepts mean and they indicate that those who wrote the tools did so at a time when the concepts were still not entirely clear in their own minds.  In essence it is not that things are named in unclear ways but that concepts have unclear boundaries.  

Some of this could be fixed in git.  fast-forward could be referred to as a shortcut to avoid a merge rather than a kind of merge. Some could be fixed with better documentation (we could describe git reset by what the user-facing changes are, rather than the internal changes).  Some would require a very different set of command layouts.