Perspectives on LedgerSMB

Linux and Baikal Electronics: Why Postgres Should not Follow that Path

2023-04-07T17:36:00.003-07:00

In my recent trip to PgConf Russia, a friend brought up the reaction of a Linux kernel maintainer to a patch submission by Baikal Electronics. The reaction has apparently shocked open source developers throughout Russia and I think perfectly encapsulates the dangers open source projects, including PostgreSQL, have to navigate today in the changing geopolitical context. I decided to write a longer piece on this controversy and the issues surrounding it because I don't want to see PostgreSQL, LedgerSMB, or other open source projects I work with go down a similar path.

I believe that we are entering into a new era in open source development and I hope we move forward with our eyes open rather than sleepwalk into disaster. Particularly as we move forward with dual-use technologies in Postgres, such as Transparent Data Encryption, we will face similar pressures and temptations. I hope we don't go down the same road.

Background: Baikal Electronics, Weapons, and Sanctions

The email occurs in the course of the Russo-Ukraine war and the purported efforts by Western powers to deprive Russia of dual-use computer chips, of a sort that could be useful for cruise missile guidance systems and other efforts. I say "purported" because today computer chips are found in virtually everything and therefore this motivation doesn't make much sense. Western countries have imposed a number of sanctions which attempt to restrict access to microchips and other sophisticated electronics. This has included both export bans to Russia of microchips but also sanctions on Russian microchip manufacturers.

Modern weapons by most major powers are heavily computerized and rely on such components. The stated hope by Western powers was to degrade Russia's military manufacturing and repair capabilities and thus end the war on terms favorable to Western interests. It is also possible that the goal of sanctions was to destroy civilian life in the vain hope of a regime change (though there are no cases of Western sanctions achieving this in history, particularly when used against an adversary).

However, most aspects of modern life rely on these same microchips. Banking systems, traffic lights, water treatment facilities, electrical generation, washing machines, refrigerators, and many more pieces of civilian equipment and infrastructure well outside the conflict zone depend on such technologies. For this reason, companies like Baikal Electronics represent an important part of an effort at import substitution and sanction-resilience, not just for military uses but for civilian uses as well. The Linux kernel maintainers have no problem accepting patches from US military and national security agencies (for example, SE-Linux was contributed by the NSA), but suddenly allowing civilian uses of Linux in Russia has become a problem. It's hard to see this as a conscious decision rather than a product of unconscious bias and human error. However, it could also be a conscious decision to weaponize open source software for geopolitical reasons.

Different Interpretations of Motivation: Weapons vs Civilian Use, or Geopolitical Enforcement

The email here could be interpreted in a number of ways. The most likely two are that this is a deliberate effort to extend sanctions to open source development or that it is a deep concern with what might be mistaken to be primarily a weapons contractor. The first is genuinely scary, while the second could be expected to be resolved with a bit of dialogue.

It is possible that the decision was made due to US/EU sanctions or the desire to take a strong stance in favor of US interests (as portrayed by US media) and I understand this is the sense that people have taken this in Russia. In this view, open source becomes just another weapon of geopolitics, and a way to punish countries we don't like. After Russia, the same fate will surely befall China, and then any other country that dares to get in the way of American ambitions throughout the world. Of course this would be rationalized as the argument that Russia is invading and occupying another country illegally and we ought not to support that, but the fact that the US has illegally invaded and is occupying half of Syria doesn't get the same treatment, so again we are dealing with a "rules-based" international order where the rules don't apply to the West. As I will show below, any open source project that goes down this path will be superseded by more inclusive, international projects which may fork from them.

A more charitable view is based on the misunderstanding that Baikal Electronics' chips are primarily used for military applications and have limited civilian use. This may have been true before the war, though even then the company was making inroads into civilian applications. Today, the company has many civilian uses and, as import substitution becomes more important, these civilian uses are growing. In this particular case, dialog ought to be able to resolve the issue and the patches ought to be able to be reviewed. I think the fact that the Linux kernel maintainers have not decided to directly discriminate against Russians as a whole, or even residents of Russia, is evidence that this is a misunderstanding rather than a genuine effort at removing the political neutrality that defines open source. However, if server vendors or even banks would be barred from making contributions for fixes affecting Baikal chips used in civilian applications, this would be quite troubling, and that seems to be the policy specified in the email.

I think the comments about refusing hardware support for Baikal's chips indicates that this is not merely about mistrust of Baikal as an organization technically, but rather something directed at them and their motivations for contribution.

After all, the Open Source Initiative's Open Source Definition does not allow licenses to discriminate against endeavors or groups of people, whether by citizenship, residency, or employment. While this does not apply to accepting patches, it does apply to distribution so contributions by the NSA become available for use by the Russian armed forces. The kernel maintainers don't have to accept patches from Russian nationals or corporations, but they have to let such people and corporations take advantage of improvements made by US allies. As I will show below, this leads to problems for open source communities trying to cut themselves off from countries they don't like.

The Open Source as Infrastructure from the Unipolar Moment

In order to understand the current situation clearly enough to see what options other countries and corporations have, a bit of history into the current situation is helpful.

Open Source is generally understood to trace back to Cold-War-era academic development practices where academics would write software and share it for purposes of research. This was confined to academia for a number of reasons but particularly due to difficulties with distribution. At the time, the internet was not well developed, and so distribution and collaboration across significant distances, and with significant numbers of people, were difficult.

With the end of the Cold War, the situation dramatically changed. The Unipolar Moment was accompanied by a move to hyper-globalism and this lead to increasing exports of computers, and the rise of a global internet. This in turn lead to increasing investments in telecommunications infrastructure and the rise of a high-speed internet. As computers became faster and storage became denser, and as internet speeds increased, many of the large open source projects we know today either began their lives or moved out from academia into the mainstream.

Open source development practices today are a product of this hyper-connected, globalized world. Globalism, however, has not equally benefited all and has proved to be demonstrably unstable. US efforts to remake the world in its image are now failing, and workers at home are increasingly unhappy with stagnant wages, lost jobs, and other consequences of a globalized labor market. As a result we began to see largely unsuccessful efforts to disconnect the US economy from the Chinese economy beginning in 2016, and more successful efforts to disconnect from the Russian economy starting in 2022. The efforts at severing economic ties with China have furthermore been picked up by the Biden Administration, and there is a general understanding in both the US and China that similar sanctions are coming to China whenever practical and convenient.

The Global South, often weary of being threatened with US, UK, and EU sanctions, has steadfastly refused to support this effort. Indeed, not a single country from the Global South has sanctioned Russia in the last year. They are therefore at risk of being considered unfriendly countries and arbitrarily sanctioned via secondary sanctions. What has developed is then what Rutgers University Political Science Professor Michael Rossi has called "The West vs The Rest." This leads to a dangerous dynamic where efforts to isolate one country risks alienating people in a large number of other countries.

Open source, however, remains as infrastructure. Software products cannot consider themselves open source while enforcing geographic boundaries in distribution or preventing individuals, organizations, agencies, or nations one doesn't like from accessing, using, or modifying the source code. This then means that this infrastructure remains no matter what we would like, and further that efforts at excluding those we don't like will generally backfire and eventually kill the projects that we want to protect.

Possible Responses from Non-Western Powers

Of course the best option for everyone would be to work together to address misunderstandings and make sure that geopolitics does not get in the way of accepting code contributions. However, if that is not possible, then other options might be necessary. There are several possible reactions that can be taken from within non-Western powers to such exclusion. These range from maintained patch sets to forking or even switching open source projects.

The simplest approach for people who are excluded from a major open source project is just to maintain a patch set of existing work for use within their countries. The advantage of this is that it is quick and easy. You can maintain the patch set as a git branch, and perhaps even build or release from that branch. This makes it quick and (usually) easy to keep up to date with git rebase, and if the patch set is small, is quite reasonable to do.

If the patch set becomes too large, then maintenance in this will be difficult, in which case forking and merging in changes from the mainline kernel may be an option. This would allow, for example, a "Rinux" (contraction of Russian Linux) to develop. The mainline kernel would still not take on further improvements from the Rinux fork, but Rinux could take all improvements from Linux. If the developers are competent this would leave Rinux as a better, more capable project, and also very attractive to Chinese and other non-Western companies and governments as a project which would would be less likely to be weaponized. If Rinux were particularly successful, it might even make inroads back into Western countries, leaving Linux increasingly used purely by Western governments and military organizations, and increasingly lagging behind until it wouldn't even be worth following anymore.

A final option would be to switch to a different kernel such as OpenBSD, which has a track record of a global outlook, particularly in areas of cryptography. The particular track record of recruiting cryptographers not bound by US export restrictions may indicate they may be safer partners of collaboration than the Linux kernel at this point.

Of course the best option for everyone would be to work together to address misunderstandings and make sure that geopolitics does not get in the way of accepting code contributions.

Our Choice: To Be Global or To Be Western

Gone now are the days when geopolitics could be safely ignored, but open source software projects today still have to decide how closely to tie themselves to Western geopolitical efforts at maintaining global power. Emerging development communities, raised on open source, in much of the world will dramatically privilege those projects that don't tie themselves to the agendas of any country or group of countries but seek to create commons that all can participate in, well beyond borders.

LedgerSMB was born of a need to maintain software while I was excluded from the SQL-Ledger community. What we found was that more inclusive communities always beat less inclusive ones, and that inclusive forks often kill non-inclusive parent projects.

I hope that the Linux community comes to its senses before it is too late. And I hope the PostgreSQL community never experiences such insanity, because this can be fatal. Of curse the projects that make this error will live on by other names, with other maintainers, and different committers, but the original projects are likely to die. And I don't want this.

Which is worse when working on production databases? Being drunk or tired?

2023-03-08T19:12:00.001-08:00

I have decided to do a series of mini-articles on human factors in database operations. This is the first, and covers fatigue.

In my talk at the PostgreSQL devroom of Fosdem, I asked a few questions:

1. How many of you have seen someone work on a production database while drunk? About half the audience.

2. How many times does this cause a major incident? No hands.

3. How many of you have seen someone cause a major incident by working on a production database while tired? Half the audience again raised their hands.

As an industry, we do not take fatigue seriously enough. We appreciate people who come in and work after long disruptive on-call shifts. We don't tell people they are tired and therefore not safe to work on production systems.

We need to do better. Every single major mistake in my career that has caused production problems has been caused either by power distance problems or by fatigue.

I am not saying people should come into work drunk. There are probably a number of contextual aspects to why drunkenness doesn't cause a problem in these cases. However I am saying that people should not touch production systems under fatigue.

Of course this is easier said than done, If we are drunk, we can feel it, but with even light stress, we often don't feel our fatigue. We aren't capable of self-monitoring our conditions in this regard. Fatigue is thus insidious -- it gradually sneaks up on our, invisible, until we make critical mistakes and bad things happen.

While there are reasons to weigh the balance differently in some areas such as operating motor vehicles (to say nothing about flying an aircraft), the fact is that general brain-intensive can be impaired via moderate fatigue perhaps more than levels of alcohol we consider unacceptable while driving.

If we value production operations, we should adopt the following rule: friends don't let friends work on the production databases tired.

The Coming Storm: Geopolitics and PostgreSQL in a Changing World

2023-03-03T05:24:00.003-08:00

At PGConf India, I watched Bruce Momjian's excellent talk Future Postgres Challenges. This talk discusses technical, technological, and project-related challenges Postgres has faced and continues to face. This immediately lead me to ask a question about efforts at geopolitical disentanglement and how we should try to avoid them in the Postgres community. While this question immediately gets interpreted through the filter of the ongoing war in Ukraine, it is far broader, as the US has been trying to disentangle itself from China and efforts there are ongoing. The problems, however, are not new.

When we as a community had decided to have a code of conduct, there was a lot of concern that the code of conduct would be used in politically motivated ways, in particular regarding culture war topics. I wrote a piece describing some of the issues which I also gave geopolitical importance to at that time. Culture war topics, I reasoned, are necessarily cultural, and hence trying to push one group's cultural ideas on the world through an open source project would be very harmful.

I wrote that piece in early 2016, before the Liberal International Order began to visibly fall apart with votes for Brexit and the election of Donald Trump. But the piece remains relevant because the dangers of allowing our community to be torn apart by current geopolitics are, if anything, even worse than they are in the culture war space.

If, as some would like, we were to shun community members and contributors who are resident in the "wrong" countries, or if we were to treat military contracts to American and Russian or Chinese governments differently, then we would risk the possibility of a fork forming around geopolitical lines. Nothing good will come out of that for anyone. And so the task ahead is to make sure that we continue making sure our community is culturally and geopolitically inclusive.

I expect that we will face a lot of pressure in coming years to bifurcate the community due to differing geopolitical perspectives, but such could well be fatal for Postgres as a project. Let's not let that happen.

Edit: I am moderating the comments as follows: What is on topic is a discussion about what is good for the project and community. Arguments about what is good for community that incidentally argue geopolitics are acceptable. Efforts at arguing geopolitics untethered from that question will get removed. I am trying to cultivate constructive discussion on a topic of community importance and sometimes this means pulling weeds (by which I mean deleting comments that detract from that).

Also all conversations need to be in the spirit of discussing what is right, not who is right. If you wonder why your comment got deleted, that's why.

Table Inheritance: What's it Good For?

2019-05-26T01:03:00.001-07:00

Table inheritance is one of the most misunderstood -- and powerful -- features of PostgreSQL. With it, certain kinds of hard problems become easy. While many folks who have been bitten by table inheritance tend to avoid the feature, this blog post is intended to provide a framework for reasoning about when table inheritance is actually the right tool for the job.

Table inheritance is, to be sure, a power tool and thus something to use only when it brings an overall reduction in complexity to the design. Moreover the current documentation doesn't provide a lot of guidance regarding what the tool actually helps with and where are the performance costs and because inheritance sits orthogonal to relational design, working this out individually is very difficult.

This blog post covers uses of table inheritance which simplify overall database design and are not addressed by declarative partitioning, because they are used in areas other than table partitioning.

Table Inheritance Explained

PostgreSQL provides the ability for tables to exist in an inheritance directed acyclic graph. Columns provided by parent tables are merged in name and type into the child table. Altering a parent table and adding a column thus cascades this operation to all child tables, though if any child table has a column with the same name and different type, the operation will fail.

Inheritance, Tables, and Types

Every table in PostgreSQL has a corresponding campsite type, and any table can be implicitly cast to any parent table. This is transitive. Combined with tuple processing functions, this gives you a number of very powerful ways of working with data at various different levels of scale.

Indexes and foreign keys are not inherited. Check constraints are inherited unless set to NO INHERIT.

Inheritance and Querying

When a table is queried, by default all child tables are also queried and their results appended to the result. Because of exclusion constraint processing, this takes out an ACCESS SHARE lock on all child tables at planning time. All rows are cast back to the type of the table target (in other words you get the columns of the table you queried).

Comparison to Java Interfaces

Despite the name, the closest equivalent to table inheritance in other programming languages are Java Interfaces. Here too you get implicit casts, a subset of fields, and a promise of compatible interfaces. And as a Java class can implement multiple interfaces, multiple inheritance in PostgreSQL is supported. Java programmers are encouraged to think of inheriting tables in interface rather than inheritance terms.

Use in Database Management Design

When we design a database there are often two overlapping concerns. The first is in relational algebra operations on the data, and the second is in managing the data. In a purely relational model this breaks down.

Notes Tables

One of the first really productive uses of table inheritance I had was in the notes tables in LedgerSMB. There are several hundred table in the database, and we want to attach notes to some subset of these tables. A naive approach might be a single global notes table with a bunch of foreign keys, or an ambiguous foreign key, or we just have a bunch of completely independent notes tables. All of these have serious obvious problems however. Large numbers of sparse foreign keys provide tons of NULL-handling problems, and provide a wide table that is harder to reason about. Ambiguous foreign keys are a terrible anti pattern which should never be used due to data consistency problems, and large numbers of independent tables provide an opportunity for subtle errors due to knowledge management problems.

A slightly better solution might be to define a notes composite type, and use CREATE TABLE OF TYPE instead. However typed tables of this sort have completely immutable schemas which makes them harder to manage over time.

We can then define a table structure something like as follows:

create table notes (

id serial primary key,

created_at timestamp not null default now(),

created_by text not null,

subject text not null,

body text not null,

fkey int not null,

check (false) NO INHERIT

);

This table will never have any rows, but child tables can have rows. For child tables, creating them is now easy:

create table invoice_notes (

LIKE notes INCLUDING INDEXES,

foreign key fkey REFERENCES invoice(id),

) INHERITS (notes);

The LIKE ... INCLUDING ALL indicates that we will copy in defaults, primary keys, and index definitions. This now provides a forward-looking way of managing all notes tables going forward. Uniqueness criteria remains enforced on a per-table basis.

If I later want to add a materialization of a column using a function I can do that in a reasonably straight-forward manner, at least compared to alternative approaches.

However, that's not all I can do. I can then provide a search_terms function on the parent table which can be used to query child tables.

create or replace function search_terms(notes)

returns tsvector language sql immutable as

select to_tsvector($1.subject) || to_tsvector($1.body);

$$;

I could then index, using GIN, the output of this function. I still have to create the index on all current tables but if I index it now on the notes table, all tables I create with LIKE notes INCLUDING ALL will now have that index too.

The function itself can be queried in a number of ways:

select * from invoice_notes n

where plainto_tsvector('something') @@ search_terms(n);

-- or

select * from invoice_notes n

where plainto_tsvector('something') @@ n.search_terms;

Once the function is created, that query works out of the box even though I never created a corresponding function for the invoice_notes table type. Thus providing a consistent interface to a group of tables is an area where table inheritance can help clear out a lot of complexity very fast. Benefits include a more robust database design, more easily re-used human knowledge in how pieces fit together, and easier management of database schemas.

Note on Use in Set/Superset Modeling

There are a number of cases where the query implications of inheritance are more important. This area is typically tricky because it often involves multiple inheritance and therefore there are a number of additional concerns that quickly crop up, though these have well-defined solutions discussed below.

Imagine we have an analytics database with numbers of pre-aggregated over possibly overlapping sets. We want to sum up numbers quickly and easily without complicating the query language. One option would be to create multiple views over a base table which includes the superset, but if your bulk operations primarily work over discrete subsets, you might get more out of breaking these out into subset tables which inherit the larger sets into which they are members. This is, in effect, a reverse partitioning scheme where a single physical table shows up in multiple query tables.

In certain cases this can be easier to manage than a single large table with multiple views selecting portions of that view. Use of this technique requires weighing different kinds of complexity and is best left for other posts.

Managing Schema Changes with Multiple Inheritance

In cases where multiple inheritance is used, adding and removing columns is relatively straight-forward, but altering existing tables can result in cases where an alteration interferes with checks on the other parent. Renaming columns or changing types of columns is particularly tricky. In most cases where this happens, a type change will not be done because rewriting tables is prohibitive, but renaming columns becomes the substitute and that is no less of a headache.

The key problem to note here is that the problem is that you have to make sure that both parents are changed at the same time, in the same statement. So the solution here is to create a super parent table with the subset of columns to be acted on, and then drop it when done. So here we:

begin;

create table to_modify (id int, new_id bigint);

alter table first_parent inherit to_modify;

alter table second_parent inherit to_modify;

alter table to_modify rename id to old_id;

alter table to_modify rename new_id to id;

commit;

The changes will then cascade down the inheritance graph properly.

Conclusions

Table inheritance is a surprisingly awesome feature in PostgreSQL, but misuse has given it a bad reputation. There are many cases where it simplifies operation and long-term management of the database in cases where partitioning actually doesn't work that well. This is a feature I expect to try to improve over time and hope others find it useful too, but to start we need to start using it for what it is good for, not the areas it falls short.

PostgreSQL at 20TB and Beyond Talk (PGConf Russia 2018)

2018-09-21T08:48:00.002-07:00

It came out a while ago but I haven't promoted it much yet.

This is the recorded version of the PostgreSQL at 20TB and Beyond talk. It covers a large, 500TB analytics pipeline and how we manage data.

For those wondering how well PostgreSQL actually scales, this talk is worth watching.

Thoughts on the Code of Conduct Controversy

2018-09-21T08:37:00.001-07:00

My overall perspective here is that the PostgreSQL community needs a code of conduct, and one which allows the committee to act in some cases for off-infrastructure activity, but that the current code of conduct has some problems which could have been fixed if efforts had been better taken ensure that feedback was gathered when it was actionable.

This piece discusses what I feel was done poorly but also what was done well and why, despite a few significant missteps, I think PostgreSQL as a project is headed in the right direction in this area.

But a second important point here is to defend the importance of a code of conduct to dissenters here, explain why we need one, and why the scope needs to extend where it needs to extend to, and why we should not be overly worried about this going in a very bad direction. The reason for this direction is that in part I found myself defending the need for a code of conduct to folks I collaborate with in Europe and the context had less to do with PostgreSQL than with the Linux kernel. But the projects in this regard are far more different than they are similar.

Major Complaint: Feedback Could Have Been Handled Better (Maybe Next Time)

In early May there was discussion about the formation of a code of conduct committee, in which I argued (successfully) that it was extremely important that the committee be geographically and culturally diverse so as to avoid one country's politics being unintentionally internationalized through a code of conduct. This was accepted and as I will go into below this is the single most important protection we have against misuse of the code of conduct to push political agendas on the community. However after this discussion there was no further solicitation for feedback until mid-September.

In Mid-September, the Code of Conduct plan was submitted to the list. In the new code of conduct was a surprising amendment which had been made the previous month, expanding the code of conduct to all interactions between community members unless another code of conduct applied and superseded the PostgreSQL community code of conduct. I objected to this as did several others with actionable criticism and alternatives. Unfortunately we were joined by a large numbers of people wanting to relitigate whether we needed a code of conduct in the first place. Those of us with actionable feedback were told that no changes would be made for about a year. In essence what looked like a public comment period was not and the more actionable feedback was, the more clearly it was ignored.

Had there been an actual comment period on the proposed language, I maintain that things would have been more tame, but ignoring even actionable feedback in such a period, in my view, helped throw fuel on the fire regarding those who wanted to re-litigate the whole concept because it further helped push the view that a plan was announced and then any concern ignored. This was unfortunate. If there had been a comment period, a deliberation, and a final draft things would have gone better.

I hope that next time such a process is followed, where feedback on proposed final wording is taken before the decision is made to refuse to make changes for a year.

Why We Have Codes of Conduct

Humans are social animals. Groups of humans form social groups, which often have group infrastructure which needs to be managed. Open source thus has all of the political considerations of a multi-national collaborative community and this includes management of common infrastructure, and how we treat each other. The kinds of social relationships and interactions that we have in the community are shaped by our culture, gender, and outlook on life, and in an international project there can be a lot of problems. When national political issues are kept out of the project and the project consists mostly of people who are willing to defend themselves possibly aggressively, a project can get along ok without a code of conduct, but as things change, it is important that there be a means of resolving conflicts within the community. Hence one needs a dedicated committee and a document which reminds people to act in ways that keep the peace.

Codes of conduct thus have a role in ensuring that people can come together and work in a collegial and civil manner across cultural, political and other disagreements, and continue to build the great software that we all rely on. In this regard I think the PostgreSQL community has hit the most important milestones and begun to build a code of conduct infrastructure which can last and ensure that the code of conduct does not turn into a way of one group of people forcing a political agenda on the world.

I have been to many conferences. Often at some point discussions turn to politics in some way. With the exception of one conversation, these comments have been thoughtful, receptive, and mutually entertaining but in that one exception, I saw a certain degree of aggressiveness that might, for others, even rise to the level of physical intimidation. A reminder that we all need to genuinely be nice to eachother is a step in the right direction.

Codes of conduct cannot create fairness. They cannot create social justice. They cannot broaden meritocracy to community contribution beyond code. Those things have to be done through other means, but they can remind everyone to treat each other collegially and to respect differences of opinion and so forth.

However, codes of conduct cannot enable merely formalities to defeat this purpose. A campaign of harassment that is taken off-list is at least as much of a problem as discussions on-list. Therefore community-related conversations are things which might have to sometimes fall under the jurisdiction of community conflict adjudication mechanisms such as the Code of Conflict.

What PostgreSQL is Doing Right

The danger in any code of conduct is that an internal controversy from one country or culture will be read into disputes in a way which ensures that other cultural groups do not feel comfortable participating. GLBT issues are an area where this commonly comes up, and in a project where you have a lot of involvement from countries where the views are very different from those of the US, this leads to big problems. On one hand, some people may see others' cultural views as invalidating their sexual identity, while others would see views pushing universalism in GLBT rights as invalidating their cultural identity. These issues cannot be resolved without retreating to a single cultural context as the norm, discouraging participation from much of the world, and thus need to be outside what a code of conduct handles. In this case it does not matter what one believes to be the right approach, but rather the fact that the consequences of siding with either side in such a controversy would be devastating for the community.

One of the key points of the current Code of Conduct is that the committee is itself geographically and culturally diverse. This ensures that the intra-committee cultural divisions will help ensure that the committee cannot just bull-doze a political orthodoxy out of fear of how a domestic controversy might be perceived. The cultural diversity thus is an immense protection and it effectively ensures that there is a right to engage in the free struggle of political opinion in one's own country.

From a responsibility to civic engagement comes a right to such a free struggle of political opinion, and in my view this is something which is effectively preserved in the community today. Note that this would not apply to trying to position the PostgreSQL project as against any political, cultural, or other group. Nor should it protect actual personally directed harassment against any member for any reason. I believe that the committee is capable of drawing these lines and hence I see the PostgreSQL project as off to a shaky but viable start.

Unlucky Timing

The Code of Conduct controversy accidentally coincided with the Linux Foundation adopting the Contributor Covenant as its code of conduct. The Contributor Covenant is a code of conduct which transparently attempts to push certain norms of certain parts of the US political spectrum on the rest of the world (see, for example, Opalgate). While I believe this to be a mistake, time will tell how this is handled. The Contributor Covenant was soundly and decisively rejected by the PostgreSQL community early on as too transparently political.

A lot of the emotional reactions in this controversy by dissenters may well be in relation to that. This is one of those things one cannot plan for and it makes it harder to have real discussions today.

Calls to Action and Conclusions

I have submitted a couple of requests for wording changes to the code of conduct committee. For others who see a need for a committee to help ensure a collegial and productive community, and see opportunities for improvement I suggest you do the same. But simply arguing about whether we need a resolution process is not productive and that probably needs to stop.

I also think the community needs to insist on two modifications to the current process:

There needs to be a comment period and deliberation over feedback between a draft of a new revision and its adoption
The code of conduct committee needs to reply with reasons why particular suggestions were rejected.

However on the other hand I think PostgreSQL as a project is off to a viable start in what is likely to become the right direction. And that is something we should all be thankful for.

On Contempt Culture, a Reply to Aurynn Shaw

2017-08-09T22:15:00.000-07:00

I saw an interesting presentation recorded and delivered on LinkedIn on contempt culture by Aurynn Shaw, delivered this year at PyCon. I had worked with Aurynn on projects back when she used to work for Command Prompt. You can watch the video below:

Unfortunately comments on a social media network are not sufficient for discussing nuance so I decided to put this blog post together. In my view she is very right about a lot of things but there are some major areas where I disagree and therefore wanted to put together a full blog post explaining what I see as an alternative to what she rightly condemns.

To start out, I think she is very much right that there often exists a sort of tribalism in tech with people condemning each others tools, whether it be Perl vs PHP (her example) or vi vs emacs, and I think that can be harmful. The comments here are aimed at fostering a sort of inclusive and nuanced conversation that is needed.

The Basic Problem

Every programming culture has norms, and many times groups from outside those norms tend to be condemned in some way or another. There are a number of reasons for this. One is competition and the other is seeking approval in one's in group. I think one could take her points further and argue that in part it is about an effort to improve the relative standing of one's group relative to others around it.

Probably the best example we can come up with in the PostgreSQL world is the way MySQL is looked at. A typical attitude is that everyone should be using PostgreSQL and therefore people choosing MySQL are optimising for the wrong things.

But where I would start to break with Aurynn's analysis would be when we contrast how we look at MySQL with how we look at Oracle. Oracle, too, has some major oversights (empty string being null if it is a varchar, no transactional DDL, etc). Almost all of us may dislike the software and the company. But people who work with Oracle still have prestige. So bashing tools isn't quite the same thing as bashing the people who use them. Part of it, no doubt, is that Oracle is more established, is an older player in the market, and therefore there is a natural degree of prestige that comes from working with the product. But the question I have is what can we learn from that?

Some time ago, I wrote a the most popular blog post in the history of this blog. It was a look at the differences in design between MySQL and PostgreSQL and was syndicated on DZone, featured in Hacker News, and otherwise got a fairly large review. In general, aside from a couple of historical errors, the PostgreSQL-using audience loved the piece. What surprised me though was that the MySQL-users also loved the piece. In fact one comment that appeared (I think on Reddit) said that I had expressed why MySQL was better.

The positive outpouring from MySQL users, I think, came from the fact that I sympathetically looked at what MySQL was designed to do and what market it was designed for (applications that effectively own the database), describing how some things I considered misfeatures actually could be useful in that environment, but also being brutally honest about the tradeoffs.

Applying This to Programming Language Debates

Before I start discussing this topic, it is worth a quick tour of my experience as a software developer.

The first language I ever worked with was BASIC on a C64. I then dabbled in Logo and some other languages, but the first language I taught myself professionally was PHP. From there I taught myself some very basic Perl, Python, and C. For a few years I worked with PHP and bash scripting, only to fall into doing Perl development by accident. I also became mildly proficient in Javascript.

My PostgreSQL experience grew out of my Perl experience. And about 3 years ago I was asked to start teaching Python courses. I rose to this challenge. Around the same time, I had a small project where we used Java and quickly found myself teaching Java and now I feel like I am moderately capable in that language. I am now teaching myself Haskell (something I think I could not have done before really mastering Python). So I have worked with a lot of languages. I can pick up new languages with ease. Part of it is because I generally seek to understand a language as a product of its own history and the need it was intended to address.

As we all know different programming languages are associated with stereotypes. Moreover, I would argue that stereotypes are usually imperfect understandings that out-group people have of in-group dynamics, so dismissing stereotypes is often as bad as simply accepting them.

PHP as a case study, compared to C.

I would like to start with an example of PHP, since this is the one specifically addressed in the talk and it is a language I have some significant experience writing software in.

PHP often is seen to be insecure because it is easy to write insecure software in the language. Of course it is easy to write insecure software in any language, but certain vulnerabilities are a particular problem in PHP due to lexical structure and (sometimes) standard library issues.

Lexically, the big issue with PHP is the fact that the language is designed to be a preprocessor to SGML files (and in fact it used to be called the PHP Hypertext Preprocessor). For this reason everything, PHP is easy to embed in SGML PI tags (so you can write a PHP template as a piece of valid HTML). This is a great feature but it makes cross site scripting particularly easy to overlook. A lot of the standard library in the 1990's had really odd behaviour, though much of this has been corrected.

Aurynn is right to point to the fact that these were exacerbated by a flood of new programmers during the rise of PHP, but one thing she does not discuss in the talk is how software and internet security were also changing during the time. In essence, the late 1990's saw the rise of SSH (20k users in 1995 to over 2M in 2000), the end of transmission of passwords across the internet in plain text, the rise of concern about SQL injection and XSS, and so forth. PHP's basic features were in place just before this really got going, and adding to this a new developer community, and you have a recipe for security problems. Of course, today, PHP has outgrown a lot of this and PHP developers today have codified best practices to deal with a lot of the current threats.

If we contrast this with C as programming language, C has even more glaring lexical issues regarding security, from double free bug possibilities to buffer overruns. C, however, is a very unforgiving language and consequently, it doesn't tend to be a language that has a large, novice developer community. At the same time, a whole lot of security issues come out of software in C.

Conclusions

There is no such thing as a perfect tool (database, programming language, etc). As we grow as professionals, part of that process is learning to better use the strengths of the technologies we work with and part of it is learning to overcome the oversights and problems of the tools as well.

It is further not the case that just because a programmer primarily uses a tool with real oversights that this reflects poor judgment from the programmer. Rather this process of learning can have the opposite impact. C programmers tend to be very knowledgeable because they have to be. The same is true for Javascript programmers for very different reasons. And one doesn't have to validate all language design decisions in order to respect others.

Instead of attacking developers of other languages, my recommendation is, when you see a problem, to neutrally and respectfully point it out, not from a position of superiority but a position of respectful assistance and also to understand that often what may seem like poor decisions in the design of a language may in fact have real benefits in some cases.

For example, Java as a language encourages mediocrity of code. It is very easy to become a mediocre Java developer. But once you understand Java as a language, this becomes a feature because it means that the barrier to understanding and debugging (and hence maintaining!) code is reduced, and once you understand that you can put emphasis instead on design and tooling. This, of course, also has costs since it is easy for legacy patterns to emerge in the tooling (JavaBeans for example) but it allows some really amazing frameworks, such as Spring.

On the other extreme, Javascript is a language characterised by shortcuts taken during the initial design stage (for time constraint reasons) and some of those cause real problems, but others make hard things possible. Javascript makes it, also, very easy to be a bad Javascript programmer. But perhaps for this reason I have found that professional Javascript programmers tend to be extremely knowledgeable, and have had to work very hard to master software development in the language, and they usually bring to the table great insights into computing problems generally.

So what I would recommend that people take away is the idea that in fact we do grow out of hardship, and that problems in tools are overcome over time. So for that reason discussing real shortcomings of tools while at the same time respecting communities and their ability to grow and overcome problems is important.

PostgreSQL at 10TB and Beyond Recorded Talk

2017-02-13T01:58:00.002-08:00

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch. For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers. You can book the course for the end of this month. It will be held in Malmo, Sweden.

PL/Perl and Large PostgreSQL Databases

2017-01-26T04:47:00.001-08:00

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways. PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

It is light-weight compared to PL/Java and many other languages
It excels at processing text in general ways.
It has extremely mature regular expression support

These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space. Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table. We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text)
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map { my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;

You could then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

PostgreSQL at 10 TB and Above

2017-01-24T09:44:00.001-08:00

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden. The seminar is free to attend. I expect to be talking for about 45 minutes with some time for questions and answers. I also have been invited to give the talk at PG Conf Russia in March. I do not know whether either will be recorded. But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own. But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes). The other two address volume of data.

Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes. This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables. We will talk about how this race condition happens and we solved it without using row locks. We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
Slow access and poor plans regarding accessing data in large tables. We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.

In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database. These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.

What Git got Right and Wrong

2016-12-11T06:51:00.000-08:00

Having been using various vcs solutions for a while, I think it is worth noting that my least favorite from a user experience perspective is git. To be sure, git has better handling of whitespace-only merge conflicts and a few other major features than any other vcs that I have used.

And the data structure and model are solid.

But even understanding what is going on behind the scenes, I find git to be an unintuitive mess at the CLI issue.

Let me start by saying things that git got right:

The dag structure is nice
Recursive merges are good
The data models and what goes on behind the scenes is solid.
It is the most full-featured vcs I have worked with

However, I still have real complaints with the software These include fundamentally different concepts merged into the same label and the fact that commands may do many different things depending on how you call them. The fact that the concepts are not clear means that it is worse than a learning curve issue. One cannot have a good grasp of what git is doing behind the scenes because this is not always clear. In particular:

In what world is a fast-forward a kind of merge?
Is there any command you can explain (besides clone) in three sentences or less?
What does git checkout do? Why does it depend on what you checkoout? Can you expect an intermediate user to understand what to expect if you have staged changes on a file, when you try to check out a copy of that file from another revision?
What does git reset do from a user perspective? Is there any way a beginner can understand that from reading the documentation?
In other words, git commands try to do too much at once and this is often very confusing.
Submodules are an afterthought and not well supported across the entire tool chain (why does git-archive not have an option to recurse into submodules?)

These specific complaints come from what I see as a lack of clarity regarding what concepts mean and they indicate that those who wrote the tools did so at a time when the concepts were still not entirely clear in their own minds. In essence it is not that things are named in unclear ways but that concepts have unclear boundaries.

Some of this could be fixed in git. fast-forward could be referred to as a shortcut to avoid a merge rather than a kind of merge. Some could be fixed with better documentation (we could describe git reset by what the user-facing changes are, rather than the internal changes). Some would require a very different set of command layouts.

PostgreSQL vs Hadoop

2016-08-18T22:43:00.002-07:00

So one of the clients I do work with is moving a large database from PostgreSQL to Hadoop. The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS). However this has provided an opportunity to think about big data problems and solutions and their implications. At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around. No, LedgerSMB will never likely use Hadoop as a backend. It is definitely not the right solution to any of our problems.

Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure. It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone. Moreover these solutions are hardly without their own significant costs. All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.

So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions. Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have. No, Hadoop does not make a great general ETL platform.....

One of the key things to note is that Hadoop is built to solve all three problems simultaneously. This means that you effectively buy into a lot of other costs if you are trying to solve only one of the V problems with it.

The single largest cost comes from the solutions to the variety of data issues. PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety. You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations. Ouch. But that's very helpful when trying to make sense of a lot of non-structured data.

Now, solutions to check out first if you are faced with volume and velocity problems include Postgres-XL and similar shard/clustering solutions but these really require good data partitioning criteria. If your data set is highly interrelated, it may not be a good solution because cross-node joins are expensive. Also you wouldn't use these for smallish datasets either, certainly not if they are under a TB since the complexity cost of these solutions is not lightly undertaken either.

Premature optimization is the root of all evil and big data solutions have their place. However don't use them just because they are cool or new, or resume-building. They are specialist tools and overuse creates more problems than underuse.

Forthcoming new scalable job queue extension

2016-08-14T00:48:00.000-07:00

So for those of you who know, I now spend most of my time doing more general PostgreSQL consulting and a fair bit of time still on LedgerSMB. One of my major projects lately has been on a large scientific computing platform currently run on PostgreSQL, but due to volume and velocity of data being moved to Hadoop (the client maintains other fairly large PostgreSQL instances with no intention of moving btw).

With this client's permission I have decided to take a lot of the work I have done in optimizing their job queue system and create an extension under PostgreSQL for it.. The job queue currently runs tens of millions of jobs per day (meaning twice that number of write queries, and a fair number of read queries too) and is one of the most heavily optimized parts of the system, so this will be based on a large number of lessons learned on what is a surprisingly hard problem.

It is worth contrasting this to pg_message_queue of which I am also the author. pg_message_queue is intended as a light-weight, easy to use message queue extension that one can use to plug into other programs to solve common problems where notification and message transfer are the main problems. This project will be an industrial scale job queuing system aimed at massive concurrency. As a result simplicity and ease of use take second place to raw power and performance under load. In other words here I am not afraid to assume the dba and programming teams know what they are doing and has the expertise to read the manual and implement appropriately.

The first version (1.x) will support all supported versions of PostgreSQL and make the following guarantees:

massively multiparallel, non-blocking performance (we currently use with 600+ connections to PostgreSQL by worker processes).
Partitioning, coalescing, and cancelling of jobs similar in some ways to TheSchwartz
Exponential pushback based on number of times a job has failed
Jobs may be issued again after deletion but that this can always be detected and bad jobs pruned
Optionally job table partitioning.

The first client written will rely on hand-coded SQL along with DBIx::Class's schema objects. This client will guarantee that:

Work modules done always succeeds or fails in a transaction
A job notifier class will be shown
Pruning of completed jobs will be provided via the perl module and a second query.

The history of this is that this came from a major client's use of The Schwartz and they out grew it for scalability reasons. While the basic approach is thus compatible, the following changes are made:

Job arguments are in json format rather than in Storable format in bytea columns
Highly optimized performance on PostgreSQL
Coalesce is replaced by a single integer cancellation column
Jobs may be requested by batches of various sizes

2.x will support 9.5+ and dispense with the need for both advisory locks and rechecking. I would like to support some sort of graph management as well (i.e. a graph link that goes from one job type to another which specifies "for each x create a job for y" type of semantics. That is still all in design.

use lib '.' considered harmful (exploits discussed)

2016-08-05T09:22:00.001-07:00

Which the discussion of CVE-2016-1238, a quick and easy fix for broken code that has been suggested is to add the following line to the top of broken Perl scripts: Note this applies to Perl as run anywhere, whether pl/perlU, plain perl, or something else.

use lib '.';

In some corners, this has become the goto solution for the problem (pun quite deliberate). It works, gets the job done, and introduces subtle, hidden, and extremely dangerous problems in the process.

For those interested in the concerns specific to PostgreSQL, these will be discussed near the end of this article.

Security and the Garden Path

I am borrowing an idea here from linguistics, the idea of the garden path, as something that I think highlights a lot of subtle security problems. Consider the newspaper headline "Number of Lothian patients made ill by drinking rockets." Starts off simple enough and you get to the end, realizing you must have misread it (and you did, probably, since the number of patients increased who were made ill by drinking, not that some people got sick because they drank hydrazine). The obvious reading and the logical reading diverge and this leads to a lot of fun in linguistic circles.

The same basic problem occurs with regard to security problems. Usually security problems occur because of two problems. Either people do something obviously insecure (plain text authentication for ftp users where it matters) or they do something that looks on the surface like it is secure but behind the scenes does something unexpected.

Perl here has a few surprises here because parsing and running a script is a multi-pass process but we tend to read it as a single pass. Normally these don't cause real problems but in certain cases there are very subtle dangers lurking. Here, with use lib '.', it is possible to inject code into a running program as long as an attacker can get a file placed in the current working directory of the program.

Relative paths, including the current working directory, do have legitimate use cases, but the problems and pitfalls must be understood before selecting this method.

What the lib pragma does

Perl looks for files to require or include based on an array of paths, globally defined, called @INC. Use lib stores a copy of the original lib at first use, and then ensures that the directory specified occurs at the start of the search order. So directories specified with use lib are searched before the default library directories. This becomes important as we look at how Perl programs get executed.

How Perl runs a program

Perl runs a program in two passes. First it creates the parse tree, then it runs the program. This is a recurive process and because of how it works, it is possible for malicious code that gets accidently run in this process to transparently inject code into this (and other portions) of the Perl process.

Keep in mind that this makes Perl a very dynamic language which, on one hand, has serious garden path issues, but on the other ensures that it is an amazingly flexible language.

During the parse stage, Perl systematically works through the file, generating a parse tree, and running any "BEGIN" blocks, "use" statements, and "no" statements. This means that injected code can be run even if later sytnax errors appear to prevent the bulk of the program from running at all or if earlier errors cause run-time exception.

After this process finishes, Perl executes the parse tree that results. This means that Perl code can rewrite the parse tree before your code is ever really written and that code can be inserted into that part o the process.

Transparent code injection during 'use'

Consider a simple Perl script:

#!/usr/bin/perl

use lib '.';
use Cwd;
use 5.010;
use strict;
use warnings;

say getcwd();

Looks straight-forward. And in most cases it will do exactly what it looks like it does. It loads the standard Cwd module and prints out the current working directory.

However, suppose I run it in a different directory, where I add two additional files:

Cwd.pm contains:

package Cwd;

use Injected;

hmmmm that doesn't look good. What does Injected.pm do?

package Injected;

use strict;

sub import {

local @INC = @INC;

my ($module) = caller;

warn $module;

delete $INC{'Injected.pm'};

delete $INC{"$module.pm"};

@INC = grep { $_ ne '.' } @INC;

eval "require $module";

warn "Got you!, via $module";

}

So when Cwd imports Injected, it deletes itself from the memory of having been included, deletes its caller too, reloads the correct caller (not from the current working directory) and then executes some code (here a harmless warning).

Cwd.pm then returns success
test.pl runs Cwd->import() which is now the correct one, but we have already run unintended code that could in theory do anything.

Any program capable of being run in arbitrary directories, written in Perl, which has this line in it (use lib '.') is subject to arbitrary code injection attacks using any module in the dependency tree, required or not.

Instead, do the opposite (where you can)

As a standard partof boilerplate in any secure Perl program, I strongly recommend adding the following line to the top of any script. As long as modules don't add it back in behind your back (would be extremely rare that they would), adding the following line:

no lib '.';

Note that this strips out the current working directory even if it si supplied as a command-line argument. So it may not be possible in all cases. So use common sense, and do some testing, and document this as desired behavior. Note one can still invoke with perl -I./. in most cases so it is possible to turn this safety off..... Additionally if you put that at the start of your module, something you include could possibly put it back in.

Safer Alternatives

In a case where you need a path relative to the script being executed, FindBin is the ideal solution. It gives you a fixed path relative to the script being run, which is usually sufficient for most cases of an application being installed on a system as a third party. So instead you would do:

use FindBin;
use lib $FindBin::Bin;

Then the script's directory will be in the include path.

PL/PerlU notes:

I always add the explicit rejection of cwd in my plperlu functions. However if someone has a program that is broken by CVE-2016-1238 related fixes, it is possible that someone would add a use lib '.' to a perl module, which is a bad idea. As discussed in the previous post, careful code review is required to be absolutely safe. Additionally, it is a very good idea to periodically check the PostgreSQL data directory for perl modules which would indicate a compromised system.

PostgreSQL, PL/Perl, and CVE-2016-1238

2016-08-02T21:39:00.000-07:00

This post is about the dangers in writing user defined functions in untrusted languages, but it is also specifically about how to avoid CVE-2016-1238-related problems when writing PL/PerlU functions. The fix is simple and straight-forward and it is important for it to be in pl/perlU stored procedures and user defined functions for reasons I will discuss. This discusses actual exploits and the severity of being able to inject abritrary Perl code into the running database backend is a good reason to be disciplined and careful about the use of this language.

It is worth saying at the outset that I have been impressed by how well sensible design choices in PostgreSQL generally mitigate problems like this. In essence you have to be working in an environment where a significant number of security measures have been bypassed either intentionally or not. This speaks volumes on the security of PostgreSQL's design since it is highly unlikely that this particular attack vector was an explicit concern. In other words these decisions make many attacks even against untrusted languages far more difficult than they would be otherwise.

The potential consequences are severe enough, however, that secure coding is particularly important in this environment even with the help of the secure design. And in any language it is easy to code yourself into corners where you aren't aware you are introducing security problems until they bite you.

The current implementation, we will see, already has a fair bit of real depth of defense behind it. PostgreSQL is not, by default, vulnerable to the problems in the CVE noted in the title. However, with a little recklessness, it is possible to open the door to the possibility of real problems and it is possible for these problems to be hidden from the code reviewer by error rather than actual malice. Given the seriousness of what can happen if you can run arbitrary code in the PostgreSQL back-end, my view is that all applicable means should be employed to prevent problems.

PL/PerlU can be used in vulnerable ways, but PL/Perl (without the U) is by design safe. Even with PL/PerlU it is worth noting that multiple safety measures have to be bypassed before vulnerability becomes a concern. This is not about any vulnerabilities in PostgreSQL, but what vulnerabilities can be added through carelessly writing stored procedures or user-defined functions.

There are two lessons I would like people to take away from this. The first is how much care has been taken with regard to PostgreSQL regarding security in design. The second is how easily one can accidentally code oneself into a corner. PL/PerlU often is the right solution for many problem domains and it can be used safely but some very basic rules need to be followed to stay out of trouble.

Extending SQL in PostgreSQL using Arbitrary Languages

PostgreSQL has a language handler system that allows user defined functions and stored procedures to be written in many different languages. Out of the box, Python, TCL, C, and Perl come supported out of the box. Perl and TCL come in trusted and untrusted variants (see below), while Python and C are always untrusted.

There are large numbers of external language handlers as well, and it is possible to write ones own. Consequently, PostgreSQL allows, effectively, SQL to be extended by plugins written in any language. The focus here will be on untrusted Perl, or PL/perlU. Untrusted languages have certain inherent risks in their usage and these become important, as here, when there is concern about specific attack vectors.

Trusted vs Untrusted Languages, and what PL/Perl and PL/PerlU can do

PostgreSQL allows languages to be marked as 'trusted' or 'untrusted' by the RDBMS. Trusted languages are made available for anyone to write functions for while untrusted languages are restricted to database superusers. Trusted languages are certified by the developers not to interact with file handles, not to engage in any other activity other than manipulating data in the database.

There is no way to 'use' or 'require' a Perl module in pl/perl (trusted) and therefore it is largely irrelevant for our discussion. However PL/PerlU can access anything else on the system and it does so with the same permissions as the database manager itself. Untrusted languages, like PL/PerlU make it possible to extend SQL in arbitrary ways by injecting (with the database superuser's permission!) code into SQL queries written in whatever langauges one wants. Untrusted languages make PostgreSQL one of the most programmable relational databases in the world, but they also complicate database security in important ways which are well beyond the responsibility of PostgreSQL as a project.

Like test harnesses, untrusted languages are insecure by design (i.e. they allow arbitrary code injection into the database backend) and this issue is as heavily mitigated as possible, making PostgreSQL one of the most security-aware databases in the market.

To define a function in an untrusted language, one must be a database superuser, so the PostgeSQL community places primary trust in the database administration team not to do anything stupid, which is generally a good policy. This article is largely in order to help that policy be effective.

Paths, Permissions, and CVE-2016-1238

The CVE referenced in the title of this article and section is one which allows attackers to inject code into a running Perl routine by placing it in the current working directory of a Perl process. If an optional dependency of a dependency is placed i the current working directory, it may be included in the current Perl interpreter without the understanding of the user. This is a problem primarily because Perl programs are sufficiently complex that people rarely understand the full dependency tree of what they are running.

If the current working directory ends up being one which includes user-writeable data of arbitrary forms, the problem exists. If not, then one is safe.

The problem however is that changing directories changes the include path. You can demonstrate this by doing as follows:

In ./test.pl create a file that contains:

use test;
use test2;

./test.pm is:

chdir test;
\
test/test2.pm is:

warn 'haxored again';

What happens is that the use test statement includes ./test.pm which changes directory, so when you use test2, you are including from test/test2.pm. This means that if any period during this cycle of the Perl interpreter's life you are in a world-writable directory, and including or requring files, you are asking for trouble.

Usual Cases for PL/PerlU

The usual use case of PL/PerlU is where you need a CPAN module to process or handle information.. For example you may want to return json using JSON.pm, or you may want to write to a (non-transactional log).

For example, the most recent PL/PerlU function I wrote used basic regular expressions to parse a public document record stored in a database to extract information relating to patents awarded on protein sequences. It was trivial but was easier to use JSON than to write json serialization inline (and yes, performed well enough given the data sizes operate on).

Are usual cases safe? Deceptively so!

Here are a few pl/perlU functions which show the relevant environment that PL/PerlU functions run in by default:

postgres=# create or replace function show_me_inc() returns setof text language plperlu as
$$
return \@INC;
$$;
CREATE FUNCTION
postgres=# select show_me_inc();
show_me_inc
------------------------------
/usr/local/lib64/perl5
/usr/local/share/perl5
/usr/lib64/perl5/vendor_perl
/usr/share/perl5/vendor_perl
/usr/lib64/perl5
/usr/share/perl5
.
(7 rows)

postgres=# create or replace function get_cwd() returns text language plperlu as
postgres-# $$
postgres$# use Cwd;
postgres$# return getcwd();
postgres$# $$;
CREATE FUNCTION
postgres=# select get_cwd();
get_cwd
---------------------
/var/lib/pgsql/data
(1 row)

Wow. I did not expect such a sensible, secure implementation. PostgreSQL usually refuses to start if non-superusers of the system have write access to the data directory. So this is, by default, a very secure configuration right up until the first chdir() operation......

Now, in the normal use case of a user defined function using PL/PerlU, you are going to have no problems. The reason is that most of the time, if you are doing sane things, you are going to want to write immutable functions which have no side effects and maybe use helpers like JSON.pm to format data. Whether or not there are vulnerabilities exploitable via JSON.pm, they cannot be exploited in this manner.

However, sometimes people do the wrong things in the database and here, particularly, trouble can occur.

What gets you into trouble?

To have an exploit in pl/perlU code, several things have to happen:

Either a previous exploit must exist which has written files to the PostgreSQL data directory, or one must change directories
After changing directories, a vulnerable module must be loaded while in a directory the attacker has write access to.

It is possible, though unlikely, for the first to occur behind your back. But people ask me all the time how to send email from the database backend and so you cannot always guarantee people are thinking through the consequences of their actions.

So vulnerabilities can occur when people are thinking tactically and coding in undisciplined ways. This is true everywhere but here the problems are especially subtle and they are as dangerous as they are rare.

An attack scenario.

So suppose company A receives a text file via an anonymous ftp drop box in X12 format (the specific format is not relevant, just that it comes in with contents and a file name that are specified by the external user). A complex format like X12 means they are highly unlikely to do the parsing themselves so they implement a module loader in PostgreSQL. The module loader operates on a file handle as such:

On import, the module loader changes directories to the incoming directory. In the actual call to get_handle, it opens a file handle, and creates an iterator based on that, and returns it. Nobody notices that the directory is not changed back because this is then loaded into the db and no other file access is done here. I have seen design problems of this magnitude go undetected for an extended period, so coding defensively means assuming they will exist.

Now, next, this is re-implemented in the database as such:

CREATE OR REPLACE FUNCTION load_incoming_file(fiilename text)
RETURNS int
language plperlu as
$$
use CompanyFileLoader '/var/incoming'; # oops, we left the data directory
use CompanyConfig 'our_format'; # oops optional dependency that falls back to .
# in terms of exploit, the rest is irrelevant
# for a proof of concept you could just return 1 here
use strict;
use warnings;

my $filename = shift;
my $records = CompanyFileLoader->get_handle($filename);
while ($r = $records->next){
# logic to insert into the db omitted
}

return $records->count;

$$;

The relevant poart of CompanyFileLoader.pm is (the rest could be replaced with stubs for a proof of concept):

package CompanyFileLoader;
use strict;
use warnings;

my @dirstack;
sub import {
my $dir = pop;
push @dirstack, $dir;
chdir $dir;
}

Now, in a real-world published module, this would cause problems but in a company's internal operations it might not pose discovered problems in a timely fashion.

The relevant part of CompanyConfig is:

package CompanyConfig;
use strict;
use warnings;

eval { require 'SupplementalConfig' };

Now, if a SupplementalConfig.pm is loaded into the same directory as the text files, it will get loaded and run as part of the pl/perlU function.

Now to exploit this someone with knowledge of the system has to place this in that directory. It could be someone internal due to failure to secure the inbound web service properly. It could be someone external who has inside knowledge (a former employee for example). Or a more standard exploit could be tried on the basis that some other shared module might have a shared dependency.

The level of inside knowledge required to pull that off is large but the consequences are actually pretty dire. When loaded, the perl module interacts with the database with the same permissions as the rest of the function, but it also has filesystem access as the database server. This means it could do any of the following:

Write perl modules to exploit this vulnerability in other contexts to the Pg data directory
delete or corrupt database files
possibly alter log files depending on setup.
Many other really bad things.

These risks are inherent with the use of untrusted languages, that you can write vulnerable code and introduce security problems into your database. This is one example of that and I think the PostgreSQL team has done an extremely good job of making the platform secure.

Disciplined coding to prevent problems

The danger can be effectively prevented by following some basic rules:

All user defined functions and stored procedures in PL/PerlU should include the line:

no lib '.';

It is possible that modules could add this back in behind your back, but for published modules this is extremely unlikely. So local development projects should not use lib '.' in order to prevent this.

Secondly, never use chdir in a pl/perl function. Remember you can always do file operations with absolute paths. Without chdir, no initial exploit against the current working directory is possible through pl/perlu. Use of chdir circumvents important safety protections in PostgreSQL.

Thirdly it is important that one sticks to well maintained modules. Dangling chdir's in a module's load logic are far more likely to be found and fixed when lots of other people are using a module, and a dangling chdir is a near requirement to accidental vulnerability. For internal modules, they need to be reviewed both for optional dependencies usage and dangling chdir's in the module load logic.

CVE-2016-1238 and the hazards of testing frameworks

2016-08-01T23:06:00.002-07:00

Because I have lost confidence in the approach taken by those in charge of fixing this problem, I have decided to do a full disclosure series on CVE-2016-1238. As mentioned in a previous post, the proposed fixes for the minor versions don't even remotely fix the problem and at most can provide a false sense of security. For this reason it is extremely important that sysadmins understand how these exploits work and how to secure their systems.

A lot of the information here is not specific to this CVE but concerns security regarding running tests generally. For this reason while this CVE is used as an example, the same basic concepts apply to PostgreSQL db testing, and much more.

As I mentioned in a previous post, prove cannot currently be run safely in any directory that is world writeable, and the current approach of making this a module problem make this far worse, not better. Moreover this is not something which can be patched in prove without breaking the basic guarantee that it tests the application as it would run on the current system's Perl interpreter and if you break that, all bets are off.

All the exploits I cover in this series are exploitable on fully patched systems. However they can be prevented by good system administration. In every case, we will look at how system administration best practices can prevent the problem.

One key problem with the current approach is that it fails to differentiate between unknowing inclusion and user errors brought on simply by not understanding the issues. The latter has been totally disregarded by the people attempting stop-gap patches.

Test harnesses generally are insecure by design. The whole point is to execute arbitrary code and see what happens and with this comes a series of inherent risks. We accept those risks because they are important to the guarantees we usually want to make that our software will perform in the real world as expected. Moreover even the security risks inherent in test harnesses are good because it is better to find problems in an environment that is somewhat sandboxed than it is in your production environment.

So testing frameworks should be considered to be code injection frameworks and they can be vectors by which code can be injected into a tested application with system security implications.

Understood this way, testing frameworks are not only insecure by design but this is desirable since you don't want a testing framework to hide a problem from you that could bite you in production.

This is a generic problem of course. A database testing framework would (and should) allow sql injection that could happen in the tested code so that these can be tested against. So whether you working with Perl, Python, PostgreSQL, or anything else, the testing framework itself has to be properly secured. And in each platform this means specific things.

In PostgreSQL, this means you need to pay close attention to certain things, such as the fact that the database tests should probably not run as a superuser for example since a superuser can do things like create functions wit system access.

In Perl, one of the most important things to consider is the inclusion of modules from the current working directory and the possibility that someone could do bad things there.

What Prove Guarantees

Prove guarantees that your perl code will run, in its current configuration, in accordance with your test cases. This necessarily requires arbitrary code execution and arbitrary dependency requirements resolved in the way Perl would resolve them on your system.

Prove guarantees that the guarantees you specify in your test cases are met by your current Perl configuration. It therefore cannot safely do any extra sandboxing for you.

How Prove Works

The basic architecture of prove is that it wraps a test harness which runs a specified program (via the shebang line) parses its output assuming it to be in the test-anything protocol, and generates a report from the rest. For example if you create a file test.t:

#!/bin/bash

echo 'ok 1';
echo 'ok 2';
echo '1..2';

and run prove test.t

You will get a report like the following:

$ prove test.t
test.t .. ok
All tests successful.

What prove has done is invoke /bin/bash, run the file on it, parse the output, check that 2 tests were run, and that both printed ok (it is a little more complex than this but....), and let you know it worked.

Of course, usually we run perl, not bash.

An Attack Scenario

The most obvious attach scenarios would occur with automated test environment that are poorly secured. In this case, if prove runs from a directory containing material more than one user might submit, user A may be able to inject code into user B's test runs.

Suppose user A has a hook for customization as follows:

eval { require 'custom.pl' };

Now this is intended to run a custom.pl file, at most once, when the code is run, and it checks the current @INC paths. If this doesn't exist it falls back to the current working directory, i.e. the directory the shell was working in when prove was run. If this directory shares information between users, user b can write a custom.pl file into that directory which will run when user A's scheduled tests are run.

The code would then run with the permissions of the test run and any misbehavior would tie back to user A's test run (it is harder to tie the behavior to user B). In the event where user A's test run operates with more system permissions than user B's, the situation is quite a bit worse. Or maybe user B doesn't even have tests run anymore for some reason.

Now, it has been proposed that prove prune its own @INC but that doesn't address this attack because prove has to run perl separately. Additionally separation of concerns dictates that this is not prove's problem.

Behavior Considered Reasonably Safe

As we have seen, there are inherent risks to test frameworks and these have to be properly secured against the very real fact that one is often running untrusted code on them. However, there are a few things that really should be considered safe. These include:

Running tests during installation of trusted software as root. If you don't trust the software, you should not be installing it.
Running tests from directories which store only information from a single user (subdirectories of a user's home directory for example).

Recommendations for Test Systems

Several basic rules for test systems apply:

Understand that test systems run arbitrary code and avoid running test cases for automated build and test systems as privileged users.
Properly secure all working directories to prevent users from unknowingly sharing data or test logic
Running as root is only permitted as part of a process installing trusted software.

Notes on Security, Separation of Concerns and CVE-2016-1238 (Full Disclosure)

2016-07-30T20:52:00.001-07:00

A cardinal rule of software security is that, when faced with a problem, make sure you fully understand it before implementing a fix. A cardinal rule of general library design is that heuristic approaches to deciding whether something is a problem really should be disfavored. These lessons were driven home when I ended up spending a lot of time debugging problems caused by a recent Debian fix for the CVE in the title. Sure everyone messes up sometimes, and so this isn't really condemnation of Debian as a project but their handling of this particular CVE is a pretty good example of what not to do.

In this article I am going to discuss actual exploits. Full disclosure has been a part of the LedgerSMB security culture since we started and discussion of exploits in this case provide administrators with real chances to secure their systems, as well as distro maintainers, Perl developers etc to write more secure software. Recommendations will be further given at the end regarding improving the security of Perl as a programming language.

Part of the problem in this case is that the CVE is poorly scoped but CVE's are often poorly scoped and it is important for developers to work with security researchers to understand a problem, understand the implications of different approaches to fix it and so forth. It is very easy to get in a view that "this must be fixed right now" but all too often (as here) a shallow fix does not completely resolve an issue and causes more problems than it resolves.

The Problem (with exploits)

Perl's system of module inclusion (and other operations) looks for Perl modules in the current directory after exhausting other directories. Technically this is optional but most UNIX and Linux distributions have this behavior. On the whole it is bad practice as well, which is why it is not by the default behavior of shells like bash. But a lot of software depends on this (with some legitimate use cases) and so changing it is problematic.

Perl programs are also often complex and have optional dependencies which may or may not exist on a system. If those do not otherwise exist in the system Perl directories but exist in the current working directory, then these may be loaded from the current working directory. Note that this is not actually limited to the current working directory, and Perl could load the files from all kinds of places, users-specified or not.

So when one is running a Perl program in a world-writeable location, there is the opportunity for another user to put code there that may be picked up by the Perl interpreter and executed. While the CVE is limited to implicit inclusion of the current working directory, the problem is actually quite a bit broader than that. Include paths can be specified on the command line and if any of them are world-writeable, then variations of the same attacks are possible.

Some programs, of course, are intended to run arbitrary Perl code. The test harness programs are good examples of this. Special attention will be given to the ways test harness programs can be exploited here.

These features come together to create opportunities for exploits in multi-user systems which administrators need to be aware of and take immediate steps to prevent. In my view there are a few important and needed features in Perl as well.

A simple exploit:

Create the following files in a safe directory:

t/01-security.t, contents:

use Test::More;
require bar;
eval { require foo };

plan skip_all => 'nothing to do';

lib/bar.pm contents:

use 5.010;
warn "this is ok";

./foo.pm, contents:

use 5.010;
warn "Haxored!";

now run:

prove -Ilib t/01-security.t

Now, what happens here is that the optional requirement of foo.pm in the test script gets resolved to the one that happens to be in your current working directory. If that directory were world writeable, then someone could add that file and it would be run when you run your test cases.

Now, it turns out that this is not a vulnerability with prove. Because prove runs Perl in a separate process and parses the output, eliminating the resolution inside prove itself has no effect. What this means is that the directory where you run something like prove can really matter and if you happen to be in a world writeable directory when you run it (or other Perl programs) you run the risk of including unintended code supplied by other users. Not good. None of the proposed fixes address the full scope of this problem either. Note that if any directory in @INC is world-writeable, a security problem exists. And because these can be specified in the Perl command line, this is far more of a root problem than the mere inclusion of the current working directory.

Security Considerations for System Administrators and Software Developers

All exploits of this sort can be prevented even without the recommendations being followed in the proper fixes section. System administrators should:

Make sure that all directories in @INC other than the current working directory are properly secured against write access (this is a no brainer, but is worth repeating)
Programs such as test harnesses which execute arbitrary Perl code should ONLY be run in properly secured directories, and the only time prove should ever be run as root is when installing (as root) modules from cpan.
Scripts intended to be run in untrusted directories should be audited and one should ensure (and add if it is missing) the following line: no lib '.';

Software developers should:

think carefully about where a script might be executed. If it is intended to be run on directories of user-supplied files, then include the no lib '.'; (This does not apply to test harnesses and other programs which execute arbitrary Perl programs).
Module maintainers should probably avoid using optional config modules to do configuration. These optional configuration modules provide standard points of attack. Use non-executable configuration files instead or modules which contain interfaces for a programmer to change the configuration.

What is wrong with the proposed fixes

The approach recommended by the reporter of the problem is to make modules exclude an implicit current working directory when loading optional dependencies. This, as I will show, raises very serious separation of concerns problems and in Debian's case includes one serious bug which is not obvious from outside. Moreover it doesn't address the problems caused by running test harnesses and the like in untrusted directories. So one gets a *serious* problem with very little real security benefit.

If you are reading through the diff linked to above, you will note it is basically boilerplate that localizes @INC and removes the last entry if it is equal to a single dot. This breaks base.pm badly because inheritance in Perl no longer follows @INC the way use does. Without this patch the following are almost equivalent with the exception that the latter also runs the module's import() routine:

use base 'Myclass';

and

use Myclass;
use base 'Myclass';

But with this patch, the latter works and the former does not unless you specify -MMyclass in the Perl command line. This occurs because someone is thinking technically about an issue without comprehending that this isn't an optional dependency in base and therefore the problem doesn't apply there. But the problem is not quickly evident in the diff, nor is it evident when trying to fix this in this way on the module level. It breaks the software contract badly, and does so for no benefit.

As a general rule, modules should be expected to act sanely when it comes to their own internal structure, but playing with @INC violates basic separation of concerns and a system that cannot be readily understood cannot be readily secured (which is why this is a real issue in the first place -- nobody understands all the optional dependencies of all the dependencies of every script on their system).

Recommendations for proper fixes

There are several things that Perl and Linux distros can do to provide proper fixes to this sort of problem. These approaches do not violate the issues of separation of concerns. The first and most important is to provide an option to globally remove '.' from @INC on a per-system basis. This is one of the things that Debian did right in their reaction to this. Providing tools for administrators to secure their systems is a good thing.

A second thing is that Perl already has a number of enhanced modes for dealing with security concerns and adding another would be a good idea. In fact, this could probably be done as a pragma which would:

On load, check to see that directories in @INC are not world-writeable -- if they are, remove them from @INC and warn, and
when using lib, check to see if the directory is world-writeable and if it is die hard.

But making this a module's responsibility to care what @INC says? That's a recipe for problems and not of solutions both security and otherwise.

What is coming in LedgerSMB 1.5

2016-07-29T08:46:00.000-07:00

LedgerSMB 1.5 rc1 is around the corner. I figure it is time for a very short list of major improvements:

A one-page app design which provides better responsiveness and testability
Speaking of testability, we now have selenium-based bdd-tests as well (and a test coverage starting to approach reasonable -- remember, we started with no tests).
We have spun off our database access framework as PGObject on CPAN. This is an anti-ORM, basically a service locator framework for stored proceures.
Quantity discounts
Many stored procedures moved from PL/PGSQL to plain SQL for better type checking
Full support for templates transparently stored in the PostgreSQL database

When PostgreSQL Doesn't Scale Well Enough

2016-03-20T03:38:00.003-07:00

The largest database I have ever worked on will eventually, it looks like, be moved off PostgreSQL. The reason is that PostgreSQL doesn't scale well enough. I am writing here however because the limitations are so extreme that it ought to give plenty of ammunition for those who think databases don't scale.

The current database size is 10TB and doubling every year. The main portions of the application have no natural partition criteria. The largest table currently is 5TB and the fastest growing portion of the application.

10TB is quite manageable. 20TB will still be manageable. By 40TB we will need a bigger server. But in 5 years we will be at 320 TB and so the future does not look very good for staying with PostgreSQL.

I looked at Postgres-XL and that would be useful if we had good partitioning criteria but that is not the case here.

But how many cases are there like this? Not too many.

EDIT: It seems I was misunderstood. This is not complaining that PostgreSQL doesn't scale well It is about a case that is outside of all reasonable limits.

Part of the reason for writing this is that I hear people complain that the RDBMS model breaks down at 1TB which is hogwash. We are facing problems as we look towards 100TB. Additionally I think that PostgreSQL would handle 100TB fine in many other cases, but not in ours. PostgreSQL at 10, 20 or 50TB is quite usable even in cases where big tables have no adequate partitioning limit (needed to avoid running out of page counters), and at 100TB in most other cases I would expect it to be a great database system. But the sorts of problems we will hit by 100TB will be compounded by the exponential growth of the data (figure within 8 years we expect to be at 1.3PB). So the only solution really is to move to a big data platform.

A couple annoyances (and solutions) regarding partitioned tables

2016-02-21T04:44:00.004-08:00

In one of my projects we had an issue where a large table that was under huge transactional load was having trouble with autovacuum not keeping up. The problem was that the table sometimes held over half a billion records, added and deleted millions of records a day, and that since most of these occurred at the heads of various indexes, autovacuum was just not fast enough.

So we decided to partition the table into around 50 pieces in order to allow autovacuum to achieve a bit better parallelism in managing the data. This helped to some extent. But partitioning is a rare solution for rare problems and comes with unexpected costs. Interestingly most of our problems have been ORM-related. Here are some we ran into and their solutions (spoiler: at the end of the day, effectively, we stopped using an ORM on these tables). At the end of the day, throughput on these tables was increased around 10-fold, and db load cut by about 90%.

Annoyance 1: Redirection and ORM transparency

The first problem we had was getting DBIx::Class to work with the partitioned table. The solution was to add another view in between which did the redirection of inserts, updates, and deletes. This also allowed us to go through the ORM for inserts (we still do) without the cross-locking issues below being a problem.

Annoyance 2: Cross-locking and exclusion constraints

A second major problem is that autovacuum can only free up space when it gets an exclusive lock and if any queries are going through the parent table, then you get constraint exclusion coming into play. The problem here is that constraint exclusion takes out a relatively non-invasive lock on every table at planning time which means you cannot even plan to select a row from one partition if another partition is locked, if you are going through the parent table.

The obvious solution here is not to go through the parent table, but the ORM doesn't support that so we had to drop to SQL. It also took us about 6 months to find and fix.

Annoyance 3: Constraint exclusion doesn't always do what you expect it to!

One day we had a very slow running straight-forward query that should have been able to resolve quickly on an index scan on one of the partitions. However, because the constraint criteria was being brought in via a subquery, it was not available at plan time, so it was falling back on a sequential scan through another large partition. Ouch...... Found the query and fixed it.

Annoyance 4: Solving some performance problems puts more stress on the next bottleneck

The result of the initial success was increased db concurrency, which was great until it became clear our selection of rows to process and delete was leading to lots of indexes having huge numbers of dead tuples at their heads. This meant that selecting rows actually became slower than before. So we had to go back and engineer a new selection algorithm to avoid this problem....

Unrelated Annoyance: Long running transactions causing autovacuum headaches

An interesting unrelated issue we had was the fact that at the time, we had transactions that would sometimes remain open for a week. While the partitions directly affected were small, the problem is that autovacuum cannot clear tuples that are invalidated since the oldest transaction started, so higher processing throughput partitions were adversely affected. After significant effort, we got the worst offenders corrected and now the longest running transactions take just over a day. This is usually sufficient depending on the load of the system (but sometimes the duration spikes to 18 hours).

Was the partitioning worth it? Definitely! However it was a bit of a long road to get there

Why Commons Should Not Have Ideological Litmus Tests

2016-02-10T23:55:00.000-08:00

This will likely be my last post on this topic. I would like to revive this blog on technical rather than ideological issues but there seems like a real effort to force ideology in some cases. I don't address this in terms of specific rights, but in terms of community function and I have a few more things to say on this topic before I return to purely technical questions.

I am also going to say at the outset that LedgerSMB adopted the Ubuntu Code of Conduct very early (thanks to the suggestion of Joshua Drake) and this was a very good choice for our community. The code of conduct provides a reminder for contributors, users, participants, and leadership alike to be civil and responsible in our dealings around the commons we create. Our experience is we have had a very good and civil community with contributions from every walk of life and a wide range of political and cultural viewpoints. I see this as an unqualified success.

Lately I have seen an increasing effort to codify a sort of political orthodoxy around open source participation. The rationale is usually about trying to make people feel safe in a community, but these are usually culture war issues so invariably the goal is to exclude those with specific political viewpoints (most of the world) from full participation, or at least silence them in public life. I see this as extremely dangerous.

On the Economic Nature of Open Source

Open source software is economically very different from the sorts of software developed by large software houses. The dynamics are different in terms of the sort of investment taken on, and the returns are different. This is particularly true for community projects like PostgreSQL and LedgerSMB, but it is true to a lesser extent even for corporate projects like MySQL. The economic implications thus are very different.

With proprietary software, the software houses build the software and absorb the costs for doing so, and then later find ways to monetize that effort. In open source, that is one strategy among many but software is built as a community and in some sense collectively owned (see more on the question of ownerership below).

So with proprietary software, you may have limited ownership over the software, and this will be particularly limited when it comes to the use in economic production (software licenses, particularly for server software, are often written to demand additional fees for many connections etc).

Like the fields and pastures before enclosure, open source software is an economic commons we can all use in economic production. We can all take the common software and apply it to our communities, creating value in those areas we value. And we don't all have to share the same values to do it. But it often feeds our families and more.

But acting as a community has certain requirements. We have to treat eachother with humanity generally. That doesn't mean we have to agree on everything but it does mean that some degree of civility must be maintained and cultivated by those who have taken on that power in open source projects.

On the Nature of Economic Production, Ownership and Power (Functionally Defined)

I am going to start by defining some terms here because I am using these terms in functional rather than formal ways.

Economic Production: Like all organisms we survive by transforming our environment and making it more conducive to our ability to live and thrive. In the interpersonal setting, we would call this economic production. Note that understood in this way, this is a very broad definition and includes everything from cooking dinner for one's family to helping people work together. Some of this may be difficult to value but it can (what is the difference between eating out and eating at home? How much does a manager contribute to success through coordination?).

Ownership: Defining ownership in functional rather than formal terms is interesting. It basically means the right to use and direct usage of something. Seen in this way, ownership is rarely all or nothing. Economic ownership is the right to utilize a resource in economic production. The extent to which one is restricted in economic production using a piece of software the less one owns it, so CAL requirements in commercial software and anti-TIVOization clauses in the GPL v3 are both restrictions on functional ownership.

Economic Power: Economic power is the power to direct or restrict economic production. Since economic production is required for human life, economic power is power over life itself. In an economic order dominated by corporations, corporations control every aspect of our lives. In places where the state has taken over from the corporations, the state takes over this as well. But such power is rarely complete because not all economic production can be centrally controlled.

I am going to come back to these below because my hesitation on kicking people out of the community due to ideological disagreements (no matter how wrong one side may seem to be) have to do with this fear of abuse of economic power.

On Meritocracy (and what should replace it)

Meritocracy is an idea popularized by Eric Raymond, that power in a community should be given to technical merit. In short, one should judge the code, not the person. The idea has obvious appeal and is on the surface hyper-inclusive. We don't have to care about anything regarding each other other than quality of code. There is room for everyone.

More recently there has been push-back in some corners against the idea of meritocracy. This push-back comes from a number of places, but what they have in common is questioning how inclusive it really is.

The most popular concern is that meritocracy suggests that we should tolerate people who actively make the community less welcoming, particularly for underrepresented groups. and therefore meritocracy becomes a cover for excluding the same groups who are otherwise excluded in other social dimensions, that the means of exclusion differs but who is excluded might not.

There is something to be said for the above concern, but advocates have often suggested that any nexus between community and hostile ideas is sufficient to raise a problem and therefore when an Italian Catholic expresses a view of gender based on his religion on Twitter, people not even involved in the project seek his removal from it on the grounds that the ideas are toxic. For reasons that will become clear, that is vast overreach, and a legitimate complaint is thus made toxic by the actions of those who promote it. And similarly toxic are the efforts by some to use social category to insist that their code should be included just to show a welcoming atmosphere.

A larger problem with meritocracy though is the way it sets up open source communities to be unbalanced, ruled by technical merit and thus not able to attract the other sorts of contributions needed to make most software successful. In a community where technical merit is the measure by which we are judged, non-technical contributions are systematically devalued and undervalued. How many open source communities produce software which is poorly documented and without a lot of attention to user interface? If you devalue the efforts at documentation and UI design, how will you produce software which really meets people's needs? If you don't value the business analysts and technical writers, how will you create economic opportunities for them in your community? If you don't value them, how will you leverage their presence to deliver value to your own customers? You can't if your values are skewed.

The successor to meritocracy should be economic communitarianism, i.e. the recognition that what is good for the community is economically good for all its members. Rather than technical merit, the measure of a contribution and a contributor ought to be the value that a contribution brings the community. Some of those will be highly technical but some will not. Sometimes a very ordinary contribution that anyone could offer will turn the tide because only one person was brave enough to do it, or had the vision to see it as necessary. Just because those are not technical does not mean that they are not valuable or should not be deeply respected. I would argue that in many ways the most successful open source communities are the ones which have effectively interpreted meritocracy loosely as economic communitarianism.

On Feeling Safe in the Community

Let's face it People need to feel safe and secure in the community regarding their physical safety and economic interests. Is there any disagreement on this point? If there is, please comment below. But the community cannot be responsible for how someone feels, only in making sure that people are objectively physically and economically secure within it. If someone feels unsafe in attending conferences, community members can help address security concerns and if someone severely misbehaves in community space, then that has to be dealt with for the good of everyone.

I don't think the proponents of ideological safety measures have really thought things through entirely. The world is a big place and it doesn't afford people ideological safety unless they don't go out and work with people they disagree with. As soon as you go across an international border, disagreements will spring up everywhere and if you aren't comfortable with this then interacting on global projects is probably not for you.

Worse, when it comes to conduct outside of community circles, those in power in the community cannot really act constructively most of the time. We don't have intimate knowledge and even if we do, our viewpoints have to be larger than the current conflict.

On "Cultural Relativism:" A welcoming community for all?

One of the points I have heard over and over in discussions regarding community codes of conduct is that welcoming people regardless of viewpoint (particularly on issues like abortion, sexuality, etc) is cultural relativism and thus not acceptable. I guess the question is not acceptable to whom? And do we really want an ideological orthodoxy on every culture war topic to be a part of an open source project? Most people I have met do not want this.

But the overall question I have for people who push culture war codes of conduct is "when you say a welcoming community for all, do you really mean it? Or do you just mean for everyone you agree with? What if the majority changes their minds?"

In the end, as I will show below, trying to enforce an ideological orthodoxy in this way does not bring marginal groups into the community but necessary forces a choice of which marginal groups to further exclude. I don't think that is a good choice and I will go on record and say it is a choice I will steadfastly refuse to make.

A Hypothetical

Ideology is shaped by culture, and ideology of sexuality is shaped by family structures, so consequently where family structures are different, views on sexuality will be also.

So suppose someone on a community email list includes a pro-same-sex marriage email signature, something like:

"Marriage is an institution for the benefit of the spouses, not [to] bind parents to their children" -- Ted Olson, arguing for a right to same-sex marraige before the United States Supreme Court.

So a socially conservative software developer from southern India complaints to the core committee saying that this is an attack on his culture, saying that traditional Indian marriages are not real marriages. Now, I assume most people would agree that it would be best for the core committee not to insist that the email signature be changed for someone to continue to participate. So with such a decision, suppose the complainant changes his signature instead to read:

"If mutual consent makes a sexual act moral, whether within marriage or without, and, by parity of reasoning, even between members of the same sex, the whole basis of sexual morality is gone and nothing but misery and defect awaits the youth of the country… " -- Mohandas Gandhi

Now the first person decries the signature as homophobic and demands the Indian fellow be thrown off the email list. And the community, if it has decided to follow the effort at ideological safety has to resolve the issue. Which group to exclude? The sexual minority? Or the group marginalized through a history of being on the business end of colonialism? And if one chooses the latter, then what does that say about the state of the world? Should Indians, Malaysians, Catholics, etc. band together to fork a competing project? Is that worth it as a cost? Doesn't that hurt everyone?

On Excluding People from the Commons

In my experience, excluding people from the commons carries with it massive cost, and this is a good thing because it keeps economic power from being abused. I have watched the impact first hand. LedgerSMB would not even exist if this weren't an issue with SQL-Ledger. That we are now the only real living fork of SQL-Ledger and far more active than the project we forked from is a testament to the cost.

Of course in that case the issue was economic competition and a developer who did not want to leverage community development to build his own business. I was periodically excluded from SQL-Ledger mailing lists etc for building community documentation (he sold documentation). Finally the fork happened beccause he wouldn't take security reports seriously. And this is one of the reasons why I would like to push for an inclusive community.

But I also experienced economic ramifications from being excluded. It was harder to find customers (again, the reason for exclusion was economic competition so that was the point). In essence, I am deeply aware of the implications of kicking people out.

I have seen on email lists and tracker tickets the comparison of the goal of excluding people with problematic ideologies with McCarthyism. The goal of McCarthyism was indeed similar, to make sure that if you had the wrong ideas you would be unable to continue a professional career. I have had relatives who suffered because they defended the legal rights of the Communist Party during that time. I am aware of cases where the government tried to take away their professional career (unsuccessfully).

Management of community is political and the cost of excluding someone is also political. We already exist in some ways on the margins of the software industry. Exclude too many people and you create your own nemesis. That's what happened to SQL-Ledger and why LedgerSMB is successful today.

Notes on former FreeBSDGirl

One blog entry that comes from the other side of this issue is Randi Harper's piece on why she no longer will go to FreeBSD conferences and participate on IRC channels. I am not familiar with the facts surrounding her complaints and frankly I don't have time to be so what the nature of her harassment complaint is, I will not be the judge.

There is however another side to the issue that is outside what she evidently has experience with, and that is the role of software maintainers in addressing the sorts of complaints she made. Consequently I want to address that side and then discuss her main points at the bottom.

One thing to remember is that when people make accusations of bullying, harassment, etc. the people in charge are also the people with the least actual knowledge of what is going on. Expecting justice from those in power in cases like this will lead, far more often than not, to feelings of betrayal. This is not because of bad intentions but because of lack of knowledge. This was one thing I learned navigating schoolyard bullies when I was growing up and we as project maintainers are in an even lower knowledge role than school administrators are. Bullies are furthermore usually experts at navigating the system and take advantage of those who are not as politically adept, so the more enforcement you throw at the problem, the worse it gets.

So there is an idea that those in charge will stop people from treating eachother badly. That has to stop because it isn't really possible (as reasonable as it sounds). What we can do is keep the peace in community settings and that is about it. One needs bottom up solutions, not top down ones.

So if someone came to me as a maintainer of a project alleging harassment on Twitter and demanding that an active committer be removed, that demand would probably go nowhere. If political statements were mentioned, the response would be "do we want a political orthodoxy?" Yet LedgerSMB has avoided these problems largely because, I think, we are a community of small businesses and therefore are used to working through disagreements and maybe because we are used to seeing these sorts of things as political.

Her main points though are worth reading and pondering. In some areas she is perfectly right and in some areas dangerously wrong.

Randi is most right in noting that personal friction cannot be handled like a technical problem. It is a political problem and needs to be handled as such. I don't think official processes are the primary protection here, and planning doesn't get you very far, but things do need to be handled delicately.

Secondly, there is a difference between telling someone to stay quiet and telling someone not to be shouting publicly. I think it is worth noting that if mediation is going to work then one cannot have people trying to undermine that in public, but people do need friends and family for support and so it is important to avoid the impression that one is insisting on total confidentiality.

Randi is also correct that how one deals with conflict is a key gauge of how healthy an open source community is. Insisting that people be banished because of politically offensive viewpoints however does not strike me as healthy or constructive. Insisting that people behave themselves in community spaces does. In very rare cases it may be necessary to mediate cases that involve behavior outside that, but insisting on strict enforcement of some sort of a codified policy will not bring peace or prosperity.

More controversially I will point out that there is a point that Randi makes implicitly that is worth making explicit here, namely that there is a general tone-deafness to women's actual experiences in open source. I think this is very valid. I can remember a former colleague in LedgerSMB making a number of complaints about how women were treated in open source. Her complaints included both unwanted sexual attention ("desperate geeks") and more actionably the fact that she was repeatedly asked how to attract more women to open source (she responded once on an IRC channel with "do you know how annoying that is?"). She ultimately moved on to other projects following a change in employment that moved LedgerSMB outside the scope of duties, but one obvious lesson that those of us in open source can take from this is just to listen to complaints. Many of these are not ones that policies can solve (you really want a policy aimed at telling people not to ask what needs to be done to attract more women to open source?) but if we listen, we can learn something.

One serious danger in the current push for more expansive codes of conduct is that it puts those who have the least knowledge in the greatest responsibility. My view is that expansive codes of conduct, vesting greater power with maintainers over areas of political advocacy outside community fora will lead to greater, not less conflict. So I am not keen in her proposed remedies.

How Codes of Conducts Should be Used

The final point I want to bring up here is how codes of conduct should be used. These are not things which should be seen as pseudo-legal or process-oriented documents. If you go this way, people will abuse the system. It is better in my experience to vest responsibility with the maintainers in keeping the peace, not dispensing out justice, and to have codes of conduct aimed at the former, not the latter. Justice is a thorny issue, one philosophers around the world have been arguing about for millennia with no clear resolution.

A major problem is the simple fact that perception and reality don't always coincide. I was reminded of this controversy while reading an article in The Local about the New Years Eve sexual assaults, about work by a feminist scholar in Sweden to point out that actually men are more at risk from bodily violence than women are, and that men suffer disproportionately from crime but are the least likely to modify behavior to avoid being victimized. The article is worth reading in light of the current issues.

So I think if one expects justice from a code of conduct, one expects too much. If one expects fairness from a code of conduct, one expects too much. If one expects peace and prosperity for all, then that may be attainable but that is not compatible with the idea that one has a right not to be confronted by people with dangerous ideologies.

Codes of conducts, used right, provide software maintainers with a valuable tool for keeping the peace. Used wrong, they lead open source projects into ruin. In the end, we have to be careful to be ideologically and culturally inclusive and that means that people cannot guarantee that they are safe from ideas they find threatening.

On Contributor Codes of Conduct and Social Justice

2016-01-26T11:07:00.001-08:00

The PostgreSQL, Ruby, and PHP communities have all been considering codes of conduct for contributors. The LedgerSMB community already uses the Ubuntu Code of Conduct. Because this addresses many projects, I am syndicating this further than where there are current issues. This is not a technical post and it covers a wide range of very divisive issues for a very diverse audience. I can only hope that the nuance I am trying to communicate comes across.

Brief History

A proximal cause seems to be an event referred to as "Opalgate" where an Italian individual who claimed to be a part of the Opal project made some unrelated tweets in an exchange about the politics of education and the question of how gender should be presented, and some people got offended and demanded his resignation (at least that is my reading of the Twitter exchange but I have been outside the US long enough to lose the context in which it would likely be read by an American in the US). The details are linked below, but the core questions involve how much major contributors to projects need to keep from saying anything at all about divisive issues seems to be a recurring topic. Moreover it is a legitimate one.

Like some of my blog posts, this goes into touchy territory. I am discussing things which require a great deal of nuance. Chances are, regardless of where you sit on some of these issues, you will be offended by things I say, but there are worse things than to be offended (one of them is never to be challenged by different viewpoints).

I write here as someone who has lived in a number of very different cultures and who can see perspectives on many of these issues which are not present in American political discourse For this reason, I think it is important for me to share the concerns I see because otherwise open source software maintainers often don't have a perspective outside of Western countries, or even outside the US.

Of course as open source software maintainers we want everyone to feel safe and valued as members of the community. But cultural tensions and ways of life do crop up and taking a position on these as a community will always do more harm than good.

Background Reading regarding Opalgate and the question of so-called "social justice warriors" in open source

It may seem strange to put a list of links for background reading near the start of an article, but I want to make sure that such material is available up front. People can read about Opalgate here and the ongoing debate between various parties about it. It's important background reading but somewhat peripheral to the overall problems involved. It may or may not be the best example of the difficulties in running cross-cultural projects but it does highlight the difficulties that come in addressing diverse community bases, those which may have deep philosophical disagreements about things which people take very personally.

The original twitter exchange
A writeup on Geek Feminism
Eric Raymond's post on why Open Source Projects should eject "social justice warriors"
Caroline Ada Ehmke's response to Eric Raymond
Caroline Ada Ehmke's post on Opalgate

In the interest of full disclosure, I too worry that there is too much eagerness to liberate children from concepts of gender and too little thought about how this can and will be abused, and what the life costs for the children actually will be. I believe that we must be human and humane to all but I am concerned that the US is going down a path that strikes me as anything but that in the long run. That doesn't mean that the concerns of the trans community in the US should be ignored, but that doesn't mean they should be paramount either. As communities we need to come together to solve problems not fight culture wars.

Twitter is not a medium which is conducive to thoughtful exchange so I also have to cut some slack. Probably not the wisest medium to discuss controversial topics. But people around the world have deep differences in views on major controversies. My wife, for example, is far more opposed to abortion than I am, and having come to a deeper understanding of her culture, I don't disagree that in her cultural context, it is more harmful. But that brings me to another problem, that many issues are contextual and we cannot see how others really are impacted by such changes, particularly when forced from the outside.

But my view doesn't matter everywhere. It matters in my family, my discussions with people I know, and so forth. But most of the world is not my responsibility nor should it be. These are not entirely easy issues and there should be room for disagreement.

Is Open Source Political?

Caroline Ada Ehmke's basic argument is that open source is inherently political, that it seeks a positive change in the world, and therefore it should ally itself with others sharing the same drive to make the world a better place. I think this viewpoint is misguided but because it is only half-wrong.

Aristotle noted that all human relationships are necessarily political. The three he chose as primary in Politics is illustrative: master and slave (we could update to boss and worker); husband and wife; and king and subject. To Aristotle, the human being alone is incomplete. We are our relationships and our politics follows from them. While there has been an effort to separate the personal and the political in modern times, Feminist historians have kept this tradition alive and well. A notion of the political grounded in humans as social animals is fundamentally more conducive to justice than cold, mechanical, highly engineered social machinery. Moreover Aristotle notes that all communities are built on some concept of the good, that humans only want things that seem good to them and therefore we can assume that all groups seek a better world, but we don't always know which ones deliver, and that is the problem.

Open source begins not with an ideology but with a conviction. Not everybody shares the same conviction. Not everyone participates in open source for the same reason. But everyone has a reason, some conviction that what they are doing is good. There is enough commonality for us all to work together, but that commonality is not as strong as one may think.

In a previous post on this blog I argued for a very different understanding of software freedom than Richard Stallman supposes, for example. While he holds a liberal enumerated liberties view, I hold a traditionalist work-ownership view. Naturally that leads to different things we look for in an ideal license.

And the diversity in viewpoint does not stop there. Some come to open source because they believe that open source is a better way of writing software. Some because they believe that open source software delivers benefits in use. But regardless of our disagreement we share the understanding that open source software brings community and individual benefits.

In two ways then is open source software political:

Communities require governance and this is inherently political, and
To the extent there is a goal to transform the software industry to one of open source that is political.

The first as we will see is a major problem. Open source communities are diverse in a way few Americans can fully comprehend (we like to think everyone is like us and there is one right way, the American Way whether that is in industry -- the right -- or formulations of rights -- the left). Thus most discussions end up being Western-normative (and in particular American-normative) and disregard perspectives from places like India, Malaysia, Indonesia, and so forth.

However it is worth coming back to the point that what brings us together is an economic vision. Yes, that is intrinsically and highly political, but it also has consequences for other causes and therefore it is worth being skeptical of alliances with groups in other directions. What would an open source-based economy look like? What would the businesses look like? Will they be the corporations of today or the perpetual family businesses and trades of yesteryear? And if the latter, what is the implication for the family? Many of these questions (just like questions of same-sex marriage) depend in large part on the current social institutions in a culture -- the implications of an industrial, corporate, weak family society adopting something like same-sex marriage are very different than in an agrarian or family-business, strong family society. My view is that these will likely have different answers in different places.

Thus when an open source community takes a position on, for example, gay rights in the name of providing a welcoming community, they make the community openly hostile to a very large portion of the world and I think that is not what we want. Moreover such a decision is usually a product of white, Western privilege and effectively marginalize those in so-called developing countries who want to see their countries economically develop in a very different direction than the US has. Worse, this is not an unintended side effect but the whole point.

A brief detour into the argument over white privilege

A discussion of so-called white privilege is needed I think for three groups reading this:

Non-Americans who will have trouble understanding the idea as it applies to American society (like all social ideas, it does not apply to all societies or even where it does apply, it may not in the same way).
White Americans who seem to have trouble understanding what people of color in the US mean when they use the term.
Activists who want to use the idea as a political weapon to enforce a sort of orthodoxy

I mentioned Western-Normative above. It is worth pointing out that this forms a part of a larger set of structures that define what is normal or central in a culture, and what is abnormal, marginalized (or perhaps liminal). It is further worth noting that the perception as to these models is more acute to those who are not treated as the paragons of normality. In the US, the paragon of normality is the white, straight male. But unspoken here is that it is the white, straight, urban, wealthy American male (or maybe European, they are white too). Everyone else (women, people of color, Africans, Asians, etc) should strive to be like these paragons of success (I, myself, having lived most of my life now either outside the US or in the rural parts, am most certainly not included in this paragon of normality model, but nevertheless it took years of marriage to someone from a different culture and race to begin to be able to partially see a different perspective).

Now, it doesn't follow that white straight males live up to this image (which is one reason why white privilege theory has proven controversial among the arguably privileged) even where there is wealth, one is brought up in nice neighborhood in the city, etc. But that isn't really the point. The point is that society holds these things to be *normal* and everything else to be only normal to the extent it is like this model. It would be better and more accurate to call this a model of normality rather than privilege and to state at the outset that we cannot really walk a mile in the shoes of people from across many social borders (culture included).

White privilege is real, as is male privilege (in some areas, particularly employment), urban privilege, American privilege, Western privilege, even female privilege (in some areas, particularly family law).

Issues exist in a sticky web of culture, and no culture is perfect

These issues of privilege aren't necessarily wrong in context: it seems unlikely that the workplace can be made less male-normative without men sharing equally in the duties and rights of childrearing, but enforcing that cuts against the goal by some feminists of liberating women from men (and also exists in tension with things like same-sex marriage and gender-nonessentialism). Insisting that men get the same amount of parental leave as women cuts one direction, but insisting that single women get free IVF cuts the other (both of these are either the case in Sweden or efforts are being made to make them the case). In other words, addressing male privilege requires a transformation of the economic and family order together, in such a way that having children becomes an economic investment rather than an economic burden. But then that has implications for the idea of gay rights as we understand the concept in the West because if having and raising children becomes normative then one is providing a sort of parental privilege, and gender equality becomes based on heteronormativity.

But the ultimate white privilege is to deny it is a factor when one uses one's own perception that other cultures are homophobic or transphobic to justify one's own racist paternalism. No need to understand why. We are white. We know what is right. We just need to educate them so they can join the ranks of the elite culturally white enlightened liberals as well. Most of the world, however, disagrees, and as maintainers of open source projects we have to somehow keep the peace. (Note I use the term liberal as it is used in the history of ideas -- in the West it is no less prevalent on the mainstream right than on the mainstream left, though the application may be different.)

Since many of these issues necessarily exist in tension with eachother, there is no such thing as a perfect culture. It isn't even clear the West does better than Southeast Asia on the whole (in fact I would say the SE Asia does better than the West on the whole). But all culture is an effort at these tradeoffs, and it is not the job of open source communities to push Western changes on the rest of the world.

What is Social Justice? Two Theories and a Problem

If open source is inherently political then social justice must in some way matter to open source. Naturally we must understand what social justice is and how it applies. Certainly a sense of being treated fairly by the community is essential for contributors from all walks of life. The cult of meritocracy is an effort at social justice within the community. As some argue it is not entirely without problems (see below) but as a technical community it is a start.

Western concepts of justice today tend to stress individuality, responsibility, and autonomy. The idea is that justice is something that exists between individuals, and maybe between individuals and the state. And while contemporary Western social justice theorists on the left try to relate the parts to the whole of society, it isn't clear that there is room for any parts other than the isolated individual and the state in their theories. If one starts with the view that humans are born free but everywhere in chains (Rousseau), then the job of the state is to liberate people from eachother, and that leaves no room for any other parts.

The individualist view of justice, when seen as primary, breaks down in a number of important ways. The most important is that it provides no real way of understanding parts and how they can be related to the whole. Thus, the state becomes both stronger and isolates people more from eachother, and predictability becomes more important than human judgement. Separatism cannot be tolerated, and assimilationism becomes the rallying cry when it comes to how the central model of normality should deal with those outside. In other words the only way that this approach can deal with those on the margins is to destroy their culture and assimilate the individuals remaining. Resistance must be made futile (Opalgate can be seen as such an effort). For this reason, this view of justice is incompatible with real cultural pluralism. This is not a question of the political spectrum in the US or Europe. It is a fundamental cultural assumption in the much of the West. Interestingly, the insistence that the personal is political means that intellectual feminism already exists in tension with this cold, mechanical view of justice.

Another view of justice can be found in Thomas Aquinas's view that in addition to justice between individuals, there is a need to recognize that just as individuals are parts in relation to the whole, so are other organs within society. In other words, justice is a function of power, and justice is in part about just design and proper distribution of power and responsibility. In this regard, Aquinas built on the thought experiments of Plato's Republic and the Politics of Aristotle. In this regard, key questions of social justice include the structure of an open source community and the relationship between the parts of the community (how users and developers interact and share power and responsibility), the relationship between open source projects and so forth.

In the end though, there was a reason why Socrates eventually rejected every formulation of justice he pondered. Justice itself is complex and to formulate it removes a critical component of it, namely human judgement when weighing harms which are not directly comparable. I think it is therefore quite necessary for people to remain humble about the topic and to realize that nobody sees all the pieces, and that we as humans learn more from disagreement than from agreement. Therefore every one of us is ignorant to some extent on the nature of justice and so disagreements are healthy.

Open Source Projects and So-Called Social Justice Warriors

Coraline Ehmke, in her post on ticket on Opalgate asked:

Is this what the other maintainers want to be reflected in the project? Will any transgender developers feel comfortable contributing?"

This is good question but another question needs to be asked as well. Given that a lot of people live in societies with very different family and social structures, should people feel comfortable using software if the maintainers of the project have come out as openly hostile to the traditional family structures in a culture? Does not a community that is welcoming of all need to avoid the impulse to delegitimize social institutions in other cultures, ones where one necessarily lacks an understanding into how it plays into questions of economic support and power? If open source is already political do we want to ally ourselves with groups that could alienate important portions of our user base by insisting that they change their way of life?

It is important that we maintain a community that is welcoming to all, but that means we have to work with people we disagree with. A mere difference of opinion should never be sufficient to trigger a problem with the code of conduct and expressing an opinion outside community resources should never be sufficient to consider the community unduly unwelcoming. A key component of the community is whether people can work together with people when they disagree, and forcing agreement or even silencing opposition is the opposite of social justice when it comes to a large-reaching global project.

Should open source communities eject social justice warriors as ESR suggests? Not if they are willing to work comfortably with people despite disagreements on hot button issues. Should we welcome them? If they are willing to work with people comfortably despite disagreements on hot button issues. Should we require civility? Yes. Should we as communities take stances on hot button issues internationally? Absolutely not. What about as individuals? Don't we have a civic duty to engage in our own communities as we feel best? And if both those are true, must we not be tolerant of a wide range of differences in opinion, even those we find deeply and horribly wrong?

PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

2014-09-17T17:37:00.000-07:00

Preface

This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject. It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems. The next article in this series will address more specific solutions and more complex scenarios.

The Problems

Often times we want to have database fields automatically turned into object types which are useful to an application. The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy. Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful. Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly. One solution here is to try to follow the basic functional programming approach and copy on write. This prevents a lot of problems. Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic. However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert. We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository

This article follows the code of PGObject::Type::BigFloat.. The code is licensed under the two-clause BSD license as is the rest of the PGObject framework. You can read the code to see the boilerplate. I won't be including it in here. I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling

To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef." This is fairly trivial:

sub is_undef {
my ($self, $set) = @_;
$self->{_pgobject_undef} = $set if defined $set;
return $self->{_pgobject_undef};
}

How PGObject Serializes

When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments. A query is generated to call the procedure, and each argument is checked for a "to_db" method. That method, if it exists, is called and the output used instead of the argument provided. This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value. If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it). This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
my $self = shift @_;
return undef if $self->is_undef;
return $self->bstr;
}

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes

Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes. We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what. The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings. If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts. For this type handler:

sub register{
my $self = shift @_;
croak "Can't pass reference to register \n".
"Hint: use the class instead of the object" if ref $self;
my %args = @_;
my $registry = $args{registry};
$registry ||= 'default';
my $types = $args{types};
$types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
for my $type (@$types){
my $ret =
PGObject->register_type(registry => $registry, pg_type => $type,
perl_class => $self);
return $ret unless $ret;
}
return 1;
}

Then we can just call this in another script as:

PGObject::Type::BigFloat->register;

Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object. In this case:

sub from_db {

my ($self, $value) = @_;

my $obj = "$self"->new($value);

$obj->is_undef(1) if ! defined $value;

return $obj;

}

This supports subclassing, which is in fact the major use case.

Use Cases

This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase. We subclass this module and add support for localized input and output (with different decimal and thousands separators). This gives us a data type which can present itself to the user as one format and to the database as another. The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.

Caveats

PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer. Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0. To this point, that behavior is probably correct. More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series: Advanced Serialization and Deserialization: Dates, Times, and JSON

PGObject Cookbook Part 1: Introduction

2014-09-15T20:21:00.001-07:00

Preface

I have decided to put together a PGObject Cookbook, showing the power of this framework. If anyone is interested in porting the db-looking sides to other languages, please let me know. I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications. It addresses some of the same problems as ORMs but does so in a very different way. Some modules are almost ORM-like and more such modules are likely to be added in the future. However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed. Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications. And there are people who should not use this framework because it is not the right tool for the job. For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework. But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework. Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN. However I expect it to also be of far more general interest since the problems are common problems across frameworks.

Introduction

PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application. Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities. As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced. In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality

registered types, autoserialization, and autodeserialization.
The simple stored procedure mapper
Aggregates and ordering
Declarative mapped methods

Intermediate Functionality

The Bulk Loader
The Composite Type stored procedure mapper
The database admin functions

Advanced Functionality

Memoization of Catalog Lookups
Writing your own stored procedure mapper

This series will cover all the above functionality and likely more. As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles

The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3. In general we took what we liked and what seemed to work well and rewrote those things that didn't. Our overall approach has been based on the following principles:

SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages. The system should leverage hand-coded SQL.
Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them. It's better to store them persistently in the db or generate well-understood queries in general ways where necessary.
Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
DB-centric but Loosely Coupled: The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right. Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
Don't Make Unnecessary Decisions for the Developer: Applications may use a framework in many atypical ways and we should support them. This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this. Consequently you can expect your application to have to slightly extend the framework to configure it.

This framework is likely to be very different from anything else you have used. While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files. And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series: Registered Types: Autoserialization and Deserialization between Numeric and Math::BigFloat.

LedgerSMB 1.4.0 Released

2014-09-14T16:36:00.000-07:00

15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

"LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

Better Productivity

LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

Better Reporting

The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

Better Integration

Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

Commercial Options

LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at http://ledgersmb.org/topic/commercial-support