Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;


You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

14 comments:

  1. ERP Software SolutionsERP Software Solutions gives the correct view of business information.

    ReplyDelete
  2. Gaining Python certifications will validate your skills and advance your career.
    python certification

    ReplyDelete
  3. Selenium is one of the most popular automated testing tool used to automate various types of applications. Selenium is a package of several testing tools designed in a way for to support and encourage automation testing of functional aspects of web-based applications and a wide range of browsers and platforms and for the same reason, it is referred to as a Suite.

    Selenium Interview Questions and Answers
    Javascript Interview Questions
    Human Resource (HR) Interview Questions

    ReplyDelete
  4. Yep you are right on this one. The PL/Perl is best to solve large
    database problems..

    ReplyDelete

  5. It's very interesting! I love your blog! I learn a lot from him! Recently, I was looking for a company on the internet of things on the internet. I found Pro4People on Google and entered contact us . It was the best I could do !!

    ReplyDelete
  6. Thanks for sharing with us, This article gives more useful information to me. Great post, keep updating.
    Web Designing Company in Bangalore | Website Development Company in Bangalore | Web Design Company in Bangalore

    ReplyDelete