First a note, all my tests involved a relatively simple table with a schema like this (column names did vary):
CREATE TABLE enctest (
id int,
id_text text,
id_enc bytea
);
In MySQL varbinary(64) was used instead of bytea.
The id was formed from a sequence from 1 to 100000. I had more trouble loading this in MySQL than in PostgreSQL. id_text was a text cast of id, and id_enc was the value of id_text encrypted using 128-bit AES encryption. This was intended to mimic sales data consisting of short strings that would be decrypted and converted to numeric data before aggregation.
The goal was to see how fast the different implementations would decrypt all records and aggregate as numeric data types. For PostgreSQL, pgcrypto was used. The tests were conducted under ANSI mode on MySQL, and the tables were innodb.
What I found was remarkably disturbing. While MySQL was blazingly fast, this speed came at the cost of basic error checking and rather than an error, decrypting with the wrong key would give the wrong data back sometimes, even on traditional modes. This is because the errors instead of warnings, per the documentation, are only transformed on insert, not on select. In other words, MySQL is just as permissive in read operations with STRICT mode turned on as turned off.
mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) FROM enctest;
+----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) |
+----------------------------------------------------------------+
| 5000050000 |
+----------------------------------------------------------------+
1 row in set (0.33 sec)
That is fast. Very fast, My similar query in PostgreSQL took about 200 seconds, so approx 600x as long, and was entirely CPU-bound the whole time.
efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
QUERY PLAN
--------------------------------------------------------------------------------
-----------------------------------------------
Aggregate (cost=7556.16..7556.17 rows=1 width=62) (actual time=217381.965..217
381.966 rows=1 loops=1)
Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
Buffers: shared read=5556 written=4948
-> Seq Scan on public.sumtest (cost=0.00..6556.08 rows=100008 width=62) (ac
tual time=0.015..1504.897 rows=100000 loops=1)
Output: testval, testvaltext, testvalenc, testvalsym
Buffers: shared read=5556 written=4948
Total runtime: 217382.010 ms
(7 rows)
My first thought was that for there to be a 3-orders-of-magnitude difference between the two implementations, something must be seriously wrong on the PostgreSQL side. This is a huge difference. But then something occurred to me. What if I use the wrong password?
On PostgreSQL:
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
On MySQL, it is a very different story:
mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) |
+-----------------------------------------------------------------+
| 1456 |
+-----------------------------------------------------------------+
1 row in set, 6335 warnings (0.34 sec)
Hmmm, out of 100000 rows, only 6000 (6%) gave a warning, and we got a meaningless answer back. Thanks, MySQL. So I tried some others:
mysql> select sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) |
+-----------------------------------------------------------+
| 1284 |
+-----------------------------------------------------------+
1 row in set, 6230 warnings (0.35 sec)
Again 6% warnings, meaningless answer returned. Wow this is fun.....
Try as I might I couldn't get MySQL to throw any errors, and I always got meaningless results back with the wrong key. A closer look would reveal that MySQL was throwing warnings only when certain rare criteria were met and was performing no validation on the data to ensure it matched the data in. Further review showed that the cryptograms were much shorter on MySQL than PostgreSQL suggesting that PostgreSQL was padding short strings in order to ensure that cryptography would better protect the data. More on this later.
This suggested that the difference in the performance might well be related to extra sanity checks in PostgreSQL that MySQL omitted for speed-related purposes. Armed with this knowledge, I tried the following:
efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000
The query returned pretty fast. However these settings are not really recommended for production environments.
I went ahead and tried again my data test queries and my performance queries and the results were two orders of magnitude faster:
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000
efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
QUERY PLAN
--------------------------------------------------------------------------------
-----------------------------------------------
Aggregate (cost=13111.00..13111.01 rows=1 width=71) (actual time=1996.574..199
6.575 rows=1 loops=1)
Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
Buffers: shared hit=778 read=10333
-> Seq Scan on public.sumtest (cost=0.00..12111.00 rows=100000 width=71) (a
ctual time=0.020..128.722 rows=100000 loops=1)
Output: testval, testvaltext, testvalenc, testvalsym
Buffers: shared hit=778 read=10333
Total runtime: 1996.617 ms
(7 rows)
Much, much faster. Of course that comes at the cost of security features.
The primary security features changed here are what are called string to key functions. PostgreSQL also offers some relatively complex containers for short data which include things like padding and session keys. MySQL does not provide string to key management, and requires that you generate the hexadecimal key yourself. PostgreSQL provides a number of options for string to key generation which allow for salted hashes to be used for the actual encryption.
One of the most obvious implications here is that with MySQL, you have to generate your salted hash yourself, while with PostgreSQL, it may generate a different salted hash for each line. This is very important for encryption particularly with smaller strings because this helps thwart rainbow tables. In essence with salted keys, there is no 1:1 relationship between the passphrase/data combination and the cryptogram, because there is no 1:1 relationship between the passphrase and the key. Further testing suggests that this is not responsible for the performance difference but it does suggest there are more checks lurking beneath the surface which are omitted from MySQL.
So given that the issue is not string to key management, the issue must be padding. For very short strings, PostgreSQL is managing padding and containers, while MySQL is purely encrypting short strings without more than minimal padding. Since there is insufficient padding, the decryption routines are much faster, but this comes at a cost of any reasonable security. Additionally PostgreSQL provides data checks that are not done on MySQL.
So what does this tell us? I think the primary lesson which I have had driven home a few times is that database-level encryption is tricky. This is particularly true when other considerations are involved, like performance aggregating data over significant sets. Add to this the woes of in-db key management and the like and in-db encryption is definitely expert territory. In this regard, MySQL's approach seems to require a lot more complexity to maintain security than PostgreSQL's.
It is important to remember that short encrypted strings are relatively common in databases which use encryption. One of the most common uses is for things like credit card numbers. For the reasons mentioned here I would suggest that PostgreSQL is much more trustworthy in these cases.
This blog tracks development of the open source accounting and ERP software LedgerSMB. I also offer some perspectives on PostgreSQL including new features which we may find useful. Brought to you by Metatron Technology Consulting.
Friday, August 30, 2013
Monday, August 26, 2013
When to use SELECT * in PostgreSQL
In LedgerSMB we use a lot of queries which involve SELECT *. Many people consider SELECT * harmful but there are several cases where it is useful. Keep in mind we encapsulate the database behind an API, so SELECT * has different implications than it does from applications selecting directly from tables.
It all comes down to software interface contracts and types. Poorly thought-out contracts, loosely applied, lead to unmaintainable code. Clear contracts, carefully applied, lead to maintainable code because the expectations are easily enforced.
PostgreSQL comes with a complex type system where every table, view, or composite type is an object class. In the right contexts, SELECT * provides you a result of a guaranteed type. This is important when doing object relational work because it means you get a series of objects back in a defined class. This allows you to then pass those on to other functions to get derived data.
Select * therefore helps you when working with objects, because you can ensure that the result types are in fact valid objects of a specified class defined in the relation clause of the query.
SELECT * is never helpful (and can have significant problems) in specific areas, such as anywhere you have a join. There are specific reasons for these problems.
Consider the following:
chris=# create table typetest (test text);
CREATE TABLE
^
chris=# insert into typetest values ('test1'), ('test2');
INSERT 0 2
chris=# CREATE VIEW typetestview AS select * from typetest;
CREATE VIEW
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)
chris=# alter table typetest add newfield bool default false;
ALTER TABLE
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)
(Edited, see note [1]). Interestingly this is fixed in the view definition so dump and reload won't change it. However without setting up casts, you can't take advantage of the fact you are passing up the data structure.
The key problem here is that views are their own types, and consequently you cannot guarantee that the view type will be the same as the underlying table type. This makes castes more complex to maintain and probably not worth the trouble.
Once joins are used in a query, however, SELECT * loses any potential benefit. Joins do not return a defined type, and so SELECT * should never be used in queries utilizing joins (aside possibly from ad hoc queries run by the dba to explore the data).
Consider for example the following CRUD stored procedure:
CREATE OR REPLACE FUNCTION accounts__list_all()
RETURNS SETOF accounts
LANGUAGE SQL AS
$$
SELECT * FROM accounts ORDER BY account_no;
$$;
This query is relatively simple, but the stored procedure returns a type that is defined by the underlying table. We all run into cases where application data can't be much further normalized and we may want to have stored procedures delivering that data to the application. In this case, we are likely to use a function like this, and that enables us to do other object-relational things outside it.
Now, if we need to change the underlying accounts table, we can always make a decision as to whether to make accounts a view with a stable representation, a complex type with a hand-coded query returning it, or just propagate the changes upwards. Because the application is not directly selecting from the underlying storage, we have options to ensure that the contract can be maintained. In essence this injects a dependency that allows us to maintain contracts more easily through schema changes.
Consequently although it leads to the same execution plan in this example, there is a tremendous difference, software engineering-wise, between an application calling:
SELECT * FROM accounts ORDER BY account_no;
and
SELECT * FROM accounts__list_all();
In the first case, you have only one contract, between the high level application code and the low-level storage. In the second case, you have two contracts, one between the storage and the procedure (which can be centrally adjusted), and a more important one between the application code and the stored procedure.
This makes a lot more sense when you realize that things like table methods can be passed up when select * is used (or methods applied to views, or the like).
In general you are going to find two areas where select * is most helpful. The first is in object-relational design. The second case is where you want PostgreSQL to define an object model for you. In reality the first case is a special case of the second.
This way of doing things is very different than the way most applications work today. The database is encapsulated behind an object model and the application consumes that object model. In those cases, select * is very helpful.
[1] Corrected as per comment. I was under the mistaken impression that select * would be a part of the view definition. This still leads to annoying and unexpected changes in view definition, such as when you drop and recreate the view and so I would still discourage it here however.
The Fundamental Design Questions
It all comes down to software interface contracts and types. Poorly thought-out contracts, loosely applied, lead to unmaintainable code. Clear contracts, carefully applied, lead to maintainable code because the expectations are easily enforced.
PostgreSQL comes with a complex type system where every table, view, or composite type is an object class. In the right contexts, SELECT * provides you a result of a guaranteed type. This is important when doing object relational work because it means you get a series of objects back in a defined class. This allows you to then pass those on to other functions to get derived data.
Select * therefore helps you when working with objects, because you can ensure that the result types are in fact valid objects of a specified class defined in the relation clause of the query.
Where SELECT * can't be helpful
SELECT * is never helpful (and can have significant problems) in specific areas, such as anywhere you have a join. There are specific reasons for these problems.
Consider the following:
chris=# create table typetest (test text);
CREATE TABLE
^
chris=# insert into typetest values ('test1'), ('test2');
INSERT 0 2
chris=# CREATE VIEW typetestview AS select * from typetest;
CREATE VIEW
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)
chris=# alter table typetest add newfield bool default false;
ALTER TABLE
chris=# select * from typetestview;
test
-------
test1
test2
(2 rows)
(Edited, see note [1]). Interestingly this is fixed in the view definition so dump and reload won't change it. However without setting up casts, you can't take advantage of the fact you are passing up the data structure.
The key problem here is that views are their own types, and consequently you cannot guarantee that the view type will be the same as the underlying table type. This makes castes more complex to maintain and probably not worth the trouble.
Once joins are used in a query, however, SELECT * loses any potential benefit. Joins do not return a defined type, and so SELECT * should never be used in queries utilizing joins (aside possibly from ad hoc queries run by the dba to explore the data).
SELECT * and Stored Procedures
Consider for example the following CRUD stored procedure:
CREATE OR REPLACE FUNCTION accounts__list_all()
RETURNS SETOF accounts
LANGUAGE SQL AS
$$
SELECT * FROM accounts ORDER BY account_no;
$$;
This query is relatively simple, but the stored procedure returns a type that is defined by the underlying table. We all run into cases where application data can't be much further normalized and we may want to have stored procedures delivering that data to the application. In this case, we are likely to use a function like this, and that enables us to do other object-relational things outside it.
Now, if we need to change the underlying accounts table, we can always make a decision as to whether to make accounts a view with a stable representation, a complex type with a hand-coded query returning it, or just propagate the changes upwards. Because the application is not directly selecting from the underlying storage, we have options to ensure that the contract can be maintained. In essence this injects a dependency that allows us to maintain contracts more easily through schema changes.
Consequently although it leads to the same execution plan in this example, there is a tremendous difference, software engineering-wise, between an application calling:
SELECT * FROM accounts ORDER BY account_no;
and
SELECT * FROM accounts__list_all();
In the first case, you have only one contract, between the high level application code and the low-level storage. In the second case, you have two contracts, one between the storage and the procedure (which can be centrally adjusted), and a more important one between the application code and the stored procedure.
Conclusions
In PostgreSQL, the choice of whether to use SELECT * in a query is a relatively clear one. If you want to return objects of a type of an underlying construct, and the return type is closely tied over time to the output type, then SELECT * is fine. On the other hand, if these things aren't true then either you should find ways to make them true, or avoid using SELECT * altogether.This makes a lot more sense when you realize that things like table methods can be passed up when select * is used (or methods applied to views, or the like).
In general you are going to find two areas where select * is most helpful. The first is in object-relational design. The second case is where you want PostgreSQL to define an object model for you. In reality the first case is a special case of the second.
This way of doing things is very different than the way most applications work today. The database is encapsulated behind an object model and the application consumes that object model. In those cases, select * is very helpful.
[1] Corrected as per comment. I was under the mistaken impression that select * would be a part of the view definition. This still leads to annoying and unexpected changes in view definition, such as when you drop and recreate the view and so I would still discourage it here however.
Sunday, August 18, 2013
In defence of hand coded SQL
One common comment I get when I point out I hand-write all my SQL queries rather than relying on an ORM or the like is that this is drudge work, obsoleted by modern tools, and when I mention these are usually wrapped in stored procedures, the reactions go from disdainful to horrified. This piece is the other side, why I do this and why I find it works. I am not saying these approaches are free of costs, but software engineering is about tradeoffs. These tradeoffs are real. My approach is not a magic bullet, but it forms a vital piece of how I build software on the database.
The first thing to note is that I use a lot of SELECT * FROM table queries when querying tables that match output structure. We all know we run into tables that cannot be reasonably further normalized and where the application structure can feed directly into the application. In a stored procedure wrapper, SELECT * reduces maintenance points of such tables if new fields need to be added (in which case the query still matches the specified return type with no modifications). This has costs in that it discourages refactoring of tables down the road but this just needs to be checked. One can still have central management by using views if needed. Central management of type definitions is generally a good thing. Views can take the place of an ORM....
The second point is that CRUD queries of this sort don't really take significant time to write, even on a well-normalized database and having these encapsulated behind a reasonably well-designed procedural interface is not a bad thing provided that some of the classical difficulties of stored procedures are addressed.
I find that my overall development time is not slowed down by hand-writing SQL. This remains true even as the software matures. The time-savings of automatic query tools is traded for the fact that one doesn't get to spend time thinking about how best to utilize queries in the application. The fact is that as application developers, we tend to do a lot in application code that could be better done as part of a query. Sitting down and thinking about how the queries fit into the application is one of the single most productive exercises one can do.
The reason is that a lot of data can be processed and filtered in the queries themselves. This allows one to request that the database send back data in the way the application can make best use of it. This can eliminate a lot of application-level code and lead to a shrinking codebase. This in turn allows application-level code to make better use of data returned from queries, which leads to better productivity all around.
The first thing to note is that I use a lot of SELECT * FROM table queries when querying tables that match output structure. We all know we run into tables that cannot be reasonably further normalized and where the application structure can feed directly into the application. In a stored procedure wrapper, SELECT * reduces maintenance points of such tables if new fields need to be added (in which case the query still matches the specified return type with no modifications). This has costs in that it discourages refactoring of tables down the road but this just needs to be checked. One can still have central management by using views if needed. Central management of type definitions is generally a good thing. Views can take the place of an ORM....
The second point is that CRUD queries of this sort don't really take significant time to write, even on a well-normalized database and having these encapsulated behind a reasonably well-designed procedural interface is not a bad thing provided that some of the classical difficulties of stored procedures are addressed.
I find that my overall development time is not slowed down by hand-writing SQL. This remains true even as the software matures. The time-savings of automatic query tools is traded for the fact that one doesn't get to spend time thinking about how best to utilize queries in the application. The fact is that as application developers, we tend to do a lot in application code that could be better done as part of a query. Sitting down and thinking about how the queries fit into the application is one of the single most productive exercises one can do.
The reason is that a lot of data can be processed and filtered in the queries themselves. This allows one to request that the database send back data in the way the application can make best use of it. This can eliminate a lot of application-level code and lead to a shrinking codebase. This in turn allows application-level code to make better use of data returned from queries, which leads to better productivity all around.
Subscribe to:
Posts (Atom)