First a note, all my tests involved a relatively simple table with a schema like this (column names did vary):
CREATE TABLE enctest (
id int,
id_text text,
id_enc bytea
);
In MySQL varbinary(64) was used instead of bytea.
The id was formed from a sequence from 1 to 100000. I had more trouble loading this in MySQL than in PostgreSQL. id_text was a text cast of id, and id_enc was the value of id_text encrypted using 128-bit AES encryption. This was intended to mimic sales data consisting of short strings that would be decrypted and converted to numeric data before aggregation.
The goal was to see how fast the different implementations would decrypt all records and aggregate as numeric data types. For PostgreSQL, pgcrypto was used. The tests were conducted under ANSI mode on MySQL, and the tables were innodb.
What I found was remarkably disturbing. While MySQL was blazingly fast, this speed came at the cost of basic error checking and rather than an error, decrypting with the wrong key would give the wrong data back sometimes, even on traditional modes. This is because the errors instead of warnings, per the documentation, are only transformed on insert, not on select. In other words, MySQL is just as permissive in read operations with STRICT mode turned on as turned off.
mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) FROM enctest;
+----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) |
+----------------------------------------------------------------+
| 5000050000 |
+----------------------------------------------------------------+
1 row in set (0.33 sec)
That is fast. Very fast, My similar query in PostgreSQL took about 200 seconds, so approx 600x as long, and was entirely CPU-bound the whole time.
efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
QUERY PLAN
--------------------------------------------------------------------------------
-----------------------------------------------
Aggregate (cost=7556.16..7556.17 rows=1 width=62) (actual time=217381.965..217
381.966 rows=1 loops=1)
Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
Buffers: shared read=5556 written=4948
-> Seq Scan on public.sumtest (cost=0.00..6556.08 rows=100008 width=62) (ac
tual time=0.015..1504.897 rows=100000 loops=1)
Output: testval, testvaltext, testvalenc, testvalsym
Buffers: shared read=5556 written=4948
Total runtime: 217382.010 ms
(7 rows)
My first thought was that for there to be a 3-orders-of-magnitude difference between the two implementations, something must be seriously wrong on the PostgreSQL side. This is a huge difference. But then something occurred to me. What if I use the wrong password?
On PostgreSQL:
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
On MySQL, it is a very different story:
mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) |
+-----------------------------------------------------------------+
| 1456 |
+-----------------------------------------------------------------+
1 row in set, 6335 warnings (0.34 sec)
Hmmm, out of 100000 rows, only 6000 (6%) gave a warning, and we got a meaningless answer back. Thanks, MySQL. So I tried some others:
mysql> select sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) |
+-----------------------------------------------------------+
| 1284 |
+-----------------------------------------------------------+
1 row in set, 6230 warnings (0.35 sec)
Again 6% warnings, meaningless answer returned. Wow this is fun.....
Try as I might I couldn't get MySQL to throw any errors, and I always got meaningless results back with the wrong key. A closer look would reveal that MySQL was throwing warnings only when certain rare criteria were met and was performing no validation on the data to ensure it matched the data in. Further review showed that the cryptograms were much shorter on MySQL than PostgreSQL suggesting that PostgreSQL was padding short strings in order to ensure that cryptography would better protect the data. More on this later.
This suggested that the difference in the performance might well be related to extra sanity checks in PostgreSQL that MySQL omitted for speed-related purposes. Armed with this knowledge, I tried the following:
efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000
The query returned pretty fast. However these settings are not really recommended for production environments.
I went ahead and tried again my data test queries and my performance queries and the results were two orders of magnitude faster:
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000
efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR: Wrong key or corrupt data
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
QUERY PLAN
--------------------------------------------------------------------------------
-----------------------------------------------
Aggregate (cost=13111.00..13111.01 rows=1 width=71) (actual time=1996.574..199
6.575 rows=1 loops=1)
Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
Buffers: shared hit=778 read=10333
-> Seq Scan on public.sumtest (cost=0.00..12111.00 rows=100000 width=71) (a
ctual time=0.020..128.722 rows=100000 loops=1)
Output: testval, testvaltext, testvalenc, testvalsym
Buffers: shared hit=778 read=10333
Total runtime: 1996.617 ms
(7 rows)
Much, much faster. Of course that comes at the cost of security features.
The primary security features changed here are what are called string to key functions. PostgreSQL also offers some relatively complex containers for short data which include things like padding and session keys. MySQL does not provide string to key management, and requires that you generate the hexadecimal key yourself. PostgreSQL provides a number of options for string to key generation which allow for salted hashes to be used for the actual encryption.
One of the most obvious implications here is that with MySQL, you have to generate your salted hash yourself, while with PostgreSQL, it may generate a different salted hash for each line. This is very important for encryption particularly with smaller strings because this helps thwart rainbow tables. In essence with salted keys, there is no 1:1 relationship between the passphrase/data combination and the cryptogram, because there is no 1:1 relationship between the passphrase and the key. Further testing suggests that this is not responsible for the performance difference but it does suggest there are more checks lurking beneath the surface which are omitted from MySQL.
So given that the issue is not string to key management, the issue must be padding. For very short strings, PostgreSQL is managing padding and containers, while MySQL is purely encrypting short strings without more than minimal padding. Since there is insufficient padding, the decryption routines are much faster, but this comes at a cost of any reasonable security. Additionally PostgreSQL provides data checks that are not done on MySQL.
So what does this tell us? I think the primary lesson which I have had driven home a few times is that database-level encryption is tricky. This is particularly true when other considerations are involved, like performance aggregating data over significant sets. Add to this the woes of in-db key management and the like and in-db encryption is definitely expert territory. In this regard, MySQL's approach seems to require a lot more complexity to maintain security than PostgreSQL's.
It is important to remember that short encrypted strings are relatively common in databases which use encryption. One of the most common uses is for things like credit card numbers. For the reasons mentioned here I would suggest that PostgreSQL is much more trustworthy in these cases.
For future people who read this... would you mind citing versions of mysql and postgres used?
ReplyDeletePostgreSQL 9.1.9 and MySQL 5.5.32
DeleteWouldn't it make more sense to test similar encryption algos? pgp_sym_encrypt uses some form of PGP encryption... using encrypt() and decrypt() directly gives postgres much better numbers:
ReplyDelete-- note this is a debug enabled build...
=> select version();
version
──────────────────────────────────────────────────────────────────────────────────────────────────────────
PostgreSQL 9.3rc1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.8.1 20130725 (prerelease), 64-bit
=> insert into enctest select g, g, encrypt(g::text::bytea, digest('secret2', 'sha256'), 'aes') from generate_series(1, 100000) as g;
INSERT 0 100000
Time: 519.701 ms
-- makes sum() easier
=> set bytea_output to 'escape';
SET
=> select sum(decrypt(id_enc, digest('secret2', 'sha256'), 'aes')::text::numeric) from enctest ;
sum
────────────
5000050000
(1 row)
Time: 296.031 ms
Yeah MySQL giving you bad data there due to the wrong key has nothing to do with MySQL vs PostgreSQL. It's because of the encryption algorithm. The PGP one does fancier stuff to make sure the data is decrypted correctly. If you did the equivalent algorithm in MySQL it would probably give you error checking just like pgp_sym_decrypt(), and would be fantastically slower.
ReplyDeleteThat was more or less the conclusion. The recommended methods in PostgreSQL are more secure because they do a number of fancy things for the sake of reliability.
Delete