How to Search on Securely Encrypted Database Fields – SitePoint

This post was originally published on the ParagonIE blog and reposted here with their permission.

We [ParagonIE] get asked the same question a lot (or some remix of it).

This question shows up from time to time in open source encryption libraries bug trackers. This was one of the weird problems covered in my talk at B-Sides Orlando (titled Building Defensible Solutions to Weird Problems), and weve previously dedicated a small section to it in one of our white papers.

You know how to search database fields, but the question is, How do we securely encrypt database fields but still use these fields in search queries?

Our secure solution is rather straightforward, but the path between most teams asking that question and discovering our straightforward solution is fraught with peril: bad designs, academic research projects, misleading marketing, and poor threat modeling.

If youre in a hurry, feel free to skip ahead to the solution.

Lets start with a simple scenario (which might be particularly relevant for a lot of local government or health care applications):

Lets first explore the flaws with the obvious answers to this problem.

The most obvious answer to most teams (particularly teams that dont have security or cryptography experts) would be to do something like this:

In the above snippet, the same plaintext always produces the same ciphertext when encrypted with the same key. But more concerning with ECB mode is that every 16-byte chunk is encrypted separately, which can have some extremely unfortunate consequences.

Formally, these constructions are not semantically secure: If you encrypt a large message, you will see blocks repeat in the ciphertext.

In order to be secure, encryption must be indistinguishable from random noise to anyone that does not hold the decryption key. Insecure modes include ECB mode and CBC mode with a static (or empty) IV.

You want non-deterministic encryption, which means each message uses a unique nonce or initialization vector that never repeats for a given key.

There is a lot of academic research going into such topics as homomorphic, order-revealing, and order-preserving encryption techniques.

As interesting as this work is, the current designs are nowhere near secure enough to use in a production environment.

For example, order-revealing encryption leaks enough data to infer the plaintext.

Homomorphic encryption schemes are often repackaging vulnerabilities (practical chosen-ciphertext attacks) as features.

As weve covered in a previous blog post, when it comes to real-world cryptography, confidentiality without integrity is the same as no confidentiality. What happens if an attacker gains access to the database, alters ciphertexts, and studies the behavior of the application upon decryption?

Theres potential for ongoing cryptography research to one day produce an innovative encryption design that doesnt undo decades of research into safe cryptography primitives and cryptographic protocol designs. However, were not there yet, and you dont need to invest into a needlessly complicated research prototype to solve the problem.

I dont expect most engineers to arrive at this solution without a trace of sarcasm. The bad idea here is, because you need secure encryption (see below), your only recourse is to query every ciphertext in the database and then iterate through them, decrypting them one-by-one and performing your search operation in the application code.

If you go down this route, you will open your application to denial of service attacks. It will be slow for your legitimate users. This is a cynics answer, and you can do much better than that, as well demonstrate below.

Lets start by avoiding all the problems outlined in the insecure/ill-advised section in one fell swoop: All ciphertexts will be the result of an authenticated encryption scheme, preferably with large nonces (generated from a secure random number generator).

With an authenticated encryption scheme, ciphertexts are non-deterministic (same message and key, but different nonce, yields a different ciphertext) and protected by an authentication tag. Some suitable options include: XSalsa20-Poly1305, XChacha20-Poly1305, and (assuming its not broken before CAESAR concludes) NORX64-4-1. If youre using NaCl or libsodium, you can just use crypto_secretbox here.

Consequently, our ciphertexts are indistinguishable from random noise, and protected against chosen-ciphertext attacks. Thats how secure, boring encryption ought to be.

However, this presents an immediate challenge: We cant just encrypt arbitrary messages and query the database for matching ciphertexts. Fortunately, there is a clever workaround.

Before you begin, make sure that encryption is actually making your data safer. It is important to emphasize that encrypted storage isnt the solution to securing a CRUD app thats vulnerable to SQL injection. Solving the actual problem (i.e. preventing the SQL injection) is the only way to go.

If encryption is a suitable security control to implement, this implies that the cryptographic keys used to encrypt/decrypt data are not accessible to the database software. In most cases, it makes sense to keep the application server and database server on separate hardware.

Possible use-case: Storing social security numbers, but still being able to query them.

In order to store encrypted information and still use the plaintext in SELECT queries, were going to follow a strategy we call blind indexing. The general idea is to store a keyed hash (e.g. HMAC) of the plaintext in a separate column. It is important that the blind index key be distinct from the encryption key and unknown to the database server.

For very sensitive information, instead of a simple HMAC, you will want to use a key-stretching algorithm (PBKDF2-SHA256, scrypt, Argon2) with the key acting as a static salt, to slow down attempts at enumeration. We arent worried about offline brute-force attacks in either case, unless an attacker can obtain the key (which must not stored in the database).

So if your table schema looks like this (in PostgreSQL flavor):

You would store the encrypted value in humans.ssn. A blind index of the plaintext SSN would go into humans.ssn_bidx. A naive implementation might look like this:

A more comprehensive proof-of-concept is included in the supplemental material for my B-Sides Orlando 2017 talk. Its released under the Creative Commons CC0 license, which for most people means the same thing as public domain.

Depending on your exact threat model, this solution leaves two questions that must be answered before it can be adopted:

Given our example above, assuming your encryption key and your blind index key are separate, both keys are stored in the webserver, and the database server doesnt have any way of obtaining these keys, then any attacker that only compromises the database server (but not the web server) will only be able to learn if several rows share a social security number, but not what the shared SSN is. This duplicate entry leak is necessary in order for indexing to be possible, which in turn allows fast SELECT queries from a user-provided value.

Furthermore, if an attacker is capable of both observing/changing plaintexts as a normal user of the application while observing the blind indices stored in the database, they can leverage this into a chosen-plaintext attack, where they iterate every possible value as a user and then correlate with the resultant blind index value. This is more practical in the HMAC scenario than in the e.g. Argon2 scenario. For high-entropy or low-sensitivity values (not SSNs), the physics of brute force can be on our side.

A much more practical attack for such a criminal would be to substitute values from one row to another then access the application normally, which will reveal the plaintext unless a distinct per-row key was employed (e.g. hash_hmac('sha256', $rowID, $masterKey, true) could even be an effective mitigation here, although others would be preferable). The best defense here is to use an AEAD mode (passing the primary key as additional associated data) so that the ciphertexts are tied to a particular database row. (This will not prevent attackers from deleting data, which is a much bigger challenge.)

Compared to the amount of information leaked by other solutions, most applications threat models should find this to be an acceptable trade-off. As long as youre using authenticated encryption for encryption, and either HMAC (for blind indexing non-sensitive data) or a password hashing algorithm (for blind indexing sensitive data), its easy to reason about the security of your application.

However, it does have one very serious limitation: It only works for exact matches. If two strings differ in a meaningless way but will always produce a different cryptographic hash, then searching for one will never yield the other. If you need to do more advanced queries, but still want to keep your decryption keys and plaintext values out of the hands of the database server, were going to have to get creative.

It is also worth noting that, while HMAC/Argon2 can prevent attackers that do not possess the key from learning the plaintext values of what is stored in the database, it might reveal metadata (e.g. two seemingly-unrelated people share a street address) about the real world.

Possible use-case: Encrypting peoples legal names, and being able to search with only partial matches.

Lets build on the previous section, where we built a blind index that allows you to query the database for exact matches.

This time, instead of adding columns to the existing table, were going to store extra index values into a join table.

The reason for this change is to normalize our data structures. You can get by with just adding columns to the existing table, but its likely to get messy.

The next change is that were going to store a separate, distinct blind index per column for every different kind of query we need (each with its own key). For example:

Every index needs to have a distinct key, and great pains should be taken to prevent blind indices of subsets of the plaintext from leaking real plaintext values to a criminal with a knack for crossword puzzles. Only create indexes for serious business needs, and log access to these parts of your application aggressively.

Thus far, all of the design propositions have been in favor of allowing developers to write carefully considered SELECT queries, while minimizing the number of times the decryption subroutine is invoked. Generally, that is where the train stops and most peoples goals have been met.

However, there are situations where a mild performance hit in search queries is acceptable if it means saving a lot of disk space.

The trick here is simple: Truncate your blind indexes to e.g. 16, 32, or 64 bits, and treat them as a Bloom filter:

It may also be worth converting these values from a string to an integer, if your database server will end up storing it more efficiently.

I hope Ive adequately demonstrated that it is not only possible to build a system that uses secure encryption while allowing fast queries (with minimal information leakage against very privileged attackers), but that its possible to build such a system simply, out of the components provided by modern cryptography libraries with very little glue.

If youre interested in implementing encrypted database storage into your software, wed love to provide you and your company with our consulting services. Contact ParagonIE if youre interested.

Go here to see the original:
How to Search on Securely Encrypted Database Fields - SitePoint

Related Posts

Comments are closed.