Task
I’m working on my final project for school, we are supposed to make a web app of our choosing and there has to be specific features in it. One of it is all data must be encrypted, and the other is that we have to have a search functionality. My app (A customer support framework) has a ticket functionality where customers can submit help request tickets, the contents of these tickets need to be encrypted at rest, at the same time admins need to be able to search contents of tickets.
Current Plan
My current plan is to store an AES-256 encrypted copy of the message message.content to meet the encrypted requirement, and also store a tokenized and hashed version of the message message.hashed to meet the searchability requirement.
The tokenization/hashing method will be:
- strip the message to alphanumeric + whitespace ([a-zA-Z0-9 ])
- tokenize by splitting the message by whitespace,
- SHA-256 each token,
- rejoin all the hashed tokens into a space seperated string and stored in the
message.hashedfield.
Thus this is a test string becomes <hash of this> <hash of is> <hash of a> <hash of test> <hash of string>
When the user searches their search string goes through all of the steps in the tokenization/hashing method, then we query the message table for message.hashed LIKE %%<hashed string>%% and if my thinking is right, we should be able to find it.
Concerns
- Statistical analysis of hashed tokens
- I really don’t see a way around this, to make the string searchable the hashing needs to be predictable.
message.hashedfield could potentially be huge, if each word is getting a SHA256 hash, a large message could result in a very large hash string- maybe we just store the last 4 of the hash?
- This would increase collisions, but the likelihood of multiple last 4’s colliding in a given search string should be pretty dang small, and any collisions would likely not be valid language.
- Would this help with the statistical analysis concern? Increasing collisions would decrease the effectiveness of statistical analysis. It would be a performance hit, but after returning all matches against the hashes I could decrypt the
message.contentdata and search the raw search query against the unencrypted text and remove any incorrect returns caused by collisions.
- maybe we just store the last 4 of the hash?
I’m interested in hearing everyone’s thoughts, am I being logical in my reasoning?


What’s the expected volume of records planned to be stored?
For a small volume on a school assignment (a few thousand records on each query), I would do a processor/filter on my base database access layer and do the encryption and decryption there for any field annotated as @Encrypt at the field level or similar (language dependent, not sure what you are using).
Some libraries use a similar approach during serialization and deserialization steps. I’m guessing you are required to write the whole thing, but reading how those work might give you ideas since they tend to have hooks to wire custom logic during the process.
This would add overhead during read and create, but would be pretty transparent to the rest business logic and as mentioned, as long as the requirements don’t say you need to support searching over a few million records in X amount of time, it might be OK.
The hash idea sounds quicker at first (hashing vs on the fly encryption/decryption), but it does not sound like it would scale well either unless the message size is constrained like you mentioned. Another problem us that it could be extremely easy to brute force with a rainbow table which kind of defeats encrypting it to begin with. If pursuing that approach, you’d need to also store a salt with each hash to prevent that attack type.
Custom encryption solutions and security through obscurity tend to be the weakest points in an implementation, which sounds like is part of the assignment to think about.