What is it?
An implementation in Python for storing passwords with cryptographically recognized techniques.
Why is it important?
Stored passwords get disclosed from time to time. Good examples give us guidance and help us avoid mistakes.
Password disclosure is bad news for users and the authenticator. Worse it undermines the trust the public has in our systems for storing and accessing private, personal services.
When using cryptography it is possible that even if the values stored are disclosed that the secrets they reflect remain opaque, or at least cost ineffective to discover.
How does it work?
Using the cypher Password-Based Key Derivation Function 2 or (perhaps even better) scrypt or bcrypt one can make it expensive to obtain secrets from stored values. These cyphers are known as key derivation functions (KDF) and below is an example of using a KDF in Python on Google App Engine.
A KDF is given a token (e.g. a password) and returns what is called a derived key, which is the value that one stores.
With PBKDF2 it can be computationally expensive to verify that a token is equal to a derived key. The expense is proportional to the number of iterations given as a parameter. With scrypt the number of iterations increases not only the computational intensity but the amount of memory needed.
There is an excellent Q & A Security.SE: How to securely hash passwords? that goes into detail how to do this. What is below is just an implementation that I hope adheres to the principles set out there.
The entire class below can be found in a Github Gist.
The topic merits a little background. If someone slurps a database of passwords, there are a number of ways they can obtain the secret passwords from the stored values. If the passwords are stored in plain text then the gig is up. If the passwords are hashed, then there is more work that must be done.
One would typically start by comparing the values against a database of popular passwords, which may indicate the algorithm used to hash the values. For example the following, with outcomes shortened for brievity:
These indicators of underlying mechanisms and values are called oracles. Once one has an oracle that indicates the algorithm, one can use a dictionary attack to check for the presence of common passwords. These dictionary attacks are often very computationally cost effective, when employed with rainbow tables.
That said, oracles of this sort are like slugs: you can kill them with salts. That is not the end of the story though, since modern GPUs can calculate hashes such as SHA1 at a rate of around 2.3 billion per second.
The KDF requires a salt, but it also makes the oracle much more expensive. The KDF increass the comparisons necessary – one for every potential value of the iterations – and one can vary the number of iterations, so even where one matches a password to a secret that match generally reveals the secret in that one instance. With a regular hash or where the KDF iterations are constant the match will reveal the number of iterations.
Here is a KDF at work, with just an 8 byte key for illustration, and a salt of “” (two double-quote characters):
The outcome is always the same length, but varies on the iterations. It similarly varies on the salt. Here is one iteration with different salts:
You can try PBKDF2 online. Note how the number of iterations significantly increases the time to produce the result. Changing the salt will alter the outcome, but not the computation time.
Thus, even if one obtains the stored secrets it is computationally expensive to obtain the passwords that were used to authenticate individuals these secrets represent.
Here is what we include in Python:
As an aside, developing with
onto App Engine used to be something of a challenging. This issue has, with great thanks to the App Engine developers, been resolved. The challenge of
getting Crypto onto App Engine is why we settled on PBKDF2 instead of bcrypt or scrypt. It may be easy to get the below working for those KDFs as well, now.
I am going to use a class in App Engine’s NDB to store the credentials. One can reference theses credentials by a key, or making them an internal property of another NDB model, and changing them to work in another data store context should be straightforward.
class Credentials – definition
Here is the opening of the class definition with the constants and stored values for each credential. Every user would have a corresponding instance of the
class in the datastore that would be used to authenticate their identity.
I am not going to get into the
here, other than to note that are also worth bearing in mind in any authentication scheme.
String casting leakage
We do not want our logs or any other conversion of the credentials to leak the contents of the model, so we overload the following to prevent that from happening. This is just a good practice for this sort of thing.
We need to get some random data throughout the process, and this is how we do it here.
We vary the number of iterations that the KDF uses in a fairly complex way. The strength of the KDF comes not only from a high number of iterations but from the number being both unpredictable and increasing over time as computation power increases.
Generating a key
Generating a key is straightforward application of the KDF. Note that this
would only be called internally by the
Even where one may get the iterations, we use a static offset
– very slightly increasing our resilience.
Setting a key
When a user sets their password (in this case the
parameter), this is the function that is called. It sets the
to new values, and then generates a key.
Verifying a key
When we receive a password (
) we run it through the algorithm with the parameters that correspond to the stored instance of the
for the user the password is purported to belong to.
Nothing is perfect. This is just a small piece of a security puzzle, and there are many other vectors one must bear in mind.
This does raise the bar for the specific concern where the stored secrets are divulged, and the risk then that the authentication they represent may be disclosed.
A number of improvements could be made to the above that I can think of. For example, storing the iterations and salt with the derived key partially (but not entirely) undermines their value if the database is slurped.
There may be inherent flaws in the algorithms that we employ here. The risk is usually in the form of short-circuits that reduce the computation time (and memory, in the case of scrypt) needed to verify a token.
Inherent issues may exist in the pseudo-random number generator that reduce the entropy of the cryptography and as a result reduce the computation time needed since the possible outcomes are significantly reduced.
Unlike a key, an algorithm can be studied and analyzed by experts to determine if it is likely to be secure. An algorithm that you have invented yourself and kept secret has not had the opportunity for such review.
I hope that you take away from this some illumination, if not inspiration and curiosity, about some important elements of storing passwords.
Any thoughts you have are most welcome!