Lesson 5: Cryptographic Hashing
This world cannot work without cryptographic hashing. From official documents and downloads to passwords and cryptocurrency, there is always something that needs digital fingerprints!
Hashing is mysterious to many programmers. Choosing the right hash function boils down to one question: What are we hashing - data or passwords? In this post, we demystify hashing and explain cryptographic hash functions as well as those specialised for password hashing.
In cryptography, hashing or a hash function is a one-way mathematical function that takes an input of arbitrary size, called plaintext, and produces a fixed-size output called a hash or a message digest. A message digest is a digital fingerprint that can be used to uniquely identify our data. If the data changes, the message digest or hash also changes in ways you cannot predict. In essence, hashing is required for data integrity, to ensure data is not corrupt or altered during transfer.
In everyday life, cryptographic hashing is used in many things in our digital world without you even realising it. Here are some examples:
SHA-256 checksum to verify a file's or download's integrity.
Password hashing for secure storage and password verification.
SSL/ TLS certificates to authenticate a website’s identity.
Digital signatures to check a digital message or document is authentic.
Signed tokens like a JSON Web Token (JWT) that provides verified access or a way to transmit information securely between parties.
These are the things that help secure our online activities and may come across as familiar for many. As a developer, chances are you are already using cryptographic hashing in your applications, such as a cryptography library (e.g. hashlib and crypto) to generate hashes or a JWT library that uses hashing to sign and verify tokens.
The terms hashing and hash functions can refer to many things in computer science. Let me point out that cryptographic hashing is distinct from the concept of hashing used in data structures (e.g. a dictionary or map) and programming languages (such as Python’s hash() function that returns the hash value of an object). They share the same concept (i.e. take an input data and transform it into a fixed-length output) but greatly differ in terms of implementation and applications. In this post, I simply use hashing and hash functions to talk about them in cryptography.
A hash function is deterministic, i.e. given the same input, it always produces the same output. It is also irreversible (such property is called pre-image resistance), meaning given a hash, it is computationally infeasible to find its original input. Another crucial property of hash functions is collision resistance, which states it is computationally infeasible to find two inputs that hash to the same value, called a collision. This sets cryptographic hash functions apart from other hash functions.
To date, SHA-2 (Secure Hash Algorithm 2) is the most widely used hashing algorithm, which replaces its predecessor, SHA-1, which is now insecure as vulnerabilities have been found. SHA-2 is a family of hash functions that includes variants like SHA-224, SHA-256, SHA-384, and SHA-512. Interestingly, SHA-2 variants omit the version of the algorithm in their names, and use numbers to indicate the output size. For instance, SHA-256 produces 256-bit (32-byte) outputs. This variant, SHA-256, is currently the most commonly used because it provides a minimum of 128 bits of security, meeting the requirements for most security purposes. SHA-3, released in 2015, is the latest addition to the SHA family, but its adoption is slow. While SHA-2 remains secure, SHA-3 is recommended for building future-proof applications.
A common mistake in hashing is poor handling of generated hashes. A hash does not verify the sender’s identity or provide evidence of who created it. When stored unsecured, such as in a user’s device, hashes can be manipulated or tampered with. Attackers who gained access to the storage can replace both a message and its hash with their own. A recipient who then receives the message and the hash has no way to tell if they are genuine or come from a trusted source.
A viable solution is to use hash functions to hash secrets for message authentication purposes, on the presumption that only parties possessing the secret can generate the same hash. However, I must mention that SHA-2 algorithm cannot be used to hash a secret and a message with construction, using SHA-256 for example, sha256(secret || message)
. This is because SHA-2 is susceptible to length extension attacks. Say d1 = sha256(secret || message)
, anyone can continue hashing more data from the hash with construction sha256(d1 || message2)
and produce a new valid hash for the extended message without knowing the secret. The new hash and the modified message can then be passed off as the real copies. Because the recomputed hash will match, one will trust the message, even if it has been compromised. There are many ways to protect a hash, but I will simply say that if you ever need to hash a message with a secret to provide authentication, use a message authentication code (MAC) function such as HMAC (hash-based message authentication code).
Let’s see some examples of how to create a SHA-256 hash in coding.
Python (using hashlib):
import hashlib
message = 'Hello World!'
# Create a SHA-256 hash object
sha256Hash = hashlib.sha256()
# Update the hash object with the data
sha256Hash.update(message.encode('utf-8'))
# Get the hexadecimal representation of the hash
hash = sha256Hash.hexdigest()
print('SHA-256 hash as hex:', hash)
Node.js (using crypto):
const crypto = require('crypto');
const message = 'Hello World!';
// Create a SHA-256 hash object
const sha256Hash = crypto.createHash('sha256');
// Update the hash object with the data
sha256Hash.update(message, 'utf-8');
// Get the hexadecimal representation of the hash
const hash = sha256Hash.digest('hex');
console.log('SHA-256 hash as hex:', hash);
Java (using MessageDigest):
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
try {
String message = "Hello World!";
// Create a MessageDigest object using SHA-256 algo
MessageDigest md = MessageDigest.getInstance("SHA-256");
// Update the digest with the data
md.update(message.getBytes(StandardCharsets.UTF_8));
// Get the byte array of the hash
byte[] hash = md.digest();
// Convert the byte array to a hexadecimal string
StringBuilder sb = new StringBuilder();
for (byte b : hash) {
String hex = String.format("%02x", b);
sb.append(hex);
}
System.out.println("SHA-256 hash as hex: " + sb.toString());
} catch (NoSuchAlgorithmException e) {
e.printStackTrace();
}
Coding tip: It is potentially unsafe to use the platform default character encoding when converting a string into a byte array or vice versa. It is a good practice to specify the encoding explicitly (e.g. UTF-8) to ensure safe conversions.
Using OpenSSL:
$ echo -n 'Hello World!' > hello.txt # assuming default encoding is UTF-8
$ openssl dgst -sha256 hello.txt
Regardless of the tool and programming language you use, the SHA-256 algorithm always hashes the string Hello World!
in UTF-8 encoding to the hexadecimal value: 7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069
.
People often confuse hashing with encryption. These two primitives are commonly used together in real-world cryptography. Let’s see the distinctions between them:
Hashing passwords
History tells us that it is a bad idea to store passwords as plaintext in databases. Passwords can be leaked when databases are breached or compromised. Password hashing allows you to authenticate a user without actually knowing their password. This can be done in the following steps:
When a user registers in our application, we hash the password and save it to the database.
When the user wants to authenticate, we hash the provided password and compare it with the stored hash from the database. If it matches, the user is authenticated.
If you think hashing passwords could result in collisions, I mentioned earlier that a secure hash function is collision-resistant. However, simply hashing passwords is not secure enough because passwords can be cracked. When hashed passwords are leaked, attackers can crack the passwords by using a brute force attack (i.e. trying all possible passwords). By using rainbow tables (i.e. pre-computed tables of hashes for many passwords), attackers can technically crack multiple passwords at a time.
If we want to slow attackers down, for example, by making them crack a password at a time, we can use salts, which are random values added to passwords before hashing. A common practice is to use a salt of at least 16 bytes (128 bits) which is different for each user and random enough to defend against rainbow tables. This effectively creates a unique hash function for each password. It is computationally infeasible for attackers to generate all rainbow tables. A salt is not meant to be private, thus it does not need to protected and can be stored unencrypted in the database.
Hashing and salting are necessary for passwords, but still not good enough. With today’s computing power, it is possible to compute billions of hashes per second. If we can hash passwords this fast, attackers can crack passwords at the same speed too. In response to this issue, specialised hash functions designed for password hashing, such as Argon2, PBKDF2, bcrypt, and scrypt, are used. These password hashes are designed to be computationally expensive, and thus slow. They are not too slow to be acceptable by users, but slow enough to make brute force attacks take forever. The above-mentioned password hashes are battled-tested and hence recommended. They automatically take care of salting and support a variable-length salt and a number of other configurable parameters such as time and memory cost.
In practice, one should not implement a password hash on their own. Established password hashing libraries are available in many programming languages and are relatively simple to use. Here’s how to use Python’s password hashing library Passlib with the Argon2 password hash:
from passlib.hash import argon2
password = 'my_secure_password'
# hash the password with Argon2 using the defaults
h = argon2.hash(password)
print(h)
# verify the password
if argon2.verify(password, h):
print('Password is correct.')
else:
print('Password is incorrect.')
This will print the following output:
$argon2id$v=19$m=65536,t=3,p=4$jnGO8Z6zttYaY6y1NuZcCw$IStDoIQ9xAzJGF2N6BOwZg5UcyrsucHfpvywFgKIM0Y
Password is correct.
As a developer, it is important to know which hashing algorithms are secure and which are broken. Always avoid using broken hash functions like MD5 and SHA-1. Failure to do so not only makes your applications vulnerable to attacks in today’s digital world but also puts your users at unnecessary risk. SHA-2 is still secure for most cases but cannot be used to hash secrets naively. SHA-256 is the most widely used hashing algorithm in the industry while SHA-3 adoption is slow. When storing passwords, use a recommended password hash such as Argon2, PBKDF2, bcrypt and scrypt. These specialised algorithms for password hashing are designed to be slow and make use of salts, and thus are effective against rainbow tables and brute force attacks. In the next post, we discuss encryption which offers confidentiality by securing messages with a key. Both hashing and encryption provide basic theory and foundations for advanced topics like digital signatures and token-based security. Make sure you have a good grasp of both the concepts before going deeper.