Hashing
Hashing is one of the most useful ideas in coding interviews and system design. In coding interviews, it powers hash maps, sets, frequency counting, grouping, and fast lookup. In system design, it helps distribute keys across servers, shards, caches, and streams.
This page starts from first principles and builds toward practical interview use.
1. What Is Hashing?
Hashing converts an input, often called a key, into a fixed-size or numeric value using a hash function.
key -> hash function -> hash value
Examples:
"abc" -> hash function -> some number
"user_123" -> hash function -> some number
"file.png" -> hash function -> some number
For deterministic hashing, the same input should produce the same output:
hash("abc") = 96354
hash("abc") = 96354
hash("abc") = 96354
Important ideas:
- Hashing is not encryption.
- Hashing is usually one-way: you should not expect to recover the original input from the hash.
- Different inputs can sometimes produce the same hash value. This is called a collision.
"abc" -> hash -> 42
"xyz" -> hash -> 42
collision: two different keys produced the same hash value
In interviews, hashing usually means: “Can I turn this value into something I can look up quickly?”
2. What Does serverIndex = hash % N Mean?
In system design, hashing is often used to decide where a key should live.
serverIndex = hash(key) % N
Where:
keyis something like a user ID, request ID, file ID, or cache key.hash(key)converts the key into a number.Nis the number of servers, shards, or buckets.% Nmaps the number into a valid index from0toN - 1.
Example:
hash("user_123") = 17
N = 4
17 % 4 = 1
"user_123" goes to server 1
Diagram:
+----------+
"user_123" -> | hash() | -> 17
+----------+
|
v
17 % 4
|
v
server 1
The problem is that changing N changes the result:
17 % 4 = 1
17 % 5 = 2
So if you add or remove servers, many keys may move to different servers. This is why simple modulo hashing is easy to understand but painful when the server count changes often.
This motivates consistent hashing, covered later.
3. How Does a String Become a Number?
Computers already represent characters as numbers. In Python, ord() gives the Unicode code point for a character:
print(ord("a")) # 97
print(ord("b")) # 98
print(ord("你")) # 20320
A basic hash function can combine these character numbers into one final number.
Educational example:
def simple_hash(key: str) -> int:
hash_value = 0
for char in key:
hash_value = hash_value * 31 + ord(char)
return hash_value
print(simple_hash("abc"))
Manual walkthrough for "abc":
start = 0
'a' -> 97
hash = 0 * 31 + 97 = 97
'b' -> 98
hash = 97 * 31 + 98 = 3105
'c' -> 99
hash = 3105 * 31 + 99 = 96354
Why multiply by 31?
Multiplication makes character order matter.
Without multiplication, "abc" and "cba" could easily collapse into similar values because you are mostly adding character numbers. With multiplication, earlier characters get carried forward and affect the final number more.
"abc" and "cba" contain the same letters,
but their positions should produce different hash values.
This simple_hash() function is for learning only. Real hash functions handle overflow, distribution, speed, and collision behavior much more carefully.
4. Hash Tables / Hash Maps
A hash table stores key-value pairs using hashing.
key -> hash -> bucket index -> store/find value
Example:
"user_123" -> hash -> 17
17 % bucket_count -> bucket 1
bucket 1 contains the value for "user_123"
In Python:
dictis a hash map.setis a hash set.
Example:
users = {
"user_123": "Gary",
"user_456": "Alice",
}
print(users["user_123"])
Conceptually, Python uses hashing to locate "user_123" quickly instead of scanning every key one by one.
Common operations:
| Operation | Average Time | Notes |
|---|---|---|
| Insert | O(1) | Add a key-value pair. |
| Lookup | O(1) | Find value by key. |
| Delete | O(1) | Remove key-value pair. |
Worst case can degrade if many keys collide into the same bucket. Real hash tables use collision-handling strategies such as:
- Chaining: each bucket stores a list of entries.
- Open addressing: if a bucket is occupied, probe another bucket.
Simple chaining diagram:
bucket 0 -> []
bucket 1 -> [("user_123", "Gary"), ("user_999", "Bob")]
bucket 2 -> []
bucket 3 -> [("user_456", "Alice")]
5. Common Interview Patterns Using Hashing
| Pattern | Data Structure | Example Problem |
|---|---|---|
| Existence check | set |
Find duplicates |
| Value to index | dict |
Two Sum |
| Frequency count | dict / Counter |
Valid Anagram |
| Grouping | defaultdict(list) |
Group Anagrams |
| Sliding window memory | dict / set |
Longest substring without repeating characters |
Two Sum
def two_sum(nums: list[int], target: int) -> list[int]:
seen = {} # value -> index
for i, num in enumerate(nums):
need = target - num
if need in seen:
return [seen[need], i]
seen[num] = i
return []
Complexity:
- Time:
O(n) - Space:
O(n)
Why hashing helps: each complement lookup is average O(1).
Count Frequency
from collections import Counter
def char_frequency(s: str) -> Counter:
return Counter(s)
print(char_frequency("banana"))
Complexity:
- Time:
O(n) - Space:
O(k), wherekis the number of unique characters.
Group Anagrams
from collections import defaultdict
def group_anagrams(words: list[str]) -> list[list[str]]:
groups = defaultdict(list)
for word in words:
key = "".join(sorted(word))
groups[key].append(word)
return list(groups.values())
Complexity:
- Time:
O(n * m log m), wherenis number of words andmis average word length. - Space:
O(n * m)for grouped output and keys.
Why hashing helps: the sorted word becomes a dictionary key for grouping.
Detect Duplicate
def has_duplicate(nums: list[int]) -> bool:
seen = set()
for num in nums:
if num in seen:
return True
seen.add(num)
return False
Complexity:
- Time:
O(n) - Space:
O(n)
6. Hashing in System Design
Practical uses:
- Sharding: decide which database shard owns a key.
- Caching: decide which cache node stores a key.
- Load balancing: route related requests consistently.
- Request routing: send a request to the same backend based on user/session key.
- Deduplication: identify repeated files, events, or payloads.
- Checksums: detect accidental data changes.
- Consistent hashing: reduce key movement when servers change.
Stable hash example:
import hashlib
def stable_hash_to_int(key: str) -> int:
digest = hashlib.sha256(key.encode("utf-8")).hexdigest()
return int(digest, 16)
def choose_server(key: str, num_servers: int) -> int:
return stable_hash_to_int(key) % num_servers
print(choose_server("user_123", 4))
Important: do not use Python’s built-in hash() for distributed systems.
Python randomizes string hash values between processes/runs for security. That means this can change:
print(hash("user_123"))
For distributed systems, use a stable hash such as:
- SHA-256
- MurmurHash
- xxHash
- CityHash
Use cryptographic hashes when integrity/security matters. Use fast non-cryptographic hashes when performance matters and security is not the goal.
7. Consistent Hashing
Simple modulo hashing:
hash(key) % N
Problem:
N changes -> many keys move
Example:
hash("user_123") = 17
17 % 4 = 1
17 % 5 = 2
Adding one server changed the destination from server 1 to server 2.
Consistent hashing reduces this movement.
Mental model:
hash ring
[S1]
/ \
keyA [S2]
| |
v v
next server next server
\ /
[S3]
How it works:
- Imagine a ring of hash values.
- Servers are placed on the ring using a hash of the server ID.
- Keys are placed on the same ring using a hash of the key.
- A key goes to the next server clockwise.
Example:
key -> hash -> position on ring
server -> hash -> position on ring
key belongs to the first server found clockwise
Why this helps:
- When a server is added, only nearby keys move to it.
- When a server is removed, only its keys move to the next server.
- Most keys stay where they were.
Consistent hashing is useful for:
- Distributed caches
- Distributed databases
- Request routing
- Storage systems
Virtual nodes:
Real systems often place each physical server on the ring multiple times as virtual nodes.
Server A -> A1, A2, A3
Server B -> B1, B2, B3
Server C -> C1, C2, C3
This improves distribution and reduces the chance that one server gets too many keys.
8. Replication, Quorums, and N/R/W
Hashing answers:
Which node should own this key?
Replication and quorum settings answer:
How many copies exist, and how many must agree before a read/write succeeds?
Common terms:
N= replication factor, the number of nodes that store a copy of the data.W= write quorum, the number of replicas that must confirm a write.R= read quorum, the number of replicas that must respond to a read.
Example with N = 3:
key = "user_123"
consistent hashing chooses primary node:
user_123 -> Node A
replication stores copies on:
Node A, Node B, Node C
Now the system can tune R and W.
| Setting | Meaning | Optimized For | Tradeoff |
|---|---|---|---|
R = 1, W = N |
All replicas must accept writes; reads need only one replica. | Fast reads | Writes are slower and less available because every replica must be reachable. |
R = N, W = 1 |
Writes need one replica; reads check all replicas. | Fast writes | Reads are slower and less available because every replica must be reachable. |
R + W > N |
Read and write quorums overlap on at least one replica. | Stronger consistency | Higher latency and lower availability than weaker quorum settings. |
R + W <= N |
Read and write quorums might not overlap. | Lower latency / higher availability | Reads may return stale data. |
Why R + W > N matters:
N = 3
W = 2
R = 2
W + R = 4
4 > 3
If a write succeeds on 2 replicas, and a read checks 2 replicas, the read must overlap with at least one replica that saw the latest successful write.
Replicas: A, B, C
Write quorum: A, B
Read quorum: B, C
Overlap: B
That overlap is what gives stronger consistency.
Important nuance: people often say R + W > N gives “strong consistency”, but in real systems it also depends on implementation details such as conflict resolution, read repair, hinted handoff, sloppy quorums, clocks, leader/leaderless design, and whether reads always choose the newest version. For interviews, the safe wording is:
R + W > N gives quorum overlap, which supports stronger consistency.
Quick examples:
N = 3, W = 2, R = 2: balanced quorum, stronger consistency.N = 3, W = 3, R = 1: read-optimized because reads are fast, but writes require all replicas.N = 3, W = 1, R = 3: write-optimized because writes are fast, but reads require all replicas.N = 3, W = 1, R = 1: fastest and most available, but stale reads are possible.
Interview framing:
Hashing decides where data is placed.
Replication decides how many copies exist.
R/W quorum decides the consistency vs latency/availability tradeoff.
9. Where Hashing Shows Up
Hashing appears in many places, not just coding interview hash maps. In interviews, recognize the pattern: a system has a key, hashes it, and uses the result to find, group, route, verify, or distribute something.
| Area | Example |
|---|---|
| Programming language data structures | Python dict, Java HashMap |
| Relational databases | Hash indexes, hash joins, partitioning/sharding |
| NoSQL databases | DynamoDB partition keys, Cassandra partitioning, Redis Cluster hash slots |
| Caching | Decide which cache node stores a key |
| Load balancing | Route the same user, session, or request key to the same backend |
| Distributed systems | Sharding, consistent hashing, data placement |
| Security | Password hashing, file integrity checks, signed payload verification |
| Storage systems | Deduplication, checksums, content-addressed storage |
| Message streaming | Route events to partitions/shards, e.g. by user ID |
| Probabilistic data structures | Bloom filters use multiple hashes to test whether an item was probably seen before |
10. Which Hashing Should You Use?
Different hashing algorithms are built for different jobs. In interviews, the important thing is not memorizing every algorithm; it is knowing when you need speed, stability, security, or password protection.
Rules of thumb:
- For coding interview hash maps/sets, use the language built-ins.
- For distributed systems, use a stable hash. Do not use Python
hash(). - For security, use cryptographic hashes or HMACs.
- For passwords, use password hashing algorithms, not normal hashes.
- For very high-throughput routing, prefer fast non-cryptographic hashes unless security is required.
11. Different Types of Hashing Algorithms
| Type | Examples | Used For |
|---|---|---|
| Non-cryptographic hash | MurmurHash, xxHash, CityHash | Fast lookup, sharding, routing |
| Cryptographic hash | SHA-256, SHA-512 | Integrity, signatures, content identity |
| Password hashing | bcrypt, scrypt, Argon2 | Password storage |
Key points:
- Fast hashes are good for performance but not security.
- Cryptographic hashes are slower but safer for integrity/security.
- Password hashing must be intentionally slow and salted.
Password example idea:
Bad: password -> SHA-256 -> store hash
Good: password -> bcrypt/scrypt/Argon2 with salt -> store password hash
12. AWS / Cloud References
TODO: Verify AWS-specific details against official AWS documentation before publishing.
Hashing-like ideas appear in many cloud systems:
- DynamoDB partition key: the partition key is hashed to determine physical partition placement.
- S3 object key distribution: object keys are distributed internally; historically key naming patterns mattered more, but modern S3 automatically scales request rates.
- ElastiCache / Redis Cluster: keys are assigned to hash slots.
- Load balancers: some algorithms/configurations may use hashing-like routing to keep related requests on the same target.
- Kinesis partition key: the partition key determines which shard receives the record.
Interview framing:
If the service needs to distribute records by key,
ask: what is the key, how is it hashed, and what happens when capacity changes?
13. Common Pitfalls
- Confusing hashing with encryption.
- Assuming hashes are always unique.
- Forgetting collisions.
- Using Python
hash()for distributed systems. - Using
% Nwhen server count changes frequently. - Using SHA-256 for high-throughput non-security routing when a faster hash may be better.
- Storing passwords with normal hashes instead of password hashing algorithms.
- Forgetting that a bad hash function can create uneven distribution.
- Forgetting that hot keys can overload one shard even if hashing is correct.
14. Interview Cheat Sheet
Coding interviews:
- Use
setfor existence checks. - Use
dictfor mapping. - Use
Counter/defaultdictfor frequency and grouping. - Hashing gives average
O(1)lookup, insert, and delete. - Always think about collisions, even if Python handles them internally.
System design:
- Use hashing to distribute keys.
hash % Nis simple but bad whenNchanges.- Consistent hashing reduces movement when servers are added or removed.
- Choose stable hash functions for distributed systems.
- Watch for hot keys and uneven partitions.
Security:
- Hashing is not encryption.
- Hashing is usually one-way.
- Cryptographic hashes are for integrity and identity.
- Passwords need bcrypt, scrypt, or Argon2.
Quick mental model:
Coding:
key -> hash -> bucket -> fast lookup
System design:
key -> stable hash -> shard/cache/server
Security:
input -> cryptographic hash -> integrity check
password -> slow salted password hash -> safe password storage