Hashing

Hashing is one of the most useful ideas in coding interviews and system design. In coding interviews, it powers hash maps, sets, frequency counting, grouping, and fast lookup. In system design, it helps distribute keys across servers, shards, caches, and streams.

This page starts from first principles and builds toward practical interview use.


1. What Is Hashing?

Hashing converts an input, often called a key, into a fixed-size or numeric value using a hash function.

key -> hash function -> hash value

Examples:

"abc"      -> hash function -> some number
"user_123" -> hash function -> some number
"file.png" -> hash function -> some number

For deterministic hashing, the same input should produce the same output:

hash("abc") = 96354
hash("abc") = 96354
hash("abc") = 96354

Important ideas:

"abc"  -> hash -> 42
"xyz"  -> hash -> 42

collision: two different keys produced the same hash value

In interviews, hashing usually means: “Can I turn this value into something I can look up quickly?”


2. What Does serverIndex = hash % N Mean?

In system design, hashing is often used to decide where a key should live.

serverIndex = hash(key) % N

Where:

Example:

hash("user_123") = 17
N = 4

17 % 4 = 1

"user_123" goes to server 1

Diagram:

              +----------+
"user_123" -> | hash()   | -> 17
              +----------+
                    |
                    v
                 17 % 4
                    |
                    v
               server 1

The problem is that changing N changes the result:

17 % 4 = 1
17 % 5 = 2

So if you add or remove servers, many keys may move to different servers. This is why simple modulo hashing is easy to understand but painful when the server count changes often.

This motivates consistent hashing, covered later.


3. How Does a String Become a Number?

Computers already represent characters as numbers. In Python, ord() gives the Unicode code point for a character:

print(ord("a"))  # 97
print(ord("b"))  # 98
print(ord("你")) # 20320

A basic hash function can combine these character numbers into one final number.

Educational example:

def simple_hash(key: str) -> int:
    hash_value = 0
    for char in key:
        hash_value = hash_value * 31 + ord(char)
    return hash_value


print(simple_hash("abc"))

Manual walkthrough for "abc":

start = 0

'a' -> 97
hash = 0 * 31 + 97 = 97

'b' -> 98
hash = 97 * 31 + 98 = 3105

'c' -> 99
hash = 3105 * 31 + 99 = 96354

Why multiply by 31?

Multiplication makes character order matter.

Without multiplication, "abc" and "cba" could easily collapse into similar values because you are mostly adding character numbers. With multiplication, earlier characters get carried forward and affect the final number more.

"abc" and "cba" contain the same letters,
but their positions should produce different hash values.

This simple_hash() function is for learning only. Real hash functions handle overflow, distribution, speed, and collision behavior much more carefully.


4. Hash Tables / Hash Maps

A hash table stores key-value pairs using hashing.

key -> hash -> bucket index -> store/find value

Example:

"user_123" -> hash -> 17
17 % bucket_count -> bucket 1
bucket 1 contains the value for "user_123"

In Python:

Example:

users = {
    "user_123": "Gary",
    "user_456": "Alice",
}

print(users["user_123"])

Conceptually, Python uses hashing to locate "user_123" quickly instead of scanning every key one by one.

Common operations:

Operation Average Time Notes
Insert O(1) Add a key-value pair.
Lookup O(1) Find value by key.
Delete O(1) Remove key-value pair.

Worst case can degrade if many keys collide into the same bucket. Real hash tables use collision-handling strategies such as:

Simple chaining diagram:

bucket 0 -> []
bucket 1 -> [("user_123", "Gary"), ("user_999", "Bob")]
bucket 2 -> []
bucket 3 -> [("user_456", "Alice")]

5. Common Interview Patterns Using Hashing

Pattern Data Structure Example Problem
Existence check set Find duplicates
Value to index dict Two Sum
Frequency count dict / Counter Valid Anagram
Grouping defaultdict(list) Group Anagrams
Sliding window memory dict / set Longest substring without repeating characters

Two Sum

def two_sum(nums: list[int], target: int) -> list[int]:
    seen = {}  # value -> index

    for i, num in enumerate(nums):
        need = target - num
        if need in seen:
            return [seen[need], i]
        seen[num] = i

    return []

Complexity:

Why hashing helps: each complement lookup is average O(1).

Count Frequency

from collections import Counter


def char_frequency(s: str) -> Counter:
    return Counter(s)


print(char_frequency("banana"))

Complexity:

Group Anagrams

from collections import defaultdict


def group_anagrams(words: list[str]) -> list[list[str]]:
    groups = defaultdict(list)

    for word in words:
        key = "".join(sorted(word))
        groups[key].append(word)

    return list(groups.values())

Complexity:

Why hashing helps: the sorted word becomes a dictionary key for grouping.

Detect Duplicate

def has_duplicate(nums: list[int]) -> bool:
    seen = set()

    for num in nums:
        if num in seen:
            return True
        seen.add(num)

    return False

Complexity:


6. Hashing in System Design

Practical uses:

Stable hash example:

import hashlib


def stable_hash_to_int(key: str) -> int:
    digest = hashlib.sha256(key.encode("utf-8")).hexdigest()
    return int(digest, 16)


def choose_server(key: str, num_servers: int) -> int:
    return stable_hash_to_int(key) % num_servers


print(choose_server("user_123", 4))

Important: do not use Python’s built-in hash() for distributed systems.

Python randomizes string hash values between processes/runs for security. That means this can change:

print(hash("user_123"))

For distributed systems, use a stable hash such as:

Use cryptographic hashes when integrity/security matters. Use fast non-cryptographic hashes when performance matters and security is not the goal.


7. Consistent Hashing

Simple modulo hashing:

hash(key) % N

Problem:

N changes -> many keys move

Example:

hash("user_123") = 17

17 % 4 = 1
17 % 5 = 2

Adding one server changed the destination from server 1 to server 2.

Consistent hashing reduces this movement.

Mental model:

              hash ring

                 [S1]
             /           \
        keyA               [S2]
         |                   |
         v                   v
      next server        next server
             \           /
                 [S3]

How it works:

Example:

key -> hash -> position on ring
server -> hash -> position on ring

key belongs to the first server found clockwise

Why this helps:

Consistent hashing is useful for:

Virtual nodes:

Real systems often place each physical server on the ring multiple times as virtual nodes.

Server A -> A1, A2, A3
Server B -> B1, B2, B3
Server C -> C1, C2, C3

This improves distribution and reduces the chance that one server gets too many keys.


8. Replication, Quorums, and N/R/W

Hashing answers:

Which node should own this key?

Replication and quorum settings answer:

How many copies exist, and how many must agree before a read/write succeeds?

Common terms:

Example with N = 3:

key = "user_123"

consistent hashing chooses primary node:
user_123 -> Node A

replication stores copies on:
Node A, Node B, Node C

Now the system can tune R and W.

Setting Meaning Optimized For Tradeoff
R = 1, W = N All replicas must accept writes; reads need only one replica. Fast reads Writes are slower and less available because every replica must be reachable.
R = N, W = 1 Writes need one replica; reads check all replicas. Fast writes Reads are slower and less available because every replica must be reachable.
R + W > N Read and write quorums overlap on at least one replica. Stronger consistency Higher latency and lower availability than weaker quorum settings.
R + W <= N Read and write quorums might not overlap. Lower latency / higher availability Reads may return stale data.

Why R + W > N matters:

N = 3
W = 2
R = 2

W + R = 4
4 > 3

If a write succeeds on 2 replicas, and a read checks 2 replicas, the read must overlap with at least one replica that saw the latest successful write.

Replicas: A, B, C

Write quorum: A, B
Read quorum:  B, C

Overlap: B

That overlap is what gives stronger consistency.

Important nuance: people often say R + W > N gives “strong consistency”, but in real systems it also depends on implementation details such as conflict resolution, read repair, hinted handoff, sloppy quorums, clocks, leader/leaderless design, and whether reads always choose the newest version. For interviews, the safe wording is:

R + W > N gives quorum overlap, which supports stronger consistency.

Quick examples:

Interview framing:

Hashing decides where data is placed.
Replication decides how many copies exist.
R/W quorum decides the consistency vs latency/availability tradeoff.

9. Where Hashing Shows Up

Hashing appears in many places, not just coding interview hash maps. In interviews, recognize the pattern: a system has a key, hashes it, and uses the result to find, group, route, verify, or distribute something.

Area Example
Programming language data structures Python dict, Java HashMap
Relational databases Hash indexes, hash joins, partitioning/sharding
NoSQL databases DynamoDB partition keys, Cassandra partitioning, Redis Cluster hash slots
Caching Decide which cache node stores a key
Load balancing Route the same user, session, or request key to the same backend
Distributed systems Sharding, consistent hashing, data placement
Security Password hashing, file integrity checks, signed payload verification
Storage systems Deduplication, checksums, content-addressed storage
Message streaming Route events to partitions/shards, e.g. by user ID
Probabilistic data structures Bloom filters use multiple hashes to test whether an item was probably seen before

10. Which Hashing Should You Use?

Different hashing algorithms are built for different jobs. In interviews, the important thing is not memorizing every algorithm; it is knowing when you need speed, stability, security, or password protection.

Rules of thumb:


11. Different Types of Hashing Algorithms

Type Examples Used For
Non-cryptographic hash MurmurHash, xxHash, CityHash Fast lookup, sharding, routing
Cryptographic hash SHA-256, SHA-512 Integrity, signatures, content identity
Password hashing bcrypt, scrypt, Argon2 Password storage

Key points:

Password example idea:

Bad:  password -> SHA-256 -> store hash
Good: password -> bcrypt/scrypt/Argon2 with salt -> store password hash

12. AWS / Cloud References

TODO: Verify AWS-specific details against official AWS documentation before publishing.

Hashing-like ideas appear in many cloud systems:

Interview framing:

If the service needs to distribute records by key,
ask: what is the key, how is it hashed, and what happens when capacity changes?

13. Common Pitfalls


14. Interview Cheat Sheet

Coding interviews:

System design:

Security:

Quick mental model:

Coding:
key -> hash -> bucket -> fast lookup

System design:
key -> stable hash -> shard/cache/server

Security:
input -> cryptographic hash -> integrity check
password -> slow salted password hash -> safe password storage
Contents