The Million-Row Problem#
You’re building a URL shortener. Every time someone creates a short link, you generate a random code and check if it already exists in the database. One database query per attempt. At 1,000 URLs, this is fine — the query takes a millisecond, the index is tiny, nobody notices.
At 100 million URLs, you’re generating codes that collide more often (birthday paradox), each collision triggers another database round trip, and those round trips add up under high throughput. You’re not slow because your code is bad — you’re slow because you’re asking the database a question it doesn’t need to answer.
What if you could check “does this code already exist?” without touching the database at all?
A Bit Array With an Attitude#
A Bloom filter is a bit array (say, 20 bits, all starting at 0) combined with a handful of hash functions. Let me walk through the full lifecycle:
Starting state — empty bit array:
[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Check "abc123" — is it in the set?
hash1 → position 3 → 0 ❌
→ "Definitely not" (one zero = guaranteed absent, no need to check other hashes)
Insert "abc123":
hash1 → position 3
hash2 → position 8
hash3 → position 14
[0][0][0][1][0][0][0][0][1][0][0][0][0][0][1][0][0][0][0][0]
^ ^ ^
3 8 14
Check "abc123" — is it in the set now?
hash1 → position 3 → 1 ✅
hash2 → position 8 → 1 ✅
hash3 → position 14 → 1 ✅
→ "Probably exists" ✓ (correct — we just inserted it)
Check "xyz789" — is it in the set?
hash1 → position 3 → 1 ✅ (set by abc123!)
hash2 → position 6 → 0 ❌
→ "Definitely not" (one zero = guaranteed absent)
Insert "xyz789":
hash1 → position 3 (already 1 — shared with abc123)
hash2 → position 6
hash3 → position 17
[0][0][0][1][0][0][1][0][1][0][0][0][0][0][1][0][0][1][0][0]
^ ^ ^ ^ ^
3(shared) 6 8 14 17
Check "def456" — is it in the set? (we NEVER inserted it)
hash1 → position 3 → 1 ✅ (set by abc123)
hash2 → position 6 → 1 ✅ (set by xyz789)
hash3 → position 14 → 1 ✅ (set by abc123)
→ "Probably exists" ✗ FALSE POSITIVE!
(all bits happen to be set by OTHER elements)
Check "nope00" — is it in the set?
hash1 → position 2 → 0 ❌
→ "Definitely not" ✓ (correct — one zero bit = guaranteed absent)
The “def456” check is the key insight. It was never inserted, but all three of its hash positions were coincidentally set by other elements. That’s a false positive — the Bloom filter is wrong, but it’s wrong in a predictable, controllable way.
Two possible answers:
- “Definitely not in the set” — 100% correct, every time
- “Probably in the set” — correct most of the time, occasionally wrong (false positive)
It never gives false negatives. If it says no, you can trust it completely.
Why Would You Want an Inaccurate Data Structure?#
Because it’s absurdly small and fast.
| Hash Set | Database Query | Bloom Filter | |
|---|---|---|---|
| Memory | O(n) — stores all values | 0 (on disk) | Fixed size bit array |
| Lookup speed | O(1) | ~1ms (network + index) | O(1) — a few hash computations |
| Accuracy | 100% | 100% | ~99%+ (configurable) |
| Can delete? | Yes | Yes | No |
| Network call? | No | Yes | No |
A Bloom filter storing 1 million items with a 0.1% false positive rate uses about 1.8 MB of memory. A HashSet storing the same 1 million strings uses 50-100 MB. The database stores them on disk but needs a network round trip for every check.
The Bloom filter trades a tiny amount of accuracy for a massive reduction in memory and the elimination of network calls. For many use cases, that trade-off is more than worth it.
The URL Shortener, Optimized#
Here’s how this applies to the problem from the intro:
import { BloomFilter } from "bloom-filters";
const bloom = BloomFilter.create(1_000_000, 0.001); // 1M capacity, 0.1% false positive
function generateUniqueCode(): string {
for (let i = 0; i < MAX_RETRIES; i++) {
const code = randomBase62(6);
if (bloom.has(code)) continue; // in-memory check (~1μs)
// Bloom says "definitely not exists" — verify with DB as safety net
const exists = await db.url.findUnique({ where: { shortCode: code } });
if (!exists) {
bloom.add(code);
return code;
}
}
throw new Error("Failed to generate unique code");
}
The Bloom filter handles the common case: the code doesn’t exist (which is true for the vast majority of attempts). Only when the Bloom filter says “probably exists” — either correctly or as a false positive — do we fall back to the database. At 100 million URLs, this eliminates 99%+ of database queries for code generation.
The database is still the source of truth. The Bloom filter is a fast, cheap pre-filter that prevents unnecessary round trips. It’s the same principle as caching — avoid the expensive operation when you already know the answer.
The Tuning Knob#
You control the false positive rate. Lower rate = more memory:
| Items | False Positive Rate | Memory |
|---|---|---|
| 1M | 1% | ~1.2 MB |
| 1M | 0.1% | ~1.8 MB |
| 1M | 0.01% | ~2.4 MB |
| 10M | 0.1% | ~18 MB |
| 100M | 0.1% | ~180 MB |
Even at 100 million items with 0.1% false positives, you’re using 180 MB — less than a single Chrome tab. A HashSet storing 100 million strings would need several gigabytes.
Where Bloom Filters Show Up in Production#
This isn’t a theoretical data structure. It’s used at massive scale:
Google Chrome checks every URL you visit against a local Bloom filter of known malicious sites. If the Bloom filter says “not malicious” (which is almost always), Chrome skips the network call to the Safe Browsing API entirely. Only on a “probably malicious” hit does it make the API call to confirm.
Cassandra and HBase use Bloom filters to check if a row exists in an SSTable before reading from disk. Disk reads are expensive. A Bloom filter that says “definitely not in this file” saves a disk seek.
Medium uses Bloom filters to track which articles a user has already seen, so recommendations don’t show duplicates. Storing “user X has seen articles [1, 2, 3, …, 10000]” for millions of users as full sets would be massive. Bloom filters compress this to a few KB per user.
Bitcoin SPV (Simplified Payment Verification) nodes use Bloom filters to request only relevant transactions from full nodes without downloading the entire blockchain.
When to Use It#
| Scenario | Bloom Filter? | Why |
|---|---|---|
| Pre-filter before expensive lookup (DB, disk, API) | Yes | Eliminates most unnecessary lookups |
| Web crawler (“have I visited this URL?”) | Yes | Millions of URLs, memory-efficient |
| Spell checker (“is this a real word?”) | Yes | Dictionary is static, fast lookup |
| Cache miss prevention | Yes | Avoid DB query when key isn’t cached |
| Duplicate detection in streams | Yes | Can’t store all seen items in memory |
When NOT to Use It#
- You need to delete elements — standard Bloom filters don’t support deletion (flipping a bit to 0 might affect other elements). Counting Bloom filters exist but add complexity.
- False positives are unacceptable — financial transactions, access control, anything where “probably yes” isn’t good enough.
- The dataset fits in a regular Set — if you have 10,000 items, just use a HashSet. Bloom filters shine at millions+.
- You need to retrieve the actual values — Bloom filters only answer “exists?” They don’t store or return the elements themselves.
The Pattern#
Bloom filters fit into the same optimization philosophy as caching and indexing : avoid the expensive operation when a cheaper check can give you the answer first.
Request comes in
→ Check Bloom filter (nanoseconds, in-memory)
→ "Definitely not" → skip the expensive lookup
→ "Probably yes" → do the actual lookup to confirm
It’s not about replacing your database or your cache. It’s about putting a cheap, fast guard in front of them so they only do work when it matters. Another tool in the optimize before you scale toolkit.
