This username is taken ❌ How is this check so fast?
I am a developer with learnings in many different languages, frameworks and technologies.
Just so you know, here is some math: Active Users on Instagram: 2000000000
❌ Solution 1:
Lookup in the database by username We can have a column username with 2 billion rows and an index on the username column, and we can run an exists query which will take around 1ms to 20ms (if B Tree Index is used, then ~30 to ~35 comparisons + memory lookup time [cache hit] + disk seek time [cache miss]).
Ok, so we are getting a good response time here. 😄 But, will you be able to manage the database with this many rows? 🤔
You need to manage read speed, write speed, consistency, fault tolerance and database replication, 😲
If you shard your database in multiple nodes, then you need to check in each node, which will increase the time from 1 to 20 ms to 1 to 20 ms * (no. of shards) 😲
So clearly, this is not a cost-effective and efficient solution. ❌
❌ Solution 2:
Lookup in the in-memory cache by username as key A Redis can lookup up to 100000 keys per second, so the raw time we take for lookup for one username would be around 1 to 10 microseconds.
Oh, this is much less than what the database took. 😄
This is raw time; you need to add network delay into consideration too, because the Redis server will be on a different VM. 😲
Each username would be ~10 to ~16 characters long, so we need ~2 bytes to store the key name, ~1 byte to store the value (any minimum-sized value works), but then there are hidden values too, for example redis key overhead which is ~50 bytes and then metadata per key and value which will be ~40 bytes so our total comes out near ~100 bytes per key, but we have 2 billion usernames so the total would be 100 bytes * 2 billion usernames = 200 GB 😲
So clearly, this is also not a cost-effective and efficient solution. ❌
✅ Solution 3: Bloom Filters
We initialize 10000s of bits that can be marked 0 or 1 inside the memory with 0
We make 3 to 7 hash functions that map our username to one of these bits
We feed the username to these hash functions and check if the bit is already 1 or 0
If any of the mapped bits is 0, then it is guaranteed that the username does not exist. If all mapped bits are marked 1, then we can say this username might exist. 😲
Depending on the policy, we can tell the user that the username is taken, or we can check the database to further verify it.
✅ A Bloom filter is an in-memory probabilistic data structure.
✅ The backend architecture of the bloom filter is similar to Redis, but instead of Redis taking 200 GB, here it will take ~4 to ~6 GB total memory, as there is less overhead and just bit structures, and the speed is ~50ns.
⚠️ Hash functions' design should evenly distribute over different bits for different usernames. ⚠️ Periodically resetting the bloom filter is required.
Does it mean you shall always use bloom filters? No, always follow the rule that “over-optimization is the root cause of all evil” and then scale as the user base or data increases