The most simple answer is Hash Table.
Drawbacks:
What if the data is too big and cannot fit in memory ?
Answer:
Distributed key value storage system.
Design a distributed key value sotrage system.
Take in a key and return value.
Make sure your system is:
1) Highly available, and
2) Consistant
Questions to ask the interviewer:
-
What type of data am I storing here ?
- What data type is the key and what data type is the value ?
- How much data am I storing ?
- How frequently will the data be accessed ?
- Is the traffic evenly distributed ?
- Is this a read or a write heavy system ?
How many people will be using it ?
1 Million
How many requests per second at worst time ?
1 request per second.
When you say distributed key value store, it means the data is distributed.
Question: How do you store distributed data ?
Answer: Sharding
Rule: We need even distribution of data.
- Range Based partitioning
Store all the data starting from letter A on one shard etc.
Cons:
Produces inconsistency, one shard can grow bigger than another.
What if one shard becomes hot ?
-
ID/Hash-Key Based Partitioning
Run the key through some sort of hash function, this gets you some id
Based on this ID we can store the values.
Master can have latest and slaves can copy from master.
If master goes down, one of the slaves becomes the new master.
The reads can be done from master.
The writes can be done to one of the slaves.
Then after some time period, we make this most updated slave the new master.
And the original master now becomes the slave.
The master slave architecture should be such that the slaves continuously copy off the master, so when we change the master, the slaves should be now copying off the new master. Thus making them also updated.
This approch will compromise the latency of the most consistent data.
We can put the most accessed data in LRU cache on the server.
As the data grows and the customers grow, we'd eventually need to split the architecture into multiple clusters.
We can use consistent hashing to add or remove the servers.
Since we have decided to shard based on hashId, and if we decide to add more servers, this approch of load balencing would make sense