Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Key Value Store

What is a key value store ?
The most simple answer is Hash Table.
Drawbacks:
What if the data is too big and cannot fit in memory ?
Answer:
Distributed key value storage system.

Goals

Design a distributed key value sotrage system.
Take in a key and return value.
Make sure your system is:
1) Highly available, and
2) Consistant

Questions to ask the interviewer:
  1. What type of data am I storing here ?
    1. What data type is the key and what data type is the value ?
  2. How much data am I storing ?
  3. How frequently will the data be accessed ?
  4. Is the traffic evenly distributed ?
  5. Is this a read or a write heavy system ?


Scope

How many people will be using it ?
1 Million

Capacity Estimations

How many requests per second at worst time ?
1 request per second.

High Level Design



Detailed Component Design

When you say distributed key value store, it means the data is distributed.

Question: How do you store distributed data ?
Answer: Sharding

Sharding

Rule: We need even distribution of data.

  1. Range Based partitioning
    Store all the data starting from letter A on one shard etc.
    Cons:
    Produces inconsistency, one shard can grow bigger than another.
    What if one shard becomes hot ?
  2. ID/Hash-Key Based Partitioning
    Run the key through some sort of hash function, this gets you some id
    Based on this ID we can store the values.


Scale the system

System availability


Master can have latest and slaves can copy from master.
If master goes down, one of the slaves becomes the new master.


Consistency

The reads can be done from master.
The writes can be done to one of the slaves.
Then after some time period, we make this most updated slave the new master.
And the original master now becomes the slave.
The master slave architecture should be such that the slaves continuously copy off the master, so when we change the master, the slaves should be now copying off the new master. Thus making them also updated.
This approch will compromise the latency of the most consistent data.


Cache

We can put the most accessed data in LRU cache on the server.

Cluster

As the data grows and the customers grow, we'd eventually need to split the architecture into multiple clusters.
We can use consistent hashing to add or remove the servers.
Since we have decided to shard based on hashId, and if we decide to add more servers, this approch of load balencing would make sense