Skip to content

Latest commit

 

History

History
 
 

README.md

Design Twitter

Goals

  1. User should be able to:
    1. Post a new tweet
    2. Follow other users.
    3. Like a tweet
  2. Create a Time Line for:
    1. Home Time Line/ News Feed: That consists of top tweets from the users our current user follows.
    2. User Time Line: That consist of top recent tweets our user has made
    3. Search Time Line: That consists of tweets based on a search term
  3. The system needs to be:
    1. Highly available
    2. Can have some latency
    3. Can take a hit on consistency


Scope

Let's first desgin a system for 1 user for now.

Capacity Estimations

Let's have some assumptions
100 Million total Users
200,000 active users
100,000 tweets per day
On average 1 user follows 200 other users

Liked tweets
1 user likes 5 tweets per day
200,000 users will like 10,00,000 tweets per day.

Tweet Views
Let's say the timeline of any user has at max 10 tweets.
Our current user visits their timeline 2ce a day and also visits 5 other people's timelines
So total timelines the current user sees is 7
And as stated above that each timeline consists of about 10 tweets
So total tweets = 70
And we have about 200,000 active users
So total tweet views is 200,000*70 = 140,000,000

Storage Estimations

What does our tweet consist of ?
1) Text (max 140 chars)
2) Image/video
3) other metadata like timestamp, tweet id etc.
Pure Text tweet without Image/Video
Let's say we need about 300 bytes to store the text
And about 50 bytes to store metadata.
So 1 pure text tweet size is: 300 bytes + 50 bytes = 350Bytes
Text+Image tweet
Let's say we need about 300 bytes to store the text
And about 50 bytes to store metadata.
And 200KB to store the image
So 1 Text+Image tweet size is: 300 bytes + 50 bytes + 200KB = 200.35KB
Text+Video tweet
Let's say we need about 300 bytes to store the text
And about 50 bytes to store metadata.
And 2MB to store the video
So 1 Text+Image tweet size is: 300 bytes + 50 bytes + 2MB = 2.00035MB
Total Storage Per Day
Let's say we have 20% tweets that contain Images and 10% tweets contain Videos
So out of 100,000 we have 20,000 tweets with images and 10,000 tweets with videos and rest 70,000 pure text tweets.
Total storage required would be:
70,000*350Bytes+20,000*200.35KB+10,000*2.00035MB = 24.03500 GB

Bandwidth Estimations

So if 24GB per day then 24GB/24Hours = 1GB per hour, 1GB/60Mins*60Seconds per second

High Level Design

Post Tweet



Get News Feed



General system analysis

We saw from Capacity Estimations we get 100,000 tweet writes per day but 140,000,000 tweet views per day.
So our system is READ HEAVY


Code

Classes

Tweet
UserId (who created the tweet)
TweetID
Text
Image
Video
TimeStamp
FavouritedBy = [List of UserID] (users that have marked this tweet as their favourite)

User
ID
UserName
Follows = [List of UserIDs]
FavouriteTweets = [List of TweetIDs]


API

CreateTweet(UserId, Text, [Optional]Image, [Optional]Video)
Creates a tweet object
Returns success if a tweet object is created successfully
PostTweet(TweetObject)
Returns success if the tweet is posted successfully


Detailed Component Design

Post Tweet

Generally the user has 1:Many relationship, when it comes to tweets and followers.
The above tables explain this.

Redis Database: Redis is in memory database, it has key value pairs kinda like hash table. It offers data replication.
So in our above diagram the data is replicated 3 times in the Redis cluster
It is very easy to insert huge amounts of data easily

Bottleneck of the above approch
Every time the user tweets the data is getting replicated 3 times, so what about hot users like celebs ?
To solve this problem, we can merge celeb tweets at load time of timeline.
So every time user tweets, all their follower's time line gets pre computed.
When the follower accesses their time line, during the access time, if the follower is following celeb that time the tweet is loaded.



Scale the design

Sharding

Sharding based on User ID
Store all data pertaining to one user Id on same shard.
What if a user becomed hot ?
What if one user has less data and other has more ?
Sharding based on Tweet ID
A tweet is mapped to some random shard by a hashing server.
To search for tweets we'd have to query all the servers
This is time consuming
Sharding based on Tweet Creation time
We store all the latest tweets on current shard.
But with this approch the traffic will hit mainly the current shard
Sharding based on Tweet Creation time ans Tweet ID
We shard based on Tweet ID, but the id of the tweet is epoch time + random increasing number.
So we still need to get all tweets from all servers but we wont have to sort them, as they'd be sorted coz of their ID


Cache

What would you wanna cache ?
1) Hot tweets that are gonna be viewed by a lot of people
2) Hot users/ their timelines that are gonna be followed by a lot of people
A tweet is not hot after some period of time. So we can use LRU cache for this approch

Replication and Fault Tolerance

We can use a master slave architecture here for database.
All reads go to one server and writes go to another server
And then the data from the write server is replicated