PASTEBIN SYSTEM DESIGN COMPLETE A-Z BEST

Understand the problem and define the design Scope (Functional Requirement)

Usecase

Users can share the code/text/configuration over the internet at high speed
sharing text information between individuals and groups can easily achieved by URLs generated by this system

Come up with the most important feature

The user should be able to upload only 'text'.
User should set expiration dates for their data.
User can share their data as public or limited to another set of users
When a user shares his data on the system it should return a unique URL that can be shared
user should be able to create custom URLs.
Optional Analytics.

Capacity Planning and Estimate (Non-Functional Requirement)

Traffic

This is a read-heavy system, assuming the read-write ratio read: write = 100:1, and 500 million total requests/ month.
R:W per sec = 200:2 | R:W per month = 495M:5M | write: 600M
Total request per year = 241.2B

Storage

Storage decisions are based on data type, amount of data and access pattern.
We assume the maximum allowed size is 10 MB (which can store 800 page book)
But the maximum users don't use such storage and average use is 70kb(30 pages)
We can assume the storage per write request is 100KB ?
Total storage per year= 600M*100KB = 60TB of data per year, for 10yrs=600TB

Memory

By Pareto Principle 80% of requests for 20% of data, we can allocate 20% of total storage for cache memory.
per month memory = 0.2*100M*100KB = 2TB of cache memory

Latency Expectation

The system should be highly available or consistent? or Eventual consistency preferred?
High Availability
Low latency

High-Level Design

Consistency

contains the most recent version of the data.

API EndPoints

create api

  @PostMapping("/create/newContent/")
  public String createpastBin(String content, String apiAccessToken, String customUrl, String experationDate,Booolean status, List<String> accessList){
      return pastBinService.createpastBin(content,apiAccessToken,customUrl,experationDate);
  }

status, accessList,customURL and experationDate is optional. API to generate and return URLs used to share across the internet. this will generate the short URL and respond with appropriate HTTP status codes (20x, 40x,50x).

get api

@GetMapping("/{url}")
public String getContent(String url){
    return pastBinService.getContent(url);
}// return content

List of user-created custom urls (if needed)

            @GetMapping("/accessible/{shorturl}")
            public List<String> getacessibleUsers(String url){
                return tinyUrlService.getacessibleUsers(url);
            }// return List of user-who can asess given url

Database Schema and DB selection

User table
user_id: userid primarykey
short_url_id: contains the short urls created by user (FK)
name: name of user
email: email of user
DOB: date of birth
last_login: last loging info
pastebinLink Table
short_url_id: unique url generated by key generation service. primary key
custom_url: optional if user enters custionurl
user_id: unique user id sequentelly generated by DB (with some configs continue reading)
content: actual data/text saved/shared
creationTime: content creation time
experationTime: (optional)the time after which the data should be deleted from DB
acessible_user_list: contains the list of users that can access the content

SQL

The data to be stored is simple, yes we can use SQL, for multiple read and write is heavy thus we can index shortURL from the pasteBin table and user-id from User.
Since the R:W load is heavy and the storage no.of row is huge we can shard the DB, to map the correct request to the correct Table we can use range-based sharding/ consistent hashing, distribution of data becomes uneven if data distribution in not uniform
Range-based sharding is LIMITED and may require redefining and redistributing ranges, causing operational overhead and potential downtime during the migration process.
So we can go for consistent hashing which dynamically ranges based on no.of nodes and can scale to any range
Note: we can have a constrain like if the data store for content is larger than 100KB we can have a link that directs to object/blob storage like amazon S3

NOSQL

Infinite scaling, consistency, redundancy, avaliability dynamic scaling across multiple DB instances are taken care of by NOSQL servers like Amazon S3 reducing heavy load on set up and devs can focus on business requirement
1. Eventual Consistency is agreed 2. Low Latency 3. High Performance 4. To reduce the heavy load 5. Amazon provides a good Ecosystem for analytics, monitoring and logging.
Thus NoSQL makes our system invincible
Now should we use SQL or NOSQL? well in this case you can use both ie if the content is larger than 100KB initially show that to the users and in the next 100ms or even less the complete data can be shared from blob storage like S3.
Or we can completely shift to NOSQL. these are two decisions you can ask the interviewer about what to take further also let me know what is better?
we can even have a queue to write the content because if the content is small it is instantly written, for larger content let the system perform its function the cron job will independently work on writing it to the data store.
ie. larger content is processed asynchronously, avoiding delays in user interactions.
Queue system provides a buffer, handling peak loads and ensuring data integrity(Fault Tolerance).

Algorithm

There are 2 possible ways to implement URL encoding to this design problem

Base62
Key Generation Service

A base is a number of digits or characters that can be used to represent a particular number.

Base 10 are digits [0–9]
Base 62 are [0–9][a-z][A-Z]
How many characters shall we keep in our tiny URL?
Here we don't want to take 8 characters as the total length of the short URL that we are going to generate as this too far exceeds the use limit (assume 100M/month*12months*200years =240Billion unique URLs generated for 200 years)
IN TINY URL DESIGN URL GENERATION is focused here we will focus on separate Key generation services and a recap on base 62. (rate limiting + analytics will be introcuded later)
Initial diagram
The problem with this approach is read, write and key generation are handled by a single server leads to a heavy toll on the server and single point of failure.
we can have multiple servers and load balancer to improve eliminate SPOF and scale
we can implement microservice architecture as it can direct requests to required servers and say we can have 5:2 read: write servers to handle the traffic-independent Key generation service. Have more servers to eliminate SPOF
API gateway to direct requests to respective servers reduces the load and also load balancer (optional but good to have) to evenly distribute the load across the servers
what about the KGS?
the KGS will create unique keys and store then in the data base in ranges as used range and unused range when ever a range is used by server that range is marjerd into used range we can have tow tables for this in the database
But this increases the number of total read on the database. the size of each key to be 10bytes and assuming out url is 7 character in length the posible keys is 3500 Billion
Total storage required for keys 100GB.
This can be stored in a independent fast storage like cache/ in a single database instead of shared. we will continue with redis for fast acess

How KGS works?

we have multiple instances of service running here so we need to have track of what service is running which key. wo do this we can introduce zookeeper which will coordinate the key generation across multiple instances.
Keys are storaged in the databases and when the KGS makes a get request from the redis it can get a range of keys and mark it as used and cache the range of keys i write so frequent call need not be done between write and KGS. multiple range counters are cordinated by zookeeper and if one range falls down it is discarded at this point as we have multiple posibilitees we can start with new range.

Now let's introduce a Blog Storage

Caching

R: W = 8000:40 per second, read heavy application
We can introduce caching at 1. Application Server Cache, 2.Global Cache - Across app servers, 3. CDN Cache 4. Browser Cache
For our system 1.Brower level cache 2.Global cache for the application server and 3. Global cache at the DB level will do. CDN is not required as cache data is only 100GB max required based on our calculation.
For this read-heavy application we can have a write-through cache because the cache will always have the latest data, the disadvantage of a write-through cache is that infrequently-requested data is also written to the cache.
To improve this LRU cache Evection policy can be used.
Finally, the writing to blog storage can be asynchronous done.

We can further improve the analytics of this performance, monitoring and analytics for this system by introducing Kafka and Elastic stack

Implementing Kafka to queue can buffer writes and decouple them from the caching layer could improve the cache performance.
Elastic Stack (Elasticsearch, Logstash, Kibana) for metrics collection, visualization, and analytics could give you insights to further optimize and improve the system over time

However for our system simple write-through cache may still provide adequate performance for our workload. Kafka and the Elastic Stack would provide operational benefits like monitoring, alerting and troubleshooting, rather than major performance gains. So desucss with the intervewr for for this requirement or not.

Guess what this is not the end we there is always space for improvement!! we'll meet in the nect chapter till then CIAO.

Pastebin System Design

Table of contents