Understand the problem and define the design Scope (Functional Requirement)
Usecase
Custom short URLs are easy to share less type error-free, and more readable.
some application allows only limited characters to share eg. tweets in such cases short URLs comes in handy.
This system will be used by mobile and web
No . of Users can we assume 100 million/month unique URLs generated.
Come up with the most important feature
service should be available to create short
ShortUrl should redirect to long URL
Users can create custom short
metric collection for analytics
Capacity Planning and Estimate (Non-Functional Requirement)
Traffic
This is a read-heavy system, assuming the read-write ratio read: write = 200:1
Urls generated per second = urls generated per month / 30*24* 3600 = 100M/30*24*3600 = 40 unique urls generated per second.(total request per second = 8040)
R:W per sec = 8000:40 | R:W per month = 20B:100M | R:W per year = 240B:1.2B
Total request per year = 241.2B
Storage
Storage decisions are based on data type, amount of data and access pattern.
Assume in database schema we store longUrl, ShortUrl, created Date, and user info.
size of longUrl = 2kb (2048 characters)
size of shortUrl = 256 bytes (assuming max) ... similar to other data types.
we can consider the total size of one data object to be 5KB(5*10^3 bytes).
Storage required per month = 100M* 5KB = (100*10^6) * (5*10^3) = 5 * 10^9 = 5GB/month
Storage for 10 years = 5GB*12*10 = 600 GB (approx)
Memory
By Pareto Principle 80% of requests for 20% of data, we can allocate 20% of total storage for cache memory.
per month memory = 0.2*100M*5KB = 100GB of cache memory
Latency Expectation
The system should be highly available or consistent? or Eventual consistency preferred?
Say the user creates a domain and some service (a web crawler is used to get the long URLs from this newly created domain sent to the service and converted to a short URL) end users don't know such domains exist. If highly user-interesting website change is allowed with any backup plan please discuss this with the interviewer and understand the clear requirements and use DB, cache (evection and writing policies) appropriately. We will discuss this soon.
High-Level Design
Consistency
Strong Consistency - data present in distributed systems have the same view of data all the time.(financial systems)
Eventual Consistency - data present in distributed systems may have the same view of data if not they will eventually become consistent by corn job/queue/broker.(social media)
Availability refers to the DB is available for read-write operations. In other words, every request to the system receives a response, without guaranteeing that it contains the most recent version of the data.
API EndPoints
create api
@PostMapping("/create/shorturl/") public String createShortURL(String longurl, String apiAccessToken, String custonUrl){ return tinyUrlService.createShortURL(longurl,apiAccessToken,custonUrl); }
used to create the short url from the long url , the API access token can be configured in properties/ send for every user (optional). custom URL is also optional. this will generate the short url and respond with appropriate http status codes (20x, 40x,50x).
get api
@GetMapping("/{shorturl}")
public String getLongURL(String shorturl){
return tinyUrlService.getLongtURL(shorturl);
}// return long url or 302/301 for redirection
List of user-created custom urls (if needed)
@GetMapping("/{shorturl}") public List<String> getUserCustomURLS(String username, apiaccesstoken){ return tinyUrlService.getUserCustomURLS(username, apiacesstoken); }// return List of user-created custom urls
Database Schema and DB selection
- Here we will check on SQL and NOSQL. First lets look into SQL why?
SQL
Assuming consistency is the most important priority ie. we agree to give away the low performance, ACID is highly prioritized and before the user or service requests to read/write to the database all the database systems that are distributed MUST have the same view of data in all systems. Also our requirements don't have complex queries / complex join operations.
Since we have a read-heavy operation if we are going for an SQL data model we can perform indexing and sharding this will improve the performance of the read-heavy system(but introduced complexity for consistent hashing to map required value for the table this is handled easily by NoSQL database like Amazon DynamoDB) and a caching layer at (DB, server, and end-user this will reduce the heavy load on servers and DB) can be introduced based on the peak availability hours of the domain, geography CDN (completely optional maybe existing CDN can be used). This will improve the read-write operations from Syian to Super Saiyan 2
But a good algorithm and coordinator between the servers will provide more consistency. Time for the algorithm to generate short URL
NOSQL
Given infinite scaling and high read rated and sharding DB and which DB to look for what data and more options are already built-in with the NOSQL services like Amazon DynamoDB, these reduce the load on the developer and dynamic scaling also handles the peak hour needs. All these major scaling requirements are taken care by Amazon I would go for NOSQL because
- Eventual Consistency is agreed 2. Low Latency 3. High Performance 4. To reduce the heavy load on developers we can focus more on functional requirement 5. Amazon provides a good Ecosystem for analytics, monitoring and logging.
Thus NoSQL making our system invincible
Algorithm
There are 2 possible ways to implement URL encoding to this design problem
Random encoding
Base62
MD5
Key Generation Service
A base is a number of digits or characters that can be used to represent a particular number.
Base 10 are digits [0–9]
Base 62 are [0–9][a-z][A-Z]
How many characters shall we keep in our tiny URL?
-
Here we don't want to take 8 characters as the total length of the short URL that we are going to generate as this too far exceeds the use limit (assume 100M/month*12months*200years =240Billion unique URLs generated for 200 years)
Random Encoding
You can completely ignore the code(for random encoding) this is just for reference (please keep the algorithm used)
import java.security.SecureRandom;
public class ShortUrlGenerator {
private static final int NUM_CHARS_SHORT_LINK = 7;
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private static final int ALPHABET_LENGTH = ALPHABET.length();
private SecureRandom secureRandom = new SecureRandom();
public String generateRandomShortUrl() {
char[] result = new char[NUM_CHARS_SHORT_LINK];
while (true) {
for (int i = 0; i < NUM_CHARS_SHORT_LINK; i++) {
int randomIndex = secureRandom.nextInt(ALPHABET_LENGTH);
result[i] = ALPHABET.charAt(randomIndex);
}
String shortLink = new String(result);
// make sure the short link isn't already used
if (!DB.checkShortLinkExists(shortLink)) {
return shortLink;
}
}
}
// Assume there's a DB class with a method to check if a short link exists
private static class DB {
public static boolean checkShortLinkExists(String shortLink) {
// Implementation to check if the short link exists in the database
// This method needs to be implemented based on the database being used
return false;
}
}
public static void main(String[] args) {
// Example usage
ShortUrlGenerator generator = new ShortUrlGenerator();
String randomShortUrl = generator.generateRandomShortUrl();
System.out.println("Random Short URL: " + randomShortUrl);
}
}
Generate a random 7-character check if this short URL is in DB putIfAbsent.
If not in DB then insert it into DB
If present in DB continue the process till a unique short URL is placed in DB
This approach increases the total load on the DB (even if cache is introduced) the total read int increases. To add if more servers exist in the distributed system the possibility of the same random number is generated and the checkShortLinkExists request increases the load and shortURL generation time. Thus reducing the overall performance of the system.
Base62 Encoding
Similarly, like random ending algorithm this generates base62 from base10 values, these base10 values are likes unique keys that can exist from 0-3500B
By generating the shortURL using this approach provides more unique keys
In distributed systems, multiple servers exist and how do the servers keep track of the current count? can that be implemented via threads but it will degrade the performance(increase the waiting time). what if we allocate a specified range to servers?
Say one server operates in the range 100M-200M, next server operates on 200M+1 to 300M like that.
But does having the number on the server cause a problem? if the server goes down the the associated range also goes down. The process can continue from where it left or a well-supported way is to introduce ZooKeeper
ZoopKeeper will keep track of the counter for each server and if one server range goes down it can assign a different (next unused range to the server).
Apart from this in distribution systems Zookeeper also does these
MD5 Hashing
Apply MD5 hash function to original URL to produce 32 digit hexadecimal string
Take first 7 characters from hash as candidate TinyURL
Check database if the 7 character URL already exists
If exists, take next 7 characters from hash and check again
Repeat taking successive 7 character slices until a unique TinyURL is found
Store mapping of TinyURL to original URL in database
The 7 char TinyURL encodings may collide initially but slicing successive 7 char segments ensures we will find a unique shortcut URL
Mapping is stored to redirect TinyURL to long URL when accessed.
This approach not recommended just for knowledge.
Key Generation Service
if anyone this approach is better than base62 with a zookeeper. Lets get into discussion.
Caching
R: W = 8000:40 per second, read heavy application
We can introduce caching at 1. Application Server Cache, 2.Global Cache - Across app servers, 3. CDN Cache 4. Browser Cache
For our system 1.Brower level cache 2.Global cache for the application server and 3. Global cache at the DB level will do. CDN is not required as cache data is only 100GB max required based on our calculation.
For this read-heavy application we can have a write-through cache because the cache will always have the latest data, the disadvantage of a write-through cache is that infrequently-requested data is also written to the cache.
To improve this LRU cache Evection policy can be used.
Initial idea
Up on improving the algorithm and improving single point of failure, scalability, performance, caching, coordinator service, load balancer.
We can further improve the analytics of this performance, monitoring and analytics for this system by introducing Kafka and Elastic stack
Implementing Kafka to queue can buffer writes and decouple them from the caching layer could improve the cache performance.
Elastic Stack (Elasticsearch, Logstash, Kibana) for metrics collection, visualization, and analytics could give you insights to further optimize and improve the system over time
However for our system simple write-through cache may still provide adequate performance for our workload. Kafka and the Elastic Stack would provide operational benefits like monitoring, alerting and troubleshooting, rather than major performance gains. So desucss with the intervewr for for this requirement or not.
Guess what this is not the end we there is always space for improvement!! we'll meet in the nect chapter till then CIAO.