Bytely - About
This project is only about building a URL shortener, which seems easy. But before considering them easy, we need to understand what this actually is and how it works under the hood, especially in high load scenarios.
Step 1: Understand the Problem
System design interview questions are intentionally left open-ended. To design a well-crafted system, it is critical to ask clarification questions.
Interview Progress
Candidate: Can you give an example of how a URL shortener work?
Interviewer: The URL Shortener performs basically 2 things: shortening the urls and redirecting the short urls to their original destinations.
Candidate: What is the Traffic Volume?
Interviewer: 100 million URLs are generated per day.
Candidate: How long is the shortened URL?
Interviewer: As short as possible.
Candidate: What characters are allowed in the shortened URL?
Interviewer Shortened URL can be a combination of numbers (0-9) and characters (a-z, A-Z).
Candidate: Can shortened URLs be deleted or updated?
Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated. Here are the basic use cases:
- URL shortening: given a long URL => return a much shorter URL
- URL redirecting: given a shorter URL => redirect to the original URL
- High availability, scalability, and fault tolerance considerations
Now, based on the information we should perform a Back of The Envelope Estimation:
Back of the Envelope Estimation
- Write operation: 100 million URLs are generated per day.
- Write operation per second:
100 million / 24 /3600 = 1160 - Read operation: Assuming ratio of read operation to write operation is 10:1, read operation per second:
1160 * 10 = 11,600 - Assuming the URL shortener service will run for 10 years, this means we must support
100 million * 365 * 10 = 365 billionrecords. - Assume average URL length is 100.
- Storage requirement over 10 years:
365 billion * 100 bytes * 10 years = 365 TB
Step 2: High Level Design
In this section we should focus on overall workflow, how data is stored, transferred and how it is accessed.
API Endpoints
We'll choose REST-style APIs. There are only 2 APIs:
- Create a new short URL:
POST api/v1/shorten/- request param:
long_url - response:
short_url
- request param:
- Redirect short URL:
GET api/v1/{short_url}/- return
long_urlfor HTTP redirection
- return
In terms of URL redirection, we have 2 options: 301 or 302 redirect. The 301 redirect is permanent and the 302 redirect is temporary. In this project, we will use 302 redirect because we need analytics data on the number of redirects.
URL Shortening
The most intuitive way is to use a hash table like <shortURL:longURL>. Now, hashtable requires us to find a proper hash function:
- Each
longURLmust be hashed to onehashValue - Each
hashValuecan be mapped to oneshortURL
Step 3: Design Deep Dive
From now on, we can dive deeper on low level design, like data model, hash function choices, etc.
Data Model
In the high-level design, everything is stored in a hash table. This is a good starting point; however, this approach is not feasible for real-world systems as memory resources are limited and expensive. A better option is to store <shortURL, longURL> mapping in a relational database. The simplest database model is Link(id, shortURL, longURL).
Hash Function
Our expected hashValue should consist of characters from [0-9, a-z, A-Z], containing 62 possible characters. We are said that the hash value should be as short as possible, so we need to compute the smallest n that 62^n >= 365 billion. The smallest n is 7, because 62^7 = ~3.5 trillion. This means we need to use 7 character long short codes for hashValue.
We should consider 2 types of hash functions now: hash + collision resolution and base62 conversion.
Hash + Collision Resolution
To shorten a long URL, we should implement a hash function that hashes a long URL to a 7-character string. A straightforward solution is to use well-known hash functions like CRC32, MD5, or SHA-1. But if we look at the natural lengths of these hash functions, their lengths are way longer than 7. So we need a way: maybe cut 1st 7 characters? But this will cause hash collisions. To resolve collisions we can recursively append a new predefined string until no more collision is discovered. (Soon I'll dive deeper on this).
Warning
In this situation, a Bloom Filter is a better solution. However we have another option.
Base62 Conversion
Base conversion helps to convert the same number between its different number representation systems. For example, 11157 becomes 2TX.
| Hash + Collision Resolution | Base62 Conversion |
|---|---|
| Fixed short URL length | The short URL length is not fixed, depends on ID |
| No need for unique ID generator | Depends of unique ID generator |
| Collision is possible | Collision is impossible |
| Impossible to figure out next available short code | Easy to figure out next short URL based on ID |
URL Shortening
Well, let's look at the overall flow of the URL shortening process:
- The
longURLis the input. - The system checks if the
longURLis in the database. - If it is, it means the
longURLwas converted toshortURLbefore. In this case, fetch theshortURLfrom the database and return it to the client. - If not, the
longURLis new. A new unique ID (primary key) is generated by the unique ID generator. - Convert the ID to
shortURLwith base 62 conversion. - Create a new database row with the
ID,shortURL, andlongURL.
Now, everything seems complete, until we reach the distributed environment. At this point, we have to discuss a Unique ID Generator, as it is a challenging task. You can visit the link about related discussion.
Step 4: Wrap Up
If there is extra time at the end of the interview, here are a few additional talking points.
Additional points
- Rate limiter: A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules. We need to focus on chapter 4 for this.
- Web server scaling: Since the web tier is stateless, it is easy to scale the web tier by adding or removing web servers.
- Database scaling: Database replication and sharding are common techniques.
- Analytics: Data is increasingly important for business success. Integrating an analytics solution to the URL shortener could help to answer important questions like how many people click on a link? When do they click the link? etc.
- Availability, consistency, and reliability. We should recall the chapter 1 for this.
Step 5: Implementation
So, I decided to implement this project step-by-step, with version-based approach. Let's distribute the whole process into 8 stages:
- A "toy" in-memory service
- Persistent (database-backed) service with DB sequence as short code
- Better code generation and deduplication strategy
- Add Caching to the Hot Path
- Rate Limiting & Abuse Protections
- Analytics (without killing the hot path)
- Vertical scaling (multiple workers behind nginx)
- Horizontal scaling (multiple app instances behind a load balancer)
This step-by-step ensures we are not rushing nor overcomplicating the problem. We are balancing the learning path and actual project outcome.
Danger
Overcomplicating the problem is a recipe for failure.
Reference Material: System Design Interview by Alex Xu