Systems Design: Youtube

Cghzqzd

3 min readJun 13, 2021

READMF

Requirements

Functional

User uploads a video
User searches a video based on titles
User should be able to comment/like/share

Non functional

Estimation

1B users and 500M are active users. Let’s say each user views 5 videos per day

Read QPS: 500M * 5/100K = 25K videos/secs

Write QPS: Let’s say upload:view ratio is 1:200. So 25k/200 =200 videos/sec

Storage:

Let’s say 200 videos/sec are uploaded and each video is 10 mins and each min is 50MB. 6k videos/min * 10 min* 50Mb =3500 GB/min!

API

Upload POST https://www.googleapis.com/upload/youtube/v3/videos

{

userDetails, videoDetails (title desc,loc,tags,thumbnail,fileDetails))

}

HTTP 202 (request accepted)

Search GET https://api.google.com/youtube/v1/search

Query Params: q, location, offset,limit

Delete

Update video metadata

Stream send video chunks over http from a given offset or use web socket

Design

At a high-level we would need the following components:

Processing Queue: Each uploaded video will be pushed to a processing queue to be de-queued later for encoding, thumbnail generation, and storage.
Encoder: To encode each uploaded video into multiple formats.
Thumbnails generator: To generate a few thumbnails for each video.
Video and Thumbnail storage: To store video and thumbnail files in some distributed file storage.
User Database: To store user’s information, e.g., name, email, address, etc.
Video metadata storage: A metadata database to store all the information about videos like title, file path in the system, uploading user, total views, likes, dislikes, etc. It will also be used to store all the video comments.

Database

Video and user metadata in MySQL

User upload Video

User comments/likes/shares Video

Video (videoId,title,desc,thumbnail,userId,likes,dislikes,views)
Comment(commentId,videoId,userId,comment,createdAt,updatedAt)
User(userId, name,email,address,age,createdAt)

Where would videos be stored? Videos can be stored in a distributed file storage system like Amazon S3. The service would be read-heavy, so we will focus on building a system that can retrieve videos quickly

Where would thumbnails be stored? There will be a lot more thumbnails than videos. If we assume that every video will have five thumbnails, we need to have a very efficient storage system that can serve huge read traffic. There will be two consideration before deciding which storage system should be used for thumbnails:

Thumbnails are small files, say, a maximum of 5KB each.
Read traffic for thumbnails will be huge compared to videos. Users will be watching one video at a time, but they might be looking at a page with 20 thumbnails of other videos.

Let’s evaluate storing all the thumbnails on a disk. Given that we have a huge number of files, we have to perform many seeks to different locations on the disk to read these files. This is quite inefficient and will result in higher latencies.

S3 would be the best choice for the simple reason that you can easily use S3 as an origin for CloudFront, Amazon’s CDN. By using CloudFront (or indeed any CDN), the images are hosted physically around the world and served from the fastest location for a given user.

Sharding

Sharding based on UserID What if a user becomes popular? We need uniform distribution.

Sharding based on VideoID Our hash function will map each VideoID to a random server where we will store that Video’s metadata. To find videos of a user, we will query all servers, and each server will return a set of videos. A centralized server will aggregate and rank these results before returning them to the user. This approach solves our problem of popular users but shifts it to popular videos.