NETFLIX HIGH-LEVEL SYSTEM DESIGN

STEP - 1

Functional Req

User Registration + login + profiles
Search
Recommendations

Activity tracking
Recommendation generation
Dashboard

Payment
Content Ingestion and Delivery

Rights + Licences
Categorization

Notifications
Trend Movies
Watch History
Endorsement

Design Constraints

Number of users: 100 million
Daily active user: %10
Content Upload / sec: Minimal
Content Streamed / sec: Need scaling

How many videos let’s say 20000
Averaga per video 10 GB : 3gb HD + 1gb sd = 4 gb.
Avg per video 2,5 hours = 10 GB

Content Size: the average size of the video data
User activity tracking: High
Content streaming latency

STEP - 2

Define Microservices

We can bucketize functional requirements into Microservices.
The system is a huge system so should be a lot of teams to handle these requirements.
Multiple Microservices - Breadth Oriented System.

High Priority Microservices

Microservices	Technology Type
User activity tracking MS	Simple time-series K-V workload with TTL/count
Recommendation generation MS	Analytics, Bulky Compute
Movie Dashboard MS	Simple K-V workload
Content Ingestion and Delivery Ms	K-V workload

Low Priority MS

Microservices	Technology Type
User registration + Login + Profiles	Simple K-V workload + Authentication algos
Search	Analytics
Payments	Simple K-V workload

STEP - 3

Logical Architecture whole services

User talks to Content Management and Activity MS
Content Owner uploads to Content Management MS some movie/s.
User is streaming content from Content Management MS.

Activity MS to Trending Movies through Pub-Sub/Batch service, because Trending Movies has much more workload. Probably user devices pushing like Roku system or smart tv pushing on.
Trending movies not offline ms. Current ms.
Recommendation MS more offline and pushes to Dashboard some recommends movies.
Users can query from the Dashboard.
The Dashboard has Rest Full API to stream from content management. To show some short videos to users from Dashboard MS.
We can expect our read: write ratio to be 200:1

STEP - 4A

We can talk to each microservices about the tier, data model, APIs and workflow. But we don’t have much time and we can pick the most important ms.

Tiers, DATA MODEL - APIs - Workflow/Algo

Content Ingestion and Delivery Microservices

App time, In-Memory(Cache for content metadata/compute for compression), Storage Tier
Data Structures

Metadata

K: content id, V: JSON object for the metadata: cast and crew, summary, thumbnail

Data:

K: content id, V: bitstream

How to store metadata:

Hashmap in memory
Row oriented in storage (There is no analytics data, easy to read-write)

How to store Data:

Video is not going to form a cache. The video comes from storage. Because of very huge data. There is storage throughput and network throughput because of decoding.
Roku devices can decode with the Ring buffer (FIFO), producer-consumer data structure.
Your producer should be faster than the consumer. The network is more bottleneck here, we don’t have to do it from memory. This is not a CPU workload, this is a storage workload.
Filesystem: They have offset and chunks mapping. Given any file, you can read from the offset. All file systems have to capability to allow you to read the file from anywhere within the file. The offset comes from the watcher’s device. Says “give me this movie by this offset”

APIs: create(K,V) for ingestion, read(K) from metadata, read (content_id, offset) from filesystem.

Video Uploads: Since videos could be huge, if while uploading the connection drops we should support resuming from the same point.

Video Encoding: Newly uploaded videos are stored on the server and a new task is added to the processing queue to encode the video into multiple formats. Once all the encoding will be completed the uploader will be notified and the video is made available for view/sharing.

Component Detail Design

STEP 4B - SCALING

Content Ingestion and Delivery Microservices

Need to scale for storage? Yes

Metadata : 20.000 movies * ~ 10 KB = 200 MB storage is trivial
Data : 20,000 movies * ~ 10 GB = 200 TB

Need to scale for throughput? Yes

Number of create(K, V) is negligible
read(K, offset)/s : 3mb/s * 10 million user : 30 TB/sec

Assumption (Fair) : A single server deliveries around 1-2 GB/s (SSD) and 300 Mb/s(spinning disk)

Metadata read is negligible

Need to scale for API parallelization? Not relevant

Apis themselves are constant latency, so no need to parallelize

Availability? Yes
Hotspot? Yes definitely
Geo-location based distribution? Yes

CDNs or content delivery network. Network latency. 30 TB/sec all over the world and we have to ensure that we have divide 30 TB/sec across the world based on access patterns. Let’s say 10 CDN all around the world. Each CDN also going to deliver 3 TB/s.

STEP 4C

Architectural layout

App Tier

LB: Using IP Anycast for DNS Load Balancing. They will typically share the same IP address. When one of the LB is failing and passive, the other LB becomes active. They make a health check between both all the time.
App Tier servers are stateless servers. So we don’t need to keep server information and session details. To handle transactions very fast. The implement is very easy on the internet.
Just a round-robin request. This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning.

Req1 > Server A , Req2 > Server B end of the server start again A

Cluster manager: Heartbeat requests are being sent to a separate cluster manager and informing the load balancer. LB knows which server is currently healthy.

Cache - Storage Tier

Config store: Keeping track of how many servers are active and where are getting map does.
Which shards belongs to which server?

Like K: 4 → Server5, Server37, Server25.

This information pass to the LB. LB will go to Server5,or Server 37 or Server25

SHARDING - REPLICATION

Sharding:

Metadata: No need, if needed, horizontal
Data: Yes

Horizontal by content-id: Horizontally doesn’t help to reduce hotspot, lats say I divided the file system data by keeping the content of the file but keeping sum files, movie somewhere else, but the full movie within the same shard. What happens to them? If the movie very hot, a lot of people watching the same movie that particular servers became a hotspot. So we need to propose a Hybrid.

Hybrid: content id + chunk id [128 MB]: Hash (Content Id, chunk id) % N = shard id -> server

Why Hybrid? - If I looked at the data model. How many rows? - 20.000 rows. The size will be huge. This is more like it, short and fat content rather than long and thin content.
If we look at URL Shortener data, was it short and fat or thin and long? - Thin and long. If I have a thin data model then can I cut it vertically? - No. Short and fat, I can cut vertically - Yes.
Given the data model I can figure out whether I can do horizontally partitioning, vertically, or both. In this case, I will do both. I will partition by the movie as well as partition by chunks.
If we chuck will go to different shard, getting any throughput? - No, My read coming with reading (k, offset). Given an offset, I will go to one chunk and I will be able to distribute my work all the chunks. I am also removing the hotspot why?
You and I are watching the same movie are more probably, you and I are watching the same location as the movie, same chunk of the movie. I am able to reduce hotspots and able to load distribute as well as since my reads are coming with both the keys, it still not scatter-gather. I am able to go to a local shard and fetching.
First, offset > shard 1 - Second offset > shard2 depending on the offset Id it will map to the chunk id.
Watching the same movie versus watching the same portion of the movie probability of watching the same portion is a bit lesser.

Placement of shards in servers

Consistent Hashing

(Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down.)

Replication - Yes

If even sharding you can not reduce hotspot versus the other way to remove hotspot. Sharding is not aligning hotspot. We can do replication. I can reduce hotspots.
We need available and we have a hotspot
AP

Does not need to be strictly nanosecond level consistent. Read heavy system. Write 1, read every time. Read-only AP system.

Best configuration: N = 3 or more, R = 1, W=1, high throughput, lowest latency.

I don’t have to conflict resolution between multiple replicas or the movie. I will just take the movie from 1 replica get out. Read-only system.
Latency: best, Throughput: best, Consistency: Bad

N = 3 , R= 1 , W = 1

Since W = 1 and N = 3, any one of the 3 replicas will take the write and return to the user before the update shows up in other replicas

Let’s say replica in server A takes the write and returns control to the user.

Subsequent read

○ Since R = 1 and N = 3, any one of the 3 replicas will take the read and return it to the user

○ If read happens in from server C, until C receives the update from A, C will return stale result

Search This Blog

Practical, Useful Solutions