Design Meta Live Comments — Google

Problem Statement

You are designing the real-time commenting system that powers live video broadcasts on platforms like Facebook Live and Instagram Live. When a creator starts streaming, viewers across the globe can post text comments that appear on every viewer's screen in near real-time. The system must handle wildly varying traffic patterns, from a casual stream with a handful of viewers to a celebrity broadcast attracting tens of millions of simultaneous participants.

A critical challenge is the mobile-first nature of live video audiences. Viewers frequently lose connectivity, switch between Wi-Fi and cellular, or background the app entirely. When they return, they expect to catch up on comments they missed without seeing duplicates or gaps. The system must also support moderation capabilities, allowing creators and platform moderators to filter inappropriate content and ban disruptive users.

At peak load, the platform serves over 10 million concurrent live streams worldwide, with the most popular individual streams reaching 5 million or more simultaneous viewers. Each comment must propagate from submission to display on remote screens within a few hundred milliseconds.

Key Requirements

Functional

Comment submission -- Authenticated viewers can post text comments to any live stream they are currently watching
Real-time broadcast -- Every viewer on a stream sees new comments appear within milliseconds of posting, maintaining a smooth scrolling feed
Reconnection catch-up -- Clients that disconnect and reconnect receive all comments posted during the gap, identified by a last-seen marker
Content moderation -- Comments are filtered against banned-word lists and per-stream user bans before reaching viewers

Non-Functional

Scalability -- Handle 10M+ concurrent streams with individual streams supporting 5M+ simultaneous viewers
Latency -- End-to-end comment delivery under 500ms at the 99th percentile
Availability -- 99.99% uptime; occasional dropped comments are tolerable but sustained outages are not
Fault tolerance -- No single point of failure; the system degrades gracefully under extreme load

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Fanout Strategy for Hot Streams

A single stream with millions of viewers creates an enormous fanout problem. Pushing each comment individually to every connected client is not feasible at this scale.

Hints to consider:

Consider how a pub/sub layer can decouple comment producers from the delivery tier
Think about whether every viewer on a viral stream truly needs to see every comment, or if intelligent sampling is acceptable
Explore how connection servers can subscribe to a stream once and broadcast locally to all clients they manage
Investigate tiered fanout where the blast radius of a single message grows through layers rather than a flat broadcast

2. Persistent Connection Management

Maintaining millions of concurrent WebSocket connections is an infrastructure challenge distinct from stateless HTTP serving.

Hints to consider:

Think about how connection servers differ from API servers in scaling, failover, and resource consumption
Consider what happens when a connection server crashes and thousands of clients reconnect simultaneously
Explore maintaining a connection registry in Redis that maps each user to the server holding their socket
Think about heartbeat intervals and how to distinguish intentional disconnects from transient network blips

3. Comment Ordering Across Servers

Viewers connected to different servers may receive the same comments in slightly different orders, which can produce a jarring user experience.

Hints to consider:

Consider whether strict global ordering is truly necessary for a scrolling comment feed
Think about assigning server-side timestamps at the ingestion point to provide a consistent sort key
Explore client-side buffering where comments are held briefly and sorted before rendering
Weigh the trade-off between perfect ordering and delivery latency

4. Reconnection and Gap Recovery

Mobile networks are inherently unreliable, and viewers routinely disconnect and rejoin streams.

Hints to consider:

Think about maintaining a sliding window of recent comments per stream in a fast data store
Consider a protocol where clients send their last-seen comment ID when reconnecting to fetch only what they missed
Explore using Redis sorted sets keyed by stream ID with timestamp scores for efficient range queries
Think about what happens when the gap is too large to fill and how the client should handle that gracefully

Suggested Approach

Step 1: Clarify Requirements

Confirm the expected scale: number of concurrent streams, peak viewers per stream, and average comment rate per second. Ask whether intelligent sampling of comments is acceptable for ultra-popular streams where viewers cannot read every comment anyway. Clarify how long comments need to be retained -- only for the lifetime of the stream plus a short buffer, or indefinitely for replay. Confirm whether rich media (emoji reactions, stickers) is in scope or if the system handles text only.

Step 2: High-Level Architecture

Viewers connect via WebSockets to a fleet of connection servers behind a load balancer. When a viewer submits a comment, the request hits an API gateway that validates the user and passes the comment through a moderation filter. Approved comments are written to a Kafka topic partitioned by stream ID. A fanout service consumes from Kafka and publishes each comment to a Redis Pub/Sub channel for the corresponding stream. Each connection server subscribes to channels for the streams its connected viewers are watching, then pushes incoming comments to those clients over their WebSocket connections. A Redis sorted set per stream holds the last several minutes of comments, enabling reconnection catch-up.

Step 3: Deep Dive — Fanout at Scale

For hot streams, a single Kafka partition and Redis Pub/Sub channel can become bottlenecks. Introduce tiered fanout: the fanout service publishes to a per-stream channel, and each connection server holding viewers for that stream subscribes once. This means the fanout cost scales with the number of connection servers, not the number of viewers. For streams exceeding a configurable threshold (e.g., 100K concurrent viewers), activate comment sampling where only a representative subset of comments is delivered to each viewer, rotating which comments are shown to maintain diversity. Connection servers batch outgoing WebSocket messages into small windows (e.g., 50ms) to reduce per-message overhead and smooth delivery.

Step 4: Address Secondary Concerns

For moderation, run an inline filter that checks each comment against a banned-word trie and a per-stream ban list stored in Redis, rejecting matches before they enter the fanout path. For reconnection, clients include their last-seen sequence number when re-establishing a WebSocket; the connection server queries the Redis sorted set to fetch missed comments. For long-term durability, persist comments asynchronously to Cassandra for post-stream replay and analytics. Monitor per-stream fanout latency and automatically enable sampling when delivery times exceed thresholds. Use consistent hashing to co-locate viewers of the same stream on the same connection servers, reducing the number of servers that must subscribe to each channel.

Related Learning

Deepen your understanding of the patterns used in this problem:

Message Queues — Understand how Kafka enables decoupled, ordered event streaming for high-throughput comment ingestion
Load Balancers — Learn strategies for distributing millions of persistent WebSocket connections across server fleets

Problem Statement

Key Requirements

Functional

Comment submission -- Authenticated viewers can post text comments to any live stream they are currently watching
Real-time broadcast -- Every viewer on a stream sees new comments appear within milliseconds of posting, maintaining a smooth scrolling feed
Reconnection catch-up -- Clients that disconnect and reconnect receive all comments posted during the gap, identified by a last-seen marker
Content moderation -- Comments are filtered against banned-word lists and per-stream user bans before reaching viewers

Non-Functional

Scalability -- Handle 10M+ concurrent streams with individual streams supporting 5M+ simultaneous viewers
Latency -- End-to-end comment delivery under 500ms at the 99th percentile
Availability -- 99.99% uptime; occasional dropped comments are tolerable but sustained outages are not
Fault tolerance -- No single point of failure; the system degrades gracefully under extreme load

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Fanout Strategy for Hot Streams

A single stream with millions of viewers creates an enormous fanout problem. Pushing each comment individually to every connected client is not feasible at this scale.

Hints to consider:

Consider how a pub/sub layer can decouple comment producers from the delivery tier
Think about whether every viewer on a viral stream truly needs to see every comment, or if intelligent sampling is acceptable
Explore how connection servers can subscribe to a stream once and broadcast locally to all clients they manage
Investigate tiered fanout where the blast radius of a single message grows through layers rather than a flat broadcast

2. Persistent Connection Management

Maintaining millions of concurrent WebSocket connections is an infrastructure challenge distinct from stateless HTTP serving.

Hints to consider:

Think about how connection servers differ from API servers in scaling, failover, and resource consumption
Consider what happens when a connection server crashes and thousands of clients reconnect simultaneously
Explore maintaining a connection registry in Redis that maps each user to the server holding their socket
Think about heartbeat intervals and how to distinguish intentional disconnects from transient network blips

3. Comment Ordering Across Servers

Viewers connected to different servers may receive the same comments in slightly different orders, which can produce a jarring user experience.

Hints to consider:

Consider whether strict global ordering is truly necessary for a scrolling comment feed
Think about assigning server-side timestamps at the ingestion point to provide a consistent sort key
Explore client-side buffering where comments are held briefly and sorted before rendering
Weigh the trade-off between perfect ordering and delivery latency

4. Reconnection and Gap Recovery

Mobile networks are inherently unreliable, and viewers routinely disconnect and rejoin streams.

Hints to consider:

Think about maintaining a sliding window of recent comments per stream in a fast data store
Consider a protocol where clients send their last-seen comment ID when reconnecting to fetch only what they missed
Explore using Redis sorted sets keyed by stream ID with timestamp scores for efficient range queries
Think about what happens when the gap is too large to fill and how the client should handle that gracefully

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive — Fanout at Scale

Step 4: Address Secondary Concerns

Related Learning

Deepen your understanding of the patterns used in this problem:

Message Queues — Understand how Kafka enables decoupled, ordered event streaming for high-throughput comment ingestion
Load Balancers — Learn strategies for distributing millions of persistent WebSocket connections across server fleets