Design a Publisher-Subscriber Message queue — Anchorage Digital

Problem Statement

Design a distributed message queue system where multiple publishers send messages to named topics and multiple subscribers consume those messages. The system must provide an at-most-once delivery guarantee: a message may be lost under certain failure conditions, but it must never be delivered more than once to a given subscriber. Think of systems like Apache Kafka, Google Cloud Pub/Sub, or Amazon SNS/SQS.

The core engineering challenges are partitioning topics for write throughput, coordinating consumer groups for parallel consumption, enforcing at-most-once semantics (which fundamentally changes how producers, brokers, and consumers handle acknowledgments), managing backpressure from slow consumers, and providing operational visibility into lag, drop rates, and throughput. You will need to reason about delivery semantics, partition strategies, consumer offset management, and failure handling.

Key Requirements

Functional

Topic management -- users can create, list, and delete topics and configure access permissions per topic
Message publishing -- producers send messages to topics with low latency, optionally specifying partition keys and metadata headers
Consumer group subscription -- subscriber groups consume messages from topics in near real time with at-most-once delivery per subscription, and individual consumers within a group are assigned distinct partitions
Observability -- operators can monitor publish rate, consumer lag, dropped message counts, and per-partition throughput through metrics and dashboards

Non-Functional

Scalability -- handle hundreds of thousands of messages per second across thousands of topics with hundreds of consumer groups
Reliability -- tolerate broker failures and rebalances without violating the at-most-once contract; accept bounded message loss rather than redelivery
Latency -- end-to-end publish-to-consume latency under 50 ms at p95 for in-region traffic
Consistency -- maintain per-partition message ordering; accept that at-most-once semantics may result in occasional dropped messages during failures

What Interviewers Focus On

Based on real interview experiences at Anchorage Digital and Amazon, these are the areas interviewers probe most deeply:

1. At-Most-Once Delivery Semantics

The at-most-once guarantee is the defining constraint of this design. Interviewers want to see that you understand how it shapes every layer: producer acknowledgments, broker replication, and consumer offset management. Getting this wrong (accidentally building at-least-once) is the most common mistake.

Hints to consider:

Configure producers to use fire-and-forget or leader-only acknowledgments (acks=0 or acks=1 in Kafka terms) so a broker crash before replication does not trigger a retry that would cause duplication
Have consumers commit their offset before processing the message, not after; this way a consumer crash mid-processing skips the message rather than redelivering it
Do not implement automatic retries on the producer side; if a publish call fails, the message is simply lost
Clearly articulate the tradeoff: this design accepts bounded message loss in exchange for zero duplicates

2. Topic Partitioning and Consumer Group Coordination

Scaling both writes and reads requires partitioning topics and assigning partitions to consumers within a group. Interviewers evaluate your understanding of partition-key routing, rebalancing, and the interaction between partition count and parallelism.

Hints to consider:

Partition each topic into N segments; producers hash the partition key to select a segment, preserving ordering for messages with the same key
Within a consumer group, each partition is assigned to exactly one consumer; rebalancing redistributes partitions when consumers join or leave
Use a coordination service (ZooKeeper, etcd, or a built-in protocol like Kafka's group coordinator) to manage partition assignments and detect consumer failures
Choose the partition count carefully: too few limits parallelism; too many creates coordination overhead and uneven load

3. Backpressure and Slow-Consumer Handling

At-most-once systems cannot retry, so slow consumers risk unbounded queue growth or message drops. Interviewers expect explicit backpressure and drop policies.

Hints to consider:

Set bounded in-memory buffers per consumer with a configurable drop policy: drop oldest, drop newest, or drop lowest priority
Expose consumer lag (difference between latest published offset and consumer's committed offset) as a metric; alert when lag exceeds a threshold
Implement per-topic or per-consumer-group rate limits to prevent one noisy tenant from starving others in a multi-tenant deployment
Provide a dead-letter topic where dropped messages are logged (metadata only, not retried) for post-mortem analysis

4. Broker Storage and Retention

Messages must be stored durably on brokers (subject to the at-most-once caveat) and retained for a configurable period. Interviewers assess your storage model and lifecycle management.

Hints to consider:

Store messages in append-only segment files on disk, indexed by offset, similar to Kafka's log-structured storage
Implement time-based and size-based retention policies; expired segments are deleted or compacted
Use page cache and sequential I/O to achieve high throughput without expensive random writes
For at-most-once, replication factor can be lower (RF=1 or RF=2) since the contract tolerates loss; this reduces write latency and storage cost

5. Multi-Tenancy and Operational Concerns

In production, many teams share the same message queue cluster. Interviewers look for isolation, quotas, and observability.

Hints to consider:

Enforce per-tenant publish and consume rate quotas to prevent resource starvation
Isolate hot topics on dedicated broker nodes or partition sets if they threaten cluster stability
Emit per-topic, per-consumer-group metrics (publish rate, consume rate, lag, drop count) to a time-series monitoring system
Provide self-service topic creation with guardrails (maximum partition count, retention limits) and approval workflows for high-throughput topics

Suggested Approach

Step 1: Clarify Requirements

Confirm the delivery guarantee is at-most-once, not at-least-once. Ask about expected message throughput (messages per second), average message size, number of topics and consumer groups, and retention period. Clarify whether ordering is required globally or per-partition. Determine if multi-tenancy is in scope and whether the system must run in a single region or across multiple regions.

Step 2: High-Level Architecture

Sketch the main components: producer clients, a set of broker nodes that own topic partitions, a coordination service (ZooKeeper or equivalent) for metadata and consumer group management, consumer clients organized into consumer groups, and a monitoring stack. Show the publish flow: producer selects a partition via key hashing, sends the message to the partition leader with acks=1, and the broker appends to the log. Show the consume flow: consumer fetches messages from its assigned partitions, commits the offset before processing, and processes the message.

Step 3: Deep Dive on At-Most-Once Pipeline

Walk through a message lifecycle. A producer publishes a message to topic T, partition P. The broker leader receives the write, appends it to the segment file, and acknowledges the producer without waiting for follower replication. If the leader crashes before replication, the message is lost -- acceptable under at-most-once. A consumer in group G is assigned partition P. It fetches the next batch of messages starting from its committed offset. Before processing, it commits the new offset. If the consumer crashes after committing but before processing, those messages are skipped -- also acceptable. Explain how this differs from at-least-once (where you would commit after processing, risking redelivery on crash).

Step 4: Address Secondary Concerns

Cover consumer group rebalancing: when a consumer joins or leaves, the coordinator triggers a rebalance, reassigning partitions; during rebalance, consumption pauses briefly. Discuss backpressure: bounded buffers with drop policies, lag monitoring, and per-tenant rate limits. Address retention: time-based and size-based segment deletion, with optional compaction for keyed topics. Mention operational tooling: dashboards for per-topic throughput, consumer lag alerts, and self-service topic provisioning. Discuss scaling: add broker nodes and redistribute partitions, increase partition count for hot topics, and use rack-aware replica placement for fault tolerance.

Related Learning Resources

Design a Job Scheduler -- covers durable task queuing, consumer coordination, and backpressure patterns that parallel message queue design
Building block: Message Queues -- foundational concepts of pub/sub, partitioning, and delivery semantics

Problem Statement

Key Requirements

Functional

Topic management -- users can create, list, and delete topics and configure access permissions per topic
Message publishing -- producers send messages to topics with low latency, optionally specifying partition keys and metadata headers
Consumer group subscription -- subscriber groups consume messages from topics in near real time with at-most-once delivery per subscription, and individual consumers within a group are assigned distinct partitions
Observability -- operators can monitor publish rate, consumer lag, dropped message counts, and per-partition throughput through metrics and dashboards

Non-Functional

Scalability -- handle hundreds of thousands of messages per second across thousands of topics with hundreds of consumer groups
Reliability -- tolerate broker failures and rebalances without violating the at-most-once contract; accept bounded message loss rather than redelivery
Latency -- end-to-end publish-to-consume latency under 50 ms at p95 for in-region traffic
Consistency -- maintain per-partition message ordering; accept that at-most-once semantics may result in occasional dropped messages during failures

What Interviewers Focus On

Based on real interview experiences at Anchorage Digital and Amazon, these are the areas interviewers probe most deeply:

1. At-Most-Once Delivery Semantics

Hints to consider:

Configure producers to use fire-and-forget or leader-only acknowledgments (acks=0 or acks=1 in Kafka terms) so a broker crash before replication does not trigger a retry that would cause duplication
Have consumers commit their offset before processing the message, not after; this way a consumer crash mid-processing skips the message rather than redelivering it
Do not implement automatic retries on the producer side; if a publish call fails, the message is simply lost
Clearly articulate the tradeoff: this design accepts bounded message loss in exchange for zero duplicates

2. Topic Partitioning and Consumer Group Coordination

Hints to consider:

Partition each topic into N segments; producers hash the partition key to select a segment, preserving ordering for messages with the same key
Within a consumer group, each partition is assigned to exactly one consumer; rebalancing redistributes partitions when consumers join or leave
Use a coordination service (ZooKeeper, etcd, or a built-in protocol like Kafka's group coordinator) to manage partition assignments and detect consumer failures
Choose the partition count carefully: too few limits parallelism; too many creates coordination overhead and uneven load

3. Backpressure and Slow-Consumer Handling

At-most-once systems cannot retry, so slow consumers risk unbounded queue growth or message drops. Interviewers expect explicit backpressure and drop policies.

Hints to consider:

Set bounded in-memory buffers per consumer with a configurable drop policy: drop oldest, drop newest, or drop lowest priority
Expose consumer lag (difference between latest published offset and consumer's committed offset) as a metric; alert when lag exceeds a threshold
Implement per-topic or per-consumer-group rate limits to prevent one noisy tenant from starving others in a multi-tenant deployment
Provide a dead-letter topic where dropped messages are logged (metadata only, not retried) for post-mortem analysis

4. Broker Storage and Retention

Messages must be stored durably on brokers (subject to the at-most-once caveat) and retained for a configurable period. Interviewers assess your storage model and lifecycle management.

Hints to consider:

Store messages in append-only segment files on disk, indexed by offset, similar to Kafka's log-structured storage
Implement time-based and size-based retention policies; expired segments are deleted or compacted
Use page cache and sequential I/O to achieve high throughput without expensive random writes
For at-most-once, replication factor can be lower (RF=1 or RF=2) since the contract tolerates loss; this reduces write latency and storage cost

5. Multi-Tenancy and Operational Concerns

In production, many teams share the same message queue cluster. Interviewers look for isolation, quotas, and observability.

Hints to consider:

Enforce per-tenant publish and consume rate quotas to prevent resource starvation
Isolate hot topics on dedicated broker nodes or partition sets if they threaten cluster stability
Emit per-topic, per-consumer-group metrics (publish rate, consume rate, lag, drop count) to a time-series monitoring system
Provide self-service topic creation with guardrails (maximum partition count, retention limits) and approval workflows for high-throughput topics

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on At-Most-Once Pipeline

Step 4: Address Secondary Concerns

Related Learning Resources

Design a Job Scheduler -- covers durable task queuing, consumer coordination, and backpressure patterns that parallel message queue design
Building block: Message Queues -- foundational concepts of pub/sub, partitioning, and delivery semantics