Design a data ingestion system — Qualtrics

Problem Statement

Design a telemetry ingestion and analytics platform that collects high-frequency sensor data and large video recordings from field devices such as drones. Each drone produces time-series readings (GPS coordinates, battery health, radar tracks, accelerometer output) at rates between 1 and 20 Hz, alongside high-definition video that can exceed 10 GB per test session. Multiple engineering teams need to upload complete test runs, search them by metadata, review synchronized telemetry and video playback, and run cross-session aggregations for trend analysis.

The system must handle hundreds of concurrent test uploads, absorb sustained write throughput of tens of thousands of data points per second, and durably store petabytes of historical data. Connectivity from field locations may be intermittent, so the upload pipeline must tolerate partial transmissions and resume without data loss. Your design should emphasize the separation of structured telemetry from unstructured video, write-path scalability with idempotent delivery, and efficient read paths for metadata search, time-range analytics, and video streaming.

Key Requirements

Functional

Resumable test-run uploads -- Vehicles upload sensor streams and video files with support for chunked, resumable transfers over unreliable cellular or satellite links
Metadata search and filtering -- Engineers locate test sessions by drone ID, test name, date range, geographic area, and custom tags with sub-second response times
Synchronized playback -- Users view video alongside time-aligned telemetry overlays and can scrub to any timestamp to see corresponding sensor readings
Cross-session analytics -- Support aggregate queries spanning thousands of sessions, such as battery degradation by temperature band or failure rate by route
Dataset export -- Enable bulk export of filtered telemetry subsets for offline machine learning training and custom analysis pipelines

Non-Functional

Scalability -- Handle 500 concurrent drones, 100K sensor writes per second, 50 TB of new video daily, and multi-year retention at petabyte scale
Reliability -- Tolerate network drops, partial uploads, and service failures with at-least-once delivery and no data loss
Latency -- Metadata searches return in under 500 ms, single-session time-series queries complete in under 2 seconds, and video streaming begins within 1 second
Consistency -- Eventual consistency is acceptable for analytics rollups; strong consistency is required for session metadata to prevent duplicate processing

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separating Time-Series Telemetry from Large Video Blobs

Interviewers expect you to recognize that high-frequency structured sensor data and multi-gigabyte video files have fundamentally different storage, transport, and query requirements and should flow through distinct pipelines.

Hints to consider:

Route telemetry through a streaming ingestion layer into a time-series or wide-column store optimized for append-heavy, range-scan workloads
Direct video files to object storage with resumable multipart uploads and lifecycle policies that transition older recordings to cheaper cold tiers
Maintain a metadata catalog (session ID, timestamp ranges, file pointers) in a search-friendly index that links the two data types
Discuss tradeoffs between storing raw video versus transcoding to multiple resolutions for bandwidth-efficient adaptive streaming

2. Write-Path Scalability and Idempotent Ingestion

With thousands of sensors across hundreds of drones producing continuous streams, the ingestion layer must handle massive write throughput without losing data or creating duplicates during retries.

Hints to consider:

Place a distributed message queue such as Kafka as a durable buffer between drones and storage, enabling backpressure, replay, and fan-out to multiple consumers
Partition by drone ID to preserve per-drone ordering while spreading load across brokers
Assign idempotency keys (drone_id plus sequence_number plus timestamp) so retries from unreliable networks are deduplicated on the server side
Batch writes to reduce per-message overhead and improve storage efficiency, balanced against freshness needs

3. Query Performance Across Diverse Access Patterns

The system must serve fast metadata filtering, time-series range scans, cross-session aggregations, and video retrieval. No single datastore excels at all four, so interviewers look for a polyglot persistence strategy.

Hints to consider:

Use a search index such as Elasticsearch for metadata filtering with rich query DSL and faceted search
Store time-series telemetry in a columnar or wide-row database like Cassandra or ClickHouse optimized for range scans and temporal aggregations
Precompute rollups and materialized views for common aggregation queries to avoid full table scans
Serve video through a CDN or edge caching layer for frequently accessed segments, backed by object storage

4. Handling Unreliable Field Connectivity

Drones operate in environments with spotty coverage, requiring resilient upload protocols and intelligent local buffering strategies that prevent data loss during outages.

Hints to consider:

Support chunked uploads with resumable protocols so connection drops do not force a restart from the beginning
Buffer data locally on the drone with disk spillover when memory fills, and flush when connectivity returns
Apply exponential backoff with jitter on retry to avoid thundering herd when many drones reconnect simultaneously
Consider a hybrid push/pull model where lightweight metadata is pushed immediately while larger payloads are scheduled for pull during maintenance windows

5. Cost Optimization and Data Lifecycle Management

Storing petabytes of telemetry and video over multiple years dominates operational costs, requiring thoughtful tiering and retention policies.

Hints to consider:

Transition video to infrequent-access or archival storage tiers after a defined active analysis window
Apply time-based TTLs or downsampling for raw telemetry, preserving full resolution for recent data and hourly or daily rollups for historical data
Use columnar compression formats for cold telemetry to reduce storage footprint
Separate high-value sessions (crashes, edge cases) from routine tests with different retention policies

Suggested Approach

Step 1: Clarify Requirements

Start by confirming the scale and priorities with your interviewer. Ask about the number of active drones, expected growth over the next two to three years, and the typical duration and data volume of a test session. Clarify which query patterns are most critical and whether real-time monitoring is needed or batch analytics is sufficient. Determine acceptable data-availability SLAs after a test completes (immediate, minutes, hours). Ask about compliance requirements for data retention, encryption at rest, or geographic restrictions, and confirm the acceptable data loss tolerance.

Step 2: High-Level Architecture

Sketch a diagram with these core components:

Ingestion Layer: Drones upload through a load-balanced API gateway. Telemetry flows into Kafka partitioned by drone_id. Video uploads go directly to object storage using multipart upload with pre-signed URLs.

Storage Layer: A Kafka consumer writes telemetry to Cassandra using wide rows keyed on (drone_id, time_bucket) for time-series queries. A separate consumer indexes session metadata into Elasticsearch. Video remains in object storage with pointers stored in the search index.

Query Layer: Application servers route metadata searches to Elasticsearch, time-series queries to Cassandra, and video streaming through a CDN backed by object storage.

Processing Layer: Batch processing jobs (Spark or Flink) read from Kafka or Cassandra to compute rollups, detect anomalies, and produce aggregated views for cross-session analytics.

Step 3: Deep Dive on Telemetry Ingestion

Walk through the telemetry ingestion path in detail:

A drone creates a session and receives a session_id and upload credentials from the API
Telemetry is batched locally in one-second windows and pushed to a Kafka topic with key equal to drone_id; each payload includes session_id, timestamp, and sensor readings
A consumer group reads from Kafka, deduplicates using the composite key of session_id, drone_id, and sequence_number, and writes to Cassandra
Another consumer indexes session metadata in Elasticsearch whenever it encounters a new session_id, enabling fast filtering and search
Discuss handling late-arriving data, schema evolution as new sensor types appear, and backfill scenarios

Step 4: Address Secondary Concerns

Cover remaining non-functional requirements:

Video Pipeline: Explain the multipart upload flow with retry logic, object-storage lifecycle policies that transition recordings to cold tiers after a defined window, and an optional transcoding pipeline to generate multiple bitrates for adaptive streaming.

Monitoring and Observability: Track ingestion lag via Kafka consumer offset monitoring, write throughput, query latency at the 99th percentile, storage growth rate, and failed upload retries. Alert on stalled consumers and capacity thresholds.

Security: Authenticate drone-to-cloud connections via mTLS, encrypt data in transit with TLS and at rest with KMS-managed keys, and enforce role-based access control on query APIs.

Disaster Recovery: Replicate Kafka across availability zones for the telemetry buffer, enable cross-region replication for object storage, and configure Cassandra with a replication factor of three. Discuss RTO and RPO targets and backup strategies.

Problem Statement

Key Requirements

Functional

Resumable test-run uploads -- Vehicles upload sensor streams and video files with support for chunked, resumable transfers over unreliable cellular or satellite links
Metadata search and filtering -- Engineers locate test sessions by drone ID, test name, date range, geographic area, and custom tags with sub-second response times
Synchronized playback -- Users view video alongside time-aligned telemetry overlays and can scrub to any timestamp to see corresponding sensor readings
Cross-session analytics -- Support aggregate queries spanning thousands of sessions, such as battery degradation by temperature band or failure rate by route
Dataset export -- Enable bulk export of filtered telemetry subsets for offline machine learning training and custom analysis pipelines

Non-Functional

Scalability -- Handle 500 concurrent drones, 100K sensor writes per second, 50 TB of new video daily, and multi-year retention at petabyte scale
Reliability -- Tolerate network drops, partial uploads, and service failures with at-least-once delivery and no data loss
Latency -- Metadata searches return in under 500 ms, single-session time-series queries complete in under 2 seconds, and video streaming begins within 1 second
Consistency -- Eventual consistency is acceptable for analytics rollups; strong consistency is required for session metadata to prevent duplicate processing

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Separating Time-Series Telemetry from Large Video Blobs

Hints to consider:

Route telemetry through a streaming ingestion layer into a time-series or wide-column store optimized for append-heavy, range-scan workloads
Direct video files to object storage with resumable multipart uploads and lifecycle policies that transition older recordings to cheaper cold tiers
Maintain a metadata catalog (session ID, timestamp ranges, file pointers) in a search-friendly index that links the two data types
Discuss tradeoffs between storing raw video versus transcoding to multiple resolutions for bandwidth-efficient adaptive streaming

2. Write-Path Scalability and Idempotent Ingestion

With thousands of sensors across hundreds of drones producing continuous streams, the ingestion layer must handle massive write throughput without losing data or creating duplicates during retries.

Hints to consider:

Place a distributed message queue such as Kafka as a durable buffer between drones and storage, enabling backpressure, replay, and fan-out to multiple consumers
Partition by drone ID to preserve per-drone ordering while spreading load across brokers
Assign idempotency keys (drone_id plus sequence_number plus timestamp) so retries from unreliable networks are deduplicated on the server side
Batch writes to reduce per-message overhead and improve storage efficiency, balanced against freshness needs

3. Query Performance Across Diverse Access Patterns

Hints to consider:

Use a search index such as Elasticsearch for metadata filtering with rich query DSL and faceted search
Store time-series telemetry in a columnar or wide-row database like Cassandra or ClickHouse optimized for range scans and temporal aggregations
Precompute rollups and materialized views for common aggregation queries to avoid full table scans
Serve video through a CDN or edge caching layer for frequently accessed segments, backed by object storage

4. Handling Unreliable Field Connectivity

Drones operate in environments with spotty coverage, requiring resilient upload protocols and intelligent local buffering strategies that prevent data loss during outages.

Hints to consider:

Support chunked uploads with resumable protocols so connection drops do not force a restart from the beginning
Buffer data locally on the drone with disk spillover when memory fills, and flush when connectivity returns
Apply exponential backoff with jitter on retry to avoid thundering herd when many drones reconnect simultaneously
Consider a hybrid push/pull model where lightweight metadata is pushed immediately while larger payloads are scheduled for pull during maintenance windows

5. Cost Optimization and Data Lifecycle Management

Storing petabytes of telemetry and video over multiple years dominates operational costs, requiring thoughtful tiering and retention policies.

Hints to consider:

Transition video to infrequent-access or archival storage tiers after a defined active analysis window
Apply time-based TTLs or downsampling for raw telemetry, preserving full resolution for recent data and hourly or daily rollups for historical data
Use columnar compression formats for cold telemetry to reduce storage footprint
Separate high-value sessions (crashes, edge cases) from routine tests with different retention policies

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Sketch a diagram with these core components:

Query Layer: Application servers route metadata searches to Elasticsearch, time-series queries to Cassandra, and video streaming through a CDN backed by object storage.

Processing Layer: Batch processing jobs (Spark or Flink) read from Kafka or Cassandra to compute rollups, detect anomalies, and produce aggregated views for cross-session analytics.

Step 3: Deep Dive on Telemetry Ingestion

Walk through the telemetry ingestion path in detail:

A drone creates a session and receives a session_id and upload credentials from the API
Telemetry is batched locally in one-second windows and pushed to a Kafka topic with key equal to drone_id; each payload includes session_id, timestamp, and sensor readings
A consumer group reads from Kafka, deduplicates using the composite key of session_id, drone_id, and sequence_number, and writes to Cassandra
Another consumer indexes session metadata in Elasticsearch whenever it encounters a new session_id, enabling fast filtering and search
Discuss handling late-arriving data, schema evolution as new sensor types appear, and backfill scenarios

Step 4: Address Secondary Concerns

Cover remaining non-functional requirements:

Security: Authenticate drone-to-cloud connections via mTLS, encrypt data in transit with TLS and at rest with KMS-managed keys, and enforce role-based access control on query APIs.