Problem Statement
Design a system that ingests high-frequency sensor data and video from drone test flights and makes the collected data available for search, analysis, and synchronized playback by engineering teams. Each test run produces time-series telemetry streams (GPS coordinates, battery health, radar tracks, accelerometer readings) at rates of 1 to 20 Hz, along with high-resolution video that can reach 10-20 GB per session. Drones operate in environments with intermittent cellular connectivity.
The core engineering challenges are reliably uploading large volumes of structured telemetry and unstructured video over unreliable networks, storing petabytes of data cost-effectively, supporting diverse query patterns (metadata search, time-range scans, cross-session aggregations), and aligning video playback with telemetry timestamps. You will need to reason about write-path scalability, idempotent ingestion, separate storage strategies for hot telemetry and cold blobs, and tiered data lifecycle management.
Key Requirements
Functional
- Test session upload -- drones upload time-series sensor streams and associated video for each test run, with support for resumable uploads over intermittent connections
- Search and filter -- engineers find test sessions by metadata such as drone ID, test name, date range, location, and custom tags
- Synchronized playback -- users view video with overlaid telemetry, scrub to specific timestamps, and see aligned sensor readings
- Cross-session analytics -- teams run aggregations and export datasets spanning many test runs (for example, battery degradation trends across the fleet)
Non-Functional
- Scalability -- handle 500-plus concurrent vehicles, 100,000 telemetry writes per second, 50 TB of new video daily, and multi-year retention at petabyte scale
- Reliability -- tolerate vehicle network drops, partial uploads, and service failures with no data loss; guarantee at-least-once delivery for telemetry
- Latency -- metadata searches return in under 500 ms; time-series queries over a single session complete in under two seconds; video streaming starts in under one second
- Consistency -- eventual consistency acceptable for analytics views; strong consistency for session metadata to prevent duplicate processing
What Interviewers Focus On
Based on real interview experiences at Anduril, Snowflake, and Qualtrics, these are the areas interviewers probe most deeply:
1. Separating Telemetry from Video
Interviewers want to see that you recognize structured time-series data and multi-gigabyte video files have fundamentally different storage, ingestion, and retrieval characteristics and must be handled by separate pipelines.
Hints to consider:
- Route telemetry through a streaming ingestion layer (Kafka) into a time-series or wide-column store (Apache Cassandra, TimescaleDB) optimized for append-heavy workloads
- Upload video files directly to object storage (Amazon S3 or Google Cloud Storage) using multipart resumable uploads with pre-signed URLs
- Maintain a session catalog in a search index (Elasticsearch) that links telemetry partition keys and video object keys back to the session ID
- Discuss the tradeoff between storing raw video versus transcoding to multiple resolutions for adaptive bitrate playback
2. Write-Path Scalability and Idempotency
Hundreds of drones streaming telemetry simultaneously create sustained write pressure. Retries over unreliable networks make idempotency essential to avoid duplicates.
Hints to consider:
- Use Kafka as a durable ingestion buffer, partitioned by drone ID so that per-drone ordering is preserved while writes are distributed across brokers
- Deduplicate at the consumer using a composite key of (session_id, drone_id, sequence_number); store recently seen keys in a fast lookup (Redis set with TTL)
- Batch telemetry writes to Cassandra to reduce per-message overhead while balancing freshness requirements
- On the vehicle side, buffer data locally during connectivity gaps and resume uploads with sequence numbers to allow the server to detect and skip duplicates
3. Supporting Multiple Query Patterns
The system must serve metadata filtering, time-range telemetry scans, cross-session aggregations, and video retrieval. No single datastore excels at all of these.
Hints to consider:
- Use Elasticsearch for metadata search with rich query DSL, faceted filtering, and geospatial queries to locate test sessions quickly
- Store time-series data in Cassandra with a partition key of (drone_id, date_bucket) and a clustering key of timestamp for efficient range scans
- Precompute rollups and materialized views (hourly or daily aggregates) in a columnar store like ClickHouse for cross-session analytics
- Serve video through a CDN (Amazon CloudFront) backed by S3 for low-latency streaming with adaptive bitrate support
4. Handling Unreliable Networks
Drones operate in the field with spotty connectivity. The upload protocol must be resilient to connection drops, partial transfers, and burst reconnections.
Hints to consider:
- Support chunked, resumable uploads using the TUS protocol or S3 multipart upload so that only incomplete chunks need retransmission
- Buffer telemetry locally on the drone with spillover to disk when memory fills; use exponential backoff with jitter on reconnection to avoid thundering herd
- Consider a hybrid push/pull model: drones push lightweight telemetry immediately but defer large video uploads to maintenance windows or high-bandwidth connections
- Implement upload manifests so the server knows which chunks have been received and can request only the missing ones
5. Data Lifecycle and Cost Management
Multi-year retention of petabytes of data dominates operational cost. Interviewers expect thoughtful tiering and retention policies.
Hints to consider:
- Transition video to infrequent-access or archival tiers (S3 Glacier) after 90 days, keeping hot data on standard tiers for active analysis
- Downsample raw telemetry over time: retain full-resolution data for 30 days, hourly rollups for one year, and daily rollups indefinitely
- Use columnar compression (Apache Parquet) for cold telemetry exports to shrink storage footprint
- Tag high-value sessions (crashes, anomalies) for extended retention while applying aggressive TTLs to routine tests
Suggested Approach
Step 1: Clarify Requirements
Confirm the number of concurrent vehicles, typical session duration, sensor frequency range, and video resolution. Ask about query priorities: is real-time monitoring needed or is batch analytics sufficient? Clarify SLAs for data availability after a test completes (immediately, minutes, hours). Determine compliance requirements around encryption, geographic restrictions, and retention periods. Verify acceptable data loss tolerance.
Step 2: High-Level Architecture
Sketch the system in layers. Ingestion layer: vehicles upload via a load-balanced API gateway; telemetry flows into Kafka partitioned by drone ID; video uploads go to S3 via multipart pre-signed URLs. Storage layer: a Kafka consumer writes telemetry to Cassandra (wide rows keyed by drone_id and date_bucket); another consumer indexes session metadata into Elasticsearch; video remains in S3 with pointers stored in Elasticsearch. Query layer: application servers handle metadata searches via Elasticsearch, time-series queries via Cassandra, and video streaming via CloudFront backed by S3. Processing layer: Apache Spark jobs consume from Kafka to compute rollups, detect anomalies, and produce aggregated views for cross-session analytics.
Step 3: Deep Dive on Telemetry Ingestion
Walk through the telemetry write path in detail. A drone creates a session and receives a session ID and upload credentials from the API. Telemetry is batched locally in one-second windows and pushed to a Kafka topic with key equal to drone ID; the payload includes session ID, timestamp, sequence number, and sensor readings. A consumer group reads from Kafka, deduplicates using the composite key, and writes to Cassandra with the schema PRIMARY KEY ((drone_id, date), timestamp) for efficient time-range queries. A second consumer indexes session metadata to Elasticsearch whenever it encounters a new session ID. Discuss handling late-arriving data, schema evolution for new sensor types, and backfill scenarios.
Step 4: Address Secondary Concerns
Cover the video pipeline: multipart upload with retry logic, S3 lifecycle policies to Glacier after 90 days, and optional transcoding for adaptive bitrate streaming. Discuss monitoring and observability: metrics on Kafka consumer lag, Cassandra write throughput, query latency percentiles, and storage growth rate. Address security: mTLS for drone-to-cloud connections, encryption at rest via KMS, and role-based access control for query APIs. Mention disaster recovery: multi-region Kafka replication, cross-region S3 replication, and Cassandra replication factor of three.
Related Learning Resources
- Building block: Blob Storage -- foundational concepts for storing and retrieving large binary objects like video files
- Building block: Message Queues -- covers durable ingestion buffers, partitioning, and backpressure patterns used in the telemetry pipeline