Design a data ingestion system — Anduril

Problem Statement

Design a system that ingests high-frequency sensor data and video from drone test flights and makes the collected data available for search, analysis, and synchronized playback by engineering teams. Each test run produces time-series telemetry streams (GPS coordinates, battery health, radar tracks, accelerometer readings) at rates of 1 to 20 Hz, along with high-resolution video that can reach 10-20 GB per session. Drones operate in environments with intermittent cellular connectivity.

The core engineering challenges are reliably uploading large volumes of structured telemetry and unstructured video over unreliable networks, storing petabytes of data cost-effectively, supporting diverse query patterns (metadata search, time-range scans, cross-session aggregations), and aligning video playback with telemetry timestamps. You will need to reason about write-path scalability, idempotent ingestion, separate storage strategies for hot telemetry and cold blobs, and tiered data lifecycle management.

Key Requirements

Functional

Test session upload -- drones upload time-series sensor streams and associated video for each test run, with support for resumable uploads over intermittent connections
Search and filter -- engineers find test sessions by metadata such as drone ID, test name, date range, location, and custom tags
Synchronized playback -- users view video with overlaid telemetry, scrub to specific timestamps, and see aligned sensor readings
Cross-session analytics -- teams run aggregations and export datasets spanning many test runs (for example, battery degradation trends across the fleet)

Non-Functional

Scalability -- handle 500-plus concurrent vehicles, 100,000 telemetry writes per second, 50 TB of new video daily, and multi-year retention at petabyte scale
Reliability -- tolerate vehicle network drops, partial uploads, and service failures with no data loss; guarantee at-least-once delivery for telemetry
Latency -- metadata searches return in under 500 ms; time-series queries over a single session complete in under two seconds; video streaming starts in under one second
Consistency -- eventual consistency acceptable for analytics views; strong consistency for session metadata to prevent duplicate processing

What Interviewers Focus On

Based on real interview experiences at Anduril, Snowflake, and Qualtrics, these are the areas interviewers probe most deeply:

1. Separating Telemetry from Video

Interviewers want to see that you recognize structured time-series data and multi-gigabyte video files have fundamentally different storage, ingestion, and retrieval characteristics and must be handled by separate pipelines.

Hints to consider:

Route telemetry through a streaming ingestion layer (Kafka) into a time-series or wide-column store (Apache Cassandra, TimescaleDB) optimized for append-heavy workloads
Upload video files directly to object storage (Amazon S3 or Google Cloud Storage) using multipart resumable uploads with pre-signed URLs
Maintain a session catalog in a search index (Elasticsearch) that links telemetry partition keys and video object keys back to the session ID
Discuss the tradeoff between storing raw video versus transcoding to multiple resolutions for adaptive bitrate playback

2. Write-Path Scalability and Idempotency

Hundreds of drones streaming telemetry simultaneously create sustained write pressure. Retries over unreliable networks make idempotency essential to avoid duplicates.

Hints to consider:

Use Kafka as a durable ingestion buffer, partitioned by drone ID so that per-drone ordering is preserved while writes are distributed across brokers
Deduplicate at the consumer using a composite key of (session_id, drone_id, sequence_number); store recently seen keys in a fast lookup (Redis set with TTL)
Batch telemetry writes to Cassandra to reduce per-message overhead while balancing freshness requirements
On the vehicle side, buffer data locally during connectivity gaps and resume uploads with sequence numbers to allow the server to detect and skip duplicates

3. Supporting Multiple Query Patterns

The system must serve metadata filtering, time-range telemetry scans, cross-session aggregations, and video retrieval. No single datastore excels at all of these.

Hints to consider:

Use Elasticsearch for metadata search with rich query DSL, faceted filtering, and geospatial queries to locate test sessions quickly
Store time-series data in Cassandra with a partition key of (drone_id, date_bucket) and a clustering key of timestamp for efficient range scans
Precompute rollups and materialized views (hourly or daily aggregates) in a columnar store like ClickHouse for cross-session analytics
Serve video through a CDN (Amazon CloudFront) backed by S3 for low-latency streaming with adaptive bitrate support

4. Handling Unreliable Networks

Drones operate in the field with spotty connectivity. The upload protocol must be resilient to connection drops, partial transfers, and burst reconnections.

Hints to consider:

Support chunked, resumable uploads using the TUS protocol or S3 multipart upload so that only incomplete chunks need retransmission
Buffer telemetry locally on the drone with spillover to disk when memory fills; use exponential backoff with jitter on reconnection to avoid thundering herd
Consider a hybrid push/pull model: drones push lightweight telemetry immediately but defer large video uploads to maintenance windows or high-bandwidth connections
Implement upload manifests so the server knows which chunks have been received and can request only the missing ones

5. Data Lifecycle and Cost Management

Multi-year retention of petabytes of data dominates operational cost. Interviewers expect thoughtful tiering and retention policies.

Hints to consider:

Transition video to infrequent-access or archival tiers (S3 Glacier) after 90 days, keeping hot data on standard tiers for active analysis
Downsample raw telemetry over time: retain full-resolution data for 30 days, hourly rollups for one year, and daily rollups indefinitely
Use columnar compression (Apache Parquet) for cold telemetry exports to shrink storage footprint
Tag high-value sessions (crashes, anomalies) for extended retention while applying aggressive TTLs to routine tests

Suggested Approach

Step 1: Clarify Requirements

Confirm the number of concurrent vehicles, typical session duration, sensor frequency range, and video resolution. Ask about query priorities: is real-time monitoring needed or is batch analytics sufficient? Clarify SLAs for data availability after a test completes (immediately, minutes, hours). Determine compliance requirements around encryption, geographic restrictions, and retention periods. Verify acceptable data loss tolerance.

Step 2: High-Level Architecture

Sketch the system in layers. Ingestion layer: vehicles upload via a load-balanced API gateway; telemetry flows into Kafka partitioned by drone ID; video uploads go to S3 via multipart pre-signed URLs. Storage layer: a Kafka consumer writes telemetry to Cassandra (wide rows keyed by drone_id and date_bucket); another consumer indexes session metadata into Elasticsearch; video remains in S3 with pointers stored in Elasticsearch. Query layer: application servers handle metadata searches via Elasticsearch, time-series queries via Cassandra, and video streaming via CloudFront backed by S3. Processing layer: Apache Spark jobs consume from Kafka to compute rollups, detect anomalies, and produce aggregated views for cross-session analytics.

Step 3: Deep Dive on Telemetry Ingestion

Walk through the telemetry write path in detail. A drone creates a session and receives a session ID and upload credentials from the API. Telemetry is batched locally in one-second windows and pushed to a Kafka topic with key equal to drone ID; the payload includes session ID, timestamp, sequence number, and sensor readings. A consumer group reads from Kafka, deduplicates using the composite key, and writes to Cassandra with the schema PRIMARY KEY ((drone_id, date), timestamp) for efficient time-range queries. A second consumer indexes session metadata to Elasticsearch whenever it encounters a new session ID. Discuss handling late-arriving data, schema evolution for new sensor types, and backfill scenarios.

Step 4: Address Secondary Concerns

Cover the video pipeline: multipart upload with retry logic, S3 lifecycle policies to Glacier after 90 days, and optional transcoding for adaptive bitrate streaming. Discuss monitoring and observability: metrics on Kafka consumer lag, Cassandra write throughput, query latency percentiles, and storage growth rate. Address security: mTLS for drone-to-cloud connections, encryption at rest via KMS, and role-based access control for query APIs. Mention disaster recovery: multi-region Kafka replication, cross-region S3 replication, and Cassandra replication factor of three.

Related Learning Resources

Building block: Blob Storage -- foundational concepts for storing and retrieving large binary objects like video files
Building block: Message Queues -- covers durable ingestion buffers, partitioning, and backpressure patterns used in the telemetry pipeline

Problem Statement

Key Requirements

Functional

Test session upload -- drones upload time-series sensor streams and associated video for each test run, with support for resumable uploads over intermittent connections
Search and filter -- engineers find test sessions by metadata such as drone ID, test name, date range, location, and custom tags
Synchronized playback -- users view video with overlaid telemetry, scrub to specific timestamps, and see aligned sensor readings
Cross-session analytics -- teams run aggregations and export datasets spanning many test runs (for example, battery degradation trends across the fleet)

Non-Functional

Scalability -- handle 500-plus concurrent vehicles, 100,000 telemetry writes per second, 50 TB of new video daily, and multi-year retention at petabyte scale
Reliability -- tolerate vehicle network drops, partial uploads, and service failures with no data loss; guarantee at-least-once delivery for telemetry
Latency -- metadata searches return in under 500 ms; time-series queries over a single session complete in under two seconds; video streaming starts in under one second
Consistency -- eventual consistency acceptable for analytics views; strong consistency for session metadata to prevent duplicate processing

What Interviewers Focus On

Based on real interview experiences at Anduril, Snowflake, and Qualtrics, these are the areas interviewers probe most deeply:

1. Separating Telemetry from Video

Hints to consider:

Route telemetry through a streaming ingestion layer (Kafka) into a time-series or wide-column store (Apache Cassandra, TimescaleDB) optimized for append-heavy workloads
Upload video files directly to object storage (Amazon S3 or Google Cloud Storage) using multipart resumable uploads with pre-signed URLs
Maintain a session catalog in a search index (Elasticsearch) that links telemetry partition keys and video object keys back to the session ID
Discuss the tradeoff between storing raw video versus transcoding to multiple resolutions for adaptive bitrate playback

2. Write-Path Scalability and Idempotency

Hundreds of drones streaming telemetry simultaneously create sustained write pressure. Retries over unreliable networks make idempotency essential to avoid duplicates.

Hints to consider:

Use Kafka as a durable ingestion buffer, partitioned by drone ID so that per-drone ordering is preserved while writes are distributed across brokers
Deduplicate at the consumer using a composite key of (session_id, drone_id, sequence_number); store recently seen keys in a fast lookup (Redis set with TTL)
Batch telemetry writes to Cassandra to reduce per-message overhead while balancing freshness requirements
On the vehicle side, buffer data locally during connectivity gaps and resume uploads with sequence numbers to allow the server to detect and skip duplicates

3. Supporting Multiple Query Patterns

The system must serve metadata filtering, time-range telemetry scans, cross-session aggregations, and video retrieval. No single datastore excels at all of these.

Hints to consider:

Use Elasticsearch for metadata search with rich query DSL, faceted filtering, and geospatial queries to locate test sessions quickly
Store time-series data in Cassandra with a partition key of (drone_id, date_bucket) and a clustering key of timestamp for efficient range scans
Precompute rollups and materialized views (hourly or daily aggregates) in a columnar store like ClickHouse for cross-session analytics
Serve video through a CDN (Amazon CloudFront) backed by S3 for low-latency streaming with adaptive bitrate support

4. Handling Unreliable Networks

Drones operate in the field with spotty connectivity. The upload protocol must be resilient to connection drops, partial transfers, and burst reconnections.

Hints to consider:

Support chunked, resumable uploads using the TUS protocol or S3 multipart upload so that only incomplete chunks need retransmission
Buffer telemetry locally on the drone with spillover to disk when memory fills; use exponential backoff with jitter on reconnection to avoid thundering herd
Consider a hybrid push/pull model: drones push lightweight telemetry immediately but defer large video uploads to maintenance windows or high-bandwidth connections
Implement upload manifests so the server knows which chunks have been received and can request only the missing ones

5. Data Lifecycle and Cost Management

Multi-year retention of petabytes of data dominates operational cost. Interviewers expect thoughtful tiering and retention policies.

Hints to consider:

Transition video to infrequent-access or archival tiers (S3 Glacier) after 90 days, keeping hot data on standard tiers for active analysis
Downsample raw telemetry over time: retain full-resolution data for 30 days, hourly rollups for one year, and daily rollups indefinitely
Use columnar compression (Apache Parquet) for cold telemetry exports to shrink storage footprint
Tag high-value sessions (crashes, anomalies) for extended retention while applying aggressive TTLs to routine tests

Suggested Approach

Step 1: Clarify Requirements

Step 2: High-Level Architecture

Step 3: Deep Dive on Telemetry Ingestion

Step 4: Address Secondary Concerns

Related Learning Resources

Building block: Blob Storage -- foundational concepts for storing and retrieving large binary objects like video files
Building block: Message Queues -- covers durable ingestion buffers, partitioning, and backpressure patterns used in the telemetry pipeline