Practice/Axon/Design a service deployment system
Design a service deployment system
System DesignMust
Problem Statement
Design a system that can roll out new versions of services to cameras and other devices deployed worldwide. The system should handle version management, deployment scheduling, and ensure reliable updates across distributed devices in the field. Think of how Tesla, Ring, or Axon's own body cameras receive firmware and software patches over the air without being brought into a lab.
This is an over-the-air (OTA) deployment platform where operators define versions, target device cohorts (by model, region, or customer), and configure rollout policies. Devices securely fetch and apply updates, report status, and automatically roll back if something goes wrong. Interviewers ask this to assess whether you can design safe, global rollouts for heterogeneous, intermittently connected clients. It tests your ability to manage versioning, orchestration, security and trust, streaming telemetry, backpressure, and reliability at the edge -- all while making pragmatic trade-offs to avoid bricking devices and to keep bandwidth and cost under control.
Key Requirements
Functional
- Version management -- create, store, and catalog firmware/software versions with metadata such as device model compatibility, dependencies, release notes, and cryptographic signatures
- Targeted rollout control -- define deployment policies that specify which devices receive an update (by geography, customer, model, or ID), rollout speed (percentage-based waves), and maintenance windows by region and time zone
- Device update orchestration -- coordinate the download, verification, installation, and activation sequence on each device, supporting resumable transfers, health checks post-install, and automatic rollback on failure
- Real-time monitoring and alerting -- dashboards showing deployment progress, success/failure rates per cohort, device error logs, and alerts when anomaly thresholds are breached
Non-Functional
- Scalability -- support tens of millions of devices checking in concurrently during global rollout windows; handle terabytes of artifact downloads per hour across CDNs
- Reliability -- ensure no more than 0.01% device failure rate due to update issues; gracefully degrade under network partitions without bricking devices
- Latency -- device check-in requests should receive a response within 200ms (excluding artifact download time); rollout policy changes must propagate to all regions within 5 seconds
- Consistency -- eventual consistency for device state metadata; strong consistency for rollout policy updates and artifact version mappings to prevent split-brain scenarios
What Interviewers Focus On
Based on real interview experiences at Axon, these are the areas interviewers probe most deeply:
1. Artifact Distribution and Cost Optimization
Firmware binaries can be large, and downloading them to millions of devices simultaneously creates enormous bandwidth costs and CDN load. Interviewers want to see how you minimize redundant transfers, handle intermittent connections, and keep costs predictable.
Hints to consider:
- Explore delta/differential updates that only transfer changed blocks between versions, reducing payload size by 80-90%
- Leverage edge CDNs with regional caching and consider peer-to-peer distribution among nearby devices
- Design resumable chunked downloads with checksums per block so devices can recover from interruptions without restarting
- Discuss bandwidth throttling and scheduling downloads during off-peak hours to reduce cellular data costs
2. Safe Rollout Orchestration and Blast Radius Containment
A bad firmware update can render thousands of devices inoperable. Interviewers expect you to articulate a staged rollout strategy that detects problems early and limits damage.
Hints to consider:
- Propose canary deployments starting with 0.1% of devices, monitoring for increased error rates or device unresponsiveness before expanding
- Implement circuit-breaker logic that automatically pauses rollouts when failure rates exceed a threshold (e.g., 2% install failures in 10 minutes)
- Use A/B boot partitions on devices so a failed update boots into the previous stable version after watchdog timeout
- Model the rollout as a state machine with explicit pause, resume, and cancel transitions that propagate quickly through the control plane
3. Device-to-Cloud Communication Model
Devices sit behind NATs, firewalls, and unreliable cellular networks. A naive push model will not work at scale. Interviewers want to understand how devices discover updates and report status.
Hints to consider:
- Use a pull-based model where devices poll the platform at randomized intervals with exponential backoff to avoid thundering herds
- Design idempotent commands so devices can safely retry check-ins without executing the same update twice
- Include long-polling or WebSocket connections for low-latency notifications to devices that need immediate updates, falling back to polling for battery-constrained devices
- Store device heartbeat and telemetry in a time-series database for anomaly detection and compliance auditing
4. Security and Integrity Verification
Compromised updates can destroy trust and enable supply-chain attacks. Interviewers expect end-to-end security in your design.
Hints to consider:
- Sign all firmware artifacts with asymmetric cryptography (e.g., Ed25519); devices verify signatures before installation using embedded public keys
- Use mutual TLS (mTLS) for device-to-cloud communication so only authenticated devices can fetch updates and servers trust device identity
- Implement secure boot chains where bootloader verifies kernel, kernel verifies application, ensuring no unsigned code runs
- Rotate signing keys periodically and support revocation lists to invalidate compromised certificates
5. Data Model and State Management
Managing millions of devices, their current versions, cohort memberships, and in-progress rollouts requires careful schema design and indexing.
Hints to consider:
- Design a device registry with fields like device_id, current_version, target_version, cohort_tags, last_checkin_time, and install_state
- Store rollout policies as separate entities linking version_id to cohort_ids with deployment parameters (start_time, rollout_percentage, pause_state)
- Use optimistic concurrency control (version vectors or conditional writes) to prevent conflicting updates to device state
- Index devices by cohort and version for efficient queries like "find all devices in cohort_A running version less than 2.0"