Practice/Microsoft/Design a system to rollout new versions of a mobile OS to devices worldwide
Design a system to rollout new versions of a mobile OS to devices worldwide
System DesignMust
Problem Statement
Design a secure, planet-scale system that delivers operating system updates to hundreds of millions of mobile devices across diverse geographies, network conditions, and device models. The system must handle multi-gigabyte binary payloads while supporting sophisticated deployment strategies like canary releases, regional rollouts, and emergency rollbacks. Users should be able to discover, download, and install updates reliably even on unstable networks, while operators must have real-time visibility into rollout health and the ability to halt problematic releases immediately. The solution needs to balance aggressive bandwidth usage with network operator concerns, prevent simultaneous mass requests that could overwhelm infrastructure, and maintain a strong security posture against supply chain attacks and man-in-the-middle tampering.
Key Requirements
Functional
- Update Discovery & Eligibility -- devices must check in periodically and receive update offers based on model, region, carrier, OS version, and current rollout phase
- Resumable Download -- support partial downloads with pause/resume capability across network failures and user interruptions for multi-GB payloads
- Staged Rollout Control -- release managers configure phased deployments targeting specific cohorts with gradual expansion from 1% to 100% over days or weeks
- Installation Workflow -- coordinate pre-flight checks, atomic installation, device reboot, post-install verification, and safe rollback if validation fails
- Emergency Controls -- provide instant global or regional pause mechanisms and rollback capabilities when telemetry indicates problems
Non-Functional
- Scalability -- support 500M+ active devices with 50M concurrent downloads during peak rollout periods
- Reliability -- 99.99% update availability with graceful degradation; no device bricking even under partial system failures
- Latency -- eligibility checks under 200ms; download initiation within 2 seconds; real-time telemetry aggregation under 30 seconds
- Consistency -- eventual consistency for device state is acceptable; strong consistency for rollout configuration and pause signals
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Content Delivery & Bandwidth Management
At planetary scale with multi-GB payloads, naive approaches instantly saturate CDN capacity and ISP peering points. Interviewers want to see you reason about differential delivery, geographic distribution strategies, and cost optimization.
Hints to consider:
- Implement delta updates by shipping only changed blocks between versions, reducing typical downloads from 3GB to 200-400MB
- Use a multi-tier CDN strategy with edge PoPs, regional caches, and origin shields to minimize egress costs and improve cache hit rates
- Design chunk-based downloads with content-addressed storage allowing resume from any CDN node
- Consider peer-to-peer delivery within safe corporate or home networks to offload CDN traffic during mass rollouts
2. Rollout Orchestration & Blast Radius Control
The difference between a successful release and a global incident lies in controlled exposure and rapid feedback loops. Interviewers expect you to demonstrate defense-in-depth through staged rollouts.
Hints to consider:
- Implement a ringfence model: internal dogfood → beta users → 1% random → 5% → 25% → 100% with mandatory wait periods and health checks between stages
- Design cohort assignment using consistent hashing on device ID so the same devices always land in early rings for predictable testing
- Build real-time anomaly detection comparing install success rates, boot failure rates, and app crash rates against baseline metrics
- Create automatic circuit breakers that pause rollouts when failure thresholds exceed historical norms by statistical significance
3. Security & Trust Chain
A compromised update pipeline could brick millions of devices or install malicious code at OS level. Interviewers probe your understanding of supply chain security and cryptographic verification.
Hints to consider:
- Require multi-party signature schemes where release builds need signatures from both build infrastructure and release engineering before distribution
- Implement certificate pinning in device firmware so update clients only trust a specific root CA, preventing MITM attacks
- Use hardware-backed attestation (TPM/Secure Enclave) to verify update integrity before installation and prevent downgrades to vulnerable versions
- Design manifest files with cryptographic hashes for each chunk allowing incremental verification during download without waiting for the full payload
4. Device State Management & Recovery
Millions of devices exist in various states of the update lifecycle at any moment. The system must track this distributed state and handle failure scenarios gracefully.
Hints to consider:
- Model the update as a state machine: CHECKING → AVAILABLE → DOWNLOADING → DOWNLOADED → INSTALLING → INSTALLED → VERIFIED with explicit error states and retry logic
- Store device state in a globally distributed database partitioned by device ID with eventual consistency acceptable for most transitions
- Implement A/B partition schemes where updates install to an inactive partition, allowing instant rollback by simply rebooting to the previous partition
- Design exponential backoff with jitter for retry attempts to prevent synchronized thundering herds after regional network outages
5. Operational Visibility & Control Plane
Release managers need real-time insight across hundreds of millions of devices to make informed rollout decisions and respond to incidents.
Hints to consider:
- Stream device telemetry through a high-throughput message bus aggregating metrics by cohort, region, and device model in near real-time
- Build dashboards showing funnel metrics: eligible devices → offer accepted → download started → download completed → install succeeded → health verified
- Implement feature flags for gradual control plane rollouts, allowing you to deploy new eligibility logic or targeting rules safely
- Design the pause mechanism as a strongly consistent global flag checked before every eligibility response and during download initiation