Practice/Google/Design a system to rollout new versions of a mobile OS to devices worldwide
Design a system to rollout new versions of a mobile OS to devices worldwide
System DesignMust
Problem Statement
Rolling out a new version of a mobile operating system to billions of devices worldwide is one of the most high-stakes distribution problems in software engineering. A failed update can brick a device, leaving the user with an unusable phone and the manufacturer with a costly warranty claim. The update must be delivered reliably over networks ranging from high-speed fiber to congested cellular connections in developing markets.
The rollout cannot happen all at once. Different device models have different hardware configurations, carriers impose their own certification requirements, and regional regulations may restrict update timing. A staged approach is essential — releasing to a small cohort first, monitoring device health metrics, and expanding gradually. If problems emerge, the rollout must pause or roll back without manual intervention.
You need to design a system that manages the lifecycle of OS updates from build artifact storage through staged rollout, handles resumable downloads over unreliable networks, verifies update integrity on-device, and monitors fleet health to automatically halt distribution when anomalies are detected.
Key Requirements
Functional
- Staged Rollout Management -- Define rollout cohorts by device model, region, carrier, and percentage. Advance through stages (canary, early adopters, general availability) based on configurable health criteria.
- Resumable Downloads -- Devices can pause and resume update downloads across network interruptions without re-downloading completed chunks.
- Integrity Verification -- Each update package is cryptographically signed; the device verifies the signature and checksum before applying the update.
- Health Monitoring and Auto-Pause -- Aggregate device health signals (boot success rate, crash rate, battery drain) post-update and automatically pause the rollout if metrics degrade beyond a threshold.
Non-Functional
- Scalability -- Distribute multi-gigabyte update packages to billions of devices across every continent without saturating origin infrastructure.
- Latency -- Devices within an active rollout cohort should be able to begin downloading within minutes of the stage opening.
- Reliability -- An interrupted or failed update must never leave a device in an unrecoverable state; the system must support A/B partition schemes or rollback mechanisms.
- Bandwidth Efficiency -- Use delta updates (binary diffs between versions) to minimize download size, especially for users on metered connections.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Staged Rollout Strategy
Interviewers want to understand how you balance speed of distribution against risk. A too-cautious rollout delays security patches; a too-aggressive one risks mass device failures.
Hints to consider:
- Think about defining cohorts as a combination of dimensions (model, region, carrier, random percentage hash of device ID) so that you get representative coverage at each stage.
- Consider how you determine when a stage is "healthy enough" to advance — what metrics do you track, what thresholds trigger a pause, and how long do you wait before declaring success?
- Evaluate how you handle mandatory security patches that need faster rollout versus feature updates that can proceed cautiously.
- Think about how you communicate rollout status to carrier partners who may need to approve each stage.
2. CDN and Download Architecture
Multi-gigabyte packages served to billions of devices require a carefully designed distribution layer.
Hints to consider:
- Consider how you use CDN edge nodes to cache update packages close to users, reducing latency and origin load.
- Think about chunked downloads with byte-range requests so that devices can resume from the last successfully received chunk.
- Evaluate peer-to-peer distribution (devices on the same Wi-Fi network sharing chunks) to reduce external bandwidth for enterprise or household deployments.
- Consider how you avoid thundering herd effects when a new stage opens and millions of devices simultaneously request the update — jittered polling intervals and server-side throttling help.