Practice/Amazon/Design a Weather System
Design a Weather System
System DesignMust
Problem Statement
Design a weather monitoring system that collects data from thousands of distributed sensors, processes it in near real time, and serves current conditions, alerts, and historical trends to users. The platform aggregates telemetry (temperature, humidity, wind, precipitation) from professional stations and community devices to power maps, apps, and APIs.
Interviewers ask this to test your ability to design an end-to-end data system: reliable high-throughput ingestion from flaky edge devices, time-series storage and geo-aware indexing, streaming aggregation pipelines, and low-latency APIs with real-time updates. The problem probes your judgment on partitioning by time and location, backpressure, data quality (deduplication, out-of-order events), and cost controls through caching and retention policies.
Key Requirements
Functional
- Current conditions -- users view current weather conditions for a given location with low latency and near-real-time freshness
- Real-time alerts -- users subscribe to configurable alerts (temperature thresholds, severe weather) for chosen areas and receive notifications promptly
- Historical queries -- users query historical data and aggregates (hourly averages, trends) for a location and time range
- Sensor management -- authorized operators register and manage sensors and monitor device health, while sensors securely send authenticated telemetry
Non-Functional
- Scalability -- support thousands of sensors producing continuous telemetry with burst handling during reconnects and storm events
- Reliability -- maintain 99.9% uptime for current conditions API; tolerate sensor failures and network partitions without data loss
- Latency -- current conditions served under 100ms, alert delivery within 30 seconds of threshold breach, historical queries under 500ms
- Consistency -- eventual consistency acceptable for current conditions display; strong consistency for alert rule evaluation and sensor configuration
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. High-Throughput Sensor Ingestion
Thousands of sensors produce continuous telemetry, often in bursts during reconnects or weather events. The ingestion layer must handle variable throughput without data loss.
Hints to consider:
- Use Kafka as a durable, partitioned log to buffer bursty sensor writes, preserving per-device ordering and enabling multiple consumers
- Partition by sensor ID or geohash to spread load and avoid hot partitions from sensors in dense areas
- Implement backpressure mechanisms so burst traffic does not overwhelm downstream processing
- Handle idempotent writes using sensor ID plus timestamp as deduplication keys to manage retries from flaky connections
2. Data Quality and Out-of-Order Handling
Sensor data arrives late, out of order, or duplicated due to network retries and clock skew. Aggregates and alerts must remain correct despite imperfect data.
Hints to consider:
- Use event-time processing with watermarks and allowed lateness to handle out-of-order data in streaming aggregations
- Normalize device timestamps using NTP sync requirements and server-side timestamp validation
- Implement late-data correction by allowing aggregation windows to be updated within a grace period
- Design health checks that detect stale sensors (missing heartbeats) and mark their data as potentially unreliable
3. Geo-Aware Storage and Query Patterns
Users query by location, requiring efficient spatial indexing. Historical queries scan large time ranges that must be served with low latency.
Hints to consider:
- Partition time-series data by geohash prefix and time bucket for efficient geo-temporal range scans
- Maintain a materialized "latest per tile" view in Redis for instant current-conditions queries without scanning raw time-series
- Use pre-aggregated rollup tables (hourly, daily) to speed historical queries and reduce storage costs
- Implement TTL-based data retention with tiered storage: hot data in fast stores, cold data in object storage
4. Real-Time Alerts and Notification Pipeline
Users expect timely alerts when weather conditions breach configured thresholds, requiring low-latency streaming evaluation.
Hints to consider:
- Evaluate alert rules against incoming sensor data using a stream processing engine with windowed aggregations
- Deduplicate alerts to prevent notification spam when conditions fluctuate around threshold boundaries
- Fan out notifications through multiple channels (push, email, SMS) with retry logic and acknowledgment tracking
- Design maintenance windows and silence rules so planned sensor outages do not trigger false alerts