Design Metrics and Logging System

[ OK ] 97 — full content available

[ INFO ] category: System Design difficulty: hard freq: medium first seen: 2026-01-12

[HARD][SYSTEM DESIGN][MEDIUM]ElasticSearchdata_engineeringwebKafkaSystem DesignMonitoringbackendinfrastructure

$ cat problem.md

Oracle system design interviews often feature problems on scalable monitoring infrastructure, but no exact match exists for a titled "Design Metrics and Logging System" interview question with the precise tags (ElasticSearch, data_engineering, web, Kafka, System Design, Monitoring, backend, infrastructure) across public sources like LeetCode, Glassdoor, Reddit, or Oracle-specific prep sites. The closest analogs are generic distributed metrics/logging designs asked at companies like Google, Facebook, Datadog, and Atlassian, which align with Oracle's focus on OCI logging, streaming (Kafka-like), and analytics.[1][3][5]

Typical Problem Statement

Design a distributed system to collect, process, aggregate, and query metrics and logs from thousands of web/backend services at scale. Requirements include:

Ingest high-volume time-series metrics (e.g., CPU usage, latency) and semi-structured logs from microservices.
Support real-time alerting, historical querying, and dashboards.
Handle 1M+ events/second, with 99.99% availability and sub-second query latency for recent data.
Integrate technologies like Kafka for ingestion, ElasticSearch for search/indexing, and data engineering pipelines.[5][7][1]

Key Constraints

Scale: 10K+ services producing 1-10TB/day; retain hot data for 7 days, cold for 1 year.
Latency: p99 ingestion <1s; query recent data <100ms, historical <5s.
Reliability: At-least-once delivery; fault-tolerant to node/zone failures; no single point of failure.
Data Volume: Metrics as key-value time-series (e.g., {timestamp, service_id, metric_name, value}); logs as JSON (e.g., {timestamp, level, message, metadata}).
Queries: Aggregations (avg, p95, sum over time windows), filtering by tags/service, full-text log search.
Cost: Tiered storage (hot: SSD, cold: S3-like); compression >10x.[2][3][7]

Input/Output Examples

No verbatim I/O examples from Oracle sources, but standard for such problems (derived from similar designs at Datadog/New Relic):

Input (Metrics via Kafka producers): Topic: metrics Message: {"timestamp": "2026-02-02T04:19:00Z", "service": "user-api", "metric": "request_latency_ms", "value": 150.5, "tags": {"env": "prod", "region": "us-east"}}

Input (Logs via Kafka): Topic: logs Message: {"timestamp": "2026-02-02T04:19:00Z", "service": "user-api", "level": "ERROR", "message": "Timeout fetching user data", "trace_id": "abc123"}

Output (Query API Response - Aggregations via ElasticSearch/Kibana-like): GET /metrics/query?service=user-api&metric=request_latency_ms&window=1h Response: { "avg": 145.2, "p95": 320.1, "count": 45000, "from": "2026-02-02T03:19:00Z" }

Output (Log Search): GET /logs/search?service=user-api&level=ERROR&trace_id=abc123 Response: [ {"timestamp": "...", "message": "Timeout...", "highlight": "user data"} ] These examples emphasize Kafka buffering, ElasticSearch indexing (with mappings for time-series), and aggregation pipelines. For Oracle-specific twists, expect OCI Streaming (Kafka-compatible) and Logging Analytics integration.[3][7][1][5]

user@intervues:~/oracle$