Oracle system design interviews often feature problems on scalable monitoring infrastructure, but no exact match exists for a titled "Design Metrics and Logging System" interview question with the precise tags (ElasticSearch, data_engineering, web, Kafka, System Design, Monitoring, backend, infrastructure) across public sources like LeetCode, Glassdoor, Reddit, or Oracle-specific prep sites. The closest analogs are generic distributed metrics/logging designs asked at companies like Google, Facebook, Datadog, and Atlassian, which align with Oracle's focus on OCI logging, streaming (Kafka-like), and analytics.[1][3][5]
Design a distributed system to collect, process, aggregate, and query metrics and logs from thousands of web/backend services at scale. Requirements include:
No verbatim I/O examples from Oracle sources, but standard for such problems (derived from similar designs at Datadog/New Relic):
Input (Metrics via Kafka producers):
Topic: metrics Message: {"timestamp": "2026-02-02T04:19:00Z", "service": "user-api", "metric": "request_latency_ms", "value": 150.5, "tags": {"env": "prod", "region": "us-east"}}
Input (Logs via Kafka):
Topic: logs Message: {"timestamp": "2026-02-02T04:19:00Z", "service": "user-api", "level": "ERROR", "message": "Timeout fetching user data", "trace_id": "abc123"}
Output (Query API Response - Aggregations via ElasticSearch/Kibana-like):
GET /metrics/query?service=user-api&metric=request_latency_ms&window=1h Response: { "avg": 145.2, "p95": 320.1, "count": 45000, "from": "2026-02-02T03:19:00Z" }
Output (Log Search):
GET /logs/search?service=user-api&level=ERROR&trace_id=abc123 Response: [ {"timestamp": "...", "message": "Timeout...", "highlight": "user data"} ]
These examples emphasize Kafka buffering, ElasticSearch indexing (with mappings for time-series), and aggregation pipelines. For Oracle-specific twists, expect OCI Streaming (Kafka-compatible) and Logging Analytics integration.[3][7][1][5]