Design a reproducible, scalable system that evaluates autonomous-driving ML models on petabytes of sensor logs. The system must let ML engineers declare an evaluation (a specific test-set, a metrics module, and slicing rules), run it on any trained model, and obtain both aggregate and per-slice metrics with confidence intervals. Evaluations must be content-addressed so re-running the same eval on the same model is instantaneous via caching. The service should support two execution modes: small evaluations (≤100 K scenes) run synchronously on a handful of GPUs for interactive debugging, while large evaluations (≥1 M scenes) are sharded across a cluster with a map-reduce pattern for overnight regression testing. Every metric must be reported per user-declared slice (e.g., weather=rain, time=night, geography=highway) and accompanied by bootstrap 95 % confidence intervals so reviewers can decide whether observed differences are statistically significant. The design should guarantee that any evaluator can reproduce an earlier result bit-for-bit given the same eval hash and model hash, and that promotion decisions are gated on per-slice criteria, not just global accuracy.