Housing Price Prediction Model

[ OK ] 641 — full content available

[ INFO ] category: System Design difficulty: hard freq: 3 first seen: 2026-02-02

[HARD][SYSTEM DESIGN][3]Machine LearningRegressionFeature EngineeringReal Estate

$ cat problem.md

Design an end-to-end machine-learning system that predicts the sale price of single-family houses anywhere in the United States. The system must ingest a daily stream of new MLS listings and closed-sale records (≈ 500 k rows day⁻¹), engineer location-centric features, train a regression model, and expose REST/GraphQL predictions with < 100 ms p99 latency. Your design should cover: (1) data ingestion from heterogeneous county feeds (CSV, XML, API) with schema evolution; (2) a feature store that keeps both batch (census, school ratings, crime) and streaming (days-on-market, mortgage rates) features consistent; (3) a training pipeline that retrails weekly on a rolling two-year window, handles missingness, outliers, and high-cardinality location IDs, and guarantees no data leakage (e.g., future macro variables must not be used); (4) model selection and tuning infrastructure that compares linear, GBDT, and neural-network regressors under RMSE and MAE on a time-based split; (5) online inference service that enriches a raw listing with derived features in real time and returns both a point estimate and a prediction interval; (6) monitoring for drift in price distribution and feature space with automatic rollback; and (7) compliance/audit layer ensuring explainability (SHAP) and fair-lending tests across protected groups. Sketch the architecture, choose storage/compute components, estimate throughput vs. cost, and describe how you would A/B test a new model in production without double-charging customers.

user@intervues:~/two-sigma$