[ OK ]07285144-0613-4d10-9fa5-a8f4797d1ed1 — full content available
[ INFO ]category: System Design · Ml System Design difficulty: hard freq: Must first seen: 2026-03-13
[HARD][ML SYSTEM DESIGN][MUST]
$catproblem.md
The "Mining Novel Data from a Large Unlabeled Corpus" problem is a standard OpenAI Machine Learning Engineer (MLE) interview case study. It tests your ability to design scalable systems that identify and filter high-quality, diverse, or "novel" data to improve LLM training when raw web data becomes redundant or low-quality. Reddit +1 12135
1. Problem Statement
You are given a massive, unlabeled text corpus (e.g., a "dump" of the internet). Your goal is to design a pipeline to extract a subset of data that is novel (unseen or rare concepts), high-quality (coherent, factual, and free of toxicity), and diverse (covering many domains). YouTube +2 038
2. Analysis & Methodology
A senior-level approach should prioritize Scaling Laws—using a fraction of your compute budget to build predictive quality models before processing the full dataset. LinkedIn 1.%20%2D)
Perplexity: Lower perplexity often correlates with high-quality, standard language.
Information Density: Using entropy to identify content-rich vs. repetitive text.
Perplexity: Lower perplexity often correlates with high-quality, standard language.
Information Density: Using entropy to identify content-rich vs. repetitive text.
Use MinHash or LSH (Locality Sensitive Hashing) to remove exact and near-duplicates at scale.
Employ Semantic Deduplication: Use embeddings from models like text-embedding-3-small at OpenAI to find and remove semantically redundant data.
Use MinHash or LSH (Locality Sensitive Hashing) to remove exact and near-duplicates at scale.
Employ Semantic Deduplication: Use embeddings from models like text-embedding-3-small at OpenAI to find and remove semantically redundant data.
Heuristic Filters: Remove "gibberish" using regex, word count thresholds, and language identification.
Model-Based Classifiers: Train a "quality classifier" (e.g., a small BERT or FastText model) on a small labeled set of "high-quality" examples (like Wikipedia or textbooks) vs. raw web crawl.
LLM-as-a-Judge: For critical subsets, use a more powerful model like GPT-4o to "grade" data samples based on logic, reasoning, or reasoning-step depth. Exponent +5
Heuristic Filters: Remove "gibberish" using regex, word count thresholds, and language identification.
Model-Based Classifiers: Train a "quality classifier" (e.g., a small BERT or FastText model) on a small labeled set of "high-quality" examples (like Wikipedia or textbooks) vs. raw web crawl.
LLM-as-a-Judge: For critical subsets, use a more powerful model like GPT-4o to "grade" data samples based on logic, reasoning, or reasoning-step depth. Exponent +5
3. Proposed Pipeline Design
Stage 1: Massive Filtering: Use distributed processing (e.g., Spark) for fast heuristic filtering and deduplication.
Stage 2: Embedding & Clustering: Generate embeddings to cluster data. Identify "sparse" clusters where the model currently lacks data (novelty).
Stage 3: Synthetic Enhancement: If a domain is high-quality but underrepresented, use an LLM to generate synthetic Q&A pairs or reasoning chains to augment that specific niche. YouTube +2
4. Failure Modes & Edge Cases
Data Leakage: Ensure the "novel" data doesn't accidentally include test set benchmarks.
Bias Reinforcement: Filtering for "quality" can inadvertently remove niche dialects or cultural perspectives if the classifier is trained too narrowly.
Scaling Costs: Running an LLM grader on trillions of tokens is impossible; use it only for validation or to train a cheaper "distilled" classifier. ACM Digital Library +2
Would you like to dive deeper into the system design for this pipeline or see a Python implementation for the deduplication step?