User Record Similarity Matcher

[ OK ] 193 — full content available

[ INFO ] category: Coding difficulty: hard freq: high first seen: 2026-01-13

[HARD][CODING][HIGH]Hash TableGraphUnion FindBFS

$ cat problem.md

User Record Similarity Matcher is a Stripe interview problem that builds a graph from user records based on field similarity scores, then finds connected users using graph algorithms like BFS or Union-Find.[1]

Problem Statement

Given a list of user records rows, each with fields id (unique integer), name (string), email (string), and company (string), plus a weights map (field name to non-negative float, e.g., {name: 0.2, email: 0.5, company: 0.3}) and a float threshold (0 ≤ threshold ≤ 1), define similarity score between records A and B as the sum of weights[field] for matching fields (exact string equality). Records are directly linked if score ≥ threshold (undirected relation).[1]

The core task constructs an undirected graph with records as nodes and direct links as edges. For a target_user_id, progressively extended requirements are:

Direct links: Return IDs directly linked to target (exclude target itself).
One-hop indirect: Include direct links plus their direct links.
Full connected component: Return all reachable IDs (same component via any path length), typically excluding target (specify choice).[1]

Suitable function signatures: find_direct_links(rows, weights, threshold, target_user_id) -> List[int] find_all_linked(rows, weights, threshold, target_user_id) -> List[int] Assumptions: Unique IDs; rows fit in memory (no distributed design needed). Tests Hash Table (adjacency list, visited sets), Graph (construction/traversal), Union Find (component merging), BFS (reachability/flood fill).[1]

Example

rows = [ { id: 1, name: "Alice", email: "[email protected]", company: "Stripe" }, { id: 2, name: "Alicia", email: "[email protected]", company: "Stripe" }, { id: 3, name: "Alice", email: "[email protected]", company: "Google" } ] weights = {name: 0.2, email: 0.5, company: 0.3} threshold = 0.5, target_user_id = 1

Record 1 vs 2: email + company match → 0.5 + 0.3 = 0.8 ≥ 0.5 → linked.
Record 1 vs 3: name match → 0.2 < 0.5 → not linked.
Record 2 vs 3: name match → 0.2 < 0.5 → not linked. For find_all_linked, output includes {2, 3} if further links exist; here direct is {2}, but example notes 3 as one-hop via 2 (implying additional setup).[1]

Constraints

No explicit numeric bounds given, but implied: N = |rows| reasonable for O(N²) similarity checks (e.g., N ≤ 10⁴ typical for interviews); strings short; weights sum ≤ 1; 0 ≤ threshold ≤ 1; IDs unique positives. Graph may have up to O(N²) edges (dense case). [1]

user@intervues:~/stripe$