Practice/Microsoft/Design a Code Vulnerability Analysis System

Design a Code Vulnerability Analysis System

System DesignOptional

Problem Statement

Your team needs to build an automated security scanning platform that integrates with GitHub repositories. When developers open pull requests, the system should intercept these events, examine the code changes for security vulnerabilities using language-specific analysis engines (supporting Java, C#, and Python initially), and post detailed feedback directly on the pull request. Additionally, engineering managers need a dashboard to track vulnerability trends across their teams without exposing sensitive security details to unauthorized personnel.

The system must process approximately 30,000 pull requests daily with significant traffic spikes during business hours. Security is paramount -- vulnerability data must remain isolated between organizations, and access controls must prevent unauthorized viewing of security findings. The architecture should accommodate future language support and scale to handle growing adoption across multiple enterprise customers.

Key Requirements

Functional

Webhook ingestion -- Accept GitHub webhook events for new and updated pull requests without blocking or timing out
Language-specific analysis -- Route code to appropriate analyzers (Java, C#, Python) based on file extensions and project configuration
Result delivery -- Post findings as inline comments on the original pull request with clear vulnerability descriptions and remediation guidance
Analytics dashboard -- Provide managers with aggregated metrics (vulnerability counts, resolution times, team comparisons) with proper access controls

Non-Functional

Scalability -- Handle 30,000 PRs per day (~350 per hour average, 1,000+ per hour during peaks) with room for 3x growth
Reliability -- Ensure at-most-once comment posting, recover from analyzer crashes, and handle GitHub API unavailability gracefully
Latency -- Return webhook acknowledgment within 5 seconds; complete analysis and post results within 10 minutes for 95th percentile
Consistency -- Maintain strong consistency for job state and findings storage; eventual consistency acceptable for analytics aggregations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Asynchronous Processing Architecture

The core challenge is decoupling webhook receipt from expensive analysis work. Interviewers want to see how you prevent webhook timeouts, handle work distribution, and manage job lifecycle.

Hints to consider:

Use a message queue to buffer incoming webhook events and enable fast acknowledgment back to GitHub
Implement worker pools that consume from the queue, with separate pools per language analyzer for independent scaling
Design a job state machine (pending → analyzing → posting → complete) stored in a database to track progress and enable retries
Consider how you'll handle duplicate webhook deliveries from GitHub using idempotency keys

2. Rate Limiting and External API Management

GitHub enforces strict rate limits on API calls (typically 5,000 requests per hour per OAuth token). Posting comments consumes this budget, and exceeding limits blocks all operations.

Hints to consider:

Implement token bucket rate limiting before making GitHub API calls, using distributed state (Redis) to coordinate across workers
Batch related comments when possible or consolidate findings into fewer API calls
Add exponential backoff and circuit breakers for transient GitHub failures
Store GitHub API response headers to track remaining quota and adjust worker throughput dynamically

3. Multi-Tenancy and Data Security

Different organizations using the system must have complete isolation of their code and vulnerability findings. This affects data storage, API design, and the analytics layer.

Hints to consider:

Partition data by tenant ID at the database level, using row-level security policies or separate schemas
Encrypt sensitive fields (code snippets, vulnerability details) at rest and in transit
Implement role-based access control where managers can only view aggregated metrics for their authorized teams
Use separate API tokens per tenant for GitHub integration to prevent cross-tenant access even if application logic fails

4. Handling Long-Running and Unpredictable Analysis Jobs

Static analysis tasks vary wildly in duration (5 seconds to 10 minutes) depending on code size and complexity. Some jobs may crash or timeout.

Hints to consider:

Set per-job timeouts and implement graceful shutdown signals to analyzers
Use a dead-letter queue for failed jobs and expose them in an operator dashboard
Consider horizontal scaling of analyzer workers with autoscaling based on queue depth
Track job durations and use percentiles to set appropriate timeout values and capacity planning

Suggested Approach

Step 1: Clarify Requirements

Begin by confirming scope boundaries and constraints with your interviewer:

Scale confirmation: Validate the 30,000 PR/day volume and ask about burst patterns (are weekends lower? Is there a time-zone concentration?)
Language support: Confirm initial support for Java, C#, Python -- will each have similar processing times? Are you integrating existing analyzer tools or building from scratch?
Security scope: Clarify what vulnerabilities to detect (SQL injection, XSS, hardcoded secrets?) and whether false positives are acceptable
GitHub integration: Ask if you're building a GitHub App, using OAuth tokens, or leveraging GitHub Actions -- this affects authentication and rate limits
Analytics requirements: Understand what metrics managers need (counts by severity? Time to remediation? Comparison across teams?) and real-time vs. batch requirements

Step 2: High-Level Architecture

Sketch the core data flow and major components:

Webhook Ingestion Layer: A lightweight HTTP service receives GitHub webhooks, validates signatures, extracts PR metadata (repo, branch, diff URL), and publishes events to a message queue (Kafka or RabbitMQ). This service responds within 5 seconds to avoid GitHub retries.

Job Orchestration: A coordinator service consumes webhook events, fetches the actual code diff from GitHub, determines which language analyzers are needed based on file extensions, and creates analysis jobs in a job queue. Job metadata (PR ID, tenant ID, status, created timestamp) is persisted in PostgreSQL.

Analyzer Workers: Separate worker pools for each language consume jobs from language-specific queues. Workers download code, run the security scanner (via local library or subprocess), parse results, and persist findings to the database with tenant isolation.

Result Poster: A dedicated service reads completed analysis jobs, formats findings as GitHub comments (with line numbers and severity badges), applies rate limiting, and posts to the GitHub API. It marks jobs as complete and handles retries with idempotency checks.

Analytics Service: Provides a read-only API for the manager dashboard, querying aggregated views (materialized views or time-series tables) that roll up findings by team, date, and severity without exposing raw vulnerability details.

Practice/Microsoft/Design a Code Vulnerability Analysis System

Design a Code Vulnerability Analysis System

System DesignOptional

Problem Statement

Key Requirements

Functional

Webhook ingestion -- Accept GitHub webhook events for new and updated pull requests without blocking or timing out
Language-specific analysis -- Route code to appropriate analyzers (Java, C#, Python) based on file extensions and project configuration
Result delivery -- Post findings as inline comments on the original pull request with clear vulnerability descriptions and remediation guidance
Analytics dashboard -- Provide managers with aggregated metrics (vulnerability counts, resolution times, team comparisons) with proper access controls

Non-Functional

Scalability -- Handle 30,000 PRs per day (~350 per hour average, 1,000+ per hour during peaks) with room for 3x growth
Reliability -- Ensure at-most-once comment posting, recover from analyzer crashes, and handle GitHub API unavailability gracefully
Latency -- Return webhook acknowledgment within 5 seconds; complete analysis and post results within 10 minutes for 95th percentile
Consistency -- Maintain strong consistency for job state and findings storage; eventual consistency acceptable for analytics aggregations

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Asynchronous Processing Architecture

The core challenge is decoupling webhook receipt from expensive analysis work. Interviewers want to see how you prevent webhook timeouts, handle work distribution, and manage job lifecycle.

Hints to consider:

Use a message queue to buffer incoming webhook events and enable fast acknowledgment back to GitHub
Implement worker pools that consume from the queue, with separate pools per language analyzer for independent scaling
Design a job state machine (pending → analyzing → posting → complete) stored in a database to track progress and enable retries
Consider how you'll handle duplicate webhook deliveries from GitHub using idempotency keys

2. Rate Limiting and External API Management

GitHub enforces strict rate limits on API calls (typically 5,000 requests per hour per OAuth token). Posting comments consumes this budget, and exceeding limits blocks all operations.

Hints to consider:

Implement token bucket rate limiting before making GitHub API calls, using distributed state (Redis) to coordinate across workers
Batch related comments when possible or consolidate findings into fewer API calls
Add exponential backoff and circuit breakers for transient GitHub failures
Store GitHub API response headers to track remaining quota and adjust worker throughput dynamically

3. Multi-Tenancy and Data Security

Different organizations using the system must have complete isolation of their code and vulnerability findings. This affects data storage, API design, and the analytics layer.

Hints to consider:

Partition data by tenant ID at the database level, using row-level security policies or separate schemas
Encrypt sensitive fields (code snippets, vulnerability details) at rest and in transit
Implement role-based access control where managers can only view aggregated metrics for their authorized teams
Use separate API tokens per tenant for GitHub integration to prevent cross-tenant access even if application logic fails

4. Handling Long-Running and Unpredictable Analysis Jobs

Static analysis tasks vary wildly in duration (5 seconds to 10 minutes) depending on code size and complexity. Some jobs may crash or timeout.

Hints to consider:

Set per-job timeouts and implement graceful shutdown signals to analyzers
Use a dead-letter queue for failed jobs and expose them in an operator dashboard
Consider horizontal scaling of analyzer workers with autoscaling based on queue depth
Track job durations and use percentiles to set appropriate timeout values and capacity planning

Suggested Approach

Step 1: Clarify Requirements

Begin by confirming scope boundaries and constraints with your interviewer:

Scale confirmation: Validate the 30,000 PR/day volume and ask about burst patterns (are weekends lower? Is there a time-zone concentration?)
Language support: Confirm initial support for Java, C#, Python -- will each have similar processing times? Are you integrating existing analyzer tools or building from scratch?
Security scope: Clarify what vulnerabilities to detect (SQL injection, XSS, hardcoded secrets?) and whether false positives are acceptable
GitHub integration: Ask if you're building a GitHub App, using OAuth tokens, or leveraging GitHub Actions -- this affects authentication and rate limits
Analytics requirements: Understand what metrics managers need (counts by severity? Time to remediation? Comparison across teams?) and real-time vs. batch requirements

Step 2: High-Level Architecture

Sketch the core data flow and major components: