Practice/Microsoft/Design a Code Vulnerability Analysis System
Design a Code Vulnerability Analysis System
System DesignOptional
Problem Statement
Your team needs to build an automated security scanning platform that integrates with GitHub repositories. When developers open pull requests, the system should intercept these events, examine the code changes for security vulnerabilities using language-specific analysis engines (supporting Java, C#, and Python initially), and post detailed feedback directly on the pull request. Additionally, engineering managers need a dashboard to track vulnerability trends across their teams without exposing sensitive security details to unauthorized personnel.
The system must process approximately 30,000 pull requests daily with significant traffic spikes during business hours. Security is paramount -- vulnerability data must remain isolated between organizations, and access controls must prevent unauthorized viewing of security findings. The architecture should accommodate future language support and scale to handle growing adoption across multiple enterprise customers.
Key Requirements
Functional
- Webhook ingestion -- Accept GitHub webhook events for new and updated pull requests without blocking or timing out
- Language-specific analysis -- Route code to appropriate analyzers (Java, C#, Python) based on file extensions and project configuration
- Result delivery -- Post findings as inline comments on the original pull request with clear vulnerability descriptions and remediation guidance
- Analytics dashboard -- Provide managers with aggregated metrics (vulnerability counts, resolution times, team comparisons) with proper access controls
Non-Functional
- Scalability -- Handle 30,000 PRs per day (~350 per hour average, 1,000+ per hour during peaks) with room for 3x growth
- Reliability -- Ensure at-most-once comment posting, recover from analyzer crashes, and handle GitHub API unavailability gracefully
- Latency -- Return webhook acknowledgment within 5 seconds; complete analysis and post results within 10 minutes for 95th percentile
- Consistency -- Maintain strong consistency for job state and findings storage; eventual consistency acceptable for analytics aggregations
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Asynchronous Processing Architecture
The core challenge is decoupling webhook receipt from expensive analysis work. Interviewers want to see how you prevent webhook timeouts, handle work distribution, and manage job lifecycle.
Hints to consider:
- Use a message queue to buffer incoming webhook events and enable fast acknowledgment back to GitHub
- Implement worker pools that consume from the queue, with separate pools per language analyzer for independent scaling
- Design a job state machine (pending → analyzing → posting → complete) stored in a database to track progress and enable retries
- Consider how you'll handle duplicate webhook deliveries from GitHub using idempotency keys
2. Rate Limiting and External API Management
GitHub enforces strict rate limits on API calls (typically 5,000 requests per hour per OAuth token). Posting comments consumes this budget, and exceeding limits blocks all operations.
Hints to consider:
- Implement token bucket rate limiting before making GitHub API calls, using distributed state (Redis) to coordinate across workers
- Batch related comments when possible or consolidate findings into fewer API calls
- Add exponential backoff and circuit breakers for transient GitHub failures
- Store GitHub API response headers to track remaining quota and adjust worker throughput dynamically
3. Multi-Tenancy and Data Security
Different organizations using the system must have complete isolation of their code and vulnerability findings. This affects data storage, API design, and the analytics layer.
Hints to consider:
- Partition data by tenant ID at the database level, using row-level security policies or separate schemas
- Encrypt sensitive fields (code snippets, vulnerability details) at rest and in transit
- Implement role-based access control where managers can only view aggregated metrics for their authorized teams
- Use separate API tokens per tenant for GitHub integration to prevent cross-tenant access even if application logic fails
4. Handling Long-Running and Unpredictable Analysis Jobs
Static analysis tasks vary wildly in duration (5 seconds to 10 minutes) depending on code size and complexity. Some jobs may crash or timeout.
Hints to consider:
- Set per-job timeouts and implement graceful shutdown signals to analyzers
- Use a dead-letter queue for failed jobs and expose them in an operator dashboard
- Consider horizontal scaling of analyzer workers with autoscaling based on queue depth
- Track job durations and use percentiles to set appropriate timeout values and capacity planning
Suggested Approach
Step 1: Clarify Requirements
Begin by confirming scope boundaries and constraints with your interviewer:
- Scale confirmation: Validate the 30,000 PR/day volume and ask about burst patterns (are weekends lower? Is there a time-zone concentration?)
- Language support: Confirm initial support for Java, C#, Python -- will each have similar processing times? Are you integrating existing analyzer tools or building from scratch?
- Security scope: Clarify what vulnerabilities to detect (SQL injection, XSS, hardcoded secrets?) and whether false positives are acceptable
- GitHub integration: Ask if you're building a GitHub App, using OAuth tokens, or leveraging GitHub Actions -- this affects authentication and rate limits
- Analytics requirements: Understand what metrics managers need (counts by severity? Time to remediation? Comparison across teams?) and real-time vs. batch requirements
Step 2: High-Level Architecture
Sketch the core data flow and major components:
Webhook Ingestion Layer: A lightweight HTTP service receives GitHub webhooks, validates signatures, extracts PR metadata (repo, branch, diff URL), and publishes events to a message queue (Kafka or RabbitMQ). This service responds within 5 seconds to avoid GitHub retries.
Job Orchestration: A coordinator service consumes webhook events, fetches the actual code diff from GitHub, determines which language analyzers are needed based on file extensions, and creates analysis jobs in a job queue. Job metadata (PR ID, tenant ID, status, created timestamp) is persisted in PostgreSQL.
Analyzer Workers: Separate worker pools for each language consume jobs from language-specific queues. Workers download code, run the security scanner (via local library or subprocess), parse results, and persist findings to the database with tenant isolation.
Result Poster: A dedicated service reads completed analysis jobs, formats findings as GitHub comments (with line numbers and severity badges), applies rate limiting, and posts to the GitHub API. It marks jobs as complete and handles retries with idempotency checks.
Analytics Service: Provides a read-only API for the manager dashboard, querying aggregated views (materialized views or time-series tables) that roll up findings by team, date, and severity without exposing raw vulnerability details.