Practice/OpenAI/Cloud IDE
System DesignMust Design a cloud-based IDE similar to Replit or GitHub Codespaces. Users can write code, manage files, run terminal commands, and see real-time output—all in the browser without local setup. This problem tests your ability to design systems with resource management, real-time streaming, and multi-tenancy isolation. The core challenges are VM/container lifecycle management and efficient terminal output streaming at scale.
This walkthrough follows the Interview Framework. Use it as a guide, not a script—adapt based on interviewer cues.
Functional Requirements Users should be able to create workspaces with a file tree (create, edit, delete files and folders) Users should be able to run code and terminal commands with real-time output streaming (stdout/stderr) Users should be able to stop running processes Users should be able to install packages and persist environment state within a session Users should be able to share workspaces with others (view/edit permissions)
Sharing here means access control (view/edit) and last-write-wins. Real-time collaborative editing (OT/CRDT) is out of scope unless explicitly required.
Package installs persist within an active session. Assume user-level installs (pip/npm/etc.) into /workspace or /home; OS-level installs require prebuilt images or an allowlist. Cross-session environment persistence (persisting the full filesystem) is optional (paid tier) and discussed as a trade-off later.
Non-Functional Requirements Requirement Target Rationale Cold start latency < 5 seconds Users expect near-instant execution Output latency < 100ms Real-time feel for terminal output Availability 99.9% Critical for paid/enterprise users Concurrent users 100K simultaneous Scale for popular platforms Execution isolation Strong Security: users can't access each other's data
In an interview, clarify: "Should we support long-running jobs (hours) or just interactive development sessions?" This significantly impacts VM lifecycle design. For this guide, we'll focus on interactive development with a 12-hour max session runtime.
Capacity Estimation
Assumptions:
100K concurrent users, each with 1 active workspace session Average session: 2 vCPU, 4GB RAM Peak concurrent sessions: 100K
Compute resources:
100K sessions × 2 vCPU = 200K vCPUs needed at peak 100K sessions × 4GB = 400TB RAM at peak At ~40GB usable RAM per node, this requires ~10,000 compute nodes
Terminal output streaming:
Not all sessions actively run processes simultaneously—assume 50% peak utilization 50K active processes × 1KB/second = 50MB/second of output data This is manageable with a modest Kafka cluster (3-5 brokers)
The main cost driver is compute, not storage. VM utilization optimization (pre-warming, pooling) directly impacts infrastructure costs.
Core Entities Workspace ├── id: UUID ├── owner_id: UUID ├── name: string ├── template: string (e.g., "python", "node", "go") ├── created_at: timestamp ├── updated_at: timestamp └── sharing_mode: enum (private, view, edit)
File ├── id: UUID ├── workspace_id: UUID (FK) ├── path: string (e.g., "/src/main.py") ├── content: text (for small files) ├── content_ref: string (optional, object storage pointer for large files) ├── is_directory: boolean ├── created_at: timestamp └── updated_at: timestamp
Process ├── id: UUID ├── workspace_id: UUID (FK) ├── sandbox_id: UUID (FK) ├── command: string (e.g., "python main.py", "npm run dev") ├── status: enum (pending, running, completed, failed, cancelled) ├── started_at: timestamp ├── finished_at: timestamp └── exit_code: integer
Sandbox (VM/Container instance) ├── id: UUID ├── workspace_id: UUID (FK) ├── user_id: UUID ├── status: enum (provisioning, warm, assigned, running, idle, terminated) ├── instance_type: string (cpu-small, cpu-large, gpu) ├── ip_address: string ├── created_at: timestamp ├── last_activity_at: timestamp └── expires_at: timestamp
Entity Relationships User 1:N Workspace 1:N File Workspace 1:1 Sandbox (active session) Workspace 1:N Process Sandbox 1:N Process
Keep Sandbox as a separate entity from Process. A sandbox persists across multiple command executions within a session, maintaining installed packages and filesystem state.
Protocol Choices Operation Protocol Reason CRUD operations REST Standard request-response Terminal streaming WebSocket Real-time bidirectional File uploads REST + multipart Large payloads REST Endpoints
POST /api/workspaces Create workspace GET /api/workspaces/{id} Get workspace with file tree PUT /api/workspaces/{id} Update workspace metadata DELETE /api/workspaces/{id} Delete workspace
GET /api/workspaces/{id}/files List files (tree structure) GET /api/files/{id} Get file content POST /api/workspaces/{id}/files Create file or directory PUT /api/files/{id} Update file content DELETE /api/files/{id} Delete file or directory POST /api/files/{id}/move Move/rename file
POST /api/workspaces/{id}/run Run command (returns stream token + sandbox_id) POST /api/processes/{id}/cancel Cancel running process POST /api/processes/{id}/input Send stdin input
POST /api/workspaces/{id}/sandbox Request/connect sandbox for workspace GET /api/sandboxes/{id}/status Get sandbox status DELETE /api/sandboxes/{id} Terminate sandbox
Prefer WebSocket messages for interactive stdin/cancel to minimize latency. Keep REST input/cancel as a fallback for non-WS clients or automation.
Run response (202):
{ "process_id": "proc-123", "sandbox_id": "sbx-456", "stream_token": "signed-token" }
The token is sandbox-scoped and short-lived; the server can return the same token for subsequent commands until it expires. If a sandbox already exists for the workspace, the API returns the existing sandbox_id and a refreshed token.
WebSocket Protocol
The client uses a short-lived stream_token from the run response; the server validates the token and sandbox ownership.
WSS /api/stream/{sandbox_id}?token=stream_token
{ "type": "output", "process_id": "proc-123", "stream": "stdout" | "stderr", "data": "Hello, world!\n", "timestamp": 1699999999999 }
{ "type": "status", "process_id": "proc-123", "status": "completed", "exit_code": 0 }
{ "type": "resume", "process_id": "proc-123", "last_seen_id": "1699999999999-0" }
{ "type": "input", "process_id": "proc-123", "data": "user input\n" }
{ "type": "cancel", "process_id": "proc-123" }
The WebSocket connection is per-sandbox, not per-process. This allows streaming output from multiple concurrent processes (e.g., a dev server and a build command) while maintaining a single connection, reducing overhead.
Unlike notebook-style systems where executions are serialized, a Cloud IDE typically allows multiple concurrent processes (e.g., running a server while executing tests). The sandbox manages process isolation internally.
Architecture Overview flowchart TB subgraph Clients WEB[Web Browser] end
subgraph Edge["Edge Layer"] LB[Load Balancer] CDN[CDN - Static Assets] end
subgraph App["Application Layer"] API[API Servers] WSS[WebSocket Servers] end
subgraph Orchestration["Sandbox Orchestration"] SM[Sandbox Manager] POOL[Warm Pool Controller] K8S[Kubernetes Cluster] end
subgraph Streaming["Terminal Output Streaming"] KAFKA[Kafka / Log Bus] end
subgraph Storage["Storage Layer"] PG[(PostgreSQLMetadata)] REDIS[(RedisSession State)] S3[(Object StorageWorkspaces/Files)] end
subgraph Compute["Sandbox Compute"] VM1[Sandbox Pod 1] VM2[Sandbox Pod 2] VMN[Sandbox Pod N] end
WEB -->|HTTPS| LB WEB -->|WSS| LB LB --> API LB --> WSS
API --> SM API --> PG API --> S3
SM --> POOL SM --> K8S POOL --> K8S
K8S --> VM1 K8S --> VM2 K8S --> VMN
VM1 --> KAFKA VM2 --> KAFKA VMN --> KAFKA
KAFKA --> WSS
WSS --> REDIS SM --> REDIS
Component Responsibilities
API Servers
Handle REST requests for workspaces, files, processes Authenticate users, authorize actions Persist metadata to PostgreSQL, files to S3
WebSocket Servers
Maintain persistent connections with clients Subscribe to shared Kafka topics (keyed by sandbox_id) Fan out terminal output to connected clients Append recent output to Redis streams for reconnect replay Handle stdin input forwarding Use consistent hashing/partition affinity so the server holding the client connection consumes that sandbox's partition (or add a routing layer for fan-out)
Sandbox Manager
Orchestrate sandbox lifecycle (create, monitor, terminate) Route process execution requests to appropriate sandbox Track sandbox health and resource usage Handle sandbox assignment for workspaces
Warm Pool Controller
Maintain a pool of pre-provisioned sandboxes Scale pool size based on demand prediction Handle different instance types (CPU, GPU)
Kubernetes Cluster
Run sandbox containers/pods Provide network isolation between sandboxes Enforce resource limits (CPU, memory, disk) Data Flow: Run Command
Let's walk through what happens when a user clicks "Run" or executes a terminal command:
sequenceDiagram participant U as User Browser participant API as API Server participant SM as Sandbox Manager participant R as Redis participant K as Kubernetes participant S as Sandbox Pod participant KF as Kafka participant WS as WebSocket Server
U->>API: POST /workspaces/{id}/run {command} API->>SM: Request process execution SM->>R: Check workspace's active sandbox
alt No active sandbox SM->>K: Create sandbox pod K-->>SM: Pod ready (IP, port) SM->>R: Store sandbox mapping end
SM->>S: Execute command via gRPC SM-->>API: Process started API-->>U: 202 Accepted {process_id, sandbox_id, stream_token}
U->>WS: Subscribe to process output (sandbox_id + stream_token) WS->>R: Get sandbox routing metadata WS->>KF: Subscribe to shared topic (keyed by sandbox_id)
loop Output streaming S->>KF: Publish stdout/stderr KF->>WS: Consume messages WS->>U: Forward via WebSocket end
S->>KF: Publish completion status KF->>WS: Consume completion WS->>U: Send status: completed
Sandbox Container Architecture
Each sandbox runs as an isolated Kubernetes pod:
apiVersion: v1 kind: Pod metadata: name: sandbox-{workspace_id} labels: type: sandbox workspace: {workspace_id} spec: containers:
name: runtime image: sandbox-python:3.11 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" securityContext: runAsNonRoot: true readOnlyRootFilesystem: true capabilities: drop: ["ALL"] volumeMounts:
name: agent image: sandbox-agent:latest
ports:
volumes:
With a read-only root filesystem, package installs write to mounted volumes like /workspace, /home/sandbox, and /tmp. Use virtualenvs or language-specific paths under those mounts.
Two-container pattern:
Runtime container: Runs user processes (Python/Node/Go/etc.), provides terminal shell access Agent container: Manages process lifecycle, captures terminal output, syncs files, communicates with control plane
Security is critical. User code runs in untrusted sandboxes. Use: (1) Container isolation with dropped capabilities, (2) Network policies blocking unauthorized egress, (3) Resource limits preventing DoS, (4) Read-only filesystem where possible, (5) Non-root user execution.
Terminal Output Streaming Deep Dive
The output streaming pipeline is the heart of real-time terminal feel:
flowchart LR subgraph Sandbox["Sandbox Pod"] CODE[User Process] -->|stdout/stderr| AGENT[Agent Process] AGENT -->|Batch + Compress| BUF[Buffer] end
subgraph Bus["Message Bus"] BUF -->|Publish| KAFKA[Kafka] KAFKA -->|Keyed partitions| PART[Partitions] end
subgraph Delivery["WebSocket Layer"] PART -->|Subscribe| WSS[WS Server] WSS -->|Fan-out| C1[Client 1] WSS -->|Fan-out| C2[Client 2] end
Avoid one Kafka topic per sandbox. Use a small number of shared topics with partitions keyed by sandbox_id to preserve ordering without exploding metadata.
At scale, shard sandbox_id to a WebSocket server so only the owning shard consumes that partition; otherwise every WS server would need to consume everything and filter locally.
Agent output handling:
class OutputStreamer: def init(self, kafka_producer, sandbox_id): self.producer = kafka_producer self.sandbox_id = sandbox_id self.topic = "sandbox-output" self.key = sandbox_id.encode() self.buffer = [] self.last_flush = time.now()
def capture(self, process_id: str, stream: str, data: bytes): self.buffer.append({ "process_id": process_id, "stream": stream, "data": data, "ts": time.now_ms() })
if self._should_flush(): self._flush()
def _should_flush(self): buffer_size = sum(len(m["data"]) for m in self.buffer) time_elapsed = time.now() - self.last_flush return buffer_size >= 4096 or time_elapsed >= 50ms
def _flush(self): if not self.buffer: return
self.producer.send( self.topic, key=self.key, value=msgpack.encode({ "sandbox_id": self.sandbox_id, "events": self.buffer, }), ) self.buffer = [] self.last_flush = time.now()
Why batch output?
Individual characters would create millions of messages 50ms batching provides perceived real-time feel Reduces Kafka throughput and WebSocket message overhead
Interview insight: Mention the trade-off between latency and throughput. Smaller batches = lower latency but higher overhead. 50ms is a sweet spot—humans can't perceive delays under 100ms as "laggy."
Output Persistence & Reconnection
A common interview follow-up: "What happens if the user disconnects mid-process and reconnects?"