Practice/OpenAI/Cloud IDE

Cloud IDE

System DesignMust Design a cloud-based IDE similar to Replit or GitHub Codespaces. Users can write code, manage files, run terminal commands, and see real-time output—all in the browser without local setup. This problem tests your ability to design systems with resource management, real-time streaming, and multi-tenancy isolation. The core challenges are VM/container lifecycle management and efficient terminal output streaming at scale.

This walkthrough follows the Interview Framework. Use it as a guide, not a script—adapt based on interviewer cues.

Phase 1: Requirements

Functional Requirements Users should be able to create workspaces with a file tree (create, edit, delete files and folders) Users should be able to run code and terminal commands with real-time output streaming (stdout/stderr) Users should be able to stop running processes Users should be able to install packages and persist environment state within a session Users should be able to share workspaces with others (view/edit permissions)

Sharing here means access control (view/edit) and last-write-wins. Real-time collaborative editing (OT/CRDT) is out of scope unless explicitly required.

Package installs persist within an active session. Assume user-level installs (pip/npm/etc.) into /workspace or /home; OS-level installs require prebuilt images or an allowlist. Cross-session environment persistence (persisting the full filesystem) is optional (paid tier) and discussed as a trade-off later.

Non-Functional Requirements Requirement Target Rationale Cold start latency < 5 seconds Users expect near-instant execution Output latency < 100ms Real-time feel for terminal output Availability 99.9% Critical for paid/enterprise users Concurrent users 100K simultaneous Scale for popular platforms Execution isolation Strong Security: users can't access each other's data

In an interview, clarify: "Should we support long-running jobs (hours) or just interactive development sessions?" This significantly impacts VM lifecycle design. For this guide, we'll focus on interactive development with a 12-hour max session runtime.

Capacity Estimation

Assumptions:

100K concurrent users, each with 1 active workspace session Average session: 2 vCPU, 4GB RAM Peak concurrent sessions: 100K

Compute resources:

100K sessions × 2 vCPU = 200K vCPUs needed at peak 100K sessions × 4GB = 400TB RAM at peak At ~40GB usable RAM per node, this requires ~10,000 compute nodes

Terminal output streaming:

Not all sessions actively run processes simultaneously—assume 50% peak utilization 50K active processes × 1KB/second = 50MB/second of output data This is manageable with a modest Kafka cluster (3-5 brokers)

The main cost driver is compute, not storage. VM utilization optimization (pre-warming, pooling) directly impacts infrastructure costs.

Phase 2: Data Model

Core Entities Workspace ├── id: UUID ├── owner_id: UUID ├── name: string ├── template: string (e.g., "python", "node", "go") ├── created_at: timestamp ├── updated_at: timestamp └── sharing_mode: enum (private, view, edit)

File ├── id: UUID ├── workspace_id: UUID (FK) ├── path: string (e.g., "/src/main.py") ├── content: text (for small files) ├── content_ref: string (optional, object storage pointer for large files) ├── is_directory: boolean ├── created_at: timestamp └── updated_at: timestamp

Process ├── id: UUID ├── workspace_id: UUID (FK) ├── sandbox_id: UUID (FK) ├── command: string (e.g., "python main.py", "npm run dev") ├── status: enum (pending, running, completed, failed, cancelled) ├── started_at: timestamp ├── finished_at: timestamp └── exit_code: integer

Sandbox (VM/Container instance) ├── id: UUID ├── workspace_id: UUID (FK) ├── user_id: UUID ├── status: enum (provisioning, warm, assigned, running, idle, terminated) ├── instance_type: string (cpu-small, cpu-large, gpu) ├── ip_address: string ├── created_at: timestamp ├── last_activity_at: timestamp └── expires_at: timestamp

Entity Relationships User 1:N Workspace 1:N File Workspace 1:1 Sandbox (active session) Workspace 1:N Process Sandbox 1:N Process

Keep Sandbox as a separate entity from Process. A sandbox persists across multiple command executions within a session, maintaining installed packages and filesystem state.

Phase 3: API Design

Protocol Choices Operation Protocol Reason CRUD operations REST Standard request-response Terminal streaming WebSocket Real-time bidirectional File uploads REST + multipart Large payloads REST Endpoints

Workspace management

POST /api/workspaces Create workspace GET /api/workspaces/{id} Get workspace with file tree PUT /api/workspaces/{id} Update workspace metadata DELETE /api/workspaces/{id} Delete workspace

File operations

GET /api/workspaces/{id}/files List files (tree structure) GET /api/files/{id} Get file content POST /api/workspaces/{id}/files Create file or directory PUT /api/files/{id} Update file content DELETE /api/files/{id} Delete file or directory POST /api/files/{id}/move Move/rename file

Process execution

POST /api/workspaces/{id}/run Run command (returns stream token + sandbox_id) POST /api/processes/{id}/cancel Cancel running process POST /api/processes/{id}/input Send stdin input

Sandbox management

POST /api/workspaces/{id}/sandbox Request/connect sandbox for workspace GET /api/sandboxes/{id}/status Get sandbox status DELETE /api/sandboxes/{id} Terminate sandbox

Prefer WebSocket messages for interactive stdin/cancel to minimize latency. Keep REST input/cancel as a fallback for non-WS clients or automation.

Run response (202):

{ "process_id": "proc-123", "sandbox_id": "sbx-456", "stream_token": "signed-token" }

The token is sandbox-scoped and short-lived; the server can return the same token for subsequent commands until it expires. If a sandbox already exists for the workspace, the API returns the existing sandbox_id and a refreshed token.

WebSocket Protocol

The client uses a short-lived stream_token from the run response; the server validates the token and sandbox ownership.

Client connects to stream terminal output

WSS /api/stream/{sandbox_id}?token=stream_token

Server → Client messages

{ "type": "output", "process_id": "proc-123", "stream": "stdout" | "stderr", "data": "Hello, world!\n", "timestamp": 1699999999999 }

{ "type": "status", "process_id": "proc-123", "status": "completed", "exit_code": 0 }

Client → Server messages

{ "type": "resume", "process_id": "proc-123", "last_seen_id": "1699999999999-0" }

{ "type": "input", "process_id": "proc-123", "data": "user input\n" }

{ "type": "cancel", "process_id": "proc-123" }

The WebSocket connection is per-sandbox, not per-process. This allows streaming output from multiple concurrent processes (e.g., a dev server and a build command) while maintaining a single connection, reducing overhead.

Unlike notebook-style systems where executions are serialized, a Cloud IDE typically allows multiple concurrent processes (e.g., running a server while executing tests). The sandbox manages process isolation internally.

Phase 4: High-Level Design

Architecture Overview flowchart TB subgraph Clients WEB[Web Browser] end

subgraph Edge["Edge Layer"] LB[Load Balancer] CDN[CDN - Static Assets] end

subgraph App["Application Layer"] API[API Servers] WSS[WebSocket Servers] end

subgraph Orchestration["Sandbox Orchestration"] SM[Sandbox Manager] POOL[Warm Pool Controller] K8S[Kubernetes Cluster] end

subgraph Streaming["Terminal Output Streaming"] KAFKA[Kafka / Log Bus] end

subgraph Storage["Storage Layer"] PG[(PostgreSQLMetadata)] REDIS[(RedisSession State)] S3[(Object StorageWorkspaces/Files)] end

subgraph Compute["Sandbox Compute"] VM1[Sandbox Pod 1] VM2[Sandbox Pod 2] VMN[Sandbox Pod N] end

WEB -->|HTTPS| LB WEB -->|WSS| LB LB --> API LB --> WSS

API --> SM API --> PG API --> S3

SM --> POOL SM --> K8S POOL --> K8S

K8S --> VM1 K8S --> VM2 K8S --> VMN

VM1 --> KAFKA VM2 --> KAFKA VMN --> KAFKA

KAFKA --> WSS

WSS --> REDIS SM --> REDIS

Component Responsibilities

API Servers

Handle REST requests for workspaces, files, processes Authenticate users, authorize actions Persist metadata to PostgreSQL, files to S3

WebSocket Servers

Maintain persistent connections with clients Subscribe to shared Kafka topics (keyed by sandbox_id) Fan out terminal output to connected clients Append recent output to Redis streams for reconnect replay Handle stdin input forwarding Use consistent hashing/partition affinity so the server holding the client connection consumes that sandbox's partition (or add a routing layer for fan-out)

Sandbox Manager

Orchestrate sandbox lifecycle (create, monitor, terminate) Route process execution requests to appropriate sandbox Track sandbox health and resource usage Handle sandbox assignment for workspaces

Warm Pool Controller

Maintain a pool of pre-provisioned sandboxes Scale pool size based on demand prediction Handle different instance types (CPU, GPU)

Kubernetes Cluster

Run sandbox containers/pods Provide network isolation between sandboxes Enforce resource limits (CPU, memory, disk) Data Flow: Run Command

Let's walk through what happens when a user clicks "Run" or executes a terminal command:

sequenceDiagram participant U as User Browser participant API as API Server participant SM as Sandbox Manager participant R as Redis participant K as Kubernetes participant S as Sandbox Pod participant KF as Kafka participant WS as WebSocket Server

U->>API: POST /workspaces/{id}/run {command} API->>SM: Request process execution SM->>R: Check workspace's active sandbox

alt No active sandbox SM->>K: Create sandbox pod K-->>SM: Pod ready (IP, port) SM->>R: Store sandbox mapping end

SM->>S: Execute command via gRPC SM-->>API: Process started API-->>U: 202 Accepted {process_id, sandbox_id, stream_token}

U->>WS: Subscribe to process output (sandbox_id + stream_token) WS->>R: Get sandbox routing metadata WS->>KF: Subscribe to shared topic (keyed by sandbox_id)

loop Output streaming S->>KF: Publish stdout/stderr KF->>WS: Consume messages WS->>U: Forward via WebSocket end

S->>KF: Publish completion status KF->>WS: Consume completion WS->>U: Send status: completed

Sandbox Container Architecture

Each sandbox runs as an isolated Kubernetes pod:

Sandbox Pod Specification

apiVersion: v1 kind: Pod metadata: name: sandbox-{workspace_id} labels: type: sandbox workspace: {workspace_id} spec: containers:

name: runtime image: sandbox-python:3.11 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" securityContext: runAsNonRoot: true readOnlyRootFilesystem: true capabilities: drop: ["ALL"] volumeMounts:
- name: workspace mountPath: /workspace
- name: home mountPath: /home/sandbox
- name: tmp mountPath: /tmp
name: agent image: sandbox-agent:latest

Handles execution requests, streams output

ports:
- containerPort: 50051 # gRPC

Network policy: no internet by default

Egress allowed only to package registries

volumes:

name: workspace emptyDir: {}
name: home emptyDir: {}
name: tmp emptyDir: {}

With a read-only root filesystem, package installs write to mounted volumes like /workspace, /home/sandbox, and /tmp. Use virtualenvs or language-specific paths under those mounts.

Two-container pattern:

Runtime container: Runs user processes (Python/Node/Go/etc.), provides terminal shell access Agent container: Manages process lifecycle, captures terminal output, syncs files, communicates with control plane

Security is critical. User code runs in untrusted sandboxes. Use: (1) Container isolation with dropped capabilities, (2) Network policies blocking unauthorized egress, (3) Resource limits preventing DoS, (4) Read-only filesystem where possible, (5) Non-root user execution.

Terminal Output Streaming Deep Dive

The output streaming pipeline is the heart of real-time terminal feel:

flowchart LR subgraph Sandbox["Sandbox Pod"] CODE[User Process] -->|stdout/stderr| AGENT[Agent Process] AGENT -->|Batch + Compress| BUF[Buffer] end

subgraph Bus["Message Bus"] BUF -->|Publish| KAFKA[Kafka] KAFKA -->|Keyed partitions| PART[Partitions] end

Avoid one Kafka topic per sandbox. Use a small number of shared topics with partitions keyed by sandbox_id to preserve ordering without exploding metadata.

At scale, shard sandbox_id to a WebSocket server so only the owning shard consumes that partition; otherwise every WS server would need to consume everything and filter locally.

Agent output handling:

Pseudo-code: Agent captures and streams terminal output

class OutputStreamer: def init(self, kafka_producer, sandbox_id): self.producer = kafka_producer self.sandbox_id = sandbox_id self.topic = "sandbox-output" self.key = sandbox_id.encode() self.buffer = [] self.last_flush = time.now()

def capture(self, process_id: str, stream: str, data: bytes): self.buffer.append({ "process_id": process_id, "stream": stream, "data": data, "ts": time.now_ms() })

Flush every 50ms or 4KB, whichever comes first

if self._should_flush(): self._flush()

def _should_flush(self): buffer_size = sum(len(m["data"]) for m in self.buffer) time_elapsed = time.now() - self.last_flush return buffer_size >= 4096 or time_elapsed >= 50ms

def _flush(self): if not self.buffer: return

Batch messages for efficiency

self.producer.send( self.topic, key=self.key, value=msgpack.encode({ "sandbox_id": self.sandbox_id, "events": self.buffer, }), ) self.buffer = [] self.last_flush = time.now()

Why batch output?

Individual characters would create millions of messages 50ms batching provides perceived real-time feel Reduces Kafka throughput and WebSocket message overhead

Interview insight: Mention the trade-off between latency and throughput. Smaller batches = lower latency but higher overhead. 50ms is a sweet spot—humans can't perceive delays under 100ms as "laggy."

Output Persistence & Reconnection

A common interview follow-up: "What happens if the user disconnects mid-process and reconnects?"