Practice/Google/Design an Identity and Access Management system for AI agents

Design an Identity and Access Management system for AI agents

System DesignOptional

Problem Statement

You are asked to design a cloud-scale Identity and Access Management (IAM) system purpose-built for AI agent identities. Unlike human users who authenticate interactively, AI agents operate autonomously, often spawning sub-agents, and require programmatic credential issuance with tightly scoped, short-lived tokens. The system must support both role-based access control (RBAC) and attribute-based access control (ABAC) to handle the diverse authorization needs of modern AI workloads.

The core challenge is low-latency authorization. Every API call an AI agent makes must be authorized, and agents can generate thousands of requests per second. Centralizing every authorization check at a single policy engine would create an unacceptable bottleneck. Instead, you need a distributed enforcement architecture where policy evaluation points (PEPs) at the edge cache and evaluate policies locally.

The system must also enforce multi-tenant isolation so that one customer's agents can never access another customer's resources, even if a policy misconfiguration occurs. Comprehensive audit logging of every authorization decision is required for compliance and forensic analysis.

Key Requirements

Functional

Agent identity lifecycle -- Create, rotate, suspend, and delete agent identities with support for hierarchical relationships (parent agents spawning child agents with delegated, narrower permissions).
Short-lived credential issuance -- Issue time-bound tokens (minutes to hours) that agents use for API authentication, with automatic renewal and revocation capabilities.
Policy evaluation (RBAC + ABAC) -- Evaluate authorization requests against role assignments and attribute-based rules (e.g., agent type, request context, resource tags, time of day) with sub-millisecond decision latency at the PEP.
Audit trail -- Record every authentication and authorization event with the agent identity, requested resource, decision, and policy version used, supporting both real-time streaming and historical queries.

Non-Functional

Scalability -- Support 100 million agent identities across thousands of tenants, handling 1 million authorization decisions per second globally.
Latency -- Authorization decisions at the PEP must complete in under 1 millisecond for cached policies, with cold-cache lookups under 10 milliseconds.
Availability -- 99.999% uptime for the authorization path; credential issuance can tolerate slightly lower availability (99.99%).
Security -- Zero cross-tenant data leakage; cryptographic isolation of tenant policy namespaces; defense in depth against privilege escalation.

What Interviewers Focus On

Based on real interview experiences, these are the areas interviewers probe most deeply:

1. Policy Model Design (RBAC + ABAC)

Interviewers want to see how you model permissions for AI agents that may have dynamic, context-dependent access needs. They probe whether your model can express complex rules without becoming unmanageable. Hints to consider:

Think about layered evaluation: RBAC for coarse-grained role assignments (e.g., "reader", "executor"), ABAC for fine-grained contextual rules (e.g., "only during business hours", "only for resources tagged non-sensitive")
Consider how policies are versioned and how you roll back a bad policy change without downtime
Explore how delegation works when a parent agent grants a subset of its own permissions to a child agent — how you prevent privilege escalation
Discuss how you handle deny-override semantics: an explicit deny in any policy always wins over an allow

2. Distributed Authorization with Cached Policies

The 1 million decisions-per-second requirement means you cannot make a network call for every authorization check. Interviewers look for a PEP architecture that keeps policy evaluation local. Hints to consider:

Think about pushing compiled policy bundles to PEPs (similar to Open Policy Agent's bundle model) that are evaluated locally in memory
Consider how you propagate policy updates — a Kafka topic of policy change events that PEPs subscribe to with bounded staleness
Explore the cache invalidation problem: what happens if a credential is revoked but the PEP's cache still considers it valid
Discuss how you handle the cold-start case when a new PEP comes online and needs to bootstrap its policy cache

3. Multi-Tenant Isolation

Interviewers probe how you guarantee that tenant A's agents can never be authorized to access tenant B's resources, even if there is a bug in the policy engine. Hints to consider:

Think about tenant ID as a mandatory, unforgeable claim embedded in every token, verified at every layer
Consider using separate DynamoDB partition keys per tenant so that a query can never accidentally cross tenant boundaries
Explore how you test isolation — chaos testing that injects cross-tenant requests and verifies they are always denied
Discuss the blast radius if a single PEP is compromised: can it be used to authorize requests for other tenants?