Practice/Google/Design an Identity and Access Management system for AI agents
Design an Identity and Access Management system for AI agents
System DesignOptional
Problem Statement
You are asked to design a cloud-scale Identity and Access Management (IAM) system purpose-built for AI agent identities. Unlike human users who authenticate interactively, AI agents operate autonomously, often spawning sub-agents, and require programmatic credential issuance with tightly scoped, short-lived tokens. The system must support both role-based access control (RBAC) and attribute-based access control (ABAC) to handle the diverse authorization needs of modern AI workloads.
The core challenge is low-latency authorization. Every API call an AI agent makes must be authorized, and agents can generate thousands of requests per second. Centralizing every authorization check at a single policy engine would create an unacceptable bottleneck. Instead, you need a distributed enforcement architecture where policy evaluation points (PEPs) at the edge cache and evaluate policies locally.
The system must also enforce multi-tenant isolation so that one customer's agents can never access another customer's resources, even if a policy misconfiguration occurs. Comprehensive audit logging of every authorization decision is required for compliance and forensic analysis.
Key Requirements
Functional
- Agent identity lifecycle -- Create, rotate, suspend, and delete agent identities with support for hierarchical relationships (parent agents spawning child agents with delegated, narrower permissions).
- Short-lived credential issuance -- Issue time-bound tokens (minutes to hours) that agents use for API authentication, with automatic renewal and revocation capabilities.
- Policy evaluation (RBAC + ABAC) -- Evaluate authorization requests against role assignments and attribute-based rules (e.g., agent type, request context, resource tags, time of day) with sub-millisecond decision latency at the PEP.
- Audit trail -- Record every authentication and authorization event with the agent identity, requested resource, decision, and policy version used, supporting both real-time streaming and historical queries.
Non-Functional
- Scalability -- Support 100 million agent identities across thousands of tenants, handling 1 million authorization decisions per second globally.
- Latency -- Authorization decisions at the PEP must complete in under 1 millisecond for cached policies, with cold-cache lookups under 10 milliseconds.
- Availability -- 99.999% uptime for the authorization path; credential issuance can tolerate slightly lower availability (99.99%).
- Security -- Zero cross-tenant data leakage; cryptographic isolation of tenant policy namespaces; defense in depth against privilege escalation.
What Interviewers Focus On
Based on real interview experiences, these are the areas interviewers probe most deeply:
1. Policy Model Design (RBAC + ABAC)
Interviewers want to see how you model permissions for AI agents that may have dynamic, context-dependent access needs. They probe whether your model can express complex rules without becoming unmanageable.
Hints to consider:
- Think about layered evaluation: RBAC for coarse-grained role assignments (e.g., "reader", "executor"), ABAC for fine-grained contextual rules (e.g., "only during business hours", "only for resources tagged
non-sensitive")
- Consider how policies are versioned and how you roll back a bad policy change without downtime
- Explore how delegation works when a parent agent grants a subset of its own permissions to a child agent — how you prevent privilege escalation
- Discuss how you handle deny-override semantics: an explicit deny in any policy always wins over an allow
2. Distributed Authorization with Cached Policies
The 1 million decisions-per-second requirement means you cannot make a network call for every authorization check. Interviewers look for a PEP architecture that keeps policy evaluation local.
Hints to consider:
- Think about pushing compiled policy bundles to PEPs (similar to Open Policy Agent's bundle model) that are evaluated locally in memory
- Consider how you propagate policy updates — a Kafka topic of policy change events that PEPs subscribe to with bounded staleness
- Explore the cache invalidation problem: what happens if a credential is revoked but the PEP's cache still considers it valid
- Discuss how you handle the cold-start case when a new PEP comes online and needs to bootstrap its policy cache
3. Multi-Tenant Isolation
Interviewers probe how you guarantee that tenant A's agents can never be authorized to access tenant B's resources, even if there is a bug in the policy engine.
Hints to consider:
- Think about tenant ID as a mandatory, unforgeable claim embedded in every token, verified at every layer
- Consider using separate DynamoDB partition keys per tenant so that a query can never accidentally cross tenant boundaries
- Explore how you test isolation — chaos testing that injects cross-tenant requests and verifies they are always denied
- Discuss the blast radius if a single PEP is compromised: can it be used to authorize requests for other tenants?