Distributed Locking for AI Agent Coordination
#Distributed Locking for AI Agent Coordination
When multiple AI agents work on the same codebase simultaneously, they need a way to avoid stepping on each other's toes. This is the distributed coordination problem — and it's one of the oldest problems in computer science, now showing up in an entirely new context.
#The Problem
Imagine three Claude Code agents working on a feature branch. Agent A is refactoring the authentication module. Agent B is updating the API routes that depend on auth. Agent C is writing tests for both. Without coordination, Agent B might read a file that Agent A is halfway through rewriting. Agent C might test against a state that no longer exists.
This isn't hypothetical. It's what happens in every multi-agent engineering setup that lacks coordination primitives.
#Redis-Backed File Claims
The solution we implemented in Nexus uses Redis as a distributed lock manager with file-level granularity. When an agent needs to modify a file, it "claims" it:
CLAIM file:src/auth/login.ts agent:agent-a ttl:30000The claim is a Redis key with a TTL (time-to-live). This gives us several properties for free:
- Mutual exclusion: Only one agent can hold a claim on a file at a time
- Crash tolerance: If an agent dies, the TTL expires and the lock is automatically released
- Visibility: Any agent can query Redis to see who holds what
#Pipeline Pattern for Atomic Operations
A single file claim is simple, but real work often requires claiming multiple files atomically. You don't want to claim auth/login.ts but fail on auth/types.ts — that leaves you in a half-locked state.
We use Redis pipelines to make multi-file claims atomic:
const pipeline = redis.pipeline();
for (const file of files) {
pipeline.set(`claim:${file}`, agentId, "PX", ttl, "NX");
}
const results = await pipeline.exec();The NX flag means "only set if not exists." If any claim fails, we roll back all of them. This is the all-or-nothing guarantee that makes the system reliable.
#Heartbeat-Based Liveness
TTLs handle the crash case, but what about an agent that's alive but slow? A 30-second TTL might expire while a legitimate operation is still in progress.
The solution is heartbeats. Every agent with active claims sends periodic heartbeat signals that extend the TTL:
PEXPIRE claim:src/auth/login.ts 30000If heartbeats stop — because the agent crashed, lost network, or was terminated — the claims expire naturally. No manual cleanup required.
#Conflict Resolution
What happens when two agents try to claim the same file? The first one wins (Redis NX guarantees this). The second agent gets a rejection and must decide:
- Wait and retry — poll until the claim is released
- Request release — send a message to the holding agent asking it to finish up
- Escalate — flag the conflict for human review
In practice, option 2 works best for AI agents. They're cooperative by nature and can often reorganize their work to avoid the conflict entirely.
#Lessons Learned
Building this system taught us several things about distributed coordination for AI agents:
Agents are more cooperative than processes. Traditional distributed locking assumes adversarial or at least independent actors. AI agents can actually communicate about their intentions, which makes conflict resolution much smoother.
TTLs should be generous. AI agents doing code generation can take unpredictable amounts of time. Short TTLs cause spurious expirations. We settled on 30 seconds with heartbeat renewal every 10 seconds.
Visibility matters more than speed. The ability for any agent (or human) to see who holds what locks is incredibly valuable for debugging. We built a dashboard view of all active claims that updates in real-time via WebSocket.
The full implementation lives in the Nexus coordination server, where it's battle-tested across multi-agent engineering sessions.