#Distributed Locking for AI Agent Coordination

When multiple AI agents work on the same codebase simultaneously, they need a way to avoid stepping on each other's toes. This is the distributed coordination problem — and it's one of the oldest problems in computer science, now showing up in an entirely new context.

#The Problem

Imagine three Claude Code agents working on a feature branch. Agent A is refactoring the authentication module. Agent B is updating the API routes that depend on auth. Agent C is writing tests for both. Without coordination, Agent B might read a file that Agent A is halfway through rewriting. Agent C might test against a state that no longer exists.

This isn't hypothetical. It's what happens in every multi-agent engineering setup that lacks coordination primitives.

#Redis-Backed File Claims

The solution we implemented in Nexus uses Redis as a distributed lock manager with file-level granularity. When an agent needs to modify a file, it "claims" it:

CLAIM file:src/auth/login.ts agent:agent-a ttl:30000

The claim is a Redis key with a TTL (time-to-live). This gives us several properties for free:

Mutual exclusion: Only one agent can hold a claim on a file at a time
Crash tolerance: If an agent dies, the TTL expires and the lock is automatically released
Visibility: Any agent can query Redis to see who holds what

#Pipeline Pattern for Atomic Operations

A single file claim is simple, but real work often requires claiming multiple files atomically. You don't want to claim auth/login.ts but fail on auth/types.ts — that leaves you in a half-locked state.

We use Redis pipelines to make multi-file claims atomic:

typescript

const pipeline = redis.pipeline();
for (const file of files) {
  pipeline.set(`claim:${file}`, agentId, "PX", ttl, "NX");
}
const results = await pipeline.exec();

The NX flag means "only set if not exists." If any claim fails, we roll back all of them. This is the all-or-nothing guarantee that makes the system reliable.

#Heartbeat-Based Liveness

TTLs handle the crash case, but what about an agent that's alive but slow? A 30-second TTL might expire while a legitimate operation is still in progress.

The solution is heartbeats. Every agent with active claims sends periodic heartbeat signals that extend the TTL:

PEXPIRE claim:src/auth/login.ts 30000

If heartbeats stop — because the agent crashed, lost network, or was terminated — the claims expire naturally. No manual cleanup required.

#Conflict Resolution

What happens when two agents try to claim the same file? The first one wins (Redis NX guarantees this). The second agent gets a rejection and must decide:

Wait and retry — poll until the claim is released
Request release — send a message to the holding agent asking it to finish up
Escalate — flag the conflict for human review

In practice, option 2 works best for AI agents. They're cooperative by nature and can often reorganize their work to avoid the conflict entirely.

#Lessons Learned

Building this system taught us several things about distributed coordination for AI agents:

Agents are more cooperative than processes. Traditional distributed locking assumes adversarial or at least independent actors. AI agents can actually communicate about their intentions, which makes conflict resolution much smoother.

TTLs should be generous. AI agents doing code generation can take unpredictable amounts of time. Short TTLs cause spurious expirations. We settled on 30 seconds with heartbeat renewal every 10 seconds.

Visibility matters more than speed. The ability for any agent (or human) to see who holds what locks is incredibly valuable for debugging. We built a dashboard view of all active claims that updates in real-time via WebSocket.

The full implementation lives in the Nexus coordination server, where it's battle-tested across multi-agent engineering sessions.