DISTRIBUTED-SYSTEMS-ENGINEERING Contents

Distributed Locks (Why They’re Dangerous)

Distributed locks coordinate access to shared resources across nodes, but naive implementations cause data corruption and split-brain failures. This lesson explains safe locking patterns, fencing tokens, and production failure scenarios.

On this page

Distributed Locks: Coordination Without Corruption

Distributed locks are used to ensure that only one node performs a critical operation at a time. Examples include scheduled jobs, leader-only tasks, inventory updates, and schema migrations. While the concept appears simple, incorrect distributed locking is one of the most common sources of data corruption in production systems.

The fundamental question is not how to acquire a lock. It is how to guarantee that no two nodes believe they hold the lock simultaneously under failure conditions.

Why Distributed Locks Are Hard

In a single process, a mutex works because memory is shared and failure is deterministic. In distributed systems:

  • Nodes can crash after acquiring a lock.
  • Network partitions can isolate lock holders.
  • Clock drift affects lease expiration.
  • Retries can create duplicate lock attempts.

Without careful design, two nodes may both execute critical sections.

Locking Strategies

Consensus-Based Locks

Use a consensus system (e.g., Raft-backed store) to serialize lock ownership. Lock acquisition is recorded as a log entry committed by majority.

This provides strong safety guarantees but introduces latency equal to quorum commit time.

Lease-Based Locks

Lock is granted for a limited time (TTL). If holder fails, lease expires and others may acquire it.

Lease-based locks require careful timeout tuning and clock considerations.

Production Scenario: Double Job Execution

Symptom

A scheduled job designed to run once per hour executes twice in parallel during a network partition.

Root Cause

The lock was implemented using a single-instance in-memory store. During partition, both sides assumed leadership and acquired independent locks.

Diagnosis

  • No majority quorum enforcement.
  • No fencing tokens protecting downstream systems.
  • Lock TTL expired unexpectedly due to GC pause.

Resolution

  • Move lock state to consensus-backed store.
  • Attach fencing tokens to each lock acquisition.
  • Ensure critical resource validates fencing tokens.

The Importance of Fencing Tokens

Even with consensus-based locks, stale lock holders can cause corruption if they continue executing after losing ownership.

Fencing tokens solve this problem:

  • Each lock acquisition receives a monotonically increasing token.
  • Downstream systems reject operations with stale tokens.
  • Even if a node believes it holds the lock, outdated tokens prevent damage.

This protects against delayed messages and GC pauses.

Clock Drift and Lease Risk

Lease-based locks depend on timeouts. If clocks drift or GC pauses delay execution, a node may believe its lease is valid when it is not.

Mitigation strategies:

  • Use monotonic clocks where possible.
  • Keep lease durations short relative to expected pause times.
  • Implement fencing tokens even with leases.

Common Anti-Patterns

  • Using a single Redis instance without replication for critical locks.
  • Assuming network partitions are rare enough to ignore.
  • Relying solely on TTL expiration without fencing.
  • Ignoring idempotency in lock-protected operations.

Lock vs Idempotency

In many cases, distributed locks can be avoided by designing idempotent operations. Instead of preventing duplicates, allow safe retries.

Locks coordinate execution. Idempotency tolerates repetition. The latter is often safer and more scalable.

Operational Checklist

  • Is lock state stored in a consensus-backed system?
  • Are fencing tokens enforced by the resource being protected?
  • Is lease duration aligned with GC and network characteristics?
  • Are lock acquisition failures observable and logged?
  • Have partition scenarios been tested?

Failure Injection Test

# Distributed lock validation
1) Acquire lock on node A
2) Simulate long GC pause on node A
3) Expire lease and acquire lock on node B
4) Resume node A execution
5) Verify stale token is rejected

Key Takeaways

  • Distributed locks require majority-backed coordination for safety.
  • Lease-based locks alone are insufficient without fencing tokens.
  • Clock drift and GC pauses are real failure factors.
  • Idempotency can reduce reliance on locks.
  • Testing under partition is mandatory before production use.

Distributed locking is a coordination tool, not a shortcut. Implemented correctly, it prevents race conditions. Implemented incorrectly, it creates them.