Why Concurrency Is Hard
Writing correct concurrent code is one of the hardest problems in systems programming. The fundamental issue is that modern CPUs and compilers do not execute instructions in the order you write them. They reorder operations for performance — legally, within a defined contract called the memory model — and this breaks the intuitions programmers bring from sequential code.
Three distinct sources of reordering conspire against you:
1. Compiler reordering. The compiler may move loads and stores across statement boundaries if it can prove the transformation is safe for a single-threaded program. It has no visibility into concurrent readers on other threads.
2. CPU out-of-order execution. Modern superscalar CPUs execute hundreds of instructions simultaneously via deep pipelines, reordering operations to hide memory latency. The CPU commits results in-order (preserving single-threaded semantics) but stores may become globally visible out of order.
3. Store buffers and invalidation queues. Each CPU core has a private store buffer that batches writes before they are flushed to the L3 cache (which is the point of coherence on most architectures). A write that has been placed in the store buffer is immediately visible to the same core (store-to-load forwarding) but invisible to other cores until flushed.
Core 0 Store Buffer 0 L3 Cache Store Buffer 1 Core 1
------ -------------- -------- -------------- ------
x = 1; ──write──► [ x=1 pending ] ◄────────────────────────────── read x → sees 0!
y = 2; ──write──► [ y=2 pending ]
│
(flush on sync)
│
└──────────────► x=1, y=2 ────────────────► read x → sees 1
The key insight: the C abstract machine is sequentially consistent, but real hardware is not. Every concurrent algorithm you write must explicitly account for the gap.
Hardware Memory Models
Different CPU architectures offer different guarantees, forming a spectrum from strong to weak consistency.
x86 / x86-64 — Total Store Order (TSO)
x86 provides TSO: all stores appear in program order relative to each other (store buffer preserves order), and all loads appear in program order. The only allowed reordering is store → load (a later load can be reordered before an earlier store to a different address). This means most x86 code written without explicit fences “accidentally” works, lulling programmers into a false sense of safety.
Allowed on x86: Forbidden on x86:
----------------- ----------------------
store(x) load(y) before load(x) ← x86 prevents this
load(y) [reordered] store(y) before store(x) ← x86 prevents this
ARM / POWER — Weak Memory Models
ARM and POWER have much weaker models. Almost any reordering is allowed: store-store, load-load, load-store, store-load. Code that works on x86 without barriers often fails on ARM. This is why Android/iOS developers must use explicit atomic operations.
The practical takeaway: Never write code that relies on implicit x86 TSO ordering. Your code will be ported to ARM, compiled with LLVM that may reorder things, or run on a future architecture.
The C++11 Memory Model and Java Memory Model
Both C++ (since C++11) and Java define formal memory models based on the happens-before relationship.
Happens-before (HB): If operation A happens-before operation B, then A’s effects are visible to B. HB is established by:
- Program order within a single thread
- Synchronization operations: mutex unlock HB mutex lock, volatile write HB volatile read (Java),
releasestore HBacquireload (C++) - Thread creation: the spawning code HB the first instruction of the new thread
- Thread join: the last instruction of a thread HB the code that joins it
If two operations on the same memory location have no HB relationship between them and at least one is a write — that is a data race, and the behavior is undefined in C++ and unspecified in Java. Undefined behavior means the compiler may produce literally any output: it is not a runtime error you can catch, it is a compile-time contract violation.
Thread 1: Thread 2:
--------- ---------
x = 1; if (ready) {
ready = true; print(x); // data race! no HB between x=1 and x read
}
Memory Ordering
C++11 std::atomic and Rust’s std::sync::atomic expose six memory orderings. Choosing the right one is critical for both correctness and performance:
// C++ memory ordering options
std::atomic<int> counter{0};
// Relaxed: no ordering constraints. Only atomicity guaranteed.
// Use for: statistics counters where you don't care about ordering.
counter.fetch_add(1, std::memory_order_relaxed);
// Acquire: all subsequent reads/writes in this thread see effects
// from any release on the same variable. Use on LOAD side.
int val = flag.load(std::memory_order_acquire);
// Release: all preceding reads/writes in this thread are visible
// to any acquire on the same variable. Use on STORE side.
flag.store(1, std::memory_order_release);
// Acquire-Release: both acquire and release semantics. Use on RMW ops
// (fetch_add, compare_exchange) that act as both sides of a sync.
node->next.exchange(new_node, std::memory_order_acq_rel);
// Sequential Consistency (seq_cst): strongest. Global total order
// of all seq_cst operations across all threads. Expensive on ARM/POWER.
counter.fetch_add(1, std::memory_order_seq_cst); // default
The acquire-release pair is the fundamental synchronization primitive. A release store “publishes” a block of writes; an acquire load “subscribes” to them. This is how mutexes, channels, and condition variables are built.
Thread 1 (producer): Thread 2 (consumer):
-------------------- --------------------
data = compute(); while (!ready.load(acquire)) { spin; }
ready.store(true, release); use(data); // guaranteed to see compute() result
Memory ordering performance on x86: relaxed ≈ acquire ≈ release (x86 TSO gives these for free). seq_cst requires an MFENCE instruction (~20-30 cycles). On ARM, every ordering level has a cost.
Mutex: Kernel-Backed Sleep
A mutex (mutual exclusion lock) ensures that only one thread executes a critical section at a time. Unlike a spinlock, a contended mutex causes the waiting thread to sleep (deschedule itself), yielding the CPU to other work.
Linux futex (Fast Userspace muTEX): The naive implementation of a mutex uses a syscall to lock() every time, which is extremely expensive (~1µs). The futex system call solves this:
State: 0 = unlocked, 1 = locked (no waiters), 2 = locked (waiters exist)
lock():
if CAS(mutex, 0, 1) succeeds → fast path, no syscall needed
else → slow path: set state=2, call futex_wait() → thread sleeps in kernel
unlock():
if atomic_exchange(mutex, 0) == 1 → no waiters, done (no syscall)
else → state was 2, call futex_wake(1) to wake one waiter
This is the uniprocessor fast path: if the mutex is uncontended, lock and unlock are pure userspace CAS operations. The kernel is only involved when threads actually contend.
Priority inversion: Thread L (low priority) holds a mutex. Thread H (high priority) blocks on it. Thread M (medium priority) preempts L, preventing L from releasing the mutex. H is now stuck waiting behind M — a high-priority thread is blocked by a lower-priority one.
Solutions: Priority inheritance (Linux PI mutexes — L temporarily inherits H’s priority), Priority ceiling (all threads holding the mutex run at ceiling priority).
// Rust: Mutex<T> makes the data inaccessible without the lock
use std::sync::{Arc, Mutex};
use std::thread;
let counter = Arc::new(Mutex::new(0u64));
let handles: Vec<_> = (0..8).map(|_| {
let c = Arc::clone(&counter);
thread::spawn(move || {
let mut guard = c.lock().unwrap(); // blocks until acquired
*guard += 1;
// guard drops here → mutex unlocked automatically
})
}).collect();
for h in handles { h.join().unwrap(); }
println!("{}", *counter.lock().unwrap()); // 8
Spinlock: Busy-Wait
A spinlock does not put the waiting thread to sleep. It loops on an atomic check until the lock is available:
// Minimal spinlock in C using GCC builtins
typedef struct { int locked; } spinlock_t;
void spin_lock(spinlock_t *l) {
while (__atomic_test_and_set(&l->locked, __ATOMIC_ACQUIRE)) {
// CPU hint: we're in a spin-wait loop
// Prevents speculative execution from pipelining loads harmfully
__asm__ volatile("pause" ::: "memory"); // x86 PAUSE instruction
}
}
void spin_unlock(spinlock_t *l) {
__atomic_clear(&l->locked, __ATOMIC_RELEASE);
}
When spinlocks win: Critical section duration < context switch cost (~1-5µs on Linux). Spinlocks eliminate the kernel round-trip. Used heavily in OS kernels, interrupt handlers, and real-time systems.
When spinlocks lose: On a single-core machine, spinning is pure waste — the holder can’t run. On multicore, if the holder is preempted, all spinners waste their entire quantum. Never use spinlocks in user-space code unless you pin threads to cores and carefully control scheduling.
The PAUSE instruction on x86 is critical: it hints to the CPU that this is a spin-wait loop, reducing power consumption and preventing the CPU from aggressively speculating on the spin variable’s value (which would cause pipeline flush on lock acquisition).
Read-Write Lock (RwLock)
Many workloads are read-heavy. A plain mutex serializes all readers, wasting parallelism. An RwLock allows concurrent readers while ensuring writer exclusivity.
State machine:
Free ──────────────► Locked (exclusive) [writer]
Free ──────────────► Locked (shared, n) [n readers]
Locked (shared, n) ─► Locked (shared, n+1) [another reader]
Locked (shared, n) ─► Free (if n was 1)
Locked (exclusive) ─► Free
Writer starvation: Naively allowing readers to always succeed when other readers hold the lock causes writers to starve. Most implementations give priority to waiting writers once one arrives — new readers block until the writer gets its turn.
use std::sync::{Arc, RwLock};
let db: Arc<RwLock<HashMap<String, String>>> = Arc::new(RwLock::new(HashMap::new()));
// Multiple concurrent readers:
let r1 = db.read().unwrap(); // shared lock
let r2 = db.read().unwrap(); // compatible — both succeed simultaneously
// Exclusive writer:
drop(r1); drop(r2);
let mut w = db.write().unwrap(); // waits for all readers to finish
w.insert("key".to_string(), "val".to_string());
Semaphore vs Mutex
A semaphore is a generalized counter with two operations: wait (decrement, block if 0) and signal (increment, wake a waiter).
| Property | Mutex | Semaphore |
|---|---|---|
| Ownership | Thread that locks must unlock | Any thread can signal |
| Initial value | 1 (binary) | Any non-negative integer |
| Use case | Mutual exclusion | Resource counting, signaling |
| Priority inversion | Has solutions (PI mutex) | Harder to solve |
Counting semaphore example: A database connection pool with 10 connections. Each acquire decrements the semaphore; each release increments it. Threads block when all 10 are in use.
The crucial distinction: Mutexes have ownership — the thread that acquires must be the one to release. Semaphores have no ownership — one thread can signal that another thread waited on. This makes semaphores useful for producer-consumer signaling but dangerous for mutual exclusion (no protection against accidental double-release).
Condition Variables
A condition variable allows threads to wait for a predicate to become true, atomically releasing the mutex while waiting.
// C++ condition variable — canonical pattern
std::mutex mu;
std::condition_variable cv;
std::queue<Task> queue;
bool done = false;
// Producer
void producer(Task t) {
std::unique_lock<std::mutex> lock(mu);
queue.push(t);
cv.notify_one(); // wake one waiter
}
// Consumer — ALWAYS use a while loop, not if!
void consumer() {
std::unique_lock<std::mutex> lock(mu);
// SPURIOUS WAKEUP: wait() can return without notify being called.
// The OS may wake the thread for scheduling reasons.
// The while loop re-checks the predicate.
while (queue.empty() && !done) {
cv.wait(lock); // atomically: releases lock + sleeps
// on wake: re-acquires lock before returning
}
if (!queue.empty()) {
Task t = queue.front(); queue.pop();
process(t);
}
}
cv.notify_one() wakes one waiting thread. cv.notify_all() wakes all — use when the predicate may be satisfied for multiple waiters simultaneously (e.g., a shutdown signal).
The spurious wakeup rule is non-negotiable. POSIX explicitly allows pthread_cond_wait to return without a signal. Always wrap wait() in a while loop checking the predicate.
Atomic Operations and Lock-Free Programming
Atomic operations are indivisible read-modify-write operations guaranteed to be seen as instantaneous by all cores. The key primitives:
Compare-And-Swap (CAS): The foundation of lock-free programming.
// CAS semantics: atomically do:
// if (*ptr == expected) { *ptr = desired; return true; }
// else { expected = *ptr; return false; }
bool compare_exchange_weak(T& expected, T desired, memory_order success, memory_order failure);
Fetch-Add: Atomically add and return old value. Used for lock-free counters and sequence numbers.
Test-And-Set: Atomically set to 1 and return old value. The building block for spinlocks.
Lock-free stack (Treiber stack):
use std::sync::atomic::{AtomicPtr, Ordering};
use std::ptr;
struct Node<T> {
data: T,
next: *mut Node<T>,
}
struct Stack<T> {
head: AtomicPtr<Node<T>>,
}
impl<T> Stack<T> {
fn push(&self, data: T) {
let node = Box::into_raw(Box::new(Node { data, next: ptr::null_mut() }));
loop {
let head = self.head.load(Ordering::Acquire);
unsafe { (*node).next = head; }
// CAS: if head hasn't changed, install our node as new head
if self.head.compare_exchange_weak(head, node,
Ordering::Release, Ordering::Relaxed).is_ok() {
return;
}
// CAS failed: another thread raced us, retry
}
}
}
The ABA Problem
CAS checks “is the value still A?” — but another thread may have changed A→B→A between your load and CAS, making the CAS succeed incorrectly.
Thread 1 reads head = A
Thread 2 pops A, pushes B, pops B, pushes A (with different next!)
Thread 1's CAS(head, A, new) succeeds — but A's next pointer changed!
Solutions:
- Tagged pointers (version counter): Use the unused high/low bits of a pointer to store a monotonically increasing version. CAS on
(pointer, version)— even if the pointer returns to A, the version never matches.
// Pack tag into low 3 bits (assuming 8-byte alignment)
typedef struct { uintptr_t ptr_and_tag; } tagged_ptr_t;
#define GET_PTR(tp) ((void*)((tp).ptr_and_tag & ~0x7ULL))
#define GET_TAG(tp) ((tp).ptr_and_tag & 0x7ULL)
#define MAKE_TP(ptr, tag) ((tagged_ptr_t){(uintptr_t)(ptr) | ((tag) & 0x7)})
-
Hazard pointers: Before accessing a pointer, publish it to a thread-local hazard slot. Memory reclamation skips any pointer that appears in any hazard slot.
-
Epoch-based reclamation (EBR): Threads announce which epoch they’re in. Memory is only freed when it’s safe for all currently-active epochs.
The Michael-Scott Queue
The canonical lock-free FIFO queue, published by Maged Michael and Michael Scott in 1996. Used in Java’s ConcurrentLinkedQueue.
Enqueue:
1. Allocate new node
2. Read tail
3. CAS tail->next from null to new node (may fail → retry)
4. CAS tail from old tail to new node (may fail → another thread did it)
Dequeue:
1. Read head
2. Read head->next (the real first element — head is a dummy)
3. CAS head from old head to head->next (may fail → retry)
4. Return old head->next->data
The key insight: if a thread sees tail->next != null, the tail pointer is lagging behind — another thread is mid-enqueue. Help it along by advancing tail before retrying. This helping mechanism is a common pattern in lock-free algorithms.
Language-Specific Concurrency
Go: CSP-Style with Goroutines and Channels
Go implements Communicating Sequential Processes (CSP): share memory by communicating rather than communicate by sharing memory.
package main
import (
"sync"
"sync/atomic"
)
// Channel-based producer-consumer (idiomatic Go)
func pipeline() {
jobs := make(chan int, 100) // buffered channel = lock-free queue
results := make(chan int, 100)
// 4 worker goroutines
var wg sync.WaitGroup
for w := 0; w < 4; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := range jobs { // blocks until job available or channel closed
results <- j * j
}
}()
}
// Send jobs
for j := 0; j < 100; j++ { jobs <- j }
close(jobs) // signals workers to exit
go func() { wg.Wait(); close(results) }()
for r := range results { _ = r }
}
// select: multiplex on multiple channels
func multiplex(ch1, ch2 <-chan int, done <-chan struct{}) {
for {
select {
case v := <-ch1:
_ = v
case v := <-ch2:
_ = v
case <-done:
return // clean shutdown
}
}
}
// sync/atomic for counters (faster than Mutex for simple increments)
var hits int64
atomic.AddInt64(&hits, 1)
Rust: Fearless Concurrency via Types
Rust’s ownership system prevents data races at compile time. The Send trait marks types safe to transfer between threads. The Sync trait marks types safe to share references between threads. The compiler enforces these automatically.
use std::thread;
use std::sync::{Arc, Mutex};
// This WILL NOT COMPILE — data race caught at compile time:
// let mut x = 5;
// thread::spawn(|| x += 1); // error: `x` moved into closure
// println!("{}", x); // error: use after move
// Correct: wrap in Arc<Mutex<T>> for shared mutable state
let x = Arc::new(Mutex::new(5));
let x2 = Arc::clone(&x);
let t = thread::spawn(move || {
*x2.lock().unwrap() += 1;
});
t.join().unwrap();
println!("{}", *x.lock().unwrap()); // 6
Mutex<T> in Rust is different from C++: the data is inside the mutex. You cannot access T without holding the lock. The lock guard is dropped when it goes out of scope — no unlock() to forget.
Java: synchronized, volatile, java.util.concurrent
import java.util.concurrent.*;
import java.util.concurrent.atomic.*;
import java.util.concurrent.locks.*;
// LockSupport.park/unpark: the primitive underlying all Java blocking
// Thread.sleep() vs LockSupport.park(): park can be woken by unpark
// without a timeout; it's how ReentrantLock is implemented.
// AtomicLong: lock-free counter using CAS
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet(); // maps to LOCK XADD on x86
// ReentrantReadWriteLock: multiple readers OR one writer
ReadWriteLock rwl = new ReentrantReadWriteLock();
rwl.readLock().lock();
try { /* read data */ } finally { rwl.readLock().unlock(); }
// StampedLock (Java 8+): optimistic reads (no lock acquisition!)
StampedLock sl = new StampedLock();
long stamp = sl.tryOptimisticRead();
int value = sharedValue; // read without lock
if (!sl.validate(stamp)) { // check if writer intervened
stamp = sl.readLock(); // fall back to real read lock
try { value = sharedValue; } finally { sl.unlockRead(stamp); }
}
TypeScript / Node.js: Single-Threaded Event Loop
Node.js runs JavaScript on a single thread. No mutex needed for shared state because only one piece of code runs at a time. Concurrency is achieved by not blocking: I/O operations yield to the event loop while waiting.
// This is safe in Node.js — no concurrent access possible
let counter = 0;
async function increment() {
// No await here → this is atomic with respect to other async functions
counter++;
}
// But SharedArrayBuffer + Worker threads require explicit synchronization
import { Worker, isMainThread, workerData } from 'worker_threads';
// In main thread:
const sab = new SharedArrayBuffer(4);
const arr = new Int32Array(sab);
new Worker(__filename, { workerData: { sab } });
// In worker — Atomics for safe cross-thread access:
const arr2 = new Int32Array(workerData.sab);
Atomics.add(arr2, 0, 1); // atomic fetch-add
Atomics.wait(arr2, 0, 0); // block until arr2[0] !== 0
Atomics.notify(arr2, 0, 1); // wake one waiter
Deadlock: Coffman Conditions
Deadlock requires all four Coffman conditions to hold simultaneously. Breaking any one prevents deadlock:
Thread A: Thread B:
--------- ---------
lock(mutex1) lock(mutex2)
lock(mutex2) ──┐ lock(mutex1) ──┐
│ │
└── waits for ──► ┘
(circular wait → deadlock)
- Mutual exclusion: Resources cannot be shared (a mutex can only be held by one thread). Hard to eliminate — it’s the whole point.
- Hold and wait: Threads hold resources while waiting for more. Eliminate by: acquire all resources at once, or release before acquiring more.
- No preemption: Resources cannot be forcibly taken. Eliminate by: use trylock() with timeout, back off and retry.
- Circular wait: A cycle exists in the resource-wait graph. Eliminate by: enforce a global lock ordering — always acquire mutex1 before mutex2.
Lock ordering is the most practical solution. Establish a total order on all mutexes (by address, by ID) and always acquire in that order. Now no cycle can form.
// Enforce lock ordering by mutex address
fn transfer(from: &Mutex<Account>, to: &Mutex<Account>, amount: u64) {
// Always lock lower address first
let (first, second) = if (from as *const _) < (to as *const _) {
(from, to)
} else {
(to, from)
};
let _g1 = first.lock().unwrap();
let _g2 = second.lock().unwrap();
// ... transfer
}
Livelock: Threads are not blocked but keep reacting to each other without making progress (two people in a corridor stepping the same direction to let the other pass). Detected by: threads are active but throughput is zero. Fix with randomized backoff.
Starvation: A thread is perpetually denied access to a resource. Common cause: unfair schedulers, priority-based systems, readers starving writers. Fix with fair queues (FIFO lock acquisition) or explicit anti-starvation in the scheduler.
Interview Questions Deep-Dives
”Explain the difference between a mutex and a semaphore”
The key differences that matter in interviews:
- Ownership: A mutex has ownership — the thread that acquires it must release it. A semaphore has no ownership — any thread can call
signal. This makes mutexes appropriate for mutual exclusion (protecting invariants) and semaphores appropriate for signaling (producer-consumer, resource counting). - Initial value: A mutex is always binary (locked/unlocked). A semaphore can be initialized to any non-negative integer — a counting semaphore initialized to N grants access to N threads simultaneously.
- Priority inversion: Mutex implementations often include priority inheritance to handle priority inversion. Semaphores do not.
- Usage: Use mutex when a single resource has an owner. Use semaphore when counting available slots in a pool, or when one thread signals another.
”How do you avoid deadlocks?”
Five concrete strategies:
- Lock ordering: Establish a global total order on all locks. Always acquire in that order. No cycle can form.
- Lock timeouts: Use
try_lockwith a deadline. If acquisition times out, release all held locks and retry with random backoff. - Lock-free data structures: Eliminate mutexes entirely for hot paths using atomic CAS operations.
- Single-lock design: Use a single coarse-grained lock rather than fine-grained locks. Simpler correctness, lower throughput — trade off appropriately.
- Transaction-based systems: STM (Software Transactional Memory) or database-style optimistic concurrency: read without locks, validate at commit, retry on conflict.
For system design interviews: also mention that deadlocks in distributed systems (distributed deadlock) require timeouts and cycle detection in the wait-for graph, since there is no global lock manager.