[2026] C++ Lock-Free Programming: CAS, ABA, Memory Order, High-Performance Queues [#34-3]
이 글의 핵심
Complete guide to lock-free programming: replace mutex bottlenecks with atomics and lock-free algorithms. Covers CAS loops, memory_order, ABA hazards, Michael-Scott queue, hazard pointers, and production benchmarks.
Introduction: Mutex Contention Bottleneck
”Processing 1M requests/sec but lock contention is killing us”
In a high-performance server, multiple worker threads fetched tasks from a work queue. Protecting the queue with std::mutex meant that increasing thread count barely improved throughput, with lock wait time dominating overall latency. Switching to a lock-free queue on the same hardware increased throughput 2-3x.
Lock-based code requires grabbing a mutex for every queue push/pop. More threads = longer lock wait times, and while one thread holds the lock, others block. Lock-free approaches use compare-and-swap (CAS) atomic operations to synchronize without locks, allowing threads to progress without waiting for each other.
Slow code (mutex):
다음은 cpp를 활용한 상세한 구현 코드입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// Mutex-protected queue - lock contention
std::queue<Task> queue;
std::mutex mtx;
void push(const Task& t) {
std::lock_guard<std::mutex> lock(mtx);
queue.push(t);
}
bool pop(Task& t) {
std::lock_guard<std::mutex> lock(mtx);
if (queue.empty()) return false;
t = queue.front();
queue.pop();
return true;
}
Fast code (lock-free): 아래 코드는 cpp를 사용한 구현 예제입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// Lock-free queue - CAS synchronization without locks
// 실행 예제
template <typename T>
class LockFreeQueue {
struct Node {
T data;
std::atomic<Node*> next;
};
std::atomic<Node*> head;
std::atomic<Node*> tail;
// push/pop use CAS (see examples below)
};
Cause: Mutex allows only one thread to access queue at a time → lock-free allows multiple threads to attempt CAS simultaneously → on contention, retry without blocking What you’ll learn:
- Understand lock-free concepts and CAS (compare-and-swap) operations
- Implement lock-free stacks and queues with complete code
- Avoid common mistakes: ABA problems, memory ordering, memory reclamation
- Verify performance gains over mutex with benchmarks
- Apply production patterns: hazard pointers, RCU, etc.
Conceptual Analogy
Mutex is like standing in a single line at a bank counter—waiting for your turn. Lock-free CAS is like a ticket machine that “only changes the displayed number if it matches what you saw”—one thread succeeds at a time, but no line forms. More CPU retries, but less latency variance from blocking.
Problem Scenarios
Scenario 1: High-Frequency Log Collector
Dozens of threads write logs to a buffer simultaneously. Mutex protection requires grabbing a lock for every log write, causing high latency during high-frequency logging. Lock-free ring buffers or MPSC (multi-producer single-consumer) queues enable lock-free log accumulation, with a separate thread batch-flushing.
Scenario 2: Real-Time Trading Order Book
Orders arrive/cancel in milliseconds. Protecting the order book with mutex causes order processing delays from lock waits, potentially disadvantageous during market price fluctuations. Lock-free data structures for order books reduce latency jitter.
Scenario 3: Game Engine Task Scheduler
Hundreds of tasks distributed to multiple threads every frame. Mutex-based task queues cause all workers to wait for queue locks at frame start, risking frame drops. Lock-free task queues (including work stealing) enable workers to fetch tasks without blocking.
Scenario 4: Network Packet Buffer
Receive thread puts packets in buffer, processing thread removes them. Single mutex protection can cause buffer overflow when packet receive rate is high while processing thread waits for lock. Lock-free SPSC (single producer single consumer) queues enable nearly independent receive and processing.
Scenario 5: Statistics Counter Aggregation
Multiple threads increment request counts, byte counts, etc. Mutex-protecting counters causes severe contention during high-frequency updates. std::atomic::fetch_add enables lock-free safe aggregation.
Scenario 6: Cache Invalidation/Update
Multiple threads read shared cache pointer, replacing with new pointer on updates. Mutex-protecting entire cache blocks even read paths. std::atomic pointers with CAS for “lock-free reads, CAS writes” pattern enable nearly blocking-free reads.
Table of Contents
- Basic Concepts
- Atomic Operations and CAS
- Lock-Free Stack Complete Example
- Lock-Free Queue Complete Example
- Memory Order (memory_order)
- ABA Problem and Solutions
- Common Errors and Solutions
- Best Practices and Cautions
- Performance Benchmarks
- Production Patterns
- Practical Examples
1. Basic Concepts
What is Lock-Free?
Lock-free algorithms are designed so that without using locks (mutexes, spinlocks, etc.), even when multiple threads access a data structure simultaneously, at least one thread always progresses. No deadlocks, no priority inversion from lock waits. 아래 코드는 mermaid를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// 실행 예제
flowchart TB
subgraph lock_based[Lock-Based]
L1[Thread A: Acquire lock] --> L2[Thread B: Wait]
L2 --> L3[Thread C: Wait]
L3 --> L4[After A completes, B, C proceed sequentially]
end
subgraph lock_free[Lock-Free]
F1[Thread A: CAS attempt] --> F2[Success → Progress]
F3[Thread B: CAS attempt] --> F4[Failure → Retry]
F4 --> F5[Progress on next CAS]
end
Analogy: Lock-based is “only person with bathroom key can enter, others wait”. Lock-free is “multiple people try opening door simultaneously, first one enters, others retry”.
Lock-Free vs Wait-Free
| Distinction | Lock-Free | Wait-Free |
|---|---|---|
| Definition | At least one thread progresses | All threads complete in finite steps |
| Retry | CAS failure allows retry | No retry, completion guaranteed |
| Implementation difficulty | Relatively low | Very high |
| Production use | Queues, stacks, counters, etc. | Limited (atomic fetch_add, etc.) |
| Most lock-free data structures are lock-free not wait-free. CAS failure retry loops are common patterns. |
2. Atomic Operations and CAS
Basic std::atomic usage
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <iostream>
#include <thread>
// Copy-paste and run: g++ -std=c++17 -pthread -o atomic_basic atomic_basic.cpp && ./atomic_basic
int main() {
std::atomic<int> counter{0};
std::thread t1([&]() {
for (int i = 0; i < 100000; ++i)
counter.fetch_add(1, std::memory_order_relaxed);
});
std::thread t2([&]() {
for (int i = 0; i < 100000; ++i)
counter.fetch_add(1, std::memory_order_relaxed);
});
t1.join();
t2.join();
std::cout << "counter = " << counter.load() << "\n"; // 200000
return 0;
}
Explanation: fetch_add atomically increments value, preventing data races. memory_order_relaxed is used when ordering isn’t needed (counters, etc.).
Compare-And-Swap (CAS)
CAS is an atomic operation: “if this memory location’s value equals expected value, change to new value, otherwise don’t change”. In C++, std::atomic::compare_exchange_strong / compare_exchange_weak are CAS.
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 에러 처리를 통해 안정성을 확보합니다, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
// 실행 예제
std::atomic<int> value{0};
// "If 0, set to 1" — only one thread succeeds
bool try_claim() {
int expected = 0;
return value.compare_exchange_strong(expected, 1);
}
// Lock-free increment (actually fetch_add is better)
void cas_increment() {
int old_val = value.load(std::memory_order_relaxed);
while (!value.compare_exchange_weak(old_val, old_val + 1,
std::memory_order_release,
std::memory_order_relaxed)) {
// On failure, old_val automatically updates to current value
}
}
compare_exchange_weak vs strong:
- weak: Possible spurious failure on ARM, PowerPC, etc. More efficient in loops.
- strong: No spurious failure. Suitable for single attempts.
CAS operation flow
아래 코드는 mermaid를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
flowchart TD
A[CAS attempt] --> B{Current value == expected?}
B -->|Yes| C[Replace with desired]
C --> D[Return true]
B -->|No| E[Update expected to current value]
E --> F[Return false]
F --> G[Retry in loop]
Key atomic operations summary
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
std::atomic<int> a{0};
// Read/write
a.load(std::memory_order_acquire);
a.store(42, std::memory_order_release);
// Arithmetic
a.fetch_add(1); // a++
a.fetch_sub(1); // a--
a.fetch_and(0xFF);
a.fetch_or(1);
// Exchange
a.exchange(100); // Returns previous value
// Conditional exchange (CAS)
int expected = 0;
a.compare_exchange_strong(expected, 1); // expected updated by reference
a.compare_exchange_weak(expected, 1);
3. Lock-Free Stack Complete Example
Singly-Linked List Stack
Stacks only need CAS updating the head pointer. Push links new node to head, pop changes head to next node. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <memory>
template <typename T>
class LockFreeStack {
public:
struct Node {
T data;
Node* next;
explicit Node(const T& d) : data(d), next(nullptr) {}
};
LockFreeStack() : head_(nullptr) {}
void push(const T& value) {
Node* new_node = new Node(value);
new_node->next = head_.load(std::memory_order_relaxed);
while (!head_.compare_exchange_weak(new_node->next, new_node,
std::memory_order_release,
std::memory_order_relaxed)) {
// CAS failure updates new_node->next to current head
}
}
bool pop(T& value) {
Node* old_head = head_.load(std::memory_order_acquire);
while (old_head &&
!head_.compare_exchange_weak(old_head, old_head->next,
std::memory_order_acquire,
std::memory_order_relaxed)) {
// CAS failure updates old_head to current head
}
if (!old_head) return false;
value = old_head->data;
delete old_head; // Production needs hazard pointers, etc. (see ABA below)
return true;
}
bool empty() const {
return head_.load(std::memory_order_acquire) == nullptr;
}
private:
std::atomic<Node*> head_;
};
Code explanation:
- push: Set new node’s next to current head, CAS head to new node only when head equals next. If another thread pushes, CAS fails and
new_node->nextupdates to new head, so retry. - pop: Read head, CAS head to old_head->next only when head equals old_head.
- Caution: Above implementation has ABA and memory reclamation issues. Production needs hazard pointers, etc. (see below).
Memory order explanation
아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
// push: release — writes before this CAS visible to other threads
head_.compare_exchange_weak(..., std::memory_order_release, ...);
// pop: acquire — can read values written by other threads before this CAS
head_.compare_exchange_weak(..., std::memory_order_acquire, ...);
4. Lock-Free Queue Complete Example
Michael-Scott Lock-Free Queue (MPMC)
Classic lock-free queue using two atomic pointers (head, tail). Dummy node enables consistent handling of empty and non-empty queues. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <memory>
template <typename T>
class LockFreeQueue {
public:
struct Node {
std::atomic<Node*> next;
T data;
Node() : next(nullptr), data() {}
explicit Node(const T& d) : next(nullptr), data(d) {}
};
LockFreeQueue() {
Node* dummy = new Node();
head_.store(dummy);
tail_.store(dummy);
}
~LockFreeQueue() {
while (Node* n = head_.load()) {
head_.store(n->next.load());
delete n;
}
}
void push(const T& value) {
Node* new_node = new Node(value);
Node* old_tail = nullptr;
Node* next = nullptr;
while (true) {
old_tail = tail_.load(std::memory_order_acquire);
next = old_tail->next.load(std::memory_order_acquire);
if (old_tail == tail_.load(std::memory_order_acquire)) {
if (next == nullptr) {
if (old_tail->next.compare_exchange_weak(next, new_node,
std::memory_order_release,
std::memory_order_relaxed)) {
tail_.compare_exchange_weak(old_tail, new_node,
std::memory_order_release,
std::memory_order_relaxed);
return;
}
} else {
tail_.compare_exchange_weak(old_tail, next,
std::memory_order_release,
std::memory_order_relaxed);
}
}
}
}
bool pop(T& value) {
Node* old_head = nullptr;
Node* old_tail = nullptr;
Node* next = nullptr;
while (true) {
old_head = head_.load(std::memory_order_acquire);
old_tail = tail_.load(std::memory_order_acquire);
next = old_head->next.load(std::memory_order_acquire);
if (old_head == head_.load(std::memory_order_acquire)) {
if (old_head == old_tail) {
if (next == nullptr) return false;
tail_.compare_exchange_weak(old_tail, next,
std::memory_order_release,
std::memory_order_relaxed);
} else {
value = next->data;
if (head_.compare_exchange_weak(old_head, next,
std::memory_order_release,
std::memory_order_relaxed)) {
delete old_head;
return true;
}
}
}
}
}
bool empty() const {
return head_.load(std::memory_order_acquire)->next.load(
std::memory_order_acquire) == nullptr;
}
private:
std::atomic<Node*> head_;
std::atomic<Node*> tail_;
};
Code explanation:
- Dummy node: Ensures head and tail always point to valid nodes.
- push: If tail->next is nullptr, link new node and move tail to new node. If another thread pushed first, advance tail to next node.
- pop: If head->next is data node, return its value and move head to next. If head == tail, queue is empty or tail hasn’t caught up yet.
SPSC (Single Producer Single Consumer) Queue
When there’s one producer and one consumer, simpler and faster implementations are possible. Can work without locks or CAS, using only ordering guarantees. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <vector>
template <typename T, size_t Capacity>
class SPSCQueue {
public:
SPSCQueue() : write_idx_(0), read_idx_(0) {
buffer_.resize(Capacity);
}
bool push(const T& value) {
size_t current = write_idx_.load(std::memory_order_relaxed);
size_t next = (current + 1) % Capacity;
if (next == read_idx_.load(std::memory_order_acquire))
return false; // full
buffer_[current] = value;
write_idx_.store(next, std::memory_order_release);
return true;
}
bool pop(T& value) {
size_t current = read_idx_.load(std::memory_order_relaxed);
if (current == write_idx_.load(std::memory_order_acquire))
return false; // empty
value = buffer_[current];
read_idx_.store((current + 1) % Capacity, std::memory_order_release);
return true;
}
private:
std::vector<T> buffer_;
std::atomic<size_t> write_idx_;
std::atomic<size_t> read_idx_;
};
Code explanation: In SPSC, producer and consumer each modify only one index, enabling synchronization with just load/store and memory ordering without CAS. release/acquire guarantee write-read ordering.
5. Memory Order (memory_order)
Order types
| memory_order | Meaning | Use cases |
|---|---|---|
| seq_cst | Sequential consistency (strongest) | Default, debugging |
| acquire | Reads/writes after this operation won’t reorder before it | load, CAS success |
| release | Reads/writes before this operation won’t reorder after it | store, CAS success |
| acq_rel | acquire + release | CAS (read-modify-write) |
| relaxed | No ordering guarantee, only atomicity | Counters, etc. |
Release-Acquire Pair
아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
// Thread A (producer)
data = 42;
ready.store(true, std::memory_order_release); // Guarantees data write completes before
// Thread B (consumer)
while (!ready.load(std::memory_order_acquire))
;
std::cout << data; // Guaranteed to see 42
A synchronizes-with relationship forms between thread writing with release and thread reading with acquire. Thus data write completes before ready write, and when B reads ready, it also sees latest data.
Recommendations for Lock-Free
아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
// push: release — head/tail shouldn't update before new node contents visible to other threads
head_.compare_exchange_weak(..., std::memory_order_release, ...);
// pop: acquire — must read node contents after reading head
head_.compare_exchange_weak(..., std::memory_order_acquire, ...);
6. ABA Problem and Solutions
What is the ABA problem?
Thread A reads head as P. Meanwhile, thread B pops P and pushes P back. Head is still P, but P->next might differ from before. When A attempts CAS with “expected = P”, it succeeds, but P->next might already be changed, causing logical errors.
아래 코드는 mermaid를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
sequenceDiagram
participant A as Thread A
participant B as Thread B
participant Q as Queue
A->>Q: Read head = P
B->>Q: pop() → Remove P
B->>Q: push(P) → Re-add P
A->>Q: CAS(P, P->next) → Success!
Note over A: P->next might already be changed
Solution 1: ABA Counter (Tagged Pointer)
Put version/counter in pointer’s lower bits so CAS fails if same address but different “generation”. On 64-bit where pointers use only 48 bits, utilize remaining bits. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <cstdint>
struct TaggedPtr {
uintptr_t ptr : 48; // Pointer (lower bits 0 from alignment)
uintptr_t tag : 16; // ABA prevention counter
};
std::atomic<uintptr_t> head_{0};
void push(Node* new_node) {
uintptr_t old_head = head_.load(std::memory_order_relaxed);
TaggedPtr old_tp;
old_tp.ptr = old_head & 0x0000FFFFFFFFFFFF;
old_tp.tag = old_head >> 48;
new_node->next = reinterpret_cast<Node*>(old_tp.ptr);
TaggedPtr new_tp;
new_tp.ptr = reinterpret_cast<uintptr_t>(new_node);
new_tp.tag = old_tp.tag + 1;
uintptr_t new_val = (static_cast<uintptr_t>(new_tp.tag) << 48) | new_tp.ptr;
while (!head_.compare_exchange_weak(old_head, new_val,
std::memory_order_release,
std::memory_order_relaxed)) {
old_tp.ptr = old_head & 0x0000FFFFFFFFFFFF;
old_tp.tag = old_head >> 48;
new_node->next = reinterpret_cast<Node*>(old_tp.ptr);
new_tp.ptr = reinterpret_cast<uintptr_t>(new_node);
new_tp.tag = old_tp.tag + 1;
new_val = (static_cast<uintptr_t>(new_tp.tag) << 48) | new_tp.ptr;
}
}
Solution 2: Hazard Pointer
Don’t delete nodes immediately; track “which threads are using this pointer”. Don’t reuse pointers in use, preventing ABA. 다음은 cpp를 활용한 상세한 구현 코드입니다. 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// Hazard Pointer overview (simplified)
thread_local Node* hazard_ptr = nullptr;
bool pop(T& value) {
Node* old_head = head_.load(std::memory_order_acquire);
do {
Node* next = old_head ? old_head->next.load() : nullptr;
hazard_ptr = old_head; // Declare this thread is using old_head
if (head_.load() != old_head) continue; // Revalidate
} while (!head_.compare_exchange_weak(old_head, next, ...));
value = old_head->data;
hazard_ptr = nullptr;
// Put old_head in retire list, delete only when not in other threads' hazard_ptr
retire(old_head);
return true;
}
Solution 3: RCU (Read-Copy-Update)
Lock-free reads, writes create copies, update, then atomically swap pointer. Old version recycled after all reads complete.
7. Common Errors and Solutions
Issue 1: Separating load + store instead of CAS
Symptom: Occasional data loss, logic errors Cause: Splitting “read-check-write” into multiple steps allows other threads to change value in between. 아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// ❌ Wrong
void set_if_zero() {
if (flag.load() == 0) {
flag.store(1); // Another thread might have already set to 1
}
}
// ✅ Correct
void set_if_zero() {
int expected = 0;
flag.compare_exchange_strong(expected, 1);
}
Issue 2: Over-relaxing memory order
Symptom: Very rare incorrect value reads, hard-to-reproduce bugs
Cause: Using only memory_order_relaxed allows compiler/CPU to reorder operations, potentially reading before previous writes visible.
아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// ❌ Wrong: using only relaxed
ready.store(true, std::memory_order_relaxed);
// Consumer might see ready before data (theoretically)
// ✅ Correct: release/acquire pair
ready.store(true, std::memory_order_release);
// Consumer
while (!ready.load(std::memory_order_acquire))
;
int x = data; // data write definitely completed before
Issue 3: Immediately deleting popped node (other threads still referencing)
Symptom: use-after-free, crashes
Cause: Immediately deleteing popped node in lock-free stack might happen while another thread still reads that node (ABA or delayed read).
아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
// ❌ Dangerous: delete immediately after pop
Node* old_head = ...;
head_.compare_exchange_weak(old_head, old_head->next);
delete old_head; // Another thread might still reference old_head
// ✅ Solution: delay reuse with hazard pointer, RCU, or ABA counter
retire(old_head); // Delete later when safe
Issue 4: Not passing expected by reference to compare_exchange
Symptom: Infinite loop, CAS fails forever
Cause: compare_exchange_weak updates expected to current value on failure. Passing by value prevents update reflection, causing continuous attempts with same value.
아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// ❌ Wrong
int expected = 0;
while (!atomic_val.compare_exchange_weak(expected, 1)) {
// expected not updated (if passed by value)
}
// ✅ Correct
int expected = 0;
while (!atomic_val.compare_exchange_weak(expected, 1)) {
// expected automatically updates to current atomic value (passed by reference)
}
Issue 5: False Sharing
Symptom: Slower than expected when using multiple atomic variables Cause: Different atomics on same cache line cause one thread’s write to invalidate other threads’ cache lines. 아래 코드는 cpp를 사용한 구현 예제입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// ❌ Wrong
struct Counters {
std::atomic<int> a;
std::atomic<int> b; // On same cache line as a
};
// ✅ Correct (cache line padding)
struct alignas(64) Counters {
std::atomic<int> a;
char padding1[64 - sizeof(std::atomic<int>)];
std::atomic<int> b;
char padding2[64 - sizeof(std::atomic<int>)];
};
Issue 6: “Just using atomics” not lock-free
Symptom: Thought it was “lock-free” but deadlock or livelock Cause: When using multiple atomics, each operation is atomic but overall algorithm might not be. Need verification that lock-free definition (at least one thread progresses) is satisfied.
Issue 7: Judging by empty() then pop
Symptom: empty() but pop succeeds, or vice versa Cause: In lock-free, other threads can intervene between empty() and pop(). 아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// ❌ Wrong usage
if (!queue.empty()) {
T value;
queue.pop(value); // Another thread might have popped in between, now empty
}
// ✅ Correct usage
T value;
if (queue.pop(value)) {
// Successfully popped
}
8. Best Practices and Cautions
1. Use atomics only for simple cases
Counters, flags, one-time initialization—fetch_add, compare_exchange_strong alone suffice. No need for complex data structures.
아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// Counter: fetch_add
std::atomic<uint64_t> count_{0};
count_.fetch_add(delta, std::memory_order_relaxed);
// Flag: compare_exchange
std::atomic<bool> done_{false};
bool expected = false;
if (done_.compare_exchange_strong(expected, true)) {
// Only this thread performs initialization
}
2. Prefer battle-tested libraries
Rather than implementing lock-free queues/stacks yourself, use proven libraries like Folly, Boost.Lockfree, libcds for safety.
3. Profile before adopting
“Lock-free is faster” is general wisdom, but actual bottleneck might not be mutex. Verify with profiling before adoption.
4. Minimize memory order
seq_cst (default) provides strongest guarantee but highest cost. Use minimum necessary release/acquire.
5. Consider ABA and memory reclamation
For pointer-based lock-free structures, always consider ABA problem and “delete immediately after pop”. Apply one of: tagged pointers, hazard pointers, RCU.
6. Testing and verification
Lock-free code has many hard-to-reproduce bugs. Use Thread Sanitizer (TSan), stress tests, formal verification tools.
# Check data races with TSan
g++ -fsanitize=thread -g -O2 -std=c++17 -pthread -o test test.cpp
./test
9. Performance Benchmarks
Test environment (example)
- CPU: Intel Core i7-12700 (12 cores)
- Compile: see command below
g++ -O3 -std=c++17 -pthread -o lockfree_bench lockfree_bench.cpp
- Threads: 4 producers, 4 consumers
- Operations: 1,000,000 pushes + 1,000,000 pops
Benchmark results (example)
| Data structure | Total time (ms) | Ops/sec | Relative speed |
|---|---|---|---|
| std::queue + mutex | 180 | 5.5M | 1.0x |
| Lock-Free queue (MPMC) | 72 | 13.9M | 2.5x |
| SPSC queue (ring buffer) | 28 | 35.7M | 6.4x |
Benchmark code example
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
template <typename Queue>
void benchmark(Queue& q, int num_producers, int num_consumers, int ops_per_thread) {
std::vector<std::thread> producers, consumers;
std::atomic<int> consumed{0};
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < num_producers; ++i) {
producers.emplace_back([&, i]() {
for (int j = 0; j < ops_per_thread; ++j) {
q.push(i * ops_per_thread + j);
}
});
}
for (int i = 0; i < num_consumers; ++i) {
consumers.emplace_back([&]() {
int val;
while (consumed.fetch_add(1) < num_producers * ops_per_thread) {
while (!q.pop(val))
;
}
});
}
for (auto& t : producers) t.join();
for (auto& t : consumers) t.join();
auto end = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(end - start).count();
int total_ops = num_producers * ops_per_thread * 2; // push + pop
std::cout << "Time: " << ms << " ms, Ops/sec: " << (total_ops / (ms / 1000.0)) << "\n";
}
Counter benchmark (atomic vs mutex)
| Method | 8 threads 1M increments (ms) | Relative speed |
|---|---|---|
| mutex | 45 | 1.0x |
| atomic fetch_add | 8 | 5.6x |
// Atomic counter
std::atomic<int> counter{0};
for (int i = 0; i < 1'000'000; ++i) counter.fetch_add(1, std::memory_order_relaxed);
10. Production Patterns
Pattern 1: Use existing libraries
Safer to use proven libraries than implementing yourself.
- Folly (Facebook):
folly::LockFreeQueue,folly::MPMCQueue - Boost.Lockfree:
boost::lockfree::queue,boost::lockfree::stack - libcds: Various lock-free data structures 아래 코드는 cpp를 사용한 구현 예제입니다. 필요한 모듈을 import하고, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <boost/lockfree/queue.hpp>
boost::lockfree::queue<int> q(128);
void producer() {
q.push(42);
}
void consumer() {
int value;
if (q.pop(value)) {
// Process
}
}
Pattern 2: Bounded vs Unbounded
- Bounded: Fixed-size buffer. Predictable memory, fails or blocks on overflow.
- Unbounded: Dynamic allocation. Variable memory usage, OOM possible. Production mostly uses bounded queues, applying retry or backpressure when full.
Pattern 3: Batch processing to amortize CAS cost
Inserting/removing multiple items with one CAS amortizes CAS cost. 아래 코드는 cpp를 사용한 구현 예제입니다. 비동기 처리를 통해 효율적으로 작업을 수행합니다, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
// Batch push: link multiple items then CAS tail once
void push_batch(const std::vector<T>& items) {
Node* first = new Node(items[0]);
Node* last = first;
for (size_t i = 1; i < items.size(); ++i) {
last->next = new Node(items[i]);
last = last->next;
}
// CAS link last to tail (multiple items with one CAS)
// ...
}
Pattern 4: Thread-local buffer + periodic flush
Each thread accumulates in local buffer, flushing to lock-free queue when full or periodically. Reduces CAS count. 아래 코드는 cpp를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
thread_local std::vector<LogEntry> local_buffer;
constexpr size_t FLUSH_THRESHOLD = 64;
void log(LogEntry e) {
local_buffer.push_back(e);
if (local_buffer.size() >= FLUSH_THRESHOLD) {
for (auto& entry : local_buffer) {
global_queue.push(entry);
}
local_buffer.clear();
}
}
Pattern 5: Shutdown flag and drain
On shutdown, stop producers first, then consumers drain until queue empty. 다음은 cpp를 활용한 상세한 구현 코드입니다. 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
std::atomic<bool> shutdown_{false};
void producer() {
while (!shutdown_.load(std::memory_order_acquire)) {
produce_and_push();
}
}
void consumer() {
T value;
while (true) {
if (queue.pop(value)) {
process(value);
} else if (shutdown_.load(std::memory_order_acquire)) {
break;
}
}
}
Pattern 6: Metrics collection
Even in lock-free structures, collect CAS failure counts, wait times, etc. for monitoring. Update statistics with fetch_add.
아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
std::atomic<uint64_t> cas_failures{0};
// Inside CAS loop
while (!head_.compare_exchange_weak(...)) {
cas_failures.fetch_add(1, std::memory_order_relaxed);
}
11. Practical Examples
Example 1: Lock-Free Counter (Multi-thread Aggregation)
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <thread>
#include <vector>
class LockFreeCounter {
public:
void add(uint64_t delta) {
count_.fetch_add(delta, std::memory_order_relaxed);
}
uint64_t get() const {
return count_.load(std::memory_order_acquire);
}
private:
std::atomic<uint64_t> count_{0};
};
// Usage: multiple threads aggregate request counts, byte counts, etc.
Example 2: Lock-Free Flag (Run Once)
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
std::atomic<bool> initialized{false};
void init_once() {
if (!initialized.load(std::memory_order_acquire)) {
bool expected = false;
if (initialized.compare_exchange_strong(expected, true,
std::memory_order_acq_rel)) {
// Only this thread performs initialization
do_actual_init();
} else {
// Another thread initializing. Wait or skip
}
}
}
Example 3: Lock-Free Cache Pointer (Read Optimization)
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
#include <memory>
template <typename T>
class Cache {
public:
std::shared_ptr<T> get() {
return std::atomic_load(&cached_); // Lock-free read
}
void update(std::shared_ptr<T> new_val) {
std::atomic_store(&cached_, new_val); // Atomic swap
}
private:
std::shared_ptr<T> cached_;
};
Example 4: Semaphore Alternative (Simple Counting)
다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <atomic>
class LockFreeSemaphore {
public:
explicit LockFreeSemaphore(int initial) : count_(initial) {}
void acquire() {
int expected;
do {
expected = count_.load(std::memory_order_acquire);
while (expected == 0) {
expected = count_.load(std::memory_order_acquire);
}
} while (!count_.compare_exchange_weak(expected, expected - 1,
std::memory_order_acq_rel));
}
void release() {
count_.fetch_add(1, std::memory_order_release);
}
private:
std::atomic<int> count_;
};
References
- C++ std::atomic cppreference
- Folly MPMCQueue
- Boost.Lockfree
- Anthony Williams, C++ Concurrency in Action
- Intel, Lock-Free Programming
Summary
| Item | Description |
|---|---|
| Lock-Free | Guarantees at least one thread progresses without locks |
| CAS | Conditional atomic update with compare_exchange_strong/weak |
| Stack | Only head CAS. Both push/pop use CAS |
| Queue | head, tail CAS. Dummy node maintains consistency |
| ABA | Mitigate with tagged pointers, hazard pointers, RCU |
| Memory order | Synchronize with release/acquire |
| Errors | Separating load+store, overusing relaxed, immediate deletion |
| Production | Proven libraries, bounded, batch processing |
| Core principles: |
- Simple cases (counters, flags) suffice with
atomicfetch_add, CAS - Complex data structures: prioritize proven libraries
- Direct implementation: always consider ABA, memory reclamation, memory order
- Verify actual environment gains over mutex with benchmarks
Implementation Checklist
- Pass expected by reference when using CAS
- Apply appropriate memory_order (release/acquire) to push/pop
- Don’t immediately delete popped nodes (consider hazard pointers, etc.)
- Apply tagged pointers or hazard pointers if ABA possible
- Prevent false sharing (cache line alignment)
- Judge by pop() return value instead of empty()
- Production: review proven libraries like Folly, Boost.Lockfree One-line summary: Understanding CAS and memory order, considering ABA and memory reclamation enables safe lock-free data structure usage.
Previous: C++ Cache Hit Optimization with Memory Alignment and Padding #34-2
Frequently Asked Questions (FAQ)
Q. When do I use this in production?
A. Lock-free synchronization: std::atomic, compare_exchange, lock-free queues/stacks, ABA problem, memory order, hazard pointers, performance benchmarks, production patterns. In production, apply by referring to examples and selection guides in the main text above.
Q. What should I read first?
A. Follow Previous links at bottom of each article to learn in order. Check C++ Series Index for complete flow.
Q. How to study deeper?
A. Refer to cppreference and official library documentation. Also utilize reference links at end of article. Next: Python meets C++: pybind11 #35-1
Related Articles
- C++ atomic | Building Thread-Safe Counters Without Mutex (with memory_order)
- C++ Data Race | “When to Use Atomic Instead of Mutex?” Common Interview Question
- C++ Cache Hit Optimization with Memory Alignment and Padding | Solving False Sharing