[2026] C++ Lock-Free Programming: CAS, ABA, Memory Order, High-Performance Queues [#34-3]

[2026] C++ Lock-Free Programming: CAS, ABA, Memory Order, High-Performance Queues [#34-3]

이 글의 핵심

Complete guide to lock-free programming: replace mutex bottlenecks with atomics and lock-free algorithms. Covers CAS loops, memory_order, ABA hazards, Michael-Scott queue, hazard pointers, and production benchmarks.

Introduction: Mutex Contention Bottleneck

”Processing 1M requests/sec but lock contention is killing us”

In a high-performance server, multiple worker threads fetched tasks from a work queue. Protecting the queue with std::mutex meant that increasing thread count barely improved throughput, with lock wait time dominating overall latency. Switching to a lock-free queue on the same hardware increased throughput 2-3x. Lock-based code requires grabbing a mutex for every queue push/pop. More threads = longer lock wait times, and while one thread holds the lock, others block. Lock-free approaches use compare-and-swap (CAS) atomic operations to synchronize without locks, allowing threads to progress without waiting for each other. Slow code (mutex): 다음은 cpp를 활용한 상세한 구현 코드입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Mutex-protected queue - lock contention
std::queue<Task> queue;
std::mutex mtx;
void push(const Task& t) {
    std::lock_guard<std::mutex> lock(mtx);
    queue.push(t);
}
bool pop(Task& t) {
    std::lock_guard<std::mutex> lock(mtx);
    if (queue.empty()) return false;
    t = queue.front();
    queue.pop();
    return true;
}

Fast code (lock-free): 아래 코드는 cpp를 사용한 구현 예제입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Lock-free queue - CAS synchronization without locks
// 실행 예제
template <typename T>
class LockFreeQueue {
    struct Node {
        T data;
        std::atomic<Node*> next;
    };
    std::atomic<Node*> head;
    std::atomic<Node*> tail;
    // push/pop use CAS (see examples below)
};

Cause: Mutex allows only one thread to access queue at a time → lock-free allows multiple threads to attempt CAS simultaneously → on contention, retry without blocking What you’ll learn:

  • Understand lock-free concepts and CAS (compare-and-swap) operations
  • Implement lock-free stacks and queues with complete code
  • Avoid common mistakes: ABA problems, memory ordering, memory reclamation
  • Verify performance gains over mutex with benchmarks
  • Apply production patterns: hazard pointers, RCU, etc.

Conceptual Analogy

Mutex is like standing in a single line at a bank counter—waiting for your turn. Lock-free CAS is like a ticket machine that “only changes the displayed number if it matches what you saw”—one thread succeeds at a time, but no line forms. More CPU retries, but less latency variance from blocking.

Problem Scenarios

Scenario 1: High-Frequency Log Collector

Dozens of threads write logs to a buffer simultaneously. Mutex protection requires grabbing a lock for every log write, causing high latency during high-frequency logging. Lock-free ring buffers or MPSC (multi-producer single-consumer) queues enable lock-free log accumulation, with a separate thread batch-flushing.

Scenario 2: Real-Time Trading Order Book

Orders arrive/cancel in milliseconds. Protecting the order book with mutex causes order processing delays from lock waits, potentially disadvantageous during market price fluctuations. Lock-free data structures for order books reduce latency jitter.

Scenario 3: Game Engine Task Scheduler

Hundreds of tasks distributed to multiple threads every frame. Mutex-based task queues cause all workers to wait for queue locks at frame start, risking frame drops. Lock-free task queues (including work stealing) enable workers to fetch tasks without blocking.

Scenario 4: Network Packet Buffer

Receive thread puts packets in buffer, processing thread removes them. Single mutex protection can cause buffer overflow when packet receive rate is high while processing thread waits for lock. Lock-free SPSC (single producer single consumer) queues enable nearly independent receive and processing.

Scenario 5: Statistics Counter Aggregation

Multiple threads increment request counts, byte counts, etc. Mutex-protecting counters causes severe contention during high-frequency updates. std::atomic::fetch_add enables lock-free safe aggregation.

Scenario 6: Cache Invalidation/Update

Multiple threads read shared cache pointer, replacing with new pointer on updates. Mutex-protecting entire cache blocks even read paths. std::atomic pointers with CAS for “lock-free reads, CAS writes” pattern enable nearly blocking-free reads.

Table of Contents

  1. Basic Concepts
  2. Atomic Operations and CAS
  3. Lock-Free Stack Complete Example
  4. Lock-Free Queue Complete Example
  5. Memory Order (memory_order)
  6. ABA Problem and Solutions
  7. Common Errors and Solutions
  8. Best Practices and Cautions
  9. Performance Benchmarks
  10. Production Patterns
  11. Practical Examples

1. Basic Concepts

What is Lock-Free?

Lock-free algorithms are designed so that without using locks (mutexes, spinlocks, etc.), even when multiple threads access a data structure simultaneously, at least one thread always progresses. No deadlocks, no priority inversion from lock waits. 아래 코드는 mermaid를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// 실행 예제
flowchart TB
  subgraph lock_based[Lock-Based]
    L1[Thread A: Acquire lock] --> L2[Thread B: Wait]
    L2 --> L3[Thread C: Wait]
    L3 --> L4[After A completes, B, C proceed sequentially]
  end
  subgraph lock_free[Lock-Free]
    F1[Thread A: CAS attempt] --> F2[Success → Progress]
    F3[Thread B: CAS attempt] --> F4[Failure → Retry]
    F4 --> F5[Progress on next CAS]
  end

Analogy: Lock-based is “only person with bathroom key can enter, others wait”. Lock-free is “multiple people try opening door simultaneously, first one enters, others retry”.

Lock-Free vs Wait-Free

DistinctionLock-FreeWait-Free
DefinitionAt least one thread progressesAll threads complete in finite steps
RetryCAS failure allows retryNo retry, completion guaranteed
Implementation difficultyRelatively lowVery high
Production useQueues, stacks, counters, etc.Limited (atomic fetch_add, etc.)
Most lock-free data structures are lock-free not wait-free. CAS failure retry loops are common patterns.

2. Atomic Operations and CAS

Basic std::atomic usage

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <iostream>
#include <thread>
// Copy-paste and run: g++ -std=c++17 -pthread -o atomic_basic atomic_basic.cpp && ./atomic_basic
int main() {
    std::atomic<int> counter{0};
    std::thread t1([&]() {
        for (int i = 0; i < 100000; ++i)
            counter.fetch_add(1, std::memory_order_relaxed);
    });
    std::thread t2([&]() {
        for (int i = 0; i < 100000; ++i)
            counter.fetch_add(1, std::memory_order_relaxed);
    });
    t1.join();
    t2.join();
    std::cout << "counter = " << counter.load() << "\n";  // 200000
    return 0;
}

Explanation: fetch_add atomically increments value, preventing data races. memory_order_relaxed is used when ordering isn’t needed (counters, etc.).

Compare-And-Swap (CAS)

CAS is an atomic operation: “if this memory location’s value equals expected value, change to new value, otherwise don’t change”. In C++, std::atomic::compare_exchange_strong / compare_exchange_weak are CAS. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 에러 처리를 통해 안정성을 확보합니다, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
// 실행 예제
std::atomic<int> value{0};
// "If 0, set to 1" — only one thread succeeds
bool try_claim() {
    int expected = 0;
    return value.compare_exchange_strong(expected, 1);
}
// Lock-free increment (actually fetch_add is better)
void cas_increment() {
    int old_val = value.load(std::memory_order_relaxed);
    while (!value.compare_exchange_weak(old_val, old_val + 1,
                                        std::memory_order_release,
                                        std::memory_order_relaxed)) {
        // On failure, old_val automatically updates to current value
    }
}

compare_exchange_weak vs strong:

  • weak: Possible spurious failure on ARM, PowerPC, etc. More efficient in loops.
  • strong: No spurious failure. Suitable for single attempts.

CAS operation flow

아래 코드는 mermaid를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

flowchart TD
    A[CAS attempt] --> B{Current value == expected?}
    B -->|Yes| C[Replace with desired]
    C --> D[Return true]
    B -->|No| E[Update expected to current value]
    E --> F[Return false]
    F --> G[Retry in loop]

Key atomic operations summary

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
std::atomic<int> a{0};
// Read/write
a.load(std::memory_order_acquire);
a.store(42, std::memory_order_release);
// Arithmetic
a.fetch_add(1);   // a++
a.fetch_sub(1);   // a--
a.fetch_and(0xFF);
a.fetch_or(1);
// Exchange
a.exchange(100);  // Returns previous value
// Conditional exchange (CAS)
int expected = 0;
a.compare_exchange_strong(expected, 1);  // expected updated by reference
a.compare_exchange_weak(expected, 1);

3. Lock-Free Stack Complete Example

Singly-Linked List Stack

Stacks only need CAS updating the head pointer. Push links new node to head, pop changes head to next node. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <memory>
template <typename T>
class LockFreeStack {
public:
    struct Node {
        T data;
        Node* next;
        explicit Node(const T& d) : data(d), next(nullptr) {}
    };
    LockFreeStack() : head_(nullptr) {}
    void push(const T& value) {
        Node* new_node = new Node(value);
        new_node->next = head_.load(std::memory_order_relaxed);
        while (!head_.compare_exchange_weak(new_node->next, new_node,
                                            std::memory_order_release,
                                            std::memory_order_relaxed)) {
            // CAS failure updates new_node->next to current head
        }
    }
    bool pop(T& value) {
        Node* old_head = head_.load(std::memory_order_acquire);
        while (old_head &&
               !head_.compare_exchange_weak(old_head, old_head->next,
                                           std::memory_order_acquire,
                                           std::memory_order_relaxed)) {
            // CAS failure updates old_head to current head
        }
        if (!old_head) return false;
        value = old_head->data;
        delete old_head;  // Production needs hazard pointers, etc. (see ABA below)
        return true;
    }
    bool empty() const {
        return head_.load(std::memory_order_acquire) == nullptr;
    }
private:
    std::atomic<Node*> head_;
};

Code explanation:

  • push: Set new node’s next to current head, CAS head to new node only when head equals next. If another thread pushes, CAS fails and new_node->next updates to new head, so retry.
  • pop: Read head, CAS head to old_head->next only when head equals old_head.
  • Caution: Above implementation has ABA and memory reclamation issues. Production needs hazard pointers, etc. (see below).

Memory order explanation

아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// push: release — writes before this CAS visible to other threads
head_.compare_exchange_weak(..., std::memory_order_release, ...);
// pop: acquire — can read values written by other threads before this CAS
head_.compare_exchange_weak(..., std::memory_order_acquire, ...);

4. Lock-Free Queue Complete Example

Michael-Scott Lock-Free Queue (MPMC)

Classic lock-free queue using two atomic pointers (head, tail). Dummy node enables consistent handling of empty and non-empty queues. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <memory>
template <typename T>
class LockFreeQueue {
public:
    struct Node {
        std::atomic<Node*> next;
        T data;
        Node() : next(nullptr), data() {}
        explicit Node(const T& d) : next(nullptr), data(d) {}
    };
    LockFreeQueue() {
        Node* dummy = new Node();
        head_.store(dummy);
        tail_.store(dummy);
    }
    ~LockFreeQueue() {
        while (Node* n = head_.load()) {
            head_.store(n->next.load());
            delete n;
        }
    }
    void push(const T& value) {
        Node* new_node = new Node(value);
        Node* old_tail = nullptr;
        Node* next = nullptr;
        while (true) {
            old_tail = tail_.load(std::memory_order_acquire);
            next = old_tail->next.load(std::memory_order_acquire);
            if (old_tail == tail_.load(std::memory_order_acquire)) {
                if (next == nullptr) {
                    if (old_tail->next.compare_exchange_weak(next, new_node,
                                                            std::memory_order_release,
                                                            std::memory_order_relaxed)) {
                        tail_.compare_exchange_weak(old_tail, new_node,
                                                    std::memory_order_release,
                                                    std::memory_order_relaxed);
                        return;
                    }
                } else {
                    tail_.compare_exchange_weak(old_tail, next,
                                                std::memory_order_release,
                                                std::memory_order_relaxed);
                }
            }
        }
    }
    bool pop(T& value) {
        Node* old_head = nullptr;
        Node* old_tail = nullptr;
        Node* next = nullptr;
        while (true) {
            old_head = head_.load(std::memory_order_acquire);
            old_tail = tail_.load(std::memory_order_acquire);
            next = old_head->next.load(std::memory_order_acquire);
            if (old_head == head_.load(std::memory_order_acquire)) {
                if (old_head == old_tail) {
                    if (next == nullptr) return false;
                    tail_.compare_exchange_weak(old_tail, next,
                                                std::memory_order_release,
                                                std::memory_order_relaxed);
                } else {
                    value = next->data;
                    if (head_.compare_exchange_weak(old_head, next,
                                                   std::memory_order_release,
                                                   std::memory_order_relaxed)) {
                        delete old_head;
                        return true;
                    }
                }
            }
        }
    }
    bool empty() const {
        return head_.load(std::memory_order_acquire)->next.load(
                   std::memory_order_acquire) == nullptr;
    }
private:
    std::atomic<Node*> head_;
    std::atomic<Node*> tail_;
};

Code explanation:

  • Dummy node: Ensures head and tail always point to valid nodes.
  • push: If tail->next is nullptr, link new node and move tail to new node. If another thread pushed first, advance tail to next node.
  • pop: If head->next is data node, return its value and move head to next. If head == tail, queue is empty or tail hasn’t caught up yet.

SPSC (Single Producer Single Consumer) Queue

When there’s one producer and one consumer, simpler and faster implementations are possible. Can work without locks or CAS, using only ordering guarantees. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <vector>
template <typename T, size_t Capacity>
class SPSCQueue {
public:
    SPSCQueue() : write_idx_(0), read_idx_(0) {
        buffer_.resize(Capacity);
    }
    bool push(const T& value) {
        size_t current = write_idx_.load(std::memory_order_relaxed);
        size_t next = (current + 1) % Capacity;
        if (next == read_idx_.load(std::memory_order_acquire))
            return false;  // full
        buffer_[current] = value;
        write_idx_.store(next, std::memory_order_release);
        return true;
    }
    bool pop(T& value) {
        size_t current = read_idx_.load(std::memory_order_relaxed);
        if (current == write_idx_.load(std::memory_order_acquire))
            return false;  // empty
        value = buffer_[current];
        read_idx_.store((current + 1) % Capacity, std::memory_order_release);
        return true;
    }
private:
    std::vector<T> buffer_;
    std::atomic<size_t> write_idx_;
    std::atomic<size_t> read_idx_;
};

Code explanation: In SPSC, producer and consumer each modify only one index, enabling synchronization with just load/store and memory ordering without CAS. release/acquire guarantee write-read ordering.

5. Memory Order (memory_order)

Order types

memory_orderMeaningUse cases
seq_cstSequential consistency (strongest)Default, debugging
acquireReads/writes after this operation won’t reorder before itload, CAS success
releaseReads/writes before this operation won’t reorder after itstore, CAS success
acq_relacquire + releaseCAS (read-modify-write)
relaxedNo ordering guarantee, only atomicityCounters, etc.

Release-Acquire Pair

아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// Thread A (producer)
data = 42;
ready.store(true, std::memory_order_release);  // Guarantees data write completes before
// Thread B (consumer)
while (!ready.load(std::memory_order_acquire))
    ;
std::cout << data;  // Guaranteed to see 42

A synchronizes-with relationship forms between thread writing with release and thread reading with acquire. Thus data write completes before ready write, and when B reads ready, it also sees latest data.

Recommendations for Lock-Free

아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// push: release — head/tail shouldn't update before new node contents visible to other threads
head_.compare_exchange_weak(..., std::memory_order_release, ...);
// pop: acquire — must read node contents after reading head
head_.compare_exchange_weak(..., std::memory_order_acquire, ...);

6. ABA Problem and Solutions

What is the ABA problem?

Thread A reads head as P. Meanwhile, thread B pops P and pushes P back. Head is still P, but P->next might differ from before. When A attempts CAS with “expected = P”, it succeeds, but P->next might already be changed, causing logical errors. 아래 코드는 mermaid를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

sequenceDiagram
    participant A as Thread A
    participant B as Thread B
    participant Q as Queue
    A->>Q: Read head = P
    B->>Q: pop() → Remove P
    B->>Q: push(P) → Re-add P
    A->>Q: CAS(P, P->next) → Success!
    Note over A: P->next might already be changed

Solution 1: ABA Counter (Tagged Pointer)

Put version/counter in pointer’s lower bits so CAS fails if same address but different “generation”. On 64-bit where pointers use only 48 bits, utilize remaining bits. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <cstdint>
struct TaggedPtr {
    uintptr_t ptr : 48;   // Pointer (lower bits 0 from alignment)
    uintptr_t tag : 16;   // ABA prevention counter
};
std::atomic<uintptr_t> head_{0};
void push(Node* new_node) {
    uintptr_t old_head = head_.load(std::memory_order_relaxed);
    TaggedPtr old_tp;
    old_tp.ptr = old_head & 0x0000FFFFFFFFFFFF;
    old_tp.tag = old_head >> 48;
    new_node->next = reinterpret_cast<Node*>(old_tp.ptr);
    TaggedPtr new_tp;
    new_tp.ptr = reinterpret_cast<uintptr_t>(new_node);
    new_tp.tag = old_tp.tag + 1;
    uintptr_t new_val = (static_cast<uintptr_t>(new_tp.tag) << 48) | new_tp.ptr;
    while (!head_.compare_exchange_weak(old_head, new_val,
                                        std::memory_order_release,
                                        std::memory_order_relaxed)) {
        old_tp.ptr = old_head & 0x0000FFFFFFFFFFFF;
        old_tp.tag = old_head >> 48;
        new_node->next = reinterpret_cast<Node*>(old_tp.ptr);
        new_tp.ptr = reinterpret_cast<uintptr_t>(new_node);
        new_tp.tag = old_tp.tag + 1;
        new_val = (static_cast<uintptr_t>(new_tp.tag) << 48) | new_tp.ptr;
    }
}

Solution 2: Hazard Pointer

Don’t delete nodes immediately; track “which threads are using this pointer”. Don’t reuse pointers in use, preventing ABA. 다음은 cpp를 활용한 상세한 구현 코드입니다. 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Hazard Pointer overview (simplified)
thread_local Node* hazard_ptr = nullptr;
bool pop(T& value) {
    Node* old_head = head_.load(std::memory_order_acquire);
    do {
        Node* next = old_head ? old_head->next.load() : nullptr;
        hazard_ptr = old_head;  // Declare this thread is using old_head
        if (head_.load() != old_head) continue;  // Revalidate
    } while (!head_.compare_exchange_weak(old_head, next, ...));
    value = old_head->data;
    hazard_ptr = nullptr;
    // Put old_head in retire list, delete only when not in other threads' hazard_ptr
    retire(old_head);
    return true;
}

Solution 3: RCU (Read-Copy-Update)

Lock-free reads, writes create copies, update, then atomically swap pointer. Old version recycled after all reads complete.

7. Common Errors and Solutions

Issue 1: Separating load + store instead of CAS

Symptom: Occasional data loss, logic errors Cause: Splitting “read-check-write” into multiple steps allows other threads to change value in between. 아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// ❌ Wrong
void set_if_zero() {
    if (flag.load() == 0) {
        flag.store(1);  // Another thread might have already set to 1
    }
}
// ✅ Correct
void set_if_zero() {
    int expected = 0;
    flag.compare_exchange_strong(expected, 1);
}

Issue 2: Over-relaxing memory order

Symptom: Very rare incorrect value reads, hard-to-reproduce bugs Cause: Using only memory_order_relaxed allows compiler/CPU to reorder operations, potentially reading before previous writes visible. 아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// ❌ Wrong: using only relaxed
ready.store(true, std::memory_order_relaxed);
// Consumer might see ready before data (theoretically)
// ✅ Correct: release/acquire pair
ready.store(true, std::memory_order_release);
// Consumer
while (!ready.load(std::memory_order_acquire))
    ;
int x = data;  // data write definitely completed before

Issue 3: Immediately deleting popped node (other threads still referencing)

Symptom: use-after-free, crashes Cause: Immediately deleteing popped node in lock-free stack might happen while another thread still reads that node (ABA or delayed read). 아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// ❌ Dangerous: delete immediately after pop
Node* old_head = ...;
head_.compare_exchange_weak(old_head, old_head->next);
delete old_head;  // Another thread might still reference old_head
// ✅ Solution: delay reuse with hazard pointer, RCU, or ABA counter
retire(old_head);  // Delete later when safe

Issue 4: Not passing expected by reference to compare_exchange

Symptom: Infinite loop, CAS fails forever Cause: compare_exchange_weak updates expected to current value on failure. Passing by value prevents update reflection, causing continuous attempts with same value. 아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// ❌ Wrong
int expected = 0;
while (!atomic_val.compare_exchange_weak(expected, 1)) {
    // expected not updated (if passed by value)
}
// ✅ Correct
int expected = 0;
while (!atomic_val.compare_exchange_weak(expected, 1)) {
    // expected automatically updates to current atomic value (passed by reference)
}

Issue 5: False Sharing

Symptom: Slower than expected when using multiple atomic variables Cause: Different atomics on same cache line cause one thread’s write to invalidate other threads’ cache lines. 아래 코드는 cpp를 사용한 구현 예제입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// ❌ Wrong
struct Counters {
    std::atomic<int> a;
    std::atomic<int> b;  // On same cache line as a
};
// ✅ Correct (cache line padding)
struct alignas(64) Counters {
    std::atomic<int> a;
    char padding1[64 - sizeof(std::atomic<int>)];
    std::atomic<int> b;
    char padding2[64 - sizeof(std::atomic<int>)];
};

Issue 6: “Just using atomics” not lock-free

Symptom: Thought it was “lock-free” but deadlock or livelock Cause: When using multiple atomics, each operation is atomic but overall algorithm might not be. Need verification that lock-free definition (at least one thread progresses) is satisfied.

Issue 7: Judging by empty() then pop

Symptom: empty() but pop succeeds, or vice versa Cause: In lock-free, other threads can intervene between empty() and pop(). 아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// ❌ Wrong usage
if (!queue.empty()) {
    T value;
    queue.pop(value);  // Another thread might have popped in between, now empty
}
// ✅ Correct usage
T value;
if (queue.pop(value)) {
    // Successfully popped
}

8. Best Practices and Cautions

1. Use atomics only for simple cases

Counters, flags, one-time initialization—fetch_add, compare_exchange_strong alone suffice. No need for complex data structures. 아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Counter: fetch_add
std::atomic<uint64_t> count_{0};
count_.fetch_add(delta, std::memory_order_relaxed);
// Flag: compare_exchange
std::atomic<bool> done_{false};
bool expected = false;
if (done_.compare_exchange_strong(expected, true)) {
    // Only this thread performs initialization
}

2. Prefer battle-tested libraries

Rather than implementing lock-free queues/stacks yourself, use proven libraries like Folly, Boost.Lockfree, libcds for safety.

3. Profile before adopting

“Lock-free is faster” is general wisdom, but actual bottleneck might not be mutex. Verify with profiling before adoption.

4. Minimize memory order

seq_cst (default) provides strongest guarantee but highest cost. Use minimum necessary release/acquire.

5. Consider ABA and memory reclamation

For pointer-based lock-free structures, always consider ABA problem and “delete immediately after pop”. Apply one of: tagged pointers, hazard pointers, RCU.

6. Testing and verification

Lock-free code has many hard-to-reproduce bugs. Use Thread Sanitizer (TSan), stress tests, formal verification tools.

# Check data races with TSan
g++ -fsanitize=thread -g -O2 -std=c++17 -pthread -o test test.cpp
./test

9. Performance Benchmarks

Test environment (example)

  • CPU: Intel Core i7-12700 (12 cores)
  • Compile: see command below
g++ -O3 -std=c++17 -pthread -o lockfree_bench lockfree_bench.cpp
  • Threads: 4 producers, 4 consumers
  • Operations: 1,000,000 pushes + 1,000,000 pops

Benchmark results (example)

Data structureTotal time (ms)Ops/secRelative speed
std::queue + mutex1805.5M1.0x
Lock-Free queue (MPMC)7213.9M2.5x
SPSC queue (ring buffer)2835.7M6.4x

Benchmark code example

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
template <typename Queue>
void benchmark(Queue& q, int num_producers, int num_consumers, int ops_per_thread) {
    std::vector<std::thread> producers, consumers;
    std::atomic<int> consumed{0};
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < num_producers; ++i) {
        producers.emplace_back([&, i]() {
            for (int j = 0; j < ops_per_thread; ++j) {
                q.push(i * ops_per_thread + j);
            }
        });
    }
    for (int i = 0; i < num_consumers; ++i) {
        consumers.emplace_back([&]() {
            int val;
            while (consumed.fetch_add(1) < num_producers * ops_per_thread) {
                while (!q.pop(val))
                    ;
            }
        });
    }
    for (auto& t : producers) t.join();
    for (auto& t : consumers) t.join();
    auto end = std::chrono::high_resolution_clock::now();
    double ms = std::chrono::duration<double, std::milli>(end - start).count();
    int total_ops = num_producers * ops_per_thread * 2;  // push + pop
    std::cout << "Time: " << ms << " ms, Ops/sec: " << (total_ops / (ms / 1000.0)) << "\n";
}

Counter benchmark (atomic vs mutex)

Method8 threads 1M increments (ms)Relative speed
mutex451.0x
atomic fetch_add85.6x
// Atomic counter
std::atomic<int> counter{0};
for (int i = 0; i < 1'000'000; ++i) counter.fetch_add(1, std::memory_order_relaxed);

10. Production Patterns

Pattern 1: Use existing libraries

Safer to use proven libraries than implementing yourself.

  • Folly (Facebook): folly::LockFreeQueue, folly::MPMCQueue
  • Boost.Lockfree: boost::lockfree::queue, boost::lockfree::stack
  • libcds: Various lock-free data structures 아래 코드는 cpp를 사용한 구현 예제입니다. 필요한 모듈을 import하고, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.
#include <boost/lockfree/queue.hpp>
boost::lockfree::queue<int> q(128);
void producer() {
    q.push(42);
}
void consumer() {
    int value;
    if (q.pop(value)) {
        // Process
    }
}

Pattern 2: Bounded vs Unbounded

  • Bounded: Fixed-size buffer. Predictable memory, fails or blocks on overflow.
  • Unbounded: Dynamic allocation. Variable memory usage, OOM possible. Production mostly uses bounded queues, applying retry or backpressure when full.

Pattern 3: Batch processing to amortize CAS cost

Inserting/removing multiple items with one CAS amortizes CAS cost. 아래 코드는 cpp를 사용한 구현 예제입니다. 비동기 처리를 통해 효율적으로 작업을 수행합니다, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Batch push: link multiple items then CAS tail once
void push_batch(const std::vector<T>& items) {
    Node* first = new Node(items[0]);
    Node* last = first;
    for (size_t i = 1; i < items.size(); ++i) {
        last->next = new Node(items[i]);
        last = last->next;
    }
    // CAS link last to tail (multiple items with one CAS)
    // ...
}

Pattern 4: Thread-local buffer + periodic flush

Each thread accumulates in local buffer, flushing to lock-free queue when full or periodically. Reduces CAS count. 아래 코드는 cpp를 사용한 구현 예제입니다. 에러 처리를 통해 안정성을 확보합니다, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

thread_local std::vector<LogEntry> local_buffer;
constexpr size_t FLUSH_THRESHOLD = 64;
void log(LogEntry e) {
    local_buffer.push_back(e);
    if (local_buffer.size() >= FLUSH_THRESHOLD) {
        for (auto& entry : local_buffer) {
            global_queue.push(entry);
        }
        local_buffer.clear();
    }
}

Pattern 5: Shutdown flag and drain

On shutdown, stop producers first, then consumers drain until queue empty. 다음은 cpp를 활용한 상세한 구현 코드입니다. 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

std::atomic<bool> shutdown_{false};
void producer() {
    while (!shutdown_.load(std::memory_order_acquire)) {
        produce_and_push();
    }
}
void consumer() {
    T value;
    while (true) {
        if (queue.pop(value)) {
            process(value);
        } else if (shutdown_.load(std::memory_order_acquire)) {
            break;
        }
    }
}

Pattern 6: Metrics collection

Even in lock-free structures, collect CAS failure counts, wait times, etc. for monitoring. Update statistics with fetch_add. 아래 코드는 cpp를 사용한 구현 예제입니다. 반복문으로 데이터를 처리합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

std::atomic<uint64_t> cas_failures{0};
// Inside CAS loop
while (!head_.compare_exchange_weak(...)) {
    cas_failures.fetch_add(1, std::memory_order_relaxed);
}

11. Practical Examples

Example 1: Lock-Free Counter (Multi-thread Aggregation)

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <thread>
#include <vector>
class LockFreeCounter {
public:
    void add(uint64_t delta) {
        count_.fetch_add(delta, std::memory_order_relaxed);
    }
    uint64_t get() const {
        return count_.load(std::memory_order_acquire);
    }
private:
    std::atomic<uint64_t> count_{0};
};
// Usage: multiple threads aggregate request counts, byte counts, etc.

Example 2: Lock-Free Flag (Run Once)

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
std::atomic<bool> initialized{false};
void init_once() {
    if (!initialized.load(std::memory_order_acquire)) {
        bool expected = false;
        if (initialized.compare_exchange_strong(expected, true,
                                                std::memory_order_acq_rel)) {
            // Only this thread performs initialization
            do_actual_init();
        } else {
            // Another thread initializing. Wait or skip
        }
    }
}

Example 3: Lock-Free Cache Pointer (Read Optimization)

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
#include <memory>
template <typename T>
class Cache {
public:
    std::shared_ptr<T> get() {
        return std::atomic_load(&cached_);  // Lock-free read
    }
    void update(std::shared_ptr<T> new_val) {
        std::atomic_store(&cached_, new_val);  // Atomic swap
    }
private:
    std::shared_ptr<T> cached_;
};

Example 4: Semaphore Alternative (Simple Counting)

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

#include <atomic>
class LockFreeSemaphore {
public:
    explicit LockFreeSemaphore(int initial) : count_(initial) {}
    void acquire() {
        int expected;
        do {
            expected = count_.load(std::memory_order_acquire);
            while (expected == 0) {
                expected = count_.load(std::memory_order_acquire);
            }
        } while (!count_.compare_exchange_weak(expected, expected - 1,
                                              std::memory_order_acq_rel));
    }
    void release() {
        count_.fetch_add(1, std::memory_order_release);
    }
private:
    std::atomic<int> count_;
};

References


Summary

ItemDescription
Lock-FreeGuarantees at least one thread progresses without locks
CASConditional atomic update with compare_exchange_strong/weak
StackOnly head CAS. Both push/pop use CAS
Queuehead, tail CAS. Dummy node maintains consistency
ABAMitigate with tagged pointers, hazard pointers, RCU
Memory orderSynchronize with release/acquire
ErrorsSeparating load+store, overusing relaxed, immediate deletion
ProductionProven libraries, bounded, batch processing
Core principles:
  1. Simple cases (counters, flags) suffice with atomic fetch_add, CAS
  2. Complex data structures: prioritize proven libraries
  3. Direct implementation: always consider ABA, memory reclamation, memory order
  4. Verify actual environment gains over mutex with benchmarks

Implementation Checklist

  • Pass expected by reference when using CAS
  • Apply appropriate memory_order (release/acquire) to push/pop
  • Don’t immediately delete popped nodes (consider hazard pointers, etc.)
  • Apply tagged pointers or hazard pointers if ABA possible
  • Prevent false sharing (cache line alignment)
  • Judge by pop() return value instead of empty()
  • Production: review proven libraries like Folly, Boost.Lockfree One-line summary: Understanding CAS and memory order, considering ABA and memory reclamation enables safe lock-free data structure usage.

Previous: C++ Cache Hit Optimization with Memory Alignment and Padding #34-2

Frequently Asked Questions (FAQ)

Q. When do I use this in production?

A. Lock-free synchronization: std::atomic, compare_exchange, lock-free queues/stacks, ABA problem, memory order, hazard pointers, performance benchmarks, production patterns. In production, apply by referring to examples and selection guides in the main text above.

Q. What should I read first?

A. Follow Previous links at bottom of each article to learn in order. Check C++ Series Index for complete flow.

Q. How to study deeper?

Keywords

C++, lock-free, atomic, CAS, ABA, memory_order, performance optimization, multithreading, interview

... 996 lines not shown ... Token usage: 63706/1000000; 936294 remaining Start-Sleep -Seconds 3