[2026] C++ Cache-Friendly Code | 10x Performance Boost with Memory Access Patterns

Q: What should I read first?

Follow the previous article links at the bottom of each post to learn in order. Check the C++ series index for the complete flow.

2026년 2월 28일 · 18분 읽기 · 수정 2026년 4월 7일 Intermediate Concept

이 글의 핵심

Practical C++ cache optimization guide. Achieve 10x performance improvement with cache-friendly code. Includes AoS vs SoA, data locality, cache line alignment with real-world examples and benchmarks.

🎯 What You’ll Learn (Reading Time: 18 minutes)

TL;DR: Learn how to achieve 10x performance improvement through C++ cache optimization. Master practical techniques including memory access patterns, AoS vs SoA, and data locality with benchmarks. What You’ll Learn:

✅ Perfectly understand CPU cache principles and cache misses
✅ Master cache-friendly code writing patterns
✅ Acquire AoS vs SoA and data locality optimization techniques
✅ Verify performance improvements with real-world benchmarks Real-World Applications:
🔥 Process large datasets 10x faster
🔥 Improve game engine frame rates
🔥 Reduce real-time system response times
🔥 Increase server throughput Difficulty: Intermediate | Performance Gain: 10x | Benchmarks: Included

Introduction: 10x Difference for Same Operation

”Just changing array traversal direction made it 10x faster”

I wrote code to traverse a 2D array. Performance differed by 10x depending on traversal direction.
CPUs fetch memory in cache line units (the memory block size fetched at once by CPU cache, usually 64 bytes). When access order follows “consecutive addresses,” cache hits (fast access when needed data is in cache) increase. When jumping around, cache misses (must fetch from main memory when not in cache) increase. Paying attention to data locality (keeping frequently used data close and consecutive to improve cache efficiency) through row-major traversal, struct alignment, and keeping related data together can make the same operation much faster. Slow Code: Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.

int matrix[1000][1000];
// ❌ Column-major traversal (slow)
for (int col = 0; col < 1000; ++col) {
    for (int row = 0; row < 1000; ++row) {
        sum += matrix[row][col];  // Many cache misses
    }
}
// Time: about 50ms

Fast Code: Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.

// After copying and pasting: g++ -std=c++17 -O2 -o cache_fast cache_fast.cpp && ./cache_fast
#include <iostream>
#include <chrono>
int main() {
    int matrix[1000][1000] = {};
    long long sum = 0;
    auto start = std::chrono::high_resolution_clock::now();
    // ✅ Row-major traversal (fast)
    for (int row = 0; row < 1000; ++row) {
        for (int col = 0; col < 1000; ++col) {
            sum += matrix[row][col];  // Cache hit
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "sum=" << sum << " time=" << ms << "ms\n";
    return 0;
}

Execution Result: Outputs in format sum=0 time=Nms (N value varies by environment). Cause: CPU cache prefetches consecutive memory What You’ll Learn:

Understand CPU cache principles
Write cache-friendly code
Utilize data locality
Optimize memory access in production

Production Experience: This article is based on real problems and solutions encountered in large-scale C++ projects. Includes practical pitfalls and debugging tips not covered in books or documentation.

CPU Cache Basics
Data Locality
Reducing Cache Misses
Struct Layout Optimization
AoS vs SoA Complete Example
False Sharing
Using Prefetch
Practical Optimization Patterns
Common Errors and Solutions
Performance Benchmarks
Production Patterns

1. CPU Cache Basics

Memory Hierarchy

Here’s an implementation example using text. Examine the code while understanding the role of each part.

CPU Register    < 1ns    (fastest)
    ↓
L1 Cache        ~1ns     (32-64KB)
    ↓
L2 Cache        ~3ns     (256KB-1MB)
    ↓
L3 Cache        ~10ns    (8-32MB)
    ↓
RAM             ~100ns   (several GB)
    ↓
SSD             ~100us   (hundreds of GB)
    ↓
HDD             ~10ms    (several TB, slowest)

Cache Line: Usually fetches memory in 64-byte units

Memory Hierarchy Visualization

Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.

flowchart TB
    subgraph fast[Fast Access]
        R[Register]
        L1[L1 Cache 32KB]
        L2[L2 Cache 256KB]
    end
    subgraph slow[Slow Access]
        L3[L3 Cache 8MB]
        RAM[RAM 100ns]
    end
    R --> L1 --> L2 --> L3 --> RAM

Cache Hit vs Miss

Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.

int arr[1000];
// Cache hit: sequential access
for (int i = 0; i < 1000; ++i) {
    sum += arr[i];  // Fast
}
// Cache miss: irregular access
for (int i = 0; i < 1000; i += 64) {
    sum += arr[i];  // Slow (cache line waste)
}

Detailed Code Explanation: Cache Hit (Sequential Access):

Accesses arr[0], arr[1], arr[2], …in order.
Cache line usually fetches 64 bytes (16 ints) at once.
When reading arr[0], arr[0]~arr[15] are loaded into cache together.
Next 15 accesses are cache hits (very fast, ~1ns).
Result: Only about 62 fetches from memory out of 1000 accesses (1000/16). Cache Miss (Irregular Access):
Jumps by 64: arr[0], arr[64], arr[128], …
Accesses different cache lines each time, causing cache misses.
All 16 accesses must fetch from memory.
Remaining 15 elements loaded into cache are discarded unused.
Result: About 16x slower (100ns vs 1ns per access). Actual Performance Difference:
Sequential access: ~1000ns (1μs)
Irregular access: ~16000ns (16μs)
Difference grows larger with more data.

2. Data Locality

Temporal Locality

Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.

// ✅ Good: repeated access to same data
int sum = 0;
for (int i = 0; i < 100; ++i) {
    sum += data[0];  // data[0] stays in cache
}
// ❌ Bad: different data each time
for (int i = 0; i < 100; ++i) {
    sum += data[rand() % 10000];  // Many cache misses
}

Spatial Locality

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

struct Point {
    int x, y, z;
};
std::vector<Point> points(1000);
// ✅ Good: sequential access
for (const auto& p : points) {
    sum += p.x + p.y + p.z;  // Cache-friendly
}
// ❌ Bad: pointer chasing
struct Node {
    int value;
    Node* next;
};
Node* head = /* ....*/;
for (Node* p = head; p != nullptr; p = p->next) {
    sum += p->value;  // Many cache misses
}

3. Reducing Cache Misses

Pattern 1: Use Contiguous Memory

Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.

// ❌ Bad: linked list (cache misses)
std::list<int> data;
for (int val : data) {
    sum += val;  // Cache miss per node
}
// ✅ Good: vector (cache-friendly)
std::vector<int> data;
for (int val : data) {
    sum += val;  // Contiguous memory, cache hits
}

Pattern 2: Row-Major Traversal

Here’s detailed implementation code using C++. Process data with loops. Examine the code while understanding the role of each part.

const int N = 1000;
int matrix[N][N];
// ❌ Column-major (slow)
for (int col = 0; col < N; ++col) {
    for (int row = 0; row < N; ++row) {
        matrix[row][col] = 0;  // Cache miss
    }
}
// ✅ Row-major (fast)
for (int row = 0; row < N; ++row) {
    for (int col = 0; col < N; ++col) {
        matrix[row][col] = 0;  // Cache hit
    }
}

Detailed Code Explanation: C++ 2D Array Memory Layout:

int matrix[N][N] is stored in row-major order in memory.
Memory order: matrix[0][0], matrix[0][1], …, matrix[0][N-1], matrix[1][0], …
Elements of the same row are contiguous in memory. Column-Major Traversal (Slow):

Access order: matrix[0][0] → matrix[1][0] → matrix[2][0] → ...
Memory order: [0][0] [0][1] [0][2] ....[1][0] [1][1] ...

Reading matrix[0][0] loads matrix[0][0]~[0][15] into cache.
But next accesses matrix[1][0], which is 1000 * 4 bytes = 4KB away.
Cached matrix[0][1]~[0][15] are discarded unused.
Result: Almost all accesses are cache misses (very slow). Row-Major Traversal (Fast):

Access order: matrix[0][0] → matrix[0][1] → matrix[0][2] → ...
Memory order: [0][0] [0][1] [0][2] ....(matches!)

Reading matrix[0][0] loads matrix[0][0]~[0][15] into cache.
Next 15 accesses (matrix[0][1]~[0][15]) are all cache hits.
Result: Only 1 cache miss per 16 accesses (very fast). Performance Difference (1000x1000 matrix):
Column-major: ~100ms (1,000,000 cache misses)
Row-major: ~10ms (62,500 cache misses)
About 10x difference! Production Tip: Choosing wrong traversal order in matrix multiplication, image processing, etc. significantly degrades performance.

4. Struct Layout Optimization

Minimize Padding

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.

// ❌ Bad: lots of padding (16 bytes)
struct Bad {
    char c1;    // 1 byte
    // 3 bytes padding
    int i;      // 4 bytes
    char c2;    // 1 byte
    // 3 bytes padding
};
// ✅ Good: minimal padding (12 bytes)
struct Good {
    int i;      // 4 bytes
    char c1;    // 1 byte
    char c2;    // 1 byte
    // 2 bytes padding
};

Separate Hot/Cold Data

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.

// ❌ Bad: frequently and rarely used data mixed
struct Entity {
    int id;              // Frequently used
    float x, y, z;       // Frequently used
    std::string name;    // Occasionally used
    std::string description;  // Rarely used
};
// ✅ Good: separate hot data only
struct EntityHot {
    int id;
    float x, y, z;
};
struct EntityCold {
    std::string name;
    std::string description;
};
std::vector<EntityHot> hotData;
std::map<int, EntityCold> coldData;

5. AoS vs SoA Complete Example

Memory Layout Comparison

Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.

flowchart LR
    subgraph AoS["AoS: Array of Structures"]
        direction TB
        A1["(x0,y0,z0,vx0,vy0,vz0,r0,g0,b0)"]
        A2["(x1,y1,z1,vx1,vy1,vz1,r1,g1,b1)"]
        A1 --> A2
    end
    subgraph SoA["SoA: Structure of Arrays"]
        direction TB
        S1["x: (x0,x1,x2,...)"]
        S2["vx: (vx0,vx1,vx2,...)"]
        S3["r: (r0,r1,r2,...)"]
        S1 --> S2 --> S3
    end

AoS (Array of Structures)

Here’s detailed implementation code using C++. Import necessary modules, define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

#include <vector>
#include <chrono>
#include <iostream>
struct ParticleAoS {
    float x, y, z;      // Position (12 bytes)
    float vx, vy, vz;   // Velocity (12 bytes)
    float r, g, b;      // Color (12 bytes)
    // Total 36 bytes
};
void updatePositionsAoS(std::vector<ParticleAoS>& particles) {
    for (auto& p : particles) {
        p.x += p.vx;
        p.y += p.vy;
        p.z += p.vz;
    }
}
int main() {
    std::vector<ParticleAoS> particles(100000);
    // Initialization...
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; ++i) {
        updatePositionsAoS(particles);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "AoS: " << ms << " ms\n";
    return 0;
}

AoS Memory Layout:

Memory: [x0,y0,z0,vx0,vy0,vz0,r0,g0,b0][x1,y1,z1,vx1,vy1,vz1,r1,g1,b1]...
        ↑ 36 bytes (9 floats)      ↑ 36 bytes
Cache line (64B) holds ~1.7 particles → loads colors even when only using position (waste)

SoA (Structure of Arrays)

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

struct ParticleSystemSoA {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
    std::vector<float> r, g, b;
    void resize(size_t n) {
        x.resize(n); y.resize(n); z.resize(n);
        vx.resize(n); vy.resize(n); vz.resize(n);
        r.resize(n); g.resize(n); b.resize(n);
    }
    size_t size() const { return x.size(); }
};
void updatePositionsSoA(ParticleSystemSoA& particles) {
    const size_t n = particles.size();
    for (size_t i = 0; i < n; ++i) {
        particles.x[i] += particles.vx[i];
        particles.y[i] += particles.vy[i];
        particles.z[i] += particles.vz[i];
    }
}

SoA Memory Layout: Here’s a simple text code example. Try running the code directly to verify its behavior.

x array:  [x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,...]
vx array: [vx0,vx1,vx2,vx3,vx4,vx5,vx6,vx7,vx8,vx9,vx10,vx11,...]
r array:  [r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13,r14,r15,...]
Cache line (64B) holds 16 floats → utilizes all x[i]~x[i+15] when reading x[i]

AoS vs SoA Selection Guide

Situation	Recommended	Reason
Process same field in bulk (position updates)	SoA	Maximum cache efficiency
Use multiple fields of one object together	AoS	Simpler code
Apply SIMD vectorization	SoA	Easy to process 4/8 in parallel
Query/modify individual entities	AoS	Access with single index

Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.

flowchart LR
    subgraph bad[❌ False Sharing]
        CL["Cache Line 64B"]
        C0["counter[0]"]
        C1["counter[1]"]
        C2["counter[2]"]
        CL --> C0 --> C1 --> C2
        T1["Thread 1 modifies"] -.->|invalidates| C0
        T2["Thread 2 modifies"] -.->|invalidates| C1
    end

When one thread modifies counter[0], the cache for counter[1] and counter[2] in the same cache line is invalidated, forcing other threads to reload from memory every time.

Problem: Multiple Threads Modify Same Cache Line

Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.

#include <thread>
#include <vector>
#include <atomic>
#include <chrono>
#include <iostream>
// ❌ Bad: false sharing occurs
void badParallelCounter() {
    const int numThreads = 4;
    std::vector<int> counters(numThreads, 0);  // 4 ints = 16 bytes, same cache line!
    std::vector<std::thread> threads;
    for (int t = 0; t < numThreads; ++t) {
        threads.emplace_back([&counters, t]() {
            for (int i = 0; i < 10000000; ++i) {
                counters[t]++;  // Causes other threads' cache line invalidation
            }
        });
    }
    for (auto& th : threads) th.join();
}

Cause: When counters[0], counters[1], …fit in the same 64-byte cache line, every time one thread modifies counters[0], that cache line is invalidated, and other threads using counters[1] must refetch from memory.

Solution: Cache Line Alignment

#include <cstddef>
// ✅ Good: align to cache line boundary
struct alignas(64) CacheLineAlignedCounter {
    int value;
    char padding[64 - sizeof(int)];  // Prevent cache line sharing
};
void goodParallelCounter() {
    const int numThreads = 4;
    std::vector<CacheLineAlignedCounter> counters(numThreads);
    std::vector<std::thread> threads;
    for (int t = 0; t < numThreads; ++t) {
        threads.emplace_back([&counters, t]() {
            for (int i = 0; i < 10000000; ++i) {
                counters[t].value++;
            }
        });
    }
    for (auto& th : threads) th.join();
}

Using C++17 alignas

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

// Each counter placed on separate cache line
struct alignas(64) ThreadLocalCounter {
    std::atomic<int64_t> count{0};
};
void benchmarkCounters() {
    const int numThreads = 4;
    std::vector<ThreadLocalCounter> counters(numThreads);
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<std::thread> threads;
    for (int t = 0; t < numThreads; ++t) {
        threads.emplace_back([&counters, t]() {
            for (int i = 0; i < 10000000; ++i) {
                counters[t].count.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    for (auto& th : threads) th.join();
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Aligned: " << ms << " ms\n";
}

7. Using Prefetch

Basic Usage

Here’s detailed implementation code using C++. Import necessary modules, process data with loops, and handle branching with conditionals. Examine the code while understanding the role of each part.

#include <xmmintrin.h>  // _mm_prefetch (or GCC/Clang: __builtin_prefetch)
void processWithPrefetch(const std::vector<int>& data) {
    const size_t n = data.size();
    const int PREFETCH_DISTANCE = 8;  // How many elements ahead to preload
    for (size_t i = 0; i < n; ++i) {
        // Preload next block into cache
        if (i + PREFETCH_DISTANCE < n) {
            __builtin_prefetch(&data[i + PREFETCH_DISTANCE], 0, 3);
            // Args: (address, 0=read, 3=all cache levels)
        }
        process(data[i]);
    }
}

Prefetch in Linked List Traversal

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, process data with loops, and handle branching with conditionals. Examine the code while understanding the role of each part.

struct Node {
    int value;
    Node* next;
};
int sumListWithPrefetch(Node* head) {
    int sum = 0;
    Node* curr = head;
    while (curr != nullptr) {
        // Preload next node (before following pointer)
        if (curr->next != nullptr) {
            __builtin_prefetch(curr->next, 0, 3);
        }
        sum += curr->value;
        curr = curr->next;
    }
    return sum;
}

Following Index Array

Here’s an implementation example using C++. Process data with loops and handle branching with conditionals. Examine the code while understanding the role of each part.

// indices[i] is index into data → access data[indices[i]]
void gatherWithPrefetch(const std::vector<float>& data,
                        const std::vector<size_t>& indices) {
    const size_t n = indices.size();
    float sum = 0;
    for (size_t i = 0; i < n; ++i) {
        if (i + 4 < n) {
            __builtin_prefetch(&data[indices[i + 4]], 0, 3);
        }
        sum += data[indices[i]];
    }
}

Warning: If prefetch distance is too large, data gets evicted from cache. If too small, no effect. Experiment with 4~16.

8. Practical Optimization Patterns

Pattern 1: Block Processing

Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.

const int BLOCK_SIZE = 64;  // Cache line size
void processBlocked(int* data, int N) {
    for (int i = 0; i < N; i += BLOCK_SIZE) {
        int end = std::min(i + BLOCK_SIZE, N);
        for (int j = i; j < end; ++j) {
            process(data[j]);
        }
    }
}

Pattern 2: Loop Fusion

Here’s detailed implementation code using C++. Process data with loops. Examine the code while understanding the role of each part.

std::vector<int> data(10000);
// ❌ Bad: multiple traversals
for (int& val : data) {
    val *= 2;
}
for (int& val : data) {
    val += 10;
}
// ✅ Good: single traversal
for (int& val : data) {
    val *= 2;
    val += 10;
}

Pattern 3: Improve Locality with Sorting

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

struct Entity {
    int type;
    // Data...
};
std::vector<Entity> entities;
// Sort by type
std::sort(entities.begin(), entities.end(),
          [](const Entity& a, const Entity& b) {
              return a.type < b.type;
          });
// Process same types consecutively (cache-friendly)
for (const auto& e : entities) {
    processType(e.type, e);
}

Pattern 4: Data Reorganization (SoA)

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

// ❌ Bad: AoS (Array of Structures)
struct Particle {
    float x, y, z;     // Position
    float vx, vy, vz;  // Velocity
    float r, g, b;     // Color
};
std::vector<Particle> particles(10000);
// Update position only (colors also loaded into cache, waste)
for (auto& p : particles) {
    p.x += p.vx;
    p.y += p.vy;
    p.z += p.vz;
}
// ✅ Good: SoA (Structure of Arrays)
struct ParticleSystem {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
    std::vector<float> r, g, b;
};
ParticleSystem particles;
particles.x.resize(10000);
// ...
// Update position only (only position data loaded into cache)
for (size_t i = 0; i < particles.x.size(); ++i) {
    particles.x[i] += particles.vx[i];
    particles.y[i] += particles.vy[i];
    particles.z[i] += particles.vz[i];
}

SoA Detailed Explanation:

AoS Problem: Cache line (64 bytes) holds about 7 floats. Reading x0 loads x0,y0,z0,vx0,vy0,vz0,r0 into cache. But only x is used, rest is wasted. Cache efficiency: about 14% (1/7)
SoA Advantage: Cache line (64 bytes) holds 16 floats. Reading x0 loads x0~x15 into cache. Next 15 accesses are all cache hits! Cache efficiency: 100%
Performance Improvement: For 10,000 particle updates, AoS ~50μs, SoA ~10μs → about 5x faster!

9. Common Errors and Solutions

Error 1: Severe Degradation from Column-Major Traversal

Symptom: 2D array processing 10x+ slower than expected. Cause: C/C++ arrays are row-major but traversed column-major. Solution: Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.

// ❌ Wrong order
for (int col = 0; col < COLS; ++col)
    for (int row = 0; row < ROWS; ++row)
        process(matrix[row][col]);
// ✅ Correct order (row-major)
for (int row = 0; row < ROWS; ++row)
    for (int col = 0; col < COLS; ++col)
        process(matrix[row][col]);

Symptom: Adding threads makes it slower. Cause: Different threads modify variables in the same cache line. Solution: Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Try running the code directly to verify its behavior.

// ❌ Sharing same cache line
std::atomic<int> counters[8];
// ✅ Cache line alignment
struct alignas(64) AlignedCounter {
    std::atomic<int> value{0};
};
std::vector<AlignedCounter> counters(8);

Error 3: Performance Degradation from Excessive Prefetch

Symptom: Adding __builtin_prefetch makes it slower. Cause: Prefetching too far ahead evicts valid data from cache. Solution: Here’s an implementation example using C++. Try running the code directly to verify its behavior.

// ❌ Distance too large (cache pollution)
__builtin_prefetch(&data[i + 64], 0, 3);
// ✅ Appropriate distance (4~16)
__builtin_prefetch(&data[i + 8], 0, 3);

Error 4: Index Mismatch When Mixing SoA and AoS

Symptom: After converting to SoA, some particles reference wrong data. Cause: Only some arrays resized during resize, or index calculation error. Solution: Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Try running the code directly to verify its behavior.

// ✅ Maintain SoA size consistency
struct ParticleSystem {
    std::vector<float> x, y, z, vx, vy, vz;
    void resize(size_t n) {
        x.resize(n); y.resize(n); z.resize(n);
        vx.resize(n); vy.resize(n); vz.resize(n);
    }
};

Error 5: Using list Instead of vector

Symptom: std::list traversal much slower than std::vector. Cause: List nodes are scattered, causing many cache misses. Solution: Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.

// ❌ list for traversal only
std::list<int> items;
for (auto v : items) sum += v;
// ✅ vector for traversal-heavy workloads
std::vector<int> items;
for (auto v : items) sum += v;

10. Performance Benchmarks

Benchmark 1: Row-Major vs Column-Major

Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.

#include <iostream>
#include <chrono>
const int N = 4096;
int matrix[N][N];
void benchmarkRowMajor() {
    long long sum = 0;
    auto start = std::chrono::high_resolution_clock::now();
    for (int row = 0; row < N; ++row) {
        for (int col = 0; col < N; ++col) {
            sum += matrix[row][col];
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Row-major: " << ms << " ms (sum=" << sum << ")\n";
}
void benchmarkColMajor() {
    long long sum = 0;
    auto start = std::chrono::high_resolution_clock::now();
    for (int col = 0; col < N; ++col) {
        for (int row = 0; row < N; ++row) {
            sum += matrix[row][col];
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Col-major: " << ms << " ms (sum=" << sum << ")\n";
}
int main() {
    benchmarkRowMajor();  // e.g., 15ms
    benchmarkColMajor();  // e.g., 150ms (about 10x difference)
}

Benchmark 2: AoS vs SoA

Here’s a simple text code example. Try running the code directly to verify its behavior.

Environment: 100,000 particles, position-only updates 100 times
AoS:  52ms
SoA:  11ms
Ratio: SoA is about 4.7x faster

Here’s a simple text code example. Try running the code directly to verify its behavior.

Environment: 4 threads, 10 million counter increments each
False sharing: 180ms
alignas(64) applied: 45ms
Ratio: about 4x faster

Benchmark 4: vector vs list Traversal

Here’s an implementation example using C++. Try running the code directly to verify its behavior.

// 1 million element traversal sum
std::vector<int> vec(1000000);
std::list<int> lst(1000000);
// vector: ~2ms
// list:   ~25ms (about 12x slower)

11. Production Patterns

Pattern 1: ECS (Entity-Component-System) Style SoA

Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.

// Common pattern in game engines
struct TransformComponent {
    std::vector<float> x, y, z;
    std::vector<float> rotX, rotY, rotZ;
};
struct RenderComponent {
    std::vector<uint32_t> textureId;
    std::vector<float> r, g, b, a;
};
void updateTransforms(TransformComponent& tf, float dt) {
    for (size_t i = 0; i < tf.x.size(); ++i) {
        tf.x[i] += 0.1f * dt;  // Sequential position access only
        tf.y[i] += 0.1f * dt;
        tf.z[i] += 0.1f * dt;
    }
}

Pattern 2: Cache Line Size Constants

Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.

namespace cache {
    constexpr size_t LINE_SIZE = 64;
    constexpr size_t L1_SIZE = 32 * 1024;
    constexpr size_t L2_SIZE = 256 * 1024;
}
template<typename T>
struct alignas(cache::LINE_SIZE) CacheLineAligned {
    T value;
};

Pattern 3: Optimize After Profiling

Here’s a simple C++ code example. Try running the code directly to verify its behavior.

// 1. Check cache misses with profiler (perf, VTune)
// 2. Identify loops with many cache misses
// 3. Apply traversal order, SoA conversion, prefetch
// 4. Verify with benchmarks

Pattern 4: Data-Oriented Design Checklist

Here’s an implementation example using text. Try running the code directly to verify its behavior.

- [ ] Use contiguous memory (vector > list)
- [ ] Row-major traversal (2D arrays)
- [ ] Consider SoA (when processing same fields in bulk)
- [ ] Separate hot/cold (group frequently used fields only)
- [ ] Cache line alignment for multithreading (prevent false sharing)
- [ ] Experiment with prefetch (pointer chaining, index arrays)

Performance Comparison Summary

Optimization	Effect	Difficulty
Row-major traversal	5~10x	Easy
SoA conversion	3~5x	Medium
Eliminate false sharing	2~10x	Easy
Remove list, use vector	5~20x	Easy
Prefetch	1.2~1.5x	Medium

Summary

Principle	Description
Contiguous Memory	vector > list
Row-Major Traversal	2D arrays are row-major
SoA > AoS	Separate frequently used fields only
Minimize Padding	Large fields first
Loop Fusion	Multiple traversals → one
Sorting	Process same types consecutively
Prevent False Sharing	Separate per-thread data with alignas(64)
Prefetch	Preload next block when following pointers/indices
Core Principles:

Prefer contiguous memory
Maximize sequential access
Separate hot data
Consider cache lines
Verify with measurements

Frequently Asked Questions (FAQ)

Q. When do you use this in production?

A. This covers CPU cache principles, reducing cache misses, data locality, and optimizing memory access patterns. Apply the examples and selection guide from the main text in production.

Q. What should I read first?

A. Follow the previous article links at the bottom of each post to learn in order. Check the C++ Series Index for the complete flow.

Q. How can I study deeper?

A. Refer to cppreference and official library documentation. Use the reference links at the end of the article. One-Line Summary: Leveraging contiguous memory and locality increases cache hits, improving performance. Next, read Compile-Time Optimization (#15-3). Previous Article: [C++ Practical Guide #15-1] Profiling and Finding Bottlenecks: Performance Measurement Basics Next Article: [C++ Practical Guide #15-3] Compile-Time Optimization: constexpr and Template Metaprogramming

이 글의 핵심

🎯 What You’ll Learn (Reading Time: 18 minutes)

Introduction: 10x Difference for Same Operation

”Just changing array traversal direction made it 10x faster”

Table of Contents

1. CPU Cache Basics

Memory Hierarchy

Memory Hierarchy Visualization

Cache Hit vs Miss

2. Data Locality

Temporal Locality

Spatial Locality

3. Reducing Cache Misses

Pattern 1: Use Contiguous Memory

Pattern 2: Row-Major Traversal

4. Struct Layout Optimization

Minimize Padding

Separate Hot/Cold Data

5. AoS vs SoA Complete Example

Memory Layout Comparison

AoS (Array of Structures)

SoA (Structure of Arrays)

AoS vs SoA Selection Guide

6. False Sharing

False Sharing Principle

Problem: Multiple Threads Modify Same Cache Line

Solution: Cache Line Alignment

Using C++17 alignas

Performance Difference: Eliminating false sharing often results in 2~10x speedup.

7. Using Prefetch

Basic Usage

Prefetch in Linked List Traversal

Following Index Array

Warning: If prefetch distance is too large, data gets evicted from cache. If too small, no effect. Experiment with 4~16.

8. Practical Optimization Patterns

Pattern 1: Block Processing

Pattern 2: Loop Fusion

Pattern 3: Improve Locality with Sorting

Pattern 4: Data Reorganization (SoA)

9. Common Errors and Solutions

Error 1: Severe Degradation from Column-Major Traversal

Error 2: Multithreading Slowed by False Sharing

Error 3: Performance Degradation from Excessive Prefetch

Error 4: Index Mismatch When Mixing SoA and AoS

Error 5: Using list Instead of vector

10. Performance Benchmarks

Benchmark 1: Row-Major vs Column-Major

Benchmark 2: AoS vs SoA

Benchmark 3: Eliminating False Sharing

Benchmark 4: vector vs list Traversal

11. Production Patterns

Pattern 1: ECS (Entity-Component-System) Style SoA

Pattern 2: Cache Line Size Constants

Pattern 3: Optimize After Profiling

Pattern 4: Data-Oriented Design Checklist

Performance Comparison Summary

Related Articles (Internal Links)

Keywords Covered in This Article (Related Searches)

Summary

Frequently Asked Questions (FAQ)

Q. When do you use this in production?

Q. What should I read first?

Q. How can I study deeper?

Related Articles