[2026] C++ Cache-Friendly Code | 10x Performance Boost with Memory Access Patterns
이 글의 핵심
Practical C++ cache optimization guide. Achieve 10x performance improvement with cache-friendly code. Includes AoS vs SoA, data locality, cache line alignment with real-world examples and benchmarks.
🎯 What You’ll Learn (Reading Time: 18 minutes)
TL;DR: Learn how to achieve 10x performance improvement through C++ cache optimization. Master practical techniques including memory access patterns, AoS vs SoA, and data locality with benchmarks. What You’ll Learn:
- ✅ Perfectly understand CPU cache principles and cache misses
- ✅ Master cache-friendly code writing patterns
- ✅ Acquire AoS vs SoA and data locality optimization techniques
- ✅ Verify performance improvements with real-world benchmarks Real-World Applications:
- 🔥 Process large datasets 10x faster
- 🔥 Improve game engine frame rates
- 🔥 Reduce real-time system response times
- 🔥 Increase server throughput Difficulty: Intermediate | Performance Gain: 10x | Benchmarks: Included
Introduction: 10x Difference for Same Operation
”Just changing array traversal direction made it 10x faster”
I wrote code to traverse a 2D array. Performance differed by 10x depending on traversal direction.
CPUs fetch memory in cache line units (the memory block size fetched at once by CPU cache, usually 64 bytes). When access order follows “consecutive addresses,” cache hits (fast access when needed data is in cache) increase. When jumping around, cache misses (must fetch from main memory when not in cache) increase. Paying attention to data locality (keeping frequently used data close and consecutive to improve cache efficiency) through row-major traversal, struct alignment, and keeping related data together can make the same operation much faster.
Slow Code:
Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.
int matrix[1000][1000];
// ❌ Column-major traversal (slow)
for (int col = 0; col < 1000; ++col) {
for (int row = 0; row < 1000; ++row) {
sum += matrix[row][col]; // Many cache misses
}
}
// Time: about 50ms
Fast Code: Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.
// After copying and pasting: g++ -std=c++17 -O2 -o cache_fast cache_fast.cpp && ./cache_fast
#include <iostream>
#include <chrono>
int main() {
int matrix[1000][1000] = {};
long long sum = 0;
auto start = std::chrono::high_resolution_clock::now();
// ✅ Row-major traversal (fast)
for (int row = 0; row < 1000; ++row) {
for (int col = 0; col < 1000; ++col) {
sum += matrix[row][col]; // Cache hit
}
}
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "sum=" << sum << " time=" << ms << "ms\n";
return 0;
}
Execution Result: Outputs in format sum=0 time=Nms (N value varies by environment).
Cause: CPU cache prefetches consecutive memory
What You’ll Learn:
- Understand CPU cache principles
- Write cache-friendly code
- Utilize data locality
- Optimize memory access in production
Production Experience: This article is based on real problems and solutions encountered in large-scale C++ projects. Includes practical pitfalls and debugging tips not covered in books or documentation.
Table of Contents
- CPU Cache Basics
- Data Locality
- Reducing Cache Misses
- Struct Layout Optimization
- AoS vs SoA Complete Example
- False Sharing
- Using Prefetch
- Practical Optimization Patterns
- Common Errors and Solutions
- Performance Benchmarks
- Production Patterns
1. CPU Cache Basics
Memory Hierarchy
Here’s an implementation example using text. Examine the code while understanding the role of each part.
CPU Register < 1ns (fastest)
↓
L1 Cache ~1ns (32-64KB)
↓
L2 Cache ~3ns (256KB-1MB)
↓
L3 Cache ~10ns (8-32MB)
↓
RAM ~100ns (several GB)
↓
SSD ~100us (hundreds of GB)
↓
HDD ~10ms (several TB, slowest)
Cache Line: Usually fetches memory in 64-byte units
Memory Hierarchy Visualization
Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.
flowchart TB
subgraph fast[Fast Access]
R[Register]
L1[L1 Cache 32KB]
L2[L2 Cache 256KB]
end
subgraph slow[Slow Access]
L3[L3 Cache 8MB]
RAM[RAM 100ns]
end
R --> L1 --> L2 --> L3 --> RAM
Cache Hit vs Miss
Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.
int arr[1000];
// Cache hit: sequential access
for (int i = 0; i < 1000; ++i) {
sum += arr[i]; // Fast
}
// Cache miss: irregular access
for (int i = 0; i < 1000; i += 64) {
sum += arr[i]; // Slow (cache line waste)
}
Detailed Code Explanation: Cache Hit (Sequential Access):
- Accesses
arr[0],arr[1],arr[2], …in order. - Cache line usually fetches 64 bytes (16 ints) at once.
- When reading
arr[0],arr[0]~arr[15]are loaded into cache together. - Next 15 accesses are cache hits (very fast, ~1ns).
- Result: Only about 62 fetches from memory out of 1000 accesses (1000/16). Cache Miss (Irregular Access):
- Jumps by 64:
arr[0],arr[64],arr[128], … - Accesses different cache lines each time, causing cache misses.
- All 16 accesses must fetch from memory.
- Remaining 15 elements loaded into cache are discarded unused.
- Result: About 16x slower (100ns vs 1ns per access). Actual Performance Difference:
- Sequential access: ~1000ns (1μs)
- Irregular access: ~16000ns (16μs)
- Difference grows larger with more data.
2. Data Locality
Temporal Locality
Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.
// ✅ Good: repeated access to same data
int sum = 0;
for (int i = 0; i < 100; ++i) {
sum += data[0]; // data[0] stays in cache
}
// ❌ Bad: different data each time
for (int i = 0; i < 100; ++i) {
sum += data[rand() % 10000]; // Many cache misses
}
Spatial Locality
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
struct Point {
int x, y, z;
};
std::vector<Point> points(1000);
// ✅ Good: sequential access
for (const auto& p : points) {
sum += p.x + p.y + p.z; // Cache-friendly
}
// ❌ Bad: pointer chasing
struct Node {
int value;
Node* next;
};
Node* head = /* ....*/;
for (Node* p = head; p != nullptr; p = p->next) {
sum += p->value; // Many cache misses
}
3. Reducing Cache Misses
Pattern 1: Use Contiguous Memory
Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.
// ❌ Bad: linked list (cache misses)
std::list<int> data;
for (int val : data) {
sum += val; // Cache miss per node
}
// ✅ Good: vector (cache-friendly)
std::vector<int> data;
for (int val : data) {
sum += val; // Contiguous memory, cache hits
}
Pattern 2: Row-Major Traversal
Here’s detailed implementation code using C++. Process data with loops. Examine the code while understanding the role of each part.
const int N = 1000;
int matrix[N][N];
// ❌ Column-major (slow)
for (int col = 0; col < N; ++col) {
for (int row = 0; row < N; ++row) {
matrix[row][col] = 0; // Cache miss
}
}
// ✅ Row-major (fast)
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
matrix[row][col] = 0; // Cache hit
}
}
Detailed Code Explanation: C++ 2D Array Memory Layout:
int matrix[N][N]is stored in row-major order in memory.- Memory order:
matrix[0][0],matrix[0][1], …,matrix[0][N-1],matrix[1][0], … - Elements of the same row are contiguous in memory. Column-Major Traversal (Slow):
Access order: matrix[0][0] → matrix[1][0] → matrix[2][0] → ...
Memory order: [0][0] [0][1] [0][2] ....[1][0] [1][1] ...
- Reading
matrix[0][0]loadsmatrix[0][0]~[0][15]into cache. - But next accesses
matrix[1][0], which is 1000 * 4 bytes = 4KB away. - Cached
matrix[0][1]~[0][15]are discarded unused. - Result: Almost all accesses are cache misses (very slow). Row-Major Traversal (Fast):
Access order: matrix[0][0] → matrix[0][1] → matrix[0][2] → ...
Memory order: [0][0] [0][1] [0][2] ....(matches!)
- Reading
matrix[0][0]loadsmatrix[0][0]~[0][15]into cache. - Next 15 accesses (
matrix[0][1]~[0][15]) are all cache hits. - Result: Only 1 cache miss per 16 accesses (very fast). Performance Difference (1000x1000 matrix):
- Column-major: ~100ms (1,000,000 cache misses)
- Row-major: ~10ms (62,500 cache misses)
- About 10x difference! Production Tip: Choosing wrong traversal order in matrix multiplication, image processing, etc. significantly degrades performance.
4. Struct Layout Optimization
Minimize Padding
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.
// ❌ Bad: lots of padding (16 bytes)
struct Bad {
char c1; // 1 byte
// 3 bytes padding
int i; // 4 bytes
char c2; // 1 byte
// 3 bytes padding
};
// ✅ Good: minimal padding (12 bytes)
struct Good {
int i; // 4 bytes
char c1; // 1 byte
char c2; // 1 byte
// 2 bytes padding
};
Separate Hot/Cold Data
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.
// ❌ Bad: frequently and rarely used data mixed
struct Entity {
int id; // Frequently used
float x, y, z; // Frequently used
std::string name; // Occasionally used
std::string description; // Rarely used
};
// ✅ Good: separate hot data only
struct EntityHot {
int id;
float x, y, z;
};
struct EntityCold {
std::string name;
std::string description;
};
std::vector<EntityHot> hotData;
std::map<int, EntityCold> coldData;
5. AoS vs SoA Complete Example
Memory Layout Comparison
Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.
flowchart LR
subgraph AoS["AoS: Array of Structures"]
direction TB
A1["(x0,y0,z0,vx0,vy0,vz0,r0,g0,b0)"]
A2["(x1,y1,z1,vx1,vy1,vz1,r1,g1,b1)"]
A1 --> A2
end
subgraph SoA["SoA: Structure of Arrays"]
direction TB
S1["x: (x0,x1,x2,...)"]
S2["vx: (vx0,vx1,vx2,...)"]
S3["r: (r0,r1,r2,...)"]
S1 --> S2 --> S3
end
AoS (Array of Structures)
Here’s detailed implementation code using C++. Import necessary modules, define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
#include <vector>
#include <chrono>
#include <iostream>
struct ParticleAoS {
float x, y, z; // Position (12 bytes)
float vx, vy, vz; // Velocity (12 bytes)
float r, g, b; // Color (12 bytes)
// Total 36 bytes
};
void updatePositionsAoS(std::vector<ParticleAoS>& particles) {
for (auto& p : particles) {
p.x += p.vx;
p.y += p.vy;
p.z += p.vz;
}
}
int main() {
std::vector<ParticleAoS> particles(100000);
// Initialization...
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; ++i) {
updatePositionsAoS(particles);
}
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "AoS: " << ms << " ms\n";
return 0;
}
AoS Memory Layout:
Memory: [x0,y0,z0,vx0,vy0,vz0,r0,g0,b0][x1,y1,z1,vx1,vy1,vz1,r1,g1,b1]...
↑ 36 bytes (9 floats) ↑ 36 bytes
Cache line (64B) holds ~1.7 particles → loads colors even when only using position (waste)
SoA (Structure of Arrays)
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
struct ParticleSystemSoA {
std::vector<float> x, y, z;
std::vector<float> vx, vy, vz;
std::vector<float> r, g, b;
void resize(size_t n) {
x.resize(n); y.resize(n); z.resize(n);
vx.resize(n); vy.resize(n); vz.resize(n);
r.resize(n); g.resize(n); b.resize(n);
}
size_t size() const { return x.size(); }
};
void updatePositionsSoA(ParticleSystemSoA& particles) {
const size_t n = particles.size();
for (size_t i = 0; i < n; ++i) {
particles.x[i] += particles.vx[i];
particles.y[i] += particles.vy[i];
particles.z[i] += particles.vz[i];
}
}
SoA Memory Layout: Here’s a simple text code example. Try running the code directly to verify its behavior.
x array: [x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,...]
vx array: [vx0,vx1,vx2,vx3,vx4,vx5,vx6,vx7,vx8,vx9,vx10,vx11,...]
r array: [r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,r10,r11,r12,r13,r14,r15,...]
Cache line (64B) holds 16 floats → utilizes all x[i]~x[i+15] when reading x[i]
AoS vs SoA Selection Guide
| Situation | Recommended | Reason |
|---|---|---|
| Process same field in bulk (position updates) | SoA | Maximum cache efficiency |
| Use multiple fields of one object together | AoS | Simpler code |
| Apply SIMD vectorization | SoA | Easy to process 4/8 in parallel |
| Query/modify individual entities | AoS | Access with single index |
6. False Sharing
False Sharing Principle
Here’s an implementation example using mermaid. Examine the code while understanding the role of each part.
flowchart LR
subgraph bad[❌ False Sharing]
CL["Cache Line 64B"]
C0["counter[0]"]
C1["counter[1]"]
C2["counter[2]"]
CL --> C0 --> C1 --> C2
T1["Thread 1 modifies"] -.->|invalidates| C0
T2["Thread 2 modifies"] -.->|invalidates| C1
end
When one thread modifies counter[0], the cache for counter[1] and counter[2] in the same cache line is invalidated, forcing other threads to reload from memory every time.
Problem: Multiple Threads Modify Same Cache Line
Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.
#include <thread>
#include <vector>
#include <atomic>
#include <chrono>
#include <iostream>
// ❌ Bad: false sharing occurs
void badParallelCounter() {
const int numThreads = 4;
std::vector<int> counters(numThreads, 0); // 4 ints = 16 bytes, same cache line!
std::vector<std::thread> threads;
for (int t = 0; t < numThreads; ++t) {
threads.emplace_back([&counters, t]() {
for (int i = 0; i < 10000000; ++i) {
counters[t]++; // Causes other threads' cache line invalidation
}
});
}
for (auto& th : threads) th.join();
}
Cause: When counters[0], counters[1], …fit in the same 64-byte cache line, every time one thread modifies counters[0], that cache line is invalidated, and other threads using counters[1] must refetch from memory.
Solution: Cache Line Alignment
Here’s detailed implementation code using C++. Import necessary modules, define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
#include <cstddef>
// ✅ Good: align to cache line boundary
struct alignas(64) CacheLineAlignedCounter {
int value;
char padding[64 - sizeof(int)]; // Prevent cache line sharing
};
void goodParallelCounter() {
const int numThreads = 4;
std::vector<CacheLineAlignedCounter> counters(numThreads);
std::vector<std::thread> threads;
for (int t = 0; t < numThreads; ++t) {
threads.emplace_back([&counters, t]() {
for (int i = 0; i < 10000000; ++i) {
counters[t].value++;
}
});
}
for (auto& th : threads) th.join();
}
Using C++17 alignas
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
// Each counter placed on separate cache line
struct alignas(64) ThreadLocalCounter {
std::atomic<int64_t> count{0};
};
void benchmarkCounters() {
const int numThreads = 4;
std::vector<ThreadLocalCounter> counters(numThreads);
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int t = 0; t < numThreads; ++t) {
threads.emplace_back([&counters, t]() {
for (int i = 0; i < 10000000; ++i) {
counters[t].count.fetch_add(1, std::memory_order_relaxed);
}
});
}
for (auto& th : threads) th.join();
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Aligned: " << ms << " ms\n";
}
Performance Difference: Eliminating false sharing often results in 2~10x speedup.
7. Using Prefetch
Basic Usage
Here’s detailed implementation code using C++. Import necessary modules, process data with loops, and handle branching with conditionals. Examine the code while understanding the role of each part.
#include <xmmintrin.h> // _mm_prefetch (or GCC/Clang: __builtin_prefetch)
void processWithPrefetch(const std::vector<int>& data) {
const size_t n = data.size();
const int PREFETCH_DISTANCE = 8; // How many elements ahead to preload
for (size_t i = 0; i < n; ++i) {
// Preload next block into cache
if (i + PREFETCH_DISTANCE < n) {
__builtin_prefetch(&data[i + PREFETCH_DISTANCE], 0, 3);
// Args: (address, 0=read, 3=all cache levels)
}
process(data[i]);
}
}
Prefetch in Linked List Traversal
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, process data with loops, and handle branching with conditionals. Examine the code while understanding the role of each part.
struct Node {
int value;
Node* next;
};
int sumListWithPrefetch(Node* head) {
int sum = 0;
Node* curr = head;
while (curr != nullptr) {
// Preload next node (before following pointer)
if (curr->next != nullptr) {
__builtin_prefetch(curr->next, 0, 3);
}
sum += curr->value;
curr = curr->next;
}
return sum;
}
Following Index Array
Here’s an implementation example using C++. Process data with loops and handle branching with conditionals. Examine the code while understanding the role of each part.
// indices[i] is index into data → access data[indices[i]]
void gatherWithPrefetch(const std::vector<float>& data,
const std::vector<size_t>& indices) {
const size_t n = indices.size();
float sum = 0;
for (size_t i = 0; i < n; ++i) {
if (i + 4 < n) {
__builtin_prefetch(&data[indices[i + 4]], 0, 3);
}
sum += data[indices[i]];
}
}
Warning: If prefetch distance is too large, data gets evicted from cache. If too small, no effect. Experiment with 4~16.
8. Practical Optimization Patterns
Pattern 1: Block Processing
Here’s an implementation example using C++. Process data with loops. Examine the code while understanding the role of each part.
const int BLOCK_SIZE = 64; // Cache line size
void processBlocked(int* data, int N) {
for (int i = 0; i < N; i += BLOCK_SIZE) {
int end = std::min(i + BLOCK_SIZE, N);
for (int j = i; j < end; ++j) {
process(data[j]);
}
}
}
Pattern 2: Loop Fusion
Here’s detailed implementation code using C++. Process data with loops. Examine the code while understanding the role of each part.
std::vector<int> data(10000);
// ❌ Bad: multiple traversals
for (int& val : data) {
val *= 2;
}
for (int& val : data) {
val += 10;
}
// ✅ Good: single traversal
for (int& val : data) {
val *= 2;
val += 10;
}
Pattern 3: Improve Locality with Sorting
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
struct Entity {
int type;
// Data...
};
std::vector<Entity> entities;
// Sort by type
std::sort(entities.begin(), entities.end(),
[](const Entity& a, const Entity& b) {
return a.type < b.type;
});
// Process same types consecutively (cache-friendly)
for (const auto& e : entities) {
processType(e.type, e);
}
Pattern 4: Data Reorganization (SoA)
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
// ❌ Bad: AoS (Array of Structures)
struct Particle {
float x, y, z; // Position
float vx, vy, vz; // Velocity
float r, g, b; // Color
};
std::vector<Particle> particles(10000);
// Update position only (colors also loaded into cache, waste)
for (auto& p : particles) {
p.x += p.vx;
p.y += p.vy;
p.z += p.vz;
}
// ✅ Good: SoA (Structure of Arrays)
struct ParticleSystem {
std::vector<float> x, y, z;
std::vector<float> vx, vy, vz;
std::vector<float> r, g, b;
};
ParticleSystem particles;
particles.x.resize(10000);
// ...
// Update position only (only position data loaded into cache)
for (size_t i = 0; i < particles.x.size(); ++i) {
particles.x[i] += particles.vx[i];
particles.y[i] += particles.vy[i];
particles.z[i] += particles.vz[i];
}
SoA Detailed Explanation:
- AoS Problem: Cache line (64 bytes) holds about 7 floats. Reading
x0loadsx0,y0,z0,vx0,vy0,vz0,r0into cache. But onlyxis used, rest is wasted. Cache efficiency: about 14% (1/7) - SoA Advantage: Cache line (64 bytes) holds 16 floats. Reading
x0loadsx0~x15into cache. Next 15 accesses are all cache hits! Cache efficiency: 100% - Performance Improvement: For 10,000 particle updates, AoS ~50μs, SoA ~10μs → about 5x faster!
9. Common Errors and Solutions
Error 1: Severe Degradation from Column-Major Traversal
Symptom: 2D array processing 10x+ slower than expected. Cause: C/C++ arrays are row-major but traversed column-major. Solution: Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.
// ❌ Wrong order
for (int col = 0; col < COLS; ++col)
for (int row = 0; row < ROWS; ++row)
process(matrix[row][col]);
// ✅ Correct order (row-major)
for (int row = 0; row < ROWS; ++row)
for (int col = 0; col < COLS; ++col)
process(matrix[row][col]);
Error 2: Multithreading Slowed by False Sharing
Symptom: Adding threads makes it slower. Cause: Different threads modify variables in the same cache line. Solution: Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Try running the code directly to verify its behavior.
// ❌ Sharing same cache line
std::atomic<int> counters[8];
// ✅ Cache line alignment
struct alignas(64) AlignedCounter {
std::atomic<int> value{0};
};
std::vector<AlignedCounter> counters(8);
Error 3: Performance Degradation from Excessive Prefetch
Symptom: Adding __builtin_prefetch makes it slower.
Cause: Prefetching too far ahead evicts valid data from cache.
Solution:
Here’s an implementation example using C++. Try running the code directly to verify its behavior.
// ❌ Distance too large (cache pollution)
__builtin_prefetch(&data[i + 64], 0, 3);
// ✅ Appropriate distance (4~16)
__builtin_prefetch(&data[i + 8], 0, 3);
Error 4: Index Mismatch When Mixing SoA and AoS
Symptom: After converting to SoA, some particles reference wrong data.
Cause: Only some arrays resized during resize, or index calculation error.
Solution:
Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Try running the code directly to verify its behavior.
// ✅ Maintain SoA size consistency
struct ParticleSystem {
std::vector<float> x, y, z, vx, vy, vz;
void resize(size_t n) {
x.resize(n); y.resize(n); z.resize(n);
vx.resize(n); vy.resize(n); vz.resize(n);
}
};
Error 5: Using list Instead of vector
Symptom: std::list traversal much slower than std::vector.
Cause: List nodes are scattered, causing many cache misses.
Solution:
Here’s an implementation example using C++. Process data with loops. Try running the code directly to verify its behavior.
// ❌ list for traversal only
std::list<int> items;
for (auto v : items) sum += v;
// ✅ vector for traversal-heavy workloads
std::vector<int> items;
for (auto v : items) sum += v;
10. Performance Benchmarks
Benchmark 1: Row-Major vs Column-Major
Here’s detailed implementation code using C++. Import necessary modules and process data with loops. Examine the code while understanding the role of each part.
#include <iostream>
#include <chrono>
const int N = 4096;
int matrix[N][N];
void benchmarkRowMajor() {
long long sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
sum += matrix[row][col];
}
}
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Row-major: " << ms << " ms (sum=" << sum << ")\n";
}
void benchmarkColMajor() {
long long sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for (int col = 0; col < N; ++col) {
for (int row = 0; row < N; ++row) {
sum += matrix[row][col];
}
}
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Col-major: " << ms << " ms (sum=" << sum << ")\n";
}
int main() {
benchmarkRowMajor(); // e.g., 15ms
benchmarkColMajor(); // e.g., 150ms (about 10x difference)
}
Benchmark 2: AoS vs SoA
Here’s a simple text code example. Try running the code directly to verify its behavior.
Environment: 100,000 particles, position-only updates 100 times
AoS: 52ms
SoA: 11ms
Ratio: SoA is about 4.7x faster
Benchmark 3: Eliminating False Sharing
Here’s a simple text code example. Try running the code directly to verify its behavior.
Environment: 4 threads, 10 million counter increments each
False sharing: 180ms
alignas(64) applied: 45ms
Ratio: about 4x faster
Benchmark 4: vector vs list Traversal
Here’s an implementation example using C++. Try running the code directly to verify its behavior.
// 1 million element traversal sum
std::vector<int> vec(1000000);
std::list<int> lst(1000000);
// vector: ~2ms
// list: ~25ms (about 12x slower)
11. Production Patterns
Pattern 1: ECS (Entity-Component-System) Style SoA
Here’s detailed implementation code using C++. Define a class to encapsulate data and functionality, and process data with loops. Examine the code while understanding the role of each part.
// Common pattern in game engines
struct TransformComponent {
std::vector<float> x, y, z;
std::vector<float> rotX, rotY, rotZ;
};
struct RenderComponent {
std::vector<uint32_t> textureId;
std::vector<float> r, g, b, a;
};
void updateTransforms(TransformComponent& tf, float dt) {
for (size_t i = 0; i < tf.x.size(); ++i) {
tf.x[i] += 0.1f * dt; // Sequential position access only
tf.y[i] += 0.1f * dt;
tf.z[i] += 0.1f * dt;
}
}
Pattern 2: Cache Line Size Constants
Here’s an implementation example using C++. Define a class to encapsulate data and functionality. Examine the code while understanding the role of each part.
namespace cache {
constexpr size_t LINE_SIZE = 64;
constexpr size_t L1_SIZE = 32 * 1024;
constexpr size_t L2_SIZE = 256 * 1024;
}
template<typename T>
struct alignas(cache::LINE_SIZE) CacheLineAligned {
T value;
};
Pattern 3: Optimize After Profiling
Here’s a simple C++ code example. Try running the code directly to verify its behavior.
// 1. Check cache misses with profiler (perf, VTune)
// 2. Identify loops with many cache misses
// 3. Apply traversal order, SoA conversion, prefetch
// 4. Verify with benchmarks
Pattern 4: Data-Oriented Design Checklist
Here’s an implementation example using text. Try running the code directly to verify its behavior.
- [ ] Use contiguous memory (vector > list)
- [ ] Row-major traversal (2D arrays)
- [ ] Consider SoA (when processing same fields in bulk)
- [ ] Separate hot/cold (group frequently used fields only)
- [ ] Cache line alignment for multithreading (prevent false sharing)
- [ ] Experiment with prefetch (pointer chaining, index arrays)
Performance Comparison Summary
| Optimization | Effect | Difficulty |
|---|---|---|
| Row-major traversal | 5~10x | Easy |
| SoA conversion | 3~5x | Medium |
| Eliminate false sharing | 2~10x | Easy |
| Remove list, use vector | 5~20x | Easy |
| Prefetch | 1.2~1.5x | Medium |
Related Articles (Internal Links)
Other articles connected to this topic.
- C++ Profiling | Finding Bottlenecks with perf and gprof
- C++ Compile-Time Optimization | constexpr, PCH, Modules, ccache, Unity Build [#15-3]
- C++ Move Semantics | Eliminate Unnecessary Copies and Optimize Performance with std::move
Keywords Covered in This Article (Related Searches)
Search for C++ cache-friendly, cache friendly, memory access patterns, performance optimization, data-oriented design, AoS SoA, false sharing, prefetch to find this article helpful.
Summary
| Principle | Description |
|---|---|
| Contiguous Memory | vector > list |
| Row-Major Traversal | 2D arrays are row-major |
| SoA > AoS | Separate frequently used fields only |
| Minimize Padding | Large fields first |
| Loop Fusion | Multiple traversals → one |
| Sorting | Process same types consecutively |
| Prevent False Sharing | Separate per-thread data with alignas(64) |
| Prefetch | Preload next block when following pointers/indices |
| Core Principles: |
- Prefer contiguous memory
- Maximize sequential access
- Separate hot data
- Consider cache lines
- Verify with measurements
Frequently Asked Questions (FAQ)
Q. When do you use this in production?
A. This covers CPU cache principles, reducing cache misses, data locality, and optimizing memory access patterns. Apply the examples and selection guide from the main text in production.
Q. What should I read first?
A. Follow the previous article links at the bottom of each post to learn in order. Check the C++ Series Index for the complete flow.
Q. How can I study deeper?
A. Refer to cppreference and official library documentation. Use the reference links at the end of the article. One-Line Summary: Leveraging contiguous memory and locality increases cache hits, improving performance. Next, read Compile-Time Optimization (#15-3). Previous Article: [C++ Practical Guide #15-1] Profiling and Finding Bottlenecks: Performance Measurement Basics Next Article: [C++ Practical Guide #15-3] Compile-Time Optimization: constexpr and Template Metaprogramming
Related Articles
- C++ Profiling |
- C++ Compile-Time Optimization | constexpr, PCH, Modules, ccache, Unity Build [#15-3]
- C++ Cache-Efficient Code: Data-Oriented Design Guide
- C++ Cache Optimization in Practice | Cache-Friendly Structures, Prefetch, False Sharing, AoS vs SoA Guide
- C++ STL Algorithm Basics | sort, find, count, transform, accumulate Guide