C++ Memory Alignment | Complete Guide to Alignment, Padding & False Sharing

C++ Memory Alignment | Complete Guide to Alignment, Padding & False Sharing

이 글의 핵심

C++ memory alignment, padding, alignas, alignof, False Sharing prevention, and struct optimization with practical examples.

Introduction

Memory alignment is the address boundary required by CPUs to efficiently read and write memory. Compilers insert padding between struct members to maintain alignment, directly impacting memory size and performance.

What You’ll Learn

  • Check and control alignment with alignof, alignas
  • Optimize struct member order to save memory
  • Prevent False Sharing to improve multithreaded performance
  • Master advanced patterns like SIMD and cache line optimization

Reality in Production

When learning development, everything seems clean and theoretical. But production is different. Wrestling with legacy code, chasing tight deadlines, facing unexpected bugs. The content covered here was initially learned as theory, but through applying it to real projects, I realized “ah, that’s why it’s designed this way.” Particularly memorable are the trials and errors from my first project. I followed what I learned from books but couldn’t figure out why it didn’t work, spending days stuck. Eventually discovered the problem through senior developer code review, learning a lot in the process. This guide covers not just theory but also pitfalls and solutions you might encounter in practice.

Table of Contents

  1. Memory Alignment Basics
  2. Practical Implementation
  3. Advanced Usage
  4. Performance Comparison
  5. Real-World Cases
  6. Troubleshooting
  7. Conclusion

Memory Alignment Basics

What is Alignment?

CPUs have defined starting addresses (alignment boundaries) for efficient read/write per type. For example, int should start at 4-byte boundary (address multiple of 4) for efficiency, and some CPUs prohibit or cause performance degradation on unaligned access.

Alignment Requirements by Type

#include <iostream>
using namespace std;
int main() {
    cout << "char: " << alignof(char) << endl;      // 1
    cout << "short: " << alignof(short) << endl;    // 2
    cout << "int: " << alignof(int) << endl;        // 4
    cout << "long: " << alignof(long) << endl;      // 4 (Windows) / 8 (Linux)
    cout << "double: " << alignof(double) << endl;  // 8
    cout << "int*: " << alignof(int*) << endl;      // 8 (64-bit)
    
    return 0;
}

Why Padding Occurs

Compiler inserts empty bytes (padding) to align each member to alignment boundary.

struct Bad {
    char c;    // Address 0 (1 byte)
    // 3 bytes padding (address 1~3)
    int i;     // Address 4 (4 bytes)
    // 4 bytes padding (address 8~11)
    double d;  // Address 12 (8 bytes)
};  // Total 24 bytes

Understanding with everyday analogy: Think of memory as an apartment building. Stack is like an elevator - fast but limited space. Heap is like a warehouse - spacious but takes time to find things. Pointers are like notes pointing to addresses, like “Floor 3, Unit 302”.

Practical Implementation

1) Struct Padding Optimization

Inefficient Layout

#include <iostream>
using namespace std;
struct Bad {
    char c;    // 1 byte
    // 3 bytes padding
    int i;     // 4 bytes
    // 4 bytes padding
    double d;  // 8 bytes
};  // Total 24 bytes
int main() {
    cout << "Bad: " << sizeof(Bad) << endl;  // 24
    
    return 0;
}

Optimized Layout

struct Good {
    double d;  // 8 bytes
    int i;     // 4 bytes
    char c;    // 1 byte
    // 3 bytes padding
};  // Total 16 bytes
int main() {
    cout << "Good: " << sizeof(Good) << endl;  // 16
    
    return 0;
}

Optimization Principles

  1. Place large types first (double → int → char)
  2. Group same-size types
  3. Minimize padding
struct Best {
    double d1;  // 8 bytes
    double d2;  // 8 bytes
    int i1;     // 4 bytes
    int i2;     // 4 bytes
    char c1;    // 1 byte
    char c2;    // 1 byte
    char c3;    // 1 byte
    char c4;    // 1 byte
};  // Total 32 bytes (no padding)
int main() {
    cout << "Best: " << sizeof(Best) << endl;  // 32
    
    return 0;
}

2) alignas - Specify Alignment

Signature:

alignas(alignment) type name;

Struct Alignment

#include <iostream>
using namespace std;
struct alignas(16) Aligned {
    int x;
    int y;
};
int main() {
    cout << "Alignment: " << alignof(Aligned) << endl;  // 16
    cout << "Size: " << sizeof(Aligned) << endl;   // 16
    
    return 0;
}

Variable Alignment

#include <iostream>
int main() {
    alignas(64) int cacheLine[16];  // 64-byte alignment
    
    cout << "Address: " << (uintptr_t)cacheLine << endl;
    // Multiple of 64
    
    return 0;
}

3) Removing Padding (pragma pack)

Warning: Performance degradation, undefined behavior possible

#include <iostream>
using namespace std;
#pragma pack(push, 1)
struct Packed {
    char c;    // 1 byte
    int i;     // 4 bytes
    double d;  // 8 bytes
};  // Total 13 bytes (no padding)
#pragma pack(pop)
int main() {
    cout << "Packed: " << sizeof(Packed) << endl;  // 13
    
    Packed p;
    p.i = 10;  // Unaligned access (slow or crash)
    
    return 0;
}

Use scenarios:

  • Network protocols (packet structures)
  • File formats (binary serialization)
  • Hardware interfaces (register maps)

Advanced Usage

1) Preventing False Sharing

False Sharing: Performance degradation when multiple threads modify different variables in same cache line

Problem Code

#include <atomic>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
struct Counters {
    std::atomic<int> counter1;  // 0-3 bytes
    std::atomic<int> counter2;  // 4-7 bytes
};  // Same cache line (64 bytes)
int main() {
    Counters counters;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    std::thread t1([&]() {
        for (int i = 0; i < 10000000; ++i) {
            counters.counter1++;
        }
    });
    
    std::thread t2([&]() {
        for (int i = 0; i < 10000000; ++i) {
            counters.counter2++;
        }
    });
    
    t1.join();
    t2.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "False Sharing: " << duration << "ms" << std::endl;
    // About 500ms
    
    return 0;
}

Solution Code

struct CountersAligned {
    alignas(64) std::atomic<int> counter1;
    alignas(64) std::atomic<int> counter2;
};  // Each on different cache line
int main() {
    CountersAligned counters;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    std::thread t1([&]() {
        for (int i = 0; i < 10000000; ++i) {
            counters.counter1++;
        }
    });
    
    std::thread t2([&]() {
        for (int i = 0; i < 10000000; ++i) {
            counters.counter2++;
        }
    });
    
    t1.join();
    t2.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "No False Sharing: " << duration << "ms" << std::endl;
    // About 150ms (3x improvement)
    
    return 0;
}

2) SIMD Alignment

SSE/AVX require 16/32-byte alignment

#include <immintrin.h>
#include <iostream>
int main() {
    // ❌ Not aligned
    float data1[8];
    // __m256 a = _mm256_load_ps(data1);  // May crash
    
    // ✅ 32-byte alignment
    alignas(32) float data2[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    __m256 a = _mm256_load_ps(data2);  // Safe
    
    // Operation
    __m256 b = _mm256_set1_ps(2.0f);
    __m256 c = _mm256_mul_ps(a, b);
    
    // Store result
    alignas(32) float result[8];
    _mm256_store_ps(result, c);
    
    for (float x : result) {
        std::cout << x << " ";  // 2 4 6 8 10 12 14 16
    }
    
    return 0;
}

Performance Comparison

Aligned vs Unaligned Access

Test: 100 million int reads

Access TypeTimeSpeedup
Aligned (4-byte boundary)50ms1x
Unaligned (1-byte boundary)200ms0.25x
Conclusion: Aligned access 4x faster

False Sharing Comparison

Test: 2 threads, 10 million increments each

StructureTimeSpeedup
False Sharing (same cache line)500ms1x
Cache line separation (alignas(64))150ms3.3x
Conclusion: 3x improvement with cache line separation

Struct Size Comparison

struct Bad {
    char c;    // 1 + 3 padding
    int i;     // 4 + 4 padding
    double d;  // 8
};  // 24 bytes
struct Good {
    double d;  // 8
    int i;     // 4
    char c;    // 1 + 3 padding
};  // 16 bytes

Conclusion: 33% savings with member order optimization

Real-World Cases

Case 1: Multithreaded Counter - False Sharing Prevention

#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
struct alignas(64) AlignedCounter {
    std::atomic<int> counter;
    char padding[60];  // Fill to 64 bytes
};
int main() {
    AlignedCounter counters[4];
    
    std::vector<std::thread> threads;
    for (int i = 0; i < 4; ++i) {
        threads.emplace_back([&, i]() {
            for (int j = 0; j < 1000000; ++j) {
                counters[i].counter++;
            }
        });
    }
    
    for (auto& t : threads) {
        t.join();
    }
    
    for (int i = 0; i < 4; ++i) {
        std::cout << "Counter " << i << ": " << counters[i].counter << std::endl;
    }
    
    return 0;
}

Case 2: SIMD Vector Operations

#include <immintrin.h>
#include <iostream>
void vectorAdd(const float* a, const float* b, float* c, size_t n) {
    for (size_t i = 0; i < n; i += 8) {
        __m256 va = _mm256_load_ps(a + i);
        __m256 vb = _mm256_load_ps(b + i);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_store_ps(c + i, vc);
    }
}
int main() {
    alignas(32) float a[8] = {1, 2, 3, 4, 5, 6, 7, 8};
    alignas(32) float b[8] = {8, 7, 6, 5, 4, 3, 2, 1};
    alignas(32) float c[8];
    
    vectorAdd(a, b, c, 8);
    
    for (float x : c) {
        std::cout << x << " ";  // 9 9 9 9 9 9 9 9
    }
    
    return 0;
}

Case 3: Network Protocol - Padding Removal

#include <cstdint>
#include <iostream>
#pragma pack(push, 1)
struct PacketHeader {
    uint8_t version;     // 1 byte
    uint16_t length;     // 2 bytes
    uint32_t sequence;   // 4 bytes
    uint64_t timestamp;  // 8 bytes
};  // Total 15 bytes (no padding)
#pragma pack(pop)
int main() {
    std::cout << "PacketHeader: " << sizeof(PacketHeader) << std::endl;  // 15
    
    PacketHeader header;
    header.version = 1;
    header.length = 100;
    header.sequence = 12345;
    header.timestamp = 1234567890;
    
    // Send over network
    // send(socket, &header, sizeof(header), 0);
    
    return 0;
}

Case 4: Game Engine - Data-Oriented Design

#include <vector>
#include <iostream>
// ❌ AoS (Array of Structures) - Many cache misses
struct EntityAoS {
    float x, y, z;      // Position
    float vx, vy, vz;   // Velocity
    int health;
    int id;
};
std::vector<EntityAoS> entitiesAoS(10000);
// ✅ SoA (Structure of Arrays) - Cache-friendly
struct EntitiesSoA {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
    std::vector<int> health;
    std::vector<int> id;
};
void updatePositions(EntitiesSoA& entities, float dt) {
    for (size_t i = 0; i < entities.x.size(); ++i) {
        entities.x[i] += entities.vx[i] * dt;
        entities.y[i] += entities.vy[i] * dt;
        entities.z[i] += entities.vz[i] * dt;
    }
}
int main() {
    EntitiesSoA entities;
    entities.x.resize(10000);
    entities.y.resize(10000);
    entities.z.resize(10000);
    entities.vx.resize(10000, 1.0f);
    entities.vy.resize(10000, 1.0f);
    entities.vz.resize(10000, 1.0f);
    
    updatePositions(entities, 0.016f);
    
    std::cout << "Position update complete" << std::endl;
    
    return 0;
}

Troubleshooting

Problem 1: Unaligned Access

Symptom: Crash or performance degradation

// ❌ Not aligned
char buffer[100];
int* ptr = reinterpret_cast<int*>(buffer + 1);
*ptr = 10;  // Unaligned (slow or crash)
// ✅ Guarantee alignment
alignas(int) char buffer[100];
int* ptr = reinterpret_cast<int*>(buffer);
*ptr = 10;

Problem 2: Struct Size Assumptions

Symptom: Serialization errors, memory calculation errors

struct Data {
    char c;
    int i;
};
// ❌ Wrong assumption
// Assume sizeof(Data) == 5 (actually 8)
// ✅ Use sizeof
size_t size = sizeof(Data);  // 8
// ✅ Verify with static_assert
static_assert(sizeof(Data) == 8, "Data size mismatch");

Problem 3: Platform Differences

Symptom: Different sizes on Windows and Linux

struct Data {
    long l;
    int i;
};
// Windows 64-bit: sizeof(long) == 4
// Linux 64-bit: sizeof(long) == 8
// ✅ Use fixed-size types
#include <cstdint>
struct DataFixed {
    int64_t l;  // Always 8 bytes
    int32_t i;  // Always 4 bytes
};

Problem 4: SIMD Alignment Error

Symptom: _mm256_load_ps crash

// ❌ Not aligned
float data[8];
__m256 a = _mm256_load_ps(data);  // Crash!
// ✅ 32-byte alignment
alignas(32) float data[8];
__m256 a = _mm256_load_ps(data);  // Safe
// Or use unaligned load
__m256 a = _mm256_loadu_ps(data);  // Slower but safe

Conclusion

C++ memory alignment directly impacts performance and memory efficiency.

Key Summary

  1. Alignment Basics
    • CPU requires alignment boundary per type
    • Compiler inserts padding to maintain alignment
    • Check with alignof, control with alignas
  2. Struct Optimization
    • Place large types first
    • Group same-size types
    • Minimize padding
  3. False Sharing Prevention
    • Separate cache lines (64 bytes)
    • Use alignas(64)
    • 3x multithreaded performance improvement
  4. SIMD Optimization
    • SSE: 16-byte alignment
    • AVX: 32-byte alignment
    • _mm256_load_ps vs _mm256_loadu_ps

Selection Guide

SituationMethod
Reduce struct sizePlace large types first
Multithreaded counteralignas(64)
SIMD operationsalignas(32)
Network protocol#pragma pack(1)

Code Example Cheatsheet

// Check alignment
cout << alignof(int) << endl;
// Check size
cout << sizeof(MyStruct) << endl;
// Specify alignment
alignas(64) int cacheLine[16];
// Struct alignment
struct alignas(16) Aligned { int x, y; };
// Remove padding (caution!)
#pragma pack(push, 1)
struct Packed { char c; int i; };
#pragma pack(pop)
// SIMD alignment
alignas(32) float data[8];
__m256 a = _mm256_load_ps(data);

Next Steps

  • Cache Optimization: C++ Cache Optimization
  • Data-Oriented Design: C++ Cache and Data-Oriented Design
  • Cache-Friendly Code: C++ Writing Cache-Friendly Code

References