C++ Branch Prediction | Complete Guide to likely/unlikely Optimization

C++ Branch Prediction | Complete Guide to likely/unlikely Optimization

이 글의 핵심

C++ branch prediction: CPU pipeline, misprediction penalty, [[likely]]/[[unlikely]], branch elimination, sorting effects, and PGO with practical examples.

Introduction

Modern CPUs use pipeline and superscalar architectures to overlap execution of multiple instructions. When branches in if or loops split the address of next instruction to execute based on conditions, we don’t know “which path will execute” until the condition is computed. So CPUs use branch predictor to perform speculative execution on one path based on past patterns. If correct, proceed; if wrong, pipeline flush (misprediction penalty) wastes dozens of cycles.

What You’ll Learn

  • Understand branch prediction principles and misprediction penalty
  • Provide hints to compiler with [[likely]], [[unlikely]]
  • Optimize performance with branch elimination, sorting, and PGO
  • Master commonly used branch optimization patterns in production

Table of Contents

  1. Basic Concepts
  2. Practical Implementation
  3. Advanced Usage
  4. Performance Comparison
  5. Real-World Cases
  6. Troubleshooting
  7. Conclusion

Basic Concepts

CPU Branch Prediction Principles

  1. Sequential execution assumption: Pipeline normally fetches “next address” sequentially.
  2. Conditional branch: When destination has two or more branches before condition is determined, predictor chooses one side.
  3. Misprediction: Discard work already started on wrong path and refill with correct address. This cost repeated millions of times in loops significantly impacts perceived performance.

Predictable vs Unpredictable

// Predictable branch: fast
for (int i = 0; i < n; ++i) {
    if (i < n/2) {  // Always same pattern
        // ...
    }
}
// Unpredictable branch: slow
for (int i = 0; i < n; ++i) {
    if (data[i] % 2 == 0) {  // Random
        // ...
    }
}

Practical Implementation

1) [[likely]] / [[unlikely]] (C++20)

#include <iostream>
#include <stdexcept>
int divide(int a, int b) {
    if (b == 0) [[unlikely]] {
        throw std::runtime_error("Division by zero");
    }
    
    return a / b;
}
void process(int* data, int n) {
    for (int i = 0; i < n; ++i) {
        if (data[i] > 0) [[likely]] {
            // Mostly positive
            processPositive(data[i]);
        } else {
            processNegative(data[i]);
        }
    }
}
int main() {
    int result = divide(10, 2);
    std::cout << result << std::endl;
    
    return 0;
}

2) Branch Elimination (Branchless)

Conditional Move (cmov)

// ❌ Branch
int max(int a, int b) {
    if (a > b) {
        return a;
    } else {
        return b;
    }
}
// ✅ Conditional move
int max(int a, int b) {
    return (a > b) ? a : b;
}
// Compiler generates cmov instruction

Using Masks

#include <vector>
#include <chrono>
#include <iostream>
int main() {
    std::vector<int> data(10000000);
    for (int i = 0; i < data.size(); ++i) {
        data[i] = i % 100 - 50;
    }
    
    // ❌ Many branches
    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for (int x : data) {
        if (x > 0) {
            sum1 += x;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    // ✅ Branch elimination
    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for (int x : data) {
        int mask = (x > 0) ? 1 : 0;
        sum2 += x * mask;
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "With branches: " << time1 << "ms" << std::endl;
    std::cout << "Branchless: " << time2 << "ms" << std::endl;
    
    return 0;
}

3) Sorting Effect

#include <algorithm>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>
int main() {
    std::vector<int> data(10000000);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10000);
    std::generate(data.begin(), data.end(), [&]() { return dis(gen); });
    
    // ❌ Random data (unpredictable)
    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for (int x : data) {
        if (x > 5000) {
            sum1 += x;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    // ✅ After sorting (predictable)
    std::sort(data.begin(), data.end());
    
    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for (int x : data) {
        if (x > 5000) {
            sum2 += x;
        }
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "Random: " << time1 << "ms" << std::endl;
    std::cout << "Sorted: " << time2 << "ms" << std::endl;
    
    return 0;
}

Result: 2-3x faster after sorting

Advanced Usage

1) PGO (Profile-Guided Optimization)

Profile Generation

# GCC/Clang
g++ -O3 -fprofile-generate program.cpp -o program
./program  # Run representative workload
g++ -O3 -fprofile-use program.cpp -o program_optimized

MSVC

# Generate profile
cl /O2 /GL /LTCG:PGI program.cpp
program.exe  # Run representative workload
# Use profile
cl /O2 /GL /LTCG:PGO program.cpp

2) Lookup Tables

// ❌ Many branches
int getDayName(int day) {
    if (day == 0) return "Sunday";
    if (day == 1) return "Monday";
    if (day == 2) return "Tuesday";
    // ...
}
// ✅ Lookup table
const char* DAY_NAMES[] = {
    "Sunday", "Monday", "Tuesday", "Wednesday",
    "Thursday", "Friday", "Saturday"
};
const char* getDayName(int day) {
    return DAY_NAMES[day];
}

3) Virtual Function Optimization

// ❌ Virtual function (indirect branch)
class Shape {
public:
    virtual double area() const = 0;
};
std::vector<Shape*> shapes;
double total = 0;
for (auto* shape : shapes) {
    total += shape->area();  // Indirect branch
}
// ✅ Separate by type
std::vector<Circle> circles;
std::vector<Rectangle> rectangles;
double total = 0;
for (const auto& circle : circles) {
    total += circle.area();  // Direct call
}
for (const auto& rect : rectangles) {
    total += rect.area();
}

Performance Comparison

Branch Misprediction Cost

Test: 10 million iterations

Branch PatternTimeSpeedup
Predictable (always true)10ms10x
Predictable (always false)10ms10x
Unpredictable (random)100ms1x
Conclusion: 10x slower on misprediction

Branch Elimination Effect

Test: 10 million conditional operations

MethodTimeSpeedup
Branch (random)100ms1x
Branchless (mask)50ms2x
Conclusion: 2x improvement with branch elimination

Sorting Effect

Test: 10 million random data points

MethodTimeSpeedup
Random data100ms1x
After sorting30ms3.3x
Conclusion: 3x improvement with sorting

Real-World Cases

Case 1: Packet Filtering

#include <vector>
#include <chrono>
#include <iostream>
struct Packet {
    int type;
    int size;
};
void filterPackets(const std::vector<Packet>& packets) {
    int count = 0;
    
    for (const auto& packet : packets) {
        if (packet.type == 1) [[likely]] {
            // Mostly type 1
            count++;
        }
    }
    
    std::cout << "Type 1 packets: " << count << std::endl;
}
int main() {
    std::vector<Packet> packets(10000000);
    for (auto& p : packets) {
        p.type = (rand() % 100 < 90) ? 1 : 2;  // 90% type 1
        p.size = 1500;
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    filterPackets(packets);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "Time: " << duration << "ms" << std::endl;
    
    return 0;
}

Case 2: Image Processing - Threshold Filter

#include <vector>
#include <algorithm>
#include <chrono>
#include <iostream>
void applyThreshold(std::vector<uint8_t>& image, uint8_t threshold) {
    // ❌ Many branches
    auto start1 = std::chrono::high_resolution_clock::now();
    for (auto& pixel : image) {
        if (pixel > threshold) {
            pixel = 255;
        } else {
            pixel = 0;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    std::cout << "With branches: " << time1 << "ms" << std::endl;
}
void applyThresholdBranchless(std::vector<uint8_t>& image, uint8_t threshold) {
    // ✅ Branch elimination
    auto start2 = std::chrono::high_resolution_clock::now();
    for (auto& pixel : image) {
        int mask = (pixel > threshold) ? 0xFF : 0x00;
        pixel = mask;
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "Branchless: " << time2 << "ms" << std::endl;
}
int main() {
    std::vector<uint8_t> image(10000000);
    std::generate(image.begin(), image.end(), []() { return rand() % 256; });
    
    applyThreshold(image, 128);
    applyThresholdBranchless(image, 128);
    
    return 0;
}

Case 3: Finance - Conditional Fees

#include <vector>
#include <chrono>
#include <iostream>
struct Transaction {
    double amount;
    int type;
};
double calculateFees(const std::vector<Transaction>& transactions) {
    double total_fee = 0.0;
    
    for (const auto& tx : transactions) {
        if (tx.type == 1) [[likely]] {
            // Regular transaction (90%)
            total_fee += tx.amount * 0.001;
        } else if (tx.type == 2) {
            // Special transaction (9%)
            total_fee += tx.amount * 0.002;
        } else [[unlikely]] {
            // Exception transaction (1%)
            total_fee += tx.amount * 0.005;
        }
    }
    
    return total_fee;
}
int main() {
    std::vector<Transaction> transactions(10000000);
    for (auto& tx : transactions) {
        int r = rand() % 100;
        tx.type = (r < 90) ? 1 : (r < 99) ? 2 : 3;
        tx.amount = 1000.0;
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    double fee = calculateFees(transactions);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "Fee: " << fee << std::endl;
    std::cout << "Time: " << duration << "ms" << std::endl;
    
    return 0;
}

Troubleshooting

Problem 1: Excessive Hints

Symptom: Wrong hints causing performance degradation

// ❌ Wrong hint
if (condition) [[unlikely]] {
    // Actually executes frequently (50%)
    // Performance degradation
}
// ✅ Apply after profiling
// perf stat -e branch-misses ./program

Problem 2: Branch Elimination Cost

Symptom: Branch elimination actually slower

// Simple branch (predictable)
if (x > 0) {
    y = x;
} else {
    y = 0;
}
// Branch elimination (not always faster)
y = (x > 0) ? x : 0;
// ✅ Compiler optimizes
// Choose after measurement

Problem 3: Sorting Cost vs Benefit

Symptom: Sorting cost exceeds branch prediction benefit

std::vector<int> data = generateData(1000);
// Single pass: don't sort
for (int x : data) {
    if (x > threshold) { /* ....*/ }
}
// Multiple passes: sort
std::sort(data.begin(), data.end());
for (int i = 0; i < 100; ++i) {
    for (int x : data) {
        if (x > threshold) { /* ....*/ }
    }
}

Guideline: Consider sorting if passes > 10

Problem 4: Platform Dependency

Symptom: Fast on one CPU but slow on another

# Measure on target CPU
perf stat -e branch-misses,branches ./program
# Output:
# 1,234,567 branch-misses
# 10,000,000 branches
# 12.3% miss rate

Solution: Benchmark on target CPU class

Conclusion

C++ branch prediction is a core element of performance optimization.

Key Summary

  1. Branch Prediction
    • CPU speculatively executes branch result
    • Dozens of cycles lost on misprediction
  2. [[likely]] / [[unlikely]]
    • C++20 standard attributes
    • Hints to compiler
  3. Branch Elimination
    • Conditional move (cmov)
    • Using masks
  4. Sorting Effect
    • Predictable patterns
    • 3x performance improvement
  5. PGO
    • Based on actual workload
    • Automatic optimization

Optimization Techniques

TechniqueEffectDifficulty
[[likely]]/[[unlikely]]1.2-1.5xLow
Branch elimination2xMedium
Sorting3xLow
PGO1.5-2xMedium

Code Example Cheatsheet

// likely/unlikely
if (rare) [[unlikely]] { /* ....*/ }
// Branch elimination
result = (condition) ? a : b;
// Mask
int mask = (x > 0) ? 1 : 0;
sum += x * mask;
// Sorting
std::sort(data.begin(), data.end());
// Lookup table
result = table[index];

Next Steps

References

  • “Computer Architecture: A Quantitative Approach” - Hennessy, Patterson
  • “Optimized C++” - Kurt Guntheroth
  • “Agner Fog’s Optimization Manuals” One-line summary: Branch prediction is fast with predictable patterns, and performance can be improved 2-3x with [[likely]]/[[unlikely]], branch elimination, sorting, and PGO.