[2026] C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

[2026] C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

이 글의 핵심

Measure before you optimize: C++ profiling with Linux perf, gprof, flame graphs, std::chrono, and Valgrind. Fix bottlenecks with data, not guesses—CPU sampling, symbols, and production-safe workflows.

Introduction: “I don’t know what’s slow”

Real-world scenarios

아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

What often happens:
- You spend three days optimizing a function you “thought” was slow; the real bottleneck was file I/O.
- perf report shows ??? for symbols and you can’t analyze.
- You profile with gprof but gmon.out never appears.
- Valgrind runs 30× slower and feels impractical.
- Your API server is at 100% CPU and you don’t know which handler is hot.
- After 24 hours, memory grows from 2GB to 8GB (possible leak).
- Your algorithm is O(n) but gets much slower than linear as n grows (cache effects suspected).

In these situations, measurement beats guessing. Use a profiler to find hotspots, visualize with flame graphs, optimize the top ~20% of time first—that usually gives the best return.

Optimizing from guesses wastes time

The program felt slow, so you optimized from intuition. The real bottleneck (the part that limits overall performance) was elsewhere. Wrong approach: 아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// “This function must be slow” — optimize it
void processData(std::vector<int>& data) {
    // complex optimization attempts...
}
// In reality this was the bottleneck
void loadData() {
    // file I/O is slow
}

After profiling:

  • processData: ~5% of time
  • loadData: ~80% of time ← real bottleneck Lessons:
  • Don’t guess—measure
  • Find bottlenecks with a profiler
  • Optimize the slowest parts first Profiling means measuring which functions use how much CPU or memory at runtime. Without it, “this part feels slow” often points at the wrong layer—I/O or another module may dominate. Use CPU sampling (e.g. perf) or instrumentation first to see where time goes, then optimize the top few percent.

End-to-end profiling flow

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart TD
    A[Program is slow] --> B[Guess without measuring]
    B --> C{Hit the bottleneck?}
    C -->|No| D[Wasted time]
    A --> E[Run profiling]
    E --> F[Find hotspots]
    F --> G[Optimize top ~20%]
    G --> H[Re-measure]
    H --> I{Goal met?}
    I -->|No| E
    I -->|Yes| J[Done]

After reading this article you will:

  • Use profiling tools effectively
  • Pinpoint bottlenecks accurately
  • Measure performance quantitatively
  • Optimize effectively in practice

Table of contents

  1. What is profiling
  2. Basic timing
  3. Profiling tools
  4. Complete profiling example
  5. Bottleneck analysis
  6. Practical optimization process
  7. Common problems
  8. Profiling best practices
  9. Production profiling patterns
  10. Checklist

1. What is profiling

Why measure performance

다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

“Don’t guess—measure.”
- Intuition is often wrong
- Bottlenecks hide in unexpected places
- Optimization without measurement wastes time

Kinds of profiling

1. CPU profiling

  • Which functions use the most CPU
  • Call counts and time spent 2. Memory profiling
  • Memory usage
  • Allocation/deallocation counts
  • Leaks 3. Cache profiling
  • Cache miss counts
  • Access patterns

Profiling categories at a glance

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart LR
    subgraph CPU[CPU profiling]
        C1[perf]
        C2[gprof]
        C3[VS Profiler]
    end
    subgraph MEM[Memory profiling]
        M1[Valgrind Memcheck]
        M2[AddressSanitizer]
    end
    subgraph CACHE[Cache profiling]
        K1[Valgrind Cachegrind]
        K2[perf stat]
    end

2. Basic timing

Measuring with std::chrono

Since C++11, std::chrono can measure intervals. Take high_resolution_clock::now() at start and end, subtract to get a duration, then duration_cast to milliseconds or microseconds. That turns “feels slow” into numbers. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// After pasting: g++ -std=c++17 -o profile_time profile_time.cpp && ./profile_time
#include <chrono>
#include <iostream>
void slowFunction() {
    // heavy work...
}
int main() {
    auto start = std::chrono::high_resolution_clock::now();
    slowFunction();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Time: " << duration.count() << " ms\n";
    return 0;
}

Sample output: Time: N ms (N depends on the environment). Details:

  • high_resolution_clock: finest clock available
  • now(): current time as time_point
  • duration_cast: convert e.g. to milliseconds
  • count(): integer value in that unit

RAII timer helper

Record time in the constructor and print elapsed time in the destructor—classic RAII timer. { Timer t("slowFunction"); slowFunction(); } prints when the scope ends. Exceptions and early returns still run the destructor, so you miss fewer “end times” than manual prints. 다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Timer {
    std::chrono::high_resolution_clock::time_point start;
    const char* name;
public:
    Timer(const char* n) : name(n) {
        start = std::chrono::high_resolution_clock::now();
    }
    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        std::cout << name << ": " << duration.count() << " us\n";
    }
};
void processData() {
    Timer timer("processData");
    // work...
}  // prints automatically in destructor

Note: Keep the Timer in the right scope—use { } blocks so the measured region is clear.

Multiple sections

다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Profiler {
    std::map<std::string, long long> timings;
    std::chrono::high_resolution_clock::time_point start;
public:
    void startTimer() {
        start = std::chrono::high_resolution_clock::now();
    }
    void record(const std::string& name) {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        timings[name] += duration.count();
        start = end;
    }
    void report() {
        for (const auto& [name, time] : timings) {
            std::cout << name << ": " << time << " us\n";
        }
    }
};
int main() {
    Profiler prof;
    prof.startTimer();
    loadData();
    prof.record("loadData");
    processData();
    prof.record("processData");
    saveData();
    prof.record("saveData");
    prof.report();
}

Usage: Each record() adds the time since the previous record() (or startTimer()). start = end advances to the next segment; repeat to accumulate totals.

3. Profiling tools

Choosing a tool

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart TD
    A[Need profiling] --> B{Platform?}
    B -->|Linux| C[perf]
    B -->|Linux/Mac| D[gprof]
    B -->|Linux/Mac| E[Valgrind]
    B -->|Windows| F[VS Profiler]
    C --> G[CPU sampling]
    D --> H[Instrumentation]
    E --> I[Memory/cache]
    F --> C

perf (Linux)

The standard Linux profiler. Sampling records which function is on-CPU periodically—low overhead, usable even in production-like settings. 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

# Profile while running
perf record ./myapp
# View results
perf report
# Per-function stats
perf stat ./myapp

Example output:

  50.23%  myapp  [.] processData
  30.45%  myapp  [.] loadFile
  15.32%  myapp  [.] parseJson

perf report tips: 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

# Include call graph
perf record -g ./myapp
# Text report
perf report --stdio
# Filter symbol
perf report --symbol-filter=processData

Interpreting perf stat: 아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

 Performance counter stats for './myapp':
          1,234.56 msec task-clock
                42      context-switches
                 0      cpu-migrations
               128      page-faults
     3,456,789,012      cycles
     2,345,678,901      instructions
  • task-clock: CPU time (ms)
  • context-switches: context switch count
  • page-faults: page fault count
  • cycles, instructions: hardware counters IPC (instructions per cycle): instructions / cycles near 1 suggests efficient CPU use; well below 0.5 may indicate memory stalls or bad branch prediction.

gprof (GNU profiler)

Compile with -pg to inject profiling code. Running produces gmon.out; gprof reports per-function time and call counts.

g++ -pg -O2 main.cpp -o myapp
./myapp
gprof myapp gmon.out

Sample gprof output: 아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 80.0      0.80     0.80        1   800.00   800.00  loadFile
 15.0      0.95     0.15      100     1.50     1.50  processData
  5.0      1.00     0.05        1    50.00    50.00  saveResult

Note: -pg with -O2 can inline and merge functions—use -O0/-O1 if you need clearer call relationships.

Valgrind Callgrind

Simulates execution step by step—accurate call graphs and cache info, but 10–50× slower—use on short runs only.

valgrind --tool=callgrind ./myapp
callgrind_annotate callgrind.out.12345
# GUI: kcachegrind

Options:

valgrind --tool=callgrind --cache-sim=yes ./myapp
valgrind --tool=callgrind --toggle-collect=processData ./myapp

Visual Studio Profiler

다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

1. Debug → Performance Profiler
2. CPU Usage
3. Start, run app
4. Inspect Hot Path and per-function time

Tool comparison

ToolPlatformMethodOverheadProduction
perfLinuxSamplingLow (~5%)Yes
gprofLinux/MacInstrumentationMedium (~10%)Sometimes
ValgrindLinux/MacSimulationVery high (10–50×)No
VS ProfilerWindowsSamplingLowYes

Flame graphs

Flame graphs stack frames bottom-up; width shows share of CPU time—great for spotting hot paths.

perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

How to read:

  • Width: fraction of CPU time on that path
  • Height: call stack (caller below, callee above)
  • Wide bars: hottest paths Full flame graph workflow: 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.
# 실행 예제
git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg   # macOS
# xdg-open flamegraph.svg  # Linux

Common patterns:

PatternMeaningAction
Wide memcpyCopy-boundBuffer pools, zero-copy
Wide malloc/freeAllocation costPools, arenas
Wide std::sortSort costAvoid sort, partial sort
Wide pthread_mutex_lockLock waitSmaller critical sections, lock-free where safe

4. Complete profiling example

Target program

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// profile_target.cpp — analyze with perf, gprof
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
void processDataCacheUnfriendly(std::vector<int>& data) {
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}
void processDataCacheFriendly(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}
void sortData(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}
int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);
    fillRandom(data);
    sortData(data);
    processDataCacheUnfriendly(data);
    processDataCacheFriendly(data);
    return 0;
}

perf example

다음은 간단한 bash 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf report --stdio
perf stat -e cycles,instructions,cache-references,cache-misses ./profile_target

Sample perf report --stdio: 다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

#   45.23%  profile_target    [.] sortData
#   28.10%  profile_target    [.] fillRandom
#   12.30%  profile_target    [.] processDataCacheUnfriendly
#   10.00%  profile_target    [.] processDataCacheFriendly

Hotspot: sortData ~45% → consider algorithm changes or removing sort.

gprof example

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof -p profile_target_gprof gmon.out
gprof -q profile_target_gprof gmon.out
gprof profile_target_gprof gmon.out > gprof_report.txt

Reading gprof: focus on % time, self seconds, calls.

Hotspot workflow

아래 코드는 mermaid를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

flowchart TD
    A[Run program] --> B[perf record -g]
    B --> C[perf report]
    C --> D{Top 3 functions?}
    D --> E[Widest bar = bottleneck]
    E --> F[Refine with Timer]
    F --> G[Pick optimization target]
    G --> H[Re-measure after fix]

5. Bottleneck analysis

Finding hotspots

아래 코드는 cpp를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Profiling says:
// 80% - loadFile()      ← bottleneck!
// 15% - processData()
// 5%  - saveResult()
void loadFile(const std::string& path) {
    Timer timer("loadFile");
    { Timer t("open"); file.open(path); }
    { Timer t("read"); /* read....slow here */ }
    { Timer t("parse"); /* parse....*/ }
}

Call counts (simple instrumentation)

다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class CallCounter {
    static std::map<std::string, int> counts;
    std::string name;
public:
    CallCounter(const char* n) : name(n) {
        counts[name]++;
    }
    static void report() {
        for (const auto& [name, count] : counts) {
            std::cout << name << ": " << count << " calls\n";
        }
    }
};
std::map<std::string, int> CallCounter::counts;

Pareto (80/20)

~80% of runtime often comes from the top ~20% of functions.
Optimizing those first yields most of the win.

6. Practical optimization process

  1. Measure baseline (chrono, benchmarks)
  2. Profile (perf record -g, etc.)
  3. Optimize the real hotspot (e.g. reserve for vectors)
  4. Re-measure
  5. Repeat

Benchmarking tips

  • Warm up caches before timing
  • Run multiple iterations and average or take median
  • Use -O2/-O3 for release-like numbers when that matches production

Memory profiling

valgrind --leak-check=full ./myapp

AddressSanitizer (faster than Valgrind for many bugs):

g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer main.cpp -o myapp
./myapp

7. Common problems

perf permission denied

Lower kernel.perf_event_paranoid or run with appropriate privileges (see your distro docs).

No gmon.out

Ensure -pg and normal process exit (not only Ctrl+C/abort in some setups).

Valgrind too slow

Use smaller inputs, or use perf for CPU-only work.

Symbols show as ???

Build with -g, avoid stripping debug info.

Inlined functions disappear from profile

Try -O1/-O0 for profiling builds, or mark critical functions __attribute__((noinline)).

7. Checklist (duplicate section id in source)

Before profiling

  • -g for symbols
  • Choose optimization level (-O1 often balances accuracy vs reality)
  • perf: check perf_event_paranoid
  • gprof: -pg
  • Valgrind: shrink workload

After profiling

  • Identify top ~20% functions
  • Drill down with timers
  • Record baseline before changes
  • Re-measure after changes
  • Regression-test behavior

Principles

  • Measure, don’t guess
  • Fix big bottlenecks first
  • Compare before/after
  • Use profilers systematically


C++ profiling, perf, gprof, Valgrind, bottleneck, performance measurement, optimization, flame graph

Summary

ToolPlatformRole
perfLinuxCPU sampling
gprofLinux/MacPer-function time
ValgrindLinux/MacMemory, cache
VS ProfilerWindowsCPU, memory
std::chronoAllManual timing
Principles: measure first; optimize hotspots; compare before/after; use profilers.

Practical tips

Debugging

  • Fix compiler warnings first
  • Reproduce with a small test case

Performance

  • Don’t optimize without profiling
  • Define measurable goals

Code review

  • Check common review feedback early
  • Follow team conventions

FAQ

When is this useful in practice?

A. Finding bottlenecks with perf/gprof/Valgrind, measuring performance, and choosing what to optimize—use the article’s workflows and examples.

perf vs gprof?

A. On Linux, prefer perf (sampling, low overhead). gprof needs -pg and rebuilds. For exact call graphs on short runs, consider Callgrind.

Does profiling slow the app?

A. perf sampling is usually ~5% overhead. gprof instrumentation is higher. Valgrind is 10–50×—short runs only.

What to read first?

A. Follow previous-post links or the C++ series index.

Go deeper?

A. See cppreference and official tool documentation. One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2). Next: C++ practical guide #15-2: cache-friendly code Previous: Perfect forwarding (#14-2)

... 996 lines not shown ... Token usage: 63706/1000000; 936294 remaining Start-Sleep -Seconds 3