When do I use this in practice?

Use profiling tools (perf, gprof, Valgrind) to locate bottlenecks, quantify performance, and spot optimization targets. Apply the workflows and examples in the article.

What should I read first?

Follow the “previous post” links at the bottom of each article in order, or use the C++ series index for the full path.

Where can I go deeper?

See cppreference and official docs for tools you use. Use the reference links at the end of the post.

[2026] C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

2026년 2월 28일 · 28분 읽기 · 수정 2026년 3월 12일 Intermediate Tutorial

이 글의 핵심

Measure before you optimize: C++ profiling with Linux perf, gprof, flame graphs, std::chrono, and Valgrind. Fix bottlenecks with data, not guesses—CPU sampling, symbols, and production-safe workflows.

Introduction: “I don’t know what’s slow”

Real-world scenarios

아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

What often happens:
- You spend three days optimizing a function you “thought” was slow; the real bottleneck was file I/O.
- perf report shows ??? for symbols and you can’t analyze.
- You profile with gprof but gmon.out never appears.
- Valgrind runs 30× slower and feels impractical.
- Your API server is at 100% CPU and you don’t know which handler is hot.
- After 24 hours, memory grows from 2GB to 8GB (possible leak).
- Your algorithm is O(n) but gets much slower than linear as n grows (cache effects suspected).

In these situations, measurement beats guessing. Use a profiler to find hotspots, visualize with flame graphs, optimize the top ~20% of time first—that usually gives the best return.

Optimizing from guesses wastes time

The program felt slow, so you optimized from intuition. The real bottleneck (the part that limits overall performance) was elsewhere. Wrong approach: 아래 코드는 cpp를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

// “This function must be slow” — optimize it
void processData(std::vector<int>& data) {
    // complex optimization attempts...
}
// In reality this was the bottleneck
void loadData() {
    // file I/O is slow
}

After profiling:

processData: ~5% of time
loadData: ~80% of time ← real bottleneck Lessons:
Don’t guess—measure
Find bottlenecks with a profiler
Optimize the slowest parts first Profiling means measuring which functions use how much CPU or memory at runtime. Without it, “this part feels slow” often points at the wrong layer—I/O or another module may dominate. Use CPU sampling (e.g. perf) or instrumentation first to see where time goes, then optimize the top few percent.

End-to-end profiling flow

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart TD
    A[Program is slow] --> B[Guess without measuring]
    B --> C{Hit the bottleneck?}
    C -->|No| D[Wasted time]
    A --> E[Run profiling]
    E --> F[Find hotspots]
    F --> G[Optimize top ~20%]
    G --> H[Re-measure]
    H --> I{Goal met?}
    I -->|No| E
    I -->|Yes| J[Done]

After reading this article you will:

Use profiling tools effectively
Pinpoint bottlenecks accurately
Measure performance quantitatively
Optimize effectively in practice

What is profiling
Basic timing
Profiling tools
Complete profiling example
Bottleneck analysis
Practical optimization process
Common problems
Profiling best practices
Production profiling patterns
Checklist

1. What is profiling

Why measure performance

다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

“Don’t guess—measure.”
- Intuition is often wrong
- Bottlenecks hide in unexpected places
- Optimization without measurement wastes time

Kinds of profiling

1. CPU profiling

Which functions use the most CPU
Call counts and time spent 2. Memory profiling
Memory usage
Allocation/deallocation counts
Leaks 3. Cache profiling
Cache miss counts
Access patterns

Profiling categories at a glance

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart LR
    subgraph CPU[CPU profiling]
        C1[perf]
        C2[gprof]
        C3[VS Profiler]
    end
    subgraph MEM[Memory profiling]
        M1[Valgrind Memcheck]
        M2[AddressSanitizer]
    end
    subgraph CACHE[Cache profiling]
        K1[Valgrind Cachegrind]
        K2[perf stat]
    end

2. Basic timing

Measuring with `std::chrono`

Since C++11, std::chrono can measure intervals. Take high_resolution_clock::now() at start and end, subtract to get a duration, then duration_cast to milliseconds or microseconds. That turns “feels slow” into numbers. 다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// After pasting: g++ -std=c++17 -o profile_time profile_time.cpp && ./profile_time
#include <chrono>
#include <iostream>
void slowFunction() {
    // heavy work...
}
int main() {
    auto start = std::chrono::high_resolution_clock::now();
    slowFunction();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Time: " << duration.count() << " ms\n";
    return 0;
}

Sample output: Time: N ms (N depends on the environment). Details:

high_resolution_clock: finest clock available
now(): current time as time_point
duration_cast: convert e.g. to milliseconds
count(): integer value in that unit

RAII timer helper

Record time in the constructor and print elapsed time in the destructor—classic RAII timer. { Timer t("slowFunction"); slowFunction(); } prints when the scope ends. Exceptions and early returns still run the destructor, so you miss fewer “end times” than manual prints. 다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Timer {
    std::chrono::high_resolution_clock::time_point start;
    const char* name;
public:
    Timer(const char* n) : name(n) {
        start = std::chrono::high_resolution_clock::now();
    }
    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        std::cout << name << ": " << duration.count() << " us\n";
    }
};
void processData() {
    Timer timer("processData");
    // work...
}  // prints automatically in destructor

Note: Keep the Timer in the right scope—use { } blocks so the measured region is clear.

Multiple sections

다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Profiler {
    std::map<std::string, long long> timings;
    std::chrono::high_resolution_clock::time_point start;
public:
    void startTimer() {
        start = std::chrono::high_resolution_clock::now();
    }
    void record(const std::string& name) {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        timings[name] += duration.count();
        start = end;
    }
    void report() {
        for (const auto& [name, time] : timings) {
            std::cout << name << ": " << time << " us\n";
        }
    }
};
int main() {
    Profiler prof;
    prof.startTimer();
    loadData();
    prof.record("loadData");
    processData();
    prof.record("processData");
    saveData();
    prof.record("saveData");
    prof.report();
}

Usage: Each `record()` adds the time since the previous `record()` (or `startTimer()`). `start = end` advances to the next segment; repeat to accumulate totals.

3. Profiling tools

Choosing a tool

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

flowchart TD
    A[Need profiling] --> B{Platform?}
    B -->|Linux| C[perf]
    B -->|Linux/Mac| D[gprof]
    B -->|Linux/Mac| E[Valgrind]
    B -->|Windows| F[VS Profiler]
    C --> G[CPU sampling]
    D --> H[Instrumentation]
    E --> I[Memory/cache]
    F --> C

perf (Linux)

The standard Linux profiler. Sampling records which function is on-CPU periodically—low overhead, usable even in production-like settings. 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

# Profile while running
perf record ./myapp
# View results
perf report
# Per-function stats
perf stat ./myapp

Example output:

  50.23%  myapp  [.] processData
  30.45%  myapp  [.] loadFile
  15.32%  myapp  [.] parseJson

perf report tips: 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

# Include call graph
perf record -g ./myapp
# Text report
perf report --stdio
# Filter symbol
perf report --symbol-filter=processData

Interpreting perf stat: 아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

 Performance counter stats for './myapp':
          1,234.56 msec task-clock
                42      context-switches
                 0      cpu-migrations
               128      page-faults
     3,456,789,012      cycles
     2,345,678,901      instructions

task-clock: CPU time (ms)
context-switches: context switch count
page-faults: page fault count
cycles, instructions: hardware counters IPC (instructions per cycle): instructions / cycles near 1 suggests efficient CPU use; well below 0.5 may indicate memory stalls or bad branch prediction.

gprof (GNU profiler)

Compile with -pg to inject profiling code. Running produces gmon.out; gprof reports per-function time and call counts.

g++ -pg -O2 main.cpp -o myapp
./myapp
gprof myapp gmon.out

Sample gprof output: 아래 코드는 text를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 80.0      0.80     0.80        1   800.00   800.00  loadFile
 15.0      0.95     0.15      100     1.50     1.50  processData
  5.0      1.00     0.05        1    50.00    50.00  saveResult

Note: -pg with -O2 can inline and merge functions—use -O0/-O1 if you need clearer call relationships.

Valgrind Callgrind

Simulates execution step by step—accurate call graphs and cache info, but 10–50× slower—use on short runs only.

valgrind --tool=callgrind ./myapp
callgrind_annotate callgrind.out.12345
# GUI: kcachegrind

Options:

valgrind --tool=callgrind --cache-sim=yes ./myapp
valgrind --tool=callgrind --toggle-collect=processData ./myapp

Visual Studio Profiler

다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

1. Debug → Performance Profiler
2. CPU Usage
3. Start, run app
4. Inspect Hot Path and per-function time

Tool comparison

Tool	Platform	Method	Overhead	Production
perf	Linux	Sampling	Low (~5%)	Yes
gprof	Linux/Mac	Instrumentation	Medium (~10%)	Sometimes
Valgrind	Linux/Mac	Simulation	Very high (10–50×)	No
VS Profiler	Windows	Sampling	Low	Yes

Flame graphs

Flame graphs stack frames bottom-up; width shows share of CPU time—great for spotting hot paths.

perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

How to read:

Width: fraction of CPU time on that path
Height: call stack (caller below, callee above)
Wide bars: hottest paths Full flame graph workflow: 아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

# 실행 예제
git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg   # macOS
# xdg-open flamegraph.svg  # Linux

Common patterns:

Pattern	Meaning	Action
Wide `memcpy`	Copy-bound	Buffer pools, zero-copy
Wide `malloc`/`free`	Allocation cost	Pools, arenas
Wide `std::sort`	Sort cost	Avoid sort, partial sort
Wide `pthread_mutex_lock`	Lock wait	Smaller critical sections, lock-free where safe

4. Complete profiling example

Target program

다음은 cpp를 활용한 상세한 구현 코드입니다. 필요한 모듈을 import하고, 반복문으로 데이터를 처리합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// profile_target.cpp — analyze with perf, gprof
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
void processDataCacheUnfriendly(std::vector<int>& data) {
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}
void processDataCacheFriendly(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}
void sortData(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}
int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);
    fillRandom(data);
    sortData(data);
    processDataCacheUnfriendly(data);
    processDataCacheFriendly(data);
    return 0;
}

perf example

다음은 간단한 bash 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf report --stdio
perf stat -e cycles,instructions,cache-references,cache-misses ./profile_target

Sample perf report --stdio: 다음은 간단한 text 코드 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

#   45.23%  profile_target    [.] sortData
#   28.10%  profile_target    [.] fillRandom
#   12.30%  profile_target    [.] processDataCacheUnfriendly
#   10.00%  profile_target    [.] processDataCacheFriendly

Hotspot: sortData ~45% → consider algorithm changes or removing sort.

gprof example

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof -p profile_target_gprof gmon.out
gprof -q profile_target_gprof gmon.out
gprof profile_target_gprof gmon.out > gprof_report.txt

Reading gprof: focus on % time, self seconds, calls.

Hotspot workflow

아래 코드는 mermaid를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

flowchart TD
    A[Run program] --> B[perf record -g]
    B --> C[perf report]
    C --> D{Top 3 functions?}
    D --> E[Widest bar = bottleneck]
    E --> F[Refine with Timer]
    F --> G[Pick optimization target]
    G --> H[Re-measure after fix]

5. Bottleneck analysis

Finding hotspots

아래 코드는 cpp를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Profiling says:
// 80% - loadFile()      ← bottleneck!
// 15% - processData()
// 5%  - saveResult()
void loadFile(const std::string& path) {
    Timer timer("loadFile");
    { Timer t("open"); file.open(path); }
    { Timer t("read"); /* read....slow here */ }
    { Timer t("parse"); /* parse....*/ }
}

Call counts (simple instrumentation)

class CallCounter {
    static std::map<std::string, int> counts;
    std::string name;
public:
    CallCounter(const char* n) : name(n) {
        counts[name]++;
    }
    static void report() {
        for (const auto& [name, count] : counts) {
            std::cout << name << ": " << count << " calls\n";
        }
    }
};
std::map<std::string, int> CallCounter::counts;

Pareto (80/20)

~80% of runtime often comes from the top ~20% of functions.
Optimizing those first yields most of the win.

6. Practical optimization process

Measure baseline (chrono, benchmarks)
Profile (perf record -g, etc.)
Optimize the real hotspot (e.g. reserve for vectors)
Re-measure
Repeat

Benchmarking tips

Warm up caches before timing
Run multiple iterations and average or take median
Use -O2/-O3 for release-like numbers when that matches production

Memory profiling

valgrind --leak-check=full ./myapp

AddressSanitizer (faster than Valgrind for many bugs):

g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer main.cpp -o myapp
./myapp

7. Common problems

perf permission denied

Lower kernel.perf_event_paranoid or run with appropriate privileges (see your distro docs).

No `gmon.out`

Ensure -pg and normal process exit (not only Ctrl+C/abort in some setups).

Valgrind too slow

Use smaller inputs, or use perf for CPU-only work.

Symbols show as `???`

Build with -g, avoid stripping debug info.

Inlined functions disappear from profile

Try `-O1`/`-O0` for profiling builds, or mark critical functions `attribute((noinline))`.

7. Checklist (duplicate section id in source)

Before profiling

-g for symbols
Choose optimization level (-O1 often balances accuracy vs reality)
perf: check perf_event_paranoid
gprof: -pg
Valgrind: shrink workload

After profiling

Principles

Measure, don’t guess
Fix big bottlenecks first
Compare before/after
Use profilers systematically

Keywords (search)

C++ profiling, perf, gprof, Valgrind, bottleneck, performance measurement, optimization, flame graph

Summary

Tool	Platform	Role
perf	Linux	CPU sampling
gprof	Linux/Mac	Per-function time
Valgrind	Linux/Mac	Memory, cache
VS Profiler	Windows	CPU, memory
std::chrono	All	Manual timing
Principles: measure first; optimize hotspots; compare before/after; use profilers.

Practical tips

Debugging

Fix compiler warnings first
Reproduce with a small test case

Performance

Don’t optimize without profiling
Define measurable goals

Code review

Check common review feedback early
Follow team conventions

FAQ

When is this useful in practice?

A. Finding bottlenecks with perf/gprof/Valgrind, measuring performance, and choosing what to optimize—use the article’s workflows and examples.

perf vs gprof?

A. On Linux, prefer perf (sampling, low overhead). gprof needs -pg and rebuilds. For exact call graphs on short runs, consider Callgrind.

Does profiling slow the app?

A. perf sampling is usually ~5% overhead. gprof instrumentation is higher. Valgrind is 10–50×—short runs only.

What to read first?

A. Follow previous-post links or the C++ series index.

Go deeper?

A. See cppreference and official tool documentation. One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2). Next: C++ practical guide #15-2: cache-friendly code Previous: Perfect forwarding (#14-2)

이 글의 핵심

Introduction: “I don’t know what’s slow”

Real-world scenarios

Optimizing from guesses wastes time

End-to-end profiling flow

Table of contents

1. What is profiling

Why measure performance

Kinds of profiling

Profiling categories at a glance

2. Basic timing

Measuring with std::chrono

RAII timer helper

Multiple sections

Usage: Each record() adds the time since the previous record() (or startTimer()). start = end advances to the next segment; repeat to accumulate totals.

3. Profiling tools

Choosing a tool

perf (Linux)

gprof (GNU profiler)

Valgrind Callgrind

Visual Studio Profiler

Tool comparison

Flame graphs

4. Complete profiling example

Target program

perf example

gprof example

Hotspot workflow

5. Bottleneck analysis

Finding hotspots

Call counts (simple instrumentation)

Pareto (80/20)

6. Practical optimization process

Benchmarking tips

Memory profiling

7. Common problems

perf permission denied

No gmon.out

Valgrind too slow

Symbols show as ???

Inlined functions disappear from profile

Try -O1/-O0 for profiling builds, or mark critical functions __attribute__((noinline)).

7. Checklist (duplicate section id in source)

Before profiling

After profiling

Principles

Related posts (internal)

Keywords (search)

Summary

Practical tips

Debugging

Performance

Code review

FAQ

When is this useful in practice?

perf vs gprof?

Does profiling slow the app?

What to read first?

Go deeper?

A. See cppreference and official tool documentation. One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2). Next: C++ practical guide #15-2: cache-friendly code Previous: Perfect forwarding (#14-2)

Related posts

Measuring with `std::chrono`

Usage: Each `record()` adds the time since the previous `record()` (or `startTimer()`). `start = end` advances to the next segment; repeat to accumulate totals.

No `gmon.out`

Symbols show as `???`

Try `-O1`/`-O0` for profiling builds, or mark critical functions `attribute((noinline))`.