Why are intermittent crashes hard to reproduce?

Scheduling, thread interleaving, and memory layout change symptoms. Core dumps, backtraces, and logging last-known-good state matter.

When should I use rr (record and replay)?

When you need deterministic replay of nondeterministic failures—if gdb alone is not enough and your environment supports rr.

No core dump is written—what now?

Check ulimit, container settings, and systemd core patterns. Production often disables dumps; reproduce in staging too.

What to suspect first in multithreaded segfaults?

Races, use-after-free, bad lock ordering. ThreadSanitizer and minimal repros narrow scope.

[2026] C++ Crash Debugging Case Study | Fixing an Intermittent Segmentation Fault

2026년 3월 30일 · 24분 읽기 · 수정 2026년 3월 30일 Advanced

이 글의 핵심

Production C++ server: intermittent segfaults traced with core dumps, gdb, and rr—how to debug “cannot reproduce” crashes and data races.

Introduction

“Sometimes the server dies” is among the hardest reports. This post covers a hard-to-reproduce intermittent crash solved with core dumps, gdb, and rr.

What you will learn

How to configure and use core dumps
gdb techniques at the crash site
Using rr when local repro fails
Debugging data races in multithreaded code

Symptom: intermittent SIGSEGV
Core dump setup
gdb: analyzing the crash
Hypothesis: dangling pointer?
Reproduction fails locally
Recording with rr
Reverse debugging
Root cause: data race
Fix: synchronization
Validation with TSan
Closing thoughts

1. Symptom: intermittent SIGSEGV

Production

Roughly 1–2 segfaults per day:

$ dmesg | tail
[12345.678] chat_server[23456]: segfault at 0 ip 00007f1234567890 sp 00007fff12345678 error 4 in chat_server

Characteristics

No reliable repro in dev
Intermittent, seemingly random
More frequent under load

2. Core dump setup

System

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

$ ulimit -c unlimited
$ sudo sysctl -w kernel.core_pattern=/var/coredumps/core.%e.%p.%t
$ sudo mkdir -p /var/coredumps
$ sudo chmod 1777 /var/coredumps

Server

$ ulimit -c
unlimited
$ ./chat_server

Next morning

$ ls -lh /var/coredumps/
-rw------- 1 user user 1.2G Mar 30 03:42 core.chat_server.23456.1711756920

3. gdb: analyzing the crash

Load core

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

$ gdb ./chat_server /var/coredumps/core.chat_server.23456.1711756920
(gdb) bt
#0  0x00007f1234567890 in std::vector<Message>::operator[] (this=0x0, __n=5)
    at /usr/include/c++/11/bits/stl_vector.h:1046
#1  0x00007f2345678901 in ChatRoom::broadcast (this=0x7f3456789012, msg=...)
    at src/chat_room.cpp:145
...

Finding

Crash in ChatRoom::broadcast with this=0x0 (null dereference)
Thread: Asio worker

4. Hypothesis: dangling pointer?

Suspect code

다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며, 반복문으로 데이터를 처리합니다, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Connection {
    ChatRoom* room_; // raw pointer
    
public:
    void handleMessage(const std::string& msg) {
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        room_ = nullptr;
    }
};
class ChatRoom {
    std::vector<Connection*> connections_;
    
public:
    void broadcast(const std::string& msg) {
        for (auto* conn : connections_) {
            conn->send(msg);
        }
    }
};

Hypotheses

Connection still points at a destroyed ChatRoom
Multithreaded race clearing room_ while in use

5. Reproduction fails locally

$ ./load_test.sh --users=1000 --duration=600
# No crash

Why

Timing-dependent races
Different CPU and load vs production
Rare interleaving Conclusion: debug without easy repro → try rr

6. Recording with rr

Setup

$ sudo apt install rr
$ echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record

$ rr record ./chat_server
rr: Saving execution to trace directory `/root/.local/share/rr/chat_server-0'.

7. Reverse debugging

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

$ rr replay /root/.local/share/rr/chat_server-0
(rr) c
Program received signal SIGSEGV, Segmentation fault.
0x00007f1234567890 in ChatRoom::broadcast (this=0x0, msg=...)

Watch `room_`

아래 코드는 cpp를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

(rr) reverse-continue
(rr) watch -l room_
(rr) reverse-continue
Old value = (ChatRoom*) 0x7f3456789012
New value = (ChatRoom*) 0x0
(rr) bt
#0  Connection::leaveRoom() at src/connection.cpp:67
#1  ChatRoom::removeConnection() at src/chat_room.cpp:89
#2  Server::handleDisconnect() at src/server.cpp:123

Insight

Thread A: inside `handleMessage` (using `room_`)
Thread B: `leaveRoom` sets `room_ = nullptr`
→ Data race

8. Root cause: data race

Interleaving

아래 코드는 cpp를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

// Thread A
void Connection::handleMessage(const std::string& msg) {
    if (room_) {              // room_ looks valid
        room_->broadcast(msg); // B may clear room_ here → crash
    }
}
// Thread B
void Connection::leaveRoom() {
    room_ = nullptr;
}

Timeline

아래 코드는 code를 사용한 구현 예제입니다. 조건문으로 분기 처리를 수행합니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

Time  | Thread A                    | Thread B
------|-----------------------------|-----------------------
t0    | if (room_) { // true        |
t1    |                             | room_ = nullptr;
t2    | room_->broadcast(msg);      |
      | SIGSEGV                     |

9. Fix: synchronization

Mutex

다음은 cpp를 활용한 상세한 구현 코드입니다. 클래스를 정의하여 데이터와 기능을 캡슐화하며, 조건문으로 분기 처리를 수행합니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

class Connection {
    mutable std::mutex roomMutex_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(roomMutex_);
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        std::lock_guard<std::mutex> lock(roomMutex_);
        room_ = nullptr;
    }
};

shared_ptr + weak_ptr

class Connection {
    std::weak_ptr<ChatRoom> room_;
    
public:
    void handleMessage(const std::string& msg) {
        if (auto room = room_.lock()) {
            room->broadcast(msg);
        }
    }
    
    void setRoom(std::shared_ptr<ChatRoom> room) {
        room_ = room;
    }
    
    void leaveRoom() {
        room_.reset();
    }
};

Asio strand (serialize handlers)

class Connection {
    boost::asio::strand<boost::asio::io_context::executor_type> strand_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        boost::asio::post(strand_, [this, msg]() {
            if (room_) {
                room_->broadcast(msg);
            }
        });
    }
    
    void leaveRoom() {
        boost::asio::post(strand_, [this]() {
            room_ = nullptr;
        });
    }
};

10. TSan validation

$ g++ -g -O1 -fsanitize=thread -std=c++17 *.cpp -o chat_server_tsan

아래 코드는 bash를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

$ ./chat_server_tsan
WARNING: ThreadSanitizer: data race (pid=12345)
  Write of size 8 at 0x7f1234567890 by thread T2:
    #0 Connection::leaveRoom() src/connection.cpp:67
    
  Previous read of size 8 at 0x7f1234567890 by thread T1:
    #0 Connection::handleMessage() src/connection.cpp:45
...
SUMMARY: ThreadSanitizer: data race src/connection.cpp:67 in Connection::leaveRoom()

11. After the fix

Load / soak

$ ./chat_server_tsan
# 24h run → 0 races reported
# Production: 1 week → 0 crashes (example)

Overhead (illustrative)

Approach	Overhead	Safety
Mutex	~5%	High
weak_ptr	~10%	Very high
Strand	~2%	High (Asio)
We chose strand (already on Asio).

12. Lessons

Takeaways

Enable core dumps in production where policy allows
rr is powerful for “can’t repro” bugs
TSan in CI catches races early
Strand / locks for shared mutable state

Intermittent crash workflow

아래 코드는 mermaid를 사용한 구현 예제입니다. 각 부분의 역할을 이해하면서 코드를 살펴보시기 바랍니다.

graph TD
    A[Crash] --> B{Core dump?}
    B -->|Yes| C[gdb backtrace]
    B -->|No| D[Fix core settings, wait]
    C --> E{Repro locally?}
    E -->|Yes| F[gdb]
    E -->|No| G[rr record]
    G --> H[rr replay / reverse]
    H --> I[Root cause]
    I --> J[Fix]
    J --> K[TSan / ASan]

Patterns

// Bad: unsynchronized shared state
class BadConnection {
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        if (room_) room_->broadcast(msg); // race
    }
};
// Good: mutex
class GoodConnection {
    std::mutex mutex_;
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(mutex_);
        if (room_) room_->broadcast(msg);
    }
};

13. More techniques

아래 코드는 bash를 사용한 구현 예제입니다. 코드를 직접 실행해보면서 동작을 확인해보세요.

(gdb) watch room_
(gdb) continue
(gdb) break Connection::handleMessage if room_ == 0
(gdb) info threads
(gdb) thread 2
(gdb) bt

Closing thoughts

Core dumps pinpointed the faulting instruction
gdb showed the stack
rr made nondeterminism debuggable
TSan confirmed the race
Strand serialized access safely
You can fix “unreproducible” crashes with the right tools.

FAQ

Q1. rr in production? Possible with overhead; some teams run a subset of hosts under rr for nasty bugs. Q2. Cores are huge Pipe `core_pattern` to a compressor, or cap size with `ulimit -c`. Q3. TSan + ASan together? No—one sanitizer per process. Run separate CI jobs.

Checklists

Crash debugging

Thread safety

Keywords

C++, crash, segmentation fault, core dump, gdb, rr, reverse debugging, data race, ThreadSanitizer, TSan, multithreading, case study

이 글의 핵심

Introduction

What you will learn

Table of contents

1. Symptom: intermittent SIGSEGV

Production

Characteristics

2. Core dump setup

System

Server

Next morning

3. gdb: analyzing the crash

Load core

Finding

4. Hypothesis: dangling pointer?

Suspect code

Hypotheses

5. Reproduction fails locally

Why

6. Recording with rr

Setup

Record

7. Reverse debugging

Watch room_

Insight

Thread A: inside handleMessage (using room_) Thread B: leaveRoom sets room_ = nullptr → Data race

8. Root cause: data race

Interleaving

Timeline

9. Fix: synchronization

Mutex

shared_ptr + weak_ptr

Asio strand (serialize handlers)

10. TSan validation

11. After the fix

Load / soak

Overhead (illustrative)

12. Lessons

Takeaways

Intermittent crash workflow

Patterns

13. More techniques

Closing thoughts

FAQ

Q1. rr in production? Possible with overhead; some teams run a subset of hosts under rr for nasty bugs. Q2. Cores are huge Pipe core_pattern to a compressor, or cap size with ulimit -c. Q3. TSan + ASan together? No—one sanitizer per process. Run separate CI jobs.

Related posts

Checklists

Crash debugging

Thread safety

Keywords

Watch `room_`

Thread A: inside `handleMessage` (using `room_`)
Thread B: `leaveRoom` sets `room_ = nullptr`
→ Data race

Q1. rr in production? Possible with overhead; some teams run a subset of hosts under rr for nasty bugs. Q2. Cores are huge Pipe `core_pattern` to a compressor, or cap size with `ulimit -c`. Q3. TSan + ASan together? No—one sanitizer per process. Run separate CI jobs.