Linux Internals & Operations: CFS Scheduling, Page Cache, VFS, Syscalls, Production Patterns

Linux Internals & Operations: CFS Scheduling, Page Cache, VFS, Syscalls, Production Patterns

이 글의 핵심

Linux joins user space and the kernel at the syscall boundary, while CFS shares CPU time, the page cache and swap absorb memory pressure, and VFS unifies filesystems. This article ties those subsystems to limits, observability, and incident patterns you actually see in production.

Introduction

Linux powers servers, cloud instances, and embedded devices. To answer why latency spikes, memory alarms fire, or containers OOM while the node looks fine, you need a coherent picture: user-space threads enter the kernel through system calls, the CFS scheduler divides CPU time, virtual memory holds file pages in the page cache and may push anonymous pages to swap, and VFS presents a single interface over ext4, XFS, NFS, overlayfs, and more.

This is not a command cheat sheet. It connects internals to operations: cgroup limits, pressure metrics, and debugging workflows. Pair disk and inode issues with Linux disk full vs inode full, and syscall-centric user-space work with Linux syscall programming in C++.

What this guide covers

  • Process scheduling: CFS intuition, vruntime, runqueues, interaction with nice and cgroup CPU weights
  • Memory management: page cache, dirty writeback, reclaim, swap, OOM
  • Filesystem architecture: VFS, dentry/inode caches, struct file and f_op
  • System call interface: entry path, vdso, tracepoints, seccomp in containers
  • Production patterns: cgroups v2, limits, observability signals, incident triage

Prerequisites and vocabulary

  • User space: where applications and libraries run; cannot touch hardware directly.
  • Kernel space: privileged mode; manages CPUs, memory, block devices, and networking.
  • System call: the supported contract for requesting kernel services.
  • Task: the unit of scheduling; in practice this maps to a thread (a process is a group of threads).

CPU and multitasking: why the scheduler matters

A single CPU core executes one thread at a time. The OS uses time slicing so many tasks appear to run concurrently. The scheduler decides who runs for how long, interacting with preemption, runqueues, priorities, and cgroup caps.

CFS intuition

The default scheduling class for normal tasks, CFS (Completely Fair Scheduler), approximates fairness using virtual runtime (vruntime). Tasks accrue vruntime as they consume CPU; the scheduler tends to pick tasks with smaller vruntime so that, over time, CPU time balances across competing tasks.

Weights (priority/nice/cgroup cpu.weight) change how fast vruntime grows. Lower nice (higher priority) tends to grow vruntime more slowly, so the task is picked more often. This is statistical fairness, not a hard latency guarantee.

Inside CFS: cfs_rq and time slices

Each CPU has a CFS runqueue (cfs_rq). Tasks are ordered by vruntime, commonly backed by a red-black tree, so the “leftmost” task is the fairest next candidate. CFS does not use a single fixed quantum for every task forever; parameters such as sched_latency_ns (target scheduling latency) and sched_min_granularity_ns (minimum slice to avoid thrashing) shape how often tasks rotate under load.

When runnable tasks explode, per-task slices can approach the minimum granularity and scheduling overhead rises: you may see high context-switch rates without proportional useful work—worth correlating with perf sched events.

Hierarchical CPU: cgroups and CFS

Services and containers usually sit under CPU cgroups. With hierarchical cpu.weight and cpu.max, a child task cannot escape a parent cap even if CFS would otherwise treat it as “fair” within the child group. When a process “looks healthy in top” but is slow, expand both the process tree and the cgroup tree.

Real-time and deadline classes

SCHED_FIFO/SCHED_RR follow different preemption rules; misconfiguration can starve normal tasks. SCHED_DEADLINE targets periodic real-time constraints with runtime/deadline parameters. Typical web and API workloads remain on CFS—know these classes exist when you debug vendor or media pipelines that require them.

Operational pitfalls

  • Context-switch storms: lock contention, too many runnable threads, or tight polling loops burn CPU without progress—perf shows scheduling hotspots.
  • CPU throttling: cgroup cpu.max injects forced idle; latency rises even if code looks fine in profiles.

Memory: page cache, swap, and OOM

The kernel manages memory in pages (often 4KiB). Anonymous mappings (heap/stack) and file-backed mappings participate differently in reclaim; the page cache dominates read-heavy performance.

Page cache

Reads and writes flow through the page cache; repeated reads hit RAM instead of disk. Writes create dirty pages; background writeback flushes them over time. MemAvailable approximates memory that can be reclaimed without driving the system into distress—better than raw free for capacity alarms.

Why it matters: “Memory full” alerts often include evictable cache. A cold cache after maintenance or migration can raise read latency—correlate with iostat, vmstat, and application access patterns.

LRU-oriented reclaim and kswapd

The kernel tracks pages in active/inactive style lists (details evolve by version) to keep recently used pages longer. kswapd works in the background around watermarks to reduce direct reclaim on the syscall path. Direct reclaim can synchronously hunt for pages and cause latency spikes—a common “random slowness” source under memory pressure.

Dirty ratios and writeback stalls

Parameters like dirty_ratio / dirty_background_ratio (or byte-based tunables) influence how much dirty memory accumulates before background or synchronous flush. Bursty logging or batch jobs can interact with disk bandwidth to create write stalls.

Swap and pressure

When RAM is tight, the kernel reclaims pages. Clean file pages are cheap to drop; anonymous pages may go to swap. Heavy swapping turns memory pressure into disk latency and can devolve into thrashing. vm.swappiness biases reclaim behavior between file cache and anonymous pages—interpret it together with workload (databases with large buffers are especially sensitive).

OOM killer

If reclaim cannot free enough memory, the OOM killer terminates a process. Selection depends on oom_score, cgroup limits, and task attributes. Containers may OOM in their cgroup while the host looks fine.

Operational notes

  • drop_caches: useful for benchmarks or targeted debugging; can hurt latency by cooling caches—avoid as a “fix” without root-cause analysis.
  • Watermarks / min_free_kbytes: tune carefully; wrong values can trigger early reclaim or leave the system fragile under bursts.

Filesystem stack and VFS

open/read/write enter VFS first. VFS sits above concrete filesystems (ext4, XFS, NFS, tmpfs, overlayfs) and provides a unified object model.

dentry and inode caches

  • inode: metadata and mapping to data extents on disk.
  • dentry: cached path components that stitch pathname resolution together.

Hot paths stay in the dentry cache; cold paths hit directory blocks on disk. Millions of tiny files stress dentry/inode caches and can interact with inode exhaustion.

VFS objects: struct file and file_operations

A successful open() installs a struct file in the FD table: current offset, flags (O_APPEND, O_NONBLOCK, …), linkage to an inode, and a pointer to file_operations (f_op). read/write/mmap dispatch through f_op into ext4/XFS/tmpfs or other implementations. The same syscall therefore follows different kernel paths for block-backed files, sockets, pipes, or network filesystems.

Path lookup and RCU

Resolving a path walks dentries; cache misses read directories from backing storage. Modern kernels optimize lookups (including RCU-friendly read paths), but metadata-heavy workloads can still burn CPU and I/O.

Page cache and the block layer

Filesystems map logical blocks through the block layer to drivers. fsync/fdatasync matter when you need durability guarantees—databases and distributed systems often couple semantics here.

Network filesystems and containers

NFS, Ceph FUSE, and similar stacks add latency and cache coherence concerns under VFS. overlayfs images can amplify metadata operations—explains “fast on the host, slow in the container.”


System call interface

Syscalls are the contract between user space and the kernel. On x86-64, userspace passes the syscall number and arguments per the ABI; the kernel dispatches through the syscall table. vdso maps a small userspace page for fast helpers like gettimeofday, avoiding full syscall transitions where possible.

Entry, return, and tracepoints

User libraries invoke the syscall instruction path; the CPU raises privilege and jumps to the kernel entry stub, which dispatches to the handler for the syscall number. Return values land in registers; errors follow the POSIX errno convention. Tracepoints such as sys_enter_* / sys_exit_* pair well with perf and eBPF for low-overhead syscall analytics.

Legacy paths (int 0x80 on IA-32, old vsyscall quirks) are mostly historical; modern x86-64 workloads use the syscall instruction fast path.

Observability and debugging

  • strace -c -p PID: syscall counts and errno patterns.
  • perf trace: syscall streams with scheduling context.
  • eBPF: attach to execve, connect, openat, etc., for policy and telemetry.

seccomp and containers

Runtimes apply seccomp profiles to reduce allowed syscalls. Unexpected syscalls fail fast—often as SIGSYS—so profiles must match real binaries across releases.


Production Linux patterns

cgroups and systemd

cgroups v2 groups CPU, memory, I/O, pids, and more in a single hierarchy. systemd places services into cgroups and ties restarts and resource accounting to them. “Alive but slow” frequently maps to CPU throttling; “suddenly killed” to memory limits or OOM.

Memory: memory.max vs memory.high

memory.max is a hard ceiling. memory.high applies soft pressure: above the threshold the kernel more aggressively reclaims or throttles to push usage down. Kubernetes QoS and limits interact with these semantics—pods can OOM while nodes look healthy if limits are tight.

PID limits: pids.max

Fork bombs or thread leaks exhaust pids controller quotas before memory does; failures can be non-obvious without checking cgroup pids events.

Limits operators touch often

  • ulimit -n: per-process FD cap—reverse proxies and connection pools hit this.
  • fs.file-max: system-wide open file ceiling.
  • Network tunables: verify against kernel version docs—copy/paste tuning without measurement is risky.

Observability stack

  • Metrics: CPU utilization and run-queue latency / saturation; MemAvailable; disk util%; TCP retransmits.
  • Logs: structured app logs plus journalctl for boot/OOM timelines.
  • Profiling: perf, async-profiler, eBPF for combined kernel+user stacks.

Incident triage checklist

  1. Separate symptoms: CPU vs memory vs disk vs network bottleneck.
  2. Check limits: cgroup, ulimit, cloud credits, Kubernetes requests/limits.
  3. Kernel logs: dmesg for OOM, storage errors, NIC resets.
  4. Scheduler/syscalls: perf sched, strace for hot loops or syscall storms.
  5. Correlate changes: deploys, kernel upgrades, network path shifts.

Troubleshooting map

SymptomLikely subsystemStart with
Broad slowness, high stealCPU scheduling / hypervisor stealvmstat 1, host credit metrics
RAM “free” but app slowpage cache miss / I/O waitiostat -xz, workload IO pattern
Rising swapreclaim / memory pressurevmstat, sar -W
open/stat stormsdentry/inode / metadatastrace, batch jobs touching many files
Container-only OOMcgroup memory.maxsystemd-cgtop, kube limits

Conclusion

Linux connects user space to the kernel at syscalls. CFS shares CPU time fairly in the normal class, page cache and swap mediate memory pressure, VFS unifies filesystem implementations, and cgroups enforce the limits that manifest as throttling and OOM in real services. Treating symptoms as signals from a specific subsystem—scheduler, reclaim, dentry cache, syscall policy—shortens incidents and avoids cargo-cult tuning.

For host hardening and SSH exposure, see Linux server security hardening.