
How Cgroups Work — Resource Limits and Accounting
A cgroup (control group) is a Linux kernel feature that limits, accounts for, and isolates the resource usage of a group of processes. If namespaces control what a process can see, cgroups control what it can consume.
When you run docker run --memory=512m --cpus=2, Docker creates a cgroup with those limits. The container's processes are placed in that cgroup. The kernel enforces the limits transparently — no cooperation from the containerized application is required.
The Cgroup Hierarchy
Cgroups are organized as a tree. Each node in the tree is a cgroup that can have resource limits and contain processes. Child cgroups inherit their parent's constraints and can have additional limits applied.
The hierarchy is exposed as a filesystem, typically mounted at /sys/fs/cgroup. Creating a directory creates a cgroup. Writing a PID to cgroup.procs adds a process. Writing a value to memory.max sets a memory limit. Everything is done through file I/O — no special syscalls needed.
Resource Controllers
Each resource type has a controller that enforces limits and tracks usage.
Memory Controller
The memory controller limits how much memory a cgroup can use. Key control files:
memory.max— hard limit. When the cgroup's memory usage hits this limit, the kernel's OOM killer terminates a process in the cgroup.memory.high— soft limit. When exceeded, the kernel aggressively reclaims memory from the cgroup (swapping, page cache eviction) but does not kill processes. Applications slow down but survive.memory.current— current memory usage (read-only).memory.swap.max— maximum swap usage.
When a cgroup exceeds memory.max, the OOM killer selects a process to terminate based on the oom_score_adj value. In a container context, this means the container is killed when it exceeds its memory limit — the same behavior you see when a Kubernetes pod is "OOMKilled."
CPU Controller
The CPU controller limits CPU time. Two mechanisms:
CPU shares (cpu.weight in v2, cpu.shares in v1) set relative priority. A cgroup with weight 200 gets twice the CPU time of a cgroup with weight 100 — but only when the CPU is contended. If no other cgroup needs the CPU, any cgroup can use 100%.
CPU quota (cpu.max in v2) sets a hard limit. The format is quota period — for example, 200000 100000 means "200ms of CPU time per 100ms period," which is 2 CPU cores. A container with --cpus=1.5 gets cpu.max = 150000 100000.
The difference matters: shares are relative and allow bursting (your container can use idle CPU). Quotas are absolute limits (your container is throttled even if CPUs are idle).
I/O Controller
The I/O controller (io.max, io.weight) throttles disk read/write bandwidth and IOPS per device. You can limit a container to, say, 50 MB/s of disk writes, preventing a noisy neighbor from saturating the disk for every other container.
PID Controller
The PID controller (pids.max) limits the number of processes a cgroup can create. This is the fork bomb defense — without it, a container could call fork() in an infinite loop and exhaust the host's PID space, denying service to every other container and the host itself.
Cgroup v1 vs v2
Cgroup v1 was the original implementation. Each controller had its own independent hierarchy. A process could be in one cgroup for memory and a different cgroup for CPU. This made configuration complex and sometimes inconsistent — setting a memory limit on one hierarchy had no relationship to the CPU limit on another.
Cgroup v2 (unified hierarchy) puts all controllers on a single tree. A process belongs to exactly one cgroup, and all controllers are managed together. This is simpler, more predictable, and the only version actively developed.
Key v2 improvements:
- Single hierarchy — one cgroup tree, all controllers attached to it.
- Pressure Stall Information (PSI) — real-time metrics showing how much time processes in a cgroup spend waiting for CPU, memory, or I/O. Used by Kubernetes for resource decisions.
- Better delegation — unprivileged users can manage sub-cgroups (enabling rootless containers).
- Threaded mode — threads within a process can be in different sub-cgroups for CPU scheduling.
Most modern distributions (Ubuntu 22.04+, Fedora 31+, Debian 11+) default to cgroup v2. Docker and Kubernetes fully support v2.
How Container Runtimes Use Cgroups
When you run docker run --memory=512m --cpus=1.5 --pids-limit=100 nginx, the runtime:
- Creates a cgroup directory:
/sys/fs/cgroup/docker/<container-id>/ - Writes
512000000tomemory.max - Writes
150000 100000tocpu.max - Writes
100topids.max - Writes the container's PID to
cgroup.procs
The container process and all its children are now constrained. The kernel enforces limits on every memory allocation, CPU scheduling decision, and fork() call.
Kubernetes adds another layer: the kubelet creates cgroups for each pod (with the pod's resource requests and limits) and nested cgroups for each container within the pod. The pod-level cgroup ensures the total resource usage of all containers in the pod stays within bounds.
Monitoring Cgroup Usage
Every cgroup exposes usage statistics through files:
$ cat /sys/fs/cgroup/docker/<id>/memory.current
234881024
$ cat /sys/fs/cgroup/docker/<id>/cpu.stat
usage_usec 8420316
user_usec 6320000
system_usec 2100316
nr_periods 1542
nr_throttled 12
throttled_usec 48000
docker stats, Prometheus cAdvisor, and Kubernetes metrics all read from these files. The nr_throttled and throttled_usec values in cpu.stat tell you whether your CPU limit is too tight — if a container is frequently throttled, it needs more CPU or the limit should be raised.
Next Steps
- How Container Images Work — layered filesystems and the OCI specification.
- How Memory Works — the memory model that cgroups constrain.
- How the Kernel Works — the kernel features cgroups rely on.