
How Containers Work — Process Isolation, Not Virtual Machines
A container is a process running on the host operating system with restricted visibility and limited resources. It is not a virtual machine. There is no hypervisor, no guest kernel, no hardware emulation. The container shares the host kernel — it just cannot see or touch most of the host's resources.
When you run docker run nginx, Docker does not boot a machine. It creates a regular Linux process, wraps it in a set of namespaces (so it sees only its own filesystem, PIDs, and network), assigns it to a cgroup (so it cannot consume unlimited CPU or memory), and mounts a union filesystem (so it gets a layered, copy-on-write root filesystem). The result looks like an isolated machine but is actually a constrained process.
The Three Pillars
Containers rest on three kernel features. Each solves a different problem:
Namespaces control what a process can see. A container in its own PID namespace sees itself as PID 1 — it has no idea other processes exist on the host. A container in its own network namespace has its own IP address, routing table, and firewall rules. A container in its own mount namespace has its own root filesystem. Linux provides seven namespace types, and a container typically uses all of them.
Cgroups (control groups) control what a process can use. A cgroup limits CPU time, memory, disk I/O, and the number of processes a container can create. When a container hits its memory limit, the kernel's OOM killer terminates it — the same mechanism that kills any process that exhausts its memory allocation.
Union filesystems control what a process sees on disk. A union filesystem (typically overlayfs) stacks read-only image layers with a writable layer on top. Reads fall through the layers until the file is found. Writes go to the top layer. This is how multiple containers can share the same base image without duplicating gigabytes of data.
How a Container Starts
When a container runtime like runc launches a container, the sequence is:
- Parse the OCI bundle — read the configuration (rootfs path, environment variables, capabilities, namespace settings).
- Create namespaces — call
clone()with flags for each namespace type (CLONE_NEWPID, CLONE_NEWNET, CLONE_NEWNS, etc.). - Set up the cgroup — create a cgroup directory, write resource limits, add the container's PID.
- Mount the root filesystem — set up the overlayfs mount with image layers as lower directories and a writable upper directory.
- Pivot root — switch the container's root filesystem from the host's
/to the container's overlayfs mount. - Drop capabilities — remove Linux capabilities the container does not need (no raw sockets, no kernel module loading, no clock changes).
- Execute the entrypoint —
exec()the container's command (e.g.,nginx -g 'daemon off;').
After step 7, the container is a regular process. The kernel enforces all restrictions through existing mechanisms — no special "container mode" exists.
Containers vs Virtual Machines
The fundamental difference: a VM runs its own kernel. A container shares the host kernel.
A VM boots a full guest operating system with its own kernel. This provides strong isolation — the guest kernel handles syscalls independently — but costs hundreds of megabytes of RAM and seconds to start. Each VM duplicates kernel code, device drivers, and system services.
A container shares the host kernel. It calls the same syscalls as every other process on the host. Isolation comes from namespaces and cgroups, not from hardware separation. This means containers start in milliseconds, use megabytes of overhead instead of gigabytes, and can run hundreds per host instead of dozens.
The tradeoff: containers provide weaker isolation than VMs. A kernel vulnerability affects all containers on the host. A VM with its own kernel is unaffected by host kernel bugs (though hypervisor vulnerabilities are also possible).
| Virtual Machine | Container | |
|---|---|---|
| Isolation | Hardware-level (hypervisor) | Kernel-level (namespaces + cgroups) |
| Startup time | Seconds to minutes | Milliseconds |
| Memory overhead | Hundreds of MB (guest OS) | Kilobytes (process metadata) |
| Kernel | Own guest kernel | Shared host kernel |
| Density | Tens per host | Hundreds per host |
| Security boundary | Strong (hypervisor) | Moderate (kernel features) |
| Examples | QEMU/KVM, VMware, Hyper-V | Docker, Podman, containerd |
Why Containers Are Not Secure by Default
Sharing the host kernel means sharing the kernel's attack surface. A container process makes syscalls to the same kernel as every other process. If a syscall has a vulnerability, a container can exploit it to escape its namespace isolation.
Default container configurations run as root inside the container, have access to most syscalls, and share the host's kernel. Hardening requires: running as a non-root user, dropping Linux capabilities, restricting syscalls with seccomp profiles, using read-only root filesystems, and limiting network access.
Rootless containers (supported by Podman and recent Docker) run the entire container runtime as an unprivileged user, adding a layer of defense using user namespaces.
The Runtime Stack
The modern container stack has three layers:
- High-level runtime (containerd, CRI-O) — manages container lifecycle, image pulling, storage, and networking. Speaks the Kubernetes CRI protocol.
- Low-level runtime (runc, crun, youki) — creates namespaces, sets up cgroups, pivots root, executes the process. Speaks the OCI Runtime Specification.
- CLI/daemon (Docker, Podman, nerdctl) — user-facing interface that calls the high-level runtime.
When you run docker run nginx, Docker tells containerd, which calls runc, which creates the namespaces, cgroups, and overlayfs mount, then exec's the nginx process.
Next Steps
- How Namespaces Work — deep dive into each Linux namespace type.
- How Cgroups Work — resource limits, the OOM killer, and cgroup v2.
- How Processes Work — the foundation: what a process is before you isolate it.
- How the Kernel Works — the kernel that all containers share.