
How Containers Work — Isolation Without Virtual Machines
A container looks like a virtual machine — it has its own filesystem, its own process tree, its own network interfaces, its own hostname. But unlike a VM, it shares the host's kernel. There's no guest OS, no hypervisor, no hardware emulation. A container is a regular process running on the host, with kernel-enforced boundaries around what it can see and use.
This makes containers fast to start (milliseconds vs seconds for VMs), lightweight (no OS overhead), and dense (thousands of containers on a single host). It also makes them fundamentally different from VMs in terms of isolation.
What Makes a Container?
A container is built from three Linux kernel features:
1. Namespaces — what the process can see.
2. Cgroups — how much the process can use.
3. Layered filesystem — the filesystem the process sees.
That's it. Docker, Podman, containerd — they're all orchestration tools that configure namespaces, cgroups, and a filesystem, then start a process. The kernel does the actual isolation.
How Do Namespaces Work?
Namespaces limit a process's view of the system. Each namespace isolates a different resource:
| Namespace | What it isolates | Effect |
|---|---|---|
| PID | Process IDs | Container sees its own process tree. Its PID 1 is not the host's PID 1. |
| Mount | Filesystem mounts | Container has its own root filesystem. Can't see host mounts. |
| Network | Network stack | Container has its own IP address, routes, ports, and firewall rules. |
| UTS | Hostname | Container has its own hostname. |
| User | User/group IDs | Root inside the container can map to an unprivileged user on the host. |
| IPC | Inter-process communication | Separate shared memory, semaphores, message queues. |
| Cgroup | Cgroup root | Container can't see host cgroup hierarchy. |
When Docker starts a container, it creates a new set of namespaces and starts the container's process inside them. The process thinks it's alone on the machine — its own PID 1, its own root filesystem, its own network interface — but it's all a kernel-enforced illusion.
The unshare syscall creates new namespaces. setns joins an existing namespace. clone with namespace flags creates a child process in new namespaces. These are the primitives that container runtimes use.
How Do Cgroups Work?
Namespaces control what a process can see. Cgroups (control groups) control how much it can use:
| Cgroup controller | What it limits |
|---|---|
| cpu | CPU time (e.g., 50% of one core) |
| memory | RAM (e.g., 512 MB max, then OOM kill) |
| io | Disk I/O bandwidth and IOPS |
| pids | Maximum number of processes |
| cpuset | Which CPU cores the process can use |
Cgroups are hierarchical — you can set limits on a group of processes, then further subdivide within the group. Kubernetes uses cgroups to enforce pod resource limits (resources.limits.memory: "512Mi").
When a container exceeds its memory cgroup limit, the kernel's OOM killer terminates processes inside the container. The container doesn't crash the host — the damage is contained. This is the key resource isolation mechanism.
Cgroups are exposed as a filesystem: /sys/fs/cgroup/. Each cgroup is a directory. Writing to files in that directory sets limits. Container runtimes create cgroup directories, write limits, and move processes into them.
How Does the Container Filesystem Work?
A container needs its own root filesystem — the /bin, /lib, /etc that its programs expect. But copying an entire OS image for every container would waste disk space.
Overlayfs solves this with layers:
Each layer is a directory of files. Overlayfs presents them as a single merged filesystem:
- Read a file → overlayfs searches layers top-down. The first layer containing the file wins.
- Write a file → overlayfs copies the file to the top (read-write) layer and modifies it there. The lower layers are untouched. This is copy-on-write.
- Delete a file → overlayfs creates a "whiteout" marker in the top layer, hiding the file in lower layers.
This means:
- Multiple containers can share the same base image layers (Ubuntu, Alpine). The layers are read-only and shared.
- Each container's changes are isolated to its own read-write layer.
- Container images are just stacked tarballs — one per layer. Pulling an image downloads only the layers you don't already have.
How Does Container Networking Work?
Each container gets its own network namespace — its own network interfaces, IP addresses, routing table, and iptables rules. But it still needs to communicate with the host and other containers.
The most common setup: the container runtime creates a virtual ethernet pair (veth). One end goes inside the container's network namespace, the other connects to a bridge on the host. The bridge acts like a virtual switch connecting all containers.
When a container sends a packet to the internet:
- The packet leaves the container's veth interface.
- It arrives at the host bridge.
- The host's iptables/nftables NATs it (translating the container's private IP to the host's IP).
- The packet goes out the host's physical interface.
Port mapping (-p 8080:80) adds an iptables rule that forwards traffic arriving on host port 8080 to container port 80. eBPF-based solutions like Cilium replace iptables with more efficient packet processing.
Containers vs Virtual Machines
| Container | Virtual Machine | |
|---|---|---|
| Isolation | Kernel namespaces | Hardware + hypervisor |
| Kernel | Shared with host | Own kernel |
| Start time | Milliseconds | Seconds to minutes |
| Memory overhead | Minimal | Full OS (hundreds of MB) |
| Density | Thousands per host | Tens per host |
| Security boundary | Process-level | Hardware-level |
| Compatibility | Same kernel (Linux on Linux) | Any OS on any host |
Containers are weaker isolation — they share the kernel. A kernel vulnerability can escape the container. VMs are stronger — a VM escape requires a hypervisor bug, which is a much smaller attack surface.
In practice, many deployments combine both: VMs for multi-tenant isolation (different customers), containers for application isolation within a tenant.
What Is a Container Image?
A container image is a stack of filesystem layers plus metadata (entrypoint command, environment variables, exposed ports). The OCI (Open Container Initiative) specification standardizes the format.
A Dockerfile describes how to build an image:
FROM ubuntu:24.04 # base layer
RUN apt-get install -y curl # dependencies layer
COPY ./app /app # application layer
CMD ["/app/server"] # default command
Each instruction creates a layer. Layers are cached — if apt-get install didn't change, the cached layer is reused. This makes rebuilds fast and images small (only changed layers are pushed/pulled).
Images are stored in registries (Docker Hub, GitHub Container Registry, private registries). The registry stores layers individually and deduplicates shared layers across images.
Next Steps
This lesson completes the systems learning path. You now understand:
- How Memory Works — stack, heap, virtual memory
- How Processes Work — fork, exec, scheduling
- How the Kernel Works — syscalls, user/kernel boundary
- How Threads Work — concurrency, synchronization
- How File Systems Work — inodes, journaling, copy-on-write
- How Containers Work — namespaces, cgroups, overlayfs
These are the building blocks underneath everything: networking, search, databases, and distributed systems.