How Containers Work — Isolation Without Virtual Machines

2026-03-22

A container looks like a virtual machine — it has its own filesystem, its own process tree, its own network interfaces, its own hostname. But unlike a VM, it shares the host's kernel. There's no guest OS, no hypervisor, no hardware emulation. A container is a regular process running on the host, with kernel-enforced boundaries around what it can see and use.

This makes containers fast to start (milliseconds vs seconds for VMs), lightweight (no OS overhead), and dense (thousands of containers on a single host). It also makes them fundamentally different from VMs in terms of isolation.

What Makes a Container?

A container is built from three Linux kernel features:

1. Namespaces — what the process can see.

2. Cgroups — how much the process can use.

3. Layered filesystem — the filesystem the process sees.

That's it. Docker, Podman, containerd — they're all orchestration tools that configure namespaces, cgroups, and a filesystem, then start a process. The kernel does the actual isolation.

How Do Namespaces Work?

Namespaces limit a process's view of the system. Each namespace isolates a different resource:

Namespace	What it isolates	Effect
PID	Process IDs	Container sees its own process tree. Its PID 1 is not the host's PID 1.
Mount	Filesystem mounts	Container has its own root filesystem. Can't see host mounts.
Network	Network stack	Container has its own IP address, routes, ports, and firewall rules.
UTS	Hostname	Container has its own hostname.
User	User/group IDs	Root inside the container can map to an unprivileged user on the host.
IPC	Inter-process communication	Separate shared memory, semaphores, message queues.
Cgroup	Cgroup root	Container can't see host cgroup hierarchy.

When Docker starts a container, it creates a new set of namespaces and starts the container's process inside them. The process thinks it's alone on the machine — its own PID 1, its own root filesystem, its own network interface — but it's all a kernel-enforced illusion.

The unshare syscall creates new namespaces. setns joins an existing namespace. clone with namespace flags creates a child process in new namespaces. These are the primitives that container runtimes use.

How Do Cgroups Work?

Namespaces control what a process can see. Cgroups (control groups) control how much it can use:

Cgroup controller	What it limits
cpu	CPU time (e.g., 50% of one core)
memory	RAM (e.g., 512 MB max, then OOM kill)
io	Disk I/O bandwidth and IOPS
pids	Maximum number of processes
cpuset	Which CPU cores the process can use

Cgroups are hierarchical — you can set limits on a group of processes, then further subdivide within the group. Kubernetes uses cgroups to enforce pod resource limits (resources.limits.memory: "512Mi").

When a container exceeds its memory cgroup limit, the kernel's OOM killer terminates processes inside the container. The container doesn't crash the host — the damage is contained. This is the key resource isolation mechanism.

Cgroups are exposed as a filesystem: /sys/fs/cgroup/. Each cgroup is a directory. Writing to files in that directory sets limits. Container runtimes create cgroup directories, write limits, and move processes into them.

How Does the Container Filesystem Work?

A container needs its own root filesystem — the /bin, /lib, /etc that its programs expect. But copying an entire OS image for every container would waste disk space.

Overlayfs solves this with layers:

Each layer is a directory of files. Overlayfs presents them as a single merged filesystem:

Read a file → overlayfs searches layers top-down. The first layer containing the file wins.
Write a file → overlayfs copies the file to the top (read-write) layer and modifies it there. The lower layers are untouched. This is copy-on-write.
Delete a file → overlayfs creates a "whiteout" marker in the top layer, hiding the file in lower layers.

This means:

Multiple containers can share the same base image layers (Ubuntu, Alpine). The layers are read-only and shared.
Each container's changes are isolated to its own read-write layer.
Container images are just stacked tarballs — one per layer. Pulling an image downloads only the layers you don't already have.

How Does Container Networking Work?

Each container gets its own network namespace — its own network interfaces, IP addresses, routing table, and iptables rules. But it still needs to communicate with the host and other containers.

The most common setup: the container runtime creates a virtual ethernet pair (veth). One end goes inside the container's network namespace, the other connects to a bridge on the host. The bridge acts like a virtual switch connecting all containers.

When a container sends a packet to the internet:

The packet leaves the container's veth interface.
It arrives at the host bridge.
The host's iptables/nftables NATs it (translating the container's private IP to the host's IP).
The packet goes out the host's physical interface.

Port mapping (-p 8080:80) adds an iptables rule that forwards traffic arriving on host port 8080 to container port 80. eBPF-based solutions like Cilium replace iptables with more efficient packet processing.

Containers vs Virtual Machines

	Container	Virtual Machine
Isolation	Kernel namespaces	Hardware + hypervisor
Kernel	Shared with host	Own kernel
Start time	Milliseconds	Seconds to minutes
Memory overhead	Minimal	Full OS (hundreds of MB)
Density	Thousands per host	Tens per host
Security boundary	Process-level	Hardware-level
Compatibility	Same kernel (Linux on Linux)	Any OS on any host

Containers are weaker isolation — they share the kernel. A kernel vulnerability can escape the container. VMs are stronger — a VM escape requires a hypervisor bug, which is a much smaller attack surface.

In practice, many deployments combine both: VMs for multi-tenant isolation (different customers), containers for application isolation within a tenant.

What Is a Container Image?

A container image is a stack of filesystem layers plus metadata (entrypoint command, environment variables, exposed ports). The OCI (Open Container Initiative) specification standardizes the format.

A Dockerfile describes how to build an image:

FROM ubuntu:24.04          # base layer
RUN apt-get install -y curl  # dependencies layer
COPY ./app /app            # application layer
CMD ["/app/server"]        # default command

Each instruction creates a layer. Layers are cached — if apt-get install didn't change, the cached layer is reused. This makes rebuilds fast and images small (only changed layers are pushed/pulled).

Images are stored in registries (Docker Hub, GitHub Container Registry, private registries). The registry stores layers individually and deduplicates shared layers across images.

Next Steps

This lesson completes the systems learning path. You now understand:

How Memory Works — stack, heap, virtual memory
How Processes Work — fork, exec, scheduling
How the Kernel Works — syscalls, user/kernel boundary
How Threads Work — concurrency, synchronization
How File Systems Work — inodes, journaling, copy-on-write
How Containers Work — namespaces, cgroups, overlayfs

These are the building blocks underneath everything: networking, search, databases, and distributed systems.

Prerequisites

References

Open Container Initiative Runtime Specification