How Containers Work — Isolation Without Virtual Machines

How Containers Work — Isolation Without Virtual Machines

2026-03-22

A container looks like a virtual machine — it has its own filesystem, its own process tree, its own network interfaces, its own hostname. But unlike a VM, it shares the host's kernel. There's no guest OS, no hypervisor, no hardware emulation. A container is a regular process running on the host, with kernel-enforced boundaries around what it can see and use.

This makes containers fast to start (milliseconds vs seconds for VMs), lightweight (no OS overhead), and dense (thousands of containers on a single host). It also makes them fundamentally different from VMs in terms of isolation.

What Makes a Container?

A container is built from three Linux kernel features:

1. Namespaces — what the process can see.

2. Cgroups — how much the process can use.

3. Layered filesystem — the filesystem the process sees.

That's it. Docker, Podman, containerd — they're all orchestration tools that configure namespaces, cgroups, and a filesystem, then start a process. The kernel does the actual isolation.

How Do Namespaces Work?

Namespaces limit a process's view of the system. Each namespace isolates a different resource:

NamespaceWhat it isolatesEffect
PIDProcess IDsContainer sees its own process tree. Its PID 1 is not the host's PID 1.
MountFilesystem mountsContainer has its own root filesystem. Can't see host mounts.
NetworkNetwork stackContainer has its own IP address, routes, ports, and firewall rules.
UTSHostnameContainer has its own hostname.
UserUser/group IDsRoot inside the container can map to an unprivileged user on the host.
IPCInter-process communicationSeparate shared memory, semaphores, message queues.
CgroupCgroup rootContainer can't see host cgroup hierarchy.

When Docker starts a container, it creates a new set of namespaces and starts the container's process inside them. The process thinks it's alone on the machine — its own PID 1, its own root filesystem, its own network interface — but it's all a kernel-enforced illusion.

The unshare syscall creates new namespaces. setns joins an existing namespace. clone with namespace flags creates a child process in new namespaces. These are the primitives that container runtimes use.

How Do Cgroups Work?

Namespaces control what a process can see. Cgroups (control groups) control how much it can use:

Cgroup controllerWhat it limits
cpuCPU time (e.g., 50% of one core)
memoryRAM (e.g., 512 MB max, then OOM kill)
ioDisk I/O bandwidth and IOPS
pidsMaximum number of processes
cpusetWhich CPU cores the process can use

Cgroups are hierarchical — you can set limits on a group of processes, then further subdivide within the group. Kubernetes uses cgroups to enforce pod resource limits (resources.limits.memory: "512Mi").

When a container exceeds its memory cgroup limit, the kernel's OOM killer terminates processes inside the container. The container doesn't crash the host — the damage is contained. This is the key resource isolation mechanism.

Cgroups are exposed as a filesystem: /sys/fs/cgroup/. Each cgroup is a directory. Writing to files in that directory sets limits. Container runtimes create cgroup directories, write limits, and move processes into them.

How Does the Container Filesystem Work?

A container needs its own root filesystem — the /bin, /lib, /etc that its programs expect. But copying an entire OS image for every container would waste disk space.

Overlayfs solves this with layers:

Container layer (read-write) Changes, new files, deleted files Application layer (read-only): your code Dependencies layer (read-only): apt packages Base image layer (read-only): Ubuntu/Alpine All layers merged into a single view at /

Each layer is a directory of files. Overlayfs presents them as a single merged filesystem:

  • Read a file → overlayfs searches layers top-down. The first layer containing the file wins.
  • Write a file → overlayfs copies the file to the top (read-write) layer and modifies it there. The lower layers are untouched. This is copy-on-write.
  • Delete a file → overlayfs creates a "whiteout" marker in the top layer, hiding the file in lower layers.

This means:

  • Multiple containers can share the same base image layers (Ubuntu, Alpine). The layers are read-only and shared.
  • Each container's changes are isolated to its own read-write layer.
  • Container images are just stacked tarballs — one per layer. Pulling an image downloads only the layers you don't already have.

How Does Container Networking Work?

Each container gets its own network namespace — its own network interfaces, IP addresses, routing table, and iptables rules. But it still needs to communicate with the host and other containers.

The most common setup: the container runtime creates a virtual ethernet pair (veth). One end goes inside the container's network namespace, the other connects to a bridge on the host. The bridge acts like a virtual switch connecting all containers.

When a container sends a packet to the internet:

  1. The packet leaves the container's veth interface.
  2. It arrives at the host bridge.
  3. The host's iptables/nftables NATs it (translating the container's private IP to the host's IP).
  4. The packet goes out the host's physical interface.

Port mapping (-p 8080:80) adds an iptables rule that forwards traffic arriving on host port 8080 to container port 80. eBPF-based solutions like Cilium replace iptables with more efficient packet processing.

Containers vs Virtual Machines

ContainerVirtual Machine
IsolationKernel namespacesHardware + hypervisor
KernelShared with hostOwn kernel
Start timeMillisecondsSeconds to minutes
Memory overheadMinimalFull OS (hundreds of MB)
DensityThousands per hostTens per host
Security boundaryProcess-levelHardware-level
CompatibilitySame kernel (Linux on Linux)Any OS on any host

Containers are weaker isolation — they share the kernel. A kernel vulnerability can escape the container. VMs are stronger — a VM escape requires a hypervisor bug, which is a much smaller attack surface.

In practice, many deployments combine both: VMs for multi-tenant isolation (different customers), containers for application isolation within a tenant.

What Is a Container Image?

A container image is a stack of filesystem layers plus metadata (entrypoint command, environment variables, exposed ports). The OCI (Open Container Initiative) specification standardizes the format.

A Dockerfile describes how to build an image:

FROM ubuntu:24.04          # base layer
RUN apt-get install -y curl  # dependencies layer
COPY ./app /app            # application layer
CMD ["/app/server"]        # default command

Each instruction creates a layer. Layers are cached — if apt-get install didn't change, the cached layer is reused. This makes rebuilds fast and images small (only changed layers are pushed/pulled).

Images are stored in registries (Docker Hub, GitHub Container Registry, private registries). The registry stores layers individually and deduplicates shared layers across images.

Next Steps

This lesson completes the systems learning path. You now understand:

These are the building blocks underneath everything: networking, search, databases, and distributed systems.