🤔 Introduction

So.... What exactly is a container? What exactly is a container runtime (like Docker)? What exactly is an container image? And... how can my windows-based application can't run on my linux server?

When talking about containerization, we think of the infamous Docker or Podman. A common analogy by most programming tutorials say that "a container is like a lightweight virtual machine that allows code to be ran on any operating system".
This analogy will get the job done for most application developer, however, it's not the full story. In this article let's go through the journey of recreating a container runtime from scratch, to develop a deeper understanding and appreciation for the tooling that we use everyday.

🤩 TLDR; Demo Video by Liz Rice

🤔 What is a container

So... let's start from the beginning. What the hell is a container?

In short, a container is simply an "isolated" process on a host operating system, with its own Filesystem, PID tree, Network, and Resources (nothing else can see its internal except the host).
It's not a lightweight operating system, virtual machine, or hardware virtualization. It's a purely an isolation layer at the operating system, an OS virtualization technology if you will.

Now if none of this make sense, let's walk though an example.

🤔 Example

Let's take a simple example of running a Python script via python app.py on your MacBook. In order to do this, your Mac needs to have:

python and pip installed
All requirements and dependencies in requirements.txt already installed

The problem

This is not too bad, however, there's a problem, a big problem. This Python script will have access to almost everything on your machine that your user have access to, namely:

Filesystem: Can see everything your user can see
CPU & RAM: Unlimited access (unless throttled by OS)
Network: Full access
Processes: Can spawn subprocesses (via subprocess, etc)
Devices: USB, audio, etc., if your user has access
Syscalls: Darwin/macOS syscalls via libc
Security: Controlled by macOS (SIP, sandboxing)

This is an uncontrolled and insecure way to deploy this python app on a random remote server. Reason being, the server needs to control what this python app can have access to, how much network bandwidth, CPU, or GPU resources it can use.

A malicious script can totally saturate network bandwidth, use up all CPU, read all sensitive files, install malware, or even worst rm -rf the entire file system if granted write access.

Solving the isolation problem, using containerization

Now, let's the fun begin! Let's start to "containerize" or "isolate" this python script.

This isolation process, at the highest level, only takes 2 steps:

Isolation of runtime environment from the OS, using namespaces: Namespaces provide processes and their children with a view of a subset of the machine's resources. There are 6 different kind of namesapces:
1. PID: The pid namespace gives a process and its children their own view of a subset of the processes in the system
2. MNT- It allows a process to have its own filesystem. This is how we can have a process think it’s running on ubuntu, or busybox, or alpine — by swapping out the filesystem the container sees.
3. NET: The network namespace gives the processes that use it their own ==network stack==. With some routing logic, we can make the container communicate with host, and subsequently to the internet.
4. UTS: The UTS namespace gives its processes their own view of the system’s hostname and domain name
5. IPC: Isolates various inter-process communication mechanisms such as message queues.
6. USER: The user namespace maps the uids a process sees to a different set of uids on the host (and gids). This is extremely useful since we can let the container running as root, in it's own user.
Isolation of physical resources, using control groups, or cgroups Cgroups are exposed by the kernel as a special file system. These file can be manipulated to put physical limits on how much resources a process can use.

And this is it, two simple yet extremely powerful tools exposed by the Linux kernel that power the backbone of containerization technologies. Docker also use this behind the scene.

Speaking of Docker, let's try to recreate a simple version of it's container runtime knowing what we know so far.

👉 Container Runtime from Scratch in Go

Let's try to rebuild something similar to docker run, by passing in a arbitrary command that we want to containerize. For example docker run python app.py with app.py in some random folder.

For this example, let's build our runtime with GoLang, since docker is built with GoLang under the hood.

What we need

Ability to pass in commands from command line
Ability to copy over existing file from host into the container
Ability to spawn, isolate, and manage a random process (this process will be our container!!! 🤩)

Let's build

The container manager

Think of this manager as the docker process. The container is isolated, but docker still need all the privileges to spawn the containers, for example making syscalls, forwarding signals, etc.

func manager() {
	// Re run this program on host (using /proc/self/exe) to spawn the container.
	cmd := exec.Command("/proc/self/exe", append([]string{"runContainer"}, os.Args[2:]...)...)
	
	cmd.SysProcAttr = &syscall.SysProcAttr{
	
	Cloneflags:
		syscall.CLONE_NEWUTS | // Run the the command in new UTS namespace
		syscall.CLONE_NEWPID | // Run the the command in new PID namespace
		syscall.CLONE_NEWNS  | // Run the the command in new MNT namespace
		syscall.CLONE_NEWIPC | // Run the the command in new IPC namespace
		syscall.CLONE_NEWNET,  // Run the the command in new NET namespace
	
	}
	// Spawn the container with all new cloned name spaces
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	must(cmd.Start())

  


	// Forward signals (like Ctrl+C) to the container
	go func() {
		ch := make(chan os.Signal, 1)
		signal.Notify(ch)
		for sig := range ch {
		_ = cmd.Process.Signal(sig)
		}
	}()
	
	
	must(cmd.Wait())

}

The container

func runContainer() {
	// We are now in an isolated process with our own namespace
	
	// 1. Set hostname, this wont affect the host hostname
	must(syscall.Sethostname([]byte("container")))
	
	// 2. Set up the file system isolation
	// 2.1 Mount proc
	must(os.MkdirAll("rootfs/proc", 0555))
	must(syscall.Mount("proc", "rootfs/proc", "proc", 0, ""))
	
	// 2.2. pivot_root
	must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
	must(os.MkdirAll("rootfs/oldrootfs", 0700))
	must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
	must(os.Chdir("/"))
	
	
	// 2.3. Unmount old root
	must(syscall.Unmount("/oldrootfs", syscall.MNT_DETACH))
	must(os.RemoveAll("/oldrootfs"))
	
	// 3. Set up cgroup limits
	setupCgroup()
	
	  
	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	
	  
	// Run the python script in this containerized environment
	must(cmd.Run())
}

  

func setupCgroup() {
	cgroupPath := "/sys/fs/cgroup/mycontainer/"
	os.MkdirAll(cgroupPath, 0755)
	
	// Set a limit: maximum number of processes allowed in this cgroup is 20.
	// Prevents fork bombs or accidental overuse of system resources.
	os.WriteFile(cgroupPath+"pids.max", []byte("20"), 0700) 
	os.WriteFile(cgroupPath+"notify_on_release", []byte("1"), 0700)
	
	// Limit CPU usage (cpu controller)
	cpuPath := "/sys/fs/cgroup/cpu/mycontainer/"
	os.MkdirAll(cpuPath, 0755)

	// 50% CPU: allow 50ms of CPU time every 100ms
	os.WriteFile(cpuPath+"cpu.cfs_quota_us", []byte("50000"), 0700)
	os.WriteFile(cpuPath+"cpu.cfs_period_us", []byte("100000"), 0700)
	
	os.WriteFile(cgroupPath+"cgroup.procs", []byte(strconv.Itoa(os.Getpid())), 0700)

}

All together

package main

import (
	"os/signal"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	switch os.Args[1] {
	case "run":
		manager()
	case "runContainer":
		runContainer()
	default:
		panic("wat should I do")

	}
}

func manager() {
	// Re run this program on host (using /proc/self/exe) to spawn the container.
	cmd := exec.Command("/proc/self/exe", append([]string{"runContainer"}, os.Args[2:]...)...)
	
	cmd.SysProcAttr = &syscall.SysProcAttr{
	
	Cloneflags:
		syscall.CLONE_NEWUTS | // Run the the command in new UTS namespace
		syscall.CLONE_NEWPID | // Run the the command in new PID namespace
		syscall.CLONE_NEWNS  | // Run the the command in new MNT namespace
		syscall.CLONE_NEWIPC | // Run the the command in new IPC namespace
		syscall.CLONE_NEWNET,  // Run the the command in new NET namespace
	
	}
	// Spawn the container with all new cloned name spaces
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	must(cmd.Start())

  


	// Forward signals (like Ctrl+C) to the container
	go func() {
		ch := make(chan os.Signal, 1)
		signal.Notify(ch)
		for sig := range ch {
		_ = cmd.Process.Signal(sig)
		}
	}()
	
	
	must(cmd.Wait())
}

func runContainer() {
	// We are now in an isolated process with our own namespace
	
	// 1. Set hostname, this wont affect the host hostname
	must(syscall.Sethostname([]byte("container")))
	
	// 2. Set up the file system isolation
	// 2.1 Mount proc
	must(os.MkdirAll("rootfs/proc", 0555))
	must(syscall.Mount("proc", "rootfs/proc", "proc", 0, ""))
	
	// 2.2. pivot_root
	must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
	must(os.MkdirAll("rootfs/oldrootfs", 0700))
	must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
	must(os.Chdir("/"))
	
	
	// 2.3. Unmount old root
	must(syscall.Unmount("/oldrootfs", syscall.MNT_DETACH))
	must(os.RemoveAll("/oldrootfs"))
	
	// 3. Set up cgroup limits
	setupCgroup()
	
	  
	cmd := exec.Command(os.Args[2], os.Args[3:]...)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	
	  
	// Run the python script in this containerized environment
	must(cmd.Run())
}

  

func setupCgroup() {
	cgroupPath := "/sys/fs/cgroup/mycontainer/"
	os.MkdirAll(cgroupPath, 0755)
	
	// Set a limit: maximum number of processes allowed in this cgroup is 20.
	// Prevents fork bombs or accidental overuse of system resources.
	os.WriteFile(cgroupPath+"pids.max", []byte("20"), 0700) 
	os.WriteFile(cgroupPath+"notify_on_release", []byte("1"), 0700)
	
	// Limit CPU usage (cpu controller)
	cpuPath := "/sys/fs/cgroup/cpu/mycontainer/"
	os.MkdirAll(cpuPath, 0755)

	// 50% CPU: allow 50ms of CPU time every 100ms
	os.WriteFile(cpuPath+"cpu.cfs_quota_us", []byte("50000"), 0700)
	os.WriteFile(cpuPath+"cpu.cfs_period_us", []byte("100000"), 0700)
	
	os.WriteFile(cgroupPath+"cgroup.procs", []byte(strconv.Itoa(os.Getpid())), 0700)

}


func must(err error) {
	if err != nil {
		panic(err)
	}
}

Conclusion

That's it, we just created a super insecure, and barebone version of docker run. To run this, we simply do

go run main.go run <whatever-command-here>

#Example
go run main.go run python app.py

🎉🎉 And that's it, we have successfully containerized our application 🎉🎉

👉 Cross-Platforms Containers

Now that we know what a container actually is, let's answer some fundamental questions on running container cross platforms.

Namespaces and cGroups are all Linux specific, how do i containerize my Windows environment?
- On Windows, containerization will use Windows job objects and Hyper-V isolation instead of Namespaces or cGroups.
How can my Linux container even run on Windows?
- Yes! Only If your windows machine has WSL2 or Hyper-V.
Can my Windows container even run on Linux?
- No! To run a Windows container, you will need to have a Windows kernel. Since running containers on the Linux kernel is way more efficient and effective, there are currently no support or use case for this.

Recreating Docker in 90 lines of code

Table of Contents