How Does a Docker Container Work Internally?

Eduardo Zepeda

June 18, 2022

Translations:

- es: ¿Cómo Funciona un Container de Docker Internamente?

⏲ Reading time: 9 minutes

Docker Linux and devops Go

Containers, especially Docker containers, are used everywhere, we tend to see them as small isolated operating systems that are inside our system. Using the Docker commands we can modify them, create them, delete them and even get inside them and run commands, but have you ever wondered how they work internally?

We know that a container is a linux process with several characteristics:

A linux process, or group of processes, executed by a user.
It is isolated from the operating system that hosts it (Namespaces).
It has a limited amount of resources (Cgroups).
It has a file system independent of the operating system in which it runs (Chroot).

To achieve this, docker, and other container technologies, take advantage of some features of GNU/Linux (from now on only linux):

Processes
Namespaces
Cgroups
Chroot

I am going to explain them briefly but you can go deeper on your own if you want to.

Processes, namespaces and cgroups on linux

Process

In simple words, a process is an instance of a running program. What is important here is that each process in linux has a PID, which is a number used to identify the process.

As you know, you can view the processes using the ps, top, htop commands , etc.

A container is a process, or a group of processes, isolated from the rest of the operating system, by means of a namespace.

Namespace

A namespace limits what we can see.

Namespaces are a Linux abstraction layer that isolates system resources. Processes inside a namespace are aware of other processes inside that namespace, but processes inside a namespace cannot interact with processes outside that namespace. Each process can belong to only one namespace.

A namespace is what makes a container feel like another operating system.

In linux, a namespace will be deactivated when its last process has finished running.

Types of namespaces in linux

There are different types of namespaces that control the resources to which a process has access:

UTS(Unix Time Sharing) namespace: Isolates hostname and domain.
PID namespace: Isolates process identifiers.
Mounts namespace: Isolates mount points.
IPC namespace: Isolates communication resources between processes.
Network namespace: Isolates network resources.
User namespace: isolates user and group identifiers.
cgroups: Isolates /proc/[pid]/cgroup and /proc/[pid]/mountinfo view.

For example, if we use a namespace of type UTS, the changes we make to the hostname from our namespace will not affect the hostname of the main operating system.

Example of namespaces in linux — Each namespace has its own hostname and domainname

cgroup

In linux cgroups limit what we can use.

The cgroups, or control groups provided by the linux kernel, allow us to organize our processes into groups, and limit the CPU, memory, input, output, number of processes and network packets generated by each of these groups.

Linux takes this configuration reading a series of files inside the path /sys/fs/cgroup/, we can create new cgroups, or modify the existing ones, creating folders and files inside this location.

For example, using cgroups we can tell linux: “limit the number of CPUs this process can use to only one, and that it can only use 20% of the CPU capacity, and also assign it a maximum of 1GB of RAM”.

Example of cgroups in linux — cgroups allow you to limit system resources

Create a container from scratch with Go

Simplifying the above we need:

Namespaces: to isolate the processes of our container from the main operating system.
Chroot: to provide our container with a file system different from that of the main operating system.
Cgroups: to limit the resources of our system to which our container can access

Now let’s create the container base in the same way as Docker, using the Go programming language .

package main

import (
    "fmt"
    "os"
    "os/exec"
)

// ./container.go run <cmd> <args>
func main() {
    switch os.Args[1] {
    case "run":
    	run()
    default:
    	panic("This command doesn't exist")
    }
}
func run() {
    fmt.Printf("Cocde executing %v with Process Id (PID): %d \n", os.Args[2:], os.Getpid())
    cmd := exec.Command(os.Args[2], os.Args[3:]...)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.Run()
}

I explain the code below.

Inside the main function, os.Args[1] returns the first argument of the program, in case the first argument is run, it will execute the run function. Easy, isn’t it?

./container run <cmd> <args>

exec.Command will take care of executing whatever we pass it after run, as a command to execute, along with its arguments, this can be a echo, a bash, an ls, or whatever you want.

./container run echo "Hello world"
Code executing [echo Hello world] with Process Id (PID): 292753
Hello world

The following lines with the cmd prefix are summarized as follows.

Redirect the standard input of the command to the standard input of the operating system.
Redirect the standard output of the command to the standard output of the operating system.
Redirect the error output of the command to the error output of the operating system.

What does this mean? It means that, in this process, everything we type in our terminal will go directly to the standard command entry that is stored in cmd.

To conclude:

cmd.Run, execute the command that we created with exec.Command.

Containers and namespaces

So far we have a program that creates a process from the arguments we pass to it.

So far so good, but we have a problem; we are not using namespaces, so our program is not isolated from the rest of the system; we can see all the processes of the main operating system and we are using its file system, instead of our own file system for the container.

To assign a namespace to our program, we will use the SysProcAttr method to create a new namespace of type UTS.

func run() {
    fmt.Printf("Cocde executing %v with Process Id (PID): %d \n", os.Args[2:], os.Getpid())
    cmd := exec.Command(os.Args[2], os.Args[3:]...)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.SysProcAttr = &syscall.SysProcAttr{
    	Cloneflags: syscall.CLONE_NEWUTS,
    }

    cmd.Run()
}

As you read in the list of namespaces, UTS is the namespace for isolating hostname and domain names.

Namespace UTS

After setting the Cloneflags, any changes we make to the hostname will be made only within the namespace. In other words, changes inside our container will not affect anything outside of it.

# Original hostname
hostname
originalHostname

# Renaming the hostname inside the container 
./container run /bin/bash
hostname anotherName

# Hostname changed inside the container
hostname
anotherName

# Exiting container
exit

# The original hostname didn't change
hostname
originalHostname

Isolating processes with the PID namespace

Since we saw how a namespace works, let’s use it for the main function of a container; isolating processes.

We will make the following changes to the main code.

In the run function, we will make sure that child is always an argument and therefore the function of the same name is executed.
exec.Command("/proc/self/exe", args…) will be used to fork our process with our commands.
CLONE_NEWPID will be used to create a new namespace to isolate the processes in our container.
The Sethostname method will be in charge of setting the hostname automatically, useful to know that we are inside the container.

The rest of the code does exactly the same.

package main

import (
    "fmt"
    "os"
    "os/exec"
    "syscall"
)

// go run container.go run <cmd> <args>
func main() {
    switch os.Args[1] {
    case "run":
    	run()
    case "child":
    	child()
    default:
    	panic("This command doesn't exist")
    }
}

func run() {
    args := append([]string{"child"}, os.Args[2:]...)
    cmd := exec.Command("/proc/self/exe", args...)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.SysProcAttr = &syscall.SysProcAttr{
    	Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
    }

    cmd.Run()
}

func child() {
    fmt.Printf("Cocde executing %v with Process Id (PID): %d \n", os.Args[2:], os.Getpid())

    syscall.Sethostname([]byte("container"))
    cmd := exec.Command(os.Args[2], os.Args[3:]...)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.Run()
}

Now, if we run the code we will see that the PID is 1, the first process, we have already isolated the processes! However, as we have not changed the file system, we will see the same processes of our main operating system.

Remember that the ps command gets the processes from the /proc directory of the file system you are using. In other words, we need another file system.

Set up a new file system for the container

To use a unique file system for the container, other than the file system of our operating system, we will use the linux command chroot .

Chroot changes the default root location to a directory of your choice.

ls /another_file_system
bin dev home lib ... proc

This new file system can have other libraries installed, configurations and be designed to our liking, it can be a copy of the one you are using or a completely different one.

To isolate the processes of our container we go to:

Change the file system to the new one with Chroot
Move to the root directory
Mount the proc folder in proc

func child() {
    fmt.Printf("Cocde executing %v with Process Id (PID): %d \n", os.Args[2:], os.Getpid())

    syscall.Sethostname([]byte("container"))
    cmd := exec.Command(os.Args[2], os.Args[3:]...)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    syscall.Chroot("/another_file_system")
    os.Chdir("/")
    syscall.Mount("proc", "proc", "proc", 0, "")

    cmd.Run()
}

Now our container is going to read the processes from our new file system, instead of the file system of the main operating system.

Limiting container resources with cgroups

Finally, we are going to limit the resources that our container can access using the linux cgroups.

The cgroups are located inside the path /sys/fs/cgroup/ and we can create a new one by creating a new folder inside the cgroup type.

In this case we will limit the memory, so our cgroup will be inside /sys/fs/cgroup/memory/<cgroup_name>. Remember I told you that cgroups worked by reading a series of directories and files?

func child() {
    // ...
    setcgroup()
    // ...
}

func setcgroup() {

    cgPath := filepath.Join("/sys/fs/cgroup/memory", "newcgroup")
    os.Mkdir(cgPath, 0755)

    ioutil.WriteFile(filepath.Join(cgPath, "memory.limit_in_bytes"), []byte("100000000"), 0700)
    ioutil.WriteFile(filepath.Join(cgPath, "tasks"), []byte(strconv.Itoa(os.Getpid())), 0700)
}

We create a directory for our cgroup with the linux permissions 0755

We will generate two files, inside our cgroup, to set the guidelines we want to implement

memory.limit_in_bytes, to limit the maximum memory to 100 MB (100000000 bytes).
tasks to tell linux that this cgroup configuration is applicable to the process number (PID) of our container, which we obtain with the Getpid method.

And that’s it, with that we have a process with its own file system, isolated from the main operating system and can access only a part of the resources.

Summary

In summary, it is possible to create a container using namespaces, cgroups and chroot, to isolate from the outside, limit resources, and provide its own file system, respectively.

The code in this post is based on a talk by LizRice at ContainerCamp.