Linux Container Primitives: An Introduction to Namespaces

Part two of the Linux Container series

October 29, 2019

The following list shows the topics of all scheduled blog posts regarding Linux containers. It will be updated with the corresponding links once new posts are being released.

Being introduced first in Linux kernel version 2.4.19 in 2002, namespaces define groups of processes that share a common view regarding specific system resources. This ultimately isolates the view on a system resource a group of processes may have, meaning that a process can for instance have its own hostname while the real hostname of the system may have an entirely different value.

There exist various namespaces types – as of Linux kernel version 4.19 the following types are available:

UTS
Mount
PID
Network
IPC (Inter Process Communication)
Control Group
User

Consider the UTS namespace as an example: Every process in a single UTS namespace shares the hostname with every other process in the same UTS namespace. Otherwise, the value of the hostname is isolated between different UTS namespaces. The code listing below illustrates this:

root@box :~# hostname # Get the hostname
box
root@box :~# unshare -u # Create a sub shell in a new UTS namespace
$ hostname # Get the hostname in the created UTS namespace
box
$ hostname anotherbox # Change the hostname in the UTS namespace
$ hostname # Verify that it has been changed
anotherbox
$ exit # Exit the UTS namespace shell
root@box :~# hostname # Verify that the real hostname of the parent UTS namespace hasn't changed
box

Namespaces can also be used in combination by making a process a member of multiple new namespaces at once. This is useful for containerization where multiple resources have to be isolated at once.

It’s important to note that by default every process is a member of a namespace of each type listed above. These namespaces are called default, init or root namespaces. In case no additional namespace configuration is in place, processes and all their direct children will reside in this exact namespace. This can be verified by executing lsns to list all namespaces in two different terminals and comparing the namespace identifiers which will be equal.

The isolation provided by namespaces is highly configurable and flexible. For instance, it’s possible for a database application and a web application to share the same network namespace, allowing both processes to communicate while other processes that reside in other network namespaces are not able to do so.

System Calls

Three system calls are commonly being used in conjunction with namespaces:

clone: Create child processes
unshare: This disables sharing a namespace with the parent process, effectively unsharing a namespace and its associated resources. Please note that this system call changes the shared namespaces in-place without the requirement of spawning a new process – with the PID namespace being an exception in this case as discussed in one of the next blog posts.
setns: Attaches a process to an already existing namespace.

Similar to the fork system call, clone is used to create child processes. There are multiple differences between the two calls: The most significant difference is that clone can be highly parametrized using flags. For example, it allows sharing the execution context, for instance the process memory space, with a child process. Therefore clone can also create threads and is more versatile than the legacy fork call. The fork call does not support most of this behavior. Instead, fork is essentially being used to create child processes as copies of a parent process. Before going into more detail about the differences and similarities it’s first important to understand what’s happening when fork, clone or a system call in general is being invoked in a C program.

fork() != fork

By using one of these two calls in C code the actual code that will be executed is not the system call itself as defined in the system call table. The code that will be called instead is a wrapper around the actual system call of the C library which is often named after the wrapped system call. These wrappers exist because using them is more convenient for developers than using system calls directly. For instance, to use a system call it’s necessary to setup registers, switch to kernel mode, handle the call results and switch back the user mode [1]. This can be simplified for by implementing a wrapper and doing these tasks in the wrapper’s code.

When inspecting the fork wrapper function it becomes clear that the actual fork system call that’s supposed to be wrapped is not being used it all. Instead, the ARCH_FORK macro gets called which is an inline system call to clone defined as:

#define ARCH_FORK() \
  INLINE_SYSCALL (clone, 4, \
		  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0, \
		  NULL, &THREAD_SELF->tid)

This results in the clone and fork library functions calling the same system call, namely clone. This can be verified by compiling a C application containing a call to fork and using strace to trace the resulting system calls when executing the resulting binary: No calls to fork are present, only a call to clone with the SIGCHILD flag. This ultimately implements the legacy fork call with a call to clone. The reason for that results in clone being a more powerful and configurable call than fork, making it possible to replace fork entirely with clone to spawn processes and threads.

Using clone()

The clone system call accepts various flags to configure the process creation. For the usage of namespaces a subset of these flags can be used to specify the new namespaces a process will join. By default the child processes are being initialized with a modified copy of the parent’s namespace configuration when supplying such a flag. This takes the desired namespaces configuration into account and makes the child process a member of the new namespaces that are represented as flags. If a specific flag of a namespace type is not specified, then the process is part of the parent’s namespace for this specific type, providing no additional level of isolation. Consider the UTS namespace example from above: The clone flag responsible for the creation of such a namespace is NEW_UTS.

The clone call has the following prototype, allowing to specify the flags described above:

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg)

child_func: Function pointer to the function being executed by the child process.
child_stack: The downwardly growing stack the child will operate on.
flags: Integer value representing all used flags as configured using an OR-Conjunction of flags.
arg: Additional arguments

The following code snippet shows a minimal example of using clone to spawn a shell process in a new UTS namespace. Please note that this minimal example does not provide includes, error handling and does not check return values.

#define STACKSIZE 8192
char *childStack;
char *childStackTop;
pid_t childPid;

// Will be executed by the child process
int run(void *) {
    system("/bin/sh");
    return 0;
}

int main(int argc, char const *argv[]) {
    childStack = (char *)malloc(STACKSIZE);
    // stack grows downward
    childStackTop = childStack + STACKSIZE;
    childPid = clone(run, childStackTop,
        CLONE_VFORK | CLONE_NEWUTS, 0);
    return 0;
}

After compiling and running this example a shell in a new UTS namespace is being spawned.

unshare()

This system call allows processes to disable sharing namespaces after they have been created. In contrast, clone causes child processes to be moved to namespaces while creating them. There exists a CLI application available called unshare that creates a new process` and unshares specific system resources, whereas using the system call in a C program isolates a process at runtime without allocating a new process:

int main(int argc, char const *argv[]) {
    unshare(CLONE_NEWUTS); // No error handling
    // Print hostname
    system("hostname");
    // Set hostname
    system("hostname testname");
    // Print hostname again
    system("hostname");
    // Print the available namespaces
    system("lsns");
    // Spawn a shell
    execvp("/bin/sh", 0);
    return 0;
}

When compiling and running the program a similar output to the following listing content can be observed:

root@box:~# ./a.out
box
testname
[...]
NS         TYPE      NPROCS  PID USER           COMMAND
[...]
4026531835 cgroup    336     1 root             /sbin/init splash
4026531836 pid       293     1 root             /sbin/init splash
4026531837 user      293     1 root             /sbin/init splash
4026531838 uts       333     1 root             /sbin/init splash
4026531839 ipc       336     1 root             /sbin/init splash
4026531840 mnt       326     1 root             /sbin/init splash
4026532009 net       292     1 root             /sbin/init splash
[...]
4026532436 uts         3 27750 root             ./a.out
[...]
root@box:~# lsns | grep uts
4026531838 uts       134   803 user             /bin/sh

As seen above, the created process is able to set its own hostname, ultimately isolating the value of this setting in-place. Additionally, the spawned process is present in two UTS namespaces. The parent process is only a member of one UTS namespace. According to the namespace IDs, one of these UTS namespaces is shared as it’s the default namespace of that type. The other namespace is the one that has been created using the unshare call.

The command lsns gathers the displayed information by checking the contents of the /proc/<PID>/ns directory for each PID. In the context of containers it should not be possible to get information on parent namespaces like it’s possible in this example. When creating a container, the /proc directory or sensitive parts of it should therefore be isolated to prevent this information leak.

setns()

To add processes to already existing namespaces, setns is being utilized. It disassociates a process from its original namespace and associates it with another namespace of the same type. The prototype of this system call is as follows:

int setns(int fd, int nstype)

fd: File descriptor of a symbolic link representing a specific namespace as represented in /proc/<PID>/ns.
nstype: This parameter is designated for checks regarding the namespace type. By passing a CLONE_NEW* flag, the namespace type of the first parameter is checked before entering the namespace. This makes sure that the passed file descriptor indeed points to the desired namespace type. To disable this check, a zero value can also be used for this parameter.

A simple example that invokes a command in a given namespace can be examined in the following code listing ([2] – modified):

[...]
// Get namespace file descriptor
int fd = open(argv[1], O_RDONLY);
// Join the namespace
setns(fd, 0);
// Execute a command in the namespace
execvp(argv[2], &argv[2]);
[...]

The code above launches a child process that executes the specified command and resides in a different namespace as the parent.

To perform this from a CLI the nsenter application can be of use. Cosider a shell process residing in a separate UTS namespace: By executing nsenter -a -t 1 the process is being moved to all namespaces originating from the system initialization process with PID 1. This effectively reverts the call to unshare, making the original UTS namespace available to the shell process. Now changing the hostname from the shell process will affect the hostname of the default namespace and therefore the system’s hostname. As a result, it’s important to prevent these types of setns calls by isolating the exposed namespaces. This illustrates that by using namespaces alone it may not be possible to prevent system modifications.

Next post in series

Continue reading the next article in this series The Mount Namespace and a Description of a Related Information Leak in Docker

References / Credits

Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

References

1 - LWN: Glibc and the kernel user-space API
2 - LWN: Namespaces in operation, part 2: the namespaces API
Title Image

~ Philipp Schmied