Linux Container Primitives: PID and Network Namespaces

Part four of the Linux Container series

August 24, 2020

After discussing the mount namespace and an information leak issue in Docker in the previous blog post , this part illustrates the PID and network namespace types. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.

PID Namespaces

By creating a PID namespace, the process ID number space gets isolated. This makes it possible that processes in a container have PIDs starting from the value 1 whereas the real PID outside of the namespace of the same process is an entirely different number. Therefore multiple processes can have the same PID value on a system while they reside in different PID namespaces. With this namespace type processes of a container can be suspended and resumed, making the PIDs of the processes unique and persistent within such a namespace. Furthermore, containers can have their own init processes with a PID value of 1 that prepare the container internals and reap child processes themselves rather than delegating this task to the system wide init process.

Joining namespaces of this type limits process communication: It’s not possible to interact with processes of other PID namespaces in a way that requires a process identifier when issuing system calls. For example, it’s important to note that processes can only send signals to other processes that reside in their current namespace or in namespaces descending from their current namespace. This isolation is one of the main use-cases for this namespace type in containerization.

PID namespaces allow nesting up to a depth of 32, resulting in one PID in the original namespace and one in the new namespace for each process. This creates one PID value in each namespace that’s a direct ancestor walking back to the root namespace. Changing PID namespaces is a one-way operation. This means that it’s not possible to move a process up to an ancestor namespace.

Moving a process to its own PID namespace does not isolate the list of processes visible for that process. This can be verified by creating a shell in its own PID namespace and examining the process list:

user@box:~$ sudo unshare -fp /bin/bash
root@box:~# ps aux
USER  PID [...] COMMAND
[...]
user  5380 [...] /usr/lib/[...]/chromium-browser --enable-pinch
user  5388 [...] /usr/lib/[...]/chromium-browser --type=zygote
[...]

As can be seen in the listing above, processes residing in other namespaces are still visible. The process list is determined by processing the /proc/<PID> folders to gather the displayed information and this is not being isolated by joining a new PID namespace. By appending the option --mount-proc to the command used above, this issue is being resolved by creating a new mount namespace and remounting the /proc directory using mount -t proc proc /proc. This effectively isolates the directory contents of /proc and prevents a container from accessing the information stored in the /proc directory for processes residing in other namespaces. Employing this remounting mechanism is a part of the default behavior of Docker and other container engines. Therefore, leaking information of host process and accessing parts of these processes by sharing /proc is prevented.

A simple C usage example for this namespace type can be examined below:

[...]
int run(void *) {
    std::cout << "[Child] PID: " << getpid() << std::endl;
    std::cout << "[Child] Parent PID: " << getppid() << std::endl;
    system("/bin/sh");
    return 0;
}

int main(int argc, char const *argv[]) {
    [...]
    childPid = clone(run, childStackTop, CLONE_VFORK | CLONE_NEWPID,
        0);
    std::cout << "[Parent] Child PID: " << childPid << std::endl;
    [...]
    return 0;
}

After compiling and executing this codes it becomes clear that the child is running with PID 1 in the isolated environment and a different, much higher identifier outside of the namespace.

A process in a new PID namespace can be created by providing the CLONE_NEWPID flag. When using the same flag in combination with unshare or when calling setns, a new PID namespace will be created. This namespace is present in the /proc/<PID>/ns/pid_for_children file. The process calling unshare or setns will stay in its current namespace. Instead, the first child of this process will be placed in the new namespace with a PID value of 1, making it the container’s init process. This results from the assumption of many libraries and applications: PIDs of processes are not subject to change. By joining a new namespace at runtime, the identifier of a process would change. In fact, even C library functions like getpid cache the PIDs of processes [1].

Ever since PID namespaces have been implemented, PIDs are represented by a kernel structure called pid. In combination with the upid structure (include/linux/pid.h) it’s possible to determine the correct pid structure as seen in a specific PID namespace.

Network Namespaces

This namespace can enable processes to have their own private network stack, including interfaces, routing tables and sockets [2]. The corresponding clone flag is CLONE_NEWNET and the ip CLI application is available to manage network namespaces easily. The ip netns command is able to create a new permanent network namespace. This is being accomplished by bind-mounting /proc/self/ns/net to /var/run/netns/<Name of namespace>}. Using this, configuration can take place without moving processes to the namespace first.

To create a network namespace from the command line, the ip netns add is used as follows:

root@box:~# ip netns add one # Create a network namespace
root@box:~# strace ip netns add two # Trace the creation
[...]
unshare(CLONE_NEWNET) = 0
mount("/proc/self/ns/net", "/var/run/netns/two, [...]) = 0
[...]
root@box:~# ip netns exec one ip link # Execute command in NS
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT [...]
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

As seen in the output listed above, adding a new network namespace is accomplished by following these steps:

Entering a new network namespace by using unshare with the CLONE_NEWNET flag
Executing the mount operation mentioned above

Deleting a network namespace can be done analogous. However, a namespace only gets destroyed in case all processes residing in the namespace are terminated. Otherwise it’s marked for deletion.

By default a new network namespace comes with its own local interface which is down by default. To perform the configuration of a namespace it may be convenient to spawn a shell in the desired network namespace with ip netns exec <name> bash. Among others, the following configuration scenarios are possible:

Linking network namespaces in order to provide a way for different namespaces to communicate by creating a network bridge.
Routing a network namespace through the host network stack to enable internet access.
Assigning a physical interface to a namespace. Please note that each network interface belongs to exactly one namespace at a given point of time. The same is true for sockets.

To enable the communication between the default network namespace and another namespace, virtual interfaces can be of use. These interfaces come in pairs of two - In this example veth0 and veth1 allowing a pipe-like communication:

# Create the virtual interface pair
root@box:~# ip link add veth0 type veth peer name veth1
# Move veth1 to the namespace named `one`
root@box:~# ip link set veth1 netns one
# Set the IP for veth1 in the new namespace
root@box:~# ip netns exec one ifconfig veth1 10.1.1.1/24 up
# Set the IP in the default namespace
root@box:~# ifconfig veth0 10.1.1.2/24 up

As seen above it’s possible to move an interface to a different namespace. This can for example be used to enable the communication between containers, similar to the functionality provided by docker --link. Therefore private container networks can be built that are isolated from the host and other containers on a host. Also, internet access for containers can be provided this way. While the Docker platform does this by default using a bridge network, it may be necessary to configure parts of this aspect manually for other container engines.

Moving an interface back to the default network namespace is accomplished with the command ip link set eth0 netns 1: It detects the correct namespace according to the specified PID with value 1, allowing administrators to identify namespaces by the process identifiers present in a given namespace. Moving interfaces between namespaces is implemented in the dev_change_net_namespace function of the Linux kernel [3]. Besides stopping and moving the desired interface it also notifies all processes using the interface in order to flush the message chains.

A common use-case for network namespaces is running multiple services binding to the same port on a single machine. This works by allocating a different port in the host’s network namespace for each port that’s being exposed by a container network namespace. Therefore all containerized web services can be configured to run on the default port 80 while they are being exposed on an entirely different port. Consequently applications do not require information on the real port that’s being used to expose the service and the host system can map ports freely. This provides a certain degree of portability when deploying containers that use network functionality.

Next post in series

Continue reading the next article in this series The User Namespace
Follow us on Twitter , LinkedIn , Xing to stay up-to-date.

Credits

Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

References

1 - Namespaces in operation, part 4: more on PID namespaces
2 - Namespaces in operation, part 7: Network namespaces
3 - dev.c - Protocol independent device support routines
Title Image

~ Philipp Schmied