Linux Container Primitives: Mount Namespaces and Information Leaks

Part three of the Linux Container series

24. März, 2020

In the previous part of the Linux Container Primitive series, basic information regarding namespaces were covered. This post discusses the mount namespace type and its usage in detail. Also, an information leak that’s related to the usage of mount namespaces in Docker is described. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.

Mount Namespaces

This namespace type was the first to be added to the Linux kernel 2.4.19 in 2002 [1]. The goal of mount namespaces is to restrict the view of the global file hierarchy by providing each namespace with its own set of mount points.

A newly created namespace initially uses a copy of the parent’s mount tree. To add and remove mount points, the mount and umount commands are available. The implementation of these commands had to be modified in order to be aware of namespaces and work in combination with mount namespaces.

The initial implementation of mount namespaces caused a usability issue that ultimately reduced the efficiency of these namespaces: To make a device available for all or a subset of all namespaces it was required to execute one mount operation for each namespace. However, the optimal approach would require only a single mount operation to perform the same task. Additionally, a solution to manage mountpoints in all namespaces at once would provide more convenience. For these reasons, shared sub-trees have been introduced[2].

The basic idea of shared sub-trees is that a propagation type is associated with each mount point. This configures how each mount operation within a mount point will be propagated to other related mount points. Internally the kernel uses peer groups in order to determine whether a mount event gets propagated to a specific mount point. A peer group consists of a set of mount points that share mount events with each other. Mount points get added to a peer group in case the specific shared mount point is being replicated by joining a new mount namespace or when bind-mounting a mount point. Peer group members are being removed in case an unmount operation is being issued or as soon as a mount namespace gets destroyed.

The following propagation types are available:

MS_SHARED: This shares the mount with all mount points residing in the same peer group. Mount operations on a peer’s mount will also be propagated to the original mount.
MS_PRIVATE: No mount events are being propagated and no events are being received. Other mount namespaces will not be able to access the mount point.
MS_SLAVE: Mounts of this type only receive events. They do not share events and effectively provide private mounts for mounts created using this propagation type.
MS_UNBINDABLE: This type is similar to the private type with an addition: Mounts of this type can not be used to create bind mounts. An example usage of this type can be found below.

The types can be prefixed with an r to make its effect recursive for all child mounts of a mount point.

Consider the following scenario [1]: Two users, namely Alice and Bob, are sharing parts of the same file hierarchy and are being placed in their own mount namespaces. They are then being provided with their own view of the system directories. To perform this task, the system’s directories have been bind-mounted into the user specific directories of the sub-tree, as shown on the left figure [1] below. Note that the folders of Alice and Bob are replicated in the bind-mounted sub-tree directories which is not desired.

By using the MS_UNBINDABLE flag this is prevented, effectively creating the hierarchy as can bee seen on the second figure below[1]:

To make use of the advantages provided by shared sub-trees, a mount now has to be shared with both users. As can be seen on the first figure below, a data medium was inserted. However, it’s only present outside of the mount namespaces of Alice and Bob.

By executing a mount command with the --make-shared flag, the mount of the data medium is now also present in the mount namespaces of both users:

Moreover, every mount operation executed in the /mnt directory will now automatically be propagated to the mount namespaces of Alice and Bob.

This approach also enables Alice and Bob to have private mounts that are not being shared with other users. This can be achieved by either making the sub-tree mounts of the users slaves or by creating private mounts inside of the respective user directories.

The mount points available for a specific process of a namespace can be found along with their propagation types in the /proc/self/mountinfo file.

By default Docker uses the private propagation type when mounting directories with the -v parameter. This can be verified by examining the output of docker inspect <Containername>:

[...]
"Mounts": [
    {
        "Type": "bind",
        "Source": "/tmp/test",
        "Destination": "/tmp/test",
        "Mode": "",
        "RW": true,
        "Propagation": "rprivate"
    }
],
[...]

This isolates the mount points of both the host and a container and no mount events are being propagated between a container and the host. Therefore mount points present on the host are not available for containerized processes and the other way around because separate mount namespaces are in place.

Consider sharing a host directory with a container after starting an initial process in it: In this scenario mounting takes place across two mount namespaces - the host namespace and the respective container namespace. The nsenter utility assists when performing this task by using setns to join the container’s mount namespace in order to allow creating mount points in the container. Because the capabilities in the host’s mount namespace are being preserved, a folder governed by the host can therefore be mounted into the container, effectively sharing it between the host and the container.

For containers, host paths can also be masked to provide each container with its own version of a specific folder. For example by masking the /proc/acpi it’s not possible for a containerized process to enable host interfaces like the Bluetooth device. Issues can be present in case no proper masking is taken into account [3]. The Docker platform automatically masks certain paths under /proc and /sys to counter these potential issues.

Discovering an Information Leak in Docker

During an assessment of the Docker platform, two information leaks regarding the /proc/asound path were discovered in the associated OCI container specification:

Leak of audio device status of the host

When media is being played on the host, the /proc/asound/card*/pcm*p/sub*/status files may contain information regarding the status of media playback. Consider this command for a demonstration:

docker run --rm \
    ubuntu:latest bash -c \
        "sleep 5; \
        cat /proc/asound/card*/pcm*p/sub*/status | \
        grep state | \
        cut -d ' ' -f2 | \
        grep RUNNING || echo 'not running'"

When playing a media file, watching a video or executing similar actions involving sound output on the host, the command above demonstrates that a containerized process is able to gather information on this status. Therefore a process in a container is able to check whether and what kind of user activity is present on the host system. Also, this may indicate whether a container runs on a desktop system or a server as media playback rarely happens on server systems.

The scenario described above is in regard to media playback. Moreover, when examining the /proc/asound/card*/pcm*c/sub*/status files (pcm*c instead of pcm*p) this can also leak information regarding capturing sound, as in recording audio or making calls on the host system.

Leak of Monitor Type

Monitors can also act as sound output devices when connected using a compatible interface like DisplayPort. The listing below illustrates how the model type of the connected monitors can be read from within a container:

docker run --rm \
    ubuntu:latest bash -c \
        "cat /proc/asound/card*/eld* | \
        grep monitor_name"

monitor_name SMS24A650 # (A Samsung monitor)
monitor_name SMS24A650

This information should not be accessible from within a container as it contains specific information on the host and its environment.

These issues have been reported [4] to the Docker maintainers and are now fixed in the upstream version by adding /proc/asound to the list of paths that are being masked to make each container manage its own versions of the affected paths. This path list is part of the OCI specification - therefore the fix for these issues also propagates to containerd which also uses the OCI specification.

Next post in series

Continue reading the next article in this series The PID and Network Namespaces

References / Credits

Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

References

1 - Mount namespaces and shared subtrees
2 - Shared subtrees
3 - Moby Pull Request #37404
4 - Minor Information Leaks in /proc/asound
Title Image

~ Philipp Schmied