Part eight of the Linux Container series

Linux Container Primitives: Network and Block I/O Control Groups

In the previous post of the Linux Container Primitives series, the basics of control groups were covered. This post illustrates the purpose of two cgroup controllers: Network and Block I/O. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.

  1. An Introduction to Linux Containers
  2. Linux Capabilities
  3. An Introduction to Namespaces
  4. The Mount Namespace and a Description of a Related Information Leak in Docker
  5. The PID and Network Namespaces
  6. The User Namespace
  7. Namespaces Kernel View and Usage in Containerization
  8. An Introduction to Control Groups
  9. The Network and Block I/O Controllers
  10. The Memory, CPU, Freezer and Device Controllers
  11. Control Groups Kernel View and Usage in Containerization

The Network Controller (v1)

This section covers the resource controllers net_cl and net_prio. Both cause identifiers to be attached to sockets once they are created by a process that’s being managed by one of the two controllers. The difference between the two controllers is that net_prio assigns an ID that’s unique for each control group whereas net_cl uses a specified identifier that does not have to be unique for each cgroup, allowing flexible tagging of sockets in classes [1]. Adding these identifiers allows quick checks to determine whether a socket originates from the same control group or class. This is more efficient than searching in the control group tree, for example using the cgroup function is_descendant() to perform these checks - especially if this has to be performed regularly.

There are multiple use-cases for these additional socket attributes. Among others, some of them are:

  • Setting the priority of network packets originating from a specific socket or device by using the network priority set by net_prio.
  • Using iptables in combination with the net_cl class identifier to filter and route packets based on the control group membership.
  • Scheduling network packets based on class identifiers.

A simple usage example of net_cls that drops all IP based traffic for all processes not present in a specific control group can be seen below [2]:

root@box :~# mkdir /sys/fs/cgroup/net_cls # Create mountpoint
root@box :~# mount -t cgroup \ # Mount the controller
    -o net_cls net_cls /sys/fs/cgroup/net_cls
root@box :~# mkdir /sys/fs/cgroup/net_cls/IPAllowed # Create cgroup
root@box :~# echo 0x100001 > \ # Set the fixed class identifier
    /sys/fs/cgroup/net_cls/IPAllowed/net_cls.classid
root@box :~# tc qdisc add dev <interface> root handle 10: htb
root@box :~# tc class add dev <interface> parent 10: classid 10:1 \
        htb rate 40mbit
root@box :~# tc filter add dev <interface> parent 10: protocol ip \
        prio 10 handle 1: cgroup
root@box :~# iptables -A OUTPUT -m cgroup ! \
        --cgroup 0x100001 -j DROP # Disallow IP for all non-members
root@box :~# echo $$ > \ # Add process to cgroup
        /sys/fs/cgroup/net_cls/IPAllowed/cgroup.procs
-- Filtering active --
root@box :~# tc qdisc del dev <interface> root; \ # Revert settings
        tc qdisc add dev <interface> root pfifo

The tc (Traffic Control) commands listed above are being used to set up a qdisc (Queueing-Discipline) that uses the fixed control group class to classify the traffic originating from a control group on a network interface by assigning it to a handle called cgroup. This filtering is accomplished by using a HTB (Hierarchical Token Bucket) filter. With iptables it’s then possible to use the cgroup handle to add rules for a control group, e.g. allowing network access.

This controller type is an example where child control groups are not automatically affected by the net_* controllers, meaning that this setting is not inherited throughout the hierarchy.

The Block IO Controller (v1/v2)

The blkio (v1) / io (v2) controller is being utilized to enable I/O resource usage policies. The most common use-cases to limit these aspects are:

  • Specifying upper bandwidth limits, for example in the blkio.throttle.read_bps_device file to specify the maximum bandwidth for a device in bits per second. Alternatively, the rbps parameter in conjunction with the io.max file is the equivalent for version 2.

  • Denying access to a specific device.

  • Limiting with proportional time based division: This allows settings weights for various control groups that will be used to prioritize all device accesses when performing I/O operations. The blkio.weight file is present for this purpose in cgroup v1 whereas this is configured with io.weight in version 2.

Enforcing Limits in the Kernel

Enforcing bandwidth limits is implemented in blk_throtl_bio which resides in block/blk-throttle.c. This function makes use of throtl_charge_bio to ultimately charge for the data volume used in an I/O operation. Depending on the resource usage, an I/O operation can be executed directly or may have to be delayed using a queue to meet the resource limitations. For delayed operations, a dispatcher function will then cause pending operations to be executed using pre-calculated timers in order to throttle requested operations. With throtl_trim_slice the required time limiting is calculated, yielding the time slice the operation may be executed in.

To allow or deny accessing a specific device, functions of security/device_cgroup.c come to use. When passing cgroup configuration strings to the files present in the virtual control group file system devcgroup_update_access parses this information and configures the control group accordingly, e.g. by setting flags indicating whether accessing a device is allowed or denied for processes of a cgroup. This builds an exception list as seen in the listing following below. Upon accessing a block device, __blkdev_get (fs/block_dev.c) is being called which performs access checks prior to allowing access. To perform these checks, __devcgroup_check_permission (security/device_cgroup.c) is called, resulting in the following checks being performed using the internal exception list:

// current is the current task_struct
dev_cgroup = task_devcgroup(current);
if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW)
    // perform checks based on the exception list
    rc = !match_exception_partial(&dev_cgroup->exceptions,
        type, major, minor, access);
else
    rc = match_exception(&dev_cgroup->exceptions, type, major,
        minor, access);
if (!rc)
    return -EPERM; // deny access

The default scheduler for I/O operations in the Linux kernel is the CFQ scheduler. It was extended to support the I/O related cgroup controllers after control groups have been introduced in the kernel. This makes it possible to account and constrain processes regarding their consumed I/O resources, for example using pre-defined weights. The CFQ I/O scheduler is implemented in block/cfq-iosched.h - not to be confused with kernel/sched/fair.c where process-related CFQ scheduling is implemented. The kernel structure cfq_group maps various settings per cgroup-device relationship. This includes applied policies and weights which are considered in order to schedule I/O operations.

Next post in series

  • The next post in this series ‘‘The Memory, CPU, Freezer and Device Controllers’’ will be published soon.
    Follow us on Twitter, LinkedIn, Xing to stay up-to-date.

Credits

Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.

References

Philipp Schmied