January 12, 2021
Linux Container Primitives: Network and Block I/O Control Groups
Part eight of the Linux Container series
In the previous post of the Linux Container Primitives series, the basics of control groups were covered. This post illustrates the purpose of two cgroup controllers: Network and Block I/O. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
The Network Controller (v1)
This section covers the resource controllers
net_prio. Both cause identifiers to be attached to sockets once they are created by a process that’s being managed by one of the two controllers. The difference between the two controllers is that
net_prio assigns an ID that’s unique for each control group whereas
net_cl uses a specified identifier that does not have to be unique for each cgroup, allowing flexible tagging of sockets in classes . Adding these identifiers allows quick checks to determine whether a socket originates from the same control group or class. This is more efficient than searching in the control group tree, for example using the cgroup function
is_descendant() to perform these checks - especially if this has to be performed regularly.
There are multiple use-cases for these additional socket attributes. Among others, some of them are:
- Setting the priority of network packets originating from a specific socket or device by using the network priority set by
iptablesin combination with the
net_clclass identifier to filter and route packets based on the control group membership.
- Scheduling network packets based on class identifiers.
A simple usage example of
net_cls that drops all IP based traffic for all processes not present in a specific control group can be seen below :
root@box :~# mkdir /sys/fs/cgroup/net_cls # Create mountpoint root@box :~# mount -t cgroup \ # Mount the controller -o net_cls net_cls /sys/fs/cgroup/net_cls root@box :~# mkdir /sys/fs/cgroup/net_cls/IPAllowed # Create cgroup root@box :~# echo 0x100001 > \ # Set the fixed class identifier /sys/fs/cgroup/net_cls/IPAllowed/net_cls.classid root@box :~# tc qdisc add dev <interface> root handle 10: htb root@box :~# tc class add dev <interface> parent 10: classid 10:1 \ htb rate 40mbit root@box :~# tc filter add dev <interface> parent 10: protocol ip \ prio 10 handle 1: cgroup root@box :~# iptables -A OUTPUT -m cgroup ! \ --cgroup 0x100001 -j DROP # Disallow IP for all non-members root@box :~# echo $$ > \ # Add process to cgroup /sys/fs/cgroup/net_cls/IPAllowed/cgroup.procs -- Filtering active -- root@box :~# tc qdisc del dev <interface> root; \ # Revert settings tc qdisc add dev <interface> root pfifo
tc (Traffic Control) commands listed above are being used to set up a
qdisc (Queueing-Discipline) that uses the fixed control group class to classify the traffic originating from a control group on a network interface by assigning it to a handle called
cgroup. This filtering is accomplished by using a HTB (Hierarchical Token Bucket) filter. With
iptables it’s then possible to use the
cgroup handle to add rules for a control group, e.g. allowing network access.
This controller type is an example where child control groups are not automatically affected by the
net_* controllers, meaning that this setting is not inherited throughout the hierarchy.
The Block IO Controller (v1/v2)
v2) controller is being utilized to enable I/O resource usage policies. The most common use-cases to limit these aspects are:
Specifying upper bandwidth limits, for example in the
blkio.throttle.read_bps_devicefile to specify the maximum bandwidth for a device in bits per second. Alternatively, the
rbpsparameter in conjunction with the
io.maxfile is the equivalent for version 2.
Denying access to a specific device.
Limiting with proportional time based division: This allows settings weights for various control groups that will be used to prioritize all device accesses when performing I/O operations. The
blkio.weightfile is present for this purpose in cgroup
v1whereas this is configured with
io.weightin version 2.
Enforcing Limits in the Kernel
Enforcing bandwidth limits is implemented in
blk_throtl_bio which resides in
block/blk-throttle.c. This function makes use of
throtl_charge_bio to ultimately charge for the data volume used in an I/O operation. Depending on the resource usage, an I/O operation can be executed directly or may have to be delayed using a queue to meet the resource limitations. For delayed operations, a dispatcher function will then cause pending operations to be executed using pre-calculated timers in order to throttle requested operations. With
throtl_trim_slice the required time limiting is calculated, yielding the time slice the operation may be executed in.
To allow or deny accessing a specific device, functions of
security/device_cgroup.c come to use. When passing cgroup configuration strings to the files present in the virtual control group file system
devcgroup_update_access parses this information and configures the control group accordingly, e.g. by setting flags indicating whether accessing a device is allowed or denied for processes of a cgroup. This builds an exception list as seen in the listing following below. Upon accessing a block device,
fs/block_dev.c) is being called which performs access checks prior to allowing access. To perform these checks,
security/device_cgroup.c) is called, resulting in the following checks being performed using the internal exception list:
// current is the current task_struct dev_cgroup = task_devcgroup(current); if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) // perform checks based on the exception list rc = !match_exception_partial(&dev_cgroup->exceptions, type, major, minor, access); else rc = match_exception(&dev_cgroup->exceptions, type, major, minor, access); if (!rc) return -EPERM; // deny access
The default scheduler for I/O operations in the Linux kernel is the CFQ scheduler. It was extended to support the I/O related cgroup controllers after control groups have been introduced in the kernel. This makes it possible to account and constrain processes regarding their consumed I/O resources, for example using pre-defined weights. The CFQ I/O scheduler is implemented in
block/cfq-iosched.h - not to be confused with
kernel/sched/fair.c where process-related CFQ scheduling is implemented. The kernel structure
cfq_group maps various settings per cgroup-device relationship. This includes applied policies and weights which are considered in order to schedule I/O operations.
Next post in series
- Continue reading the next article in this series
Follow us on Twitter , LinkedIn , Xing to stay up-to-date.
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.