January 12, 2021
Linux Container Primitives: Network and Block I/O Control Groups
Part eight of the Linux Container series
![preview-image for Logo](https://www.schutzwerk.com/en/blog/linux-container-cgroups-02-network-block-io/container6.jpg)
In the previous post of the Linux Container Primitives series, the basics of control groups were covered. This post illustrates the purpose of two cgroup controllers: Network and Block I/O. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
The Network Controller (v1)
This section covers the resource controllers net_cl
and net_prio
. Both cause identifiers to be attached to sockets once they are created by a process that’s being managed by one of the two controllers. The difference between the two controllers is that net_prio
assigns an ID that’s unique for each control group whereas net_cl
uses a specified identifier that does not have to be unique for each cgroup, allowing flexible tagging of sockets in classes [1]. Adding these identifiers allows quick checks to determine whether a socket originates from the same control group or class. This is more efficient than searching in the control group tree, for example using the cgroup function is_descendant()
to perform these checks - especially if this has to be performed regularly.
There are multiple use-cases for these additional socket attributes. Among others, some of them are:
- Setting the priority of network packets originating from a specific socket or device by using the network priority set by
net_prio
. - Using
iptables
in combination with thenet_cl
class identifier to filter and route packets based on the control group membership. - Scheduling network packets based on class identifiers.
A simple usage example of net_cls
that drops all IP based traffic for all processes not present in a specific control group can be seen below [2]:
root@box :~# mkdir /sys/fs/cgroup/net_cls # Create mountpoint
root@box :~# mount -t cgroup \ # Mount the controller
-o net_cls net_cls /sys/fs/cgroup/net_cls
root@box :~# mkdir /sys/fs/cgroup/net_cls/IPAllowed # Create cgroup
root@box :~# echo 0x100001 > \ # Set the fixed class identifier
/sys/fs/cgroup/net_cls/IPAllowed/net_cls.classid
root@box :~# tc qdisc add dev <interface> root handle 10: htb
root@box :~# tc class add dev <interface> parent 10: classid 10:1 \
htb rate 40mbit
root@box :~# tc filter add dev <interface> parent 10: protocol ip \
prio 10 handle 1: cgroup
root@box :~# iptables -A OUTPUT -m cgroup ! \
--cgroup 0x100001 -j DROP # Disallow IP for all non-members
root@box :~# echo $$ > \ # Add process to cgroup
/sys/fs/cgroup/net_cls/IPAllowed/cgroup.procs
-- Filtering active --
root@box :~# tc qdisc del dev <interface> root; \ # Revert settings
tc qdisc add dev <interface> root pfifo
The tc
(Traffic Control) commands listed above are being used to set up a qdisc
(Queueing-Discipline) that uses the fixed control group class to classify the traffic originating from a control group on a network interface by assigning it to a handle called cgroup
. This filtering is accomplished by using a HTB (Hierarchical Token Bucket) filter. With iptables
it’s then possible to use the cgroup
handle to add rules for a control group, e.g. allowing network access.
This controller type is an example where child control groups are not automatically affected by the net_*
controllers, meaning that this setting is not inherited throughout the hierarchy.
The Block IO Controller (v1/v2)
The blkio
(v1
) / io
(v2
) controller is being utilized to enable I/O resource usage policies. The most common use-cases to limit these aspects are:
Specifying upper bandwidth limits, for example in the
blkio.throttle.read_bps_device
file to specify the maximum bandwidth for a device in bits per second. Alternatively, therbps
parameter in conjunction with theio.max
file is the equivalent for version 2.Denying access to a specific device.
Limiting with proportional time based division: This allows settings weights for various control groups that will be used to prioritize all device accesses when performing I/O operations. The
blkio.weight
file is present for this purpose in cgroupv1
whereas this is configured withio.weight
in version 2.
Enforcing Limits in the Kernel
Enforcing bandwidth limits is implemented in blk_throtl_bio
which resides in block/blk-throttle.c
. This function makes use of throtl_charge_bio
to ultimately charge for the data volume used in an I/O operation. Depending on the resource usage, an I/O operation can be executed directly or may have to be delayed using a queue to meet the resource limitations. For delayed operations, a dispatcher function will then cause pending operations to be executed using pre-calculated timers in order to throttle requested operations. With throtl_trim_slice
the required time limiting is calculated, yielding the time slice the operation may be executed in.
To allow or deny accessing a specific device, functions of security/device_cgroup.c
come to use. When passing cgroup configuration strings to the files present in the virtual control group file system devcgroup_update_access
parses this information and configures the control group accordingly, e.g. by setting flags indicating whether accessing a device is allowed or denied for processes of a cgroup. This builds an exception list as seen in the listing following below. Upon accessing a block device, __blkdev_get
(fs/block_dev.c
) is being called which performs access checks prior to allowing access. To perform these checks, __devcgroup_check_permission
(security/device_cgroup.c
) is called, resulting in the following checks being performed using the internal exception list:
// current is the current task_struct
dev_cgroup = task_devcgroup(current);
if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW)
// perform checks based on the exception list
rc = !match_exception_partial(&dev_cgroup->exceptions,
type, major, minor, access);
else
rc = match_exception(&dev_cgroup->exceptions, type, major,
minor, access);
if (!rc)
return -EPERM; // deny access
The default scheduler for I/O operations in the Linux kernel is the CFQ scheduler. It was extended to support the I/O related cgroup controllers after control groups have been introduced in the kernel. This makes it possible to account and constrain processes regarding their consumed I/O resources, for example using pre-defined weights. The CFQ I/O scheduler is implemented in block/cfq-iosched.h
- not to be confused with kernel/sched/fair.c
where process-related CFQ scheduling is implemented. The kernel structure cfq_group
maps various settings per cgroup-device relationship. This includes applied policies and weights which are considered in order to schedule I/O operations.
Next post in series
- Continue reading the next article in this series The
Memory,
CPU,
Freezer
and
Device
Controllers
Follow us on Twitter , LinkedIn , Xing to stay up-to-date.
Credits
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.