Part eight of the Linux Container series
Linux Container Primitives: Network and Block I/O Control Groups
In the previous post of the Linux Container Primitives series, the basics of control groups were covered. This post illustrates the purpose of two cgroup controllers: Network and Block I/O. The following list shows the topics of all scheduled blog posts. It will be updated with the corresponding links once new posts are being released.
- An Introduction to Linux Containers
- Linux Capabilities
- An Introduction to Namespaces
- The Mount Namespace and a Description of a Related Information Leak in Docker
- The PID and Network Namespaces
- The User Namespace
- Namespaces Kernel View and Usage in Containerization
- An Introduction to Control Groups
- The Network and Block I/O Controllers
- The Memory, CPU, Freezer and Device Controllers
- Control Groups Kernel View and Usage in Containerization
The Network Controller (v1)
This section covers the resource controllers
net_prio. Both cause identifiers to be attached to sockets once they are created by a process that’s being managed by one of the two controllers. The difference between the two controllers is that
net_prio assigns an ID that’s unique for each control group whereas
net_cl uses a specified identifier that does not have to be unique for each cgroup, allowing flexible tagging of sockets in classes . Adding these identifiers allows quick checks to determine whether a socket originates from the same control group or class. This is more efficient than searching in the control group tree, for example using the cgroup function
is_descendant() to perform these checks - especially if this has to be performed regularly.
There are multiple use-cases for these additional socket attributes. Among others, some of them are:
- Setting the priority of network packets originating from a specific socket or device by using the network priority set by
iptablesin combination with the
net_clclass identifier to filter and route packets based on the control group membership.
- Scheduling network packets based on class identifiers.
A simple usage example of
net_cls that drops all IP based traffic for all processes not present in a specific control group can be seen below :
root@box :~# mkdir /sys/fs/cgroup/net_cls # Create mountpoint root@box :~# mount -t cgroup \ # Mount the controller -o net_cls net_cls /sys/fs/cgroup/net_cls root@box :~# mkdir /sys/fs/cgroup/net_cls/IPAllowed # Create cgroup root@box :~# echo 0x100001 > \ # Set the fixed class identifier /sys/fs/cgroup/net_cls/IPAllowed/net_cls.classid root@box :~# tc qdisc add dev <interface> root handle 10: htb root@box :~# tc class add dev <interface> parent 10: classid 10:1 \ htb rate 40mbit root@box :~# tc filter add dev <interface> parent 10: protocol ip \ prio 10 handle 1: cgroup root@box :~# iptables -A OUTPUT -m cgroup ! \ --cgroup 0x100001 -j DROP # Disallow IP for all non-members root@box :~# echo $$ > \ # Add process to cgroup /sys/fs/cgroup/net_cls/IPAllowed/cgroup.procs -- Filtering active -- root@box :~# tc qdisc del dev <interface> root; \ # Revert settings tc qdisc add dev <interface> root pfifo
tc (Traffic Control) commands listed above are being used to set up a
qdisc (Queueing-Discipline) that uses the fixed control group class to classify the traffic originating from a control group on a network interface by assigning it to a handle called
cgroup. This filtering is accomplished by using a HTB (Hierarchical Token Bucket) filter. With
iptables it’s then possible to use the
cgroup handle to add rules for a control group, e.g. allowing network access.
This controller type is an example where child control groups are not automatically affected by the
net_* controllers, meaning that this setting is not inherited throughout the hierarchy.
The Block IO Controller (v1/v2)
v2) controller is being utilized to enable I/O resource usage policies. The most common use-cases to limit these aspects are:
Specifying upper bandwidth limits, for example in the
blkio.throttle.read_bps_devicefile to specify the maximum bandwidth for a device in bits per second. Alternatively, the
rbpsparameter in conjunction with the
io.maxfile is the equivalent for version 2.
Denying access to a specific device.
Limiting with proportional time based division: This allows settings weights for various control groups that will be used to prioritize all device accesses when performing I/O operations. The
blkio.weightfile is present for this purpose in cgroup
v1whereas this is configured with
io.weightin version 2.
Enforcing Limits in the Kernel
Enforcing bandwidth limits is implemented in
blk_throtl_bio which resides in
block/blk-throttle.c. This function makes use of
throtl_charge_bio to ultimately charge for the data volume used in an I/O operation. Depending on the resource usage, an I/O operation can be executed directly or may have to be delayed using a queue to meet the resource limitations. For delayed operations, a dispatcher function will then cause pending operations to be executed using pre-calculated timers in order to throttle requested operations. With
throtl_trim_slice the required time limiting is calculated, yielding the time slice the operation may be executed in.
To allow or deny accessing a specific device, functions of
security/device_cgroup.c come to use. When passing cgroup configuration strings to the files present in the virtual control group file system
devcgroup_update_access parses this information and configures the control group accordingly, e.g. by setting flags indicating whether accessing a device is allowed or denied for processes of a cgroup. This builds an exception list as seen in the listing following below. Upon accessing a block device,
fs/block_dev.c) is being called which performs access checks prior to allowing access. To perform these checks,
security/device_cgroup.c) is called, resulting in the following checks being performed using the internal exception list:
// current is the current task_struct dev_cgroup = task_devcgroup(current); if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) // perform checks based on the exception list rc = !match_exception_partial(&dev_cgroup->exceptions, type, major, minor, access); else rc = match_exception(&dev_cgroup->exceptions, type, major, minor, access); if (!rc) return -EPERM; // deny access
The default scheduler for I/O operations in the Linux kernel is the CFQ scheduler. It was extended to support the I/O related cgroup controllers after control groups have been introduced in the kernel. This makes it possible to account and constrain processes regarding their consumed I/O resources, for example using pre-defined weights. The CFQ I/O scheduler is implemented in
block/cfq-iosched.h - not to be confused with
kernel/sched/fair.c where process-related CFQ scheduling is implemented. The kernel structure
cfq_group maps various settings per cgroup-device relationship. This includes applied policies and weights which are considered in order to schedule I/O operations.
Next post in series
- Continue reading the next article in this series “The Memory, CPU, Freezer and Device Controllers”
Follow us on Twitter, LinkedIn, Xing to stay up-to-date.
Credits: The elaboration and software project associated to this subject are results of a Master’s thesis created at SCHUTZWERK in collaboration with Aalen University by Philipp Schmied.