367 lines
15 KiB
Plaintext
Executable File
367 lines
15 KiB
Plaintext
Executable File
Central, scheduler-driven, power-performance control
|
|
(EXPERIMENTAL)
|
|
|
|
Abstract
|
|
========
|
|
|
|
The topic of a single simple power-performance tunable, that is wholly
|
|
scheduler centric, and has well defined and predictable properties has come up
|
|
on several occasions in the past [1,2]. With techniques such as a scheduler
|
|
driven DVFS [3], we now have a good framework for implementing such a tunable.
|
|
This document describes the overall ideas behind its design and implementation.
|
|
|
|
|
|
Table of Contents
|
|
=================
|
|
|
|
1. Motivation
|
|
2. Introduction
|
|
3. Signal Boosting Strategy
|
|
4. OPP selection using boosted CPU utilization
|
|
5. Per task group boosting
|
|
6. Question and Answers
|
|
- What about "auto" mode?
|
|
- What about boosting on a congested system?
|
|
- How CPUs are boosted when we have tasks with multiple boost values?
|
|
7. References
|
|
|
|
|
|
1. Motivation
|
|
=============
|
|
|
|
Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
|
|
scheduler to select the optimal DVFS operating point (OPP) for running a task
|
|
allocated to a CPU. The introduction of sched-DVFS enables running workloads at
|
|
the most energy efficient OPPs.
|
|
|
|
However, sometimes it may be desired to intentionally boost the performance of
|
|
a workload even if that could imply a reasonable increase in energy
|
|
consumption. For example, in order to reduce the response time of a task, we
|
|
may want to run the task at a higher OPP than the one that is actually required
|
|
by it's CPU bandwidth demand.
|
|
|
|
This last requirement is especially important if we consider that one of the
|
|
main goals of the sched-DVFS component is to replace all currently available
|
|
CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
|
|
driven governors we currently have, it is already more responsive at selecting
|
|
the optimal OPP to run tasks allocated to a CPU. However, just tracking the
|
|
actual task load demand may not be enough from a performance standpoint. For
|
|
example, it is not possible to get behaviors similar to those provided by the
|
|
"performance" and "interactive" CPUFreq governors.
|
|
|
|
This document describes an implementation of a tunable, stacked on top of the
|
|
sched-DVFS which extends its functionality to support task performance
|
|
boosting.
|
|
|
|
By "performance boosting" we mean the reduction of the time required to
|
|
complete a task activation, i.e. the time elapsed from a task wakeup to its
|
|
next deactivation (e.g. because it goes back to sleep or it terminates). For
|
|
example, if we consider a simple periodic task which executes the same workload
|
|
for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
|
|
that task must complete each of its activations in less than 5[s].
|
|
|
|
A previous attempt [5] to introduce such a boosting feature has not been
|
|
successful mainly because of the complexity of the proposed solution. The
|
|
approach described in this document exposes a single simple interface to
|
|
user-space. This single tunable knob allows the tuning of system wide
|
|
scheduler behaviours ranging from energy efficiency at one end through to
|
|
incremental performance boosting at the other end. This first tunable affects
|
|
all tasks. However, a more advanced extension of the concept is also provided
|
|
which uses CGroups to boost the performance of only selected tasks while using
|
|
the energy efficient default for all others.
|
|
|
|
The rest of this document introduces in more details the proposed solution
|
|
which has been named SchedTune.
|
|
|
|
|
|
2. Introduction
|
|
===============
|
|
|
|
SchedTune exposes a simple user-space interface with a single power-performance
|
|
tunable:
|
|
|
|
/proc/sys/kernel/sched_cfs_boost
|
|
|
|
This permits expressing a boost value as an integer in the range [0..100].
|
|
|
|
A value of 0 (default) configures the CFS scheduler for maximum energy
|
|
efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
|
|
required to satisfy their workload demand.
|
|
A value of 100 configures scheduler for maximum performance, which translates
|
|
to the selection of the maximum OPP on that CPU.
|
|
|
|
The range between 0 and 100 can be set to satisfy other scenarios suitably. For
|
|
example to satisfy interactive response or depending on other system events
|
|
(battery level etc).
|
|
|
|
A CGroup based extension is also provided, which permits further user-space
|
|
defined task classification to tune the scheduler for different goals depending
|
|
on the specific nature of the task, e.g. background vs interactive vs
|
|
low-priority.
|
|
|
|
The overall design of the SchedTune module is built on top of "Per-Entity Load
|
|
Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
|
|
Performance Point (OPP) selection.
|
|
Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
|
|
the operating frequency of that CPU to better match the workload demand. The
|
|
selection of the actual OPP being activated is influenced by the global boost
|
|
value, or the boost value for the task CGroup when in use.
|
|
|
|
This simple biasing approach leverages existing frameworks, which means minimal
|
|
modifications to the scheduler, and yet it allows to achieve a range of
|
|
different behaviours all from a single simple tunable knob.
|
|
The only new concept introduced is that of signal boosting.
|
|
|
|
|
|
3. Signal Boosting Strategy
|
|
===========================
|
|
|
|
The whole PELT machinery works based on the value of a few load tracking signals
|
|
which basically track the CPU bandwidth requirements for tasks and the capacity
|
|
of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
|
|
some of these load tracking signals to make a task or RQ appears more demanding
|
|
that it actually is.
|
|
|
|
Which signals have to be inflated depends on the specific "consumer". However,
|
|
independently from the specific (signal, consumer) pair, it is important to
|
|
define a simple and possibly consistent strategy for the concept of boosting a
|
|
signal.
|
|
|
|
A boosting strategy defines how the "abstract" user-space defined
|
|
sched_cfs_boost value is translated into an internal "margin" value to be added
|
|
to a signal to get its inflated value:
|
|
|
|
margin := boosting_strategy(sched_cfs_boost, signal)
|
|
boosted_signal := signal + margin
|
|
|
|
Different boosting strategies were identified and analyzed before selecting the
|
|
one found to be most effective.
|
|
|
|
Signal Proportional Compensation (SPC)
|
|
--------------------------------------
|
|
|
|
In this boosting strategy the sched_cfs_boost value is used to compute a
|
|
margin which is proportional to the complement of the original signal.
|
|
When a signal has a maximum possible value, its complement is defined as
|
|
the delta from the actual value and its possible maximum.
|
|
|
|
Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
|
|
the maximum possible value, the margin becomes:
|
|
|
|
margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
|
|
|
|
Using this boosting strategy:
|
|
- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
|
|
- each value in the range of sched_cfs_boost effectively inflates the signal in
|
|
question by a quantity which is proportional to the maximum value.
|
|
|
|
For example, by applying the SPC boosting strategy to the selection of the OPP
|
|
to run a task it is possible to achieve these behaviors:
|
|
|
|
- 0% boosting: run the task at the minimum OPP required by its workload
|
|
- 100% boosting: run the task at the maximum OPP available for the CPU
|
|
- 50% boosting: run at the half-way OPP between minimum and maximum
|
|
|
|
Which means that, at 50% boosting, a task will be scheduled to run at half of
|
|
the maximum theoretically achievable performance on the specific target
|
|
platform.
|
|
|
|
A graphical representation of an SPC boosted signal is represented in the
|
|
following figure where:
|
|
a) "-" represents the original signal
|
|
b) "b" represents a 50% boosted signal
|
|
c) "p" represents a 100% boosted signal
|
|
|
|
|
|
^
|
|
| SCHED_LOAD_SCALE
|
|
+-----------------------------------------------------------------+
|
|
|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
|
|
|
|
|
| boosted_signal
|
|
| bbbbbbbbbbbbbbbbbbbbbbbb
|
|
|
|
|
| original signal
|
|
| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
|
|
| |
|
|
|bbbbbbbbbbbbbbbbbb |
|
|
| |
|
|
| |
|
|
| |
|
|
| +-----------------------+
|
|
| |
|
|
| |
|
|
| |
|
|
|------------------+
|
|
|
|
|
|
|
|
+----------------------------------------------------------------------->
|
|
|
|
The plot above shows a ramped load signal (titled 'original_signal') and it's
|
|
boosted equivalent. For each step of the original signal the boosted signal
|
|
corresponding to a 50% boost is midway from the original signal and the upper
|
|
bound. Boosting by 100% generates a boosted signal which is always saturated to
|
|
the upper bound.
|
|
|
|
|
|
4. OPP selection using boosted CPU utilization
|
|
==============================================
|
|
|
|
It is worth calling out that the implementation does not introduce any new load
|
|
signals. Instead, it provides an API to tune existing signals. This tuning is
|
|
done on demand and only in scheduler code paths where it is sensible to do so.
|
|
The new API calls are defined to return either the default signal or a boosted
|
|
one, depending on the value of sched_cfs_boost. This is a clean an non invasive
|
|
modification of the existing existing code paths.
|
|
|
|
The signal representing a CPU's utilization is boosted according to the
|
|
previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
|
|
(ie CFS run-queue) to appear more used then it actually is.
|
|
|
|
Thus, with the sched_cfs_boost enabled we have the following main functions to
|
|
get the current utilization of a CPU:
|
|
|
|
cpu_util()
|
|
boosted_cpu_util()
|
|
|
|
The new boosted_cpu_util() is similar to the first but returns a boosted
|
|
utilization signal which is a function of the sched_cfs_boost value.
|
|
|
|
This function is used in the CFS scheduler code paths where sched-DVFS needs to
|
|
decide the OPP to run a CPU at.
|
|
For example, this allows selecting the highest OPP for a CPU which has
|
|
the boost value set to 100%.
|
|
|
|
|
|
5. Per task group boosting
|
|
==========================
|
|
|
|
The availability of a single knob which is used to boost all tasks in the
|
|
system is certainly a simple solution but it quite likely doesn't fit many
|
|
utilization scenarios, especially in the mobile device space.
|
|
|
|
For example, on battery powered devices there usually are many background
|
|
services which are long running and need energy efficient scheduling. On the
|
|
other hand, some applications are more performance sensitive and require an
|
|
interactive response and/or maximum performance, regardless of the energy cost.
|
|
To better service such scenarios, the SchedTune implementation has an extension
|
|
that provides a more fine grained boosting interface.
|
|
|
|
A new CGroup controller, namely "schedtune", could be enabled which allows to
|
|
defined and configure task groups with different boosting values.
|
|
Tasks that require special performance can be put into separate CGroups.
|
|
The value of the boost associated with the tasks in this group can be specified
|
|
using a single knob exposed by the CGroup controller:
|
|
|
|
schedtune.boost
|
|
|
|
This knob allows the definition of a boost value that is to be used for
|
|
SPC boosting of all tasks attached to this group.
|
|
|
|
The current schedtune controller implementation is really simple and has these
|
|
main characteristics:
|
|
|
|
1) It is only possible to create 1 level depth hierarchies
|
|
|
|
The root control groups define the system-wide boost value to be applied
|
|
by default to all tasks. Its direct subgroups are named "boost groups" and
|
|
they define the boost value for specific set of tasks.
|
|
Further nested subgroups are not allowed since they do not have a sensible
|
|
meaning from a user-space standpoint.
|
|
|
|
2) It is possible to define only a limited number of "boost groups"
|
|
|
|
This number is defined at compile time and by default configured to 16.
|
|
This is a design decision motivated by two main reasons:
|
|
a) In a real system we do not expect utilization scenarios with more then few
|
|
boost groups. For example, a reasonable collection of groups could be
|
|
just "background", "interactive" and "performance".
|
|
b) It simplifies the implementation considerably, especially for the code
|
|
which has to compute the per CPU boosting once there are multiple
|
|
RUNNABLE tasks with different boost values.
|
|
|
|
Such a simple design should allow servicing the main utilization scenarios identified
|
|
so far. It provides a simple interface which can be used to manage the
|
|
power-performance of all tasks or only selected tasks.
|
|
Moreover, this interface can be easily integrated by user-space run-times (e.g.
|
|
Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
|
|
classification, which has been a long standing requirement.
|
|
|
|
Setup and usage
|
|
---------------
|
|
|
|
0. Use a kernel with CGROUP_SCHEDTUNE support enabled
|
|
|
|
1. Check that the "schedtune" CGroup controller is available:
|
|
|
|
root@linaro-nano:~# cat /proc/cgroups
|
|
#subsys_name hierarchy num_cgroups enabled
|
|
cpuset 0 1 1
|
|
cpu 0 1 1
|
|
schedtune 0 1 1
|
|
|
|
2. Mount a tmpfs to create the CGroups mount point (Optional)
|
|
|
|
root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
|
|
|
|
3. Mount the "schedtune" controller
|
|
|
|
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
|
|
root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
|
|
|
|
4. Setup the system-wide boost value (Optional)
|
|
|
|
If not configured the root control group has a 0% boost value, which
|
|
basically disables boosting for all tasks in the system thus running in
|
|
an energy-efficient mode.
|
|
|
|
root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
|
|
|
|
5. Create task groups and configure their specific boost value (Optional)
|
|
|
|
For example here we create a "performance" boost group configure to boost
|
|
all its tasks to 100%
|
|
|
|
root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
|
|
root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
|
|
|
|
6. Move tasks into the boost group
|
|
|
|
For example, the following moves the tasks with PID $TASKPID (and all its
|
|
threads) into the "performance" boost group.
|
|
|
|
root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
|
|
|
|
This simple configuration allows only the threads of the $TASKPID task to run,
|
|
when needed, at the highest OPP in the most capable CPU of the system.
|
|
|
|
|
|
6. Question and Answers
|
|
=======================
|
|
|
|
What about "auto" mode?
|
|
-----------------------
|
|
|
|
The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
|
|
with some suitable user-space element. This element could use the exposed
|
|
system-wide or cgroup based interface.
|
|
|
|
How are multiple groups of tasks with different boost values managed?
|
|
---------------------------------------------------------------------
|
|
|
|
The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
|
|
on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
|
|
is boosted with a value which is the maximum of the boost values of the
|
|
currently RUNNABLE tasks in its RQ.
|
|
|
|
This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
|
|
to run and switch back to the energy efficient mode as soon as the last boosted
|
|
task is dequeued.
|
|
|
|
|
|
7. References
|
|
=============
|
|
[1] http://lwn.net/Articles/552889
|
|
[2] http://lkml.org/lkml/2012/5/18/91
|
|
[3] http://lkml.org/lkml/2015/6/26/620
|