2018-12-06 14:17:51 +00:00
|
|
|
.. SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
Copyright(c) 2010-2014 Intel Corporation.
|
2017-04-21 10:43:26 +00:00
|
|
|
|
|
|
|
Power Management
|
|
|
|
================
|
|
|
|
|
|
|
|
The DPDK Power Management feature allows users space applications to save power
|
|
|
|
by dynamically adjusting CPU frequency or entering into different C-States.
|
|
|
|
|
|
|
|
* Adjusting the CPU frequency dynamically according to the utilization of RX queue.
|
|
|
|
|
|
|
|
* Entering into different deeper C-States according to the adaptive algorithms to speculate
|
|
|
|
brief periods of time suspending the application if no packets are received.
|
|
|
|
|
|
|
|
The interfaces for adjusting the operating CPU frequency are in the power management library.
|
|
|
|
C-State control is implemented in applications according to the different use cases.
|
|
|
|
|
|
|
|
CPU Frequency Scaling
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The Linux kernel provides a cpufreq module for CPU frequency scaling for each lcore.
|
|
|
|
For example, for cpuX, /sys/devices/system/cpu/cpuX/cpufreq/ has the following sys files for frequency scaling:
|
|
|
|
|
|
|
|
* affected_cpus
|
|
|
|
|
|
|
|
* bios_limit
|
|
|
|
|
|
|
|
* cpuinfo_cur_freq
|
|
|
|
|
|
|
|
* cpuinfo_max_freq
|
|
|
|
|
|
|
|
* cpuinfo_min_freq
|
|
|
|
|
|
|
|
* cpuinfo_transition_latency
|
|
|
|
|
|
|
|
* related_cpus
|
|
|
|
|
|
|
|
* scaling_available_frequencies
|
|
|
|
|
|
|
|
* scaling_available_governors
|
|
|
|
|
|
|
|
* scaling_cur_freq
|
|
|
|
|
|
|
|
* scaling_driver
|
|
|
|
|
|
|
|
* scaling_governor
|
|
|
|
|
|
|
|
* scaling_max_freq
|
|
|
|
|
|
|
|
* scaling_min_freq
|
|
|
|
|
|
|
|
* scaling_setspeed
|
|
|
|
|
|
|
|
In the DPDK, scaling_governor is configured in user space.
|
|
|
|
Then, a user space application can prompt the kernel by writing scaling_setspeed to adjust the CPU frequency
|
|
|
|
according to the strategies defined by the user space application.
|
|
|
|
|
|
|
|
Core-load Throttling through C-States
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
Core state can be altered by speculative sleeps whenever the specified lcore has nothing to do.
|
|
|
|
In the DPDK, if no packet is received after polling,
|
|
|
|
speculative sleeps can be triggered according the strategies defined by the user space application.
|
|
|
|
|
2018-05-15 09:49:22 +00:00
|
|
|
Per-core Turbo Boost
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
Individual cores can be allowed to enter a Turbo Boost state on a per-core
|
|
|
|
basis. This is achieved by enabling Turbo Boost Technology in the BIOS, then
|
|
|
|
looping through the relevant cores and enabling/disabling Turbo Boost on each
|
|
|
|
core.
|
|
|
|
|
2018-12-06 14:17:51 +00:00
|
|
|
Use of Power Library in a Hyper-Threaded Environment
|
|
|
|
----------------------------------------------------
|
|
|
|
|
|
|
|
In the case where the power library is in use on a system with Hyper-Threading enabled,
|
|
|
|
the frequency on the physical core is set to the highest frequency of the Hyper-Thread siblings.
|
|
|
|
So even though an application may request a scale down, the core frequency will
|
|
|
|
remain at the highest frequency until all Hyper-Threads on that core request a scale down.
|
|
|
|
|
2017-04-21 10:43:26 +00:00
|
|
|
API Overview of the Power Library
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
The main methods exported by power library are for CPU frequency scaling and include the following:
|
|
|
|
|
|
|
|
* **Freq up**: Prompt the kernel to scale up the frequency of the specific lcore.
|
|
|
|
|
|
|
|
* **Freq down**: Prompt the kernel to scale down the frequency of the specific lcore.
|
|
|
|
|
|
|
|
* **Freq max**: Prompt the kernel to scale up the frequency of the specific lcore to the maximum.
|
|
|
|
|
|
|
|
* **Freq min**: Prompt the kernel to scale down the frequency of the specific lcore to the minimum.
|
|
|
|
|
|
|
|
* **Get available freqs**: Read the available frequencies of the specific lcore from the sys file.
|
|
|
|
|
|
|
|
* **Freq get**: Get the current frequency of the specific lcore.
|
|
|
|
|
|
|
|
* **Freq set**: Prompt the kernel to set the frequency for the specific lcore.
|
|
|
|
|
2018-05-15 09:49:22 +00:00
|
|
|
* **Enable turbo**: Prompt the kernel to enable Turbo Boost for the specific lcore.
|
|
|
|
|
|
|
|
* **Disable turbo**: Prompt the kernel to disable Turbo Boost for the specific lcore.
|
|
|
|
|
2017-04-21 10:43:26 +00:00
|
|
|
User Cases
|
|
|
|
----------
|
|
|
|
|
|
|
|
The power management mechanism is used to save power when performing L3 forwarding.
|
|
|
|
|
2018-12-06 14:17:51 +00:00
|
|
|
|
|
|
|
Empty Poll API
|
|
|
|
--------------
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
~~~~~~~~
|
|
|
|
|
|
|
|
For packet processing workloads such as DPDK polling is continuous.
|
|
|
|
This means CPU cores always show 100% busy independent of how much work
|
|
|
|
those cores are doing. It is critical to accurately determine how busy
|
|
|
|
a core is hugely important for the following reasons:
|
|
|
|
|
|
|
|
* No indication of overload conditions
|
|
|
|
* User does not know how much real load is on a system, resulting
|
|
|
|
in wasted energy as no power management is utilized
|
|
|
|
|
|
|
|
Compared to the original l3fwd-power design, instead of going to sleep
|
|
|
|
after detecting an empty poll, the new mechanism just lowers the core frequency.
|
|
|
|
As a result, the application does not stop polling the device, which leads
|
|
|
|
to improved handling of bursts of traffic.
|
|
|
|
|
|
|
|
When the system become busy, the empty poll mechanism can also increase the core
|
|
|
|
frequency (including turbo) to do best effort for intensive traffic. This gives
|
|
|
|
us more flexible and balanced traffic awareness over the standard l3fwd-power
|
|
|
|
application.
|
|
|
|
|
|
|
|
|
|
|
|
Proposed Solution
|
|
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
The proposed solution focuses on how many times empty polls are executed.
|
|
|
|
The less the number of empty polls, means current core is busy with processing
|
|
|
|
workload, therefore, the higher frequency is needed. The high empty poll number
|
|
|
|
indicates the current core not doing any real work therefore, we can lower the
|
|
|
|
frequency to safe power.
|
|
|
|
|
|
|
|
In the current implementation, each core has 1 empty-poll counter which assume
|
|
|
|
1 core is dedicated to 1 queue. This will need to be expanded in the future to
|
|
|
|
support multiple queues per core.
|
|
|
|
|
|
|
|
Power state definition:
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
* LOW: Not currently used, reserved for future use.
|
|
|
|
|
|
|
|
* MED: the frequency is used to process modest traffic workload.
|
|
|
|
|
|
|
|
* HIGH: the frequency is used to process busy traffic workload.
|
|
|
|
|
|
|
|
There are two phases to establish the power management system:
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
* Training phase. This phase is used to measure the optimal frequency
|
|
|
|
change thresholds for a given system. The thresholds will differ from
|
|
|
|
system to system due to differences in processor micro-architecture,
|
|
|
|
cache and device configurations.
|
|
|
|
In this phase, the user must ensure that no traffic can enter the
|
|
|
|
system so that counts can be measured for empty polls at low, medium
|
|
|
|
and high frequencies. Each frequency is measured for two seconds.
|
|
|
|
Once the training phase is complete, the threshold numbers are
|
|
|
|
displayed, and normal mode resumes, and traffic can be allowed into
|
|
|
|
the system. These threshold number can be used on the command line
|
|
|
|
when starting the application in normal mode to avoid re-training
|
|
|
|
every time.
|
|
|
|
|
|
|
|
* Normal phase. Every 10ms the run-time counters are compared
|
|
|
|
to the supplied threshold values, and the decision will be made
|
|
|
|
whether to move to a different power state (by adjusting the
|
|
|
|
frequency).
|
|
|
|
|
|
|
|
API Overview for Empty Poll Power Management
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
* **State Init**: initialize the power management system.
|
|
|
|
|
|
|
|
* **State Free**: free the resource hold by power management system.
|
|
|
|
|
|
|
|
* **Update Empty Poll Counter**: update the empty poll counter.
|
|
|
|
|
|
|
|
* **Update Valid Poll Counter**: update the valid poll counter.
|
|
|
|
|
|
|
|
* **Set the Fequence Index**: update the power state/frequency mapping.
|
|
|
|
|
|
|
|
* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
|
|
|
|
|
|
|
|
User Cases
|
|
|
|
----------
|
|
|
|
The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
|
|
|
|
|
2017-04-21 10:43:26 +00:00
|
|
|
References
|
|
|
|
----------
|
|
|
|
|
|
|
|
* l3fwd-power: The sample application in DPDK that performs L3 forwarding with power management.
|
|
|
|
|
|
|
|
* The "L3 Forwarding with Power Management Sample Application" chapter in the *DPDK Sample Application's User Guide*.
|