42. Virtual Machine Power Management Application
Applications running in virtual environments have an abstract view of the underlying hardware on the host. Specifically, applications cannot see the binding of virtual components to physical hardware. When looking at CPU resourcing, the pinning of Virtual CPUs (vCPUs) to Physical CPUs (pCPUs) on the host is not apparent to an application and this pinning may change over time. In addition, operating systems on Virtual Machines (VMs) do not have the ability to govern their own power policy. The Machine Specific Registers (MSRs) for enabling P-state transitions are not exposed to the operating systems running on the VMs.
The solution demonstrated in this sample application shows an example of how a DPDK application can indicate its processing requirements using VM-local only information (vCPU/lcore, and so on) to a host resident VM Power Manager. The VM Power Manager is responsible for:
- Accepting requests for frequency changes for a vCPU
- Translating the vCPU to a pCPU using libvirt
- Performing the change in frequency
This application demonstrates the following features:
The handling of VM application requests to change frequency. VM applications can request frequency changes for a vCPU. The VM Power Management Application uses libvirt to translate that virtual CPU (vCPU) request to a physical CPU (pCPU) request and performs the frequency change.
The acceptance of power management policies from VM applications. A VM application can send a policy to the host application. The policy contains rules that define the power management behaviour of the VM. The host application then applies the rules of the policy independent of the VM application. For example, the policy can contain time-of-day information for busy/quiet periods, and the host application can scale up/down the relevant cores when required. See Command Line Options Available When Sending a Policy to the Host for information on setting policy values.
Out-of-band monitoring of workloads using core hardware event counters. The host application can manage power for an application by looking at the event counters of the cores and taking action based on the branch miss/hit ratio. See Command Line Options for Enabling Out-of-band Branch Ratio Monitoring.
Note: This functionality also applies in non-virtualised environments.
In addition to the librte_power
library used on the host, the
application uses a special version of librte_power
on each VM, which
directs frequency changes and policies to the host monitor rather than
the APCI cpufreq
sysfs
interface used on the host in non-virtualised
environments.
In the above diagram, the DPDK Applications are shown running in virtual machines, and the VM Power Monitor application is shown running in the host.
DPDK VM Application
- Reuse
librte_power
interface, but uses an implementation that forwards frequency requests to the host using avirtio-serial
channel - Each lcore has exclusive access to a single channel
- Sample application reuses
l3fwd_power
- A CLI for changing frequency from within a VM is also included
VM Power Monitor
- Accepts VM commands over
virtio-serial
endpoints, monitored usingepoll
- Commands include the virtual core to be modified, using
libvirt
to get the physical core mapping - Uses
librte_power
to affect frequency changes using Linux userspace power governor (acpi_cpufreq
ORintel_pstate
driver) - CLI: For adding VM channels to monitor, inspecting and changing channel state, manually altering CPU frequency. Also allows for the changings of vCPU to pCPU pinning
42.1. Sample Application Architecture Overview
The VM power management solution employs qemu-kvm
to provide
communications channels between the host and VMs in the form of a
virtio-serial
connection that appears as a para-virtualised serial
device on a VM and can be configured to use various backends on the
host. For this example, the configuration of each virtio-serial
endpoint
on the host as an AF_UNIX
file socket, supporting poll/select and
epoll
for event notification. In this example, each channel endpoint on
the host is monitored for EPOLLIN
events using epoll
. Each channel
is specified as qemu-kvm
arguments or as libvirt
XML for each VM,
where each VM can have several channels up to a maximum of 64 per VM. In this
example, each DPDK lcore on a VM has exclusive access to a channel.
To enable frequency changes from within a VM, the VM forwards a
librte_power
request over the virtio-serial
channel to the host. Each
request contains the vCPU and power command (scale up/down/min/max). The
API for the host librte_power
and guest librte_power
is consistent
across environments, with the selection of VM or host implementation
determined automatically at runtime based on the environment. On
receiving a request, the host translates the vCPU to a pCPU using the
libvirt API before forwarding it to the host librte_power
.
In addition to the ability to send power management requests to the host, a VM can send a power management policy to the host. In some cases, using a power management policy is a preferred option because it can eliminate possible latency issues that can occur when sending power management requests. Once the VM sends the policy to the host, the VM no longer needs to worry about power management, because the host now manages the power for the VM based on the policy. The policy can specify power behavior that is based on incoming traffic rates or time-of-day power adjustment (busy/quiet hour power adjustment for example). See Command Line Options Available When Sending a Policy to the Host for more information.
One method of power management is to sense how busy a core is when
processing packets and adjusting power accordingly. One technique for
doing this is to monitor the ratio of the branch miss to branch hits
counters and scale the core power accordingly. This technique is based
on the premise that when a core is not processing packets, the ratio of
branch misses to branch hits is very low, but when the core is
processing packets, it is measurably higher. The implementation of this
capability is as a policy of type BRANCH_RATIO
.
See Command Line Options Available When Sending a Policy to the Host for more information on using the
BRANCH_RATIO policy option.
A JSON interface enables the specification of power management requests and policies in JSON format. The JSON interfaces provide a more convenient and more easily interpreted interface for the specification of requests and policies. See JSON Interface for Power Management Requests and Policies for more information.
42.1.1. Performance Considerations
While the Haswell microarchitecture allows for independent power control for each core, earlier microarchitectures do not offer such fine-grained control. When deploying on pre-Haswell platforms, greater care must be taken when selecting which cores are assigned to a VM, for example, a core does not scale down in frequency until all of its siblings are similarly scaled down.
42.2. Configuration
42.2.1. BIOS
To use the power management features of the DPDK, you must enable
Enhanced Intel SpeedStep® Technology in the platform BIOS. Otherwise,
the sys
file folder /sys/devices/system/cpu/cpu0/cpufreq
does not
exist, and you cannot use CPU frequency-based power management. Refer to the
relevant BIOS documentation to determine how to access these settings.
42.2.2. Host Operating System
The DPDK Power Management library can use either the acpi_cpufreq
or
the intel_pstate
kernel driver for the management of core frequencies. In
many cases, the intel_pstate
driver is the default power management
environment.
Should the acpi-cpufreq driver
be required, the intel_pstate
module must be disabled, and the acpi-cpufreq
module loaded in its place.
To disable the intel_pstate
driver, add the following to the grub
Linux command line:
intel_pstate=disable
On reboot, load the acpi_cpufreq
module:
modprobe acpi_cpufreq
42.2.3. Hypervisor Channel Configuration
Configure virtio-serial
channels using libvirt
XML.
The XML structure is as follows:
<name>{vm_name}</name>
<controller type='virtio-serial' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</controller>
<channel type='unix'>
<source mode='bind' path='/tmp/powermonitor/{vm_name}.{channel_num}'/>
<target type='virtio' name='virtio.serial.port.poweragent.{vm_channel_num}'/>
<address type='virtio-serial' controller='0' bus='0' port='{N}'/>
</channel>
Where a single controller of type virtio-serial
is created, up to 32
channels can be associated with a single controller, and multiple
controllers can be specified. The convention is to use the name of the
VM in the host path {vm_name}
and to increment {channel_num}
for each
channel. Likewise, the port value {N}
must be incremented for each
channel.
On the host, for each channel to appear in the path, ensure the creation
of the /tmp/powermonitor/
directory and the assignment of qemu
permissions:
mkdir /tmp/powermonitor/
chown qemu:qemu /tmp/powermonitor
Note that files and directories in /tmp
are generally removed when
rebooting the host and you may need to perform the previous steps after
each reboot.
The serial device as it appears on a VM is configured with the target
element attribute name and must be in the form:
virtio.serial.port.poweragent.{vm_channel_num}
, where
vm_channel_num
is typically the lcore channel to be used in
DPDK VM applications.
Each channel on a VM is present at:
/dev/virtio-ports/virtio.serial.port.poweragent.{vm_channel_num}
42.3. Compiling and Running the Host Application
42.3.1. Compiling the Host Application
For information on compiling the DPDK and sample applications, see Compiling the Sample Applications.
The application is located in the vm_power_manager
subdirectory.
To build just the vm_power_manager
application using make
:
cd dpdk/examples/vm_power_manager/
make
The resulting binary is dpdk/build/examples/vm_power_manager
.
To build just the vm_power_manager
application using meson
/ninja
:
cd dpdk
meson setup build
cd build
ninja
meson configure -Dexamples=vm_power_manager
ninja
The resulting binary is dpdk/build/examples/dpdk-vm_power_manager
.
42.3.2. Running the Host Application
The application does not have any specific command line options other than the EAL options:
./<build_dir>/examples/dpdk-vm_power_mgr [EAL options]
The application requires exactly two cores to run. One core for the CLI and the other for the channel endpoint monitor. For example, to run on cores 0 and 1 on a system with four memory channels, issue the command:
./<build_dir>/examples/dpdk-vm_power_mgr -l 0-1 -n 4
After successful initialization, the VM Power Manager CLI prompt appears:
vm_power>
Now, it is possible to add virtual machines to the VM Power Manager:
vm_power> add_vm {vm_name}
When a {vm_name}
is specified with the add_vm
command, a lookup is
performed with libvirt
to ensure that the VM exists. {vm_name}
is a
unique identifier to associate channels with a particular VM and for
executing operations on a VM within the CLI. VMs do not have to be
running to add them.
It is possible to issue several commands from the CLI to manage VMs.
Remove the virtual machine identified by {vm_name}
from the VM Power
Manager using the command:
rm_vm {vm_name}
Add communication channels for the specified VM using the following
command. The virtio
channels must be enabled in the VM configuration
(qemu/libvirt
) and the associated VM must be active. {list}
is a
comma-separated list of channel numbers to add. Specifying the keyword
all
attempts to add all channels for the VM:
set_pcpu {vm_name} {vcpu} {pcpu}
Enable query of physical core information from a VM:
set_query {vm_name} enable|disable
Manual control and inspection can also be carried in relation CPU frequency scaling:
Get the current frequency for each core specified in the mask:
show_cpu_freq_mask {mask}
Set the current frequency for the cores specified in {core_mask} by scaling each up/down/min/max:
add_channels {vm_name} {list}|all
Enable or disable the communication channels in {list}
(comma-separated)
for the specified VM. Alternatively, replace list
with the keyword
all
. Disabled channels receive packets on the host. However, the commands
they specify are ignored. Set the status to enabled to begin processing
requests again:
set_channel_status {vm_name} {list}|all enabled|disabled
Print to the CLI information on the specified VM. The information lists the number of vCPUs, the pinning to pCPU(s) as a bit mask, along with any communication channels associated with each VM, and the status of each channel:
show_vm {vm_name}
Set the binding of a virtual CPU on a VM with name {vm_name}
to the
physical CPU mask:
set_pcpu_mask {vm_name} {vcpu} {pcpu}
Set the binding of the virtual CPU on the VM to the physical CPU:
set_pcpu {vm_name} {vcpu} {pcpu}
It is also possible to perform manual control and inspection in relation to CPU frequency scaling.
Get the current frequency for each core specified in the mask:
show_cpu_freq_mask {mask}
Set the current frequency for the cores specified in {core_mask}
by
scaling each up/down/min/max:
set_cpu_freq {core_mask} up|down|min|max
Get the current frequency for the specified core:
show_cpu_freq {core_num}
Set the current frequency for the specified core by scaling up/down/min/max:
set_cpu_freq {core_num} up|down|min|max
42.3.3. Command Line Options for Enabling Out-of-band Branch Ratio Monitoring
There are a couple of command line parameters for enabling the out-of-band monitoring of branch ratios on cores doing busy polling using PMDs as described below:
--core-branch-ratio {list of cores}:{branch ratio for listed cores}
Specify the list of cores to monitor the ratio of branch misses to branch hits. A tightly-polling PMD thread has a very low branch ratio, therefore the core frequency scales down to the minimum allowed value. On receiving packets, the code path changes, causing the branch ratio to increase. When the ratio goes above the ratio threshold, the core frequency scales up to the maximum allowed value. The specified branch-ratio is a floating point number that identifies the threshold at which to scale up or down for the elements of the core-list. If not included the default branch ratio of 0.01 but will need adjustment for different workloads
This parameter can be used multiple times for different sets of cores. The branch ratio mechanism can also be useful for non-PMD cores and hyper-threaded environments where C-States are disabled.
42.4. Compiling and Running the Guest Applications
It is possible to use the l3fwd-power
application (for example) with the
vm_power_manager
.
The distribution also provides a guest CLI for validating the setup.
For both l3fwd-power
and the guest CLI, the host application must use
the add_channels
command to monitor the channels for the VM. To do this,
issue the following commands in the host application:
vm_power> add_vm vmname
vm_power> add_channels vmname all
vm_power> set_channel_status vmname all enabled
vm_power> show_vm vmname
42.4.1. Compiling the Guest Application
For information on compiling DPDK and the sample applications in general, see Compiling the Sample Applications.
For compiling and running the l3fwd-power
sample application, see
L3 Forwarding with Power Management Sample Application.
The application is in the guest_cli
subdirectory under vm_power_manager
.
To build just the guest_vm_power_manager
application using make
, issue
the following commands:
cd dpdk/examples/vm_power_manager/guest_cli/
make
The resulting binary is dpdk/build/examples/guest_cli
.
Note: This sample application conditionally links in the Jansson JSON
library. Consequently, if you are using a multilib or cross-compile
environment, you may need to set the PKG_CONFIG_LIBDIR
environmental
variable to point to the relevant pkgconfig
folder so that the correct
library is linked in.
For example, if you are building for a 32-bit target, you could find the correct directory using the following find command:
# find /usr -type d -name pkgconfig
/usr/lib/i386-linux-gnu/pkgconfig
/usr/lib/x86_64-linux-gnu/pkgconfig
Then use:
export PKG_CONFIG_LIBDIR=/usr/lib/i386-linux-gnu/pkgconfig
You then use the make
command as normal, which should find the 32-bit
version of the library, if it installed. If not, the application builds
without the JSON interface functionality.
To build just the vm_power_manager
application using meson
/ninja
:
cd dpdk
meson setup build
cd build
ninja
meson configure -Dexamples=vm_power_manager/guest_cli
ninja
The resulting binary is dpdk/build/examples/guest_cli
.
42.4.2. Running the Guest Application
The standard EAL command line parameters are necessary:
./<build_dir>/examples/dpdk-vm_power_mgr [EAL options] -- [guest options]
The guest example uses a channel for each lcore enabled. For example, to run on cores 0, 1, 2 and 3:
./<build_dir>/examples/dpdk-guest_vm_power_mgr -l 0-3
42.4.3. Command Line Options Available When Sending a Policy to the Host
Optionally, there are several command line options for a user who needs to send a power policy to the host application:
--vm-name {name of guest vm}
- Allows the user to change the virtual machine name passed down to the host application using the power policy. The default is ubuntu2.
--vcpu-list {list vm cores}
- A comma-separated list of cores in the VM that the user wants the host application to monitor. The list of cores in any VM starts at zero, and the host application maps these to the physical cores once the policy passes down to the host. Valid syntax includes individual cores 2,3,4, a range of cores 2-4, or a combination of both 1,3,5-7.
--busy-hours {list of busy hours}
- A comma-separated list of hours in which to set the core frequency to the maximum. Valid syntax includes individual hours 2,3,4, a range of hours 2-4, or a combination of both 1,3,5-7. Valid hour values are 0 to 23.
--quiet-hours {list of quiet hours}
- A comma-separated list of hours in which to set the core frequency to minimum. Valid syntax includes individual hours 2,3,4, a range of hours 2-4, or a combination of both 1,3,5-7. Valid hour values are 0 to 23.
--policy {policy type}
The type of policy. This can be one of the following values:
- TRAFFIC - Based on incoming traffic rates on the NIC.
- TIME - Uses a busy/quiet hours policy.
- BRANCH_RATIO - Uses branch ratio counters to determine core busyness.
- WORKLOAD - Sets the frequency to low, medium or high based on the received policy setting.
Note: Not all policy types need all parameters. For example, BRANCH_RATIO only needs the vcpu-list parameter.
After successful initialization, the VM Power Manager Guest CLI prompt appears:
vm_power(guest)>
To change the frequency of an lcore, use a set_cpu_freq
command similar
to the following:
set_cpu_freq {core_num} up|down|min|max
where, {core_num}
is the lcore and channel to change frequency by
scaling up/down/min/max.
To start an application, configure the power policy, and send it to the host, use a command like the following:
./<build_dir>/examples/dpdk-guest_vm_power_mgr -l 0-3 -n 4 -- --vm-name=ubuntu --policy=BRANCH_RATIO --vcpu-list=2-4
Once the VM Power Manager Guest CLI appears, issuing the ‘send_policy now’ command will send the policy to the host:
send_policy now
Once the policy is sent to the host, the host application takes over the power monitoring of the specified cores in the policy.
42.5. JSON Interface for Power Management Requests and Policies
In addition to the command line interface for the host command, and a
virtio-serial
interface for VM power policies, there is also a JSON
interface through which power commands and policies can be sent.
Note: This functionality adds a dependency on the Jansson library.
Install the Jansson development package on the system to avail of the
JSON parsing functionality in the app. Issue the apt-get install
libjansson-dev
command to install the development package. The command
and package name may be different depending on your operating system. It
is worth noting that the app builds successfully if this package is not
present, but a warning displays during compilation, and the JSON parsing
functionality is not present in the app.
Send a request or policy to the VM Power Manager by simply opening a
fifo file at /tmp/powermonitor/fifo
, writing a JSON string to that file,
and closing the file.
The JSON string can be a power management request or a policy, and takes the following format:
{"packet_type": {
"pair_1": value,
"pair_2": value
}}
The packet_type
header can contain one of two values, depending on
whether a power management request or policy is being sent. The two
possible values are instruction
and policy
and the expected name-value
pairs are different depending on which type is sent.
The pairs are in the format of standard JSON name-value pairs. The value type varies between the different name-value pairs, and may be integers, strings, arrays, and so on. See JSON Interface Examples for examples of policies and instructions and JSON Name-value Pairs for the supported names and value types.
42.5.1. JSON Interface Examples
The following is an example JSON string that creates a time-profile policy.
{"policy": {
"name": "ubuntu",
"command": "create",
"policy_type": "TIME",
"busy_hours":[ 17, 18, 19, 20, 21, 22, 23 ],
"quiet_hours":[ 2, 3, 4, 5, 6 ],
"core_list":[ 11 ]
}}
The following is an example JSON string that removes the named policy.
{"policy": {
"name": "ubuntu",
"command": "destroy",
}}
The following is an example JSON string for a power management request.
{"instruction": {
"name": "ubuntu",
"command": "power",
"unit": "SCALE_MAX",
"resource_id": 10
}}
To query the available frequencies of an lcore, use the query_cpu_freq command. Where {core_num} is the lcore to query. Before using this command, please enable responses via the set_query command on the host.
query_cpu_freq {core_num}|all
To query the capabilities of an lcore, use the query_cpu_caps command. Where {core_num} is the lcore to query. Before using this command, please enable responses via the set_query command on the host.
query_cpu_caps {core_num}|all
To start the application and configure the power policy, and send it to the host:
./<build_dir>/examples/dpdk-guest_vm_power_mgr -l 0-3 -n 4 -- --vm-name=ubuntu --policy=BRANCH_RATIO --vcpu-list=2-4
Once the VM Power Manager Guest CLI appears, issuing the ‘send_policy now’ command will send the policy to the host:
send_policy now
Once the policy is sent to the host, the host application takes over the power monitoring of the specified cores in the policy.
42.5.2. JSON Name-value Pairs
The following are the name-value pairs supported by the JSON interface:
- avg_packet_thresh
- busy_hours
- command
- core_list
- mac_list
- max_packet_thresh
- name
- policy_type
- quiet_hours
- resource_id
- unit
- workload
42.5.2.1. avg_packet_thresh
- Description
- The threshold below which the frequency is set to the minimum value for the TRAFFIC policy. If the traffic rate is above this value and below the maximum value, the frequency is set to medium.
- Type
- integer
- Values
- The number of packets below which the TRAFFIC policy applies the minimum frequency, or the medium frequency if between the average and maximum thresholds.
- Required
- Yes
- Example
"avg_packet_thresh": 100000
42.5.2.2. busy_hours
- Description
- The hours of the day in which we scale up the cores for busy times.
- Type
- array of integers
- Values
- An array with a list of hour values (0-23).
- Required
- For the TIME policy only.
- Example
"busy_hours":[ 17, 18, 19, 20, 21, 22, 23 ]
42.5.2.3. command
- Description
- The type of packet to send to the VM Power Manager. It is possible to create or destroy a policy or send a direct command to adjust the frequency of a core, as is possible on the command line interface.
- Type
- string
- Values
- Possible values are: - CREATE: Create a new policy. - DESTROY: Remove an existing policy. - POWER: Send an immediate command, max, min, and so on.
- Required
- Yes
- Example
"command": "CREATE"
42.5.2.4. core_list
- Description
- The cores to which to apply a policy.
- Type
- array of integers
- Values
- An array with a list of virtual CPUs.
- Required
- For CREATE/DESTROY policy requests only.
- Example
"core_list":[ 10, 11 ]
42.5.2.5. mac_list
- Description
- When the policy is of type TRAFFIC, it is necessary to specify the MAC addresses that the host must monitor.
- Type
- array of strings
- Values
- An array with a list of MAC address strings.
- Required
- For TRAFFIC policy types only.
- Example
"mac_list":[ "de:ad:be:ef:01:01","de:ad:be:ef:01:02" ]
42.5.2.6. max_packet_thresh
- Description
- In a policy of type TRAFFIC, the threshold value above which the frequency is set to a maximum.
- Type
- integer
- Values
- The number of packets per interval above which the TRAFFIC policy applies the maximum frequency.
- Required
- For the TRAFFIC policy only.
- Example
"max_packet_thresh": 500000
42.5.2.7. name
- Description
- The name of the VM or host. Allows the parser to associate the policy with the relevant VM or host OS.
- Type
- string
- Values
- Any valid string.
- Required
- Yes
- Example
"name": "ubuntu2"
42.5.2.8. policy_type
- Description
- The type of policy to apply.
See the
--policy
option description for more information. - Type
- string
- Values
Possible values are:
- TIME: Time-of-day policy. Scale the frequencies of the relevant cores up/down depending on busy and quiet hours.
- TRAFFIC: Use statistics from the NIC and scale up and down accordingly.
- WORKLOAD: Determine how heavily loaded the cores are and scale up and down accordingly.
- BRANCH_RATIO: An out-of-band policy that looks at the ratio between branch hits and misses on a core and uses that information to determine how much packet processing a core is doing.
- Required
- For
CREATE
andDESTROY
policy requests only. - Example
"policy_type": "TIME"
42.5.2.9. quiet_hours
- Description
- The hours of the day to scale down the cores for quiet times.
- Type
- array of integers
- Values
- An array with a list of hour numbers with values in the range 0 to 23.
- Required
- For the TIME policy only.
- Example
"quiet_hours":[ 2, 3, 4, 5, 6 ]
42.5.2.10. resource_id
- Description
- The core to which to apply a power command.
- Type
- integer
- Values
- A valid core ID for the VM or host OS.
- Required
- For the
POWER
instruction only. - Example
"resource_id": 10
42.5.2.11. unit
- Description
- The type of power operation to apply in the command.
- Type
- string
- Values
- SCALE_MAX: Scale the frequency of this core to the maximum.
- SCALE_MIN: Scale the frequency of this core to the minimum.
- SCALE_UP: Scale up the frequency of this core.
- SCALE_DOWN: Scale down the frequency of this core.
- ENABLE_TURBO: Enable Intel® Turbo Boost Technology for this core.
- DISABLE_TURBO: Disable Intel® Turbo Boost Technology for this core.
- Required
- For the
POWER
instruction only. - Example
"unit": "SCALE_MAX"
42.5.2.12. workload
- Description
- In a policy of type WORKLOAD, it is necessary to specify how heavy the workload is.
- Type
- string
- Values
- HIGH: Scale the frequency of this core to maximum.
- MEDIUM: Scale the frequency of this core to minimum.
- LOW: Scale up the frequency of this core.
- Required
- For the
WORKLOAD
policy only. - Example
"workload": "MEDIUM"