22. L3 Forwarding with Power Management Sample Application

22.1. Introduction

The L3 Forwarding with Power Management application is an example of power-aware packet processing using the DPDK. The application is based on existing L3 Forwarding sample application, with the power management algorithms to control the P-states and C-states of the Intel processor via a power management library.

22.2. Overview

The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding. The initialization and run-time paths are very similar to those of the L3 Forwarding Sample Application. The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms by leveraging the Power library to control P-state and C-state of processor based on packet load.

The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues. The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive, process and deliver packets in the user space.

In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps:

Retrieve input packets through the PMD to poll Rx queue
Process each received packet or provide received packets to other processing cores through software queues
Send pending output packets to Tx queue through the PMD

In this way, the PMD achieves better performance than a traditional interrupt-mode driver, at the cost of keeping cores active and running at the highest frequency, hence consuming the maximum power all the time. However, during the period of processing light network traffic, which happens regularly in communication infrastructure systems due to well-known “tidal effect”, the PMD is still busy waiting for network packets, which wastes a lot of power.

Processor performance states (P-states) are the capability of an Intel processor to switch between different supported operating frequencies and voltages. If configured correctly, according to system workload, this feature provides power savings. CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability. CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application. The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application to set the CPUFreq governor and set the frequency of specific cores.

This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq. The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down. Specifically, some thresholds are checked to see whether a specific core running a DPDK polling thread needs to increase frequency a step up based on the near to full trend of polled Rx queues. Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold or the thread’s sleeping time exceeds a threshold.

C-States are also known as sleep states. They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt. However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency). Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency. It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order to fully realize the benefits of entering the C-state. CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability. Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state. It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT, based on the speculative sleep duration of the core. In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period if there is no Rx packet received on recent polls. In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states instead of always running to the C0 state waiting for packets.

Note

To fully demonstrate the power saving capability of using C-states, it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up.

22.3. Compiling the Application

To compile the sample application see Compiling the Sample Applications.

The application is located in the l3fwd-power sub-directory.

22.4. Running the Application

The application has a number of command line options:

./<build_dir>/examples/dpdk-l3fwd_power [EAL options] -- -p PORTMASK [-P]  --config(port,queue,lcore)[,(port,queue,lcore)] [--max-pkt-len PKTLEN] [--no-numa]

where,

-p PORTMASK: Hexadecimal bitmask of ports to configure
-P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet’s Ethernet MAC destination address. Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted.
-u: optional, sets uncore min/max frequency to minimum value.
-U: optional, sets uncore min/max frequency to maximum value.
-i (frequency index): optional, sets uncore frequency to frequency index value, by setting min and max values to be the same.
–config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
–max-pkt-len: optional, maximum packet length in decimal (64-9600)
–no-numa: optional, disables numa awareness
–empty-poll: Traffic Aware power management. See below for details
–telemetry: Telemetry mode.
–pmd-mgmt: PMD power management mode.
–max-empty-polls : Number of empty polls to wait before entering sleep state. Applies to –pmd-mgmt mode only.
–pause-duration: Set the duration of the pause callback (microseconds). Applies to –pmd-mgmt mode only.
–scale-freq-min: Set minimum frequency for scaling. Applies to –pmd-mgmt mode only.
–scale-freq-max: Set maximum frequency for scaling. Applies to –pmd-mgmt mode only.

See L3 Forwarding Sample Application for details. The L3fwd-power example reuses the L3fwd command line options.

22.5. Explanation

The following sections provide some explanation of the sample application code. As mentioned in the overview section, the initialization and run-time paths are identical to those of the L3 forwarding application. The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application.

22.5.1. Power Library Initialization

The Power library is initialized in the main routine. It changes the P-state governor to userspace for specific cores that are under control. The Timer library is also initialized and several timers are created later on, responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics.

Note

Only the power management related initialization is shown.

int
main(int argc, char **argv)
{
	struct lcore_conf *qconf;
	struct rte_eth_dev_info dev_info;
	struct rte_eth_txconf *txconf;
	int ret;
	uint16_t nb_ports;
	uint16_t queueid;
	unsigned lcore_id;
	uint64_t hz;
	uint32_t n_tx_queue, nb_lcores;
	uint32_t dev_rxq_num, dev_txq_num;
	uint8_t socketid;
	uint16_t portid, nb_rx_queue, queue;
	const char *ptr_strings[NUM_TELSTATS];

	/* init EAL */
	ret = rte_eal_init(argc, argv);
	if (ret < 0)
		rte_exit(EXIT_FAILURE, "Invalid EAL parameters\n");
	argc -= ret;
	argv += ret;

	/* catch SIGINT and restore cpufreq governor to ondemand */
	signal(SIGINT, signal_exit_now);

	/* init RTE timer library to be used late */
	rte_timer_subsystem_init();

	/* if we're running pmd-mgmt mode, don't default to baseline mode */
	baseline_enabled = false;

	/* parse application arguments (after the EAL ones) */
	ret = parse_args(argc, argv);
	if (ret < 0)
		rte_exit(EXIT_FAILURE, "Invalid L3FWD parameters\n");

	if (app_mode == APP_MODE_DEFAULT)
		app_mode = autodetect_mode();

	RTE_LOG(INFO, L3FWD_POWER, "Selected operation mode: %s\n",
			mode_to_str(app_mode));

	/* only legacy and empty poll mode rely on power library */
	if ((app_mode == APP_MODE_LEGACY || app_mode == APP_MODE_EMPTY_POLL) &&
			init_power_library())
		rte_exit(EXIT_FAILURE, "init_power_library failed\n");

	if (update_lcore_params() < 0)
		rte_exit(EXIT_FAILURE, "update_lcore_params failed\n");

	if (check_lcore_params() < 0)
		rte_exit(EXIT_FAILURE, "check_lcore_params failed\n");

	ret = init_lcore_rx_queues();
	if (ret < 0)
		rte_exit(EXIT_FAILURE, "init_lcore_rx_queues failed\n");

	nb_ports = rte_eth_dev_count_avail();

	if (check_port_config() < 0)
		rte_exit(EXIT_FAILURE, "check_port_config failed\n");

	nb_lcores = rte_lcore_count();

	/* initialize all ports */
	RTE_ETH_FOREACH_DEV(portid) {
		struct rte_eth_conf local_port_conf = port_conf;
		/* not all app modes need interrupts */
		bool need_intr = app_mode == APP_MODE_LEGACY ||
				app_mode == APP_MODE_INTERRUPT;

		/* skip ports that are not enabled */
		if ((enabled_port_mask & (1 << portid)) == 0) {
			printf("\nSkipping disabled port %d\n", portid);
			continue;
		}

		/* init port */
		printf("Initializing port %d ... ", portid );
		fflush(stdout);

		ret = rte_eth_dev_info_get(portid, &dev_info);
		if (ret != 0)
			rte_exit(EXIT_FAILURE,
				"Error during getting device (port %u) info: %s\n",
				portid, strerror(-ret));

		dev_rxq_num = dev_info.max_rx_queues;
		dev_txq_num = dev_info.max_tx_queues;

		nb_rx_queue = get_port_n_rx_queues(portid);
		if (nb_rx_queue > dev_rxq_num)
			rte_exit(EXIT_FAILURE,
				"Cannot configure not existed rxq: "
				"port=%d\n", portid);

		n_tx_queue = nb_lcores;
		if (n_tx_queue > dev_txq_num)
			n_tx_queue = dev_txq_num;
		printf("Creating queues: nb_rxq=%d nb_txq=%u... ",
			nb_rx_queue, (unsigned)n_tx_queue );
		/* If number of Rx queue is 0, no need to enable Rx interrupt */
		if (nb_rx_queue == 0)
			need_intr = false;

		if (need_intr)
			local_port_conf.intr_conf.rxq = 1;

		ret = rte_eth_dev_info_get(portid, &dev_info);
		if (ret != 0)
			rte_exit(EXIT_FAILURE,
				"Error during getting device (port %u) info: %s\n",
				portid, strerror(-ret));

		ret = config_port_max_pkt_len(&local_port_conf, &dev_info);
		if (ret != 0)
			rte_exit(EXIT_FAILURE,
				"Invalid max packet length: %u (port %u)\n",
				max_pkt_len, portid);

		if (dev_info.tx_offload_capa & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE)
			local_port_conf.txmode.offloads |=
				RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE;

		local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
			dev_info.flow_type_rss_offloads;
		if (local_port_conf.rx_adv_conf.rss_conf.rss_hf !=
				port_conf.rx_adv_conf.rss_conf.rss_hf) {
			printf("Port %u modified RSS hash function based on hardware support,"
				"requested:%#"PRIx64" configured:%#"PRIx64"\n",
				portid,
				port_conf.rx_adv_conf.rss_conf.rss_hf,
				local_port_conf.rx_adv_conf.rss_conf.rss_hf);
		}

		if (local_port_conf.rx_adv_conf.rss_conf.rss_hf == 0)
			local_port_conf.rxmode.mq_mode = RTE_ETH_MQ_RX_NONE;
		local_port_conf.rxmode.offloads &= dev_info.rx_offload_capa;
		port_conf.rxmode.offloads = local_port_conf.rxmode.offloads;

		ret = rte_eth_dev_configure(portid, nb_rx_queue,
					(uint16_t)n_tx_queue, &local_port_conf);
		if (ret < 0)
			rte_exit(EXIT_FAILURE, "Cannot configure device: "
					"err=%d, port=%d\n", ret, portid);

		ret = rte_eth_dev_adjust_nb_rx_tx_desc(portid, &nb_rxd,
						       &nb_txd);
		if (ret < 0)
			rte_exit(EXIT_FAILURE,
				 "Cannot adjust number of descriptors: err=%d, port=%d\n",
				 ret, portid);

		ret = rte_eth_macaddr_get(portid, &ports_eth_addr[portid]);
		if (ret < 0)
			rte_exit(EXIT_FAILURE,
				 "Cannot get MAC address: err=%d, port=%d\n",
				 ret, portid);

		print_ethaddr(" Address:", &ports_eth_addr[portid]);
		printf(", ");

		/* init memory */
		ret = init_mem(NB_MBUF);
		if (ret < 0)
			rte_exit(EXIT_FAILURE, "init_mem failed\n");

		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
			if (rte_lcore_is_enabled(lcore_id) == 0)
				continue;

			/* Initialize TX buffers */
			qconf = &lcore_conf[lcore_id];
			qconf->tx_buffer[portid] = rte_zmalloc_socket("tx_buffer",
				RTE_ETH_TX_BUFFER_SIZE(MAX_PKT_BURST), 0,
				rte_eth_dev_socket_id(portid));
			if (qconf->tx_buffer[portid] == NULL)
				rte_exit(EXIT_FAILURE, "Can't allocate tx buffer for port %u\n",
						 portid);

			rte_eth_tx_buffer_init(qconf->tx_buffer[portid], MAX_PKT_BURST);
		}

		/* init one TX queue per couple (lcore,port) */
		queueid = 0;
		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
			if (rte_lcore_is_enabled(lcore_id) == 0)
				continue;

			if (queueid >= dev_txq_num)
				continue;

			if (numa_on)
				socketid = \
				(uint8_t)rte_lcore_to_socket_id(lcore_id);
			else
				socketid = 0;

			printf("txq=%u,%d,%d ", lcore_id, queueid, socketid);
			fflush(stdout);

			txconf = &dev_info.default_txconf;
			txconf->offloads = local_port_conf.txmode.offloads;
			ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd,
						     socketid, txconf);
			if (ret < 0)
				rte_exit(EXIT_FAILURE,
					"rte_eth_tx_queue_setup: err=%d, "
						"port=%d\n", ret, portid);

			qconf = &lcore_conf[lcore_id];
			qconf->tx_queue_id[portid] = queueid;
			queueid++;

			qconf->tx_port_id[qconf->n_tx_port] = portid;
			qconf->n_tx_port++;
		}
		printf("\n");
	}

	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
		if (rte_lcore_is_enabled(lcore_id) == 0)
			continue;

		if (app_mode == APP_MODE_LEGACY) {
			/* init timer structures for each enabled lcore */
			rte_timer_init(&power_timers[lcore_id]);
			hz = rte_get_timer_hz();
			rte_timer_reset(&power_timers[lcore_id],
					hz/TIMER_NUMBER_PER_SECOND,
					SINGLE, lcore_id,
					power_timer_cb, NULL);
		}
		qconf = &lcore_conf[lcore_id];
		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
		fflush(stdout);

		/* init RX queues */
		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
			struct rte_eth_rxconf rxq_conf;

			portid = qconf->rx_queue_list[queue].port_id;
			queueid = qconf->rx_queue_list[queue].queue_id;

			if (numa_on)
				socketid = \
				(uint8_t)rte_lcore_to_socket_id(lcore_id);
			else
				socketid = 0;

			printf("rxq=%d,%d,%d ", portid, queueid, socketid);
			fflush(stdout);

			ret = rte_eth_dev_info_get(portid, &dev_info);
			if (ret != 0)
				rte_exit(EXIT_FAILURE,
					"Error during getting device (port %u) info: %s\n",
					portid, strerror(-ret));

			rxq_conf = dev_info.default_rxconf;
			rxq_conf.offloads = port_conf.rxmode.offloads;
			ret = rte_eth_rx_queue_setup(portid, queueid, nb_rxd,
				socketid, &rxq_conf,
				pktmbuf_pool[socketid]);
			if (ret < 0)
				rte_exit(EXIT_FAILURE,
					"rte_eth_rx_queue_setup: err=%d, "
						"port=%d\n", ret, portid);

			if (parse_ptype) {
				if (add_cb_parse_ptype(portid, queueid) < 0)
					rte_exit(EXIT_FAILURE,
						 "Fail to add ptype cb\n");
			}

			if (app_mode == APP_MODE_PMD_MGMT && !baseline_enabled) {
				/* Set power_pmd_mgmt configs passed by user */
				rte_power_pmd_mgmt_set_emptypoll_max(max_empty_polls);
				ret = rte_power_pmd_mgmt_set_pause_duration(pause_duration);
				if (ret < 0)
					rte_exit(EXIT_FAILURE,
						"Error setting pause_duration: err=%d, lcore=%d\n",
							ret, lcore_id);

				ret = rte_power_pmd_mgmt_set_scaling_freq_min(lcore_id,
						scale_freq_min);
				if (ret < 0)
					rte_exit(EXIT_FAILURE,
						"Error setting scaling freq min: err=%d, lcore=%d\n",
							ret, lcore_id);

				ret = rte_power_pmd_mgmt_set_scaling_freq_max(lcore_id,
						scale_freq_max);
				if (ret < 0)
					rte_exit(EXIT_FAILURE,
						"Error setting scaling freq max: err=%d, lcore %d\n",
							ret, lcore_id);

				ret = rte_power_ethdev_pmgmt_queue_enable(
						lcore_id, portid, queueid,
						pmgmt_type);
				if (ret < 0)
					rte_exit(EXIT_FAILURE,
						"rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n",
							ret, portid);
			}
		}
	}

22.5.2. Monitoring Loads of Rx Queues

In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing if the network load is actually heavy or light. In this sample, sampling network load work is done by monitoring received and available descriptors on NIC Rx queues in recent polls. Based on the number of returned and available Rx descriptors, this example implements algorithms to generate frequency scaling hints and speculative sleep duration, and use them to control P-state and C-state of processors via the power management library. Frequency (P-state) control and sleep state (C-state) control work individually for each logical core, and the combination of them contributes to a power efficient packet processing solution when serving light network loads.

The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop to return the number of received and available Rx descriptors. And those numbers of specific queue are passed to P-state and C-state heuristic algorithms to generate hints based on recent network load trends.

Note

Only power control related code is shown.

static int main_intr_loop(__rte_unused void *dummy)
{
	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
	unsigned int lcore_id;
	uint64_t prev_tsc, diff_tsc, cur_tsc;
	int i, j, nb_rx;
	uint16_t portid, queueid;
	struct lcore_conf *qconf;
	struct lcore_rx_queue *rx_queue;
	uint32_t lcore_rx_idle_count = 0;
	uint32_t lcore_idle_hint = 0;
	int intr_en = 0;

	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
				   US_PER_S * BURST_TX_DRAIN_US;

	prev_tsc = 0;

	lcore_id = rte_lcore_id();
	qconf = &lcore_conf[lcore_id];

	if (qconf->n_rx_queue == 0) {
		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n",
				lcore_id);
		return 0;
	}

	RTE_LOG(INFO, L3FWD_POWER, "entering main interrupt loop on lcore %u\n",
			lcore_id);

	for (i = 0; i < qconf->n_rx_queue; i++) {
		portid = qconf->rx_queue_list[i].port_id;
		queueid = qconf->rx_queue_list[i].queue_id;
		RTE_LOG(INFO, L3FWD_POWER,
				" -- lcoreid=%u portid=%u rxqueueid=%" PRIu16 "\n",
				lcore_id, portid, queueid);
	}

	/* add into event wait list */
	if (event_register(qconf) == 0)
		intr_en = 1;
	else
		RTE_LOG(INFO, L3FWD_POWER, "RX interrupt won't enable.\n");

	while (!is_done()) {
		stats[lcore_id].nb_iteration_looped++;

		cur_tsc = rte_rdtsc();

		/*
		 * TX burst queue drain
		 */
		diff_tsc = cur_tsc - prev_tsc;
		if (unlikely(diff_tsc > drain_tsc)) {
			for (i = 0; i < qconf->n_tx_port; ++i) {
				portid = qconf->tx_port_id[i];
				rte_eth_tx_buffer_flush(portid,
						qconf->tx_queue_id[portid],
						qconf->tx_buffer[portid]);
			}
			prev_tsc = cur_tsc;
		}

start_rx:
		/*
		 * Read packet from RX queues
		 */
		lcore_rx_idle_count = 0;
		for (i = 0; i < qconf->n_rx_queue; ++i) {
			rx_queue = &(qconf->rx_queue_list[i]);
			rx_queue->idle_hint = 0;
			portid = rx_queue->port_id;
			queueid = rx_queue->queue_id;

			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
					MAX_PKT_BURST);

			stats[lcore_id].nb_rx_processed += nb_rx;
			if (unlikely(nb_rx == 0)) {
				/**
				 * no packet received from rx queue, try to
				 * sleep for a while forcing CPU enter deeper
				 * C states.
				 */
				rx_queue->zero_rx_packet_count++;

				if (rx_queue->zero_rx_packet_count <=
						MIN_ZERO_POLL_COUNT)
					continue;

				rx_queue->idle_hint = power_idle_heuristic(
						rx_queue->zero_rx_packet_count);
				lcore_rx_idle_count++;
			} else {
				rx_queue->zero_rx_packet_count = 0;
			}

			/* Prefetch first packets */
			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
				rte_prefetch0(rte_pktmbuf_mtod(
						pkts_burst[j], void *));
			}

			/* Prefetch and forward already prefetched packets */
			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
				rte_prefetch0(rte_pktmbuf_mtod(
						pkts_burst[j + PREFETCH_OFFSET],
						void *));
				l3fwd_simple_forward(
						pkts_burst[j], portid, qconf);
			}

			/* Forward remaining prefetched packets */
			for (; j < nb_rx; j++) {
				l3fwd_simple_forward(
						pkts_burst[j], portid, qconf);
			}
		}

		if (unlikely(lcore_rx_idle_count == qconf->n_rx_queue)) {
			/**
			 * All Rx queues empty in recent consecutive polls,
			 * sleep in a conservative manner, meaning sleep as
			 * less as possible.
			 */
			for (i = 1,
			    lcore_idle_hint = qconf->rx_queue_list[0].idle_hint;
					i < qconf->n_rx_queue; ++i) {
				rx_queue = &(qconf->rx_queue_list[i]);
				if (rx_queue->idle_hint < lcore_idle_hint)
					lcore_idle_hint = rx_queue->idle_hint;
			}

			if (lcore_idle_hint < SUSPEND_THRESHOLD)
				/**
				 * execute "pause" instruction to avoid context
				 * switch which generally take hundred of
				 * microseconds for short sleep.
				 */
				rte_delay_us(lcore_idle_hint);
			else {
				/* suspend until rx interrupt triggers */
				if (intr_en) {
					turn_on_off_intr(qconf, 1);
					sleep_until_rx_interrupt(
							qconf->n_rx_queue,
							lcore_id);
					turn_on_off_intr(qconf, 0);
					/**
					 * start receiving packets immediately
					 */
					if (likely(!is_done()))
						goto start_rx;
				}
			}
			stats[lcore_id].sleep_time += lcore_idle_hint;
		}
	}

	return 0;
}

22.5.3. P-State Heuristic Algorithm

The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core according to available descriptor number returned from rte_eth_rx_queue_count(). On every poll for new packets, the length of available descriptor on an Rx queue is evaluated, and the algorithm used for frequency hinting is as follows:

If the size of available descriptors exceeds 96, the maximum frequency is hinted.
If the size of available descriptors exceeds 64, a trend counter is incremented by 100.
If the length of the ring exceeds 32, the trend counter is incremented by 1.
When the trend counter reached 10000 the frequency hint is changed to the next higher frequency.

Note

The assumption is that the Rx queue size is 128 and the thresholds specified above must be adjusted accordingly based on actual hardware Rx queue size, which are configured via the rte_eth_rx_queue_setup() function.

In general, a thread needs to poll packets from multiple Rx queues. Most likely, different queue have different load, so they would return different frequency hints. The algorithm evaluates all the hints and then scales up frequency in an aggressive manner by scaling up to highest frequency as long as one Rx queue requires. In this way, we can minimize any negative performance impact.

On the other hand, frequency scaling down is controlled in the timer callback function. Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period, or if the average packet per iteration is less than expectation, the frequency is decreased by one step.

22.5.4. C-State Heuristic Algorithm

Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets, an idle counter begins incrementing for each successive zero poll. At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration in order to force logical to enter deeper sleeping C-state. There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough to select C-state to enter based on actual sleep period time of giving logical core. The algorithm has the following sleeping behavior depending on the idle counter:

If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us() which execute pause instructions to avoid costly context switch but saving power at the same time.
If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used. A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives.
If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used. This allows the core to enter the C3/C6 states.

Note

The thresholds specified above need to be adjusted for different Intel processors and traffic profiles.

If a thread polls multiple Rx queues and different queue returns different sleep duration values, the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time in order to avoid a potential performance impact.

22.6. Empty Poll Mode

Additionally, there is a traffic aware mode of operation called “Empty Poll” where the number of empty polls can be monitored to keep track of how busy the application is. Empty poll mode can be enabled by the command line option –empty-poll.

See Power Management chapter in the DPDK Programmer’s Guide for empty poll mode details.

./<build_dir>/examples/dpdk-l3fwd-power -l xxx -n 4 -a 0000:xx:00.0 -a 0000:xx:00.1 \
    -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Where,

–empty-poll: Enable the empty poll mode instead of original algorithm

–empty-poll=”training_flag, med_threshold, high_threshold”

training_flag : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values.
med_threshold : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000.
high_threshold : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000.
-l : optional, set up the LOW power state frequency index
-m : optional, set up the MED power state frequency index
-h : optional, set up the HIGH power state frequency index

22.6.1. Empty Poll Mode Example Usage

To initially obtain the ideal thresholds for the system, the training mode should be run first. This is achieved by running the l3fwd-power app with the training flag set to “1”, and the other parameters set to 0.

./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P

This will run the training algorithm for x seconds on each core (cores 2 and 3), and then print out the recommended threshold values for those cores. The thresholds should be very similar for each core.

POWER: Bring up the Timer
POWER: set the power freq to MED
POWER: Low threshold is 230277
POWER: MED threshold is 335071
POWER: HIGH threshold is 523769
POWER: Training is Complete for 2
POWER: set the power freq to MED
POWER: Low threshold is 236814
POWER: MED threshold is 344567
POWER: HIGH threshold is 538580
POWER: Training is Complete for 3

Once the values have been measured for a particular system, the app can then be started without the training mode so traffic can start immediately.

./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P

22.7. Telemetry Mode

The telemetry mode support for l3fwd-power is a standalone mode, in this mode l3fwd-power does simple l3fwding along with calculating empty polls, full polls, and busy percentage for each forwarding core. The aggregation of these values of all cores is reported as application level telemetry to metric library for every 500ms from the main core.

The busy percentage is calculated by recording the poll_count and when the count reaches a defined value the total cycles it took is measured and compared with minimum and maximum reference cycles and accordingly busy rate is set to either 0% or 50% or 100%.

./<build_dir>/examples/dpdk-l3fwd-power --telemetry -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --telemetry

The new stats empty_poll , full_poll and busy_percent can be viewed by running the script /usertools/dpdk-telemetry-client.py and selecting the menu option Send for global Metrics.

22.8. PMD power management Mode

The PMD power management mode support for l3fwd-power is a standalone mode. In this mode, l3fwd-power does simple l3fwding along with enabling the power saving scheme on specific port/queue/lcore. Main purpose for this mode is to demonstrate how to use the PMD power management API.

./build/examples/dpdk-l3fwd-power -l 1-3 --  --pmd-mgmt -p 0x0f --config="(0,0,2),(0,1,3)"

22.9. PMD Power Management Mode

There is also a traffic-aware operating mode that, instead of using explicit power management, will use automatic PMD power management. This mode is limited to one queue per core, and has three available power management schemes:

baseline: This mode will not enable any power saving features.
monitor: This will use rte_power_monitor() function to enter a power-optimized state (subject to platform support).
pause: This will use rte_power_pause() or rte_pause() to avoid busy looping when there is no traffic.
scale: This will use frequency scaling routines available in the librte_power library. The reaction time of the scale mode is longer than the pause and monitor mode.

See Power Management chapter in the DPDK Programmer’s Guide for more details on PMD power management.

./<build_dir>/examples/dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --pmd-mgmt=scale

22.10. Setting Uncore Values

Uncore frequency can be adjusted through manipulating related sysfs entries to adjust the minimum and maximum uncore values. This will be set for each package and die on the SKU. The driver for enabling this is available from kernel version 5.6 and above. Three options are available for setting uncore frequency:

-u: This will set uncore minimum and maximum frequencies to minimum possible value.
-U: This will set uncore minimum and maximum frequencies to maximum possible value.
-i: This will allow you to set the specific uncore frequency index that you want, by setting the uncore frequency to a frequency pointed by index. Frequency index’s are set 100MHz apart from maximum to minimum. Frequency index values are in descending order, i.e., index 0 is maximum frequency index.

dpdk-l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" -i 1