56. Dispatcher Library
56.1. Overview
The purpose of the dispatcher is to help reduce coupling in an Eventdev-based DPDK application.
In particular, the dispatcher addresses a scenario where an application’s modules share the same event device and event device ports, and performs work on the same lcore threads.
The dispatcher replaces the conditional logic that follows an event
device dequeue operation, where events are dispatched to different
parts of the application, typically based on fields in the
rte_event
, such as the queue_id
, sub_event_type
, or
sched_type
.
Below is an excerpt from a fictitious application consisting of two modules; A and B. In this example, event-to-module routing is based purely on queue id, where module A expects all events to a certain queue id, and module B two other queue ids.
Note
Event routing may reasonably be done based on other rte_event
fields (or even event user data). Indeed, that’s the very reason to
have match callback functions, instead of a simple queue
id-to-handler mapping scheme. Queue id-based routing serves well in
a simple example.
for (;;) {
struct rte_event events[MAX_BURST];
unsigned int n;
n = rte_event_dequeue_burst(dev_id, port_id, events,
MAX_BURST, 0);
for (i = 0; i < n; i++) {
const struct rte_event *event = &events[i];
switch (event->queue_id) {
case MODULE_A_QUEUE_ID:
module_a_process(event);
break;
case MODULE_B_STAGE_0_QUEUE_ID:
module_b_process_stage_0(event);
break;
case MODULE_B_STAGE_1_QUEUE_ID:
module_b_process_stage_1(event);
break;
}
}
}
The issue this example attempts to illustrate is that the centralized conditional logic has knowledge of things that should be private to the modules. In other words, this pattern leads to a violation of module encapsulation.
The shared conditional logic contains explicit knowledge about what
events should go where. In case, for example, the
module_a_process()
is broken into two processing stages — a
module-internal affair — the shared conditional code must be updated
to reflect this change.
The centralized event routing code becomes an issue in larger applications, where modules are developed by different organizations. This pattern also makes module reuse across different applications more difficult. The part of the conditional logic relevant for a particular application may need to be duplicated across many module instantiations (e.g., applications and test setups).
The dispatcher separates the mechanism (routing events to their receiver) from the policy (which events should go where).
The basic operation of the dispatcher is as follows:
- Dequeue a batch of events from the event device.
- For each event determine which handler should receive the event, using a set of application-provided, per-handler event matching callback functions.
- Provide events matching a particular handler, to that handler, using its process callback.
If the above application would have made use of the dispatcher, the code relevant for its module A may have looked something like this:
static bool
module_a_match(const struct rte_event *event, void *cb_data)
{
return event->queue_id == MODULE_A_QUEUE_ID;
}
static void
module_a_process_events(uint8_t event_dev_id, uint8_t event_port_id,
const struct rte_event *events,
uint16_t num, void *cb_data)
{
uint16_t i;
for (i = 0; i < num; i++)
module_a_process_event(&events[i]);
}
/* In the module's initialization code */
rte_dispatcher_register(dispatcher, module_a_match, NULL,
module_a_process_events, module_a_data);
Note
Error handling is left out of this and future example code in this chapter.
When the shared conditional logic is removed, a new question arises:
which part of the system actually runs the dispatching mechanism? Or
phrased differently, what is replacing the function hosting the shared
conditional logic (typically launched on all lcores using
rte_eal_remote_launch()
)? To solve this issue, the dispatcher is
run as a DPDK Service.
The dispatcher is a layer between the application and the event device
in the receive direction. In the transmit (i.e., item of work
submission) direction, the application directly accesses the Eventdev
core API (e.g., rte_event_enqueue_burst()
) to submit new or
forwarded events to the event device.
56.2. Dispatcher Creation
A dispatcher is created using the rte_dispatcher_create()
function.
The event device must be configured before the dispatcher is created.
Usually, only one dispatcher is needed per event device. A dispatcher handles exactly one event device.
A dispatcher is freed using the rte_dispatcher_free()
function.
The dispatcher’s service functions must not be running on
any lcore at the point of this call.
56.3. Event Port Binding
To be able to dequeue events, the dispatcher must know which event
ports are to be used, on all the lcores it uses. The application
provides this information using
rte_dispatcher_bind_port_to_lcore()
.
This call is typically made from the part of the application that deals with deployment issues (e.g., iterating lcores and determining which lcore does what), at the time of application initialization.
The rte_dispatcher_unbind_port_from_lcore()
is used to undo
this operation.
Multiple lcore threads may not safely use the same event port.
Note
This property (which is a feature, not a bug) is inherited from the core Eventdev APIs.
Event ports cannot safely be bound or unbound while the dispatcher’s service function is running on any lcore.
56.4. Event Handlers
The dispatcher handler is an interface between the dispatcher and an application module, used to route events to the appropriate part of the application.
56.4.1. Handler Registration
The event handler interface consists of two function pointers:
- The
rte_dispatcher_match_t
callback, which job is to decide if this event is to be the property of this handler. - The
rte_dispatcher_process_t
, which is used by the dispatcher to deliver matched events.
An event handler registration is valid on all lcores.
The functions pointed to by the match and process callbacks resides in the application’s domain logic, with one or more handlers per application module.
A module may use more than one event handler, for convenience or to further decouple sub-modules. However, the dispatcher may impose an upper limit of the number of handlers. In addition, installing a large number of handlers increase dispatcher overhead, although this does not necessarily translate to a system-level performance degradation. See the section on Event Clustering for more information.
Handler registration and unregistration cannot safely be done while the dispatcher’s service function is running on any lcore.
56.4.2. Event Matching
A handler’s match callback function decides if an event should be delivered to this handler, or not.
An event is routed to no more than one handler. Thus, if a match function returns true, no further match functions will be invoked for that event.
Match functions must not depend on being invocated in any particular order (e.g., in the handler registration order).
Events failing to match any handler are dropped, and the
ev_drop_count
counter is updated accordingly.
56.4.3. Event Delivery
The handler callbacks are invocated by the dispatcher’s service function, upon the arrival of events to the event ports bound to the running service lcore.
A particular event is delivered to at most one handler.
The application must not depend on all match callback invocations for a particular event batch being made prior to any process calls are being made. For example, if the dispatcher dequeues two events from the event device, it may choose to find out the destination for the first event, and deliver it, and then continue to find out the destination for the second, and then deliver that event as well. The dispatcher may also choose a strategy where no event is delivered until the destination handler for both events have been determined.
The events provided in a single process call always belong to the same event port dequeue burst.
56.4.4. Event Clustering
The dispatcher maintains the order of events destined for the same handler.
Order here refers to the order in which the events were delivered
from the event device to the dispatcher (i.e., in the event array
populated by rte_event_dequeue_burst()
), in relation to the order
in which the dispatcher delivers these events to the application.
The dispatcher does not guarantee to maintain the order of events delivered to different handlers.
For example, assume that MODULE_A_QUEUE_ID
expands to the value 0,
and MODULE_B_STAGE_0_QUEUE_ID
expands to the value 1. Then
consider a scenario where the following events are dequeued from the
event device (qid is short for event queue id).
[e0: qid=1], [e1: qid=1], [e2: qid=0], [e3: qid=1]
The dispatcher may deliver the events in the following manner:
module_b_stage_0_process([e0: qid=1], [e1: qid=1])
module_a_process([e2: qid=0])
module_b_stage_0_process([e2: qid=1])
The dispatcher may also choose to cluster (group) all events destined
for module_b_stage_0_process()
into one array:
module_b_stage_0_process([e0: qid=1], [e1: qid=1], [e3: qid=1])
module_a_process([e2: qid=0])
Here, the event e2
is reordered and placed behind e3
, from a
delivery order point of view. This kind of reshuffling is allowed,
since the events are destined for different handlers.
The dispatcher may also deliver e2
before the three events
destined for module B.
An example of what the dispatcher may not do, is to reorder event
e1
so, that it precedes e0
in the array passed to the module
B’s stage 0 process callback.
Although clustering requires some extra work for the dispatcher, it leads to fewer process function calls. In addition, and likely more importantly, it improves temporal locality of memory accesses to handler-specific data structures in the application, which in turn may lead to fewer cache misses and improved overall performance.
56.5. Finalize
The dispatcher may be configured to notify one or more parts of the application when the matching and processing of a batch of events has completed.
The rte_dispatcher_finalize_register
call is used to
register a finalize callback. The function
rte_dispatcher_finalize_unregister
is used to remove a
callback.
The finalize hook may be used by a set of event handlers (in the same
modules, or a set of cooperating modules) sharing an event output
buffer, since it allows for flushing of the buffers at the last
possible moment. In particular, it allows for buffering of
RTE_EVENT_OP_FORWARD
events, which must be flushed before the next
rte_event_dequeue_burst()
call is made (assuming implicit release
is employed).
The following is an example with an application-defined event output
buffer (the event_buffer
):
static void
finalize_batch(uint8_t event_dev_id, uint8_t event_port_id,
void *cb_data)
{
struct event_buffer *buffer = cb_data;
unsigned lcore_id = rte_lcore_id();
struct event_buffer_lcore *lcore_buffer =
&buffer->lcore_buffer[lcore_id];
event_buffer_lcore_flush(lcore_buffer);
}
/* In the module's initialization code */
rte_dispatcher_finalize_register(dispatcher, finalize_batch,
shared_event_buffer);
The dispatcher does not track any relationship between a handler and a finalize callback, and all finalize callbacks will be called, if (and only if) at least one event was dequeued from the event device.
Finalize callback registration and unregistration cannot safely be done while the dispatcher’s service function is running on any lcore.
56.6. Service
The dispatcher is a DPDK service, and is managed in a manner similar to other DPDK services (e.g., an Event Timer Adapter).
Below is an example of how to configure a particular lcore to serve as
a service lcore, and to map an already-configured dispatcher
(identified by DISPATCHER_ID
) to that lcore.
static void
launch_dispatcher_core(struct rte_dispatcher *dispatcher,
unsigned lcore_id)
{
uint32_t service_id;
rte_service_lcore_add(lcore_id);
rte_dispatcher_service_id_get(dispatcher, &service_id);
rte_service_map_lcore_set(service_id, lcore_id, 1);
rte_service_lcore_start(lcore_id);
rte_service_runstate_set(service_id, 1);
}
As the final step, the dispatcher must be started.
rte_dispatcher_start(dispatcher);
56.6.1. Multi Service Dispatcher Lcores
In an Eventdev application, most (or all) compute-intensive and
performance-sensitive processing is done in an event-driven manner,
where CPU cycles spent on application domain logic is the direct
result of items of work (i.e., rte_event
events) dequeued from an
event device.
In the light of this, it makes sense to have the dispatcher service be the only DPDK service on all lcores used for packet processing — at least in principle.
However, there is nothing in DPDK that prevents colocating other services with the dispatcher service on the same lcore.
Tasks that prior to the introduction of the dispatcher into the application was performed on the lcore, even though no events were received, are prime targets for being converted into such auxiliary services, running on the dispatcher core set.
An example of such a task would be the management of a per-lcore timer
wheel (i.e., calling rte_timer_manage()
).
Applications employing Read-Copy-Update (RCU) (or
similar technique) may opt for having quiescent state (e.g., calling
rte_rcu_qsbr_quiescent()
) signaling factored out into a separate
service, to assure resource reclaiming occurs even though some
lcores currently do not process any events.
If more services than the dispatcher service is mapped to a service lcore, it’s important that the other service are well-behaved and don’t interfere with event processing to the extent the system’s throughput and/or latency requirements are at risk of not being met.
In particular, to avoid jitter, they should have a small upper bound for the maximum amount of time spent in a single service function call.
An example of scenario with a more CPU-heavy colocated service is a
low-lcore count deployment, where the event device lacks the
RTE_EVENT_ETH_RX_ADAPTER_CAP_INTERNAL_PORT
capability (and thus
requires software to feed incoming packets into the event device). In
this case, the best performance may be achieved if the Event Ethernet
RX and/or TX Adapters are mapped to lcores also used for event
dispatching, since otherwise the adapter lcores would have a lot of
idle CPU cycles.