2019-06-25 11:12:58 +00:00
|
|
|
.. SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
Copyright(c) 2010-2014 Intel Corporation.
|
2017-04-21 10:43:26 +00:00
|
|
|
|
|
|
|
Writing Efficient Code
|
|
|
|
======================
|
|
|
|
|
|
|
|
This chapter provides some tips for developing efficient code using the DPDK.
|
|
|
|
For additional and more general information,
|
|
|
|
please refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
|
|
|
|
which is a valuable reference to writing efficient code.
|
|
|
|
|
|
|
|
Memory
|
|
|
|
------
|
|
|
|
|
|
|
|
This section describes some key memory considerations when developing applications in the DPDK environment.
|
|
|
|
|
|
|
|
Memory Copy: Do not Use libc in the Data Plane
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Many libc functions are available in the DPDK, via the Linux* application environment.
|
|
|
|
This can ease the porting of applications and the development of the configuration plane.
|
|
|
|
However, many of these functions are not designed for performance.
|
|
|
|
Functions such as memcpy() or strcpy() should not be used in the data plane.
|
|
|
|
To copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
|
|
|
|
Refer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
|
|
|
|
|
|
|
|
For specific functions that are called often,
|
|
|
|
it is also a good idea to provide a self-made optimized function, which should be declared as static inline.
|
|
|
|
|
|
|
|
The DPDK API provides an optimized rte_memcpy() function.
|
|
|
|
|
|
|
|
Memory Allocation
|
|
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
|
|
|
|
In some cases, using dynamic allocation is necessary,
|
|
|
|
but it is really not advised to use malloc-like functions in the data plane because
|
|
|
|
managing a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
|
|
|
|
|
|
|
|
If you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
|
|
|
|
This API is provided by librte_mempool.
|
|
|
|
This data structure provides several services that increase performance, such as memory alignment of objects,
|
|
|
|
lockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
|
|
|
|
The rte_malloc () function uses a similar concept to mempools.
|
|
|
|
|
|
|
|
Concurrent Access to the Same Memory Area
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Read-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
|
|
|
|
which are very costly.
|
|
|
|
It is often possible to use per-lcore variables, for example, in the case of statistics.
|
|
|
|
There are at least two solutions for this:
|
|
|
|
|
|
|
|
* Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
|
|
|
|
|
|
|
|
* Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
|
|
|
|
|
|
|
|
Read-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
|
|
|
|
|
|
|
|
NUMA
|
|
|
|
~~~~
|
|
|
|
|
|
|
|
On a NUMA system, it is preferable to access local memory since remote memory access is slower.
|
|
|
|
In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
|
|
|
|
|
|
|
|
Sometimes, it can be a good idea to duplicate data to optimize speed.
|
|
|
|
For read-mostly variables that are often accessed,
|
|
|
|
it should not be a problem to keep them in one socket only, since data will be present in cache.
|
|
|
|
|
|
|
|
Distribution Across Memory Channels
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Modern memory controllers have several memory channels that can load or store data in parallel.
|
|
|
|
Depending on the memory controller and its configuration,
|
|
|
|
the number of channels and the way the memory is distributed across the channels varies.
|
|
|
|
Each channel has a bandwidth limit,
|
|
|
|
meaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
|
|
|
|
|
|
|
|
By default, the :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
|
|
|
|
|
2018-05-15 09:49:22 +00:00
|
|
|
Locking memory pages
|
|
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
The underlying operating system is allowed to load/unload memory pages at its own discretion.
|
|
|
|
These page loads could impact the performance, as the process is on hold when the kernel fetches them.
|
|
|
|
|
|
|
|
To avoid these you could pre-load, and lock them into memory with the ``mlockall()`` call.
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
|
|
|
|
RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
|
|
|
|
strerror(errno));
|
|
|
|
}
|
|
|
|
|
2017-04-21 10:43:26 +00:00
|
|
|
Communication Between lcores
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
To provide a message-based communication between lcores,
|
|
|
|
it is advised to use the DPDK ring API, which provides a lockless ring implementation.
|
|
|
|
|
|
|
|
The ring supports bulk and burst access,
|
|
|
|
meaning that it is possible to read several elements from the ring with only one costly atomic operation
|
|
|
|
(see :doc:`ring_lib`).
|
|
|
|
Performance is greatly improved when using bulk access operations.
|
|
|
|
|
|
|
|
The code algorithm that dequeues messages may be something similar to the following:
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
#define MAX_BULK 32
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
/* Process as many elements as can be dequeued. */
|
2018-05-15 09:49:22 +00:00
|
|
|
count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
|
2017-04-21 10:43:26 +00:00
|
|
|
if (unlikely(count == 0))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
my_process_bulk(obj_table, count);
|
|
|
|
}
|
|
|
|
|
2022-09-02 04:40:05 +00:00
|
|
|
PMD
|
|
|
|
---
|
2017-04-21 10:43:26 +00:00
|
|
|
|
|
|
|
The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
|
|
|
|
allowing the factorization of some code for each call in the send or receive function.
|
|
|
|
|
|
|
|
Avoid partial writes.
|
|
|
|
When PCI devices write to system memory through DMA,
|
|
|
|
it costs less if the write operation is on a full cache line as opposed to part of it.
|
|
|
|
In the PMD code, actions have been taken to avoid partial writes as much as possible.
|
|
|
|
|
|
|
|
Lower Packet Latency
|
|
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Traditionally, there is a trade-off between throughput and latency.
|
|
|
|
An application can be tuned to achieve a high throughput,
|
|
|
|
but the end-to-end latency of an average packet will typically increase as a result.
|
|
|
|
Similarly, the application can be tuned to have, on average,
|
|
|
|
a low end-to-end latency, at the cost of lower throughput.
|
|
|
|
|
|
|
|
In order to achieve higher throughput,
|
|
|
|
the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
|
|
|
|
|
|
|
|
Using the testpmd application as an example,
|
2022-09-02 04:40:05 +00:00
|
|
|
the burst size can be set on the command line to a value of 32 (also the default value).
|
|
|
|
This allows the application to request 32 packets at a time from the PMD.
|
2017-04-21 10:43:26 +00:00
|
|
|
The testpmd application then immediately attempts to transmit all the packets that were received,
|
2022-09-02 04:40:05 +00:00
|
|
|
in this case, all 32 packets.
|
2017-04-21 10:43:26 +00:00
|
|
|
|
|
|
|
The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
|
|
|
|
This behavior is desirable when tuning for high throughput because
|
2022-09-02 04:40:05 +00:00
|
|
|
the cost of tail pointer updates to both the RX and TX queues can be spread
|
|
|
|
across 32 packets,
|
2017-04-21 10:43:26 +00:00
|
|
|
effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
|
|
|
|
However, this is not very desirable when tuning for low latency because
|
2022-09-02 04:40:05 +00:00
|
|
|
the first packet that was received must also wait for another 31 packets to be received.
|
|
|
|
It cannot be transmitted until the other 31 packets have also been processed because
|
2017-04-21 10:43:26 +00:00
|
|
|
the NIC will not know to transmit the packets until the TX tail pointer has been updated,
|
2022-09-02 04:40:05 +00:00
|
|
|
which is not done until all 32 packets have been processed for transmission.
|
2017-04-21 10:43:26 +00:00
|
|
|
|
|
|
|
To consistently achieve low latency, even under heavy system load,
|
|
|
|
the application developer should avoid processing packets in bunches.
|
|
|
|
The testpmd application can be configured from the command line to use a burst value of 1.
|
|
|
|
This will allow a single packet to be processed at a time, providing lower latency,
|
|
|
|
but with the added cost of lower throughput.
|
|
|
|
|
|
|
|
Locks and Atomic Operations
|
|
|
|
---------------------------
|
|
|
|
|
2021-02-05 08:48:47 +00:00
|
|
|
This section describes some key considerations when using locks and atomic
|
|
|
|
operations in the DPDK environment.
|
|
|
|
|
|
|
|
Locks
|
|
|
|
~~~~~
|
|
|
|
|
|
|
|
On x86, atomic operations imply a lock prefix before the instruction,
|
2017-04-21 10:43:26 +00:00
|
|
|
causing the processor's LOCK# signal to be asserted during execution of the following instruction.
|
|
|
|
This has a big impact on performance in a multicore environment.
|
|
|
|
|
|
|
|
Performance can be improved by avoiding lock mechanisms in the data plane.
|
|
|
|
It can often be replaced by other solutions like per-lcore variables.
|
|
|
|
Also, some locking techniques are more efficient than others.
|
|
|
|
For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
|
|
|
|
|
2021-02-05 08:48:47 +00:00
|
|
|
Atomic Operations: Use C11 Atomic Builtins
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
DPDK generic rte_atomic operations are implemented by __sync builtins. These
|
|
|
|
__sync builtins result in full barriers on aarch64, which are unnecessary
|
|
|
|
in many use cases. They can be replaced by __atomic builtins that conform to
|
|
|
|
the C11 memory model and provide finer memory order control.
|
|
|
|
|
|
|
|
So replacing the rte_atomic operations with __atomic builtins might improve
|
|
|
|
performance for aarch64 machines.
|
|
|
|
|
|
|
|
Some typical optimization cases are listed below:
|
|
|
|
|
|
|
|
Atomicity
|
|
|
|
^^^^^^^^^
|
|
|
|
|
|
|
|
Some use cases require atomicity alone, the ordering of the memory operations
|
|
|
|
does not matter. For example, the packet statistics counters need to be
|
|
|
|
incremented atomically but do not need any particular memory ordering.
|
|
|
|
So, RELAXED memory ordering is sufficient.
|
|
|
|
|
|
|
|
One-way Barrier
|
|
|
|
^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
Some use cases allow for memory reordering in one way while requiring memory
|
|
|
|
ordering in the other direction.
|
|
|
|
|
|
|
|
For example, the memory operations before the spinlock lock are allowed to
|
|
|
|
move to the critical section, but the memory operations in the critical section
|
|
|
|
are not allowed to move above the lock. In this case, the full memory barrier
|
|
|
|
in the compare-and-swap operation can be replaced with ACQUIRE memory order.
|
|
|
|
On the other hand, the memory operations after the spinlock unlock are allowed
|
|
|
|
to move to the critical section, but the memory operations in the critical
|
|
|
|
section are not allowed to move below the unlock. So the full barrier in the
|
|
|
|
store operation can use RELEASE memory order.
|
|
|
|
|
|
|
|
Reader-Writer Concurrency
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
Lock-free reader-writer concurrency is one of the common use cases in DPDK.
|
|
|
|
|
|
|
|
The payload or the data that the writer wants to communicate to the reader,
|
|
|
|
can be written with RELAXED memory order. However, the guard variable should
|
|
|
|
be written with RELEASE memory order. This ensures that the store to guard
|
|
|
|
variable is observable only after the store to payload is observable.
|
|
|
|
|
|
|
|
Correspondingly, on the reader side, the guard variable should be read
|
|
|
|
with ACQUIRE memory order. The payload or the data the writer communicated,
|
|
|
|
can be read with RELAXED memory order. This ensures that, if the store to
|
|
|
|
guard variable is observable, the store to payload is also observable.
|
|
|
|
|
2017-04-21 10:43:26 +00:00
|
|
|
Coding Considerations
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Inline Functions
|
|
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Small functions can be declared as static inline in the header file.
|
|
|
|
This avoids the cost of a call instruction (and the associated context saving).
|
|
|
|
However, this technique is not always efficient; it depends on many factors including the compiler.
|
|
|
|
|
|
|
|
Branch Prediction
|
|
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
|
|
|
|
allow the developer to indicate if a code branch is likely to be taken or not.
|
|
|
|
For instance:
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
if (likely(x > 1))
|
|
|
|
do_stuff();
|
|
|
|
|
|
|
|
Setting the Target CPU Type
|
|
|
|
---------------------------
|
|
|
|
|
2021-02-05 08:48:47 +00:00
|
|
|
The DPDK supports CPU microarchitecture-specific optimizations by means of RTE_MACHINE option.
|
2017-04-21 10:43:26 +00:00
|
|
|
The degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
|
|
|
|
therefore it is preferable to use the latest compiler versions whenever possible.
|
|
|
|
|
|
|
|
If the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
|
|
|
|
the build process gracefully degrades to whatever latest feature set is supported by the compiler.
|
|
|
|
|
|
|
|
Since the build and runtime targets may not be the same,
|
|
|
|
the resulting binary also contains a platform check that runs before the
|
|
|
|
main() function and checks if the current machine is suitable for running the binary.
|
|
|
|
|
|
|
|
Along with compiler optimizations,
|
|
|
|
a set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
|
|
|
|
These defines correspond to the instruction sets that the target CPU should be able to support.
|