2020-06-18 16:55:50 +00:00
|
|
|
.. SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
Copyright(c) 2018 Intel Corporation.
|
|
|
|
|
2023-09-11 06:58:14 +00:00
|
|
|
.. include:: <isonum.txt>
|
|
|
|
|
2020-06-18 16:55:50 +00:00
|
|
|
NTB Rawdev Driver
|
|
|
|
=================
|
|
|
|
|
|
|
|
The ``ntb`` rawdev driver provides a non-transparent bridge between two
|
|
|
|
separate hosts so that they can communicate with each other. Thus, many
|
|
|
|
user cases can benefit from this, such as fault tolerance and visual
|
|
|
|
acceleration.
|
|
|
|
|
|
|
|
This PMD allows two hosts to handshake for device start and stop, memory
|
|
|
|
allocation for the peer to access and read/write allocated memory from peer.
|
|
|
|
Also, the PMD allows to use doorbell registers to notify the peer and share
|
|
|
|
some information by using scratchpad registers.
|
|
|
|
|
2021-02-05 08:48:47 +00:00
|
|
|
BIOS setting on Intel Xeon
|
|
|
|
--------------------------
|
2020-06-18 16:55:50 +00:00
|
|
|
|
2023-09-11 06:58:14 +00:00
|
|
|
Intel Non-transparent Bridge (NTB) needs special BIOS settings on both systems.
|
|
|
|
Note that for 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors,
|
|
|
|
option ``Port Subsystem Mode`` should be changed from ``Gen5`` to ``Gen4 Only``,
|
|
|
|
then reboot.
|
|
|
|
|
|
|
|
- Set ``Non-Transparent Bridge PCIe Port Definition`` for needed PCIe ports
|
|
|
|
as ``NTB to NTB`` mode, on both hosts.
|
|
|
|
- Set ``Enable NTB BARs`` as ``Enabled``, on both hosts.
|
|
|
|
- Set ``Enable SPLIT BARs`` as ``Disabled``, on both hosts.
|
|
|
|
- Set ``Imbar1 Size``, ``Imbar2 Size``, ``Embar1 Size`` and ``Embar2 Size``,
|
|
|
|
as 12-29 (i.e., 4K-512M) for 2nd Generation Intel\ |reg| Xeon\ |reg| Scalable Processors;
|
|
|
|
as 12-51 (i.e., 4K-128PB) for 3rd and 4th Generation Intel\ |reg| Xeon\ |reg| Scalable Processors.
|
|
|
|
Note that those bar sizes on both hosts should be the same.
|
|
|
|
- Set ``Crosslink Control override`` as ``DSD/USP`` on one host,
|
|
|
|
``USD/DSP`` on another host.
|
|
|
|
- Set ``PCIe PLL SSC (Spread Spectrum Clocking)`` as ``Disabled``, on both hosts.
|
|
|
|
This is a hardware requirement when using Re-timer Cards.
|
2020-06-18 16:55:50 +00:00
|
|
|
|
|
|
|
Device Setup
|
|
|
|
------------
|
|
|
|
|
|
|
|
The Intel NTB devices need to be bound to a DPDK-supported kernel driver
|
|
|
|
to use, i.e. igb_uio, vfio. The ``dpdk-devbind.py`` script can be used to
|
|
|
|
show devices status and to bind them to a suitable kernel driver. They will
|
|
|
|
appear under the category of "Misc (rawdev) devices".
|
|
|
|
|
|
|
|
Prerequisites
|
|
|
|
-------------
|
|
|
|
|
|
|
|
NTB PMD needs kernel PCI driver to support write combining (WC) to get
|
|
|
|
better performance. The difference will be more than 10 times.
|
|
|
|
To enable WC, there are 2 ways.
|
|
|
|
|
|
|
|
- Insert igb_uio with ``wc_activate=1`` flag if use igb_uio driver.
|
|
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
|
|
insmod igb_uio.ko wc_activate=1
|
|
|
|
|
|
|
|
- Enable WC for NTB device's Bar 2 and Bar 4 (Mapped memory) manually.
|
|
|
|
The reference is https://www.kernel.org/doc/html/latest/x86/mtrr.html
|
|
|
|
Get bar base address using ``lspci -vvv -s ae:00.0 | grep Region``.
|
|
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
|
|
# lspci -vvv -s ae:00.0 | grep Region
|
|
|
|
Region 0: Memory at 39bfe0000000 (64-bit, prefetchable) [size=64K]
|
|
|
|
Region 2: Memory at 39bfa0000000 (64-bit, prefetchable) [size=512M]
|
|
|
|
Region 4: Memory at 39bfc0000000 (64-bit, prefetchable) [size=512M]
|
|
|
|
|
|
|
|
Using the following command to enable WC.
|
|
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
|
|
echo "base=0x39bfa0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
|
|
|
|
echo "base=0x39bfc0000000 size=0x20000000 type=write-combining" >> /proc/mtrr
|
|
|
|
|
|
|
|
And the results:
|
|
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
|
|
# cat /proc/mtrr
|
|
|
|
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
|
|
|
|
reg01: base=0x07f000000 ( 2032MB), size= 16MB, count=1: uncachable
|
|
|
|
reg02: base=0x39bfa0000000 (60553728MB), size= 512MB, count=1: write-combining
|
|
|
|
reg03: base=0x39bfc0000000 (60554240MB), size= 512MB, count=1: write-combining
|
|
|
|
|
|
|
|
To disable WC for these regions, using the following.
|
|
|
|
|
|
|
|
.. code-block:: console
|
|
|
|
|
|
|
|
echo "disable=2" >> /proc/mtrr
|
|
|
|
echo "disable=3" >> /proc/mtrr
|
|
|
|
|
|
|
|
Ring Layout
|
|
|
|
-----------
|
|
|
|
|
|
|
|
Since read/write remote system's memory are through PCI bus, remote read
|
|
|
|
is much more expensive than remote write. Thus, the enqueue and dequeue
|
|
|
|
based on ntb ring should avoid remote read. The ring layout for ntb is
|
|
|
|
like the following:
|
|
|
|
|
|
|
|
- Ring Format::
|
|
|
|
|
|
|
|
desc_ring:
|
|
|
|
|
|
|
|
0 16 64
|
|
|
|
+---------------------------------------------------------------+
|
|
|
|
| buffer address |
|
|
|
|
+---------------+-----------------------------------------------+
|
|
|
|
| buffer length | resv |
|
|
|
|
+---------------+-----------------------------------------------+
|
|
|
|
|
|
|
|
used_ring:
|
|
|
|
|
|
|
|
0 16 32
|
|
|
|
+---------------+---------------+
|
|
|
|
| packet length | flags |
|
|
|
|
+---------------+---------------+
|
|
|
|
|
|
|
|
- Ring Layout::
|
|
|
|
|
|
|
|
+------------------------+ +------------------------+
|
|
|
|
| used_ring | | desc_ring |
|
|
|
|
| +---+ | | +---+ |
|
|
|
|
| | | | | | | |
|
|
|
|
| +---+ +--------+ | | +---+ |
|
|
|
|
| | | ---> | buffer | <+---+-| | |
|
|
|
|
| +---+ +--------+ | | +---+ |
|
|
|
|
| | | | | | | |
|
|
|
|
| +---+ | | +---+ |
|
|
|
|
| ... | | ... |
|
|
|
|
| | | |
|
|
|
|
| +---------+ | | +---------+ |
|
|
|
|
| | tx_tail | | | | rx_tail | |
|
|
|
|
| System A +---------+ | | System B +---------+ |
|
|
|
|
+------------------------+ +------------------------+
|
|
|
|
<---------traffic---------
|
|
|
|
|
|
|
|
- Enqueue and Dequeue
|
|
|
|
Based on this ring layout, enqueue reads rx_tail to get how many free
|
|
|
|
buffers and writes used_ring and tx_tail to tell the peer which buffers
|
|
|
|
are filled with data.
|
|
|
|
And dequeue reads tx_tail to get how many packets are arrived, and
|
|
|
|
writes desc_ring and rx_tail to tell the peer about the new allocated
|
|
|
|
buffers.
|
|
|
|
So in this way, only remote write happens and remote read can be avoid
|
|
|
|
to get better performance.
|
|
|
|
|
|
|
|
Limitation
|
|
|
|
----------
|
|
|
|
|
2023-09-11 06:58:14 +00:00
|
|
|
This PMD is only supported on Intel Xeon Platforms:
|
|
|
|
|
|
|
|
- 4th Generation Intel® Xeon® Scalable Processors.
|
|
|
|
- 3rd Generation Intel® Xeon® Scalable Processors.
|
|
|
|
- 2nd Generation Intel® Xeon® Scalable Processors.
|