What is DPDK
DPDK is the Data Plane Development Kit that consists of libraries to accelerate packet processing workloads running on a wide variety of CPU architectures. Designed to run on x86, POWER and ARM processors. Polling-mode drivers skips packet processing from the operating system kernel to processes running in user space. This offloading achieves higher computing efficiency and higher packet throughput than is possible using the interrupt-driven processing provided in the kernel.
Why DPDK for voipmonitor
Sniffing packets by kernel linux is driven by IRQ interrupts - every packet (or if driver supports every set of packets) needs to be handled by interrupt which has limitation around 3Gbit on 10Gbit cards (it depends on CPU). DPDK allows to read pacekts directly in userspace not using interrupts which allows faster packet reading (so called poll-mode reading). It needs some tweaks to the operating system (cpu affinity / NOHZ kernel) as the reader thread is sensitive to any scheduler delays which can occur on overloaded or misconfigured system. For 6Gbit packet rate with 3 000 000 packets / second any slight delays can cause packet drops.
Version >= DPDK 21.08.0 is requried - download the latest version from:
How it works
On supported NIC cards (https://core.dpdk.org/supported/) the ethernet port needs to be unbinded from kernel and binded to DPDK, the command for it is:
- no special driver is needed - debian 10/11 already has support for this out of the box
- bind/unbind means that when you undind NIC port from the kernel you cannot use it within the operating system - the port dissapears (you will not see eth1 for example)
- you can unbind from dpdk and bind back to kernel so eth1 can be used again
- dpdk is referencing NIC port by the PCI address which you can get from the "dpdk-devbind.py -s" command for example
list of available network devices:
dpdk-devbind.py -s Network devices using kernel driver =================================== 0000:0b:00.0 'NetXtreme II BCM5709 Gigabit Ethernet 1639' if=enp11s0f0 drv=bnx2 unused= *Active* 0000:0b:00.1 'NetXtreme II BCM5709 Gigabit Ethernet 1639' if=enp11s0f1 drv=bnx2 unused= 0000:1f:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' if=ens3f0 drv=ixgbe unused= 0000:1f:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' if=ens3f1 drv=ixgbe unused=
bind both 10gbit ports to vfio-pci driver (this driver is available by default on >= debian10)
modprobe vfio-pci dpdk-devbind.py -b vfio-pci 0000:1f:00.0 0000:1f:00.1
bind B port back to kernel:
dpdk-devbind.py -b ixgbe 0000:1f:00.1
On some systems vfio-pci does not work for 10Gbit card - instead igb_uio (for Intel cards) needs to be loaded alongside with special kernel parameters:
/etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt intel_iommu=on"
Loading igb_uio for X540-AT2 4 port 10Gbit card (if vfio does not work)
More information about drivers:
dpdk is now ready to be used by voipmonitor
DPDK requires huge pages which can be configured in two ways:
- 16 GB huge pages allocated to numa node0 (first CPU which handles NIC card)
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
or permanantly add to /etc/default/grub: default_hugepagesz=1G hugepagesz=16G but this will add hugepages evenly for all numanodes which you might not need as the DPDK needs huge pages only for its mbuffer which has to be allocated only on the numa node handling NIC card.
In case of more physical CPU turn off numa balancing which causes memory latency
echo 0 > /proc/sys/kernel/numa_balancing
Disable transparent huge pages which can cause latency or high TLB shootdowns
echo never > /sys/kernel/mm/transparent_hugepage/enabled
or permanently - add transparent_hugepage=never to /etc/default/grub
- Ideal configuration is that the sniffer dpdkd reader and worker thread will run on standalone CPU cores and denies all other processes to ever touch or run on those cores. This can be configured fully manually for every processes or system wide by kernel parameters:
add to /etc/default/grub
- we tell the kernel that no processes can run on 2 and 30 cores (in our case 2 and 30 is one physical core and hyperthread sybling. In voipmontior.conf : dpdk_read_thread_core = 2, dpdk_worker_thread_core = 30
this was proven to work for 6Gbit traffic on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz. For higher traffic or less powerfull CPU you might need to set reader and worker thread to two physical cores
- NOHZ kernel
For Best performance, use the data cores as isolated cpus and operate them in tickless mode on kernel version 4.4 above. For this compile the Kernel with CONFIG_NO_HZ_FULL=y (default debian kernels does not have this option) We were able to achieve stable non packet loss reading from the NIC (6Gbit / 3000000 packets / sec) but for high traffic this might bee needed.
The CONFIG_NO_HZ_FULL linux kernel build option is used to configure a tickless kernel. The idea is to configure certain processor cores to operate in tickless mode and these cores do not receive any periodic interrupts. These cores will run dedicated tasks (and no other tasks will be schedules on such cores obviating the need to send a scheduling tick). A CONFIG_HZ based timer interrupt will invalidate L1 cache on the core and this can degrade dataplane performance by a few % points (to be quantified, but estimated to be 1-3%). Running tickless typically means getting 1 timer interrupt/sec instead of 1000/sec.
Add to /etc/default/grub
nohz=on nohz_full=2,30 rcu_nocbs=2,30 rcu_nocb_poll clocksource=tsc
other kernel tweaks
Add to /etc/default/grub
cpuidle.off=1 skew_tick=1 acpi_irq_nobalance idle=poll transparent_hugepage=never audit=0 nosoftlockup mce=ignore_ce mitigations=off selinux=0 nmi_watchdog=0
We are not sure if these has any impact but was recommended during DPDK implementations and testing (be aware that mitigations=off turns off security patches for discovered CPU security flaws)
- dpdk_read_thread_cpu_affinity sets on which CPU core will reader (polling NIC for packets) run.
- dpdk_worker_thread_cpu_affinity sets on which CPU core will worker run - it should run on hyperthread sibbling to core you set for dpdk_read_thread_core
- dpdk_pci_device - what interface will be used for sniffing packets
- it is important to lock reader and worker threads to particular CPU cores so that sniffer will not use those cores for other threads
- in case of more NUMA nodes (two or more physical CPUs) always chose CPU cores for reader and worker thread which are on the same NUMA node for the NIC PCI card
voipmonitor.conf: interface = dpdk:0 dpdk = yes dpdk_read_thread_cpu_affinity = 2 dpdk_worker_thread_cpu_affinity = 30 dpdk_pci_device = 0000:04:00.0
thread_affinity = 1,3-5,4 ; this sets cpu affinity for the voipmonitor. It is automatically set to all cpu cores except dpdk_read_thread_core and dpdk_worker_thread_core. Using this option will override automatic cpu cores. You normally do not want to change this unless you decide to leave some cores dedicated to some other important processes on your server or if you want to hold sniffer on particular NUMA node. dpdk_nb_rx = 4096 ; default size is 4096 if not specified. This is ring buffer on the NIC port. Maximum for Intel X540 is 4096 but it can be larger for others. You can get what is maximum by using ethtool -g eth1 dpdk_nb_tx = 1024 (we do not need ring buffer for sending, but dpdk wants to have this - default is 1024 dpdk_nb_mbufs = 1024 ; number of packets multiplied by 1024 between reader and worker (buffer size). Each packet size is around 2kb which means that it will allocate 2GB of RAM by default. Higher mbuf is recommended (4096) for >=5Gbit traffic dpdk_pkt_burst = 32 ; do not change this unless you exactly know what you are doing dpdk_mempool_cache_size = 512; size of the cache size for dpdk mempool (do not change this until you exactly know what you are doing) dpdk_memory_channels = 4; number of memory bank channels - if not specified, dpdk uses default value (TODO: we are not sure if it tries to guess it or what is the default) dpdk_force_max_simd_bitwidth = 512; default is not set - if you have CPU which supports AVX 512 and you have compiled dpdk with AVX 512 support you can try to enable this and set 512
dpdk_ring_size = ; number of packets * 1024 in ring buffer which holds references to mbuf structures between worker thread and voipmonitor's packet buffer. If not specified it equels to dpdk_nb_mbufs