Program type `BPF_PROG_TYPE_XDP`

v4.8

XDP (Express Data Path) programs can attach to network devices and are called for every incoming (ingress) packet received by that network device. XDP programs can take quite a large number of actions, most prominent of which are manipulation of the packet, dropping the packet, redirecting it and letting it pass to the network stack.

Notable use cases for XDP programs are for DDoS protection, Load Balancing, and high-throughput packet filtering. If loaded with native driver support, XDP programs will be called just after receiving the packet but before allocating memory for a socket buffer. This call site makes XDP programs extremely performant, especially in use cases where traffic is forwarded or dropped a lot in comparison to other eBPF program types or techniques which run after the relatively expensive socket buffer allocation process has taken place, only to discard it.

Usage

XDP programs are typically put into an ELF section prefixed with xdp. The XDP program is called by the kernel with a xdp_md context. The return value indicates what action the kernel should take with the packet, the following values are permitted:

XDP_ABORTED - Signals that a unrecoverable error has taken place. Returning this action will cause the kernel to trigger the xdp_exception tracepoint and print a line to the trace log. This allows for debugging of such occurrences. It is also expensive, so should not be used without consideration in production.
XDP_DROP - Discards the packet. It should be noted that since we drop the packet very early, it will be invisible to tools like tcpdump. Consider recording drops using a custom feedback mechanism to maintain visibility.
XDP_PASS - Pass the packet to the network stack. The packet can be manipulated before hand
XDP_TX - Send the packet back out the same network port it arrived on. The packet can be manipulated before hand.
XDP_REDIRECT - Redirect the packet to one of a number of locations. The packet can be manipulated before hand.

XDP_REDIRECT should not be returned by itself, always in combination with a helper function call. A number of helper functions can be used to redirect the current packet. These annotate hidden values in the context to inform the kernel what actual redirection action to take after the program exists.

Packets can be redirected in the following ways:

The packet can be redirected to egress on a different interface than where it entered (like XDP_TX but for a different interface). This can be done using the bpf_redirect helper (not recommended) or the bpf_redirect_map helper in combination with a BPF_MAP_TYPE_DEVMAP or BPF_MAP_TYPE_DEVMAP_HASH map.
The packet can be redirected to another CPU for further processing using the bpf_redirect_map helper in combination with a BPF_MAP_TYPE_CPUMAP map.
The packet can be redirected to userspace, bypassing the kernel network stack using the bpf_redirect_map helper in combination with a BPF_MAP_TYPE_XSKMAP map

Context

XDP programs are called with the struct xdp_md context. This is a very simple context representing a single packet.

`data`

v4.8

This field contains a pointer to the start of packet data. The XDP program can read from this region between data and data_end, as long as it always performs bounds checks.

`data_end`

v4.8

This field contains a pointer to the end of the packet data. The verifier will enforce that any XDP program checks that offsets from data are less then data_end before the program attempts to read from it.

`data_meta`

v4.15

This field contains a pointer to the start of a metadata region in the packet memory. By default, no metadata room is available, so the value of data_meta and data will be the same. The XDP program can request metadata with the bpf_xdp_adjust_meta helper, on success data_meta is updated so it is not less then data. The room between data_meta and data is freely useable by the XDP program.

If the packet with metadata is passed to the kernel, that metadata will be available in the __sk_buff via its data_meta and data fields.

This means that XDP programs can communicate information to for example BPF_PROG_TYPE_SCHED_CLS programs which can then manipulate the socket buffer to change __sk_buff->mark or __sk_buff->priority on behalf of an XDP program.

`ingress_ifindex`

v4.16

This field contains the network interface index the packet arrived on.

`rx_queue_index`

v4.16

This field contains the queue index within the NIC on which the packet was received.

Note

While this field is normally read-only, offloaded XDP programs are allowed to write to it to perform custom RSS (Receive-Side Scaling) in the network device v4.18

`egress_ifindex`

v5.8

This field is read-only and contains the network interface index the packet has been redirected out of. This field is only ever set after an initial XDP program redirected a packet to another device with a BPF_MAP_TYPE_DEVMAP and the value of the map contained a file descriptor of a secondary XDP program. This secondary program will be invoked with a context that has egress_ifindex, rx_queue_index, and ingress_ifindex set so it can modify fields in the packet to match the redirection.

XDP fragments

v5.18

An increasingly common performance optimization technique is to use larger packets and to bulk process them (Jumbo packets, GRO, BIG-TCP). It might therefor happen that packets get larger than a single memory page or that we want to glue multiple already allocated packets together. This breaks the existing assumption XDP programs have of all the packet data living in a linear area between data and data_end.

In order to offer support and not break existing programs, the concept of "XDP fragment aware" programs was introduced. XDP program authors writing such programs can compare the length between the data and data_end pointer and the output of bpf_xdp_get_buff_len. If the XDP program needs to work with data beyond the linear portion it should use the bpf_xdp_load_bytes and bpf_xdp_store_bytes helpers.

To indicate that a program is "XDP Fragment aware" the program should be loaded with the BPF_F_XDP_HAS_FRAGS flag. Program authors can indicate that they wish libraries like libbpf to load programs with this flag by placing their program in a xdp.frags/ ELF section instead of a xdp/ section.

Note

If a program is both "XDP Fragment aware" and should be attached to a BPF_MAP_TYPE_CPUMAP or BPF_MAP_TYPE_DEVMAP the two ELF naming conventions are combined: xdp.frags/cpumap/ or xdp.frags/devmap.

Warning

XDP fragments are not supported by all network drivers, check the driver support table.

Attachment

There are two ways of attaching XDP programs to network devices, the legacy way of doing is is via a netlink socket the details of which are complex. Examples of libraries that implement netlink XDP attaching are vishvananda/netlink and libbpf.

The modern and recommended way is to use BPF links. Doing so is as easy as calling BPF_LINK_CREATE with the target_ifindex set to the network interface target, attach_type set to BPF_LINK_TYPE_XDP and the same flags as would be used for the netlink approach.

There are some subtle differences. The netlink method will give the network interface a reference to the program, which means that after attaching, the program will stay attached until it is detached by a program, even if the original loader exists. This is in contrast to kprobes for example which will stop as soon as the loader exists (assuming we are not pinning the program). With links however, this referencing doesn't occur, the creation of the link returns a file descriptor which is used to manage the lifecycle, if the link file descriptor is closed or the loader exists without pinning it, the program will be detached from the network interface.

Warning

Hardware offloaded GRO and LSO are incompatible with XDP and have to be disabled in order to use XDP. Not doing so will result in a -EINVAL error upon attaching. The following commands can be used to disable GRO and LSO: ethtool -K {ifname} lro off gro off

Warning

For XDP programs without fragments support there exists a max MTU of between 1500 and 4096 bytes, the exact limit depends on the driver. If the configured MTU on the device is set higher then the limit, XDP programs cannot be attached.

Flags

`XDP_FLAGS_UPDATE_IF_NOEXIST`

If set, the kernel will only attach the XDP program if the network interface doesn't have a XDP program attached already.

Note

This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.

`XDP_FLAGS_SKB_MODE`

If set, the kernel will attach the program in SKB (Socket buffer) mode. This mode is also known as "Generic mode". This always works regardless of driver support. It works by calling the XDP program after a socket buffer has already been allocated further up the stack that an XDP program would normally be called. This negates the speed advantage of XDP programs. This mode also lacks full feature support since some actions cannot be taken this high up the network stack anymore.

It is recommended to use BPF_PROG_TYPE_SCHED_CLS prog types instead if driver support isn't available since it offers more capabilities with roughly the same performance.

This flag is mutually exclusive with XDP_FLAGS_DRV_MODE and XDP_FLAGS_HW_MODE

`XDP_FLAGS_DRV_MODE`

If set, the kernel will attach the program in driver mode. This does require support from the network driver, but most predominant network card vendors have support in the latest kernel.

This flag is mutually exclusive with XDP_FLAGS_SKB_MODE and XDP_FLAGS_HW_MODE

`XDP_FLAGS_HW_MODE`

If set, the kernel will attach the program in hardware offload mode. This requires both driver and hardware support for XDP offloading. Currently only select Netronome devices support offloading. However, it should be noted that only a subset of normal features are supported.

`XDP_FLAGS_REPLACE`

If set, the kernel will atomically replace the existing program for this new program. You will also have to pass the file descriptor of the old program via the netlink request.

Note

This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.

Device map program

v5.8

XDP programs can be attached to map values of a BPF_MAP_TYPE_DEVMAP map. Once attached this program will run after the first program concluded but before the packet is sent of to the new network device. These programs are called with additional context, see egress_ifindex.

Only XDP programs that have been loaded with the BPF_XDP_DEVMAP value in expected_attach_type are allowed to be attached in this way.

Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/devmap/ ELF section.

CPU map program

v5.9.

XDP programs can be attached to map values of a BPF_MAP_TYPE_CPUMAP map. Once attached this program will run on the new logical CPU. The idea being that you would spend minimal time in the first XDP program and only schedule it and perform the more CPU intensive tasks in this second program.

Only XDP programs that have been loaded with the BPF_XDP_CPUMAP value in expected_attach_type are allowed to be attached in this way.

Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/cpumap/ ELF section.

Driver support

Driver name	Native XDP	XDP Fragments	AF_XDP
Mellanox mlx4	v4.8
Mellanox mlx5	v4.9	v5.18¹, v6.4	v5.3
Qlogic qede	v4.10
Netronome nfp	v4.10		v5.18
Virtio	v4.10	v6.3	v6.11
Broadcom bnxt	v4.11	v5.19
Intel ixgbe	v4.12		v4.20
Cavium thunder (nicvf)	v4.12
Intel i40e	v4.13	v6.4	v4.20
Tun	v4.14
Netdevsim	v4.16
Intel ixgbevf	v4.17
Veth	v4.19	v5.5
Freescale dpaa2	v5.0		v6.2
Socionext netsec	v5.3
TI cpsw	v5.3
Solarflare efx	v5.5
Intel ice	v5.5	v6.3	v5.5
Marvell mvneta	v5.5	v5.18
Amazon ena	v5.6
Hyper-V netvsc	v5.6
Marvell mvpp2	v5.9
Xen xennet	v5.9
Intel igb	v5.10		v6.14
Freescale dpaa	v5.11
Intel igc	v5.13		v5.14
STmicro stmmac	v5.13		v5.13
Freescale enetc	v5.13
Bond	v5.15
Marvell otx2	v5.16
Microsoft mana	v5.17
Fungible fun	v5.18
Atlantic aq	v5.19	v5.19
Mediatek mtk	v6.0
Freescale fec_enet	v6.2
Microchip lan966x	v6.2
Engleder tsnep	v6.3		v6.4
Google gve	v6.4		v6.4
VMware vmxnet3	v6.6
Pensando Ionic	v6.9	v6.9
TI CPSW	v6.10

Note

This table has last been updated for Linux v6.10 and is subject to change in the future.

Max MTU

Plain XDP (fragments disabled) has the limitation that every packet must fit within a single memory page (typically 4096 bytes). This same memory page is also used to store NIC specific metadata and metadata to be passed to the network stack. The room needed for the metadata eats into the available space for the packet data. This means that the actual maximum MTU is some amount lower. The exact value depends on a lot of factors including but not limited to: the driver, the NIC, the CPU architecture, the kernel version and kernel configuration.

The following table has been calculated from mathematical formulas based on the driver code and constants derived from the most common systems. This table assumes a 4k page size, most common L2 cache line sizes for the given architectures, a 6.8 kernel (kernel version doesn't seem to make a big difference). Please refer to tools/mtu-calc in the doc sources to see the exact formulas used and/or to calculate exact max MTU if you have a non-standard system.

Plain XDPXDP with Fragments

Vendor	Driver	x86	arm	arm64	armv7	riscv
Kernel	Veth	3520	3518	3520	3454	3518
Kernel	VirtIO	3506	3506	3506	3442	3506
Kernel	Tun	1500	1500	1500	1500	1500
Kernel	Bond	⁴	⁴	⁴	⁴	⁴
Xen	Netfront	3840	3840	3840	3840	3840
Amazon	ENA	3498	3498	3498	3434	3498
Aquantia/Marvell	AQtion	2048	2048	2048	2048	2048
Broadcom	BNXT	3502	3500	3502	3436	3500
Cavium	Thunder (nicvf)	1508	1508	1508	1508	1508
Engelder	TSN Endpoint	∞²	∞²	∞²	∞²	∞²
Freescale	FEC	∞²	∞²	∞²	∞²	∞²
Freescale	DPAA	3706	3706	3706	3642	3706
Freescale	DPAA2	?³	?³	?³	?³	?³
Freescale	ENETC	∞²	∞²	∞²	∞²	∞²
Fungible	Funeth	3566	3566	3566	3502	3566
Google	GVE	2032	2032	2032	2032	2032
Intel	I40e	3046	3046	3046	3046	3046
Intel	ICE	3046	3046	3046	3046	3046
Intel	IGB	3046	3046	3046	3046	3046
Intel	IGC	1500	1500	1500	1500	1500
Intel	IXGBE	3050	3050	3050	3050	3050
Intel	IXGBEVF	3050	3050	3050	3050	3050
Marvell	NETA	3520	3520	3520	3456	3520
Marvell	PPv2	3552	3552	3552	3488	3552
Marvell	Octeon TX2	1508	1508	1508	1508	1508
MediaTek	MTK	3520	3520	3520	3456	3520
Mellanox	MLX4	3498	3498	3498	3434	3498
Mellanox	MLX5	3498	3498	3498	3434	3498
Microchip	LAN966x	∞²	∞²	∞²	∞²	∞²
Microsoft	Mana	3506	3506	3506	3442	3506
Microsoft	Hyper-V	3506	3506	3506	3442	3506
Netronome	NFP	4096	4096	4096	4096	4096
Pensando	Ionic	3502	3502	3502	3438	3502
Qlogic	QEDE	∞²	∞²	∞²	∞²	∞²
Solarflare	SFP (SFC9xxx PF/VF)	3530	3546	3530	3386	3514
Solarflare	SFP (Riverhead)	3522	3530	3522	3370	3498
Solarflare	SFP (SFC4000A)	3508	3538	3508	3378	3506
Solarflare	SFP (SFC4000B)	3528	3542	3528	3382	3510
Solarflare	SFP (SFC9020/SFL9021)	3528	3542	3528	3382	3510
Socionext	NetSec	1500	1500	1500	1500	1500
STMicro	ST MAC	1500	1500	1500	1500	1500
TI	CPSW	∞²	∞²	∞²	∞²	∞²
VMWare	VMXNET 3	3494	3492	3494	3428	3492

Vendor	Driver	x86	arm	arm64	armv7	riscv
Kernel	Veth	73152	73150	73152	73086	73150
Kernel	VirtIO	∞²	∞²	∞²	∞²	∞²
Kernel	Tun
Kernel	Bond	⁴	⁴	⁴	⁴	⁴
Xen	Netfront
Amazon	ENA
Aquantia/Marvell	AQtion	∞²	∞²	∞²	∞²	∞²
Broadcom	BNXT	∞²	∞²	∞²	∞²	∞²
Cavium	Thunder (nicvf)
Engelder	TSN Endpoint
Freescale	FEC
Freescale	DPAA
Freescale	DPAA2	?³	?³	?³	?³	?³
Freescale	ENETC
Fungible	Funeth
Google	GVE
Intel	I40e	9702	9702	9702	9702	9702
Intel	ICE	3046	3046	3046	3046	3046
Intel	IGB
Intel	IGC
Intel	IXGBE
Intel	IXGBEVF
Marvell	NETA	∞²	∞²	∞²	∞²	∞²
Marvell	PPv2	3552	3552	3552	3488	3552
Marvell	Octeon TX2
MediaTek	MTK
Mellanox	MLX4
Mellanox	MLX5	∞²	∞²	∞²	∞²	∞²
Microchip	LAN966x
Microsoft	Mana
Microsoft	Hyper-V
Netronome	NFP
Pensando	Ionic
Qlogic	QEDE
Solarflare	SFP (SFC9xxx PF/VF)
Solarflare	SFP (Riverhead)
Solarflare	SFP (SFC4000A)
Solarflare	SFP (SFC4000B)
Solarflare	SFP (SFC9020/SFL9021)
Socionext	NetSec
STMicro	ST MAC
TI	CPSW
VMWare	VMXNET 3

Warning

If the configured MTU on a network interface is higher than the limit calculated by the network driver, XDP programs cannot be attached. When attaching via netlink, most drivers will use netlink debug messages to communicate the exact limit. When attaching via BPF links, no such feedback is given, by default. The error message can still be obtained by attaching a eBPF program to the bpf_xdp_link_attach_failed tracepoint and printing the error message or passing it userspace.

VLAN Offload

When VLAN hardware offload is enabled on the NIC, the NIC driver performs outermost VLAN header stripping and insertion. VLAN stripping at the driver level means that some XDP program that intercepts a VLAN-tagged packet at ingress will see the packet's Ethernet header without any VLAN.

Why does it happen like this, and why can we still see VLAN headers in the tcpdump on the "allowed" traffic? Roughly speaking, when packet data is pre-processed at the low hardware level, the driver code at first cuts out the VLAN from the Ethernet part and stores it in a separate vlan field in its receive descriptor structure. Then, a few steps later, some XDP program is run on the packet. When it is finished and it's clear that the packet will not be dropped, the driver writes the VLAN from the corresponding receive descriptor to the dedicated fields (vlan_proto and vlan_tci) in the allocated socket buffer. Thus, VLAN is preserved separately from the packet data in the socket buffer structure while it travels further in the stack. We can observe it in the tcpdump but not at XDP level.

VLAN offloads can be checked with the command ethtool -k <dev_name> | grep vlan-offload. To see VLAN header in the XDP program, we either need to disable VLAN offloads via ethtool -K <dev_name> rxvlan off txvlan off, or we could use bpf_xdp_metadata_rx_vlan_tag kernel function, which is supported by some recent drivers.

Helper functions

Not all helper functions are available in all program types. These are the helper calls available for XDP programs:

Supported helper functions

KFuncs

Supported kfuncs

Only the legacy RQ mode supports XDP frags, which is not the default and will require setting via ethtool. ↩
Driver does not have logic to limit the max MTU and XDP usage, but implicit limits such as in firmware or hardware may still apply. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
MTU limit is loaded from firmware. ↩↩↩↩↩↩↩↩↩↩
MTU limit is determined by slave devices. ↩↩↩↩↩↩↩↩↩↩

Program type BPF_PROG_TYPE_XDP

Usage

Context

data

data_end

data_meta

ingress_ifindex

rx_queue_index

egress_ifindex

XDP fragments

Attachment

Flags

XDP_FLAGS_UPDATE_IF_NOEXIST

XDP_FLAGS_SKB_MODE

XDP_FLAGS_DRV_MODE

XDP_FLAGS_HW_MODE

XDP_FLAGS_REPLACE

Device map program

CPU map program

Driver support

Max MTU

VLAN Offload

Helper functions

KFuncs

Program type `BPF_PROG_TYPE_XDP`

`data`

`data_end`

`data_meta`

`ingress_ifindex`

`rx_queue_index`

`egress_ifindex`

`XDP_FLAGS_UPDATE_IF_NOEXIST`

`XDP_FLAGS_SKB_MODE`

`XDP_FLAGS_DRV_MODE`

`XDP_FLAGS_HW_MODE`

`XDP_FLAGS_REPLACE`