Program type BPF_PROG_TYPE_XDP
XDP (Express Data Path) programs can attach to network devices and are called for every incoming (ingress) packet received by that network device. XDP programs can take quite a large number of actions, most prominent of which are manipulation of the packet, dropping the packet, redirecting it and letting it pass to the network stack.
Notable use cases for XDP programs are for DDoS protection, Load Balancing, and high-throughput packet filtering. If loaded with native driver support, XDP programs will be called just after receiving the packet but before allocating memory for a socket buffer. This call site makes XDP programs extremely performant, especially in use cases where traffic is forwarded or dropped a lot in comparison to other eBPF program types or techniques which run after the relatively expensive socket buffer allocation process has taken place, only to discard it.
Usage
XDP programs are typically put into an ELF section prefixed with xdp
. The XDP program is called by the kernel with a xdp_md
context. The return value indicates what action the kernel should take with the packet, the following values are permitted:
XDP_ABORTED
- Signals that a unrecoverable error has taken place. Returning this action will cause the kernel to trigger thexdp_exception
tracepoint and print a line to the trace log. This allows for debugging of such occurrences. It is also expensive, so should not be used without consideration in production.XDP_DROP
- Discards the packet. It should be noted that since we drop the packet very early, it will be invisible to tools liketcpdump
. Consider recording drops using a custom feedback mechanism to maintain visibility.XDP_PASS
- Pass the packet to the network stack. The packet can be manipulated before handXDP_TX
- Send the packet back out the same network port it arrived on. The packet can be manipulated before hand.XDP_REDIRECT
- Redirect the packet to one of a number of locations. The packet can be manipulated before hand.
XDP_REDIRECT
should not be returned by itself, always in combination with a helper function call. A number of helper functions can be used to redirect the current packet. These annotate hidden values in the context to inform the kernel what actual redirection action to take after the program exists.
Packets can be redirected in the following ways:
- The packet can be redirected to egress on a different interface than where it entered (like
XDP_TX
but for a different interface). This can be done using thebpf_redirect
helper (not recommended) or thebpf_redirect_map
helper in combination with aBPF_MAP_TYPE_DEVMAP
orBPF_MAP_TYPE_DEVMAP_HASH
map. - The packet can be redirected to another CPU for further processing using the
bpf_redirect_map
helper in combination with aBPF_MAP_TYPE_CPUMAP
map. - The packet can be redirected to userspace, bypassing the kernel network stack using the
bpf_redirect_map
helper in combination with aBPF_MAP_TYPE_XSKMAP
map
Context
XDP programs are called with the struct xdp_md
context. This is a very simple context representing a single packet.
data
This field contains a pointer to the start of packet data. The XDP program can read from this region between data
and data_end
, as long as it always performs bounds checks.
data_end
This field contains a pointer to the end of the packet data. The verifier will enforce that any XDP program checks that offsets from data
are less then data_end
before the program attempts to read from it.
data_meta
This field contains a pointer to the start of a metadata region in the packet memory. By default, no metadata room is available, so the value of data_meta
and data
will be the same. The XDP program can request metadata with the bpf_xdp_adjust_meta
helper, on success data_meta
is updated so it is not less then data
. The room between data_meta
and data
is freely useable by the XDP program.
If the packet with metadata is passed to the kernel, that metadata will be available in the __sk_buff
via its data_meta
and data
fields.
This means that XDP programs can communicate information to for example BPF_PROG_TYPE_SCHED_CLS
programs which can then manipulate the socket buffer to change __sk_buff->mark
or __sk_buff->priority
on behalf of an XDP program.
ingress_ifindex
This field contains the network interface index the packet arrived on.
rx_queue_index
This field contains the queue index within the NIC on which the packet was received.
Note
While this field is normally read-only, offloaded XDP programs are allowed to write to it to perform custom RSS (Receive-Side Scaling) in the network device v4.18
egress_ifindex
This field is read-only and contains the network interface index the packet has been redirected out of. This field is only ever set after an initial XDP program redirected a packet to another device with a BPF_MAP_TYPE_DEVMAP
and the value of the map contained a file descriptor of a secondary XDP program. This secondary program will be invoked with a context that has egress_ifindex
, rx_queue_index
, and ingress_ifindex
set so it can modify fields in the packet to match the redirection.
XDP fragments
An increasingly common performance optimization technique is to use larger packets and to bulk process them (Jumbo packets, GRO, BIG-TCP). It might therefor happen that packets get larger than a single memory page or that we want to glue multiple already allocated packets together. This breaks the existing assumption XDP programs have of all the packet data living in a linear area between data
and data_end
.
In order to offer support and not break existing programs, the concept of "XDP fragment aware" programs was introduced. XDP program authors writing such programs can compare the length between the data
and data_end
pointer and the output of bpf_xdp_get_buff_len
. If the XDP program needs to work with data beyond the linear portion it should use the bpf_xdp_load_bytes
and bpf_xdp_store_bytes
helpers.
To indicate that a program is "XDP Fragment aware" the program should be loaded with the BPF_F_XDP_HAS_FRAGS
flag. Program authors can indicate that they wish libraries like libbpf to load programs with this flag by placing their program in a xdp.frags/
ELF section instead of a xdp/
section.
Note
If a program is both "XDP Fragment aware" and should be attached to a BPF_MAP_TYPE_CPUMAP
or BPF_MAP_TYPE_DEVMAP
the two ELF naming conventions are combined: xdp.frags/cpumap/
or xdp.frags/devmap
.
Warning
XDP fragments are not supported by all network drivers, check the driver support table.
Attachment
There are two ways of attaching XDP programs to network devices, the legacy way of doing is is via a netlink socket the details of which are complex. Examples of libraries that implement netlink XDP attaching are vishvananda/netlink
and libbpf.
The modern and recommended way is to use BPF links. Doing so is as easy as calling BPF_LINK_CREATE
with the target_ifindex
set to the network interface target, attach_type
set to BPF_LINK_TYPE_XDP
and the same flags
as would be used for the netlink approach.
There are some subtle differences. The netlink method will give the network interface a reference to the program, which means that after attaching, the program will stay attached until it is detached by a program, even if the original loader exists. This is in contrast to kprobes for example which will stop as soon as the loader exists (assuming we are not pinning the program). With links however, this referencing doesn't occur, the creation of the link returns a file descriptor which is used to manage the lifecycle, if the link file descriptor is closed or the loader exists without pinning it, the program will be detached from the network interface.
Warning
Hardware offloaded GRO and LSO are incompatible with XDP and have to be disabled in order to use XDP. Not doing so will result in a -EINVAL
error upon attaching.
The following commands can be used to disable GRO and LSO: ethtool -K {ifname} lro off gro off
Warning
For XDP programs without fragments support there exists a max MTU of between 1500 and 4096 bytes, the exact limit depends on the driver. If the configured MTU on the device is set higher then the limit, XDP programs cannot be attached.
Flags
XDP_FLAGS_UPDATE_IF_NOEXIST
If set, the kernel will only attach the XDP program if the network interface doesn't have a XDP program attached already.
Note
This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.
XDP_FLAGS_SKB_MODE
If set, the kernel will attach the program in SKB (Socket buffer) mode. This mode is also known as "Generic mode". This always works regardless of driver support. It works by calling the XDP program after a socket buffer has already been allocated further up the stack that an XDP program would normally be called. This negates the speed advantage of XDP programs. This mode also lacks full feature support since some actions cannot be taken this high up the network stack anymore.
It is recommended to use BPF_PROG_TYPE_SCHED_CLS
prog types instead if driver support isn't available since it offers more capabilities with roughtly the same performance.
This flag is mutually exclusive with XDP_FLAGS_DRV_MODE
and XDP_FLAGS_HW_MODE
XDP_FLAGS_DRV_MODE
If set, the kernel will attach the program in driver mode. This does require support from the network driver, but most predominant network card vendors have support in the latest kernel.
This flag is mutually exclusive with XDP_FLAGS_SKB_MODE
and XDP_FLAGS_HW_MODE
XDP_FLAGS_HW_MODE
If set, the kernel will attach the program in hardware offload mode. This requires both driver and hardware support for XDP offloading. Currently only select Netronome devices support offloading. However, it should be noted that only a subset of normal features are supported.
XDP_FLAGS_REPLACE
If set, the kernel will atomically replace the existing program for this new program. You will also have to pass the file descriptor of the old program via the netlink request.
Note
This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.
Device map program
XDP programs can be attached to map values of a BPF_MAP_TYPE_DEVMAP
map. Once attached this program will run after the first program concluded but before the packet is sent of to the new network device. These programs are called with additional context, see egress_ifindex
.
Only XDP programs that have been loaded with the BPF_XDP_DEVMAP
value in expected_attach_type
are allowed to be attached in this way.
Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/devmap/
ELF section.
CPU map program
v5.9.
XDP programs can be attached to map values of a BPF_MAP_TYPE_CPUMAP
map. Once attached this program will run on the new logical CPU. The idea being that you would spend minimal time in the first XDP program and only schedule it and perform the more CPU intensive tasks in this second program.
Only XDP programs that have been loaded with the BPF_XDP_CPUMAP
value in expected_attach_type
are allowed to be attached in this way.
Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/cpumap/
ELF section.
Driver support
Driver name | Native XDP | XDP hardware Offload | XDP Fragments | AF_XDP |
---|---|---|---|---|
v4.8 | ||||
v4.9 | v5.181, v6.4 | v5.3 | ||
v4.10 | ||||
v4.10 | v5.18 | |||
v4.10 | v6.3 | v6.11 | ||
v4.11 | v5.19 | |||
v4.12 | v4.20 | |||
v4.12 | ||||
v4.13 | v6.4 | v4.20 | ||
v4.14 | ||||
v4.16 | ||||
v4.17 | ||||
v4.19 | v5.5 | |||
v5.0 | v6.2 | |||
v5.3 | ||||
v5.3 | ||||
v5.5 | ||||
v5.5 | v6.3 | v5.5 | ||
v5.5 | v5.18 | |||
v5.6 | ||||
v5.6 | ||||
v5.9 | ||||
v5.9 | ||||
v5.10 | ||||
v5.11 | ||||
v5.13 | v5.14 | |||
v5.13 | v5.13 | |||
v5.13 | ||||
v5.15 | ||||
v5.16 | ||||
v5.17 | ||||
v5.18 | ||||
v5.19 | v5.19 | |||
v6.0 | ||||
v6.2 | ||||
v6.2 | ||||
v6.3 | v6.4 | |||
v6.4 | v6.4 | |||
v6.6 | ||||
v6.9 | v6.9 | |||
v6.10 |
Note
This table has last been updated for Linux v6.10 and is subject to change in the future.
Max MTU
Plain XDP (fragments disabled) has the limitation that every packet must fit within a single memory page (typically 4096 bytes). This same memory page is also used to store NIC specific metadata and metadata to be passed to the network stack. The room needed for the metadata eats into the available space for the packet data. This means that the actual maximum MTU is some amount lower. The exact value depends on a lot of factors including but not limited to: the driver, the NIC, the CPU architecture, the kernel version and kernel configuration.
The following table has been calculated from mathematical formulas based on the driver code and constants derived from the most common systems. This table assumes a 4k page size, most common L2 cache line sizes for the given architectures, a 6.8 kernel (kernel version doesn't seem to make a big difference). Please refer to tools/mtu-calc
in the doc sources to see the exact formulas used and/or to calculate exact max MTU if you have a non-standard system.
Vendor | Driver | x86 | arm | arm64 | armv7 | riscv |
---|---|---|---|---|---|---|
Kernel | Veth | 3520 | 3518 | 3520 | 3454 | 3518 |
Kernel | VirtIO | 3506 | 3506 | 3506 | 3442 | 3506 |
Kernel | Tun | 1500 | 1500 | 1500 | 1500 | 1500 |
Kernel | Bond | 4 | 4 | 4 | 4 | 4 |
Xen | Netfront | 3840 | 3840 | 3840 | 3840 | 3840 |
Amazon | ENA | 3498 | 3498 | 3498 | 3434 | 3498 |
Aquantia/Marvell | AQtion | 2048 | 2048 | 2048 | 2048 | 2048 |
Broadcom | BNXT | 3502 | 3500 | 3502 | 3436 | 3500 |
Cavium | Thunder (nicvf) | 1508 | 1508 | 1508 | 1508 | 1508 |
Engelder | TSN Endpoint | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Freescale | FEC | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Freescale | DPAA | 3706 | 3706 | 3706 | 3642 | 3706 |
Freescale | DPAA2 | ?3 | ?3 | ?3 | ?3 | ?3 |
Freescale | ENETC | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Fungible | Funeth | 3566 | 3566 | 3566 | 3502 | 3566 |
GVE | 2032 | 2032 | 2032 | 2032 | 2032 | |
Intel | I40e | 3046 | 3046 | 3046 | 3046 | 3046 |
Intel | ICE | 3046 | 3046 | 3046 | 3046 | 3046 |
Intel | IGB | 3046 | 3046 | 3046 | 3046 | 3046 |
Intel | IGC | 1500 | 1500 | 1500 | 1500 | 1500 |
Intel | IXGBE | 3050 | 3050 | 3050 | 3050 | 3050 |
Intel | IXGBEVF | 3050 | 3050 | 3050 | 3050 | 3050 |
Marvell | NETA | 3520 | 3520 | 3520 | 3456 | 3520 |
Marvell | PPv2 | 3552 | 3552 | 3552 | 3488 | 3552 |
Marvell | Octeon TX2 | 1508 | 1508 | 1508 | 1508 | 1508 |
MediaTek | MTK | 3520 | 3520 | 3520 | 3456 | 3520 |
Mellanox | MLX4 | 3498 | 3498 | 3498 | 3434 | 3498 |
Mellanox | MLX5 | 3498 | 3498 | 3498 | 3434 | 3498 |
Microchip | LAN966x | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Microsoft | Mana | 3506 | 3506 | 3506 | 3442 | 3506 |
Microsoft | Hyper-V | 3506 | 3506 | 3506 | 3442 | 3506 |
Netronome | NFP | 4096 | 4096 | 4096 | 4096 | 4096 |
Pensando | Ionic | 3502 | 3502 | 3502 | 3438 | 3502 |
Qlogic | QEDE | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Solarflare | SFP (SFC9xxx PF/VF) | 3530 | 3546 | 3530 | 3386 | 3514 |
Solarflare | SFP (Riverhead) | 3522 | 3530 | 3522 | 3370 | 3498 |
Solarflare | SFP (SFC4000A) | 3508 | 3538 | 3508 | 3378 | 3506 |
Solarflare | SFP (SFC4000B) | 3528 | 3542 | 3528 | 3382 | 3510 |
Solarflare | SFP (SFC9020/SFL9021) | 3528 | 3542 | 3528 | 3382 | 3510 |
Socionext | NetSec | 1500 | 1500 | 1500 | 1500 | 1500 |
STMicro | ST MAC | 1500 | 1500 | 1500 | 1500 | 1500 |
TI | CPSW | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
VMWare | VMXNET 3 | 3494 | 3492 | 3494 | 3428 | 3492 |
Vendor | Driver | x86 | arm | arm64 | armv7 | riscv |
---|---|---|---|---|---|---|
Kernel | Veth | 73152 | 73150 | 73152 | 73086 | 73150 |
Kernel | VirtIO | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Kernel | Tun | |||||
Kernel | Bond | 4 | 4 | 4 | 4 | 4 |
Xen | Netfront | |||||
Amazon | ENA | |||||
Aquantia/Marvell | AQtion | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Broadcom | BNXT | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Cavium | Thunder (nicvf) | |||||
Engelder | TSN Endpoint | |||||
Freescale | FEC | |||||
Freescale | DPAA | |||||
Freescale | DPAA2 | ?3 | ?3 | ?3 | ?3 | ?3 |
Freescale | ENETC | |||||
Fungible | Funeth | |||||
GVE | ||||||
Intel | I40e | 9702 | 9702 | 9702 | 9702 | 9702 |
Intel | ICE | 3046 | 3046 | 3046 | 3046 | 3046 |
Intel | IGB | |||||
Intel | IGC | |||||
Intel | IXGBE | |||||
Intel | IXGBEVF | |||||
Marvell | NETA | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Marvell | PPv2 | 3552 | 3552 | 3552 | 3488 | 3552 |
Marvell | Octeon TX2 | |||||
MediaTek | MTK | |||||
Mellanox | MLX4 | |||||
Mellanox | MLX5 | ∞2 | ∞2 | ∞2 | ∞2 | ∞2 |
Microchip | LAN966x | |||||
Microsoft | Mana | |||||
Microsoft | Hyper-V | |||||
Netronome | NFP | |||||
Pensando | Ionic | |||||
Qlogic | QEDE | |||||
Solarflare | SFP (SFC9xxx PF/VF) | |||||
Solarflare | SFP (Riverhead) | |||||
Solarflare | SFP (SFC4000A) | |||||
Solarflare | SFP (SFC4000B) | |||||
Solarflare | SFP (SFC9020/SFL9021) | |||||
Socionext | NetSec | |||||
STMicro | ST MAC | |||||
TI | CPSW | |||||
VMWare | VMXNET 3 |
Warning
If the configured MTU on a network interface is higher than the limit calculated by the network driver, XDP programs cannot be attached. When attaching via netlink, most drivers will use netlink debug messages to communicate the exact limit. When attaching via BPF links, no such feedback is given, by default. The error message can still be obtained by attaching a eBPF program to the bpf_xdp_link_attach_failed
tracepoint and printing the error message or passing it userspace.
Helper functions
Not all helper functions are available in all program types. These are the helper calls available for XDP programs:
Supported helper functions
bpf_perf_event_output
bpf_get_smp_processor_id
bpf_csum_diff
bpf_xdp_adjust_head
bpf_xdp_adjust_meta
bpf_redirect
bpf_redirect_map
bpf_xdp_adjust_tail
bpf_xdp_get_buff_len
bpf_xdp_load_bytes
bpf_xdp_store_bytes
bpf_fib_lookup
bpf_check_mtu
bpf_sk_lookup_udp
bpf_sk_lookup_tcp
bpf_sk_release
bpf_skc_lookup_tcp
bpf_tcp_check_syncookie
bpf_tcp_gen_syncookie
bpf_tcp_raw_check_syncookie_ipv4
bpf_tcp_raw_gen_syncookie_ipv6
bpf_tcp_raw_check_syncookie_ipv4
bpf_tcp_raw_check_syncookie_ipv6
bpf_map_lookup_elem
bpf_map_update_elem
bpf_map_delete_elem
bpf_map_push_elem
bpf_map_pop_elem
bpf_map_peek_elem
bpf_map_lookup_percpu_elem
bpf_get_prandom_u32
bpf_get_smp_processor_id
bpf_get_numa_node_id
bpf_tail_call
bpf_ktime_get_ns
bpf_ktime_get_boot_ns
bpf_ringbuf_output
bpf_ringbuf_reserve
bpf_ringbuf_submit
bpf_ringbuf_discard
bpf_ringbuf_query
bpf_for_each_map_elem
bpf_loop
bpf_strncmp
bpf_spin_lock
bpf_spin_unlock
bpf_jiffies64
bpf_per_cpu_ptr
bpf_this_cpu_ptr
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_timer_cancel
bpf_trace_printk
bpf_get_current_task
bpf_get_current_task_btf
bpf_probe_read_user
bpf_probe_read_kernel
bpf_probe_read_user_str
bpf_probe_read_kernel_str
bpf_snprintf_btf
bpf_snprintf
bpf_task_pt_regs
bpf_trace_vprintk
bpf_cgrp_storage_get
bpf_cgrp_storage_delete
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_kptr_xchg
bpf_ktime_get_tai_ns
bpf_ringbuf_discard_dynptr
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit_dynptr
bpf_user_ringbuf_drain
KFuncs
Supported kfuncs
bpf_arena_alloc_pages
bpf_arena_free_pages
bpf_cast_to_kern_ctx
bpf_cgroup_acquire
bpf_cgroup_ancestor
bpf_cgroup_from_id
bpf_cgroup_release
bpf_crypto_decrypt
bpf_crypto_encrypt
bpf_ct_change_status
bpf_ct_change_timeout
bpf_ct_insert_entry
bpf_ct_release
bpf_ct_set_nat_info
bpf_ct_set_status
bpf_ct_set_timeout
bpf_dynptr_adjust
bpf_dynptr_clone
bpf_dynptr_from_xdp
bpf_dynptr_is_null
bpf_dynptr_is_rdonly
bpf_dynptr_size
bpf_dynptr_slice
bpf_dynptr_slice_rdwr
bpf_iter_bits_destroy
bpf_iter_bits_new
bpf_iter_bits_next
bpf_iter_css_destroy
bpf_iter_css_new
bpf_iter_css_next
bpf_iter_css_task_destroy
bpf_iter_css_task_new
bpf_iter_css_task_next
bpf_iter_num_destroy
bpf_iter_num_new
bpf_iter_num_next
bpf_iter_task_destroy
bpf_iter_task_new
bpf_iter_task_next
bpf_iter_task_vma_destroy
bpf_iter_task_vma_new
bpf_iter_task_vma_next
bpf_list_pop_back
bpf_list_pop_front
bpf_list_push_back_impl
bpf_list_push_front_impl
bpf_map_sum_elem_count
bpf_obj_drop_impl
bpf_obj_new_impl
bpf_percpu_obj_drop_impl
bpf_percpu_obj_new_impl
bpf_preempt_disable
bpf_preempt_enable
bpf_rbtree_add_impl
bpf_rbtree_first
bpf_rbtree_remove
bpf_rcu_read_lock
bpf_rcu_read_unlock
bpf_rdonly_cast
bpf_refcount_acquire_impl
bpf_skb_ct_alloc
bpf_skb_ct_lookup
bpf_task_acquire
bpf_task_from_pid
bpf_task_get_cgroup1
bpf_task_release
bpf_task_under_cgroup
bpf_throw
bpf_wq_init
bpf_wq_set_callback_impl
bpf_wq_start
bpf_xdp_ct_alloc
bpf_xdp_ct_lookup
bpf_xdp_flow_lookup
bpf_xdp_metadata_rx_hash
bpf_xdp_metadata_rx_timestamp
bpf_xdp_metadata_rx_vlan_tag
crash_kexec