Program type BPF_PROG_TYPE_SK_REUSEPORT
Socket reuse port programs can be attached to a SO_REUSEPORT
socket group to replace the default socket selection mechanism.
Usage
In v3.9 the SO_REUSEPORT
socket option was added which allows multiple sockets to listen to the same port on the same host. The original purpose of the feature being that this allows for high-efficient distribution of traffic across threads which would normally have to be done in userspace causing unnecessary delay.
By default, incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection—that is, the peer IP address and port plus the local IP address and port.
With the introduction of BPF_PROG_TYPE_SK_REUSEPORT
program, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
map, and the bpf_sk_select_reuseport
helper function we can replace the default distribution behavior with a BPF program.
A key feature is that the sockets do not have to belong to the same process. This means that you can steer traffic between two processes to do A/B testing or software updates without dropping connections. For the latter scenario, the typical use case is to use a map-in-map with a BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
as inner map, allowing userspace to switch out all sockets at once. In that scenario, any existing TCP connections would still be handled by the old sockets/process but new connections are routed to the new process.
Context
The context of this program type is struct sk_reuseport_md
. All fields of this context type are read-only and may not be modified by the program directly.
c structure
struct sk_reuseport_md {
/*
* Start of directly accessible data. It begins from
* the tcp/udp header.
*/
__bpf_md_ptr(void *, data);
/* End of directly accessible data */
__bpf_md_ptr(void *, data_end);
/*
* Total length of packet (starting from the tcp/udp header).
* Note that the directly accessible bytes (data_end - data)
* could be less than this "len". Those bytes could be
* indirectly read by a helper "bpf_skb_load_bytes()".
*/
__u32 len;
/*
* Eth protocol in the mac header (network byte order). e.g.
* ETH_P_IP(0x0800) and ETH_P_IPV6(0x86DD)
*/
__u32 eth_protocol;
__u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */
__u32 bind_inany; /* Is sock bound to an INANY address? */
__u32 hash; /* A hash of the packet 4 tuples */
/* When reuse->migrating_sk is NULL, it is selecting a sk for the
* new incoming connection request (e.g. selecting a listen sk for
* the received SYN in the TCP case). reuse->sk is one of the sk
* in the reuseport group. The bpf prog can use reuse->sk to learn
* the local listening ip/port without looking into the skb.
*
* When reuse->migrating_sk is not NULL, reuse->sk is closed and
* reuse->migrating_sk is the socket that needs to be migrated
* to another listening socket. migrating_sk could be a fullsock
* sk that is fully established or a reqsk that is in-the-middle
* of 3-way handshake.
*/
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(struct bpf_sock *, migrating_sk);
};
data
This field contain a pointer to the start of directly accessible data. It begins from the TCP/UDP header.
Note
This program type only has read access, it may not modify the packet data.
data_end
This field contain a pointer to the end of directly accessible data.
len
This field contains the total length of packet (starting from the TCP/UDP header).
Note
The directly accessible bytes (data_end - data) could be less than this len
. Those bytes could be indirectly read by a helper bpf_skb_load_bytes
.
eth_protocol
This field contains the ethernet protocol in the mac header (network byte order). e.g. ETH_P_IP
(0x0800
) and ETH_P_IPV6
(0x86DD
)
ip_protocol
This field contain the IP protocol. e.g. IPPROTO_TCP
, IPPROTO_UDP
.
bind_inany
This field is true
if the socket group is bound to an INANY address.
hash
This field is a hash of the packet 4 tuples.
sk
and migrating_sk
These fields are used together to handle socket migration. If both are NULL
we are doing the initial selection.
When migrating_sk
is NULL
, it is selecting a sk for the new incoming connection request (e.g. selecting a listen sk for the received SYN in the TCP case). sk
is one of the sk in the reuseport group. The bpf prog can use reuse->sk to learn the local listening ip/port without looking into the skb.
When migrating_sk
is not NULL, sk
is closed and migrating_sk
is the socket that needs to be migrated to another listening socket. migrating_sk could be a fullsock sk that is fully established or a reqsk that is in-the-middle of 3-way handshake.
Attachment
This program type can be attached to a reuse port socket group by using the setsockopt
syscall on one of the sockets in the group with the SOL_SOCKET
socket level and SO_ATTACH_BPF
socket option.
This program should be loaded with the BPF_SK_REUSEPORT_SELECT
expected_attach_type
to use it only for the selection logic or BPF_SK_REUSEPORT_SELECT_OR_MIGRATE
if the program should also handle socket migration logic.
Socket migration
Before v5.14, the reuse port feature had a defect in its logic. When a SYN packet is received, the connection is tied to a listening socket. Accordingly, when the listener is closed, in-flight requests during the three-way handshake and child sockets in the accept queue are dropped even if other listeners could accept such connections.
This situation can happen when various server management tools restart server (such as nginx) processes. For instance, when we change nginx configurations and restart it, it spins up new workers that respect the new configuration and closes all listeners on the old workers, resulting in in-flight
To fix this defect, the concept of socket migration was added, which will repeat the socket selection logic to pick a new socket. When not using eBPF, the same hash logic is used, but only if the net.ipv4.tcp_migrate_req
sysctl setting has been enabled. When using eBPF with this program type, loading the program with the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE
attachment type indicates that this program also overwrites the migration logic. No need to set the sysctl option in this case. This does mean that the program can be called for initial selection as well as for migration. The sk
and sk_migration
context fields indicate for which purpose the program is invoked.
When invoked for migration, the following actions can be taken:
- return
SK_PASS
after selecting a socket with bpf_sk_select_reuseport, select it as a new listener. - return
SK_PASS
without calling bpf_sk_select_reuseport, falls back to the random selection. - return
SK_DROP
, cancel the migration.
Note
The kernel select a listening socket in three places, but it does not have struct skb
at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So the kernel allocates an empty skb temporarily before running the eBPF program.
Helper functions
Not all helper functions are available in all program types. These are the helper calls available for socket reuse port programs:
Supported helper functions
bpf_cgrp_storage_delete
bpf_cgrp_storage_get
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_for_each_map_elem
bpf_get_current_pid_tgid
v6.10bpf_get_current_task
bpf_get_current_task_btf
bpf_get_ns_current_pid_tgid
v6.10bpf_get_numa_node_id
bpf_get_prandom_u32
bpf_get_smp_processor_id
bpf_get_socket_cookie
bpf_jiffies64
bpf_kptr_xchg
bpf_ktime_get_boot_ns
bpf_ktime_get_coarse_ns
bpf_ktime_get_ns
bpf_ktime_get_tai_ns
bpf_loop
bpf_map_delete_elem
bpf_map_lookup_elem
bpf_map_lookup_percpu_elem
bpf_map_peek_elem
bpf_map_pop_elem
bpf_map_push_elem
bpf_map_update_elem
bpf_per_cpu_ptr
bpf_probe_read_kernel
bpf_probe_read_kernel_str
bpf_probe_read_user
bpf_probe_read_user_str
bpf_ringbuf_discard
bpf_ringbuf_discard_dynptr
bpf_ringbuf_output
bpf_ringbuf_query
bpf_ringbuf_reserve
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit
bpf_ringbuf_submit_dynptr
bpf_sk_select_reuseport
bpf_skb_load_bytes
bpf_skb_load_bytes_relative
bpf_snprintf
bpf_snprintf_btf
bpf_spin_lock
bpf_spin_unlock
bpf_strncmp
bpf_tail_call
bpf_task_pt_regs
bpf_this_cpu_ptr
bpf_timer_cancel
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_trace_printk
bpf_trace_vprintk
bpf_user_ringbuf_drain
KFuncs
There are currently no kfuncs supported for this program type