Program type BPF_PROG_TYPE_SOCK_OPS
Socket ops programs are attached to cGroups and get called for multiple lifecycle events of a socket, giving the program the opportunity to changes settings per connection or to record the existence of a socket.
Usage
Socket ops programs are called multiple times on the same socket during different parts of its lifecycle for different operations. Some operations query the program for certain parameters, others just inform the program of certain events so the program can perform some at that time.
Regardless of the type of operation, the program should always return 1
on success. A negative integer indicate a operation is not supported. For operations that query information, the reply
field in the context is used to "reply" to the query, the program is expected to set it equal to the requested value.
There are a few envisioned use cases for this program type. First is to reply with certain settings like RTO, RTT and ECN (see ops section for details) or to set socket options using the bpf_setsockopt
helper to tune settings/options on a per-connection basis.
For example, it is easy to use Facebook's internal IPv6 addresses to determine if both hosts of a connection are in the same data center. Therefore, it is easy to write a BPF program to choose a small SYN RTO value when both hosts are in the same data center.
Secondly, socket ops programs are in an excellent position to gather detailed metrics about connections. Especially after v4.16.
Thirdly, socket ops programs can be used to implement TCP options which are not known to the kernel, both on the sending and receiving side. See BPF_SOCK_OPS_PARSE_HDR_OPT_CB
and BPF_SOCK_OPS_WRITE_HDR_OPT_CB
.
The last, but not least, envisioned use case for socket ops programs is to dynamically add sockets to BPF_MAP_TYPE_SOCKMAP
or BPF_MAP_TYPE_SOCKHASH
maps. Since socket ops programs are notified when sockets are connecting or listening, it allows us to add the sockets to these maps before any actual message traffic happens. This allows BPF_PROG_TYPE_SK_MSG
and BPF_PROG_TYPE_SK_SKB
to operate without user space needing to add sockets to the sock maps. The bpf_sock_map_update
and bpf_sock_hash_update
helpers exist for this very purpose.
Ops
After attaching the program, it will be invoked for multiple socket and multiple ops. The op
field in the context indicates for which operation the program is invoked. Availability of fields in the context and the meaning of return values vary from op to op.
The ops ending with _CB
are callbacks which are just called to notify the program of an event. Return values for these ops are ignored. Some of these callbacks are not triggered unless activated by setting flags on the socket. Setting these flags is done by the program itself with the use of the bpf_sock_ops_cb_flags_set
helper which can both set and unset flags.
BPF_SOCK_OPS_TIMEOUT_INIT
When invoked with this op
, the program can overwrite the default RTO (retransmission timeout) for a SYN or SYN-ACK. -1
can be returned if default value should be used.
BPF_SOCK_OPS_RWND_INIT
When invoked with this op
, the program can overwrite the default initial advertized window (in packets) or -1 if default value should be used.
BPF_SOCK_OPS_TCP_CONNECT_CB
The program is invoked with this op
when a socket is in the 'connect' state, it has sent out a SYN message, but is not yet established. This is just a notification, return value is discarded.
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
The program is invoked with this op
when a active socket transitioned to have an established connection. This happens when a outgoing connection establishes. This is just a notification, return value is discarded.
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
The program is invoked with this op
when a active socket transitioned to have an established connection. This happens when a incoming connection establishes. This is just a notification, return value is discarded.
BPF_SOCK_OPS_NEEDS_ECN
When invoked with this op
, the program is asked if ECN (Explicit Congestion Notification) should be enabled for a given connection. The program is expected to return 0
or 1
.
BPF_SOCK_OPS_BASE_RTT
When invoked with this op
, the program is asked for the base RTT (Round Trip Time) for a given connection. If the measured RTT goes above this value it indicates the connection is congested and the congestion control algorithm will take steps.
BPF_SOCK_OPS_RTO_CB
When BPF_SOCK_OPS_RTO_CB_FLAG
is set via bpf_sock_ops_cb_flags_set
, this program may be called with this op
to indicate when an RTO (Retransmission Timeout) has triggered. This is just a notification, return value is discarded.
The arguments in the context will have the following meanings:
args[0]
: value oficsk_retransmits
args[1]
: value oficsk_rto
args[2]
: whether RTO has expired
BPF_SOCK_OPS_RETRANS_CB
When the BPF_SOCK_OPS_RETRANS_CB_FLAG
flag is set with bpf_sock_ops_cb_flags_set
, the program is invoked with this op
when a packet from the skb has been retransmitted. This is just a notification, return value is discarded.
The arguments in the context will have the following meanings:
args[0]
: sequence number of 1st byteargs[1]
: # segmentsargs[2]
: return value of tcp_transmit_skb (0 => success)
BPF_SOCK_OPS_STATE_CB
When the BPF_SOCK_OPS_STATE_CB_FLAG
flag is set with bpf_sock_ops_cb_flags_set
, the program is invoked with this op
when the TCP state of the socket changes. This is just a notification, return value is discarded.
The arguments in the context will have the following meanings:
args[0]
: old_stateargs[1]
: new_state
The states will be one of:
enum {
BPF_TCP_ESTABLISHED = 1,
BPF_TCP_SYN_SENT,
BPF_TCP_SYN_RECV,
BPF_TCP_FIN_WAIT1,
BPF_TCP_FIN_WAIT2,
BPF_TCP_TIME_WAIT,
BPF_TCP_CLOSE,
BPF_TCP_CLOSE_WAIT,
BPF_TCP_LAST_ACK,
BPF_TCP_LISTEN,
BPF_TCP_CLOSING, /* Now a valid state */
BPF_TCP_NEW_SYN_RECV
};
BPF_SOCK_OPS_TCP_LISTEN_CB
The program is invoked with this op
when the listen
syscall is used on the socket, transitioning it to the LISTEN
state. This is just a notification, return value is discarded.
BPF_SOCK_OPS_RTT_CB
When the BPF_SOCK_OPS_RTT_CB_FLAG
flag is set with bpf_sock_ops_cb_flags_set
, the program is invoked with this op
for every round trip. This is just a notification, return value is discarded.
BPF_SOCK_OPS_PARSE_HDR_OPT_CB
The program is invoked with this op
to parse TCP headers. If the BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
is set, the program will be invoked for all TCP headers, if BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
is set, the program is only invoked for unknown TCP headers.
The program will be invoked to handle the packets received at an already established connection.
The TCP header is question starts at sock_ops->skb_data
, the bpf_load_hdr_opt
helper can also be used to search for a particular option.
This is just a notification, return value is discarded.
BPF_SOCK_OPS_HDR_OPT_LEN_CB
When the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
flag is set with bpf_sock_ops_cb_flags_set
, the program is invoked with this op
to reserve space for TCP options which will be written to the packet when the program is invoked with the BPF_SOCK_OPS_WRITE_HDR_OPT_CB
op.
The arguments in the context will have the following meanings:
args[0]
: bool want_cookie. (in writing SYNACK only)
sock_ops->skb_data
: Not available because no header has been written yet.
sock_ops->skb_tcp_flags
: The tcp_flags of the outgoing skb. (e.g. SYN, ACK, FIN).
The bpf_reserve_hdr_opt
should be used to reserve space.
This is just a notification, return value is discarded.
BPF_SOCK_OPS_WRITE_HDR_OPT_CB
When the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
flag is set with bpf_sock_ops_cb_flags_set
, the program is invoked with this op
to write TCP options to the packet, the room for these options has been reserved in a previous invokation of the program with the BPF_SOCK_OPS_HDR_OPT_LEN_CB
op.
The arguments in the context will have the following meanings:
args[0]
: bool want_cookie. (in writing SYNACK only)
sock_ops->skb_data
: Referring to the outgoing skb. It covers the TCP header that has already been written by the kernel and the earlier BPF programs.
sock_ops->skb_tcp_flags
: The tcp_flags of the outgoing skb. (e.g. SYN, ACK, FIN).
The bpf_store_hdr_opt
should be used to write the option.
The bpf_load_hdr_opt
can also be used to search for a particular option that has already been written by the kernel or the earlier BPF programs.
Context
struct bpf_sock_ops
C structure
/* User bpf_sock_ops struct to access socket values and specify request ops
* and their replies.
* Some of this fields are in network (bigendian) byte order and may need
* to be converted before use (bpf_ntohl() defined in samples/bpf/bpf_endian.h).
* New fields can only be added at the end of this structure
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 args[4]; /* Optionally passed to bpf program */
__u32 reply; /* Returned by bpf program */
__u32 replylong[4]; /* Optionally returned by bpf prog */
};
__u32 family;
__u32 remote_ip4; /* Stored in network byte order */
__u32 local_ip4; /* Stored in network byte order */
__u32 remote_ip6[4]; /* Stored in network byte order */
__u32 local_ip6[4]; /* Stored in network byte order */
__u32 remote_port; /* Stored in network byte order */
__u32 local_port; /* stored in host byte order */
__u32 is_fullsock; /* Some TCP fields are only valid if
* there is a full socket. If not, the
* fields read as zero.
*/
__u32 snd_cwnd;
__u32 srtt_us; /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
__u32 state;
__u32 rtt_min;
__u32 snd_ssthresh;
__u32 rcv_nxt;
__u32 snd_nxt;
__u32 snd_una;
__u32 mss_cache;
__u32 ecn_flags;
__u32 rate_delivered;
__u32 rate_interval_us;
__u32 packets_out;
__u32 retrans_out;
__u32 total_retrans;
__u32 segs_in;
__u32 data_segs_in;
__u32 segs_out;
__u32 data_segs_out;
__u32 lost_out;
__u32 sacked_out;
__u32 sk_txhash;
__u64 bytes_received;
__u64 bytes_acked;
__bpf_md_ptr(struct bpf_sock *, sk);
/* [skb_data, skb_data_end) covers the whole TCP header.
*
* BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received
* BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the
* header has not been written.
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have
* been written so far.
* BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
__bpf_md_ptr(void *, skb_data);
__bpf_md_ptr(void *, skb_data_end);
__u32 skb_len; /* The total length of a packet.
* It includes the header, options,
* and payload.
*/
__u32 skb_tcp_flags; /* tcp_flags of the header. It provides
* an easy way to check for tcp_flags
* without parsing skb_data.
*
* In particular, the skb_tcp_flags
* will still be available in
* BPF_SOCK_OPS_HDR_OPT_LEN even though
* the outgoing header has not
* been written yet.
*/
__u64 skb_hwtstamp;
};
op
This field will indicate the current operation, see the ops section for the possible values and meanings.
args
This field is an array of 4 __u32
values, used by some operations to provide additional information. The meaning of the arguments is dependant on the op
.
reply
This field is used as the return value for operations that expect one. It is the only field the BPF program is allowed to modify.
replylong
This field was envisioned to be used for replies that do not fit in a single __u32
, but in practice this has not occurred as of v6.3.
family
The address family of the socket for which the program is invoked. One of the AF_*
enums.
remote_ip4
The remote IPv4 address in network byte order if family
== AF_INET
.
local_ip4
The local IPv4 address in network byte order if family
== AF_INET
.
remote_ip6
The remote IPv6 address in network byte order if family
== AF_INET6
.
local_ip6
The local IPv6 address in network byte order if family
== AF_INET6
.
remote_port
The remote data link / layer 4 port in network byte order.
local_port
The local data link / layer 4 port in network byte order.
is_fullsock
Some TCP fields are only valid if there is a full socket. If not, the fields read as zero.
snd_cwnd
The sending congestion window
srtt_us
The averaged/smoothed RTT (Round Trip Time), stored 3 bits shifted left in μs (microseconds).
actual srtt in μs = ctx->srtt_us >> 3;
bpf_sock_ops_cb_flags
This field contains the flags that indicate which optional operations are enabled or not. Possible values are listed in include/uapi/linux/bpf.h
. To the change the contents of the field, the bpf_sock_ops_cb_flags_set
helper must be used.
state
This field contains the connection state of the socket.
The states will be one of:
enum {
BPF_TCP_ESTABLISHED = 1,
BPF_TCP_SYN_SENT,
BPF_TCP_SYN_RECV,
BPF_TCP_FIN_WAIT1,
BPF_TCP_FIN_WAIT2,
BPF_TCP_TIME_WAIT,
BPF_TCP_CLOSE,
BPF_TCP_CLOSE_WAIT,
BPF_TCP_LAST_ACK,
BPF_TCP_LISTEN,
BPF_TCP_CLOSING, /* Now a valid state */
BPF_TCP_NEW_SYN_RECV
};
rtt_min
The minimum observed RTT (Round Trip Time)
snd_ssthresh
The slow start size threshold.
rcv_nxt
The TCP sequence number we want to receive next.
snd_nxt
The TCP sequence number we will to send next.
snd_una
The first byte we want to ACK for.
mss_cache
Cached effective MSS (Maximum Segment Size), not including SACKS.
ecn_flags
ECN (Explicit Congestion Notification) status bits.
rate_delivered
Saved rate sample: packets delivered.
rate_interval_us
Saved rate sample: time elapsed.
packets_out
Number of packets which are "in flight".
retrans_out
Number of packets re-transmitted out.
total_retrans
Total # of packet re-transmits for entire connection.
segs_in
RFC4898 tcpEStatsPerfSegsIn total number of segments in.
data_segs_in
RFC4898 tcpEStatsPerfDataSegsIn total number of data segments in.
segs_out
RFC4898 tcpEStatsPerfSegsOut the total number of segments sent.
data_segs_out
RFC4898 tcpEStatsPerfDataSegsOut total number of data segments sent.
lost_out
Number of lost packets.
sacked_out
Number of
sk_txhash
Computed flow hash for use on transmit.
bytes_received
RFC4898 tcpEStatsAppHCThruOctetsReceived sum(delta(rcv_nxt))
, or how many bytes were acked.
bytes_acked
RFC4898 tcpEStatsAppHCThruOctetsAcked sum(delta(snd_una))
, or how many bytes were acked.
sk
Pointer to the struct bpf_sock
.
skb_data
skb_data
to skb_data_end
covers the whole TCP header.
BPF_SOCK_OPS_PARSE_HDR_OPT_CB
- The packet receivedBPF_SOCK_OPS_HDR_OPT_LEN_CB
- Not useful because the header has not been written.BPF_SOCK_OPS_WRITE_HDR_OPT_CB
- The header and options have been written so far.BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
- The SYNACK that concludes the 3WHS.BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
- The ACK that concludes the 3WHS.
bpf_load_hdr_opt
can also be used to read a particular option.
skb_data_end
The end pointer of the TCP header.
skb_len
The total length of a packet. It includes the header, options, and payload.
skb_tcp_flags
tcp_flags
of the header. It provides an easy way to check for tcp_flags
without parsing skb_data.
In particular, the skb_tcp_flags
will still be available in BPF_SOCK_OPS_HDR_OPT_LEN
even though the outgoing header has not been written yet.
skb_hwtstamp
The timestamp at which the packet was received as reported by the hardware/NIC.
In sockops, the skb is also available to the bpf prog during the BPF_SOCK_OPS_PARSE_HDR_OPT_CB
event. There is a use case that the hwtstamp will be useful to the sockops prog to better measure the one-way-delay when the sender has put the tx timestamp in the tcp header option.
Warning
hwtstamps
can only be compared against other hwtstamps
from the same device.
Attachment
Socket ops programs are attached to cGroups via the BPF_PROG_ATTACH
syscall or via BPF link.
Examples
Clamping a connection
// Copyright (c) 2017 Facebook
#define DEBUG 1
SEC("sockops")
int bpf_clamp(struct bpf_sock_ops *skops)
{
int bufsize = 150000;
int to_init = 10;
int clamp = 100;
int rv = 0;
int op;
/* For testing purposes, only execute rest of BPF program
* if neither port numberis 55601
*/
if (bpf_ntohl(skops->remote_port) != 55601 && skops->local_port != 55601) {
skops->reply = -1;
return 0;
}
op = (int) skops->op;
#ifdef DEBUG
bpf_printk("BPF command: %d\n", op);
#endif
/* Check that both hosts are within same datacenter. For this example
* it is the case when the first 5.5 bytes of their IPv6 addresses are
* the same.
*/
if (skops->family == AF_INET6 &&
skops->local_ip6[0] == skops->remote_ip6[0] &&
(bpf_ntohl(skops->local_ip6[1]) & 0xfff00000) ==
(bpf_ntohl(skops->remote_ip6[1]) & 0xfff00000)) {
switch (op) {
case BPF_SOCK_OPS_TIMEOUT_INIT:
rv = to_init;
break;
case BPF_SOCK_OPS_TCP_CONNECT_CB:
/* Set sndbuf and rcvbuf of active connections */
rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF,
&bufsize, sizeof(bufsize));
rv += bpf_setsockopt(skops, SOL_SOCKET,
SO_RCVBUF, &bufsize,
sizeof(bufsize));
break;
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
rv = bpf_setsockopt(skops, SOL_TCP,
TCP_BPF_SNDCWND_CLAMP,
&clamp, sizeof(clamp));
break;
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
/* Set sndbuf and rcvbuf of passive connections */
rv = bpf_setsockopt(skops, SOL_TCP,
TCP_BPF_SNDCWND_CLAMP,
&clamp, sizeof(clamp));
rv += bpf_setsockopt(skops, SOL_SOCKET,
SO_SNDBUF, &bufsize,
sizeof(bufsize));
rv += bpf_setsockopt(skops, SOL_SOCKET,
SO_RCVBUF, &bufsize,
sizeof(bufsize));
break;
default:
rv = -1;
}
} else {
rv = -1;
}
#ifdef DEBUG
bpf_printk("Returning %d\n", rv);
#endif
skops->reply = rv;
return 1;
}
Dump statistics
#define INTERVAL 1000000000ULL
int _version SEC("version") = 1;
char _license[] SEC("license") = "GPL";
struct {
__u32 type;
__u32 map_flags;
int *key;
__u64 *value;
} bpf_next_dump SEC(".maps") = {
.type = BPF_MAP_TYPE_SK_STORAGE,
.map_flags = BPF_F_NO_PREALLOC,
};
SEC("sockops")
int _sockops(struct bpf_sock_ops *ctx)
{
struct bpf_tcp_sock *tcp_sk;
struct bpf_sock *sk;
__u64 *next_dump;
__u64 now;
switch (ctx->op) {
case BPF_SOCK_OPS_TCP_CONNECT_CB:
bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG);
return 1;
case BPF_SOCK_OPS_RTT_CB:
break;
default:
return 1;
}
sk = ctx->sk;
if (!sk)
return 1;
next_dump = bpf_sk_storage_get(&bpf_next_dump, sk, 0,
BPF_SK_STORAGE_GET_F_CREATE);
if (!next_dump)
return 1;
now = bpf_ktime_get_ns();
if (now < *next_dump)
return 1;
tcp_sk = bpf_tcp_sock(sk);
if (!tcp_sk)
return 1;
*next_dump = now + INTERVAL;
bpf_printk("dsack_dups=%u delivered=%u\n",
tcp_sk->dsack_dups, tcp_sk->delivered);
bpf_printk("delivered_ce=%u icsk_retransmits=%u\n",
tcp_sk->delivered_ce, tcp_sk->icsk_retransmits);
return 1;
}
Adding socket to map
// Copyright (c) 2017-2018 Covalent IO
SEC("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
__u32 lport, rport;
int op, err = 0, index, key, ret;
op = (int) skops->op;
switch (op) {
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
lport = skops->local_port;
rport = skops->remote_port;
if (lport == 10000) {
ret = 1;
#ifdef SOCKMAP
err = bpf_sock_map_update(skops, &sock_map, &ret,
BPF_NOEXIST);
#else
err = bpf_sock_hash_update(skops, &sock_map, &ret,
BPF_NOEXIST);
#endif
}
break;
case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
lport = skops->local_port;
rport = skops->remote_port;
if (bpf_ntohl(rport) == 10001) {
ret = 10;
#ifdef SOCKMAP
err = bpf_sock_map_update(skops, &sock_map, &ret,
BPF_NOEXIST);
#else
err = bpf_sock_hash_update(skops, &sock_map, &ret,
BPF_NOEXIST);
#endif
}
break;
default:
break;
}
return 0;
}
Helper functions
Supported helper functions
bpf_setsockopt
bpf_getsockopt
bpf_sock_ops_cb_flags_set
bpf_sock_map_update
bpf_sock_hash_update
bpf_get_socket_cookie
bpf_get_local_storage
bpf_perf_event_output
bpf_sk_storage_get
bpf_sk_storage_delete
bpf_get_netns_cookie
bpf_load_hdr_opt
bpf_store_hdr_opt
bpf_reserve_hdr_opt
bpf_tcp_sock
bpf_skc_to_tcp6_sock
bpf_skc_to_tcp_sock
bpf_skc_to_tcp_timewait_sock
bpf_skc_to_tcp_request_sock
bpf_skc_to_udp6_sock
bpf_skc_to_unix_sock
bpf_ktime_get_coarse_ns
bpf_map_lookup_elem
bpf_map_update_elem
bpf_map_delete_elem
bpf_map_push_elem
bpf_map_pop_elem
bpf_map_peek_elem
bpf_map_lookup_percpu_elem
bpf_get_prandom_u32
bpf_get_smp_processor_id
bpf_get_numa_node_id
bpf_tail_call
bpf_ktime_get_ns
bpf_ktime_get_boot_ns
bpf_ringbuf_output
bpf_ringbuf_reserve
bpf_ringbuf_submit
bpf_ringbuf_discard
bpf_ringbuf_query
bpf_for_each_map_elem
bpf_loop
bpf_strncmp
bpf_spin_lock
bpf_spin_unlock
bpf_jiffies64
bpf_per_cpu_ptr
bpf_this_cpu_ptr
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_timer_cancel
bpf_trace_printk
bpf_get_current_task
bpf_get_current_task_btf
bpf_probe_read_user
bpf_probe_read_kernel
bpf_probe_read_user_str
bpf_probe_read_kernel_str
bpf_snprintf_btf
bpf_snprintf
bpf_task_pt_regs
bpf_trace_vprintk
bpf_cgrp_storage_get
bpf_cgrp_storage_delete
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_kptr_xchg
bpf_ktime_get_tai_ns
bpf_ringbuf_discard_dynptr
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit_dynptr
bpf_user_ringbuf_drain
KFuncs
There are currently no kfuncs supported for this program type