Program type BPF_PROG_TYPE_CGROUP_SOCKOPT
cGroup socket ops programs are executed when a process in the cGroup to which the program is attached uses the getsockopt
or setsockopt
syscall depending on the attach type and modify or block the operation.
Usage
cGroup socket ops programs are typically located in the cgroup/getsockopt
or cgroup/setsockopt
ELF section to indicate the BPF_CGROUP_GETSOCKOPT
and BPF_CGROUP_SETSOCKOPT
attach types respectively.
BPF_CGROUP_SETSOCKOPT
BPF_CGROUP_SETSOCKOPT
is triggered before the kernel handling of sockopt and it has writable context: it can modify the supplied arguments before passing them down to the kernel. This hook has access to the cGroup and socket local storage.
If BPF program sets optlen
to -1, the control will be returned back to the userspace after all other BPF programs in the cGroup chain finish (i.e. kernel setsockopt
handling will not be executed).
Note
optlen
can not be increased beyond the user-supplied value. It can only be decreased or set to -1. Any other value will trigger EFAULT
.
Return Type:
0
- reject the syscall,EPERM
will be returned to the userspace.1
- success, continue with next BPF program in the cgroup chain.
BPF_CGROUP_GETSOCKOPT
BPF_CGROUP_GETSOCKOPT
is triggered after the kernel handing of sockopt. The BPF hook can observe optval
, optlen
and retval
if it's interested in whatever kernel has returned. BPF hook can override the values above, adjust optlen
and reset retval
to 0. If optlen
has been increased above initial getsockopt
value (i.e. userspace buffer is too small), EFAULT
is returned.
This hook has access to the cGroup and socket local storage.
Note
The only acceptable value to set to retval
is 0 and the original value that the kernel returned. Any other value will trigger EFAULT
.
Return Type:
0
- reject the syscall,EPERM
will be returned to the userspace.1
- success: copyoptval
andoptlen
to userspace, returnretval
from the syscall (note that this can be overwritten by the BPF program from the parent cGroup).
cGroup Inheritance
Suppose, there is the following cGroup hierarchy where each cGroup has BPF_CGROUP_GETSOCKOPT
attached at each level with BPF_F_ALLOW_MULTI
A (root, parent)
\
B (child)
When the application calls getsockopt
syscall from the cGroup B, the programs are executed from the bottom up: B, A. First program (B) sees the result of kernel's getsockopt
. It can optionally adjust optval
, optlen
and reset retval
to 0. After that control will be passed to the second (A) program which will see the same context as B including any potential modifications.
Same for BPF_CGROUP_SETSOCKOPT
: if the program is attached to A and B, the trigger order is B, then A. If B does any changes to the input arguments (level
, optname
, optval
, optlen
), then the next program in the chain (A) will see those changes, not the original input setsockopt
arguments. The potentially modified values will be then passed down to the kernel.
Large optval
When the optval
is greater than the PAGE_SIZE
, the BPF program can access only the first PAGE_SIZE
of that data. So it has to options:
- Set
optlen
to zero, which indicates that the kernel should use the original buffer from the userspace. Any modifications done by the BPF program to theoptval
are ignored. - Set
optlen
to the value less thanPAGE_SIZE
, which indicates that the kernel should use BPF's trimmedoptval
.
When the BPF program returns with the optlen
greater than PAGE_SIZE
, the userspace will receive original kernel buffers without any modifications that the BPF program might have applied.
Context
struct bpf_sockopt
C structure
struct bpf_sockopt {
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(void *, optval);
__bpf_md_ptr(void *, optval_end);
__s32 level;
__s32 optname;
__s32 optlen;
__s32 retval;
};
sk
Pointer to the socket for which the syscall is invoked.
optval
Pointer to the start of the option value, the end pointer being optval_end
. The program must perform bounds check with optval_end
before accessing the memory.
For BPF_CGROUP_SETSOCKOPT
the opt value contains the option the process wants to set. For BPF_CGROUP_GETSOCKOPT
the opt value contains the option the syscall returned.
optval_end
This is the end pointer of the option value.
level
This field indicates the socket level for which the syscall is invoked. Values are one of SOL_*
constants. Typically SOL_SOCKET
, SOL_IP
, SOL_IPV6
, SOL_TCP
, or SOL_UDP
unless dealing with more specialized protocols. Only BPF_CGROUP_SETSOCKOPT
programs are allowed to modify this field.
optname
This field indicates the name of the socket option. Valid options depend on the socket level. More info can be found in the man pages such as socket(7)
, ip(7)
, tcp(7)
, udp(7)
, etc.
Only BPF_CGROUP_SETSOCKOPT
programs are allowed to modify this field.
optlen
This field indicates the length of the socket option, which should be smaller or equal to optval_end - optval
. The program can modify this value to trim the option value. Both BPF_CGROUP_SETSOCKOPT
and BPF_CGROUP_GETSOCKOPT
programs are allowed to modify this field.
retval
This field indicates the return value of the syscall. Only BPF_CGROUP_GETSOCKOPT
programs can read and/or modify this value to override the return value of the syscall.
Attachment
cGroup socket buffer programs are attached to cGroups via the BPF_PROG_ATTACH
syscall or via BPF link.
Example
SEC("cgroup/getsockopt")
int getsockopt(struct bpf_sockopt *ctx)
{
/* Custom socket option. */
if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
ctx->retval = 0;
optval[0] = ...;
ctx->optlen = 1;
return 1;
}
/* Modify kernel's socket option. */
if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
ctx->retval = 0;
optval[0] = ...;
ctx->optlen = 1;
return 1;
}
/* optval larger than PAGE_SIZE use kernel's buffer. */
if (ctx->optlen > PAGE_SIZE)
ctx->optlen = 0;
return 1;
}
SEC("cgroup/setsockopt")
int setsockopt(struct bpf_sockopt *ctx)
{
/* Custom socket option. */
if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
/* do something */
ctx->optlen = -1;
return 1;
}
/* Modify kernel's socket option. */
if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
optval[0] = ...;
return 1;
}
/* optval larger than PAGE_SIZE use kernel's buffer. */
if (ctx->optlen > PAGE_SIZE)
ctx->optlen = 0;
return 1;
}
Helper functions
Supported helper functions
bpf_get_netns_cookie
bpf_sk_storage_get
bpf_sk_storage_delete
bpf_setsockopt
v5.15bpf_getsockopt
v5.15bpf_tcp_sock
bpf_get_current_uid_gid
bpf_get_local_storage
bpf_get_current_cgroup_id
bpf_perf_event_output
bpf_get_retval
bpf_set_retval
bpf_map_lookup_elem
bpf_map_update_elem
bpf_map_delete_elem
bpf_map_push_elem
bpf_map_pop_elem
bpf_map_peek_elem
bpf_map_lookup_percpu_elem
bpf_get_prandom_u32
bpf_get_smp_processor_id
bpf_get_numa_node_id
bpf_tail_call
bpf_ktime_get_ns
bpf_ktime_get_boot_ns
bpf_ringbuf_output
bpf_ringbuf_reserve
bpf_ringbuf_submit
bpf_ringbuf_discard
bpf_ringbuf_query
bpf_for_each_map_elem
bpf_loop
bpf_strncmp
bpf_spin_lock
bpf_spin_unlock
bpf_jiffies64
bpf_per_cpu_ptr
bpf_this_cpu_ptr
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_timer_cancel
bpf_trace_printk
bpf_get_current_task
bpf_get_current_task_btf
bpf_probe_read_user
bpf_probe_read_kernel
bpf_probe_read_user_str
bpf_probe_read_kernel_str
bpf_snprintf_btf
bpf_snprintf
bpf_task_pt_regs
bpf_trace_vprintk
bpf_cgrp_storage_get
bpf_cgrp_storage_delete
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_kptr_xchg
bpf_ktime_get_tai_ns
bpf_ringbuf_discard_dynptr
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit_dynptr
bpf_user_ringbuf_drain
KFuncs
There are currently no kfuncs supported for this program type