Program type BPF_PROG_TYPE_CGROUP_SOCKOPT
cGroup socket ops programs are executed when a process in the cGroup to which the program is attached uses the getsockopt or setsockopt syscall depending on the attach type and modify or block the operation.
Usage
cGroup socket ops programs are typically located in the cgroup/getsockopt or cgroup/setsockopt ELF section to indicate the BPF_CGROUP_GETSOCKOPT and BPF_CGROUP_SETSOCKOPT attach types respectively.
BPF_CGROUP_SETSOCKOPT
BPF_CGROUP_SETSOCKOPT is triggered before the kernel handling of sockopt and it has writable context: it can modify the supplied arguments before passing them down to the kernel. This hook has access to the cGroup and socket local storage.
If BPF program sets optlen to -1, the control will be returned back to the userspace after all other BPF programs in the cGroup chain finish (i.e. kernel setsockopt handling will not be executed).
Note
optlen can not be increased beyond the user-supplied value. It can only be decreased or set to -1. Any other value will trigger EFAULT.
Return Type:
0- reject the syscall,EPERMwill be returned to the userspace.1- success, continue with next BPF program in the cgroup chain.
BPF_CGROUP_GETSOCKOPT
BPF_CGROUP_GETSOCKOPT is triggered after the kernel handing of sockopt. The BPF hook can observe optval, optlen and retval if it's interested in whatever kernel has returned. BPF hook can override the values above, adjust optlen and reset retval to 0. If optlen has been increased above initial getsockopt value (i.e. userspace buffer is too small), EFAULT is returned.
This hook has access to the cGroup and socket local storage.
Note
The only acceptable value to set to retval is 0 and the original value that the kernel returned. Any other value will trigger EFAULT.
Return Type:
cGroup Inheritance
Suppose, there is the following cGroup hierarchy where each cGroup has BPF_CGROUP_GETSOCKOPT attached at each level with BPF_F_ALLOW_MULTI
A (root, parent)
\
B (child)
When the application calls getsockopt syscall from the cGroup B, the programs are executed from the bottom up: B, A. First program (B) sees the result of kernel's getsockopt. It can optionally adjust optval, optlen and reset retval to 0. After that control will be passed to the second (A) program which will see the same context as B including any potential modifications.
Same for BPF_CGROUP_SETSOCKOPT: if the program is attached to A and B, the trigger order is B, then A. If B does any changes to the input arguments (level, optname, optval, optlen), then the next program in the chain (A) will see those changes, not the original input setsockopt arguments. The potentially modified values will be then passed down to the kernel.
Large optval
When the optval is greater than the PAGE_SIZE, the BPF program can access only the first PAGE_SIZE of that data. So it has to options:
- Set
optlento zero, which indicates that the kernel should use the original buffer from the userspace. Any modifications done by the BPF program to theoptvalare ignored. - Set
optlento the value less thanPAGE_SIZE, which indicates that the kernel should use BPF's trimmedoptval.
When the BPF program returns with the optlen greater than PAGE_SIZE, the userspace will receive original kernel buffers without any modifications that the BPF program might have applied.
Context
struct bpf_sockopt
C structure
struct bpf_sockopt {
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(void *, optval);
__bpf_md_ptr(void *, optval_end);
__s32 level;
__s32 optname;
__s32 optlen;
__s32 retval;
};
sk
Pointer to the socket for which the syscall is invoked.
optval
Pointer to the start of the option value, the end pointer being optval_end. The program must perform bounds check with optval_end before accessing the memory.
For BPF_CGROUP_SETSOCKOPT the opt value contains the option the process wants to set. For BPF_CGROUP_GETSOCKOPT the opt value contains the option the syscall returned.
optval_end
This is the end pointer of the option value.
level
This field indicates the socket level for which the syscall is invoked. Values are one of SOL_* constants. Typically SOL_SOCKET, SOL_IP, SOL_IPV6, SOL_TCP, or SOL_UDP unless dealing with more specialized protocols. Only BPF_CGROUP_SETSOCKOPT programs are allowed to modify this field.
optname
This field indicates the name of the socket option. Valid options depend on the socket level. More info can be found in the man pages such as socket(7), ip(7), tcp(7), udp(7), etc.
Only BPF_CGROUP_SETSOCKOPT programs are allowed to modify this field.
optlen
This field indicates the length of the socket option, which should be smaller or equal to optval_end - optval. The program can modify this value to trim the option value. Both BPF_CGROUP_SETSOCKOPT and BPF_CGROUP_GETSOCKOPT programs are allowed to modify this field.
retval
This field indicates the return value of the syscall. Only BPF_CGROUP_GETSOCKOPT programs can read and/or modify this value to override the return value of the syscall.
Attachment
cGroup socket buffer programs are attached to cGroups via the BPF_PROG_ATTACH syscall or via BPF link.
Example
SEC("cgroup/getsockopt")
int getsockopt(struct bpf_sockopt *ctx)
{
/* Custom socket option. */
if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
ctx->retval = 0;
optval[0] = ...;
ctx->optlen = 1;
return 1;
}
/* Modify kernel's socket option. */
if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
ctx->retval = 0;
optval[0] = ...;
ctx->optlen = 1;
return 1;
}
/* optval larger than PAGE_SIZE use kernel's buffer. */
if (ctx->optlen > PAGE_SIZE)
ctx->optlen = 0;
return 1;
}
SEC("cgroup/setsockopt")
int setsockopt(struct bpf_sockopt *ctx)
{
/* Custom socket option. */
if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
/* do something */
ctx->optlen = -1;
return 1;
}
/* Modify kernel's socket option. */
if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
optval[0] = ...;
return 1;
}
/* optval larger than PAGE_SIZE use kernel's buffer. */
if (ctx->optlen > PAGE_SIZE)
ctx->optlen = 0;
return 1;
}
Helper functions
Supported helper functions
bpf_cgrp_storage_deletebpf_cgrp_storage_getbpf_dynptr_databpf_dynptr_from_membpf_dynptr_readbpf_dynptr_writebpf_for_each_map_elembpf_get_current_ancestor_cgroup_idv6.4bpf_get_current_cgroup_idbpf_get_current_pid_tgidv6.10bpf_get_current_taskbpf_get_current_task_btfbpf_get_current_uid_gidbpf_get_local_storagebpf_get_netns_cookiebpf_get_ns_current_pid_tgidv6.10bpf_get_numa_node_idbpf_get_prandom_u32bpf_get_retvalbpf_get_smp_processor_idbpf_getsockoptv5.15bpf_jiffies64bpf_kptr_xchgbpf_ktime_get_boot_nsbpf_ktime_get_nsbpf_ktime_get_tai_nsbpf_loopbpf_map_delete_elembpf_map_lookup_elembpf_map_lookup_percpu_elembpf_map_peek_elembpf_map_pop_elembpf_map_push_elembpf_map_update_elembpf_per_cpu_ptrbpf_perf_event_outputbpf_probe_read_kernelbpf_probe_read_kernel_strbpf_probe_read_userbpf_probe_read_user_strbpf_ringbuf_discardbpf_ringbuf_discard_dynptrbpf_ringbuf_outputbpf_ringbuf_querybpf_ringbuf_reservebpf_ringbuf_reserve_dynptrbpf_ringbuf_submitbpf_ringbuf_submit_dynptrbpf_set_retvalbpf_setsockoptv5.15bpf_sk_storage_deletebpf_sk_storage_getbpf_snprintfbpf_snprintf_btfbpf_spin_lockbpf_spin_unlockbpf_strncmpbpf_tail_callbpf_task_pt_regsbpf_tcp_sockbpf_this_cpu_ptrbpf_timer_cancelbpf_timer_initbpf_timer_set_callbackbpf_timer_startbpf_trace_printkbpf_trace_vprintkbpf_user_ringbuf_drain
KFuncs
Supported kfuncs
__bpf_trapv6.12 -bpf_arena_alloc_pagesv6.12 -bpf_arena_free_pagesv6.12 -bpf_arena_reserve_pagesv6.12 -bpf_cast_to_kern_ctxv6.12 -bpf_cgroup_read_xattrv6.12 -bpf_copy_from_user_dynptrv6.12 -bpf_copy_from_user_strv6.12 -bpf_copy_from_user_str_dynptrv6.12 -bpf_copy_from_user_task_dynptrv6.12 -bpf_copy_from_user_task_strv6.12 -bpf_copy_from_user_task_str_dynptrv6.12 -bpf_dynptr_adjustv6.12 -bpf_dynptr_clonev6.12 -bpf_dynptr_copyv6.12 -bpf_dynptr_from_skbv6.12 -bpf_dynptr_is_nullv6.12 -bpf_dynptr_is_rdonlyv6.12 -bpf_dynptr_memsetv6.12 -bpf_dynptr_sizev6.12 -bpf_dynptr_slicev6.12 -bpf_dynptr_slice_rdwrv6.12 -bpf_get_kmem_cachev6.12 -bpf_iter_bits_destroyv6.12 -bpf_iter_bits_newv6.12 -bpf_iter_bits_nextv6.12 -bpf_iter_css_destroyv6.12 -bpf_iter_css_newv6.12 -bpf_iter_css_nextv6.12 -bpf_iter_css_task_destroyv6.12 -bpf_iter_css_task_newv6.12 -bpf_iter_css_task_nextv6.12 -bpf_iter_dmabuf_destroyv6.12 -bpf_iter_dmabuf_newv6.12 -bpf_iter_dmabuf_nextv6.12 -bpf_iter_kmem_cache_destroyv6.12 -bpf_iter_kmem_cache_newv6.12 -bpf_iter_kmem_cache_nextv6.12 -bpf_iter_num_destroyv6.12 -bpf_iter_num_newv6.12 -bpf_iter_num_nextv6.12 -bpf_iter_task_destroyv6.12 -bpf_iter_task_newv6.12 -bpf_iter_task_nextv6.12 -bpf_iter_task_vma_destroyv6.12 -bpf_iter_task_vma_newv6.12 -bpf_iter_task_vma_nextv6.12 -bpf_local_irq_restorev6.12 -bpf_local_irq_savev6.12 -bpf_map_sum_elem_countv6.12 -bpf_preempt_disablev6.12 -bpf_preempt_enablev6.12 -bpf_probe_read_kernel_dynptrv6.12 -bpf_probe_read_kernel_str_dynptrv6.12 -bpf_probe_read_user_dynptrv6.12 -bpf_probe_read_user_str_dynptrv6.12 -bpf_rcu_read_lockv6.12 -bpf_rcu_read_unlockv6.12 -bpf_rdonly_castv6.12 -bpf_res_spin_lockv6.12 -bpf_res_spin_lock_irqsavev6.12 -bpf_res_spin_unlockv6.12 -bpf_res_spin_unlock_irqrestorev6.12 -bpf_sock_addr_set_sun_pathv6.12 -bpf_sock_ops_enable_tx_tstampbpf_strchrv6.12 -bpf_strchrnulv6.12 -bpf_strcmpv6.12 -bpf_strcspnv6.12 -bpf_stream_vprintkv6.12 -bpf_strlenv6.12 -bpf_strnchrv6.12 -bpf_strnlenv6.12 -bpf_strnstrv6.12 -bpf_strrchrv6.12 -bpf_strspnv6.12 -bpf_strstrv6.12 -bpf_wq_initv6.12 -bpf_wq_set_callback_implv6.12 -bpf_wq_startv6.12 -