Program type BPF_PROG_TYPE_TRACING
Tracing programs are a newer alternative to kprobes and tracepoints. Tracing programs utilize BPF trampolines, a new mechanism which provides practically zero overhead. In addition, tracing programs can be attached to BPF programs to provide troubleshooting and debugging capabilities, something that is not possible with kprobes.
Usage
There are a few variations of tracing programs depending on their attach type.
Raw tracepoint
Raw tracepoint programs can be loaded as its own dedicated program type or as an attach type under the tracing program type. When loaded as a tracing program it can attach to a BTF ID of a tracepoint via a link instead of having to use the special syscall to attach.
For details see the Raw Tracepoint page.
Fentry
Fentry programs are similar in function to a kprobe attached to a functions first instruction. This program type is invoked before control passes to the function to allow for tracing/observation.
Kprobes do not have to be attached at the entry point of a function, kprobes can be installed at any point in the function, whereas fentry programs are always attached at the entry point of a function.
Fentry programs are attached using a BPF trampoline which causes less overhead than kprobes. Fentry programs can also be attached to BPF programs such as XDP, TC or cGroup programs which makes debugging eBPF programs easier. Kprobes lack this capability.
Fentry programs are typically located in an ELF section prefixed with fentry/
.
Fexit
Fexit programs are similar to kretprobes. The program is invoked when the function returns no matter where the return occurs. Fexit programs get invoked with the input arguments and the return value of the function, so there is no need to store the input arguments in a map like you would have to do with kprobes and kretprobes.
Fexit programs are typically located in an ELF section prefixed with fexit/
.
Modify return
Fmodify_return programs run after the fentry program but before the function we are tracing. Unlike the fentry and fexit programs, the fmodify_return program can return non-zero values. When a non-zero value is returned, the function we are tracing will not be executed and the value returned by the fmodify_return program will be returned instead.
Fmodify_return programs are provided with the input arguments to the function under trace and a return value. If multiple fmodify_return programs are attached, then the return value of the previous fmodify_return program will be provided as the input to the next fmodify_return program.
Unlike fentry/fexit programs, fmodify_return programs are only allowed for security hooks (with an extra CAP_MAC_ADMIN
check) and functions whitelisted for error injection (ALLOW_ERROR_INJECTION
).
Fmodify_return programs are typically located in an ELF section prefixed with fmod_ret/
.
Iterator
Iterator programs use the same program type but have a different use case. Iterator programs are used to iterate over a list of in kernel data structures to efficiently collect data and/or to summarize data, specifically in cases where otherwise the kernel-userspace boundary would cause a bottleneck.
Iterator programs can only be attached to specific pre-defined iterators. Each iterator follows the naming convention bpf_iter__<iter_name>
which is a type which can be found in the vmlinux of the kernel. This type is also the context with which the program will be invoked for each data structure in the iterator.
Iterator programs are typically located in an ELF section prefixed with iter/
.
Context
Raw tracepoint
Please see the Raw Tracepoint page for details.
Fentry / Fexit / Fmodify_return
Programs for all of these attach types are provided with an array of u64 values representing the arguments to the function that is being traced. The Fexit and Fmodify_return programs are also provided with the return value of the function or the previous Fmodify_return program.
The BPF_PROG
and BPF_PROG2
macros defined in libbpf can be used to cast the values of the array to their proper types to provide a more natural way of declaring a BPF program.
Some functions in the kernel are passed structures which are larger than 8 bytes, in that case the value of the argument may be spread over multiple indexes in the array. The BPF_PROG
cannot deal with this, so when writing BPF programs that may attach to functions which take structures as arguments, the BPF_PROG2
macro should be used instead.
Iterator
The context for iterator programs differs per iterator, however, the first field of every iterator context meta
is a pointer to struct bpf_iter_meta
:
struct bpf_iter_meta {
struct seq_file *seq;
__u64 session_id;
__u64 seq_num;
};
The rest of the context struct will contain the data structure for the current iteration.
The metadata contains a sequence file, which is effectively the output of the iterator program. When a pinned iterator is read, the iterator program will be invoked for each data structure in the iterator. The iterator program can then use the seq_file to output data to the user. Dedicated print helpers are used to write to the sequence file such as bpf_seq_printf
, bpf_seq_printf_btf
, and bpf_seq_write
.
The session_id
is an incrementing value which is used to identify the current iteration session. The seq_num
is the current iteration number within the session.
Docs could be improved
This part of the docs is incomplete, contributions are very welcome
Attachment
All tracing programs are attached via BPF links. The program should be loaded with the correct attach type and the same attach type used when creating the link via the attach_type
attribute. The tracepoint, function or iterator to attach to should be specified via the target_btf_id
attribute, its value matching the BTF ID for the target from the vmlinux BTF blob.
Example
Raw tracepoint
/**
* A trivial example tracepoint program that shows how to
* acquire and release a struct task_struct * pointer.
*/
SEC("tp_btf/task_newtask")
int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
{
struct task_struct *acquired;
acquired = bpf_task_acquire(task);
if (acquired)
/*
* In a typical program you'd do something like store
* the task in a map, and the map will automatically
* release it later. Here, we release it manually.
*/
bpf_task_release(acquired);
return 0;
}
Fentry
// SPDX-License-Identifier: GPL-2.0
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
extern const int bpf_prog_active __ksym;
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 12);
} ringbuf SEC(".maps");
SEC("fentry/security_inode_getattr")
int BPF_PROG(d_path_check_rdonly_mem, struct path *path, struct kstat *stat,
__u32 request_mask, unsigned int query_flags)
{
void *active;
u32 cpu;
cpu = bpf_get_smp_processor_id();
active = (void *)bpf_per_cpu_ptr(&bpf_prog_active, cpu);
if (active) {
/* FAIL here! 'active' points to 'regular' memory. It
* cannot be submitted to ring buffer.
*/
bpf_ringbuf_submit(active, 0);
}
return 0;
}
char _license[] SEC("license") = "GPL";
Fexit
SEC("fexit/inet_stream_connect")
int BPF_PROG(update_cookie_tracing, struct socket *sock,
struct sockaddr *uaddr, int addr_len, int flags, int ret)
{
struct socket_cookie *p;
if (uaddr->sa_family != AF_INET6)
return 0;
p = bpf_cgrp_storage_get(&socket_cookies, sock->sk->sk_cgrp_data.cgroup, 0, 0);
if (!p)
return 0;
if (p->cookie_key != bpf_get_socket_cookie(sock->sk))
return 0;
p->cookie_value |= 0xF0;
return 0;
}
Fmodify_return
// SPDX-License-Identifier: GPL-2.0
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "hid_bpf_helpers.h"
SEC("fmod_ret/hid_bpf_device_event")
int BPF_PROG(hid_y_event, struct hid_bpf_ctx *hctx)
{
s16 y;
__u8 *data = hid_bpf_get_data(hctx, 0 /* offset */, 9 /* size */);
if (!data)
return 0; /* EPERM check */
bpf_printk("event: size: %d", hctx->size);
bpf_printk("incoming event: %02x %02x %02x",
data[0],
data[1],
data[2]);
bpf_printk(" %02x %02x %02x",
data[3],
data[4],
data[5]);
bpf_printk(" %02x %02x %02x",
data[6],
data[7],
data[8]);
y = data[3] | (data[4] << 8);
y = -y;
data[3] = y & 0xFF;
data[4] = (y >> 8) & 0xFF;
bpf_printk("modified event: %02x %02x %02x",
data[0],
data[1],
data[2]);
bpf_printk(" %02x %02x %02x",
data[3],
data[4],
data[5]);
bpf_printk(" %02x %02x %02x",
data[6],
data[7],
data[8]);
return 0;
}
SEC("fmod_ret/hid_bpf_device_event")
int BPF_PROG(hid_x_event, struct hid_bpf_ctx *hctx)
{
s16 x;
__u8 *data = hid_bpf_get_data(hctx, 0 /* offset */, 9 /* size */);
if (!data)
return 0; /* EPERM check */
x = data[1] | (data[2] << 8);
x = -x;
data[1] = x & 0xFF;
data[2] = (x >> 8) & 0xFF;
return 0;
}
SEC("fmod_ret/hid_bpf_rdesc_fixup")
int BPF_PROG(hid_rdesc_fixup, struct hid_bpf_ctx *hctx)
{
__u8 *data = hid_bpf_get_data(hctx, 0 /* offset */, 4096 /* size */);
if (!data)
return 0; /* EPERM check */
bpf_printk("rdesc: %02x %02x %02x",
data[0],
data[1],
data[2]);
bpf_printk(" %02x %02x %02x",
data[3],
data[4],
data[5]);
bpf_printk(" %02x %02x %02x ...",
data[6],
data[7],
data[8]);
/*
* The original report descriptor contains:
*
* 0x05, 0x01, // Usage Page (Generic Desktop) 30
* 0x16, 0x01, 0x80, // Logical Minimum (-32767) 32
* 0x26, 0xff, 0x7f, // Logical Maximum (32767) 35
* 0x09, 0x30, // Usage (X) 38
* 0x09, 0x31, // Usage (Y) 40
*
* So byte 39 contains Usage X and byte 41 Usage Y.
*
* We simply swap the axes here.
*/
data[39] = 0x31;
data[41] = 0x30;
return 0;
}
char _license[] SEC("license") = "GPL";
Iterator
SEC("iter/task_file")
int dump_task_file(struct bpf_iter__task_file *ctx)
{
struct seq_file *seq = ctx->meta->seq;
struct task_struct *task = ctx->task;
struct file *file = ctx->file;
__u32 fd = ctx->fd;
if (task == NULL || file == NULL)
return 0;
if (ctx->meta->seq_num == 0) {
count = 0;
BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
}
if (tgid == task->tgid && task->tgid != task->pid)
count++;
if (last_tgid != task->tgid) {
last_tgid = task->tgid;
unique_tgid_count++;
}
BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
(long)file->f_op);
return 0;
}
Helper functions
Not all helper functions are available in all program types. These are the helper calls available for raw tracepoint programs:
Supported helper functions
bpf_cgrp_storage_delete
bpf_cgrp_storage_get
bpf_copy_from_user
bpf_copy_from_user_task
bpf_current_task_under_cgroup
bpf_d_path
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_find_vma
bpf_for_each_map_elem
bpf_get_attach_cookie
v5.19bpf_get_branch_snapshot
bpf_get_current_ancestor_cgroup_id
bpf_get_current_cgroup_id
bpf_get_current_comm
bpf_get_current_pid_tgid
bpf_get_current_task
bpf_get_current_task_btf
bpf_get_current_uid_gid
bpf_get_func_arg
bpf_get_func_arg_cnt
bpf_get_func_ip
bpf_get_func_ret
bpf_get_ns_current_pid_tgid
bpf_get_numa_node_id
bpf_get_prandom_u32
bpf_get_smp_processor_id
bpf_get_socket_cookie
bpf_get_stack
bpf_get_stackid
bpf_get_task_stack
bpf_jiffies64
bpf_kptr_xchg
bpf_ktime_get_boot_ns
bpf_ktime_get_ns
bpf_ktime_get_tai_ns
bpf_loop
bpf_map_delete_elem
bpf_map_lookup_elem
bpf_map_lookup_percpu_elem
bpf_map_peek_elem
bpf_map_pop_elem
bpf_map_push_elem
bpf_map_update_elem
bpf_per_cpu_ptr
bpf_perf_event_output
bpf_perf_event_read
bpf_perf_event_read_value
bpf_probe_read
bpf_probe_read_kernel
bpf_probe_read_kernel_str
bpf_probe_read_str
bpf_probe_read_user
bpf_probe_read_user_str
bpf_probe_write_user
bpf_ringbuf_discard
bpf_ringbuf_discard_dynptr
bpf_ringbuf_output
bpf_ringbuf_query
bpf_ringbuf_reserve
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit
bpf_ringbuf_submit_dynptr
bpf_send_signal
bpf_send_signal_thread
bpf_seq_printf
bpf_seq_printf_btf
bpf_seq_write
bpf_sk_storage_delete
v5.11bpf_sk_storage_get
v5.11bpf_skb_output
bpf_skc_to_mptcp_sock
bpf_skc_to_tcp6_sock
bpf_skc_to_tcp_request_sock
bpf_skc_to_tcp_sock
bpf_skc_to_tcp_timewait_sock
bpf_skc_to_udp6_sock
bpf_skc_to_unix_sock
bpf_snprintf
bpf_snprintf_btf
bpf_sock_from_file
bpf_spin_lock
bpf_spin_unlock
bpf_strncmp
bpf_tail_call
bpf_task_pt_regs
bpf_task_storage_delete
bpf_task_storage_get
bpf_this_cpu_ptr
bpf_timer_cancel
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_trace_printk
bpf_trace_vprintk
bpf_user_ringbuf_drain
bpf_xdp_get_buff_len
bpf_xdp_output
KFuncs
Supported kfuncs
bpf_arena_alloc_pages
bpf_arena_free_pages
bpf_cast_to_kern_ctx
bpf_cgroup_acquire
bpf_cgroup_ancestor
bpf_cgroup_from_id
bpf_cgroup_release
bpf_cpumask_acquire
bpf_cpumask_and
bpf_cpumask_any_and_distribute
bpf_cpumask_any_distribute
bpf_cpumask_clear
bpf_cpumask_clear_cpu
bpf_cpumask_copy
bpf_cpumask_create
bpf_cpumask_empty
bpf_cpumask_equal
bpf_cpumask_first
bpf_cpumask_first_and
bpf_cpumask_first_zero
bpf_cpumask_full
bpf_cpumask_intersects
bpf_cpumask_or
bpf_cpumask_release
bpf_cpumask_set_cpu
bpf_cpumask_setall
bpf_cpumask_subset
bpf_cpumask_test_and_clear_cpu
bpf_cpumask_test_and_set_cpu
bpf_cpumask_test_cpu
bpf_cpumask_weight
bpf_cpumask_xor
bpf_dynptr_adjust
bpf_dynptr_clone
bpf_dynptr_is_null
bpf_dynptr_is_rdonly
bpf_dynptr_size
bpf_dynptr_slice
bpf_dynptr_slice_rdwr
bpf_iter_bits_destroy
bpf_iter_bits_new
bpf_iter_bits_next
bpf_iter_css_destroy
bpf_iter_css_new
bpf_iter_css_next
bpf_iter_css_task_destroy
bpf_iter_css_task_new
bpf_iter_css_task_next
bpf_iter_num_destroy
bpf_iter_num_new
bpf_iter_num_next
bpf_iter_task_destroy
bpf_iter_task_new
bpf_iter_task_next
bpf_iter_task_vma_destroy
bpf_iter_task_vma_new
bpf_iter_task_vma_next
bpf_key_put
bpf_list_pop_back
bpf_list_pop_front
bpf_list_push_back_impl
bpf_list_push_front_impl
bpf_lookup_system_key
bpf_lookup_user_key
bpf_map_sum_elem_count
bpf_obj_drop_impl
bpf_obj_new_impl
bpf_percpu_obj_drop_impl
bpf_percpu_obj_new_impl
bpf_preempt_disable
bpf_preempt_enable
bpf_rbtree_add_impl
bpf_rbtree_first
bpf_rbtree_remove
bpf_rcu_read_lock
bpf_rcu_read_unlock
bpf_rdonly_cast
bpf_refcount_acquire_impl
bpf_sock_destroy
bpf_task_acquire
bpf_task_from_pid
bpf_task_get_cgroup1
bpf_task_release
bpf_task_under_cgroup
bpf_throw
bpf_verify_pkcs7_signature
bpf_wq_init
bpf_wq_set_callback_impl
bpf_wq_start
cgroup_rstat_flush
cgroup_rstat_updated
crash_kexec
hid_bpf_allocate_context
hid_bpf_get_data
hid_bpf_hw_output_report
hid_bpf_hw_request
hid_bpf_input_report
hid_bpf_release_context
hid_bpf_try_input_report