Program type BPF_PROG_TYPE_SCHED_CLS
This program type allows for the implementation of a Traffic Control (TC) classifier (aka filter) in eBPF. TC can be used for a number of use cases, all of them having to do with the manipulation of traffic. TC is for example used to implement QoS (Quality of Service) allowing latency sensitive traffic like VoIP (Voice over IP) to be processed ahead of lets say web traffic. It can also drop packets to simulate packet-loss, add latency to simulate distant clients or apply bandwidth limitations for applications or users, to name a few.
TC allows an admin to filter traffic using a hierarchical model of qdiscs (
Usage
TC Classifier programs are typically put into an ELF section prefixed with tc/
or classifier/
. The TC Classifier program is called by the kernel with a __sk_buff
context. The return value indicates what action the kernel should take with the packet, the following values are permitted:
Regular classifier
By default, when a BPF classifier is attached to a qdisc it will act as any other classifier. It can't take actions such as dropping or redirecting packets, instead its return value is used to pick a class based on the contents of the packet. A return value of -1
indicates the default class should be picked, a return value of 0
means the filter did not match, and that the next filter should try, and any positive number indicates the id of the class.
While possible, this is a rarely used use-case, eBPF programs are typically used for direct actions.
Direct action
When attached in direct action mode, the eBPF program will act as both a classifier and an action. This mode simplifies setups for the most common use cases where we just want to always execute an action. In direct action mode the return value can be one of:
TC_ACT_UNSPEC
(-1) - Signals that the default configured action should be taken.TC_ACT_OK
(0) - Signals that the packet should proceed.TC_ACT_RECLASSIFY
(1) - Signals that the packet has to re-start classification from the root qdisc. This is typically used after modifying the packet so its classification might have different results.TC_ACT_SHOT
(2) - Signals that the packet should be dropped, no other TC processing should happen.TC_ACT_PIPE
(3) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_STOLEN
(4) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_QUEUED
(5) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_REPEAT
(6) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_REDIRECT
(7) - Signals that the packet should be redirected, the details of how and where to are set as side effects by helpers functions.
Classifiers in direct action mode can still set a class id by setting the tc_classid
field
Context
This program type is not allowed to read from and write to all fields of the context since doing so might break assumptions in the kernel or because data is not available at the point where the program is hooked into the kernel.
Context fields
Attachment
As of kernel version v6.2 the only way to attach eBPF programs to TC is via a netlink socket the details of which are complex. The usage of a netlink library is recommended if you wish to manage attachment via an API. However, the most common way to go about this is via the iproute2 tc
CLI tool which is the standard implementation for network utilities using the netlink protocol.
The most basic example of attaching a TC classifier is:
# Add a qdisc of type `clsact` to device `eth1`
$ tc qdisc add dev eth1 clsact
# Load the `program.o` ELF file, and attach the `my_func` section to the qdisc of eth1 on the ingress side.
$ tc filter add dev eth1 ingress bpf obj program.o sec my_func
For more details on the tc
command, see the general man page.
For more details on the bpf filter options, see the tc-bpf
man page.
In addition, the kernel supports the tcx (the new tc BPF fast path with BPF link support) since kernel v6.6, which allows for more advanced features like attaching multiple programs to a single qdisc, or attaching programs to a qdisc on the egress side:
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
| Program Type | Attach Type | ELF Section Name | Sleepable |
+===========================================+========================================+==================================+===========+
| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` [#tc_legacy]_ | |
+ + +----------------------------------+-----------+
| | | ``tc`` [#tc_legacy]_ | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_NETKIT_PRIMARY`` | ``netkit/primary`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_NETKIT_PEER`` | ``netkit/peer`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TCX_INGRESS`` | ``tc/ingress`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TCX_EGRESS`` | ``tc/egress`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TCX_INGRESS`` | ``tcx/ingress`` | |
+ +----------------------------------------+----------------------------------+-----------+
| | ``BPF_TCX_EGRESS`` | ``tcx/egress`` | |
+-------------------------------------------+----------------------------------------+----------------------------------+-----------+
The definition of return codes for tcx programs can be found in the kernel sources:
/* (Simplified) user return codes for tcx prog type.
* A valid tcx program must return one of these defined values. All other
* return codes are reserved for future use. Must remain compatible with
* their TC_ACT_* counter-parts. For compatibility in behavior, unknown
* return codes are mapped to TCX_NEXT.
*/
enum tcx_action_base {
TCX_NEXT = -1,
TCX_PASS = 0,
TCX_DROP = 2,
TCX_REDIRECT = 7,
};
For more details of tcx, see the
Helper functions
Not all helper functions are available in all program types. These are the helper calls available for TC classifier programs:
Supported helper functions
bpf_cgrp_storage_delete
bpf_cgrp_storage_get
bpf_check_mtu
bpf_clone_redirect
bpf_csum_diff
bpf_csum_level
bpf_csum_update
bpf_dynptr_data
bpf_dynptr_from_mem
bpf_dynptr_read
bpf_dynptr_write
bpf_fib_lookup
bpf_for_each_map_elem
bpf_get_cgroup_classid
bpf_get_current_pid_tgid
v6.10bpf_get_current_task
bpf_get_current_task_btf
bpf_get_hash_recalc
bpf_get_listener_sock
bpf_get_ns_current_pid_tgid
v6.10bpf_get_numa_node_id
bpf_get_prandom_u32
bpf_get_route_realm
bpf_get_smp_processor_id
bpf_get_socket_cookie
bpf_get_socket_uid
bpf_jiffies64
bpf_kptr_xchg
bpf_ktime_get_boot_ns
bpf_ktime_get_ns
bpf_ktime_get_tai_ns
bpf_l3_csum_replace
bpf_l4_csum_replace
bpf_loop
bpf_map_delete_elem
bpf_map_lookup_elem
bpf_map_lookup_percpu_elem
bpf_map_peek_elem
bpf_map_pop_elem
bpf_map_push_elem
bpf_map_update_elem
bpf_per_cpu_ptr
bpf_perf_event_output
bpf_probe_read_kernel
bpf_probe_read_kernel_str
bpf_probe_read_user
bpf_probe_read_user_str
bpf_redirect
bpf_redirect_neigh
bpf_redirect_peer
bpf_ringbuf_discard
bpf_ringbuf_discard_dynptr
bpf_ringbuf_output
bpf_ringbuf_query
bpf_ringbuf_reserve
bpf_ringbuf_reserve_dynptr
bpf_ringbuf_submit
bpf_ringbuf_submit_dynptr
bpf_set_hash
bpf_set_hash_invalid
bpf_sk_assign
bpf_sk_fullsock
bpf_sk_lookup_tcp
bpf_sk_lookup_udp
bpf_sk_release
bpf_sk_storage_delete
bpf_sk_storage_get
bpf_skb_adjust_room
bpf_skb_ancestor_cgroup_id
bpf_skb_cgroup_classid
bpf_skb_cgroup_id
bpf_skb_change_head
v5.8bpf_skb_change_proto
bpf_skb_change_tail
bpf_skb_change_type
bpf_skb_ecn_set_ce
bpf_skb_get_tunnel_key
bpf_skb_get_tunnel_opt
bpf_skb_get_xfrm_state
bpf_skb_load_bytes
bpf_skb_load_bytes_relative
bpf_skb_pull_data
bpf_skb_set_tstamp
bpf_skb_set_tunnel_key
bpf_skb_set_tunnel_opt
bpf_skb_store_bytes
bpf_skb_under_cgroup
bpf_skb_vlan_pop
bpf_skb_vlan_push
bpf_skc_lookup_tcp
bpf_snprintf
bpf_snprintf_btf
bpf_spin_lock
bpf_spin_unlock
bpf_strncmp
bpf_tail_call
bpf_task_pt_regs
bpf_tcp_check_syncookie
bpf_tcp_gen_syncookie
bpf_tcp_raw_check_syncookie_ipv4
bpf_tcp_raw_check_syncookie_ipv6
bpf_tcp_raw_gen_syncookie_ipv4
bpf_tcp_raw_gen_syncookie_ipv6
bpf_tcp_sock
bpf_this_cpu_ptr
bpf_timer_cancel
bpf_timer_init
bpf_timer_set_callback
bpf_timer_start
bpf_trace_printk
bpf_trace_vprintk
bpf_user_ringbuf_drain
KFuncs
Supported kfuncs
bpf_arena_alloc_pages
bpf_arena_free_pages
bpf_cast_to_kern_ctx
bpf_cgroup_acquire
bpf_cgroup_ancestor
bpf_cgroup_from_id
bpf_cgroup_release
bpf_crypto_decrypt
bpf_crypto_encrypt
bpf_ct_change_status
bpf_ct_change_timeout
bpf_ct_insert_entry
bpf_ct_release
bpf_ct_set_nat_info
bpf_ct_set_status
bpf_ct_set_timeout
bpf_dynptr_adjust
bpf_dynptr_clone
bpf_dynptr_from_skb
bpf_dynptr_is_null
bpf_dynptr_is_rdonly
bpf_dynptr_size
bpf_dynptr_slice
bpf_dynptr_slice_rdwr
bpf_iter_bits_destroy
bpf_iter_bits_new
bpf_iter_bits_next
bpf_iter_css_destroy
bpf_iter_css_new
bpf_iter_css_next
bpf_iter_css_task_destroy
bpf_iter_css_task_new
bpf_iter_css_task_next
bpf_iter_num_destroy
bpf_iter_num_new
bpf_iter_num_next
bpf_iter_task_destroy
bpf_iter_task_new
bpf_iter_task_next
bpf_iter_task_vma_destroy
bpf_iter_task_vma_new
bpf_iter_task_vma_next
bpf_list_pop_back
bpf_list_pop_front
bpf_list_push_back_impl
bpf_list_push_front_impl
bpf_map_sum_elem_count
bpf_obj_drop_impl
bpf_obj_new_impl
bpf_percpu_obj_drop_impl
bpf_percpu_obj_new_impl
bpf_preempt_disable
bpf_preempt_enable
bpf_rbtree_add_impl
bpf_rbtree_first
bpf_rbtree_remove
bpf_rcu_read_lock
bpf_rcu_read_unlock
bpf_rdonly_cast
bpf_refcount_acquire_impl
bpf_sk_assign_tcp_reqsk
bpf_skb_ct_alloc
bpf_skb_ct_lookup
bpf_skb_get_fou_encap
bpf_skb_get_xfrm_info
bpf_skb_set_fou_encap
bpf_skb_set_xfrm_info
bpf_task_acquire
bpf_task_from_pid
bpf_task_get_cgroup1
bpf_task_release
bpf_task_under_cgroup
bpf_throw
bpf_wq_init
bpf_wq_set_callback_impl
bpf_wq_start
bpf_xdp_ct_alloc
bpf_xdp_ct_lookup
crash_kexec