Program type BPF_PROG_TYPE_SCHED_CLS
This program type allows for the implementation of a Traffic Control (TC) classifier (aka filter) in eBPF. TC can be used for a number of use cases, all of them having to do with the manipulation of traffic. TC is for example used to implement QoS (Quality of Service) allowing latency sensitive traffic like VoIP (Voice over IP) to be processed ahead of lets say web traffic. It can also drop packets to simulate packet-loss, add latency to simulate distant clients or apply bandwidth limitations for applications or users, to name a few.
TC allows an admin to filter traffic using a hierarchical model of qdiscs (
Usage
TC Classifier programs are typically put into an ELF section prefixed with tc/ or classifier/. The TC Classifier program is called by the kernel with a __sk_buff context. The return value indicates what action the kernel should take with the packet, the following values are permitted:
Regular classifier
By default, when a BPF classifier is attached to a qdisc it will act as any other classifier. It can't take actions such as dropping or redirecting packets, instead its return value is used to pick a class based on the contents of the packet. A return value of -1 indicates the default class should be picked, a return value of 0 means the filter did not match, and that the next filter should try, and any positive number indicates the id of the class.
While possible, this is a rarely used use-case, eBPF programs are typically used for direct actions.
Direct action
When attached in direct action mode, the eBPF program will act as both a classifier and an action. This mode simplifies setups for the most common use cases where we just want to always execute an action. In direct action mode the return value can be one of:
TC_ACT_UNSPEC(-1) - Signals that the default configured action should be taken.TC_ACT_OK(0) - Signals that the packet should proceed.TC_ACT_RECLASSIFY(1) - Signals that the packet has to re-start classification from the root qdisc. This is typically used after modifying the packet so its classification might have different results.TC_ACT_SHOT(2) - Signals that the packet should be dropped, no other TC processing should happen.TC_ACT_PIPE(3) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_STOLEN(4) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_QUEUED(5) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_REPEAT(6) - While defined, this action should not be used and holds no particular meaning for eBPF classifiers.TC_ACT_REDIRECT(7) - Signals that the packet should be redirected, the details of how and where to are set as side effects by helpers functions.
Classifiers in direct action mode can still set a class id by setting the tc_classid field
Context
This program type is not allowed to read from and write to all fields of the context since doing so might break assumptions in the kernel or because data is not available at the point where the program is hooked into the kernel.
Context fields
Attachment
As of kernel version v6.2 the only way to attach eBPF programs to TC is via a netlink socket the details of which are complex. The usage of a netlink library is recommended if you wish to manage attachment via an API. However, the most common way to go about this is via the iproute2 tc CLI tool which is the standard implementation for network utilities using the netlink protocol.
The most basic example of attaching a TC classifier is:
# Add a qdisc of type `clsact` to device `eth1`
$ tc qdisc add dev eth1 clsact
# Load the `program.o` ELF file, and attach the `my_func` section to the qdisc of eth1 on the ingress side.
$ tc filter add dev eth1 ingress bpf obj program.o sec my_func
For more details on the tc command, see the general man page.
For more details on the bpf filter options, see the tc-bpf man page.
In addition, the kernel supports the tcx (the new tc BPF fast path with BPF link support) since kernel v6.6, which allows for more advanced features like attaching multiple programs to a single qdisc, or attaching programs to a qdisc on the egress side:
| Attach Type | ELF Section Name |
|---|---|
BPF_NETKIT_PRIMARY |
netkit/primary |
BPF_NETKIT_PEER |
netkit/peer |
BPF_TCX_INGRESS |
tc/ingress |
BPF_TCX_EGRESS |
tc/egress |
BPF_TCX_INGRESS |
tcx/ingress |
BPF_TCX_EGRESS |
tcx/egress |
The definition of return codes for tcx programs can be found in the kernel sources:
/* (Simplified) user return codes for tcx prog type.
* A valid tcx program must return one of these defined values. All other
* return codes are reserved for future use. Must remain compatible with
* their TC_ACT_* counter-parts. For compatibility in behavior, unknown
* return codes are mapped to TCX_NEXT.
*/
enum tcx_action_base {
TCX_NEXT = -1,
TCX_PASS = 0,
TCX_DROP = 2,
TCX_REDIRECT = 7,
};
For more details of tcx, see the
Helper functions
Not all helper functions are available in all program types. These are the helper calls available for TC classifier programs:
Supported helper functions
bpf_cgrp_storage_deletebpf_cgrp_storage_getbpf_check_mtubpf_clone_redirectbpf_csum_diffbpf_csum_levelbpf_csum_updatebpf_dynptr_databpf_dynptr_from_membpf_dynptr_readbpf_dynptr_writebpf_fib_lookupbpf_for_each_map_elembpf_get_cgroup_classidbpf_get_current_ancestor_cgroup_idv6.4bpf_get_current_cgroup_idv6.4bpf_get_current_pid_tgidv6.10bpf_get_current_taskbpf_get_current_task_btfbpf_get_hash_recalcbpf_get_listener_sockbpf_get_netns_cookiev6.13bpf_get_ns_current_pid_tgidv6.10bpf_get_numa_node_idbpf_get_prandom_u32bpf_get_route_realmbpf_get_smp_processor_idbpf_get_socket_cookiebpf_get_socket_uidbpf_jiffies64bpf_kptr_xchgbpf_ktime_get_boot_nsbpf_ktime_get_nsbpf_ktime_get_tai_nsbpf_l3_csum_replacebpf_l4_csum_replacebpf_loopbpf_map_delete_elembpf_map_lookup_elembpf_map_lookup_percpu_elembpf_map_peek_elembpf_map_pop_elembpf_map_push_elembpf_map_update_elembpf_per_cpu_ptrbpf_perf_event_outputbpf_probe_read_kernelbpf_probe_read_kernel_strbpf_probe_read_userbpf_probe_read_user_strbpf_redirectbpf_redirect_neighbpf_redirect_peerbpf_ringbuf_discardbpf_ringbuf_discard_dynptrbpf_ringbuf_outputbpf_ringbuf_querybpf_ringbuf_reservebpf_ringbuf_reserve_dynptrbpf_ringbuf_submitbpf_ringbuf_submit_dynptrbpf_set_hashbpf_set_hash_invalidbpf_sk_assignbpf_sk_fullsockbpf_sk_lookup_tcpbpf_sk_lookup_udpbpf_sk_releasebpf_sk_storage_deletebpf_sk_storage_getbpf_skb_adjust_roombpf_skb_ancestor_cgroup_idbpf_skb_cgroup_classidbpf_skb_cgroup_idbpf_skb_change_headv5.8bpf_skb_change_protobpf_skb_change_tailbpf_skb_change_typebpf_skb_ecn_set_cebpf_skb_get_tunnel_keybpf_skb_get_tunnel_optbpf_skb_get_xfrm_statebpf_skb_load_bytesbpf_skb_load_bytes_relativebpf_skb_pull_databpf_skb_set_tstampbpf_skb_set_tunnel_keybpf_skb_set_tunnel_optbpf_skb_store_bytesbpf_skb_under_cgroupbpf_skb_vlan_popbpf_skb_vlan_pushbpf_skc_lookup_tcpbpf_snprintfbpf_snprintf_btfbpf_spin_lockbpf_spin_unlockbpf_strncmpbpf_tail_callbpf_task_pt_regsbpf_tcp_check_syncookiebpf_tcp_gen_syncookiebpf_tcp_raw_check_syncookie_ipv4bpf_tcp_raw_check_syncookie_ipv6bpf_tcp_raw_gen_syncookie_ipv4bpf_tcp_raw_gen_syncookie_ipv6bpf_tcp_sockbpf_this_cpu_ptrbpf_timer_cancelbpf_timer_initbpf_timer_set_callbackbpf_timer_startbpf_trace_printkbpf_trace_vprintkbpf_user_ringbuf_drain
KFuncs
Supported kfuncs
__bpf_trapbpf_arena_alloc_pagesbpf_arena_free_pagesbpf_arena_reserve_pagesbpf_cast_to_kern_ctxbpf_cgroup_acquirebpf_cgroup_ancestorbpf_cgroup_from_idbpf_cgroup_read_xattrbpf_cgroup_releasebpf_copy_from_user_dynptrbpf_copy_from_user_strbpf_copy_from_user_str_dynptrbpf_copy_from_user_task_dynptrbpf_copy_from_user_task_strbpf_copy_from_user_task_str_dynptrbpf_crypto_decryptbpf_crypto_encryptbpf_ct_change_statusbpf_ct_change_timeoutbpf_ct_insert_entrybpf_ct_releasebpf_ct_set_nat_infobpf_ct_set_statusbpf_ct_set_timeoutbpf_dynptr_adjustbpf_dynptr_clonebpf_dynptr_copybpf_dynptr_from_skbbpf_dynptr_is_nullbpf_dynptr_is_rdonlybpf_dynptr_memsetbpf_dynptr_sizebpf_dynptr_slicebpf_dynptr_slice_rdwrbpf_get_kmem_cachebpf_iter_bits_destroybpf_iter_bits_newbpf_iter_bits_nextbpf_iter_css_destroybpf_iter_css_newbpf_iter_css_nextbpf_iter_css_task_destroybpf_iter_css_task_newbpf_iter_css_task_nextbpf_iter_dmabuf_destroybpf_iter_dmabuf_newbpf_iter_dmabuf_nextbpf_iter_kmem_cache_destroybpf_iter_kmem_cache_newbpf_iter_kmem_cache_nextbpf_iter_num_destroybpf_iter_num_newbpf_iter_num_nextbpf_iter_task_destroybpf_iter_task_newbpf_iter_task_nextbpf_iter_task_vma_destroybpf_iter_task_vma_newbpf_iter_task_vma_nextbpf_list_backbpf_list_frontbpf_list_pop_backbpf_list_pop_frontbpf_list_push_back_implbpf_list_push_front_implbpf_local_irq_restorebpf_local_irq_savebpf_map_sum_elem_countbpf_obj_drop_implbpf_obj_new_implbpf_percpu_obj_drop_implbpf_percpu_obj_new_implbpf_preempt_disablebpf_preempt_enablebpf_probe_read_kernel_dynptrbpf_probe_read_kernel_str_dynptrbpf_probe_read_user_dynptrbpf_probe_read_user_str_dynptrbpf_rbtree_add_implbpf_rbtree_firstbpf_rbtree_leftbpf_rbtree_removebpf_rbtree_rightbpf_rbtree_rootbpf_rcu_read_lockbpf_rcu_read_unlockbpf_rdonly_castbpf_refcount_acquire_implbpf_res_spin_lockbpf_res_spin_lock_irqsavebpf_res_spin_unlockbpf_res_spin_unlock_irqrestorebpf_send_signal_taskbpf_sk_assign_tcp_reqskbpf_skb_ct_allocbpf_skb_ct_lookupbpf_skb_get_fou_encapbpf_skb_get_xfrm_infobpf_skb_set_fou_encapbpf_skb_set_xfrm_infobpf_strchrbpf_strchrnulbpf_strcmpbpf_strcspnbpf_stream_vprintkbpf_strlenbpf_strnchrbpf_strnlenbpf_strnstrbpf_strrchrbpf_strspnbpf_strstrbpf_task_acquirebpf_task_from_pidbpf_task_from_vpidbpf_task_get_cgroup1bpf_task_releasebpf_task_under_cgroupbpf_throwbpf_wq_initbpf_wq_set_callback_implbpf_wq_startbpf_xdp_ct_allocbpf_xdp_ct_lookupcrash_kexec