Skip to content

Struct ops sched_ext_ops

v6.12

Sched ext (Scheduler extension) Ops can be used to implement a custom scheduler in BPF.

Usage

The Linux kernel provides built-in scheduler implementations like CFS and EEVDF. These schedulers are designed to provide a good balance between fairness and performance for most workloads. However, there are use cases where a custom scheduler is needed to meet specific requirements. The BPF scheduler extension provides a way to implement a custom scheduler in BPF.

See also kernel docs

Fields and ops

A BPF scheduler can implement an arbitrary scheduling policy by implementing and loading operations in this table. Note that a userland scheduling policy can also be implemented using the BPF scheduler as a shim layer.

Note

The following definition has been modified from the one found in the kernel for the sake of readability. This does not impact the definition for the purposes of implementing a BPF program.

struct sched_ext_ops {
    char name[SCX_OPS_NAME_LEN];
    u32  dispatch_max_batch;
    u64  flags;
    u32  timeout_ms;
    u32  exit_dump_len;
    u64  hotplug_seq;

    s32  (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);
    void (*enqueue)(struct task_struct *p, u64 enq_flags);
    void (*dequeue)(struct task_struct *p, u64 deq_flags);
    void (*dispatch)(s32 cpu, struct task_struct *prev);
    void (*tick)(struct task_struct *p);
    void (*runnable)(struct task_struct *p, u64 enq_flags);
    void (*running)(struct task_struct *p);
    void (*stopping)(struct task_struct *p, bool runnable);
    void (*quiescent)(struct task_struct *p, u64 deq_flags);
    bool (*yield)(struct task_struct *from, struct task_struct *to);
    bool (*core_sched_before)(struct task_struct *a, struct task_struct *b);
    void (*set_weight)(struct task_struct *p, u32 weight);
    void (*set_cpumask)(struct task_struct *p, const struct cpumask *cpumask);
    void (*update_idle)(s32 cpu, bool idle);
    void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
    void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);

    s32  (*init_task)(struct task_struct *p, struct scx_init_task_args *args);
    void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args);

    void (*enable)(struct task_struct *p);
    void (*disable)(struct task_struct *p);

    void (*dump)(struct scx_dump_ctx *ctx);
    void (*dump_cpu)(struct scx_dump_ctx *ctx, s32 cpu, bool idle);
    void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p);

#ifdef CONFIG_EXT_GROUP_SCHED
    s32  (*cgroup_init)(struct cgroup *cgrp, struct scx_cgroup_init_args *args);
    void (*cgroup_exit)(struct cgroup *cgrp);
    s32  (*cgroup_prep_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);
    void (*cgroup_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);
    void (*cgroup_cancel_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);
    void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
#endif /* CONFIG_EXT_GROUP_SCHED */

    void (*cpu_online)(s32 cpu);
    void (*cpu_offline)(s32 cpu);

    s32  (*init)(void);
    void (*exit)(struct scx_exit_info *info);
};

name

v6.12

name[SCX_OPS_NAME_LEN] (SCX_OPS_NAME_LEN = 128)

The BPF scheduler's name, for observability purposes.

Must be a non-zero valid BPF object name including only isalnum(), _ and . chars. Shows up in kernel.sched_ext_ops sysctl while the BPF scheduler is enabled.

dispatch_max_batch

v6.12

u32 dispatch_max_batch

Max number of tasks that dispatch can dispatch.

flags

v6.12

u64 flags

The flags field is a bitfield that can be set to control the behavior of the scheduler. The enum scx_ops_flags enum defines the flags that can be set in this field.

timeout_ms

v6.12

u32 timeout_ms

The maximum amount of time, in milliseconds, that a runnable task should be able to wait before being scheduled. The maximum timeout may not exceed the default timeout of 30 seconds.

Defaults to the maximum allowed timeout value of 30 seconds.

exit_dump_len

v6.12

u32 exit_dump_len

scx_exit_info.dump buffer length. If 0, the default value of 32768 is used.

hotplug_seq

v6.12

u64 hotplug_seq

A sequence number that may be set by the scheduler to detect when a hot-plug event has occurred during the loading process. If 0, no detection occurs. Otherwise, the scheduler will fail to load if the sequence number does not match scx_hotplug_seq on the enable path.

select_cpu

v6.12

s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);

Pick the target CPU for a task which is being woken up.

Decision made here isn't final. p may be moved to any CPU while it is getting dispatched for execution later. However, as p is not on the rq at this point, getting the eventual execution CPU right here saves a small bit of overhead down the line.

If an idle CPU is returned, the CPU is kicked and will try to dispatch. While an explicit custom mechanism can be added, select_cpu serves as the default way to wake up idle CPUs.

p may be inserted into a DSQ directly by calling scx_bpf_dsq_insert. If so, the enqueue will be skipped. Directly inserting into SCX_DSQ_LOCAL will put p in the local DSQ of the CPU returned by this operation.

Note

select_cpu is never called for tasks that can only run on a single CPU or tasks with migration disabled, as they don't have the option to select a different CPU. See select_task_rq for details.

Parameters

p: task being woken upa

prev_cpu: the cpu p was on before sleeping

wake_flags: SCX_WAKE_*, possible values are:

  • SCX_WAKE_FORK (0x02) - Wakeup after exec
  • SCX_WAKE_TTWU (0x04) - Wakeup after fork
  • SCX_WAKE_SYNC (0x08) - Wakeup

Returns

The ID of the CPU to be woken up.

enqueue

v6.12

void (*enqueue)(struct task_struct *p, u64 enq_flags);

Enqueue a task on the BPF scheduler

p is ready to run. Insert directly into a DSQ by calling scx_bpf_dsq_insert or enqueue on the BPF scheduler. If not directly inserted, the bpf scheduler owns p and if it fails to dispatch p, the task will stall.

If p was inserted into a DSQ from select_cpu, this callback is skipped.

Parameters

p: task being enqueued

enq_flags: Enqueue flags, possible values defined by enum scx_enq_flags

dequeue

v6.12

void (*dequeue)(struct task_struct *p, u64 deq_flags);

Remove a task from the BPF scheduler

Remove p from the BPF scheduler. This is usually called to isolate the task while updating its scheduling properties (e.g. priority).

The ext core keeps track of whether the BPF side owns a given task or not and can gracefully ignore spurious dispatches from BPF side, which makes it safe to not implement this method. However, depending on the scheduling logic, this can lead to confusing behaviors - e.g. scheduling position not being updated across a priority change.

Parameters

p: task being dequeued

deq_flags: Dequeue flags, possible values defined by enum scx_deq_flags

dispatch

v6.12

void (*dispatch)(s32 cpu, struct task_struct *prev);

Dispatch tasks from the BPF scheduler and/or user DSQs

Called when a CPU's local DSQ is empty. The operation should dispatch one or more tasks from the BPF scheduler into the DSQs using scx_bpf_dsq_insert and/or move from user DSQs into the local DSQ using scx_bpf_dsq_move_to_local.

The maximum number of times scx_bpf_dsq_insert can be called without an intervening scx_bpf_dsq_move_to_local is specified by ops.dispatch_max_batch. See the comments on top of the two functions for more details.

When not NULL, prev is an SCX task with its slice depleted. If prev is still runnable as indicated by set SCX_TASK_QUEUED in prev->scx.flags, it is not enqueued yet and will be enqueued after dispatch returns. To keep executing prev, return without dispatching or moving any tasks. Also see SCX_OPS_ENQ_LAST.

Parameters

cpu: CPU to dispatch tasks for

prev: previous task being switched out

tick

v6.12

void (*tick)(struct task_struct *p);

Periodic tick. This operation is called every 1/HZ seconds on CPUs which are executing an SCX task. Setting p->scx.slice to 0 will trigger an immediate dispatch cycle on the CPU.

Parameters

p: task running currently

runnable

v6.12

void (*runnable)(struct task_struct *p, u64 enq_flags);

A task is becoming runnable on its associated CPU

This and the following three functions can be used to track a task's execution state transitions. A task becomes runnable on a CPU, and then goes through one or more running and stopping pairs as it runs on the CPU, and eventually becomes quiescent when it's done running on the CPU.

p is becoming runnable on the CPU because it's

  • waking up (SCX_ENQ_WAKEUP)
  • being moved from another CPU
  • being restored after temporarily taken off the queue for an attribute change.

This and enqueue are related but not coupled. This operation notifies p's state transition and may not be followed by enqueue e.g. when p is being dispatched to a remote CPU, or when p is being enqueued on a CPU experiencing a hotplug event. Likewise, a task may be enqueue'd without being preceded by this operation e.g. after exhausting its slice.

Parameters

p: task becoming runnable

enq_flags: Bitfield of flags, valid values defined in enum scx_enq_flags

running

v6.12

void (*running)(struct task_struct *p);

A task is starting to run on its associated CPU. See runnable for explanation on the task state notifiers.

Parameters

p: task starting to run

stopping

v6.12

void (*stopping)(struct task_struct *p, bool runnable);

A task is stopping execution. See runnable for explanation on the task state notifiers. If !runnable, quiescent will be invoked after this operation returns.

Parameters

p: task stopping to run

runnable: is task p still runnable?

quiescent

v6.12

void (*quiescent)(struct task_struct *p, u64 deq_flags);

A task is becoming not runnable on its associated CPU. See runnable for explanation on the task state notifiers.

p is becoming quiescent on the CPU because it's

  • sleeping (SCX_DEQ_SLEEP)
  • being moved to another CPU
  • being temporarily taken off the queue for an attribute change (SCX_DEQ_SAVE).

  • This and dequeue are related but not coupled. This operation

  • notifies p's state transition and may not be preceded by dequeue
  • e.g. when p is being dispatched to a remote CPU.

Parameters

p: task becoming not runnable

deq_flags: Bitfield of flags, valid values defined in enum scx_deq_flags

yield

v6.12

bool (*yield)(struct task_struct *from, struct task_struct *to);

Yield CPU. If to is NULL, from is yielding the CPU to other runnable tasks. The BPF scheduler should ensure that other available tasks are dispatched before the yielding task. Return value is ignored in this case.

If to is not-NULL, from wants to yield the CPU to to.

Parameters

from: yielding task

to: optional yield target task

Returns

If the bpf scheduler can implement the request, return true; otherwise, false.

core_sched_before

v6.12

bool (*core_sched_before)(struct task_struct *a, struct task_struct *b);

Task ordering for core-sched. Used by core-sched to determine the ordering between two tasks. See Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on core-sched.

Both a and b are runnable and may or may not currently be queued on the BPF scheduler.

If not specified, the default is ordering them according to when they became runnable.

Parameters

a: task A

b: task B

Returns

Should return true if a should run before b. false if there's no required ordering or b should run before a.

set_weight

v6.12

void (*set_weight)(struct task_struct *p, u32 weight);

Set task weight. Update p's weight to weight.

Parameters

p: task to set weight for

weight: new weight [1..10000]

set_cpumask

v6.12

void (*set_cpumask)(struct task_struct *p, const struct cpumask *cpumask);

Set CPU affinity. Update p's CPU affinity to cpumask.

Parameters

p: task to set CPU affinity for

cpumask: cpumask of cpus that p can run on

update_idle

v6.12

void (*update_idle)(s32 cpu, bool idle);

Update the idle state of a CPU. This operation is called when rq's CPU goes or leaves the idle state. By default, implementing this operation disables the built-in idle CPU tracking and the following helpers become unavailable:

The user also must implement select_cpu as the default implementation relies on scx_bpf_select_cpu_dfl.

Specify the SCX_OPS_KEEP_BUILTIN_IDLE flag to keep the built-in idle tracking.

Parameters

cpu: CPU to update the idle state for

idle: whether entering or exiting the idle state

cpu_acquire

v6.12

void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);

A CPU is becoming available to the BPF scheduler. A CPU that was previously released from the BPF scheduler is now once again under its control.

Parameters

cpu: The CPU being acquired by the BPF scheduler.

args: Acquire arguments, see struct scx_cpu_acquire_args.

cpu_release

v6.12

void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);

A CPU is taken away from the BPF scheduler. The specified CPU is no longer under the control of the BPF scheduler. This could be because it was preempted by a higher priority sched_class, though there may be other reasons as well. The caller should consult args->reason to determine the cause.

Parameters

cpu: The CPU being released by the BPF scheduler.

args: Release arguments, see struct scx_cpu_release_args.

init_task

v6.12

s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args);

Initialize a task to run in a BPF scheduler. Either we're loading a BPF scheduler or a new task is being forked. Initialize p for BPF scheduling. This operation may block and can be used for allocations, and is called exactly once for a task.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

p: task to initialize for BPF scheduling

args: init arguments, see struct scx_init_task_args

Returns

0 for success, -errno for failure. An error return while loading will abort loading of the BPF scheduler. During a fork, it will abort that specific fork.

exit_task

v6.12

void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args);

Exit a previously-running task from the system. p is exiting or the BPF scheduler is being unloaded. Perform any necessary cleanup for p.

Parameters

p: task to exit

args: exit arguments, see struct scx_exit_task_args

enable

v6.12

void (*enable)(struct task_struct *p);

Enable BPF scheduling for a task. Enable p for BPF scheduling. enable is called on p any time it enters SCX, and is always paired with a matching disable.

Parameters

p: task to enable BPF scheduling for

disable

v6.12

void (*disable)(struct task_struct *p);

Disable BPF scheduling for a task. p is exiting, leaving SCX or the BPF scheduler is being unloaded. Disable BPF scheduling for p. A disable call is always matched with a prior enable call.

Parameters

p: task to disable BPF scheduling for

dump

v6.12

void (*dump)(struct scx_dump_ctx *ctx);

Dump BPF scheduler state on error. Use scx_bpf_dump to generate BPF scheduler specific debug dump.

Parameters

ctx: debug dump context, see struct scx_dump_ctx

dump_cpu

v6.12

void (*dump_cpu)(struct scx_dump_ctx *ctx, s32 cpu, bool idle);

Dump BPF scheduler state for a CPU on error. Use scx_bpf_dump to generate BPF scheduler specific debug dump for cpu. If idle is true and this operation doesn't produce any output, cpu is skipped for dump.

Parameters

ctx: debug dump context, see struct scx_dump_ctx

cpu: CPU to generate debug dump for

idle: cpu is currently idle without any runnable tasks

dump_task

v6.12

void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p);

Dump BPF scheduler state for a runnable task on error

Use scx_bpf_dump to generate BPF scheduler specific debug dump for p.

Parameters

ctx: debug dump context, see struct scx_dump_ctx

p: runnable task to generate debug dump for

cgroup_init

v6.12

s32 (*cgroup_init)(struct cgroup *cgrp, struct scx_cgroup_init_args *args);

Initialize a cGroup. Either the BPF scheduler is being loaded or cgrp created, initialize cgrp for sched_ext. This operation may block.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

cgrp: cgroup being initialized

args: init arguments, see struct scx_cgroup_init_args

Returns

0 for success, -errno for failure. An error return while loading will abort loading of the BPF scheduler. During cgroup creation, it will abort the specific cgroup creation.

cgroup_exit

v6.12

void (*cgroup_exit)(struct cgroup *cgrp);

Exit a cGroup. Either the BPF scheduler is being unloaded or cgrp destroyed, exit cgrp for sched_ext. This operation my block.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

cgrp: cgroup being exited

cgroup_prep_move

v6.12

s32 (*cgroup_prep_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);

Prepare a task to be moved to a different cGroup. Prepare p for move from cGroup from to to. This operation may block and can be used for allocations.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

p: task being moved

from: cgroup p is being moved from

to: cgroup p is being moved to

Returns

0 for success, -errno for failure. An error return aborts the migration.

cgroup_move

v6.12

void (*cgroup_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);

Commit cGroup move. p is dequeued during this operation.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Parameters

p: task being moved

from: cgroup p is being moved from

to: cgroup p is being moved to

cgroup_cancel_move

v6.12

void (*cgroup_cancel_move)(struct task_struct *p, struct cgroup *from, struct cgroup *to);

Cancel cGroup move. p was cgroup_prep_move'd but failed before reaching cgroup_move. Undo the preparation.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Parameters

p: task whose cgroup move is being canceled

from: cgroup p was being moved from

to: cgroup p was being moved to

cgroup_set_weight

v6.12

void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);

A cGroup's weight is being changed. Update cgrp's weight to weight.

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

Parameters

cgrp: cgroup whose weight is being updated

weight: new weight [1..10000]

cpu_online

v6.12

void (*cpu_online)(s32 cpu);

A CPU became online. cpu just came online. cpu will not call enqueue or dispatch, nor run tasks associated with other CPUs beforehand.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

cpu: CPU which just came up

cpu_offline

v6.12

void (*cpu_offline)(s32 cpu);

A CPU is going offline. cpu is going offline. cpu will not call enqueue or dispatch, nor run tasks associated with other CPUs afterwards.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

cpu: CPU which is going offline

init

v6.12

s32 (*init)(void);

Initialize the BPF scheduler.

Note

The BPF program assigned to this field is allowed to be sleepable.

exit

v6.12

void (*exit)(struct scx_exit_info *info);

Clean up after the BPF scheduler. exit is also called on init failure, which is a bit unusual. This is to allow rich reporting through info on how init failed.

Note

The BPF program assigned to this field is allowed to be sleepable.

Parameters

info: Exit info, see struct scx_exit_info

Types

enum scx_ops_flags

This enum defines all of the flags that can be set as bitfield in flags.

enum scx_ops_flags {
    SCX_OPS_KEEP_BUILTIN_IDLE   = 1LLU << 0,
    SCX_OPS_ENQ_LAST            = 1LLU << 1,
    SCX_OPS_ENQ_EXITING         = 1LLU << 2,
    SCX_OPS_SWITCH_PARTIAL      = 1LLU << 3,
    SCX_OPS_HAS_CGROUP_WEIGHT   = 1LLU << 16,
};

SCX_OPS_KEEP_BUILTIN_IDLE

v6.12

Keep built-in idle tracking even if update_idle is implemented.

SCX_OPS_ENQ_LAST

v6.12

By default, if there are no other task to run on the CPU, ext core keeps running the current task even after its slice expires. If this flag is specified, such tasks are passed to ops.enqueue with SCX_ENQ_LAST. See the comment above SCX_ENQ_LAST for more info.

SCX_OPS_ENQ_EXITING

v6.12

An exiting task may schedule after PF_EXITING is set. In such cases, bpf_task_from_pid may not be able to find the task and if the BPF scheduler depends on PID lookup for dispatching, the task will be lost leading to various issues including RCU grace period stalls.

To mask this problem, by default, unhashed tasks are automatically dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't depend on PID lookups and wants to handle these tasks directly, the following flag can be used.

SCX_OPS_SWITCH_PARTIAL

v6.12

If set, only tasks with policy set to SCHED_EXT are attached to sched_ext. If clear, SCHED_NORMAL tasks are also included.

SCX_OPS_HAS_CGROUP_WEIGHT

v6.12

CPU cGroup support flags.

enum scx_enq_flags

enum scx_enq_flags {
    SCX_ENQ_WAKEUP          = 1LLU << 0,
    SCX_ENQ_HEAD            = 1LLU << 4,
    SCX_ENQ_CPU_SELECTED    = 1LLU << 10,
    SCX_ENQ_PREEMPT         = 1LLU << 32,
    SCX_ENQ_REENQ           = 1LLU << 40,
    SCX_ENQ_LAST            = 1LLU << 41,
    SCX_ENQ_CLEAR_OPSS      = 1LLU << 56,
    SCX_ENQ_DSQ_PRIQ        = 1LLU << 57,
};

SCX_ENQ_WAKEUP

v6.12

Mark a task as runnable.

SCX_ENQ_HEAD

v6.12

Place the task at the head of the runqueue. If not set, the task is placed at the tail.

SCX_ENQ_CPU_SELECTED

v6.12

This flag is set by the scheduler core internals in select_task_rq, if the scheduler called the select_task_rq callback of the current class. This callback for SCX translates into select_cpu.

select_task_rq/select_cpu is not called when a task can only run on 1 CPU or CPU migration is disabled, since in those cases there is no decision to be made.

SCX_ENQ_PREEMPT

v6.12

Set the following to trigger preemption when calling scx_bpf_dsq_insert with a local DSQ as the target. The slice of the current task is cleared to zero and the CPU is kicked into the scheduling path. Implies SCX_ENQ_HEAD.

SCX_ENQ_REENQ

v6.12

The task being enqueued was previously enqueued on the current CPU's SCX_DSQ_LOCAL, but was removed from it in a call to the bpf_scx_reenqueue_local kfunc. If bpf_scx_reenqueue_local was invoked in a cpu_release callback, and the task is again dispatched back to SCX_DSQ_LOCAL by this current enqueue, the task will not be scheduled on the CPU until at least the next invocation of the cpu_acquire callback.

SCX_ENQ_LAST

v6.12

The task being enqueued is the only task available for the CPU. By default, ext core keeps executing such tasks but when SCX_OPS_ENQ_LAST is specified, they're enqueue'd with the SCX_ENQ_LAST flag set.

The BPF scheduler is responsible for triggering a follow-up scheduling event. Otherwise, Execution may stall.

SCX_ENQ_CLEAR_OPSS

v6.12

When this flag is set, to hand back control over a task from the BPF scheduler to the SCX core. Setting p->scx.ops_state to SCX_OPSS_NONE

SCX_ENQ_DSQ_PRIQ

v6.12

This flag is set when a task is inserted into a DSQ with the scx_bpf_dsq_insert_vtime kfunc. It indicates that the ordering of the task in the DSQ is based on the virtual time of the task, not insertion order.

enum scx_deq_flags

enum scx_deq_flags {
    SCX_DEQ_SLEEP           = 1LLU << 0,
    SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
};

SCX_DEQ_SLEEP

v6.12

Task is no longer runnable.

SCX_DEQ_CORE_SCHED_EXEC

v6.12

The generic core-sched layer decided to execute the task even though it hasn't been dispatched yet. Dequeue from the BPF side.

enum scx_dsq_id_flags

DSQ (dispatch queue) IDs are 64bit of the format:

Bits: [63] [62 ..  0]
      [ B] [   ID   ]
  • B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
  • ID: 63 bit ID

Built-in IDs:

Bits: [63] [62] [61..32] [31 ..  0]
      [ 1] [ L] [   R  ] [    V   ]
  • 1: 1 for built-in DSQs.
  • L: 1 for LOCAL_ON DSQ IDs, 0 for others
  • V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
enum scx_dsq_id_flags {
    SCX_DSQ_FLAG_BUILTIN    = 1LLU << 63,
    SCX_DSQ_FLAG_LOCAL_ON   = 1LLU << 62,

    SCX_DSQ_INVALID         = SCX_DSQ_FLAG_BUILTIN | 0,
    SCX_DSQ_GLOBAL          = SCX_DSQ_FLAG_BUILTIN | 1,
    SCX_DSQ_LOCAL           = SCX_DSQ_FLAG_BUILTIN | 2,
    SCX_DSQ_LOCAL_ON        = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
};

SCX_DSQ_FLAG_BUILTIN

v6.12

This flag is part of a DSQ ID. If set indicates that the DSQ is a built-in DSQ or a DSQ created by the BPF scheduler.

SCX_DSQ_FLAG_LOCAL_ON

v6.12

This flag is part of a DSQ ID. If set indicates the DSQ is a local DSQ and the CPU number is encoded in the ID.

SCX_DSQ_GLOBAL

v6.12

Combined flags. The DSQ is builtin and global.

SCX_DSQ_LOCAL

v6.12

Combined flags. The DSQ is builtin and local to the current CPU.

SCX_DSQ_LOCAL_ON

v6.12

Combined flags. The DSQ is builtin and local to a specific CPU (encoded in the ID).

enum scx_ent_flags

enum scx_ent_flags {
    SCX_TASK_QUEUED             = 1 << 0,
    SCX_TASK_RESET_RUNNABLE_AT  = 1 << 2,
    SCX_TASK_DEQD_FOR_SLEEP     = 1 << 3,
    SCX_TASK_CURSOR             = 1 << 31,
};

SCX_TASK_QUEUED

v6.12

On ext runqueue

SCX_TASK_RESET_RUNNABLE_AT

v6.12

runnable_at should be reset

SCX_TASK_DEQD_FOR_SLEEP

v6.12

Last dequeue was for SLEEP.

SCX_TASK_CURSOR

v6.12

iteration cursor, not a task

enum scx_task_state

enum scx_task_state {
    SCX_TASK_NONE,
    SCX_TASK_INIT,
    SCX_TASK_READY,
    SCX_TASK_ENABLED,
};

SCX_TASK_NONE

v6.12

init_task not called yet

SCX_TASK_INIT

v6.12

init_task succeeded, but task can be cancelled

SCX_TASK_READY

v6.12

fully initialized, but not in sched_ext

SCX_TASK_ENABLED

v6.12

fully initialized and in sched_ext

enum scx_kf_mask

Mask bits for sched_ext_entity.kf_mask. Not all kfuncs can be called from everywhere and the following bits track which kfunc sets are currently allowed for current. This simple per-task tracking works because SCX ops nest in a limited way. BPF will likely implement a way to allow and disallow kfuncs depending on the calling context which will replace this manual mechanism. See scx_kf_allow().

enum scx_kf_mask {
    SCX_KF_UNLOCKED     = 0,
    SCX_KF_CPU_RELEASE  = 1 << 0,
    SCX_KF_DISPATCH     = 1 << 1,
    SCX_KF_ENQUEUE      = 1 << 2,
    SCX_KF_SELECT_CPU   = 1 << 3,
    SCX_KF_REST         = 1 << 4,
};

SCX_KF_UNLOCKED

v6.12

sleepable and not rq locked

SCX_KF_CPU_RELEASE

v6.12

This flag is set when kfuncs are enabled that may only be called from the cpu_release callback. SCX_KF_ENQUEUE and SCX_KF_DISPATCH may be nested inside SCX_KF_CPU_RELEASE.

SCX_KF_DISPATCH

v6.12

This flag is set when kfuncs are enabled that may only be called from the dispatch callback. SCX_KF_REST may be nested inside SCX_KF_DISPATCH.

SCX_KF_ENQUEUE

v6.12

This flag is set when kfuncs are enabled that may only be called from the enqueue callback. SCX_KF_SELECT_CPU may be nested inside SCX_KF_ENQUEUE.

SCX_KF_SELECT_CPU

v6.12

This flag is set when kfuncs are enabled that may only be called from the select_cpu callback.

SCX_KF_REST

v6.12

This flag is set when the rest of the kfuncs (kfuncs not under any of the other flags) are enabled.

enum scx_ops_state

enum scx_ops_state {
    SCX_OPSS_NONE,          // (1)!
    SCX_OPSS_QUEUEING,      // (2)!
    SCX_OPSS_QUEUED,        // (3)!
    SCX_OPSS_DISPATCHING,   // (4)!
};
  1. v6.12 owned by the SCX core
  2. v6.12 in transit to the BPF scheduler
  3. v6.12 owned by the BPF scheduler
  4. v6.12 in transit back to the SCX core

enum scx_ent_dsq_flags

enum scx_ent_dsq_flags {
    SCX_TASK_DSQ_ON_PRIQ = 1 << 0, // (1)!
};
  1. v6.12 task is queued on the priority queue of a DSQ

enum scx_exit_kind

enum scx_exit_kind {
    SCX_EXIT_NONE,
    SCX_EXIT_DONE,          // (1)!

    SCX_EXIT_UNREG = 64,    // (2)!
    SCX_EXIT_UNREG_BPF,     // (3)!
    SCX_EXIT_UNREG_KERN,    // (4)!
    SCX_EXIT_SYSRQ,         // (5)!

    SCX_EXIT_ERROR = 1024,  // (6)!
    SCX_EXIT_ERROR_BPF,     // (7)!
    SCX_EXIT_ERROR_STALL,   // (8)!
};
  1. v6.12
  2. v6.12 user-space initiated unregistration
  3. v6.12 BPF-initiated unregistration
  4. v6.12 kernel-initiated unregistration
  5. v6.12 requested by 'S' sysrq
  6. v6.12 runtime error, error message contains details
  7. v6.12 ERROR but triggered through scx_bpf_error()
  8. v6.12 watchdog detected stalled runnable tasks

enum scx_cpu_preempt_reason

enum scx_cpu_preempt_reason {
    SCX_CPU_PREEMPT_RT,      // (1)!
    SCX_CPU_PREEMPT_DL,      // (2)!
    SCX_CPU_PREEMPT_STOP,    // (3)!
    SCX_CPU_PREEMPT_UNKNOWN, // (4)!
};
  1. next task is being scheduled by &rt_sched_class
  2. next task is being scheduled by &dl_sched_class
  3. next task is being scheduled by &stop_sched_class
  4. unknown reason for SCX being preempted

struct task_struct

This struct is the main data structure for a task in the Linux kernel. It contains all the information about a task, including its state, scheduling information, and more. Due to its size, only the fields relevant to SCX are documented here. The below definition is a simplified version of the actual struct. For the full definition see the Linux source code.

struct task_struct {
#ifdef CONFIG_SCHED_CLASS_EXT
    struct sched_ext_entity scx;
#endif
};

scx

v6.12

struct sched_ext_entity scx

SCX specific information for the task.

struct cgroup

This the representation of a cGroup in the Linux kernel. It is part of a tree of cGroups, in the hierarchy defined by the user. BPF schedulers can walk pointers provided in this struct to access cGroup related information.

As this struct is not directly related to SCX, its not documented here. For the full definition see the Linux source code.

struct sched_ext_entity

This struct is embedded in task_struct and contains all fields necessary for a task to be scheduled by SCX.

The fields on this structure are read-only unless otherwise noted.

struct sched_ext_entity {
    struct scx_dispatch_q      *dsq;
    struct scx_dsq_list_node    dsq_list; 
    struct rb_node              dsq_priq; 

    u32 dsq_seq;
    u32 dsq_flags;
    u32 flags; 
    u32 weight;
    s32 sticky_cpu;
    s32 holding_cpu;
    u32 kf_mask; 

    struct task_struct  *kf_tasks[2];
    atomic_long_t        ops_state;
    struct list_head     runnable_node;
    unsigned long        runnable_at;

#ifdef CONFIG_SCHED_CORE
    u64 core_sched_at;
#endif

    u64  ddsp_dsq_id;
    u64  ddsp_enq_flags;
    u64  slice;
    u64  dsq_vtime;
    bool disallow;

#ifdef CONFIG_EXT_GROUP_SCHED
    struct cgroup       *cgrp_moving_from;
#endif
    struct list_head    tasks_node;
};

dsq

v6.12

struct scx_dispatch_q *dsq

The DSQ the task is currently on, or NULL if the task is not on any DSQ.

dsq_list

v6.12

struct scx_dsq_list_node dsq_list

The linked list node, that is part of the FIFO-DSQ the task is on. The linked list is in dispatch order.

dsq_priq

v6.12

struct rb_node dsq_priq

The red-black tree node, that is part of the vtime-DSQ the task is on. The red-black priority queue is ordered by p->scx.dsq_vtime.

dsq_seq

v6.12

u32 dsq_seq

This is the DSQ sequence number the task is on.

dsq_flags

v6.12

u32 dsq_flags

Flags related to the DSQ the task is on. See enum scx_ent_dsq_flags for possible values.

flags

v6.12

u32 flags

This field contains both flags defined in enum scx_ent_flags and a task state defined in enum scx_task_state. The value of the task state can be masked out with scx_entity.flags & SCX_TASK_STATE_MASK.

weight

v6.12

u32 weight

The weight of the task. A value in the range 1..10000. The higher the weight, the more priority the task should have.

sticky_cpu

v6.12

s32 sticky_cpu

Docs could be improved

This part of the docs is incomplete, contributions are very welcome

holding_cpu

v6.12

s32 holding_cpu

Docs could be improved

This part of the docs is incomplete, contributions are very welcome

kf_mask

v6.12

u32 kf_mask

See scx_kf_mask.

kf_tasks

v6.12

struct task_struct *kf_tasks[2]

SCX_CALL_OP_TASK()

ops_state

v6.12

atomic_long_t ops_state

Used to track the task ownership between the SCX core and the BPF scheduler. Valid values described by [enum scx_ops_state] (#enum-scx_ops_state).

State transitions look as follows:

 NONE -> QUEUEING -> QUEUED -> DISPATCHING
   ^              |                 |
   |              v                 v
   \-------------------------------/

QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call sites for explanations on the conditions being waited upon and why they are safe. Transitions out of them into NONE or QUEUED must store_release and the waiters should load_acquire.

Tracking scx_ops_state enables sched_ext core to reliably determine whether any given task can be dispatched by the BPF scheduler at all times and thus relaxes the requirements on the BPF scheduler. This allows the BPF scheduler to try to dispatch any task anytime regardless of its state as the SCX core can safely reject invalid dispatches.

runnable_node

v6.12

struct list_head runnable_node

This is a node of a runqueue linked list. This node is linked into the rq->scx.runnable_list when the task becomes runnable.

runnable_at

v6.12

unsigned long runnable_at

The jiffies value at which the task became runnable.

core_sched_at

v6.12

u64 core_sched_at

See scx_prio_less()

Note

This field is only available on kernels compiled with the CONFIG_SCHED_CORE Kconfig enabled.

ddsp_dsq_id

v6.12

u64 ddsp_dsq_id

The DSQ ID when on the direct dispatch path.

ddsp_enq_flags

v6.12

u64 ddsp_enq_flags

The DSQ enqueue flags when on the direct dispatch path.

slice

v6.12

u64 slice

Note

This field can be modified by the BPF scheduler.

Runtime budget in nanoseconds. This is usually set through scx_bpf_dispatch but can also be modified directly by the BPF scheduler. Automatically decreased by SCX as the task executes. On depletion, a scheduling event is triggered.

This value is cleared to zero if the task is preempted by SCX_KICK_PREEMPT and shouldn't be used to determine how long the task ran. Use p->se.sum_exec_runtime instead.

dsq_vtime

v6.12

u64 dsq_vtime

Note

This field can be modified by the BPF scheduler.

Used to order tasks when dispatching to the vtime-ordered priority queue of a DSQ. This is usually set through scx_bpf_dispatch_vtime but can also be modified directly by the BPF scheduler. Modifying it while a task is queued on a DSQ may mangle the ordering and is not recommended.

disallow

v6.12

bool disallow

Note

This field can be modified by the BPF scheduler.

Reject switching into SCX.

If set, reject future sched_setscheduler(2) calls updating the policy to SCHED_EXT with -EACCES.

Can be set from init_task while the BPF scheduler is being loaded (scx_init_task_args->fork). If set and the task's policy is already SCHED_EXT, the task's policy is rejected and forcefully reverted to SCHED_NORMAL. The number of such events are reported through /sys/kernel/debug/sched_ext::nr_rejected. Setting this flag during fork is not allowed.

cgrp_moving_from

v6.12

struct cgroup *cgrp_moving_from

Note

This field is only available on kernels compiled with the CONFIG_EXT_GROUP_SCHED Kconfig enabled.

tasks_node

v6.12

struct list_head tasks_node

This is a node of a task list linked list. This node is linked into the scx_tasks after forking.

struct cpumask

This structure is a bitmap, one bit for every CPU.

struct cpumask { 
    DECLARE_BITMAP(bits, NR_CPUS); 
};

struct scx_cpu_acquire_args

Argument container for cpu_acquire. Currently empty, but may be expanded in the future.

struct scx_cpu_acquire_args {};

struct scx_cpu_release_args

argument container for cpu_release

struct scx_cpu_release_args {
    enum scx_cpu_preempt_reason reason;
    struct task_struct         *task;
};

reason

v6.12

The reason the CPU was preempted. See enum scx_cpu_preempt_reason for possible values.

task

v6.12

The task that's going to be scheduled on the CPU.

struct scx_init_task_args

Argument container for init_task

struct scx_init_task_args {
    bool fork;
#ifdef CONFIG_EXT_GROUP_SCHED
    struct cgroup *cgroup;
#endif
};

fork

v6.12

Set if init_task is being invoked on the fork path, as opposed to the scheduler transition path.

cgroup

v6.12

The cGroup the task is joining.

struct scx_exit_task_args

Argument container for exit_task

struct scx_exit_task_args {
    bool cancelled;
};

cancelled

v6.12

Whether the task exited before running on sched_ext.

struct scx_dump_ctx

Informational context provided to dump operations.

struct scx_dump_ctx {
    enum scx_exit_kind  kind;
    s64                 exit_code;
    const char         *reason;
    u64                 at_ns;
    u64                 at_jiffies;
};

struct scx_cgroup_init_args

Argument container for cgroup_init

struct scx_cgroup_init_args {
    u32 weight;
};

weight

v6.12

The weight of the cGroup [1..10000].

struct scx_exit_info

This struct is passed to exit to describe why the BPF scheduler is being disabled.

struct scx_exit_info {
    enum scx_exit_kind  kind;
    s64                 exit_code;
    const char          *reason;
    unsigned long       *bt;
    u32                 bt_len;
    char                *msg;
    char                *dump;
};

kind

v6.12

enum scx_exit_kind kind

Broad category of the exit reason, one of enum scx_exit_kind

exit_code

v6.12

s64 exit_code

Exit code if gracefully exiting

reason

v6.12

const char *reason;

Textual representation of the above

bt

v6.12

unsigned long *bt;

Backtrace if exiting due to an error

bt_len

v6.12

u32 bt_len;

msg

v6.12

char *msg;

Informational message

dump

v6.12

char *dump;

debug dump