Chapter 14. Tuning the Task Scheduler

Contents

14.1. Introduction
14.2. Process Classification
14.3. O(1) Scheduler
14.4. Completely Fair Scheduler
14.5. For More Information

Modern operating systems, such as openSUSE®, normally run many different tasks at the same time. For example, you can be searching in a text file while receiving an e-mail and copying a big file to an external hard drive. These simple tasks require many additional processes to be run by the system. To provide each task with its required system resources, the Linux kernel needs a tool to distribute available system resources to individual tasks. And this is exactly what task scheduler does.

The following sections explain the most important terms related to process scheduling. They also introduce information about the task scheduler policy, scheduling algorithm, description of the task scheduler used by openSUSE, and references to other sources of relevant information.

14.1. Introduction

The Linux kernel controls the way tasks (or processes) are managed in the running system. The task scheduler, sometimes called process scheduler, is the part of the kernel that decides which task to run next. It is one of the core components of a multitasking operating system (such as Linux), being responsible for best utilizing system resources to guarantee that multiple tasks are being executed simultaneously.

14.1.1. Preemption

The theory behind task scheduling is very simple. If there are runnable processes in a system, at least one process must always be running. If there are more runnable processes than processors in a system, not all the processes can be running all the time.

Therefore, some processes need to be stopped temporarily, or suspended, so that others can be running again. The scheduler decides what process in the queue will run next.

As already mentioned, Linux, like all other Unix variants, is a multitasking operating system. That means that several tasks can be running at the same time. Linux provides a so called preemptive multitasking, where the scheduler decides when a process is suspended. This forced suspension is called preemption. All Unix flavors have been providing preemptive multitasking since the beginning.

14.1.2. Timeslice

The time period for which a process will be running before it is preempted is defined in advance. It is called a process' timeslice and represents the amount of processor time that is provided to each process. By assigning timeslices, the scheduler makes global decisions for the running system, and prevents individual processes from dominating over the processor resources.

14.1.3. Process Priority

The scheduler evaluates processes based on their priority. To calculate the current priority of a process, the task scheduler uses complex algorithms. As a result, each process is given a value according to which it is allowed to run on a processor.

14.2. Process Classification

Processes are usually classified according to their purpose and behavior. Although the borderline is not always clearly distinct, generally two criterias are used to sort them. These criteria are independent and do not exclude each other.

One approach is to classify a process either I/O-bound or processor-bound.

I/O-bound

I/O stands for Input/Output devices, such as keyboards, mice, or optical and hard disks. I/O-bound processes spend the majority of time submitting and waiting for requests. They are run very frequently, but for short time intervals, not to block other processes waiting for I/O requests.

processor-bound

On the other hand, processor-bound tasks use their time to execute a code, and usually run until they are preempted by the scheduler. They do not block processes waiting for I/O requests, and, therefore, can be run less frequently but for longer time intervals.

Another approach is to divide processes by either being interactive, batch, or real-time ones.

  • Interactive processes spend a lot of time waiting for I/O requests, such as keyboard or mouse operations. The scheduler must wake up such process quickly on user request, or the user will find the environment unresponsive. The typical delay is approximately 100 ms. Office applications, text editors or image manipulation programs represent typical interactive processes.

  • Batch processes often run in the background and do not need to be responsive. They usually receive lower priority from the scheduler. Multimedia converters, database search engines, or log files analyzers are typical examples of batch processes.

  • Real-time processes must never be blocked by low-priority processes, and the scheduler guarantees a short response time to them. Applications for editing multimedia content are a good example here.

14.3. O(1) Scheduler

The Linux kernel version 2.6 introduced a new task scheduler, called O(1) scheduler (see Big O notation), It was used as the default scheduler up to Kernel version 2.6.22. Its main task is to schedule tasks within a fixed amount of time, no matter how many runnable processes there are in the system.

The scheduler calculates the timeslices dynamically. However, to determine the appropriate timeslice is a complex task: Too long timeslices cause the system to be less interactive and responsive, while too short ones make the processor waste a lot of time on the overhead of switching the processes too frequently. The default timeslice is usually rather low, for example 20ms. The scheduler determines the timeslice based on priority of a process, which allows the processes with higher priority to run more often and for a longer time.

A process does not have to utilize all its timeslice at once. For instance, a process with a timeslice of 150ms does not have to be running for 150ms in one go. It can be running in five different schedule slots for 30ms instead. Interactive tasks typically benefit from this approach because they do not need such a large timeslice at once while they need to be responsive as long as possible.

The scheduler also assigns process priorities dynamically. It monitors the processes' behavior and, if needed, adjusts its priority. For example, a process which is being suspended for a long time is brought up by increasing its priority.

14.4. Completely Fair Scheduler

Since the Linux kernel version 2.6.23, a new approach has been taken to the scheduling of runnable processes. Completely Fair Scheduler (CFS) became the default Linux kernel scheduler. Since then, important changes and improvements have been made. The information in this chapter applies to openSUSE with kernel version 2.6.32. The scheduler environment was divided into several parts, and three main new features were introduced:

Modular Scheduler Core

The core of the scheduler was enhanced with scheduling classes. These classes are modular and represent scheduling policies.

Completely Fair Scheduler

Introduced in kernel 2.6.23 and extended in 2.6.24, CFS tries to assure that each process obtains its fair share of the processor time.

Group Scheduling

For example, if you split processes into groups according to which user is running them, CFS tries to provide each of these groups with the same amount of processor time.

As a result, CFS brings more optimized scheduling for both servers and desktops.

14.4.1. How CFS Works

CFS tries to guarantee a fair approach to each runnable task. To find the most balanced way of task scheduling, it uses the concept of red-black tree. A red-black tree is a type of self-balancing data search tree which provides inserting and removing entries in a reasonable way so that it remains well balanced. For more information, see the wiki pages of Red-black tree.

When a task enters into the run queue (a planned time line of processes to be executed next), the scheduler records the current time. While the process waits for processor time, its wait value gets incremented by an amount derived from the total number of tasks currently in the run queue and the process priority. As soon as the processor runs the task, its wait value gets decremented. If the value drops below a certain level, the task is preempted by the scheduler and other tasks get closer to the processor. By this algorithm, CFS tries to reach the ideal state where the wait value is always zero.

14.4.2. Grouping Processes

Since the Linux kernel version 2.6.24, CFS can be tuned to be fair to users or groups rather than to tasks only. Runnable tasks are then grouped to form entities, and CFS tries to be fair to these entities instead of individual runnable tasks. The scheduler also tries to be fair to individual tasks within these entities.

Tasks can be grouped in two mutually exclusive ways:

  • By user IDs

  • By kernel control groups.

The way the kernel scheduler lets you group the runnable tasks depends on setting the kernel compile-time options CONFIG_FAIR_USER_SCHED and CONFIG_FAIR_CGROUP_SCHED. The default setting in openSUSE® 11.4 is to use control groups, which lets you create groups as needed. For more information, see Chapter 10, Kernel Control Groups.

14.4.3. Kernel Configuration Options

Basic aspects of the task scheduler behavior can be set through the kernel configuration options. Setting these options is part of the kernel compilation process. Because kernel compilation process is a complex task and out of this document's scope, refer to relevant source of information (for example http://en.opensuse.org/Configure,_Build_and_Install_a_Custom_Linux_Kernel).

[Warning]Kernel Compilation

If you run openSUSE on a kernel that was not shipped with it, for example on a self-compiled kernel, you loose the entire support entitlement.

14.4.4. Terminology

Documents regarding task scheduling policy often use several technical terms which you need to know to understand the information correctly. Here are some of them:

Latency

Delay between the time a process is scheduled to run and the actual process execution.

Granularity

The relation between granularity and latency can be expressed by the following equation:

gran = ( lat / rtasks ) - ( lat / rtasks / rtasks )

where gran stands for granularity, lat stand for latency, and rtasks is the number of running tasks.

SCHED_BATCH

Scheduling policy designed for CPU-intensive tasks.

SCHED_OTHER

Default Linux time-sharing scheduling policy.

14.4.5. Runtime Tuning

The sysctl interface for examining and changing kernel parameters at runtime introduces important variables by means of which you can change the default behavior of the task scheduler. The syntax of the sysctl is simple, and all the following commands must be entered on the command line as root.

To read a value from a kernel variable, enter

sysctl variable

To assign a value, enter

sysctl variable=value

To get a list of all scheduler related sysctl variables, enter

sysctl -A | grep "sched" | grep -v "domain"
saturn.example.com:~ # sysctl -A | grep "sched" | grep -v "domain"
kernel.sched_child_runs_first = 0
kernel.sched_min_granularity_ns = 1000000
kernel.sched_latency_ns = 5000000
kernel.sched_wakeup_granularity_ns = 1000000
kernel.sched_shares_ratelimit = 250000
kernel.sched_tunable_scaling = 1
kernel.sched_shares_thresh = 4
kernel.sched_features = 15834238
kernel.sched_migration_cost = 500000
kernel.sched_nr_migrate = 32
kernel.sched_time_avg = 1000
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_compat_yield = 0

Note that variables ending with _ns and _us accept values in nanoseconds and microseconds, respectively.

A list of the most important task scheduler sysctl tuning variables (located at /proc/sys/kernel/) with a short description follows:

sched_child_runs_first

A freshly forked child runs before the parent continues execution. Setting this parameter to 1 is beneficial for an application in which the child performs an execution after fork. For example make -j<NO_CPUS> performs better when sched_child_runs_first is turned off. The default value is 0.

sched_compat_yield

Enables the aggressive yield behavior of the old 0(1) scheduler. Java applications that use synchronization extensively perform better with this value set to 1. Only use it when you see a drop in performance. The default value is 0.

Expect applications that depend on the sched_yield() syscall behavior to perform better with the value set to 1.

sched_migration_cost

Amount of time after the last execution that a task is considered to be cache hot in migration decisions. A hot task is less likely to be migrated, so increasing this variable reduces task migrations. The default value is 500000 (ns).

If the CPU idle time is higher than expected when there are runnable processes, try reducing this value. If tasks bounce between CPUs or nodes too often, try increasing it.

sched_latency_ns

Targeted preemption latency for CPU bound tasks. Increasing this variable increases a CPU bound task's timeslice. A task's timeslice is its weighted fair share of the scheduling period:

timeslice = scheduling period * (task's weight/total weight of tasks in the run queue)

The task's weight depends on the task's nice level and the scheduling policy. Minimum task weight for a SCHED_OTHER task is 15, corresponding to nice 19. The maximum task weight is 88761, corresponding to nice -20.

Timeslices become smaller as the load increases. When the number of runnable tasks exceeds sched_latency_ns/sched_min_granularity_ns, the slice becomes number_of_running_tasks * sched_min_granularity_ns. Prior to that, the slice is equal to sched_latency_ns.

This value also specifies the maximum amount of time during which a sleeping task is considered to be running for entitlement calculations. Increasing this variable increases the amount of time a waking task may consume before being preempted, thus increasing scheduler latency for CPU bound tasks. The default value is 20000000 (ns).

sched_min_granularity_ns

Minimal preemption granularity for CPU bound tasks. See sched_latency_ns for details. The default value is 4000000 (ns).

sched_wakeup_granularity_ns

The wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components. The default value is 5000000 (ns).

[Warning]

Settings larger than half of sched_latency_ns will result in zero wake-up preemption and short duty cycle tasks will be unable to compete with CPU hogs effectively.

sched_rt_period_us

Period over which real-time task bandwidth enforcement is measured. The default value is 1000000 (µs).

sched_rt_runtime_us

Quantum allocated to real-time tasks during sched_rt_period_us. Setting to -1 disables RT bandwidth enforcement. By default, RT tasks may consume 95%CPU/sec, thus leaving 5%CPU/sec or 0.05s to be used by SCHED_OTHER tasks.

sched_features

Provides information about specific debugging features.

sched_stat_granularity_ns

Specifies the granularity for collecting task scheduler statistics.

sched_nr_migrate

Controls how many tasks can be moved across processors through migration software interrupts (softirq). If a large number of tasks is created by SCHED_OTHER policy, they will all be run on the same processor. The default value is 32. Increasing this value gives a performance boost to large SCHED_OTHER threads at the expense of increased latencies for real-time tasks.

14.4.6. Debugging Interface and Scheduler Statistics

CFS comes with a new improved debugging interface, and provides runtime statistics information. Relevant files were added to the /proc file system, which can be examined simply with the cat or less command. A list of the related /proc files follows with their short description:

/proc/sched_debug

Contains the current values of all tunable variables (see Section 14.4.5, “Runtime Tuning”) that affect the task scheduler behavior, CFS statistics, and information about the run queue on all available processors.

saturn.example.com:~ # less /proc/sched_debug
Sched Debug Version: v0.09, 2.6.32.8-0.3-default #1
now at 2413026096.408222 msecs
  .jiffies                                 : 4898148820
  .sysctl_sched_latency                    : 5.000000
  .sysctl_sched_min_granularity            : 1.000000
  .sysctl_sched_wakeup_granularity         : 1.000000
  .sysctl_sched_child_runs_first           : 0.000000
  .sysctl_sched_features                   : 15834238
  .sysctl_sched_tunable_scaling            : 1 (logaritmic)

cpu#0, 1864.411 MHz
  .nr_running                    : 1
  .load                          : 1024
  .nr_switches                   : 37539000
  .nr_load_updates               : 22950725
[...]
cfs_rq[0]:/
  .exec_clock                    : 52940326.803842
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 54410632.307072
  .max_vruntime                  : 0.000001
[...]
rt_rq[0]:/
  .rt_nr_running                 : 0
  .rt_throttled                  : 0
  .rt_time                       : 0.000000
  .rt_runtime                    : 950.000000

runnable tasks:
  task  PID   tree-key    switches prio exec-runtime    sum-exec sum-sleep
--------------------------------------------------------------------------
R cat 16884 54410632.307072    0   120  54410632.307072 13.836804 0.000000
/proc/schedstat

Displays statistics relevant to the current run queue. Also domain-specific statistics for SMP systems are displayed for all connected processors. Because the output format is not user-friendly, read the contents of /usr/src/linux/Documentation/scheduler/sched-stats.txt for more information.

/proc/PID/sched

Displays scheduling information on the process with id PID.

saturn.example.com:~ # cat /proc/`pidof nautilus`/sched
 nautilus (4009, #threads: 1)
 ---------------------------------------------------------
 se.exec_start                      :    2419575150.560531
 se.vruntime                        :      54549795.870151
 se.sum_exec_runtime                :       4867855.829415
 se.avg_overlap                     :             0.401317
 se.avg_wakeup                      :             3.247651
 se.avg_running                     :             0.323432
 se.wait_start                      :             0.000000
 se.sleep_start                     :    2419575150.560531
 [...]
 nr_voluntary_switches              :               938552
 nr_involuntary_switches            :                71872
 se.load.weight                     :                 1024
 policy                             :                    0
 prio                               :                  120
 clock-delta                        :                  109

14.5. For More Information

To get a compact knowledge about Linux kernel task scheduling, you need to explore several information sources. Here are some of them: