Saturday, October 24, 2009

Kernel Wait Queues


One problem that might arise with read operation is what to do when there’s no data yet, but we’re not at end-of-file. The default answer is ‘‘go to sleep waiting for data.’’ This section shows how a process is put to sleep, how it is awakened, and how an application can ask if there is data without just blindly issuing a read call and blocking. We then apply the same concepts to write.

Whenever a process must wait for an event (such as the arrival of data or the termination of a process), it should go to sleep. Sleeping causes the process to suspend execution, freeing the processor for other uses. At some future time, when the event being waited for occurs, the process will be woken up and will continue with its job.

There are several ways of handling sleeping and waking up in Linux, each suited to different needs. All, however, work with the same basic data type, a wait queue (wait_queue_head_t). A wait queue is exactly that—a queue of processes that are waiting for an event. Wait queues are declared and initialized as follows:

wait_queue_head_t my_queue;
init_waitqueue_head (&my_queue);


When a wait queue is declared statically (i.e., not as an automatic variable of a procedure or as part of a dynamically-allocated data structure), it is also possible to initialize the queue at compile time:

DECLARE_WAIT_QUEUE_HEAD (my_queue);

It is a common mistake to neglect to initialize a wait queue (especially since earlier versions of the kernel did not requir e this initialization); if you forget, the results will usually not be what you intended. Once the wait queue is declared and initialized, a process may use it to go to sleep. Sleeping is accomplished by calling one of the variants of sleep_on, depending on how deep a sleep is called for.

sleep_on(wait_queue_head_t *queue);

Puts the process to sleep on this queue. sleep_on has the disadvantage of not being interruptible; as a result, the process can end up being stuck (and unkillable) if the event it’s waiting for never happens.

interruptible_sleep_on(wait_queue_head_t *queue);

The interruptible variant works just like sleep_on, except that the sleep can be interrupted by a signal. This is the form that device driver writers have been using for a long time, before wait_event_interruptible appeared.

sleep_on_timeout(wait_queue_head_t *queue, long timeout);
interruptible_sleep_on_timeout(wait_queue_head_t *queue, long timeout);


These two functions behave like the previous two, with the exception that the sleep will last no longer than the given timeout period. The timeout is specified in ‘‘jiffies’’.

void wait_event(wait_queue_head_t queue, int condition);
int wait_event_interruptible(wait_queue_head_t queue, int condition);


These macros are the preferred way to sleep on an event. They combine waiting for an event and testing for its arrival in a way that avoids race conditions. They will sleep until the condition, which may be any boolean C expression, evaluates true. The macros expand to a while loop, and the condition is reevaluated over time—the behavior is different from that of a function call or a simple macro, where the arguments are evaluated only at call time. The latter macro is implemented as an expression that evaluates to 0 in case of success and -ERESTARTSYS if the loop is interrupted by a signal. It is worth repeating that driver writers should almost always use the interruptible instances of these functions/macros. The noninterruptible version exists for the small number of situations in which signals cannot be dealt with, for example, when waiting for a data page to be retrieved from swap space. Most drivers do not present such special situations. Of course, sleeping is only half of the problem; something, somewhere will have to wake the process up again. When a device driver sleeps directly, there is usually code in another part of the driver that performs the wakeup, once it knows that the event has occurred. Typically a driver will wake up sleepers in its interrupt handler once new data has arrived. Other scenarios are possible, however. Just as there is more than one way to sleep, so there is also more than one way to wake up. The high-level functions provided by the kernel to wake up processes are as follows:.

wake_up(wait_queue_head_t *queue);
This function will wake up all processes that are waiting on this event queue.

wake_up_interruptible(wait_queue_head_t *queue);
wake_up_interruptible wakes up only the processes that are in interruptible
sleeps. Any process that sleeps on the wait queue using a noninterruptible
function or macro will continue to sleep.

wake_up_sync(wait_queue_head_t *queue);

wake_up_interruptible_sync(wait_queue_head_t *queue);


Normally, a wake_up call can cause an immediate reschedule to happen, meaning that other processes might run before wake_up retur ns. The “synchronous” variants instead make any awakened processes runnable, but do not reschedule the CPU. This is used to avoid rescheduling when the current process is known to be going to sleep, thus forcing a reschedule anyway. Note that awakened processes could run immediately on a different processor, so these functions should not be expected to provide mutual exclusion. If your driver is using interruptible_sleep_on, there is little difference between wake_up and wake_up_interruptible. Calling the latter is a common convention, however, to preserve consistency between the two calls.

As an example of wait queue usage, imagine you want to put a process to sleep when it reads your device and awaken it when someone else writes to the device.

The following code does just that:

DECLARE_WAIT_QUEUE_HEAD(wq);

ssize_t sleepy_read (struct file *filp, char *buf, size_t count,
loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) going to sleep\n",
current->pid, current->comm);
interruptible_sleep_on(&wq);
printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
return 0; /* EOF */
}

ssize_t sleepy_write (struct file *filp, const char *buf, size_t count,
loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}

An important thing to remember with wait queues is that being woken up does not guarantee that the event you were waiting for has occurred; a process can be woken for other reasons, mainly because it received a signal. Any code that sleeps should do so in a loop that tests the condition after returning from the sleep.

ptrace system call

Linux provides few system calls for tracing. Ptrace system call is used to trace the process and strace is used to trace the system calls and signals. Let’s see the usage of ptrace here.


What is ptrace?

A parent process can observe and control the execution of another process, and examine its core images and registers using the ptrace system call.

Ptrace finds its application primarily in break-point debugging and system call tracing.


How to trace a process?

Parent forks a child and the child does a PTRACE_TRACEME allowing the parent to trace the child process. The child then execs a new process (a process that is to be traced) by the parent (tracer) process.

When the child’s (exec’d) process is being traced, the child process stops when it is delivered a signal, even if the signal is being ignored. (SIGKILL is exceptional, it kills the child process, as usual). The parent process (the tracer) will be notified at its wait() system call. It may then inspect and modify the child process while it is stopped. The parent may then ignore the signal or re-send another signal to the child. The parent can kill the child or leave the child process to continue when it’s done with tracing.

The parent process can also attach another existing process and trace it using the request PTRACE_ATTACH.


Syntax:


#include sys/ptrace.h

long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);


pid is the id of the process to be traced.
Request determines the type of action to be performed on the tracing (child) process
Addr, data, we shall see them later. This varies depending on the requests. Few request don’t need the addr, data.


Requests:

Lets see some of the important requests that can be performed on the process (to be traced) by using ptrace system call. Man ptrace will explore you more options than this.

PTRACE_TRACEME
Indicates that this process is to be traced by its parent. Any signal (except SIGKILL) delivered to this process will cause it to stop and its parent to be notified via wait(). Also, all subsequent calls to exec() by this process will cause a SIGTRAP to be sent to it, giving the parent a chance to gain control before the new program begins execution. A process probably shouldn’t make this request if its parent isn’t expecting to trace it. (pid, addr, and data are ignored.) This request is done only by the child process.

PTRACE_PEEKTEXT, PTRACE_PEEKDATA
Reads a word at the location addr in the child’s memory, returning the word as the result of the ptrace() call. Linux does not have separate text and data address spaces, so the two requests are currently equivalent. (The argument data is ignored.)

PTRACE_PEEKUSR
Reads a word at offset addr in the childâs USER area, which holds the registers and other information about the process

PTRACE_POKETEXT, PTRACE_POKEDATA
Copies the word data to location addr in the child’s memory.

PTRACE_POKEUSR
Copies the word data to offset addr in the child’s USER area. As above, the offset must typically be word-aligned. In order to maintain the integrity of the kernel, some modifications to the USER area are disallowed.

PTRACE_GETREGS, PTRACE_GETFPREGS
Copies the child’s general purpose or floating-point registers, respectively, to location data in the parent. (addr is ignored.)

PTRACE_GETSIGINFO
Retrieve information about the signal that caused the stop. Copies a siginfo_t structure from the child to location data in the parent. (addr is ignored.)

PTRACE_SETREGS, PTRACE_SETFPREGS
Copies the childâs general purpose or floating-point registers, respectively, from location data in the parent. As for PTRACE_POKEUSER, some general purpose register modifications may be disallowed. (addr is ignored.)

PTRACE_SETSIGINFO
Set signal information. Copies a siginfo_t structure from location data in the parent to the child. This will only affect signals that would normally be delivered to the child and were caught by the tracer.

PTRACE_CONT
Restarts the stoped child process

PTRACE_SYSCALL, PTRACE_SINGLESTEP
Restarts the stopped child as for PTRACE_CONT, but arranges for the child to be stopped at the next entry to or exit from a system call, or after execution of a single instruction, respectively.

PTRACE_KILL
Sends the child a SIGKILL to terminate it.

PTRACE_ATTACH
Attaches to the process specified in pid, making it a traced "child" of the current process; the behavior of the child is as if it had done a PTRACE_TRACEME.

PTRACE_DETACH
Restarts the stopped child as for PTRACE_CONT, but first detaches from the process, undoing the reparenting effect of PTRACE_ATTACH, and the effects of PTRACE_TRACEME.

Return values:

On success, PTRACE_PEEK* requests return the requested data, while other requests return zero. On error, all requests return -1, and errno is set appropriately.

Examples:

Lets see an example to print the system call number called by a child process.

#include sys/ptrace.h
#include sys/types.h
#include sys/wait.h
#include unistd.h
#include linux/user.h

int main()
{ pid_t child;
struct user_regs_struct regs;
child = fork();
if(child == 0) {
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
execl("/bin/ls", "ls", NULL);
}
else {
wait(NULL);
ptrace(PTRACE_GETREGS,
child, 0 ,
&regs);
printf("The child made a "
"system call %ld\n", regs.orig_eax);
ptrace(PTRACE_CONT, child, NULL, NULL);
}
return 0;
}

In the above program, a parent forks a child and the child executes the image of ls by calling the execl system call.


The system call numbers are defined in /usr/src/linux-2.6.23.12/arch/i386/kernel/syscall_table.S
ENTRY(sys_call_table)
.long sys_unlink /* 10 */
.long sys_execve /*exec’s system call no is 11 */


So, the output of the program should print the system call number as 11 and list out all the files in the dir.

Before executing the system call, the kernel checks whether the process is being traced. If it is, the kernel stops the process and gives control to the tracking process so it can examine and modify the traced process' registers.


Before running exec, the child calls ptrace with the first argument, equal to TRACE_TRACEME. This tells the kernel that the process is being traced, and when the child executes the execve system call, it hands over control to its parent. The parent waits for notification from the kernel with a wait() call. Then the parent can check the arguments of the system call or do other things, such as looking into the registers.
When the system call occurs, the kernel saves the original contents of the eax register, which contains the system call number. We can read this value from child's USER segment by calling ptrace with the first argument

PTRACE_GETREGS.


After we are done examining the system call, the child can continue with a call to ptrace with the first argument PTRACE_CONT, which lets the system call continue.
The structure user_regs_struct is defined in /usr/include/sys/user.h and contains ebx, ecx, edx system call argv registers and eip, instruction pointer. You can also print those register values if you wish to.


Lets see another example which calculates the number of machine cycle instructions executed by ls.

#include stdio.h
#include stdlib.h
#include signal.h
#include syscall.h
#include sys/ptrace.h
#include sys/types.h
#include sys/wait.h
#include unistd.h
#include errno.h

int main(void)
{
long long counter = 0; /* machine instruction counter */
int wait_val; /* child's return value */
int pid; /* child's process id */

puts("Please wait");

switch (pid = fork()) {
case -1:
perror("fork");
break;
case 0: /* child process starts */
ptrace(PTRACE_TRACEME, 0, 0, 0);
/*
* must be called in order to allow the
* control over the child process
*/
execl("/bin/ls", "ls", NULL);
/*
* executes the program and causes
* the child to stop and send a signal
* to the parent, the parent can now
* switch to PTRACE_SINGLESTEP
*/
break;
/* child process ends */
default:/* parent process starts */
wait(&wait_val);
/*
* parent waits for child to stop at next
* instruction (execl()) */
while (wait_val == 1407 ) {
counter++;
if (ptrace(PTRACE_SINGLESTEP, pid, 0, 0) != 0)
perror("ptrace");
/*
* switch to singlestep tracing and
* release child
* if unable call error.
*/
wait(&wait_val);
/* wait for next instruction to complete */
}
}
printf("Number of machine instructions : %lld\n", counter);
return 0;
}


Here, the parent forks a child which execs ls. The child sets its trace bit by calling PTRACE_TRACEME allowing the parent to trace it. Parent does a single step tracing by calling ptrace with the arg. PTRACE_SINGLESTEP to the child’s pid.

You can exec your program to see how faster it gets executed by checking for lowest machine cycle instructions.


It will be more interesting if you try to overwrite the child’s image using the other ptrace requests like POKE and set register of child’s area. You can also try inserting breakpoints to it.

Zombie and Orphan Processes

A zombie process is a process that has completed execution but still has an entry in the process table, this entry being still needed to allow the process that started the zombie process to read its exit status.

When a process ends, all of the memory and resources associated with it are de allocated so they can be used by other processes. However, the process's entry in the process table remains. The parent can read the child's exit status by executing the wait system call, at which stage the zombie is removed. The wait call may be executed in sequential code, but it is commonly executed in a handler for the SIGCHLD signal, which the parent is sent whenever a child has died.

After the zombie is removed, its process ID and entry in the process table can then be reused. However, if a parent fails to call wait, the zombie will be left in the process table. In some situations this may be desirable, for example if the parent creates another child process it ensures that it will not be allocated the same process ID.

A zombie process is not the same as an orphan process. An orphan process is a process that is still executing, but whose parent has died. They don't become zombie processes; instead, they are adopted by init (process ID 1), which waits on its children.

Zombies can be identified in the output from the UNIX ps command by the presence of a "Z" in the STAT column. Zombies that exist for more than a short period of time typically indicate a bug in the parent program. As with other leaks, the presence of a few zombies isn't worrisome in itself, but may indicate a problem that would grow serious under heavier loads. Since there is no memory allocated to zombie processes except for the process table entry itself, the primary concern with many zombies is not running out of memory, but rather running out of process ID numbers.

To remove zombies from a system, the SIGCHLD signal can be sent to the parent manually, using the kill command. If the parent process still refuses to reap the zombie, the next step would be to remove the parent process. When a process loses its parent, init becomes its new parent. Init periodically executes the wait system call to reap any zombies with init as parent.

Nice Command

`Nice' prints or modifies a process's "niceness", a parameter that affects whether the process is scheduled favorably.
Synopsis:

nice [OPTION]... [COMMAND [ARG]...]

If no arguments are given, `nice' prints the current niceness. Otherwise, `nice' runs the given COMMAND with its niceness adjusted. By default, its niceness is incremented by 10.

Nicenesses range at least from -20 (resulting in the most favorable scheduling) through 19 (the least favorable). Some systems may have a wider range of nicenesses; An attempt to set the niceness outside the supported range is treated as an attempt to use the minimum or maximum supported value.

A niceness should not be confused with a scheduling priority, which lets applications determine the order in which threads are scheduled to run. Unlike a priority, niceness is merely advice to the scheduler, which the scheduler is free to ignore.

Try opening a file with different nice values and notice the delay during opening the file to feel the usage of niceness value:

[root@localhost ]# nice -n 18 vi unused.c
[root@localhost ]# nice -n -18 vim unused.c

It is sometimes useful to run a non-interactive program with reduced niceness. Say if you would like to find the factors of numbers 50, 90, 999 by the command factor, you can use the nice command with + value for niceness.

Resource Limits in Linux

Linux provides two system calls to control the maximum system resource consumption. They are setrlimits and getrlimits.


getrlimit() and setrlimit() get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure (the rlim argument to both getrlimit() and setrlimit()):


struct rlimit {

rlim_t rlim_cur; /* Soft limit */

rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */

};


The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value.


Format of the system calls:
int getrlimit(int resource, struct rlimit *rlim);

int setrlimit(int resource, const struct rlimit *rlim);


Usage:
{

struct rlimit lim;
if(getrlimit(RLIMIT_STACK, &lim) <>

{

perror("getrlimit");

exit(1);

}
}

These resource limits could also be set from the shell for the processes started by the shell using the shell command uname.


The resources (RLIMIT_) that could be controlled are:

  • Maximum size of the core file. (RLIMIT_CORE) When 0 no core dump files are created. When non-zero, larger dumps are truncated to this size.
  • Maximum size of the process’s data segment. This includes initialized data, uninitialized data, and heap. (RLIMIT_DATA).
  • Maximum scheduling priority (RLIMIT_NICE nice value).
  • Maximum size of the file that the process may create (RLIMIT_FSIZE). Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal.
  • Maximum number of pending signals (RLIMIT_SIGPENDING).
  • Maximum size that may be locked into RAM. In effect this limit will round to the nearest multiple of system page size (RLIMIT_MEMLOCK).
  • Maximum resident set size (RLIMIT_RSS).
  • Maximum number of open file descriptors (RLIMIT_NOFILE). Pipe size in 512-byte blocks
  • Maximum number of bytes in POSIX message queues.( RLIMIT_MSGQUEUE)Maximum rt priority (RLIMIT_RTPRIO).Maximum stack size (RLIMIT_STACK).
  • Maximum amount of cpu time in seconds. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process (RLIMIT_CPU).
  • Maximum number of processes available to a single user (RLIMIT_NPROC)
  • Maximum amount of virtual memory in bytes. (RLIMIT_AS)Maximum number of file locks (RLIMIT_LOCKS).

What defines a Process in Linux?

The kernel stores the processes in a circular doubly linked list called the task list. Each element in the task list is a process descriptor of the type struct task_struct, which is defined in include/linux/sched.h. The process descriptor contains all the information about a specific process. A task_struct exactly defines a Linux process

The important information of processes contained in task_struct is:

  • Volatile long state - contains whether this task is runnable, and if not whether it is interruptible (may receive signals) or not.
  • Long counter - represents the dynamic portion of the task's goodness. Whenever it is reset, it is set to priority. At certain points in the task's execution (most commonly when the 10 msec timer goes off while the task is running) the counter is decremented. If the counter ever reaches 0 while the task is executing, need_resched is set.
  • Long priority - represents the static portion of the task's goodness.
  • Long need_resched - is examined before returning to the current task after a system call and each time the 10 msec timer interrupts. If need_resched is set, schedule () is called.
  • Unsigned long policy - states whether the current task is being scheduled under soft realtime first in first out (SCHED_FIFO), soft real time round robin (SCHED_RR ), or the normal Linux priority scheduling policy (SCHED_OTHER).
  • Unsigned rt_priority - is used to determine a soft real time task's goodness.
  • Struct mm_struct *mm - points to the memory management information for the task. Some tasks share a single mm_struct.


Apart from these, task_struct also contains a pointer to the task_struct of its parent and child processes, process credentials, resource limits, file system info, list of open files information and VM state of the process.

System call interface

System Call


A system call is an interface between a user-space application and a service that the kernel provides. Because the service is provided in the kernel, a direct call cannot be performed; instead, you must use a process of crossing the user-space/kernel boundary. The way you do this differs based on the particular architecture.



Let’s see how to add the system call to the Linux kernel 2.6 and how to use it from the user space for i386 architecture.



Implementation of system call


The implementation of system calls in Linux varies based on the architecture, but it can also differ within a given architecture. For example, older x86 processors used an interrupt mechanism to migrate from user-space to kernel-space, but new IA-32 processors provide instructions that optimize this transition (using sysenter and sysexit instructions).


Each system call is multiplexed into the kernel through a single entry point. The eax register is used to identify the particular system call that should be invoked, which is specified in the C library (per the call from the user-space application). When the C library has loaded the system call index and any arguments, a software interrupt is invoked (interrupt 0x80), which results in execution (through the interrupt handler) of the system_call function. This function handles all system calls, as identified by the contents of eax. After a few simple tests, the actual system call is invoked using the system_call_table and index contained in eax. Upon return from the system call, syscall_exit is eventually reached, and a call to resume_userspace transitions back to user-space. Execution resumes in the C library, which then returns to the user application.




The simplified flows of the system call (ex getpid) using the interrupt method:




The system call table and various linkages:





The system call table uses the index provided in eax to identify which system call to invoke from the table (sys_call_table). A sample of the contents of this table and the locations of these entities is also shown.





Adding a Linux System Call


Three basic steps to add a new system call to the kernel:

  1. Add the new function.
  2. Update the header files.
  3. Update the system call table for the new function.

You can create a new file for your system calls. Let’s see a simple example.




Step 1 is to add kernel Functions for the new system call:



asmlinkage long sys_getjiffies( void )
{
  return (long)get_jiffies_64();
}
 
asmlinkage long sys_diffjiffies( long ujiffies )
{
  return (long)get_jiffies_64() - ujiffies;
}


The above two functions are used for monitoring jiffies. The first one returns the current jiffies, while the second returns the difference of the current and the value that the caller passes in. Note the use of the asmlinkage modifier. This macro (defined in linux/include/asm-i386/linkage.h) tells the compiler to pass all function arguments on the stack.


Final Kernel Function for the system call example:


asmlinkage long sys_pdiffjiffies( long ujiffies,
                                  long __user *presult )
{
  long cur_jiffies = (long)get_jiffies_64();
  long result;
  int  err = 0;
 
  if (presult) {
 
    result = cur_jiffies - ujiffies;
    err = put_user( result, presult );
 
  }
 
  return err ? -EFAULT : 0;
}

This function takes two arguments: a long and a pointer to a long that's defined as __user. The __user macro simply tells the compiler that the pointer should not be dereferenced. This function calculates the difference between two jiffies values, and then provides the result to the user through a user-space pointer. The put_user function places the result value into user-space at the location that presult specifies. If an error occurs during this operation, it will be returned, and you'll likewise notify the user-space caller.



Step 2 is to update the header files to make room for the new functions in the system call table:


For this, update the header file linux/include/asm/unistd.h with the new system call numbers.

#define __NR_getcpu                             318
#define __NR_epoll_pwait       319
#define __NR_getjiffies           320
#define __NR_diffjiffies          321
#define __NR_pdiffjiffies        322
 
#define NR_syscalls                               323




Step 3 is updating the system call table:



Update the file linux/arch/i386/kernel/syscall_table.S for the new functions that will populate the particular indexes

.long sys_getcpu
.long sys_epoll_pwait
.long sys_getjiffies                 /* 320 */
.long sys_diffjiffies
.long sys_pdiffjiffies

Note: The size of this table is defined by the symbolic constant NR_syscalls.

Add your system call file to your Kernel Makefile (/usr/src/linux-x.x.x./kernel/Makefile) so that it gets updated.

Now the system needs to be re-compiled to include the chages. This is done by executing the below at /usr/src/linux-x.x.x

Make menuconfig or xconfig or config

Make dep

Make

Now a new image needs to be created to use the system call. This could be done by executing make install at /usr/src/linux-x.x-x. or by manually editing /etc/grub.conf or lilo. Now you can write an application to test your system call from the booted new image.

Q & A on system call

Differences between read and fread?

The former is a system call, while the latter is a C standard library function call. The latter is more efficient, because it use buffering. When an fread() call is made, more than the requested amount is read() from the file. The extra bytes are held in a buffer, local to the C standard library, and not directly accessible by your program. When your program next calls fread(), it may be able to satisfy the request using bytes already in its buffer, eliminating the need for another read() system call.