Saturday, February 2, 2013

Select Poll



SELECT/POLL


"Asynchronous I/O" is the ability of a process to perform input/output on multiple sources at the same time. This term is also used when the system does I/O when data is actually available or ready to be sent, versus performing a read/write operation and blocking as a result. There are several channels through which I/O could be performed (timeouts, signals, data on socket etc) and the key is to monitor these multiple channels simultaneously.

Applications wishing to use a non blocking I/O use the poll, select, and epoll system calls to watch multiple file descriptors and see whether they can read from or write to one or more open files without blocking. These calls can also block a process until any of the file descriptors that are being waited on, become available for reading or writing.

The functions poll and select pass an array of File Descriptors (FDs) to the kernel, with an optional timeout value. When there is activity, or when timeout occurs, the poll/select system call returns. The application must then scan the result array to see which FDs have an event that they were interested in receiving. This scheme works well with small numbers of FDs, but does not scale for thousands of FDs. The epoll call was added in Linux version 2.5.45 to scale to thousands of file descriptors.

This article talks about the poll system call.

poll() System Call

The poll() system call was introduced in Linux 2.1.23. The poll() library call was introduced in libc 5.4.28. To check the version of glibc on the system, give the following command –
linux$  rpm –qa | grep glibc

#include
int poll (struct pollfd fds[], nfds_t nfds, int timeout);

nfds is the number of pollfd structures in the fds array.
timeout is the timeout value in milli seconds.

For each member of the array pointed to by fds, poll() examines the file descriptors for the event(s) specified in events field in the fds structure. The elements of poll_fd are as follows:
fd specifies an open file descriptor,
events specifies the events that need to watched and are bitmasks constructed by OR'ing a combination flags, some of which are given below.
POLLIN - Data other than high-priority data may be read without blocking.
POLLOUT - Normal data may be written without blocking.
POLLERR - An error has occurred on the device or stream. This flag is only valid in the revents bitmask; it shall be ignored in the events member.
revents is set with appropriate bits indicating events occurred on fd

struct pollfd {
        int fd;
        short events;
        short revents;
}

Device Driver

Support for any of these calls i.e. poll, select or epoll, requires support from the device driver. This support (for all three calls) is provided through the driver's poll method. This method has the following prototype:
unsigned int (*poll) (struct file *filp, poll_table *wait);

This driver method is called whenever the user-space program performs a poll, select, or epoll system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:
1.      Call poll_wait() on wait queues that could indicate a change in the poll status. If no file descriptors are currently available for I/O, the kernel causes the process to wait on the wait queues for all file descriptors passed to the system call.
2.      Return a bit mask via revents field of the pollfd describing the operations that can be performed without blocking.
The 2nd argument to poll() is the  poll_table and this is used as an opaque by the driver to get a poll_table_entry structure for its use. It is passed to the driver method so that the driver can load it with every wait queue that could wake up the process and change the status of the poll operation. The driver adds a wait queue to the poll_table structure by calling the function poll_wait().

Function Flow on poll() from user space

User application calls select() or poll(). For poll(), the function in kernel space that gets called is  do_sys_poll().  For select() he first function to be called in kernel space is core_sys_select()located in fs/select.c, and is a wrapper function for calling do_select().We will look at what poll() function does in this section.
 
do_sys_poll() then calls
                poll_initwait()
                                Sets poll_table’s function pt->qproc to __pollwait()
                do_poll() is then called, that returns fdcount
                                Calls do_pollfd()
                                                Calls f_op->poll()
                                                                Drivers implement the poll() routines
                                                                calls - void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p).
calls __pollwait  // Defined in fs/select.c
                                                                                                Calls poll_get_entry() to get struct poll_table_entry
                                                                                                struct poll_table_entry {
                                                                                                wait_queue_t wait;
                                                                                                wait_queue_head_t * wait_address;
                                                                                                }
                                                                                                Adds filp to entry->filp
                                                                                                Adds wait_address to entry->wait_address
                                                                                                Calls init_waitqueue_entry() to add “current” process to entry->wait->private
                                                                                                Calls add_wait_queue() to add struct entry->wait to the list wait_address. Thus, at the end of the poll_wait(), the process has been added to the event’s waitQ.
                poll_freewait()
                                When a task wakes up, it has be removed from all the wait queues it is on. Having a list of all the wait queues a task is on helps save time.

Data Structure

This section explains some internals of the poll_table_struct. Whenever a user application calls poll, select, or epoll_ctl, the kernel invokes the poll method of all files referenced by the system call, passing the same poll_table to each of them. [3]. The poll_table structure is a wrapper around a function that builds the actual data structure.
typedef struct poll_table_struct {
        poll_queue_proc qproc;
} poll_table;

In the last section it was seen that the function poll_initwait() called from do_sys_poll() sets poll_table’s function pt->qproc to __pollwait()

The poll_table_page structure, for poll and select, is a linked list of memory pages containing poll_table_entry structures.
struct file * filp;
}
struct poll_table_page {
        struct poll_table_page * next;
        struct poll_table_entry * entry;
        struct poll_table_entry entries[0];
}

This structure is maintained by the kernel so that the process can be removed from all of those queues before poll or select returns. When the poll call completes, the poll_table structure is deallocated, and all wait queue entries previously added to the poll table (if any) are removed from the table and their wait queues.

The following figure is taken from [3].

 

Figure 1: Data Structures behind poll

Code Example     

We will take the example of signalfd() to explain the poll() mechanism [4]. signalfd() is available on Linux since kernel 2.6.22

The signalfd function creates a file descriptor that can be used to accept signals targeted at the caller.  This provides an alternative to the use of a signal handler and has the advantage that the file descriptor may be monitored by select, poll or epoll.
Synopsis
#include
int signalfd (int fd, const sigset_t *mask, int flags);

The mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. The set of signals to be received via the file descriptor should be blocked using sigprocmask(2), to prevent the signals being handled according to their default dispositions.
If the fd argument is -1, then the call creates a new file descriptor and associates the signal set specified in mask with that descriptor.  If fd is not -1, then it must specify a valid existing signalfd file descriptor, and mask is used to replace the signal set associated with that descriptor.
signalfd() returns a file descriptor that supports read, close, poll, select and epoll calls.

Driver Code

Add poll() pointer to the file-ops structure and define it. The following example is taken from fs/signalfd.c

static const struct file_operations signalfd_fops = {
        .release        = signalfd_release,
        .poll           = signalfd_poll,
        .read           = signalfd_read,
};

static unsigned int signalfd_poll(struct file *file, poll_table *wait)
{
        struct signalfd_ctx *ctx = file->private_data;
        unsigned int events = 0;
 
        poll_wait(file, &current->sighand->signalfd_wqh, wait);
        if (next_signal(&current->pending, &ctx->sigmask) ||
               next_signal(&current->signal->shared_pending,
                       &ctx->sigmask))
               events |= POLLIN;
        return events;
}

Later, when a signal is available, the driver calls:
wake_up(&current->sighand->signalfd_wqh);

This will cause the select/poll system call to wake up and to check all file descriptors again (by calling the f_ops->poll function).

User Code

User Space Poll routine
#include
int poll(struct pollfd *ufds, unsigned int nfds, int timeout);

The following user code has been tried on a machine running Suse 11.1. This is the output of “uname –a”.
$ Linux linux-gg13 2.6.27.7-9-default #1 SMP 2008-12-04 18:10:04 +0100 x86_64 x86_64 x86_64 GNU/Linux

#include
#include
#include
#include
#include
#include

int main (int argc, char *argv[])
{
                int sigfd;
                sigset_t mask;
                struct pollfd fds[1];
                int timeout_msecs = 200000;
                int ret;

                /* handle SIGTERM and SIGINT. */
                sigemptyset (&mask);
                sigaddset (&mask, SIGTERM);  // kill -15
                sigaddset (&mask, SIGINT);   // signal value = 2, kyb shortcut = ctrl-c

                /* Block signals handled using signalfd() to remove default signal actions */
                if (sigprocmask(SIG_BLOCK, &mask, NULL) < 0) {
                                perror ("sigprocmask");
                                return 1;
                }

                /* Create a file descriptor from which we will read the signals. */
                sigfd = signalfd (-1, &mask, 0);
                if (sigfd < 0) {
                                perror ("signalfd");
                                return 1;
                }
                fds[0].fd = sigfd;
                fds[0].events = POLLIN;
                fds[0].revents = 0;
                ret = poll(fds, 1, timeout_msecs);
                if (fds[0].revents && POLLIN) {
                                printf("\n revents in fd[0] = 0x%x ", fds[0].revents);
                }
                if (ret > 0) {
                                // an event has ouccured on the fd
                                struct signalfd_siginfo si;
                                ssize_t res;
                                res = read (sigfd, &si, sizeof(si));
                                if (res < 0) {
                                                perror ("read");
                                                return 1;
                                }
                                if (si.ssi_signo == SIGTERM) {
                                                printf ("...SIGTERM\n");
                                } else if (si.ssi_signo == SIGINT) {
                                                printf ("...SIGINT\n");
                                }
                } else {
                                printf("...Timeout");
                }
                close (sigfd);
                return 0;
}

Running User Code

When the above user code is run in user space, it gives the following output, as soon as the program is started and it encounters the poll() call. The trace is got via a call to dump_stack() at the starting of signalfd_poll() routine.

Call Trace:          
dump_stack                                 
signalfd_poll
do_sys_poll
sys_poll
system_call_fastpath

When a ^C is pressed, sending a SIGINT to the process, the same stack trace is seen, as the do_sys_poll() again calls signalfd_poll() after being woken up.

dump_stack                                 
signalfd_poll
do_sys_poll
sys_poll
system_call_fastpath

References:
[3] Linux Device Drivers, 3rd Edition By Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini http://www.makelinux.net/ldd3/chp-6-sect-3.shtml
[4] http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html

No comments:

Post a Comment