Dissecting Interrupts and Browsing DMA

by Alessandro Rubini

Though last month's installment appeared to cover everything about interrupts, that is not true. One month later, you are ready to understand the ultimate technology for interrupt handling. We also begin to investigate the fascinating world of memory management by explaining the tasks of a DMA-capable driver.

Changes in Current Linux Versions

Before we start, you should note two changes in recent Linux versions. In Linux 1.3.70, the first steps were taken to support shared interrupts. The idea is that several devices and device drivers share the same interrupt line. This is also part of the PCI specification, where every device has its own vendor- and product-dependent device ID. When reading this ID, Linux gets quite verbose about plugged-in PCI devices when booting with PCI enabled. For this reason, the declaration of request_irq and free_irq in linux/sched.h now reads:

extern int request_irq(unsigned int irq,
   void (*handler)(int, void *, struct pt_regs *),
   unsigned long flags,
   const char *device,
   void *dev_id);
extern void free_irq(unsigned int irq, void *dev_id);

Thus, when registering an interrupt with request_irq(), a fourth parameter must be handed to Linux: the device ID. Currently, most devices will pass a NULL for dev_id when requesting and freeing interrupts—this results in the same behaviour as before. If you really want to share interrupts, you also have to set SA_SHIRQ in flags, in addition to passing dev_id. Sharing of interrupts only works if all device drivers on one interrupt line agree to share their interrupt.

The second “change” is not a real change, but rather, a stylistic upgrade: get_user_byte() and put_user_byte() are considered obsolete and should not be used in new code. They are replaced by the more flexible get_user and put_user calls.

Linus Torvalds explains that these functions use compiler “magic” to look at the pointer they are passed and read or write the correct size item. This means you can't use void * or unsigned long as a pointer; you always have to use a pointer to the right type. Also, if you give it a char *, you get a char back, not an unsigned char, unlike the old get_fs_byte() function. If you need an unsigned value, use a pointer to an unsigned type. Never cast the return value to get the access size you want—if you think you need to do that, you are doing something wrong. Your argument pointer should always be of the right type.

Essentially, you should think of get_user() as a simple pointer dereference (kind of like *(xxx) in plain C, except it fetches the pointer data from user space instead). In fact, on some architectures, that is exactly what it is.

While we're fixing previous oversights, it is worth noting that the kernel offers a facility to autodetect interrupt lines. That's slightly different from what was shown in our article a couple of months ago. Those who are interested can look at <linux/interrupt.h> for documentation about it.

We now return you to your regularly scheduled programming.

A Split View of Interrupts

As you may remember from last time, interrupt handling is managed by a single driver function. Our implementation dealt with both low-level (acknowledging the interrupt) and high-level (such as awakening tasks) issues. This way of doing things may work for simple drivers, but it is prone to failure if the handling is too slow.

If you look at the code, it is clear that acknowledging the interrupt and retrieving or sending the relevant data is a minimal part of the workings. With common devices where you are moving only one or a few bytes per interrupt, most of the time is spent in managing device-specific queues, chains, and whatever other strange data structures are used in your implementation. Don't take our skel_interrupt() as an example here, since it is the most simplified interrupt handler possible; real devices may have a lot of status information and many modes of operation. If you spend too much time processing your data structures, it is possible to miss the next interrupt or interrupts and either hang or lose data, because when an interrupt handler is running, at least that interrupt is blocked, and with fast interrupt handlers, all interrupts are blocked.

The solution devised for this problem is to split the task of interrupt handling into two halves:

  • First, a fast “top half” handles hardware issues and must terminate before a new interrupt is issued. Normally, very little is done here besides moving data between the device and some memory buffer (and not even that, if your device driver uses DMA), and making sure the hardware is in a sane state.

  • Then a slower “bottom half” of the handler runs with interrupts enabled and can take as long as is needed to accomplish everything. It is executed as soon as possible after the interrupt is serviced.

Fortunately, the kernel offers a special way to schedule execution of the bottom half code, which isn't necessarily related to a particular process; this means both the request to run the function and the execution of the function itself are done outside of the context of any process. A special mechanism is needed, because the other kernel functions all operate in the context of a process—an orderly thread of execution, normally associated with a running instance of a user-level program—while interrupt handling is asynchronous and not related to a particular process.

Bottom Halves: When and How

From the programmer's point of view, the bottom half is very similar to the top half, in that it cannot call schedule() and can only perform atomic memory allocations. This is understandable, because the function is not called in the context of a process; the bottom half is asynchronous, just like the top half, the normal interrupt handler. The difference is mainly that interrupts are enabled and there is no critical code in progress. So, when exactly are those routines executed?

As you know, the processor is mostly executing on behalf of a process, both in user space and in kernel space (while executing system calls). The notable exceptions are servicing an interrupt and scheduling another process in place of the current one: during these operations the current pointer is not meaningful, even though it is always a valid pointer to a struct task_struct. Additionally, kernel code is controlling the CPU when a process enters/exits a system call. This happens quite often, as a single piece of code handles every system call.

With this in mind, it is apparent that if you want your bottom half executed as soon as possible, it must be called from the scheduler or when entering or leaving a system call, since doing it during interrupt service is taboo. Actually, Linux calls do_bottom_half(), defined in kernel/softirq.c from schedule() (kernel/sched.c) and from ret_from_sys_call(), defined in an architecture-specific assembly file, usually entry.S.

The bottom halves are not bound to the interrupt number, though the kernel keeps a static array of 32 such functions. Currently (I'm using 1.3.71), there is no way to ask for a free bottom half number or ID, so we will hard-code one. This is dirty coding, but it is used only to show the idea; later we will remove such ID stealing.

In order to execute your bottom half, you need to tell the kernel about it. This is accomplished through the mark_bh() function, which takes one argument, the ID of your bottom half.

This listing shows the code for a split-interrupt handler using the “dirty” ID pseudo-allocation.

#include <linux/interrupt.h>
#define SKEL_BH 16 /* dirty planning */
/*
 * This is the top half, argument to request_irq()
 */
static void skel_interrupt(int irq,
                           struct pt_regs *regs)
{
  do_top_half_stuff()
  /* tell the kernel to run the bh later */
  mark_bh(SKEL_BH);
}
/*
 * This is the bottom half
 */
static void do_skel_bh(void)
{
  do_bottom_half_stuff()
}
/*
 * But the bh must be initialized ...
 */
int init_module(void)
{
  /* ... */
  init_bh(SKEL_BH, do_skel_bh);
  /* ... */
}
/*
 * ... and cleaned up
 */
void cleanup_module(void)
{
  /* ... */
  disable_bh(SKEL_BH)
  /* ... */
}

Using Task Queues

Actually, dirty allocation of a bottom half ID is not really needed, because the kernel implements a more sophisticated mechanism which you will surely enjoy.

This mechanism is called “task queues”, because the functions to be called are kept in queues, constructed of linked lists. This also means you can register more than one bottom half function that is associated with your driver. Moreover, task queues are not directly related to interrupt handling and can be used independently of interrupt management.

A task queue is a chain of struct tq_struct's, as declared in <linux/tqueue.h>.

struct tq_struct {
    /* linked list of active bh's */
    struct tq_struct *next;
    /* must be initialized to zero */
    int sync;
    /* function to call */
    void (*routine)(void *);
    /* argument to function */
    void *data;
};
typedef struct tq_struct * task_queue;
void queue_task(struct tq_struct *bh_pointer,
                task_queue *bh_list);
void run_task_queue(task_queue *list);

You'll notice the routine argument is a function getting a pointer argument. This is a useful feature, as you'll soon discover by yourself, but remember the data pointer is your complete responsibility: if it points to kmalloc()ed memory, you must remember to free it yourself.

Another thing to keep in mind is that the next field is used to keep the queue consistent, so you must be careful never to look in it, and never insert the same tq_struct in multiple queues, nor twice in the same queue.

There are a few other functions similar to queue_task() in the header, which are worth looking at. Here we stick to the most general call.

In order to use a task queue, you will need either to declare your own queue or add tasks to a predefined queue. We are going to explore both methods.

This listing shows how to run multiple bottom halves in your interrupt handler with your own queue.

#include <linux/interrupt.h>
#include <linux/tqueue.h#gt;
DECLARE_TASK_QUEUE(tq_skel);
#define SKEL_BH 16 /* dirty planning */
/*
 * Two different tasks
 */
static struct tq_struct task1;
static struct tq_struct task2;
/*
 * The bottom half only runs the queue
 */
static void do_skel_bh(void)
{
  run_task_queue(&tq_skel);
}
/*
 * The top half queues the different tasks based
 * on some conditions
 */
static void skel_interrupt(int irq,
                           struct pt_regs *regs)
{
  do_top_half_stuff()
  if (condition1()(I) {
    queue_task(&task1, &tq_skel);
    mark_bh(SKEL_BH);
  }
  if (condition2()(I) {
    queue_task(&task2, &tq_skel);
    mark_bh(SKEL_BH);
  }
}
/*
 * And init as usual
 */
int init_module(void)
{
  /* ... */
  task1.routine=proc1; task1.data=arg1;
  task2.routine=proc2; task2.data=arg2;
  init_bh(SKEL_BH, do_skel_bh);
  /* ... */
}
void cleanup_module(void)
{
  /* ... */
  disable_bh(SKEL_BH)
  /* ... */
}

Using task queues is an enjoyable experience and helps keep your code clean. For example, if you are running the skel-machine described in the previous few Kernel Korner columns, you can service multiple hardware devices by using a single interrupt handling function that gets the hardware structure as an argument. This behaviour can be accomplished by having a tq_struct as a member of the Skel_Hw structure. The big advantage here is if multiple devices request attention at nearly the same time, all of them are queued and serviced later all at once (with interrupts enabled). Of course, this only works if the Skel hardware isn't too picky about when interrupts are acknowledged and the interrupt condition is dealt with.

Running Predefined Queues

The kernel proper defines three task queues for your convenience and amusement. A custom driver should normally use one of those queues instead of declaring its own queue, and any driver may use the predefined queues without declaring new ones. The only reason there are special queues is higher performance: queues with smaller IDs are executed first.

The three predefined queues are:

struct tq_struct *tq_timer;
struct tq_struct *tq_scheduler;
struct tq_struct *tq_immediate;

The first is run in relation with kernel timers and won't be discussed here—it is left as an exercise for the reader. The next is run anytime scheduling takes place, and the last is run “immediately” upon exit from the interrupt handler, as a generic bottom half; this will generally be the queue you use in your driver to replace the older bottom half mechanism.

The “immediate” queue is used like tq_skel above. You don't need to choose an ID and declare it, although mark_bh() must still be called with the IMMEDIATE_BH argument as shown below. Correspondingly, the tq_timer queue uses mark_bh(TIMER_BH), but the tq_scheduler queue does not need to be marked to run, and so mark_bh() is not called.

This listing shows an example of using the “immediate” queue:

#include <linux/interrupt.h>
#include <linux/tqueue.h>
/*
 * Two different tasks
 */
static struct tq_struct task1;
static struct tq_struct task2;
/*
 * The top half queues tasks, and no bottom
 * half is there
 */
static void skel_interrupt(int irq,
                           struct pt_regs *regs)
{
  do_top_half_stuff()
  if (condition1()(I) {
    queue_task(&task1,&tq_immediate);
    mark_bh(IMMEDIATE_BH);
    }
  if (condition2()(I) {
    queue_task(&task2,&tq_skel);
    mark_bh(IMMEDIATE_BH);
    }
}
/*
 * And init as usual, but nothing has to be
 * cleaned up
 */
int init_module(void)
{
  /* ... */
  task1.routine=proc1; task1.data=arg1;
  task2.routine=proc2; task2.data=arg2;
  /* ... */
}
An Example: Playing with tq_scheduler

Task queues are quite amusing to play with, but most of us lack “real” hardware needing deferred handling. Fortunately, the implementation of run_task_queue() is smart enough you can enjoy yourself even without suitable hardware.

The good news is run_task_queue() calls queued functions after removing them from the queue. Thus, you can re-insert a task in the queue from within the task itself.

This silly task prints a message only every ten seconds, up to the end of the world. It needs to be registered only once, and then it will arrange its own existence.

struct tq_struct silly;
void silly_task(void *unused)
{
  static unsigned long last;
  if (jiffies/HZ/10 != last) {
    last=jiffies/HZ/10;
    printk(KERN_INFO "I'm there\n");
  }
queue_task(&silly, &tq_scheduler);
/* tq_scheduler doesn't need mark_bh() */
}

If you're thinking about viruses, please hold on, and remember a wise administrator doesn't do anything as root without reading the code beforehand.

But let's leave task queues and begin looking at memory issues.

DMA on the PC—Dirty, Machine-dependent, Awful

Remember the old days of the IBM PC? Yes, those days, when the PC was delivered with 128 KB of RAM, an 8086-processor, tape interface and 360 KB floppy. [You got a whole 128KB? Lucky you! —ED] Those were the days when DMA on ISA bus was considered fast. The idea of DMA is to transfer a block of data from a device to memory or vice versa without letting the CPU do the boring job of moving individual bytes. Instead, after initialization of both the device and the on-board DMA controller of the motherboard, the device signals the DMA controller that it has data to transfer. The DMA controller puts the RAM in a state receiving data from the data bus, the device puts the data on the data bus, and when this has been done, the controller increments an address counter and decrements a length counter, so that further transfers go into the next location.

In those old days this technique was fast, allowing transfer rates of up to 800 kilobytes per second on 16-bit ISA. Today we have transfer rates of 132 MB/second with PCI 2.0. But good old ISA DMA has still its applications: imagine a sound card playing 16-bit coded music at 48 kHz stereo. This would result in 192 kilobytes per second. Transmitting the data by interrupt requesting, say, two words every 20 microseconds would certainly let the CPU drop a whole lot of interrupts. Polling the data (non-interrupt driven data transfer) at that rate would certainly stop the CPU from doing anything useful, too. What we need is a continuous data flow at midrange speed—perfect for DMA. Linux only has to initiate and stop the data transfer. The rest is done by the hardware itself.

We will deal only with ISA DMA in this article—most expansion cards are still ISA, and ISA DMA is fast enough for many applications. However, DMA on the ISA bus has severe restrictions:

  • The hardware and the DMA controllers know nothing about virtual memory—all they can see is physical memory and its addresses (This restriction belongs to all kinds of ISA DMA.) Instead of using a page here and another there and gluing them together in virtual memory, we must allocate continuous blocks of physical memory, and we may not swap them out.

  • Intel-based systems have two DMA controllers, one with the lower four DMA channels allowing byte-wise transfers and one with the upper four channels allowing (faster) word transfer. Both have only a 16-bit address counter and a 16-bit length counter. Therefore, we may not transfer more than 65535 bytes (or words, with the second controller) at once.

  • The address counter represents only the lower 16 bits of the address (A0-A15 on DMA channels 0-3, A1-A16 on channels 4-7). The upper 8 bits of address space are represented in a page register. They will not change during a transfer, meaning transfers can take place only within one 16-bit (64 KB) segment of memory.

  • Upper 8 bits? 24 bits of address space in total? Yes, it's sad, but true. We can only transfer within the lower 16MB of RAM. If your data eventually needs to reach another address, the CPU has to copy it “by hand” again, using the DMA-accessible memory as a “bounce buffer”, as it is called.

Allocating a DMA Buffer

Right, you know the restrictions—so decide now to read on or not!

The first thing we need for DMA is a buffer. The restrictions (lower 16 MB of memory, continuous page-addresses in physical memory) are all fulfilled if you allocate your buffer with:

void  *dma_buf;
dma_buf = kmalloc(buffer_size,
                  GFP_BUFFER | GFP_DMA);

The returned address will never fit the start of a page, as much as you would prefer it do so. The reason is Linux has a quite advanced memory management on used and unused blocks of pages with kmalloc(). It maintains lists of free segments as small as 32 bytes (64 on the DEC Alpha), another list for segments of double size, another for quadruple-size, etc. Every time you free kmalloc()ed memory, Linux will try to join your released segment with a free neighbor segment. If the neighbor is free, too, they are passed into the list of double size, where it is checked again, if it has a free-neighbor, to get into the next order list.

The sizes kmalloc supports at the moment (Linux 1.3.71) range from PAGE_SIZE > 7 to PAGE_SIZE << 5. Each step in the power of 2 is one list, so two joined elements in the one list form one element of the next higher order.

You might wonder why you don't get a simple whole page. That's because at the beginning of every segment, some list information is maintained. This is called the (somewhat misleading) page_descriptor, and its size is currently between 16 and 32 bytes (depending on your machine type). Therefore, if you ask Linux for 64KB of RAM, Linux will have to use a block of free 128KB RAM and hand you 128 kilobytes to 32 bytes.

And this is difficult to get—the floppy driver sometimes dreams of this, when it tries to allocate its DMA buffer at run time and can't find any continuous RAM. Therefore, always think in powers of two, but then subtract some bytes (about 0x20). If you want to look at the magic numbers, they're all in mm/kmalloc.c.

The Role of Interrupts

Most devices using DMA will generate interrupts. For example, a sound card will generate an interrupt to tell the CPU, “Gimme data, I'm running out of it.”

The machine we built our driver for is quite fascinating: it is a laboratory interface with its own CPU, own RAM, analog and digital inputs and outputs, and bells and whistles. It uses a character channel for receiving commands and sending answers and uses DMA for block transfer of sampled data (or data to output). Therefore, an invoked interrupt might have these reasons:

  • send data on the character channel

  • read data from the character channel

  • initiate a DMA-transfer to or from the host

  • the transfer is done

  • the transfer is going to cross the boundary of the DMA page register (DMA page-register has to be incremented)

Your interrupt handler has to find out what is meant by the interrupt. Normally, it will read a status register of the device having more detailed information on the task to perform.

As you see, we have gone far away from the general truth of a clean loading and unloading of a module and are right in the middle of dirty specialization. We decided our lab driver should perform the following tasks:

  • Allocate DMA-able memory on user request and map this memory to the user space (the user has direct access to the DMA buffer and can “see” how it fills up—this is faster and more flexible than first capturing the data and then transferring it via, say, a character channel to the user).

  • Check, whenever a transfer is required, if the buffer address is valid (was allocated by our driver) and if the length is sufficient.

  • And, of course, do not forget to release the memory when the user closes the character channel, even if it has not been released explicitly.

This concept differs from the floppy driver, where you will never look directly at the actual DMA buffer. But it is probably good for you as well: you might decide to use this method for a frame grabber. The user might allocate multiple buffers, set up the transfer address for the one and look at the other while the first is filled up. As the only thing the user and the system have to do is toggle the sampling address and the buffers are both mapped into the user address space, not a single byte of the picture has to be transferred by the CPU—they just arrive at the location where the user wants them. We will describe the tricks to do this in the next Kernel Korner.

Before we start with real code, let us think of the steps to be taken for a complete transfer:

  1. An interrupt occurs, meaning a transfer should start.

  2. The interrupt handler starts the transfer.

  3. The interrupt handler returns, while the CPU starts its normal activity and the transfer is running.

  4. Interrupt occurs, meaning the transfer is finished.

  5. The interrupt handler ends the transfer...

  6. ...and probably asks the device for another one, wakes up sleeping beauties, etc.

Don't be disappointed that we don't write your whole device driver—things will be very different in different situations! Here's the code for step 2) and 5):

static int skel_dma_start (unsigned long dma_addr,
                           int dma_chan,
                           unsigned long dma_len,
                           int want_to_read) {
    unsigned long flags;
    if (!dma_len || dma_len > 0xffff)
        /* Invalid length */
        return -EINVAL;
    if (dma_addr & 0xffff !=
       (dma_addr + dma_len) & 0xffff)
        /* We would cross a 64kb-segment */
        return -EINVAL;
    if (dma_addr + dma_len > MAX_DMA_ADDRESS)
        /* Only lower 16 MB */
        return -EINVAL;
    /* Don't need any irqs here... */
    save_flags (flags); cli ();
    /* Get a well defined state */
    disable_dma (dma_chan);
    clear_dma_ff (dma_chan);
    set_dma_mode (dma_chan,
                  want_to_read ?
                  /* we want to get data */
                  DMA_MODE_READ
                  /* we want to send data */
                 : DMA_MODE_WRITE);
    set_dma_addr (dma_chan, dma_addr);
    set_dma_count (dma_chan, dma_len);
    enable_dma (dma_chan);
    restore_flags (flags);
    return 0;
}
static void skel_dma_stop (int dma_chan) {
    disable_dma (dma_chan);
}

Sorry, we can't give you a more detailed code here, as you have to decide for yourself how to include DMA in your driver. As usual, the best way to get things working is to look at some working implementation as a starting point.

Deeper and Further

If you want to go deeper with the topics just described, the best teacher is the source, as usual. Split-half interrupt handlers and task queues are used throughout the mainstream kernel, while the DMA implementation shown here is taken from our ceddrv-0.17, available by ftp from tsx-11.mit.edu.

The next installment will come back to more concrete issues—we realize that both DMA and task queues may appear to be rather esoteric topics. We'll show how mmap() works and how a driver may implement its semantics.

Alessandro Rubini is a 27-year-old Linuxer with a taste for the practical side of computer science and a tendency to delay sleeping. This helps him meet deadlines by exploiting time-zone offsets.

Georg V. Zezschwitz is also 27-year-old Linuxer with the same taste for the practical side of computer science and a tendency to delay sleeping.

Load Disqus comments