Today, we’re going to look at the many ways of implementing user-to-kernel transitions on x86, i.e. system calls. Let’s first quickly review what system calls actually need to accomplish.
In modern operating systems there is a distinction between user mode (executing normal application code) and kernel mode1 (being able to touch system configuration and devices). System calls are the way for applications to request services from the operating system kernel and bridge the gap. To facilitate that, the CPU needs to provide a mechanism for applications to securely transition from user to kernel mode.
Secure in this context means that the application cannot just jump to arbitrary kernel code, because that would effectively allow the application to do what it wants on the system. The kernel must be able to configure defined entry points and the system call mechanism of the processor must enforce these. After the system call is handled, the operating system also needs to know where to return to in the application, so the system call mechanism also has to provide this information.
- Software Interrupts using the
- Call Gates
- Fast system calls using
- Fast system calls using
Software interrupts are the oldest mechanism. The key idea is to use the same method to enter the kernel as hardware interrupts do. In essence, it is still the mechanism that was introduced with Protected Mode in 1982 on the 286, but even the earlier CPUs already had cruder versions of this.
Because interrupt vector 0x80 can still be used to invoke system calls2 on 64-bit Linux, we are going to stick with this example:
The processor finds the kernel entry address by taking the interrupt vector
number from the
int instruction and looking up the corresponding descriptor in
the Interrupt Descriptor
Table (IDT). This descriptor
will be an Interrupt or Trap Gate3 to kernel mode and it contains the
pointer to the handling function in the kernel.
These kinds of transitions between different privilege levels using gates cause the processor to switch the stack as well. The stack pointer for the kernel privilege level is kept in the Task State Segment4. After switching to the new stack, the processor pushes (among other information) the return address and the user’s stack pointer onto the kernel stack. A typical handler routine in the kernel would then continue with pushing general purpose registers on the stack as well to preserve them. The data structure that is created on the stack during this process is called the interrupt frame.
To return to userspace5, the kernel executes an
return) instruction after restoring the general purpose registers.
restore the user’s stack and execution continues after the
that entered the kernel in the first place. Despite the short description here,
iret is one of the most
instructions in the x86 instruction set.
Our second mechanism, the Call Gate is very similar to using software interrupts. Although Call Gates are the somewhat official way of implementing system calls in the absence of the more modern alternatives discussed below, I’m aware of no use except by malware.
I’ve highlighted the differences between the software interrupt flow and Call Gate traversal here:
int, the user initiates the system call by executing a far call.
Far calls are left-overs from the x86 segmented memory
model where a
instruction doesn’t only specify the instruction pointer to go to, but also
refers the memory segment the instruction pointer is relative to using a
0x18 in the example).
The processor looks up the corresponding segment in the Global Descriptor
Table and in our case
finds a Call Gate instead of an ordinary segment. The Call Gate specifies
the instruction pointer in the kernel just as the Interrupt Gate does. The
processor ignores the instruction pointer provided by the
call instruction in
this case. The rest works similarly to the software interrupt case, except that
the kernel has to use a different instruction for the return path because of a
different stack frame layout created by the hardware.
As you’ll see below, both of these kernel entry methods are dog slow. For both the interrupt and call gate traversals, the processor re-loads code and stack segment registers from the GDT. Each descriptor load is very expensive, because the processor has to decipher a fairly messed up data structure and perform many checks.
Many of these checks are pointless in modern operating systems. The features provided by the segmented memory model and the hierarchical protection domains they enable are not used. Instead of disjoint memory segments, every segment has a base of zero and the maximum size. Protection is realized via paging and the protection rings are only used to implement kernel and user mode.
The solution to this problem comes in an Intel and an AMD flavor:
sysexit is the instruction pair to implement fast system calls on
Intel that they introduced with the Pentium II in 1997. AMD came up with a
similar but incompatible instruction pair
sysret with the K6-2 in
Both of these instruction pairs work pretty the same. Instead of having to
consult descriptor tables in memory for what to do most of the functionality is
hardcoded and the unused flexibility is lost:
syscall assume a
flat memory model and load segment descriptors with fixed values. They are also
incompatible with any non-standard use of the privilege levels.
A kernel-accessible model-specific
register (MSR) points to
the kernel’s system call entry point.
sysenter also switches to the kernel
stack in this way. The processor does not have to interpret data structures in
memory. The return address is left in a general-purpose register for the kernel
to save wherever it wants to.
I’ve done microbencharks (code is on Github) of the cost of these mechanisms7. I’ve measured the cost of entering and exiting the kernel with an empty system call handler in the kernel. We are just looking at the cost the hardware imposes on the system call path. The difference in performance is striking:
sysexit to perform a system call
is a magnitude faster than using traditional methods. Both modern methods cost
around 70 cycles per roundtrip from user to kernel and back. This is less than
a single 64-bit integer division!
Linux needs to offer
int 0x80for compatibility with ancient 32-bit applications and it takes no steps to prevent 64-bit applications from using it. This is weird, because all 64-bit CPUs have at least support for the much faster
The only difference between interrupt and trap gate is that the former causes the processor to mask interrupts when traversing the gate. This is largely irrelevant for our discussion. ↩
The TSS is a vestige from hardware-supported task switching support also introduced with the 286. This feature was never really used and AMD neutered it when they designed the 64-bit extension to x86. ↩
iretcan also return to kernel mode. This is decided depending on the segment selectors that are pushed as part of