1001 Ways of Implementing a System Call

Today, we’re going to look at the many ways of implementing user-to-kernel transitions on x86, i.e. system calls. Let’s first quickly review what system calls actually need to accomplish.

In modern operating systems there is a distinction between user mode (executing normal application code) and kernel mode¹ (being able to touch system configuration and devices). System calls are the way for applications to request services from the operating system kernel and bridge the gap. To facilitate that, the CPU needs to provide a mechanism for applications to securely transition from user to kernel mode.

Secure in this context means that the application cannot just jump to arbitrary kernel code, because that would effectively allow the application to do what it wants on the system. The kernel must be able to configure defined entry points and the system call mechanism of the processor must enforce these. After the system call is handled, the operating system also needs to know where to return to in the application, so the system call mechanism also has to provide this information.

I came up with four mechanisms that match this description that work for 64-bit environments. I’m going to save the weirder ones that only work on 32-bit for another post. So we have:

Software Interrupts using the int instruction
Call Gates
Fast system calls using sysenter/sysexit
Fast system calls using syscall/sysret

Software interrupts are the oldest mechanism. The key idea is to use the same method to enter the kernel as hardware interrupts do. In essence, it is still the mechanism that was introduced with Protected Mode in 1982 on the 286, but even the earlier CPUs already had cruder versions of this.

Because interrupt vector 0x80 can still be used to invoke system calls² on 64-bit Linux, we are going to stick with this example:

Kernel Entry using a Software Interrupt

The processor finds the kernel entry address by taking the interrupt vector number from the int instruction and looking up the corresponding descriptor in the Interrupt Descriptor Table (IDT). This descriptor will be an Interrupt or Trap Gate³ to kernel mode and it contains the pointer to the handling function in the kernel.

These kinds of transitions between different privilege levels using gates cause the processor to switch the stack as well. The stack pointer for the kernel privilege level is kept in the Task State Segment⁴. After switching to the new stack, the processor pushes (among other information) the return address and the user’s stack pointer onto the kernel stack. A typical handler routine in the kernel would then continue with pushing general purpose registers on the stack as well to preserve them. The data structure that is created on the stack during this process is called the interrupt frame.

To return to userspace⁵, the kernel executes an iret (interrupt return) instruction after restoring the general purpose registers. iret restore the user’s stack and execution continues after the int instruction that entered the kernel in the first place. Despite the short description here, iret is one of the most complicated instructions in the x86 instruction set.

Our second mechanism, the Call Gate is very similar to using software interrupts. Although Call Gates are the somewhat official way of implementing system calls in the absence of the more modern alternatives discussed below, I’m aware of no use except by malware.

I’ve highlighted the differences between the software interrupt flow and Call Gate traversal here:

Kernel Entry using a Call Gate

Instead of int, the user initiates the system call by executing a far call. Far calls are left-overs from the x86 segmented memory model where a call instruction doesn’t only specify the instruction pointer to go to, but also refers the memory segment the instruction pointer is relative to using a selector (0x18 in the example).

The processor looks up the corresponding segment in the Global Descriptor Table and in our case finds a Call Gate instead of an ordinary segment. The Call Gate specifies the instruction pointer in the kernel just as the Interrupt Gate does. The processor ignores the instruction pointer provided by the call instruction in this case. The rest works similarly to the software interrupt case, except that the kernel has to use a different instruction for the return path because of a different stack frame layout created by the hardware.

As you’ll see below, both of these kernel entry methods are dog slow. For both the interrupt and call gate traversals, the processor re-loads code and stack segment registers from the GDT. Each descriptor load is very expensive, because the processor has to decipher a fairly messed up data structure and perform many checks.

Many of these checks are pointless in modern operating systems. The features provided by the segmented memory model and the hierarchical protection domains they enable are not used. Instead of disjoint memory segments, every segment has a base of zero and the maximum size. Protection is realized via paging and the protection rings are only used to implement kernel and user mode.

The solution to this problem comes in an Intel and an AMD flavor: sysenter/sysexit is the instruction pair to implement fast system calls on Intel that they introduced with the Pentium II in 1997. AMD came up with a similar but incompatible instruction pair syscall/sysret with the K6-2 in 1998⁶.

Both of these instruction pairs work pretty much the same. Instead of having to consult descriptor tables in memory for what to do, most of the functionality is hardcoded and the unused flexibility is lost: sysenter and syscall assume a flat memory model and load segment descriptors with fixed values. They are also incompatible with any non-standard use of the privilege levels.

A kernel-accessible model-specific register (MSR) points to the kernel’s system call entry point. sysenter also switches to the kernel stack in this way. The processor does not have to interpret data structures in memory, as opposed to the Call Gate or software interrupt case.

Saving the instruction pointer where the kernel should return to works differently with sysenter and syscall as well.syscall saves the instruction pointer in the RCX register. sysenter leaves the caller to specify where the system call should return to.

I’ve done microbencharks (code is on Github) of the cost of these mechanisms⁷. I’ve measured the cost of entering and exiting the kernel with an empty system call handler in the kernel. We are just looking at the cost the hardware imposes on the system call path. The difference in performance is striking:

Kernel Entry Microbenchmarks

Using either syscall/sysret or sysenter/sysexit to perform a system call is a magnitude faster than using traditional methods. Both modern methods cost around 70 cycles per roundtrip from user to kernel and back. This is less than a single 64-bit integer division!

No description of sysenter would be complete without mentioning the sharp edges of using it. Check out these two issues that can ruin your day, if you implement system call paths.

Footnotes

Supervisor and supervisor mode are a rarely used synonyms for kernel and kernel mode, but you will find them in the Intel SDM. It also explains the term hypervisor. ↩
Linux needs to offer int 0x80 for compatibility with ancient 32-bit applications and it takes no steps to prevent 64-bit applications from using it. This is weird, because all 64-bit CPUs have at least support for the much faster syscall. ↩
The only difference between interrupt and trap gate is that the former causes the processor to mask interrupts when traversing the gate. This is largely irrelevant for our discussion. ↩
The TSS is a vestige from hardware-supported task switching support also introduced with the 286. This feature was never really used and AMD neutered it when they designed the 64-bit extension to x86. ↩
iret can also return to kernel mode. This is decided depending on the segment selectors that are pushed as part of int. ↩
I have used publicly available CPUID dumps to research when these features appeared. ↩
The benchmarks were performed on an Intel Xeon E3-1270 v5 (Skylake client) running at 3.6 Ghz running in a VM, but as none of these operations exit the numbers should be comparable on bare metal. ↩