1001 Ways of Implementing a System Call
Today, we’re going to look at the many ways of implementing user-to-kernel transitions on x86, i.e. system calls. Let’s first quickly review what system calls actually need to accomplish.
In modern operating systems there is a distinction between user mode (executing normal application code) and kernel mode1 (being able to touch system configuration and devices). System calls are the way for applications to request services from the operating system kernel and bridge the gap. To facilitate that, the CPU needs to provide a mechanism for applications to securely transition from user to kernel mode.
Secure in this context means that the application cannot just jump to arbitrary kernel code, because that would effectively allow the application to do what it wants on the system. The kernel must be able to configure defined entry points and the system call mechanism of the processor must enforce these. After the system call is handled, the operating system also needs to know where to return to in the application, so the system call mechanism also has to provide this information.
I came up with four mechanisms that match this description that work for 64-bit environments. I’m going to save the weirder ones that only work on 32-bit for another post. So we have:
- Software Interrupts using the
int
instruction - Call Gates
- Fast system calls using
sysenter
/sysexit
- Fast system calls using
syscall
/sysret
Software interrupts are the oldest mechanism. The key idea is to use the same method to enter the kernel as hardware interrupts do. In essence, it is still the mechanism that was introduced with Protected Mode in 1982 on the 286, but even the earlier CPUs already had cruder versions of this.
Because interrupt vector 0x80 can still be used to invoke system calls2 on 64-bit Linux, we are going to stick with this example:
The processor finds the kernel entry address by taking the interrupt vector
number from the int
instruction and looking up the corresponding descriptor in
the Interrupt Descriptor
Table (IDT). This descriptor
will be an Interrupt or Trap Gate3 to kernel mode and it contains the
pointer to the handling function in the kernel.
These kinds of transitions between different privilege levels using gates cause the processor to switch the stack as well. The stack pointer for the kernel privilege level is kept in the Task State Segment4. After switching to the new stack, the processor pushes (among other information) the return address and the user’s stack pointer onto the kernel stack. A typical handler routine in the kernel would then continue with pushing general purpose registers on the stack as well to preserve them. The data structure that is created on the stack during this process is called the interrupt frame.
To return to userspace5, the kernel executes an iret
(interrupt
return) instruction after restoring the general purpose registers. iret
restore the user’s stack and execution continues after the int
instruction
that entered the kernel in the first place. Despite the short description here,
iret
is one of the most
complicated
instructions in the x86 instruction set.
Our second mechanism, the Call Gate is very similar to using software interrupts. Although Call Gates are the somewhat official way of implementing system calls in the absence of the more modern alternatives discussed below, I’m aware of no use except by malware.
I’ve highlighted the differences between the software interrupt flow and Call Gate traversal here:
Instead of int
, the user initiates the system call by executing a far call.
Far calls are left-overs from the x86 segmented memory
model where a call
instruction doesn’t only specify the instruction pointer to go to, but also
refers the memory segment the instruction pointer is relative to using a
selector (0x18
in the example).
The processor looks up the corresponding segment in the Global Descriptor
Table and in our case
finds a Call Gate instead of an ordinary segment. The Call Gate specifies
the instruction pointer in the kernel just as the Interrupt Gate does. The
processor ignores the instruction pointer provided by the call
instruction in
this case. The rest works similarly to the software interrupt case, except that
the kernel has to use a different instruction for the return path because of a
different stack frame layout created by the hardware.
As you’ll see below, both of these kernel entry methods are dog slow. For both the interrupt and call gate traversals, the processor re-loads code and stack segment registers from the GDT. Each descriptor load is very expensive, because the processor has to decipher a fairly messed up data structure and perform many checks.
Many of these checks are pointless in modern operating systems. The features provided by the segmented memory model and the hierarchical protection domains they enable are not used. Instead of disjoint memory segments, every segment has a base of zero and the maximum size. Protection is realized via paging and the protection rings are only used to implement kernel and user mode.
The solution to this problem comes in an Intel and an AMD flavor:
sysenter
/sysexit
is the instruction pair to implement fast system calls on
Intel that they introduced with the Pentium II in 1997. AMD came up with a
similar but incompatible instruction pair syscall
/sysret
with the K6-2 in
19986.
Both of these instruction pairs work pretty much the same. Instead of
having to consult descriptor tables in memory for what to do, most of
the functionality is hardcoded and the unused flexibility is lost:
sysenter
and syscall
assume a flat memory model and load segment
descriptors with fixed values. They are also incompatible with any
non-standard use of the privilege levels.
A kernel-accessible model-specific
register (MSR)
points to the kernel’s system call entry point. sysenter
also
switches to the kernel stack in this way. The processor does not have
to interpret data structures in memory, as opposed to the Call Gate or
software interrupt case.
Saving the instruction pointer where the kernel should return to works
differently with sysenter
and syscall
as well.syscall
saves the
instruction pointer in the RCX
register. sysenter
leaves the
caller to specify where the system call should return to.
I’ve done microbencharks (code is on Github) of the cost of these mechanisms7. I’ve measured the cost of entering and exiting the kernel with an empty system call handler in the kernel. We are just looking at the cost the hardware imposes on the system call path. The difference in performance is striking:
Using either syscall
/sysret
or sysenter
/sysexit
to perform a system call
is a magnitude faster than using traditional methods. Both modern methods cost
around 70 cycles per roundtrip from user to kernel and back. This is less than
a single 64-bit integer division!
No description of sysenter
would be complete without mentioning the sharp
edges of using it. Check out
these
two issues that can ruin your day, if
you implement system call paths.
Footnotes
-
Supervisor and supervisor mode are a rarely used synonyms for kernel and kernel mode, but you will find them in the Intel SDM. It also explains the term hypervisor. ↩
-
Linux needs to offer
int 0x80
for compatibility with ancient 32-bit applications and it takes no steps to prevent 64-bit applications from using it. This is weird, because all 64-bit CPUs have at least support for the much fastersyscall
. ↩ -
The only difference between interrupt and trap gate is that the former causes the processor to mask interrupts when traversing the gate. This is largely irrelevant for our discussion. ↩
-
The TSS is a vestige from hardware-supported task switching support also introduced with the 286. This feature was never really used and AMD neutered it when they designed the 64-bit extension to x86. ↩
-
iret
can also return to kernel mode. This is decided depending on the segment selectors that are pushed as part ofint
. ↩ -
I have used publicly available CPUID dumps to research when these features appeared. ↩
-
The benchmarks were performed on an Intel Xeon E3-1270 v5 (Skylake client) running at 3.6 Ghz running in a VM, but as none of these operations exit the numbers should be comparable on bare metal. ↩