Split Lock Detection VM Hangs
Recently, I’ve noticed strange hangs of KVM VMs on a custom VMM. As it fits the topic of this blog, I thought I make the issue more googleable. Until we dive into the issue, we have to set the scene a bit.
The Scene
Consider that we want to run a KVM vCPU on Linux, but we want it to
unconditionally exit after 1ms regardless of what the guest does. To
achieve this, we can create a CLOCK_MONOTONIC
timer with
timer_create
that sends a signal to the thread that runs the vCPU (via
SIGEV_THREAD_ID
). We choose SIGUSR1
, but other signals work as
well.
We have to make sure that we do not receive the signal when the vCPU
does not execute. This is important, because then the signal will not
fulfill its goal of kicking the vCPU out of guest execution. For that,
we mask SIGUSR1
with
pthread_sigmask
in the host thread and unmask it for the vCPU via
KVM_SET_SIGNAL_MASK
.
This setup works beautifully and in essence emulates the VMX
preemption timer1. There is only one wart at this
point. When
KVM_RUN
returns EINTR
, because the timer signal was pending, we need to
“consume” the signal or the next KVM_RUN
will immediately exit
again. We can do this with
sigtimedwait
with a zero timeout.
Weird VM Hangs
When I used this scheme on my Intel Tiger Lake laptop, I noticed strange hangs in VMs. The VM would sometimes get stuck on one instruction. The weird thing was that the vCPU could still receive and handle interrupts, but this one harmless looking instruction would never complete. The effect was that some Linux kernel threads would just get stuck while others continue to run.
The instruction in question was this from the
set_bit
function of my Linux 5.4 guest:
ffffffff810238b0 <set_bit>:
ffffffff810238b0: f0 48 0f ab 3e lock bts %rdi,(%rsi)
ffffffff810238b5: c3 ret
Way too late, I noticed the following warning in the host’s kernel log with a matching instruction point:
x86/split lock detection: #AC: vmm/61253 took a split_lock trap at address: 0xffffffff810238b0
Split lock detection is an anti-DoS feature that can find or kill processes that perform misaligned locked memory accesses, because they trigger extremely slow paths in the CPU that impact the performance of other cores in the system.
When I checked in more detail, the lock bts
was indeed performing a
misaligned locked memory access, but why would this warning cause a permanent
hang at this instruction?
On my laptop running Linux 6.6, split lock detection was in its
default setting warn
. This is reasonable, because the underlying
issue is not something you typically care about on a desktop
system. The
documentation
of the relevant kernel parameter reads as follows:
split_lock_detect=
[X86] Enable split lock detection or bus lock detection
When enabled (and if hardware support is present), atomic
instructions that access data across cache line
boundaries will result in an alignment check exception
for split lock detection or a debug exception for
bus lock detection.
...
warn - the kernel will emit rate-limited warnings
about applications triggering the #AC
exception or the #DB exception. This mode is
the default on CPUs that support split lock
detection or bus lock detection. Default
behavior is by #AC if both features are
enabled in hardware.
There were no clues about the hang here either. 🤔
Going Deeper
When I checked the kernel
function
that emits the warning (called via
handle_guest_split_lock
),
the pieces started falling together:
static void split_lock_warn(unsigned long ip)
{
struct delayed_work *work;
int cpu;
if (!current->reported_split_lock)
pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n",
current->comm, current->pid, ip);
current->reported_split_lock = 1;
if (sysctl_sld_mitigate) {
/*
* misery factor #1:
* sleep 10ms before trying to execute split lock.
*/
if (msleep_interruptible(10) > 0)
return;
/*
* Misery factor #2:
* only allow one buslocked disabled core at a time.
*/
if (down_interruptible(&buslock_sem) == -EINTR)
return;
work = &sl_reenable_unlock;
} else {
work = &sl_reenable;
}
cpu = get_cpu();
schedule_delayed_work_on(cpu, work, 2);
/* Disable split lock detection on this CPU to make progress */
sld_update_msr(false);
put_cpu();
}
When the host detects a split lock, it will try to punish the offending thread by introducing a 10ms delay. But recall that our vCPU has a 1ms timer pending!
The situation is thus the following:
- The VMM programs a 1ms timer and starts guest execution with
KVM_RUN
. - The guest executes a misaligned
lock bts
and exits with an#AC
exception. - The host Linux kernel sleeps for 10ms to punish this behavior.
- The sleep is interrupted and the function immediately returns with split lock detection still enabled.
At this point, the VMM sees that 10ms has passed and processes its timeout events. It programs a new timeout and we have the same sequence of events again.
I have created a minimal example of this issue
here. The guest
code
just counts how many times it can can execute the lock bts
instruction.
When you execute this test program once with split_lock_detect=warn
and once with split_lock_detect=off
, you get the following data:
The plot shows number of loops that the guest finished on the vertical axis and the pending timeout in ms on the horizontal axis.
You can clearly see that for timeouts below 10ms, this (artificial) guest makes no progress at all when split lock detection is enabled! On the other hand, when split lock detection is disabled, the guest makes roughly as much progress as we give it time.
Workarounds
As I already mentioned, the easiest workaround is to turn split lock
detection off via split_lock_detect=off
. This is safe unless you run
a public cloud. Alternatively, the punishment can be disabled by
writing 0
into /proc/sys/kernel/split_lock_mitigate
.
A Bug?
The split_lock_warn
function is clearly written to allow the
offender to make some progress. But in the situation where
msleep_interruptible
is actually interrupted, this is not the case
anymore. It looks like a bug to me.
It’s a difficult question what the correct behavior should be here. If
msleep_interruptible
managed to sleep at least a bit (i.e. some
punishment was dealt), we should still go into the lower part of the
function that disables split lock detection and allow for forward
progress. This may make it possible to circumvent this punishment
though.