x86 CPUs usually identify themselves and their features using the cpuid instruction. But even without looking at their self-reported identities or timing behavior, it is possible to tell CPU microarchitectures apart.

Take for example the ud0 instruction. This instruction is used to generate an Invalid Opcode Exception (#UD). It is encoded with the two bytes 0F FF.

If we place this instruction at the end of an executable page in memory and the following page is not executable, we see differences across x86 microarchitectures. On my Goldmont Plus-based Intel NUC, executing this instruction will indeed cause an #UD exception. On Linux, this exception is delivered as SIGILL.

If I retry the same setup on my Skylake desktop, the result is a SIGSEGV instead. This signal is caused by a page fault during instruction fetch. This means that the CPU did not manage to decode this instruction with just the two bytes and tried to fetch more bytes. My somewhat older Broadwell-based laptop has the same behavior.

Using baresifter, we can reverse engineer (more on that in a future blog post) that Skylake and Broadwell actually try to decode ud0 as if it had source and destination operands. After the the two opcode bytes, they expect a ModR/M byte and as many additional immediate or displacement bytes as the ModR/M byte indicate.

I have put the code for this example on Github.

Why would this matter? Afterall, this behavior is now even documented in the Intel Software Developer’s Manual:

Some older processors decode the UD0 instruction without a ModR/M byte. As a result, those processors would deliver an invalid-opcode exception instead of a fault on instruction fetch when the instruction with a ModR/M byte (and any implied bytes) would cross a page or segment boundary.

I have picked an easy example for this post. Beyond this documented difference, there are many other undocumented differences in instruction fetch behavior for other illegal opcodes that makes it fairly easy to figure out what microarchitecture we are dealing with. This still applies when a hypervisor intercepts cpuid and changes the (virtual) CPU’s self-reported identity. It is also possible to fingerprint different x86 instruction decoding libraries using this approach and narrow down which hypervisor software stack is used.

One usecase I can think of is to build malware that is tailored to recognize its target using instruction fetch fingerprinting. Let’s say the malware’s target is an embedded system with an ancient x86 CPU. If it is actively fingerprinting the CPU, it can avoid deploying its payload in an automated malware analysis system and be discovered, unless the malware analysis is performed on the exact same type of system targeted by the malware.