This article provides a technical analysis of Zenbleed, a side-channel attack affecting all AMD Zen 2 processors. Tavis Ormandy reported this vulnerability to AMD on 15 May 2023 and it was assigned CVE-2023-20593. The vulnerability is of particular concern for shared hosting providers, virtualisation platforms, and other shared-tenant systems. However, any scenario where a malicious actor can execute code potentially poses a threat, including in contexts such as privilege escalation, sandbox escape, and possibly even malicious JavaScript executing in a web browser.
本文提供了对Zenbleed的技术分析,Zenbleed是一种影响所有AMD Zen 2处理器的侧信道攻击。Tavis Ormandy 于 2023 年 5 月 15 日向 AMD 报告了此漏洞,并为其分配了 CVE-2023-20593。共享托管提供商、虚拟化平台和其他共享租户系统特别关注此漏洞。但是,恶意参与者可以执行代码的任何场景都可能构成威胁,包括在权限提升、沙盒转义等上下文中,甚至可能在 Web 浏览器中执行恶意 JavaScript。
While AMD has historically enjoyed relative respite from side-channel attack publications, this past disparity was largely due to Intel’s processors being a more attractive research target, with a greater depth of information available around engineering features (e.g. red unlock) and internals (e.g. microcode structure), and a greater share of the server market at the time. In the five years since Meltdown and Spectre, researchers have been busy closing the knowledge gap around AMD’s processors, making it easier to discover impactful security issues.
虽然AMD历来从侧信道攻击出版物中享有相对的喘息机会,但过去的这种差异主要是由于英特尔的处理器是一个更具吸引力的研究目标,围绕工程功能(例如红色解锁)和内部(例如微码结构)的信息更深入,以及当时服务器市场的更大份额。在Meltdown和Spectre之后的五年里,研究人员一直在忙于缩小AMD处理器的知识差距,从而更容易发现有影响力的安全问题。
The Zenbleed vulnerability exploits incorrect recovery behaviour after a branch misprediction involving optimised vector instructions, resulting in information within floating point unit (FPU) registers being leaked. Vectorisation is frequently utilised in common library functions (e.g. memcpy
, memcmp
, strlen
) for performance reasons, making this a very wide-reaching vulnerability in terms of the types of data that can be extracted.
Zenbleed漏洞在涉及优化矢量指令的分支错误预测后利用不正确的恢复行为,导致浮点单元(FPU)寄存器内的信息泄露。出于性能原因,矢量化经常用于公共库函数(例如 memcpy
、 memcmp
、 strlen
),这使得它在可以提取的数据类型方面是一个非常广泛的漏洞。
To understand Zenbleed, we need to dig into modern processor design. Modern x86_64 processors do not simply execute one instruction after the next. Instead, they operate in a superscalar manner, essentially executing multiple instructions at once using techniques such as instruction-level parallelism (ILP) and out-of-order execution. While the processor outwardly appears to have a small number of general purpose registers (e.g. rax
, rbx
, r12
, etc.) and a bank of SIMD registers (e.g. xmm0
, ymm3
, etc.), each processor core actually has a far larger number of internal registers. The named registers aren’t uniquely represented by a single physical hardware register each, but are rather dynamically allocated in a register file. This enables some very important optimisations.
要理解Zenbleed,我们需要深入研究现代处理器设计。现代x86_64处理器不是简单地执行一条接一条指令。相反,它们以超标量方式运行,基本上使用指令级并行性 (ILP) 和无序执行等技术一次执行多个指令。虽然处理器表面上看起来有少量的通用寄存器(例如、 rax
、 rbx
r12
等)和一组SIMD寄存器(例如 xmm0
、 ymm3
等),但每个处理器内核实际上都有更多的内部寄存器。命名寄存器不是由每个物理硬件寄存器唯一表示的,而是在寄存器文件中动态分配的。这可以实现一些非常重要的优化。
For example, if you were to execute the instruction xchg rax, rcx
, the processor almost certainly doesn’t move any values between physical hardware registers within the register file. Instead, it performs a register rename, essentially swapping the labels on the register file entries. This also happens with SIMD registers, allowing for complex behaviours and optimisations relating to the “nesting” of registers (e.g. xmm1
being one half of ymm1
, which in turn is one half of zmm1
).
例如,如果你要执行指令 xchg rax, rcx
,处理器几乎肯定不会在寄存器文件中的物理硬件寄存器之间移动任何值。相反,它执行寄存器重命名,实质上是交换寄存器文件条目上的标签。SIMD 寄存器也会发生这种情况,允许与寄存器的“嵌套”相关的复杂行为和优化(例如 ymm1
, xmm1
是寄存器的一半,而后者又是寄 zmm1
存器的一半)。
When we think of a classical processor design, we typically think of it having an instruction decoder, an arithmetic logic unit (ALU), a floating point unit (FPU), etc. However, a superscalar processor actually has several of these per core, and uses a complex scheduling system to execute many operations at the same time. By identifying data dependencies between instructions, the processor can identify cases where later instructions do not depend upon the results of previous instructions, allowing it to execute the instruction at the same time.
当我们想到经典的处理器设计时,我们通常会想到它具有指令解码器、算术逻辑单元 (ALU)、浮点单元 (FPU) 等。然而,超标量处理器实际上每个内核有几个这样的操作,并使用复杂的调度系统同时执行许多操作。通过识别指令之间的数据依赖关系,处理器可以识别后面的指令不依赖于先前指令的结果的情况,从而允许它同时执行指令。
For example, consider the following sequence of instructions:
例如,请考虑以下指令序列:
Rather than executing the first instruction, stalling while waiting for the memory fetch to complete, then working on the next instructions, the processor can instead look ahead and see that sub rax, 0x8
does not depend upon the results of the first two instructions and choose to execute it simultaneously. It may also recognise that xor rax, rax
sets rax
to zero, thus not depending on the value of rax
before that time, allowing it to start working on further instructions too, as long as memory accesses are correctly ordered. Not only this, but if the processor’s register allocation scheme keeps track of which entries in the register file are zero, then it does not need to explicitly zero a register to represent rax
, but can simply reuse an already-zeroed entry.
处理器可以改为向前看,而不是执行第一条指令,在等待内存获取完成时停滞不前,然后处理下一条指令,而是可以向前看,看看它 sub rax, 0x8
不依赖于前两条指令的结果,并选择同时执行它。它还可以识别设置为 xor rax, rax
rax
零,因此不依赖于该时间之前的值 rax
,只要内存访问顺序正确,它就可以开始处理进一步的指令。不仅如此,如果处理器的寄存器分配方案跟踪寄存器文件中的哪些条目为零,那么它不需要显式将寄存器归零来表示 rax
,而是可以简单地重用已经归零的条目。
By carefully accounting for data dependencies and memory access ordering, the processor can parallelise operations across multiple physical ALUs and other units at the same time, re-ordering operations to try to ensure maximum utilisation of parallel units at all times. This also occurs with SIMD instructions, with special accounting for the upper and lower halves of the SIMD registers (xmm*
, ymm*
, zmm*
) to help identify data dependencies when independent pieces of data are simultaneously processed in a vectorised manner.
通过仔细考虑数据依赖关系和内存访问排序,处理器可以同时跨多个物理 ALU 和其他单元并行化操作,重新排序操作以尝试确保始终最大限度地利用并行单元。这也适用于 SIMD 指令,对 SIMD 寄存器的上半部分和下半部分( xmm*
ymm*
、 zmm*
、)进行特殊核算,以帮助识别以矢量化方式同时处理独立数据片段时的数据依赖关系。
This behaviour also interacts with speculative execution, where the processor tries to guess what the result of a branch instruction will be and continues execution as if the guess was correct, then rolls back to the previous state if the guess was incorrect. For example:
此行为还与推测执行交互,其中处理器尝试猜测分支指令的结果是什么,并继续执行,就好像猜测是正确的一样,如果猜测不正确,则回滚到以前的状态。例如:
When the processor hits je skip
, the memory fetch from the first instruction is still in flight, so it doesn’t yet know whether the branch will be taken or not. Without speculative execution this results in a pipeline stall while the memory fetch completes. To avoid this stall, the processor makes a branch prediction (i.e. an informed guess based on various metadata and prior observations) and saves a checkpoint. It then continues execution as if its prediction was correct (i.e. either after the branch or at the branch target, depending on what the prediction was) and either commits or rolls back its state depending on whether its prediction later turns out to be correct.
当处理器命中 je skip
时,从第一条指令中获取的内存仍在运行中,因此它还不知道分支是否会被占用。如果没有推测执行,这会导致在内存提取完成时管道停止。为了避免这种停滞,处理器进行分支预测(即基于各种元数据和先前观察的知情猜测)并保存检查点。然后,它继续执行,就好像它的预测是正确的一样(即在分支之后或在分支目标,取决于预测是什么),并根据其预测后来是否正确来提交或回滚其状态。
Let’s say that the processor guesses that the branch is not taken. It executes the code immediately after the branch (i.e. add rcx, 4
, …) and continues until it hits the write hazard at mov [rcx], rax
. It may also look ahead and see that it would execute add rcx, 8
, which is not dependent on the write hazard, and execute that too. ILP also applies here, so some of these operations can be done in parallel.
假设处理器猜测分支未被占用。它紧跟在分支之后执行代码(即,…),并继续执行 add rcx, 4
,直到它遇到写入 mov [rcx], rax
危险。它也可以向前看,看看它会执行 add rcx, 8
,这不依赖于写入危险,并执行它。ILP 在这里也适用,因此其中一些操作可以并行完成。
When the memory fetch issued by cmp rax, [rcx]
comes back, the processor now knows whether or not its prediction was correct. If it was, it commits the speculatively executed state and carries on. If it wasn’t, it has to roll back the state to an earlier checkpoint.
当 发出的 cmp rax, [rcx]
内存提取返回时,处理器现在知道其预测是否正确。如果是,它将提交推测性执行状态并继续。如果不是,则必须将状态回滚到较早的检查点。
The Zenbleed vulnerability arises from faulty behaviour when a branch misprediction rollback occurs immediately after a special SIMD register optimisation and register rename occur.
Zenbleed 漏洞源于在发生特殊的 SIMD 寄存器优化和寄存器重命名后立即发生分支错误预测回滚的错误行为。
The optimisation in question is called the XMM Register Merge Optimization. AMD Zen 2 processors keep track of SIMD registers whose upper halves have been zeroed, using a z-bit in its Register Allocation Table (RAT). When an instruction writes non-zero data to the upper half of a register, the z-bit is cleared, indicating that there is data present and any subsequent instructions that might be affected by that data cannot be executed until the data dependency is resolved. However, if the upper half is zeroed, instructions that also do not modify that upper half can proceed without waiting, avoiding the data dependency and resulting pipeline stall.
所讨论的优化称为 XMM 寄存器合并优化。AMD Zen 2 处理器使用寄存器分配表 (RAT) 中的 z 位跟踪上半部分已归零的 SIMD 寄存器。当指令将非零数据写入寄存器的上半部分时,z位被清除,表示存在数据,并且在解决数据依赖关系之前,无法执行可能受该数据影响的任何后续指令。但是,如果上半部分为零,则不修改上半部分的指令可以继续而无需等待,从而避免数据依赖和由此导致的管道停止。
Tavis Ormandy’s writeup of the Zenbleed demonstrates this optimisation using the AVX2 optimised strlen
function from glibc:
Tavis Ormandy对Zenbleed的文章展示了使用glibc的AVX2优化 strlen
功能进行优化:
vpxor xmm0, xmm0, xmm0 ;xor xmm0 与 xmm0 并将其存储在 xmm0 中(扩展到 ymm0)
vpcmpeqb ymm1, ymm0, [rdi] ;将 RDI 处的内存与 YMM0 进行比较,将结果存储在 YMM1 中
vpmovmskb eax, ymm1 ;将 EAX 设置为 ymm1 寄存器中空字节的 32 位位图
tzcnt eax, eax ;计算尾随零
vzeroupper ;将 YMM0-YMM15 的高 128 位归零
The first instruction zeroes the 128-bit SIMD register xmm0
(similar to xor rax, rax
) and, in the process, also zeroes the 256-bit SIMD register ymm0
which encompasses it, since xmm0
is the lower half of ymm0
.
第一条指令将 128 位 SIMD 寄存器归零(类似于 xor rax, rax
),在此过程中,还将包含它的 256 位 SIMD 寄存器 ymm0
xmm0
归零,因为 xmm0
的下半部分 ymm0
是 。
The second instruction, vpcmpeqb
(vector compare equal bytes), treats the ymm0
register as 32 packed bytes and compares those to the 32 bytes of memory pointed to by rdi
. Bytes that are equal produce a corresponding byte of all 1s in the ymm1
destination register, whereas bytes that are not equal produce a corresponding byte of all 0s.
第二条指令 vpcmpeqb
(矢量比较相等字节)将 ymm0
寄存器视为 32 个打包字节,并将其与 指向 rdi
的 32 个字节的内存进行比较。相等的字节在 ymm1
目标寄存器中产生所有 1 的相应字节,而不相等的字节产生所有 0 的相应字节。
The third instruction, vpmovmskb
(vector move byte mask), takes the most significant bit of each packed byte in the ymm1
register and writes it to the corresponding bit in eax
. This results in MSBs from 32 separate bytes in ymm1
being packed into a single 32-bit general purpose register.
第三条指令 vpmovmskb
(矢量移动字节掩码)获取 ymm1
寄存器中每个打包字节的最高有效位,并将其写入 中的 eax
相应位。这导致来自32个独立字节的MSB ymm1
被打包到单个32位通用寄存器中。
The fourth instruction counts the trailing zeroes in eax
. Since each bit in eax
now represents a byte in the source memory that was zero, this finds how many trailing \0
characters appeared after the end of a 32-byte aligned string chunk.
第四条指令计算 中的 eax
尾随零。由于现在中的每个 eax
位都表示源内存中的一个字节为零,因此这将查找在 32 字节对齐的字符串块末尾之后出现的尾随 \0
字符数。
The fifth instruction, vzeroupper
, is not functionally required – the code has already finished calculating the number of trailing \0
characters – but its presence is important for performance. The instruction zeroes the upper halves of all ymm
registers (and zmm
registers too) – or, rather, what this actually does is set the corresponding z-bits being in the RAT to indicate that the upper halves of each register are zero, without actually zeroing any underlying entries in the register file. The lower half of the ymm
register (accessible via xmm*
) is still allocated in the register file, but it is merged with an upper half that is unallocated and marked as zero via its z-bit.
第五条指令 vzeroupper
在功能上不是必需的 – 代码已经完成了尾随 \0
字符数的计算 – 但它的存在对性能很重要。该指令将所有寄存器(以及 ymm
zmm
寄存器)的上半部分归零 – 或者更确切地说,这实际上是设置RAT中的相应z位,以指示每个寄存器的上半部分为零,而实际上没有将寄存器文件中的任何底层条目归零。寄存 ymm
器的下半部分(可通过 xmm*
)在寄存器文件中分配,但它与未分配的上半部分合并,并通过其 z 位标记为零。
This is why the vzeroupper
instruction helps prevent the processor from falsely assuming data dependencies in subsequent instructions that use the ymm
registers. The XMM Register Merge Optimization allows the processor to identify instructions which do not write to the upper portion of the register, thus letting them execute without treating the upper (zero) portion of the register as a data dependency. This uncouples the data dependency between overlapping xmm
and ymm
registers.
这就是为什么该 vzeroupper
指令有助于防止处理器在使用 ymm
寄存器的后续指令中错误地假定数据依赖关系的原因。XMM 寄存器合并优化允许处理器识别不写入寄存器上部的指令,从而允许它们执行,而无需将寄存器的上部(零)部分视为数据依赖关系。这将分离重叠 xmm
和 ymm
寄存器之间的数据依赖关系。
Unfortunately it seems that AMD Zen 2 processors do not correctly handle the case when a vzeroupper
instruction is speculatively executed and then rolled back due to branch misprediction. The scenario is as follows:
不幸的是,AMD Zen 2处理器似乎无法正确处理 vzeroupper
推测性执行指令然后由于分支错误预测而回滚的情况。场景如下:
- SIMD instructions that support the XMM Register Merge Optimisation are executed, using
xmm
operands.
使用 操作数执行xmm
支持 XMM 寄存器合并优化的 SIMD 指令。 - A register rename is triggered on the overlapping
ymm
operand, e.g. by thevmovdqa
instruction.
寄存器重命名在重叠ymm
操作数上触发,例如由vmovdqa
指令触发。 - A branch is reached and the CPU speculatively executes past it.
到达一个分支,CPU 通过该分支推测执行。 - A
vzeroupper
instruction is speculatively executed, which sets the z-bit on the upper halves of allymm
registers and deallocates their respective entries in the register file.
推测性地执行一条vzeroupper
指令,该指令在所有寄存ymm
器的上半部分设置z位,并在寄存器文件中解除分配它们各自的条目。 - The branch condition is resolved and misprediction is detected.
分支条件已解决,并检测到错误预测。 - The processor rolls back the
vzeroupper
instruction by clearing the z-bits and re-allocating the entries.
处理器通过清除 z 位并重新分配条目来回滚vzeroupper
指令。 - Execution continues from the correct branch path.
从正确的分支路径继续执行。
However, when the rollback occurs, the processor resets the z-bit to zero, leaving the register in an undefined state, with the upper half of the ymm
register pointing at an uninitialised entry in the register file. This is comparable to a use-after-free bug, but in the processor’s register file instead of system memory.
但是,当发生回滚时,处理器会将 z 位重置为零,使寄存器处于未定义状态,寄存 ymm
器的上半部分指向寄存器文件中未初始化的条目。这与释放后使用错误相当,但在处理器的寄存器文件中而不是系统内存中。
Since the register file is shared by SMT cores, this can be used to snoop on data in the SIMD registers across hyperthreads. This isn’t the only attack scenario, though – the same attack can be leveraged for privilege escalation.
由于寄存器文件由 SMT 内核共享,因此可用于跨超线程窥探 SIMD 寄存器中的数据。不过,这并不是唯一的攻击场景 – 可以利用相同的攻击进行权限提升。
While it might initially seem like SIMD registers aren’t particularly interesting, they are used in optimised versions of almost all string and memory manipulation functions in standard libraries. This means they are constantly handling sensitive data like passwords, keys, configuration files, etc. making all this data vulnerable to leakage.
虽然最初看起来 SIMD 寄存器并不是特别有趣,但它们用于标准库中几乎所有字符串和内存操作函数的优化版本。这意味着他们不断处理敏感数据,如密码、密钥、配置文件等,使所有这些数据都容易受到泄漏。
There is a PoC exploit for Zenbleed on GitHub which is capable of dumping data across hyperthreads. The code is also nicely commented and quite easy to follow.
GitHub上有一个针对Zenbleed的PoC漏洞,它能够跨超线程转储数据。该代码也得到了很好的注释,并且很容易理解。
AMD released Bulletin AMD-SB-7008 “Cross-Process Information Leak” to track the issue. They also released a microcode patch to address the issue on Family 17h Model 31h (EPYC 7002 series) and Family 17h Model 0Ah (Sabrina SoCs). So far there are no microcode updates for consumer products, meaning that AMD’s desktop, mobile, HEDT, and workstation (Threadripper) processors remain vulnerable. AGESA firmware updates are scheduled for release in October and December 2023, which should contain new microcode for those products. It seems that the coordinated disclosure process for Zenbleed went a little off the rails, possibly due to AMD accidentally publishing information several months ahead of the agreed embargo date, resulting in the bug being disclosed 3-4 months ahead of patch availability.
AMD 发布了公告 AMD-SB-7008“跨进程信息泄漏”来跟踪该问题。他们还发布了一个微码补丁,以解决Family 17h Model 31h(EPYC 7002系列)和Family 17h Model 0Ah(Sabrina SoC)的问题。到目前为止,消费类产品还没有微码更新,这意味着AMD的台式机,移动,HEDT和工作站(Threadripper)处理器仍然容易受到攻击。AGESA 固件更新计划于 2023 年 10 月和 12 月发布,其中应包含这些产品的新微码。Zenbleed的协调披露过程似乎有点偏离轨道,可能是由于AMD在约定的禁运日期前几个月意外发布了信息,导致该错误在补丁可用前3-4个月被披露。
On systems where the microcode or firmware updates cannot be applied, a workaround is possible using a chicken bit in the DE_CFG
register at MSR 0xC0011029. Setting bit 9 in this register enables a backup fix, but has additional performance impact compared to the microcode update. Linux’s name for this workaround bit is MSR_AMD64_DE_CFG_ZEN2_FP_BACKUP_FIX_BIT
, which it should automatically apply on affected platforms when no microcode update is present. The bit can manually be set on Linux using msr-tools
, or on FreeBSD with cpucontrol
.
在无法应用微码或固件更新的系统上,可以使用 MSR 0xC0011029 DE_CFG
寄存器中的鸡钻头进行解决方法。在此寄存器中设置位 9 可启用备份修复,但与微码更新相比,具有额外的性能影响。此解决方法位的 Linux 名称为 MSR_AMD64_DE_CFG_ZEN2_FP_BACKUP_FIX_BIT
,当不存在微码更新时,它应该自动应用于受影响的平台。该位可以在 Linux 上使用 手动设置 msr-tools
,也可以在 FreeBSD 上使用 cpucontrol
手动设置。
At the time of writing, Microsoft do not appear to have a security update that applies the DE_CFG[9]
chicken bit workaround. You can modify MSRs using RWEverything on Windows, although that comes with its own risks and is probably not a sensible thing to do in production.
在撰写本文时,Microsoft似乎没有应用 DE_CFG[9]
鸡钻头解决方法的安全更新。您可以在Windows上使用RWEverything修改MSR,尽管这会带来自己的风险,并且在生产环境中可能不明智。
It is possible to query which version of microcode has been applied, to test whether an updated version has been applied, although the method is OS specific. On Windows, the microcode version information is found in the following registry key:
可以查询已应用哪个版本的微码,以测试是否已应用更新版本,尽管该方法特定于操作系统。在 Windows 上,微码版本信息位于以下注册表项中:
HKEY_LOCAL_MACHINE\硬件\描述\系统\中央处理器\0
The Update Revision
value describes the microcode version that has been loaded into the processor, and the Previous Update Revision
describes the microcode version that was loaded into the processor by the system firmware (UEFI / BIOS) at boot.
该 Update Revision
值描述已加载到处理器中的微码版本,并 Previous Update Revision
描述在启动时由系统固件 (UEFI/BIOS) 加载到处理器中的微码版本。
On Linux, /proc/cpuinfo
will list the microcode version alongside other processor details:
在 Linux 上, /proc/cpuinfo
将列出微码版本以及其他处理器详细信息:
型号名称 : AMD EPYC 7601 32 核处理器
The same info can also usually be found in the kernel boot log.
通常也可以在内核引导日志中找到相同的信息。
For Zen 2 architecture EPYC processors, a microcode version of 0x0830107a
or higher indicates that a fix was applied. For Zen 2 architecture Sabrina SoCs, a microcode version of 0x08a00008
or higher indicates that a fix was applied. As noted above, all other processor families, including desktop Ryzen processors, are yet to receive a microcode update with a patch, so we don’t yet know what the fixed microcode versions will be.
对于 Zen 2 架构 EPYC 处理器,微码版本为 或 0x0830107a
更高表示已应用修复。对于 Zen 2 架构 Sabrina SoC,微码版本 0x08a00008
或更高版本表示已应用修复程序。如上所述,所有其他处理器系列,包括台式机 Ryzen 处理器,尚未收到带有补丁的微码更新,因此我们还不知道固定的微码版本是什么。
In the interim, Linux should automatically apply software mitigations for Zenbleed. You can query the status of these mitigations through the sysfs
interface, under the following directory:
在此期间,Linux 应该自动为 Zenbleed 应用软件缓解措施。您可以通过 sysfs
以下目录下的界面查询这些缓解措施的状态:
/sys/devices/system/cpu/vulnerability/
If you’re running a server with a Zen 2 EPYC processor, you should update your firmware and install all OS patches to help ensure that Zenbleed is patched. If your system vendor has yet to release firmware updates to address this issue, it is possible that your OS will still load the new microcode blobs at boot, so make sure to check that first before trying to implement any manual workarounds. As always, refer to vendor guidance for good practice mitigation strategies.
如果您运行的服务器装有 Zen 2 EPYC 处理器,则应更新固件并安装所有操作系统补丁,以帮助确保 Zenbleed 已打补丁。如果您的系统供应商尚未发布固件更新来解决此问题,则您的操作系统可能仍会在启动时加载新的微码 blob,因此请确保在尝试实施任何手动解决方法之前先检查一下。与往常一样,请参阅供应商指南,了解良好做法缓解策略。
原文始发于Graham Sutherland:Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions
转载请注明:Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions | CTF导航