原文始发于Jack Dates:Exploiting an Unbounded memcpy in Parallels Desktop
This post details the development of a guest-to-host virtualization escape for Parallels Desktop on macOS, as used in our successful Pwn2Own 2021 entry. Given privileged code execution in the guest (i.e. via kernel module), the exploit obtains code execution in the Parallels application process running on the host system.
After providing a brief look at the approach I took towards evaluating Parallels and exploring some of its relevant attack surface, the remainder of the post will demonstrate how we were able to reliably exploit an unbounded memcpy(...)
corruption style vulnerability to achieve arbitrary code execution.
Recon
Parallels Desktop is a virtualization product for macOS, comparable to other virtualization solutions such as VirtualBoxor VMWare. It acts as a hypervisor to allow running guest operating systems (e.g. Windows, Linux) on a macOS host ‘all without rebooting!’ (part of their marketing byline).
My primary motivation for taking a look at Parallels was due to its sudden inclusion in Pwn2Own 2021. Prior to this, nobody at RET2 had actually used Parallels before, let alone tried poking at it for vulnerabilities.
When faced with a new target, it can be a helpful first step to see what previous work already exists, be it writeups, bugs, exploits, or other relevant documentation/materials. At the time (early 2021), public research on Parallels was pretty scarce, save for one ZDI blog post and a handful of ZDI advisories. Since Pwn2Own, a few more resources [1, 2, 3] have trickled out, but the number of public works is still limited.
Identifying Guest-to-Host Attack Surface
After creating a new VM and spending some time getting acquainted with Parallels’ features, some of the more interesting functionality that immediately stood out was the numerous “tools” made available to the guest VM.
Guest tools (or Parallels Tools) implicitly provide most of the sharing features offered by Parallels: things like drag-and-drop, copy-pasting between host and guest, shared folders, and dynamic screen resolution.
- At this point, it was quickly apparent that some of Parallels’ by-design shared folder behavior could be used for a guest-to-host escape. As part of the Pwn2Own meta-game, I opted to ignore these issues fearing a possible collision (it seemed at least two other teams ended up finding these same issues!)
My focus shifted to the host-side implementation of these tools, presuming it was likely that the guest directly controlled input to that part of the code. To pursue this route, a binary to reverse would be helpful. By spawning a few VMs and looking at the system process list, we can infer prl_vm_app
is the binary that runs for each guest.
Investigating the Parallels Host Process
While evaluating the Parallels worker process running on the host, we are most interested in identifying the parts of the code to which we can directly control input. The prl_vm_app
Mach-O was massive, and putting in some reversing effort starting at the entrypoint did not prove immediately helpful.
I chose to instead look at a legitimate use case of the guest tools, with the hope that this would help identify where to start looking in the host prl_vm_app
binary. The guest side of the tools implementation takes the form of several kernel modules, which act as a client for the interface provided by the host.
Fortunately, for Linux guests, Parallels provides source code for the kernel modules, which must be built from source depending on the specific Linux distribution. By reading the source, we can infer:
- how the guest provides input to the host side of the tools implementation
- how this interface is intended to work
There are five kernel modules, each with separate source directories:
prl_tg
: interfaces with the “toolgate”, a generic way to pass data to operations identified by opcodeprl_eth
: virtual networkprl_fs
: shared foldersprl_fs_freeze
: a suspend/resume helper dealing with filesystem structuresprl_vid
: video, potentially with some sort of DRM
One of the header files for prl_tg
included opcode definitions for a multitude of toolgate operations, and based on the large number of them, this interface appealed the most.
From the guest tools source code, we can see that the guest kernel invokes toolgate operations by writing the physical address of a request structure to an I/O port. The request structure contains an opcode, size, and optionally inline data and/or “out-of-line” buffers.
While certainly useful context, none of the information we’ve gathered from the guest tools source code can be taken as ground truth – It is merely a suggestion of intent. We’ll need to find the host-side implementation of the toolgate to see how things really (don’t) work.
Locating the Toolgate Entrypoint
Shared folders were something easily poked at (i.e. with simple command-line file operations in the guest), and internally appeared to be implemented with toolgate requests. Finding the host implementation of shared folders seemed a promising path to finding the overall toolgate implementation.
The following procedure can be used to locate the prl_vm_app
function responsible for shared folders requests:
- create a guest with tools installed; default config shares the home directory at
/media/psf/Home
- on the host, attach (e.g. with lldb) to the
prl_vm_app
process and set a breakpoint onunlink
- in the guest, create and delete a file in the shared folder, hitting the breakpoint
- look at the backtrace
Matching up the backtrace functions with those in your disassembler/decompiler of choice, you should see a function with a switch-case over the various toolgate opcodes for shared folders requests.
However, this function solely handles shared folders requests, and reversing deeper into the backtrace does not bring us to the initial toolgate entrypoint as we hoped… The shared folders requests are handled on a separate thread; the requests are taken from a queue and handled asynchronously. Our search will now turn to finding where the queue is populated.
This can be done with watchpoints:
- set a breakpoint where the requests are taken off the queue
- in the guest, trigger a shared folders toolgate request, hitting the breakpoint
- note the address of the array used for the queue slots, set a watchpoint on the next array slot
- in the guest, trigger another shared folders request, hitting the watchpoint
- look at the backtrace
The watchpoint should trigger on a thread named VCPU0
(or potentially another “virtual CPU” thread if there are multiple cores), and reversing through the backtrace should eventually reveal the toolgate entrypoint, where the guest-supplied request structure is parsed and interpreted.
Toolgate Protocol and Handlers
Using the guest tools source as a guide, we can reverse engineer how toolgate requests are parsed and handled.
To interact with the toolgate, the guest writes a physical address to a TG_PORT_SUBMIT
I/O port. Other ports are indicated in the guest tools source (these port numbers would be offset by a base port):
For example, the base port in the guest is usually 0x8000
, so to submit a toolgate request, the physical address of the request structure would be written to port 0x8008
with the out
instruction.
The host then expects a TG_PAGED_REQUEST
structure to be present at the guest physical address (again, structure definitions taken from guest tools source):
Request
: the “opcode” of the toolgate operationStatus
: a return value written by the host to indicate success, an error, or pending- Certain requests are asynchronous (and will be pending), as we saw with shared folders earlier. These can be canceled by writing the request’s physical address to the
TG_PORT_CANCEL
port.
- Certain requests are asynchronous (and will be pending), as we saw with shared folders earlier. These can be canceled by writing the request’s physical address to the
RequestSize
: size of the variably sizedTG_PAGED_REQUEST
RequestPages
: physical page frame numbers for each page in theTG_PAGED_REQUEST
structure- Only really necessary if the request spans multiple pages.
RequestPages[0]
is somewhat redundant considering the physical address written to the I/O port
- Only really necessary if the request spans multiple pages.
Buffers
: “out-of-line” data blobs, again with physical page frame numbers for each page of each blob- The name
Va
implying virtual address is a bit odd, but in reality only the page offset (Va & 0xfff
) matters, thePages
array defines the actual location in guest memory by page frame number
- The name
The host is free to use the inline data and/or the buffers both as input and/or output. For example, a specific opcode may expect Buffers[0]
to be an input buffer, and will use Buffers[1]
as the output.
In prl_vm_app
, the toolgate is represented by a CPCIToolgate
class. One member is an array of registered handler entries containing a minimum/maximum opcode and the handler object itself. Upon receiving a guest request, this array is searched through to find the handler (if any) that handles the given opcode. Then a virtual call on the handler object does all the operation-specific work.
The members of this array will define our attack surface. Breaking in the debugger and dumping the handlers array, we can map opcode ranges to their handlers:
| Opcode range (inclusive) | Handler C++ class name |
| ------------------------ | ----------------------- |
| 0x200-0x23f | CSFoldersTool |
| 0x8000-0x8001 | CUTCommandsSender |
| 0x8010-0x8012 | CPTool |
| 0x8020-0x8029 | CSMTool |
| 0x8030-0x8033 | LayoutSyncHost |
| 0x8040-0x8044 | CSFoldersTool |
| 0x8050-0x8050 | Tg2OtgImpl |
| 0x8200-0x8202 | CGracefulShutdownHost |
| 0x8210-0x8211 | CToolsCenterHost |
| 0x8220-0x8221 | CSIAServer |
| 0x8230-0x8231 | CCoherenceToolServer |
| 0x8301-0x8301 | CWinMicroApp |
| 0x8302-0x8302 | CFavRunAppsHost |
| 0x8304-0x8304 | CDragAndDropBase |
| 0x8320-0x8328 | CSHAShellExt |
| 0x8340-0x8347 | CUIEmuHost |
| 0x8410-0x8411 | CSharedProfileTool |
| 0x8420-0x8440 | CInvSharing |
| 0x8500-0x8503 | CDesktopUtilitiesHost |
| 0x8700-0x8702 | CSFilterTgHandler |
| 0x8800-0x8801 | CHostCEP |
| 0x8900-0x8900 | CPrintingTool |
| 0x9020-0x9021 | CLocationTool |
| 0x9030-0x9030 | CEnergySavingTool |
| 0x9040-0x9043 | CVSDebug |
| 0x9050-0x9050 | AutoPauseHost |
| 0x9060-0x9060 | VmHostname |
| 0x9070-0x9070 | UniversalChannelHost |
| 0x9080-0x9080 | CDynResHost |
| 0x9100-0x9101 | CWinMicroApp |
| 0x9110-0x9111 | VolumeControllerHost |
| 0x9120-0x9124 | TimeSynchronizationHost |
| 0x9200-0x9200 | CToolbox |
There seemed to also be a video toolgate completely separate from the standard toolgate (i.e. different virtual device, different I/O base port). This component wasn’t investigated much, but contained the following opcode ranges dumped dynamically as before:
| Opcode range (inclusive) | Handler C++ class name |
| ------------------------ | ---------------------- |
| 0x100-0x11f | VideoTgHandler |
| 0x400-0x41f | DirectXHandler |
| 0x8100-0x811f | VideoTgHandler |
| 0x8130-0x813f | OpenGLHandler |
We can now focus our reversing efforts on the handlers’ virtual methods for handling toolgate requests, code we know is reachable with arbitrary input. Based on the handler class names, these requests should do “interesting” things. The workflow for this is to pick a handler class, find its vtable (which will have a symbol name like `vtable for'CDragAndDropBase
), then reverse the virtual handler function (for example, the function at offset 16 in the vtable).
The path forward from here to vulnerability discovery is mostly the standard reversing grind. Let’s move on to the bugs found in some of these opcode handlers.
Bug 1: Megasmash (an unbounded memcpy(...)
)
Opcode 0x8050
is handled by Tg2OtgImpl
, seemingly a proxy between the toolgate and “OTG” (possibly Open Tools Gate). The proxy works something like this:
Buffers[0]
contains a header of sorts. This includes the size of the OTG input request, and will contain the output return value and response size once handled- According to the input size from the header, the remaining buffers (starting at
Buffers[1]
) are “flattened” into a single byte array. This flattened array is treated as the OTG request to pass through. - An OTG opcode is fetched from offset 0 in the flattened blob, and the request is dispatched to the appropriate OTG handler for that opcode
- The return value and response size are set in the header, and the response blob is “unflattened” into
Buffers[1]
onwards
The flattened request is represented by a QByteArray
, a class supplied by the 3rd-party QtCore library, part of the Qt family. It is instantiated in the following way, using size from the input header:
And the respective QtCore implementation:
There is a pretty standard integer overflow here. If the provided size
is -1 (0xffffffff
), the size passed to Data::allocate
will be 0.
Looking at the allocate
implementation we can see that the following code will return a special empty singleton in QtCore’s data section, qt_array_empty
:
A QByteArray
typically stores its data inline; or in other words, d->data()
in the snippet above is effectively (void*)d + sizeof(*d)
(a pointer to the address immediately following the declared structure fields). In the case of the singleton, the inline data would also be in the data section (although it has size zero).
Back in the OTG proxy, the provided buffers will be flattened by a loop, which copies one buffer at a time into increasing offsets of the newly constructed QByteArray
(which is expected to have size 0xffffffff
, but is actually empty).
The overall effect of this loop can be summed up as:
In theory, this lets us corrupt a significant portion of QtCore’s data section. There is, however, the small problem of the excessively large copy being guaranteed to cause a fault (there will be a read-only mapping at some point after the writable data section).
These kinds of “wild copy” bugs are less than ideal but have been proven exploitable by researchers several times in the past [1, 2]. Sometimes it is possible to modify memory or program state during the copy loop to cut it short, which wasn’t viable in this case. Another method is to race another thread using the corrupted data before the copying thread crashes, which I could not find a reliable method of doing.
We’ll just keep this bug in our back pocket for now…
Bug 2: An Information Leak
This bug is less exciting, but provides a useful information leak to bypass ASLR in the host process.
The OTG handler for OTG opcode 10 appears to fetch various configuration settings for the guest. Different types of settings can be requested, the vulnerable type is 2. Here is a rough pseudocode interpretation for this type of request:
This handler resizes the output QByteArray
to size 0x90
, then proceeds to populate the first 0x20
bytes. The remaining 0x70
bytes are left uninitialized, and will be copied out to the guest by the OTG proxy.
This allows us to leak 0x70
bytes of uninitialized heap memory from the prl_vm_app
host process, which can be repeated as necessary to obtain any leaks (heap or .text) we need.
Bug 3: Limited Stack Buffer Overflow
Opcode 0x8304
is handled by CDragAndDropBase
, presumably to implement drag and drop behavior.
This handler uses “worker” objects, represented by CDragDropWorkerBase
. An initial request must be made to create a new worker, and a handle to this worker is passed back to the guest for use in subsequent requests. As a side note, the handle was simply the heap address of the worker object, so was a limited information leak by itself.
Drag-and-drop requests have their own structure expected by the host, including a type field specifying what drag-and-drop operation to perform. A pseudocode snippet of the vulnerable code is reproduced below, preceded by some relevant structure definitions (struct interpretations are my own):
The guest-supplied string is supposed to contain a maximum of 1024 UTF-16 characters. At the first annotation [1]
the length passed to QString::fromUtf16(...)
is -1, indicating that the input should be treated as null-terminated to determine the actual length.
There is no check, however, that the input indeed contains a null terminator. The 0x74
additional bytes following the input (and potentially whatever is on the stack afterwards) may be interpreted as part of the string, exceeding the expected 1024 length limit.
At [2]
, the string is copied into a local stack structure. This is where the overflow can occur. There is just enough data in the input dnd_blob
structure to corrupt the msg.worker
pointer with a controlled value, which is written to at [3]
. This gives us a write of an uncontrolled value (the req
pointer) to an arbitrary location.
In theory, if the contents of the stack after the dnd_blob
structure were non-zero and somewhat controllable, this overflow could corrupt the return address or other stack data. Empirically, there seemed to always be zeroes following the dnd_blob
, and this path was not pursued.
Exception Handling on macOS
It is likely possible the semi-controlled pointer write above would have been enough to finagle our way to code execution. Instead, our exploit will leverage this weaker bug to render the megasmash (bug #1) reliably exploitable.
The key realization to make is that when the megasmash inevitably causes a page fault, this doesn’t necessarily result in the death of the process. As part of the Mach architecture of macOS, the page fault generates an exception (EXC_BAD_ACCESS
), which is sent as a mach message to the registered exception port (at either the thread, task, or host level). The thread that generated the exception remains suspended until a response to the mach message is received (but other threads may continue to run).
Under normal circumstances, there is neither a thread-level or task-level exception port registered, and the host-level (i.e. system-wide) exception handling ends up killing the task. prl_vm_app
, however, does register a task-level exception port using task_set_exception_ports
, for 3 exception types: EXC_BAD_ACCESS
, EXC_BAD_INSTRUCTION
, and EXC_ARITHMETIC
.
I Know the Pieces Fit
In terms of intended functionality, my guess is that Parallels intercepts these exceptions for crash reporting purposes. A separate thread is spawned with the sole purpose of, in a loop, reading messages from the exception port, handling them, and responding back.
What happens if this exception handling thread triggers an exception?
- The exception handling thread enters the kernel exception delivery code, and is effectively suspended
- A message is sent to the task-level exception port
- The kernel waits for a response…
This response will never come, seeing as the thread intended by prl_vm_app
to respond to the exception port message, is itself suspended due to an exception.
As a side note, a potentially more robust technique would have been to set the exception ports at the thread level (with thread_set_exception_ports
), for each individual thread except for the exception handler. That way, exceptions generated by the handler will get forwarded to the default host-level machinery.
If we find a way to force the exception handler thread to fault, the megasmash suddenly seems a lot more exploitable:
- Perform whatever setup so the exception handler will crash later
- Megasmash the QtCore data section, which faults and generates an exception
- Exception handler faults and generates an exception, leaving both threads suspended indefinitely
- Wait for another thread to use the corrupted QtCore data section in an interesting way
Using our limited stack-based buffer overflow (bug #3), we can corrupt a pointer the exception handler will later attempt dereferencing, causing the thread to fault and suspend indefinitely.
This is simple to achieve in practice. When the exception handler is being initialized, prior to calling task_set_exception_ports
, the pre-existing ports are retrieved (with task_get_exception_ports
) and stored in a heap allocated structure. Later on, when the handler thread receives an exception message, one of its first actions is to reset the exception ports to the original ones using this structure. This heap pointer resides in the data section, and will serve as the target pointer we will corrupt.
Putting Everything Together
The final exploit then looks something like this:
- Use the info leak (bug #2) to obtain the base addresses of
prl_vm_app
and QtCoreprl_vm_app
: Experimenting by running the leak many times had a text pointer showing up relatively often. It turned out to be a function pointer passed toCFRunLoopTimerCreate
- QtCore: The
CDragDropWorkerBase
objects containQList
objects, which if empty, contain a pointer toQListData::shared_null
(in the QtCore data section). By spraying many workers then deleting them, we can (eventually) get this leak
- Use the uncontrolled pointer write (bug #3) to corrupt the pointer to the original exception ports
- By misaligning the write 4 bytes into the 8-byte pointer, we can ensure the new value will be a non-canonical garbage address
- Megasmash the QtCore data section (bug #1)
- We target a function pointer
qt_signal_spy_callback_set
(a hook of sorts triggered often through normal operation ofprl_vm_app
) with fully controlled and carefully placed data, amidst our massive corruption of Qt’s writable data segment - The megasmash will run off the end of the writable data segment triggering an
EXC_BAD_ACCESS
- We target a function pointer
- Crash the exception handler thread
- The
prl_vm_app
exception handler will attempt to restore the original exception ports from the heap-allocated structure. Since we corrupted this pointer with an invalid address in step 2, it will force the exception handler thread to fault.
- The
- Wait for another thread to use the corrupted function pointer
- Stack pivot, short ropchain to call
system(...)
rcx
happens to be pointing into the corrupted QtCore data section, anlea rsp, [rcx+0x30] ; ret
gadget kicks off the ropchain
- Providing the string
open -a Calculator
tosystem(...)
is sufficient to pop calc
- Stack pivot, short ropchain to call
Roughly, the steps described above which make up the final exploit are illustrated by the following animation:
A less dramatic video of the successful $40,000 exploit entry running live is available on ZDI’s day two stream of Pwn2Own 2021. The full exploit code can be found here on our GitHub.
Bugfixes
The bugs used in this exploit were all patched by Parallels version 49187.
- Bug #3 (the stack buffer overflow) was patched with a check for a null word in the first 1024 UTF-16 characters
- Bug #2 (the information leak) was patched by calling
bzero(...)
after resizing the output array - Bug #1 (megasmash) was patched by ensuring the size passed to
QByteArray
is non-negative- While potentially a dangerous pattern, Qt stated that the QtCore integer overflow was not a bug in QtCore, instead that it is a misuse of the API to ask to create a structure with negative size
- Interestingly, at the time, the master branch of QtCore did not exhibit the integer overflow due to a refactor in how the size parameter is passed around.
- While potentially a dangerous pattern, Qt stated that the QtCore integer overflow was not a bug in QtCore, instead that it is a misuse of the API to ask to create a structure with negative size
The confusion over whether Qt should actually fix the overflow further delayed the publication of this writeup.
For one reason or another, ZDI opted to drop Parallels from 2022’s competition:
RIP Parallels 2021-2021 https://t.co/aThASMwp1P
— Bruno (@bkth_) January 12, 2022
Conclusion
In this post, we detailed a guest-to-host escape of the Parallels Desktop virtualization platform by chaining together three different vulnerabilities. More importantly, we were able to produce another meaningful (real-world) example of both surviving and then exploiting a ‘wild’ unbounded memcpy(...)
We believe that the techniques discussed here can be generalized and applied to successfully exploit similar issues that may otherwise appear impractical or impossible to exploit.