ioctlvhax

Introduction

plutoo and I found this bug back in september of last year. Effectively this will grant ROP in an IOS usermode process which may then be further used to target the kernel. The vulnerability itself is a TOCTTOU race condition.

The Bug

Initially the ioctlv handling of the IOS kernel contained a major design flaw, namely that the buffer address verification of the vectors happens in-place provided that we supply more than 8 vectors. This enables the PPC side to change the buffer address after its verification. Due to the nature of the bug exploitation requires the number of vectors passed in to be relatively large.

int _ioctlv(int fd, int cmd, u32 num_in, u32 num_io, ioctlv_vec *vec, ...)
{
/* ... */
dev_ctxt *dev = get_dev(...);
u32 num_total = num_int + num_out;
/* ... */
/* Copy in vectors if num_total <= 8 else use external vectors. */
/* Check vector buffer addrs. */
/* ... */
}
view raw _ioctlv_old.c hosted with ❤ by GitHub

The Fix

The bug was fixed with version 5.2.0 by adding a new field to the device context that limits the number of vectors, which is set to 8 by default and may be changed using syscall 0x2E if required.

int _ioctlv(int fd, int cmd, u32 num_in, u32 num_io, ioctlv_vec *vec, ...)
{
/* ... */
dev_ctxt *dev = get_dev(...);
u32 num_total = num_int + num_out;
/* ... */
if(num_total <= dev->max_vecs)
{
/* Copy in vectors if num_total <= 8 else use external vectors. */
/* Check vector buffer addrs. */
}
else
res = -11;
/* ... */
}
int _dev_register(char *path, int mqid, int pid)
{
/* ... */
dev->max_vecs = 8;
/* ... */
}
int syscall_2E_set_ioctlv_max_vecs(char *name, u16 num)
{
/* ... */
dev->max_vecs = num;
/* ... */
}
view raw _ioctlv_new.c hosted with ❤ by GitHub

Exploitation

The goal was to gain ROP under an IOS usermode process. For this we had to look for a device that did not check the number of vectors itself. It turns out that “/dev/im” provides us with some very handy ioctlv handlers, namely:

static int dev_im_mq;
static int hb_param_idx;
static int hb_param_type;
int SetDeviceState(fd_ctxt *fd, int *buf)
{
/* ... */
switch(buf[0])
{
/* ... */
case 3:
hb_param_idx = buf[2];
hb_param_type = buf[1];
/* ... */
}
/* ... */
}
int GetHomeButtonParams(int *buf)
{
buf[0] = hb_param_type;
buf[1] = hb_param_idx;
return 0;
}
int dev_im_handler(...)
{
while(!recv_msg(dev_im_mq, &req, 0))
{
switch(req->cmd)
{
/* ... */
case 4:
/* ... */
SetDeviceState(fd_ctxt, (int *)req->vecs[0].phys);
/* ... */
case 7:
/* ... */
GetHomeButtonParams((int *)req->vecs[0].phys);
/* ... */
/* ... */
}
/* ... */
}
/* ... */
}
view raw dev_im.c hosted with ❤ by GitHub

Thus this allows us to write 8 bytes worth of data to an address we eventually control. With this arbitrary write we can now carefully setup a ROP stack inside the AUXIL process, overwrite the return address of one of the devices’ handler threads and get the handler thread to return by overwriting the corresponding message queue handle.

Note that this is by no means the only way to exploit this flaw – interested readers are encouraged to let us know about any alternative strategies they might come up with.

Reading BCA – The Hard Way

If you ever wanted to know what’s stored in a DVD’s BCA, you’ve probably read it out using your favourite device. Here I present you a different approach: reading it out by scanning the area and running “pattern recognition” on the image.

Here is a part of a Wii game that I’ve scanned:

Image

After filtering and thresholding it looks like this:

Image

The black and white image is then fed into a tool I wrote that will estimate the midpoint and the major and minor axes of the inner ellipse, detect all the BCA cuts around it, mark them and spit out the timing between the cuts. This timing can then be decoded into the actual data that is stored in the area. Note that the thick cuts are actually two normal ones that my scanner’s resolution didn’t pick up anymore, the tool will detect that too. Here is a part of a generated image (the blue lines are the detected cuts):

Image

And here is a complete output too:

Image

Shamir’s Secret Sharing

To be able to understand what is going to follow, I first have to do a little mathematics:

Let’s define a polynomial over a finite field P \in \mathbb{F}_q[X] of degree k as P(X) = a_0 + a_1 X + a_2 X^2 + \dots + a_k X^k (a_0,\dots,a_k \in \mathbb{F}_q) where q is a prime number and \mathbb{F}_q is a finite field with q elements. If we have (X_i) (X_i \ne X_j for i \ne j) and (Y_i = P(X_i)) with i = 1,\dots,k we can use Lagrange interpolation to directly calculate a_0: a_0 = \sum_i Y_i \cdot \prod_{i \ne j} (-X_j)(X_i - X_j)^{-1}.

But what are we going to use this for? Consider you want to divide a secret into n different parts and you want that k \leq n parts are required to reconstruct the secret. This is what is called a (n,k) threshold scheme. We can use our little mathematics from above to implement such a scheme. Choose the secret S and create a random polynomial of degree k - 1 with a_0 = S. To create the parts (shares) Y_i evaluate the polynomial at random X_i and distribute the pairs (X_i, Y_i) to the different participants (the X_i can be public, the Y_i are private to each one). To reconstruct the secret, k of the participants can share their Y_i and then use the explicit formula from above ([AS]).

Here is an example run along with some source of my implementation:

//...
bn_t *N = bn_from_str(bn_alloc(32),
    "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    "FFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F");
bn_t *s = bn_from_str(bn_alloc(32),
    "098765432100DEADBEEF000000000000"
    "000000000000CAFEBABE001234567890");

//Create random polynomial.
poly_t *p = ssecrets_create_poly(s, 3, N);

bn_t **lx = (bn_t **)malloc(sizeof(bn_t *) * 4);
bn_t **ls = (bn_t **)malloc(sizeof(bn_t *) * 4);

u32 i;
for(i = 0; i < 3; i++)
{
    //Evaluate at random x.
    lx[i] = bn_reduce(bn_rand(bn_alloc(N->n)), N);
    ls[i] = ssecrets_create_share(p, lx[i]);
    bn_print(stdout, "P(", lx[i], ") = ");
    bn_print(stdout, "", ls[i], "\n");
}

//Reconstruct secret.
bn_t *cs = ssecrets_calc_secret(lx, ls, 3, N);
bn_print(stdout, "secret = ", cs, "\n");
//...

/*
* P(9EF35877FC0E2DB6B76BFF03B8CFA2EADD90190719DA598482D1569C6F9CA1EF) =
* B451E41F20E5E85EC35CB94E6E88E292B25CBF67B687D93B6B42D18EEE26A8A9
*
* P(810A3DB3AB3E4EA758A535A0616A6C9E1FE08B610D5F0496911A1AFFD3BC8136) =
* B1B26A93205BD5416288A8EB0DE050C76D501247A8BE7BD4BA4416FEB743C90D
*
* P(37D9B47A812B434077C025ADA5300EA41778036222ECD4853E4D64CB92CDF135) =
* 9D58638CB24E0783966106ED8BB9D21F8F308C845829E6DD3CF1F7AD0C0499BB
*
* secret =
* 098765432100DEADBEEF000000000000000000000000CAFEBABE001234567890
*/

References:

 

On Decompilation (Of SPU Binaries)

First one needs to understand, that an ideal decompiler, that will recover the original source code, is impossible to create. The process of compilation is removing a lot of high level information, such as used data types or loop types. Then optimization could re-arrange code/data flow, eliminate common subexpressions, perform function inlining, etc. Therefore any output of our decompiler will most likely not match the original source code, but if we do it right, we can get a (very) close approximation.

So what is it, that I’m doing at the moment?

Currently the decompiler will need to be fed with an ELF binary, that has its section headers present and from them it will extract information about executable and data regions. For every executable region it will then extract subroutines and split every of them into basic blocks. A basic block is a piece of code, that only contains simple operations and ends with a branch. This basics blocks might get split again, if a branch to any of its operation occurs from inside the subroutine. This way we will collect a list of all the basic blocks, can link them according to their exit branches and end up with the control flow graphs of all the subroutines. For identification of high level statements like ifs and loops (see [REC] for loop detection) we compute DFS and dominator trees and for later use the frontiers of each block (see [DOM] for used algorithms). Next we perform register liveness analysis to identify subroutine arguments and return values (see [LIV] for used algorithm). Now we can translate all the assembler instructions into simplified operations, e.g. translate “ori rt, ra, 0” into “rt <- ra” and identify local stack variables in that process too, following IBM’s SPU ABI specifications (see [ABI]).

More additional steps are then required to produce high level code, as you can see in the following list (entries marked with (DONE) are finished and working already):

  • Extract text/data regions (DONE)
  • Extract subroutines (DONE)
  • Extract basic blocks (DONE)
  • Create control flow graphs (DONE)
  • Compute (reverse) DFS/dominator/frontier information (DONE)
  • Extract high level statements using (reverse) DFS/dominator trees (DONE)
  • Local and global liveness analysis to find subroutine arguments (DONE)
  • Extract and simplify operations from every basic block (DONE)
  • Transform operations into SSA form using frontier information (PARTIAL)
  • Perform constant/expression propagation and dead code elimination (PARTIAL)
  • Transform operations from SSA form
  • Build expression tree from optimized operations
  • Output high level (pseudo-)code from restored expressions

I’m currently looking into methods that don’t need SSA form to perform propagation, but if they don’t work (too well) I’ll have to stick with SSA. Here is the output of a little argument test program and here is the original source (R ~ register, S ~ stackvar, I ~ immediate).

After SPU decompilation is working, I plan to refactor the source code to make it more generic, so it’s easier to port it to other CPU architectures (e.g. PPC).

References:

The Exploit

As the exploit that was used to dump lv0ldr/bootldr/howeveryouliketocallit is public now, let’s have a closer look at it to understand what’s going on. Here is what I have reversed from lv0 (it shares the syscon portion of the code with its SPU counterpart):

//In .data section.
static u8 tmp_pkt[0x800];

//Get size from sc packet.
#define GET_SIZE(pkt) ((pkt[4] << 8) | pkt[5])

int read_cmpl_msg(/*...*/, u8 *payload_buf /*r5*/, int min_size /*r6*/, /*...*/)
{
    u16 pkt_size;

    //Get packet header.
    memcpy_aligned_64(tmp_pkt, MMIO_SC_PKT, 0x10);

    //Check packet size.
    pkt_size = GET_SIZE(tmp_pkt);
    if(pkt_size - 4 < min_size || pkt_size + 8 > 0x800)
        return ERR;

    //Run first sc_checksum.
    if(!sc_checksum(...))
        return ERR;

    //Read packet again (plus header!).
    pkt_size = GET_SIZE(tmp_pkt);
    memcpy_aligned_64(tmp_pkt, MMIO_SC_PKT, size + 0x1B);

    //Get size again (not checked now).
    //I suspect that this is actually a compiler 'quirk' and not a
    //programmer mistake. The original source probably accesses the
    //packet size through a structure and the compiler noticed the
    //non const access of the packet and generated this read of the
    //size member because it could have changed.
    pkt_size = GET_SIZE(tmp_pkt);

    //Let's have some fun (payload_buf on caller stack).
    memcpy(payload_buf, tmp_pkt + 8, size - 4);

    //Run second sc_checksum.
    if(!sc_checksum(...))
        return ERR;

    //...
}

The syscon library implements some high level functions, e.g. to shutdown the console on panic or to read certain configuration values. Every of this functions internally uses another function to exchange packets with syscon and the exchange function uses the read_cmpl_msg one to get the answer packet. The top-level function will pass a fixed size buffer to the exchange function. So if we are able to control syscon packets, e.g. by emulating MMIO (and thanks to IBM we are), we can change the packet size between the two packet readings and overwrite the caller stack. And if we first copy a little stub to shared LS and let the return address point to it, we can easily dump the whole 256 kB.

Nothing more left to say now, let’s wait and see if this is going to be fixed in future firmware versions (we just have to check lv0 fortunately).