1# Introduction 2 3The goal of deprilvileging qemu is this: Even if there is a bug (for 4example in qemu) which permits a domain to gain control of the device 5model, the compromised device model process is prevented from 6violating the system's overall security properties. Ie, a guest 7cannot "escape" from the virtualisation by using a qemu bug. 8 9This document lists the various technical measures which we either 10have taken, or plan to take to effect this goal. Some of them are 11required to be considered secure (that is, there are known attack 12vectors which they close); others are "just in case" (that is, there 13are no known attack vectors, but we perform the restrictions to reduce 14the possibility of unknown attack vectors). 15 16# Restrictions done 17 18The following restrictions are currently implemented. 19 20## Having qemu switch user 21 22'''Description''': As mentioned above, having QEMU switch to a 23non-root user, one per domain id. Not being the root user limits what 24a compromised QEMU process can do to the system, and having one user 25per domain id limits what a comprimised QEMU process can do to the 26QEMU processes of other VMs. 27 28'''Implementation''': The toolstack adds the following to the qemu command-line: 29 30 -runas <uid>:<gid> 31 32'''How to test''': 33 34 grep /proc/<qpid>/status [UG]id 35 36'''Testing Status''': Not tested 37 38## Xen library / file-descriptor restrictions 39 40'''Description''': Close and restrict Xen-related file descriptors. 41Specifically: 42 * Close all xenstore-related file descriptors 43 * Make sure that all open instances of `privcmd` and `evtchn` file 44descriptors have had `IOCTL_PRIVCMD_RESTRICT` and 45`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively. 46 47'''Implementation''': Toolstack adds the following to the qemu command-line: 48 49 -xen-domid-restrict 50 51'''How to test''': 52 53Use `fishdescriptor` to pull a file descriptor from a running QEMU, 54then use `depriv-fd-checker` to check that it has the desired 55properties, and that hypercalls which are meant to fail do fail. (In 56Debian `fishdescriptor` can be found in the binary package 57`chiark-scripts`; the `depriv-fd-checker` is included in the Xen 58source tree.) 59 60'''Testing status''': Tested 61 62## Chroot 63 64'''Description''': Qemu runs in its own chroot, such that even if it 65could call an 'open' command of some sort, there would be nothing for 66it to see. 67 68'''Implementation''': The toolstack creates a directory in the libxl "run-dir"; e.g. 69`/var/run/xen/qemu-root-<domid>` 70 71Then adds the following to the qemu command-line: 72 73 -chroot /var/run/xen/qemu-root-<domid> 74 75'''How to test''': Check `/proc/<qpid>/root` 76 77'''Tested''': Not tested 78 79## Namespaces for unused functionality (Linux only) 80 81'''Description''': QEMU doesn't use the functionality associated with 82mount and IPC namespaces. (IPC namespaces contol non-file-based IPC 83mechanisms within the kernel; unix and network sockets are not 84affected by this.) Making separate namespaces for these for QEMU 85won't affect normal operation, but it does mean that even if other 86restrictions fail, the process won't be able to even name system mount 87points or existing non-file-based IPC descriptors to attempt to attack 88them. 89 90'''Implementation''': 91 92In theory this could be done in QEMU (similar to -sandbox, -runas, 93-chroot, and so on), but a patch doing this in QEMU was NAKed upstream 94(see [qemu-namespaces]). They preferred that this was done as a setup step by 95whatever executes QEMU; i.e., have the process which exec's QEMU first 96call: 97 98 unshare(CLONE_NEWNS | CLONE_NEWIPC) 99 100'''How to test''': Check `/proc/<qpid>/ns/[ipc,mnt]` 101 102'''Tested''': Not tested 103 104[qemu-namespaces]: https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04723.html 105 106### Basic RLIMITs 107 108'''Description''': A number of limits on the resources that a given 109process / userid is allowed to consume. These can limit the ability 110of a compromised QEMU process to DoS domain 0 by exhausting various 111resources available to it. 112 113'''Implementation''' 114 115Limits that can be implemented immediately without much effort: 116 - RLIMIT_FSIZE` (file size) to 256KiB. 117 118Probably not necessary but why not: 119 - RLIMIT_CORE: 0 120 - RLIMIT_MSGQUEUE: 0 121 - RLIMIT_LOCKS: 0 122 - RLIMIT_MEMLOCK: 0 123 124Note: mlock() is used by QEMU only when both "realtime" and "mlock" 125are specified; this does not apply to QEMU running as a Xen DM. 126 127'''How to test''': Check `/proc/<qpid>/limits` 128 129'''Tested''': Not tested 130 131### libxl UID cleanup 132 133'''Description''': Domain IDs are reused, and thus restricted UIDs are 134reused. If a compromised QEMU can fork (due to seccomp or 135RLIMIT_NPROC limits being ineffective for some reason), it may avoid 136being killed when its domain dies, then wait until the domain ID is 137reused again, at which point it will have control over the domain in 138question (which probably belongs to someone else). 139 140libxl should kill all UIDs associated with a domain both when the VM 141is destroyed, and before starting a VM with the same UID. 142 143'''Implementation''': This is unnecessarily tricky. 144 145The kill() system call can have three kinds of targets: 146 - A single pid 147 - A process group 148 - "Every process except me to which I am allowed to send a signal" (-1) 149 150Targeting a single pid is racy and likely to be beaten by the 151following loop: 152 153 while(1) { 154 if(fork()) 155 _exit(0); 156 } 157 158That is, by the time you've read the process list and found the 159process id you want to kill, that process has exited and there is a 160new process whose pid you don't know about. 161 162Targeting a process group will be ineffective, as unprivileged 163processes are allowed to make their own process groups. 164 165kill(-1) can be used but must be done with care. Consider the 166following code, for example: 167 168 setuid(target_uid); 169 kill(-1, 9); 170 171This looks like it will do the trick; but by setting all of the user 172ids (effective, real, and saved), it opens the 'killing' process up to 173being killed by the target process: 174 175 while(1) { 176 if(fork()) 177 _exit(0); 178 else 179 kill(-1, 9); 180 } 181 182Fortunately there is an assymetry we can take advantage of. From the 183POSIX spec: 184 185> For a process to have permission to send a signal to a process 186> designated by pid, unless the sending process has appropriate 187> privileges, the real or effective user ID of the sending process shall 188> match the real or saved set-user-ID of the receiving process. 189 190The solution is to allocate a second "reaper" uid that is only used to kill 191target processes. We set the euid of the killing process to the `target_uid`, 192but the ruid of the killing process to `reaper_uid`, leaving the suid of the 193killing process as 0: 194 195 setresuid(reaper_uid, target_uid, 0); 196 kill(-1, 9); 197 198NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that 199will set *both* euid *and* suid to `target_uid`, making the killing 200process vulnerable to the target process again. 201 202Since this will kill all other `reaper_uid` processes as well, we must 203either allocate a separate `reaper_uid` per domain, or use locking to 204ensure that only one killing process is active at a time. 205 206# Restrictions / improvements still to do 207 208This lists potential restrictions still to do. It is meant to be 209listed in order of ease of implementation, with low-hanging fruit 210first. 211 212### Further RLIMITs 213 214RLIMIT_AS limits the total amount of memory; but this includes the 215virtual memory which QEMU uses as a mapcache. xen-mapcache.c already 216fiddles with this; it would be straightforward to make it *set* the 217rlimit to what it thinks a sensible limit is. 218 219RLIMIT_NPROC limits total number of processes or threads. QEMU uses 220threads for some devices, so this would require some thought. 221 222Other things that would take some cleverness / changes to QEMU to 223utilize due to ordering constrants: 224 - RLIMIT_NOFILES (after all necessary files are opened) 225 226## libxl: Treat QMP connection as untrusted 227 228'''Description''': Currently libxl talks with QEMU via QMP; but its 229interactions have not historically considered from a security point of 230view. For example, qmp_synchronous_send() waits for a response from 231QEMU, which a compromised QEMU could simply not send (thus preventing 232the toolstack from making forward progress). 233 234'''Implementation''': Audit toolstack interactions with QEMU which 235happen after the guest has started running, and assume QEMU has been 236compromised. 237 238### seccomp filtering (Linux only) 239 240'''Description''': Turn on seccomp filtering to disable syscalls which 241QEMU doesn't need. 242 243'''Implementation''': Enable from the command-line: 244 245 -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny 246 247`elevateprivileges` is currently required to allow `-runas` to work. 248Removing this requirement would mean making sure that the uid change 249happened before the seccomp2 call, perhaps by changing the uid before 250executing QEMU. (But this would then require other changes to create 251the QMP socket, VNC socket, and so on). 252 253It should be noted that `-sandbox` is implemented as a blacklist, not 254a whitelist; that is, it disables known-unsed functionality which may 255be harmful, rather than disabling all functionality except that known 256to be safe and needed. This is unfortunately necessary since qemu 257doesn't know what system calls libraries might end up making. (See 258[lwn-seccomp] for a more complete discussion.) 259 260This feature is not on by default and may not be available in all 261environments. We therefore need to either: 262 1. Require that this feature be enabled to build qemu 263 2. Check for `-sandbox` support at runtime before 264 265[lwn-seccomp]: https://lwn.net/Articles/738694/ 266 267### Disks 268 269The chroot (and seccomp?) happens late enough such that QEMU can 270initialize itself and open its disks. If you want to add a disk at run 271time via or insert a CD, you can't pass a path because QEMU is 272chrooted. Instead use the add-fd QMP command and use 273/dev/fdset/<fdset-id> as the path. 274 275A further layer of restriction could be to set RLIMIT_NOFILES to '0', 276and hand all disks over QMP. 277 278## Migration 279 280When calling xen-save-devices-state, since QEMU is running in a chroot 281it is not useful to pass a filename (it doesn't even have write access 282inside the chroot). Instead, give it an open fd using the add-fd 283mechanism. 284 285Additionally, all the restrictions need to be applied to the qemu 286started up on the post-migration side. One issue that needs to be 287solved is how to signal the toolstack on restore that qemu is ready 288for the domain to be started (since this is normally done via 289xenstore, and at this point the xenstore connections will have been 290closed). 291 292### Network namespacing (Linux only) 293 294Enter QEMU into its own network namespace (in addition to mount & IPC 295namespaces): 296 297 unshare(CLONE_NEWNET); 298 299QEMU does actually use the network namespace as a Xen DM for two 300purposes: 1) To set up network tap devices 2) To open vnc connections. 301 302#### Network 303 304If QEMU runs in its own network namespace, it can't open the tap 305device itself because the interface won't be visible outside of its 306own namespace. So instead, have the toolstack open the device and pass 307it as an fd on the command-line: 308 309 -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd> 310 311#### VNC 312 313If QEMU runs in its own network namespace, it is not straightforward 314to listen on a TCP socket outside of its own network namespace. One 315option would be to use VNC over a UNIX socket: 316 317 -vnc unix:/var/run/xen/vnc-<domid> 318 319However, this would break functionality in the general case; I think 320we need to have the toolstack open a socket and pass the fd to QEMU 321(which requires changes to QEMU). 322 323