1# Introduction
2
3The goal of deprilvileging qemu is this: Even if there is a bug (for
4example in qemu) which permits a domain to gain control of the device
5model, the compromised device model process is prevented from
6violating the system's overall security properties.  Ie, a guest
7cannot "escape" from the virtualisation by using a qemu bug.
8
9This document lists the various technical measures which we either
10have taken, or plan to take to effect this goal.  Some of them are
11required to be considered secure (that is, there are known attack
12vectors which they close); others are "just in case" (that is, there
13are no known attack vectors, but we perform the restrictions to reduce
14the possibility of unknown attack vectors).
15
16# Restrictions done
17
18The following restrictions are currently implemented.
19
20## Having qemu switch user
21
22'''Description''': As mentioned above, having QEMU switch to a
23non-root user, one per domain id.  Not being the root user limits what
24a compromised QEMU process can do to the system, and having one user
25per domain id limits what a comprimised QEMU process can do to the
26QEMU processes of other VMs.
27
28'''Implementation''': The toolstack adds the following to the qemu command-line:
29
30    -runas <uid>:<gid>
31
32'''How to test''':
33
34    grep /proc/<qpid>/status [UG]id
35
36'''Testing Status''': Not tested
37
38## Xen library / file-descriptor restrictions
39
40'''Description''': Close and restrict Xen-related file descriptors.
41Specifically:
42 * Close all xenstore-related file descriptors
43 * Make sure that all open instances of `privcmd` and `evtchn` file
44descriptors have had `IOCTL_PRIVCMD_RESTRICT` and
45`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively.
46
47'''Implementation''': Toolstack adds the following to the qemu command-line:
48
49    -xen-domid-restrict
50
51'''How to test''':
52
53Use `fishdescriptor` to pull a file descriptor from a running QEMU,
54then use `depriv-fd-checker` to check that it has the desired
55properties, and that hypercalls which are meant to fail do fail.  (In
56Debian `fishdescriptor` can be found in the binary package
57`chiark-scripts`; the `depriv-fd-checker` is included in the Xen
58source tree.)
59
60'''Testing status''': Tested
61
62## Chroot
63
64'''Description''': Qemu runs in its own chroot, such that even if it
65could call an 'open' command of some sort, there would be nothing for
66it to see.
67
68'''Implementation''': The toolstack creates a directory in the libxl "run-dir"; e.g.
69`/var/run/xen/qemu-root-<domid>`
70
71Then adds the following to the qemu command-line:
72
73    -chroot /var/run/xen/qemu-root-<domid>
74
75'''How to test''':  Check `/proc/<qpid>/root`
76
77'''Tested''': Not tested
78
79## Namespaces for unused functionality (Linux only)
80
81'''Description''': QEMU doesn't use the functionality associated with
82mount and IPC namespaces. (IPC namespaces contol non-file-based IPC
83mechanisms within the kernel; unix and network sockets are not
84affected by this.)  Making separate namespaces for these for QEMU
85won't affect normal operation, but it does mean that even if other
86restrictions fail, the process won't be able to even name system mount
87points or existing non-file-based IPC descriptors to attempt to attack
88them.
89
90'''Implementation''':
91
92In theory this could be done in QEMU (similar to -sandbox, -runas,
93-chroot, and so on), but a patch doing this in QEMU was NAKed upstream
94(see [qemu-namespaces]). They preferred that this was done as a setup step by
95whatever executes QEMU; i.e., have the process which exec's QEMU first
96call:
97
98    unshare(CLONE_NEWNS | CLONE_NEWIPC)
99
100'''How to test''':  Check `/proc/<qpid>/ns/[ipc,mnt]`
101
102'''Tested''': Not tested
103
104[qemu-namespaces]: https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04723.html
105
106### Basic RLIMITs
107
108'''Description''': A number of limits on the resources that a given
109process / userid is allowed to consume.  These can limit the ability
110of a compromised QEMU process to DoS domain 0 by exhausting various
111resources available to it.
112
113'''Implementation'''
114
115Limits that can be implemented immediately without much effort:
116 - RLIMIT_FSIZE` (file size) to 256KiB.
117
118Probably not necessary but why not:
119 - RLIMIT_CORE: 0
120 - RLIMIT_MSGQUEUE: 0
121 - RLIMIT_LOCKS: 0
122 - RLIMIT_MEMLOCK: 0
123
124Note: mlock() is used by QEMU only when both "realtime" and "mlock"
125are specified; this does not apply to QEMU running as a Xen DM.
126
127'''How to test''': Check `/proc/<qpid>/limits`
128
129'''Tested''': Not tested
130
131### libxl UID cleanup
132
133'''Description''': Domain IDs are reused, and thus restricted UIDs are
134reused.  If a compromised QEMU can fork (due to seccomp or
135RLIMIT_NPROC limits being ineffective for some reason), it may avoid
136being killed when its domain dies, then wait until the domain ID is
137reused again, at which point it will have control over the domain in
138question (which probably belongs to someone else).
139
140libxl should kill all UIDs associated with a domain both when the VM
141is destroyed, and before starting a VM with the same UID.
142
143'''Implementation''': This is unnecessarily tricky.
144
145The kill() system call can have three kinds of targets:
146 - A single pid
147 - A process group
148 - "Every process except me to which I am allowed to send a signal" (-1)
149
150Targeting a single pid is racy and likely to be beaten by the
151following loop:
152
153    while(1) {
154        if(fork())
155	    _exit(0);
156    }
157
158That is, by the time you've read the process list and found the
159process id you want to kill, that process has exited and there is a
160new process whose pid you don't know about.
161
162Targeting a process group will be ineffective, as unprivileged
163processes are allowed to make their own process groups.
164
165kill(-1) can be used but must be done with care.  Consider the
166following code, for example:
167
168    setuid(target_uid);
169    kill(-1, 9);
170
171This looks like it will do the trick; but by setting all of the user
172ids (effective, real, and saved), it opens the 'killing' process up to
173being killed by the target process:
174
175    while(1) {
176        if(fork())
177            _exit(0);
178        else
179            kill(-1, 9);
180    }
181
182Fortunately there is an assymetry we can take advantage of.  From the
183POSIX spec:
184
185> For a process to have permission to send a signal to a process
186> designated by pid, unless the sending process has appropriate
187> privileges, the real or effective user ID of the sending process shall
188> match the real or saved set-user-ID of the receiving process.
189
190The solution is to allocate a second "reaper" uid that is only used to kill
191target processes.  We set the euid of the killing process to the `target_uid`,
192but the ruid of the killing process to `reaper_uid`, leaving the suid of the
193killing process as 0:
194
195    setresuid(reaper_uid, target_uid, 0);
196    kill(-1, 9);
197
198NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that
199will set *both* euid *and* suid to `target_uid`, making the killing
200process vulnerable to the target process again.
201
202Since this will kill all other `reaper_uid` processes as well, we must
203either allocate a separate `reaper_uid` per domain, or use locking to
204ensure that only one killing process is active at a time.
205
206# Restrictions / improvements still to do
207
208This lists potential restrictions still to do.  It is meant to be
209listed in order of ease of implementation, with low-hanging fruit
210first.
211
212### Further RLIMITs
213
214RLIMIT_AS limits the total amount of memory; but this includes the
215virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
216fiddles with this; it would be straightforward to make it *set* the
217rlimit to what it thinks a sensible limit is.
218
219RLIMIT_NPROC limits total number of processes or threads.  QEMU uses
220threads for some devices, so this would require some thought.
221
222Other things that would take some cleverness / changes to QEMU to
223utilize due to ordering constrants:
224 - RLIMIT_NOFILES (after all necessary files are opened)
225
226## libxl: Treat QMP connection as untrusted
227
228'''Description''': Currently libxl talks with QEMU via QMP; but its
229interactions have not historically considered from a security point of
230view.  For example, qmp_synchronous_send() waits for a response from
231QEMU, which a compromised QEMU could simply not send (thus preventing
232the toolstack from making forward progress).
233
234'''Implementation''': Audit toolstack interactions with QEMU which
235happen after the guest has started running, and assume QEMU has been
236compromised.
237
238### seccomp filtering (Linux only)
239
240'''Description''': Turn on seccomp filtering to disable syscalls which
241QEMU doesn't need.
242
243'''Implementation''': Enable from the command-line:
244
245    -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
246
247`elevateprivileges` is currently required to allow `-runas` to work.
248Removing this requirement would mean making sure that the uid change
249happened before the seccomp2 call, perhaps by changing the uid before
250executing QEMU.  (But this would then require other changes to create
251the QMP socket, VNC socket, and so on).
252
253It should be noted that `-sandbox` is implemented as a blacklist, not
254a whitelist; that is, it disables known-unsed functionality which may
255be harmful, rather than disabling all functionality except that known
256to be safe and needed.  This is unfortunately necessary since qemu
257doesn't know what system calls libraries might end up making.  (See
258[lwn-seccomp] for a more complete discussion.)
259
260This feature is not on by default and may not be available in all
261environments.  We therefore need to either:
262 1. Require that this feature be enabled to build qemu
263 2. Check for `-sandbox` support at runtime before
264
265[lwn-seccomp]: https://lwn.net/Articles/738694/
266
267### Disks
268
269The chroot (and seccomp?) happens late enough such that QEMU can
270initialize itself and open its disks. If you want to add a disk at run
271time via or insert a CD, you can't pass a path because QEMU is
272chrooted. Instead use the add-fd QMP command and use
273/dev/fdset/<fdset-id> as the path.
274
275A further layer of restriction could be to set RLIMIT_NOFILES to '0',
276and hand all disks over QMP.
277
278## Migration
279
280When calling xen-save-devices-state, since QEMU is running in a chroot
281it is not useful to pass a filename (it doesn't even have write access
282inside the chroot). Instead, give it an open fd using the add-fd
283mechanism.
284
285Additionally, all the restrictions need to be applied to the qemu
286started up on the post-migration side.  One issue that needs to be
287solved is how to signal the toolstack on restore that qemu is ready
288for the domain to be started (since this is normally done via
289xenstore, and at this point the xenstore connections will have been
290closed).
291
292### Network namespacing (Linux only)
293
294Enter QEMU into its own network namespace (in addition to mount & IPC
295namespaces):
296
297    unshare(CLONE_NEWNET);
298
299QEMU does actually use the network namespace as a Xen DM for two
300purposes: 1) To set up network tap devices 2) To open vnc connections.
301
302#### Network
303
304If QEMU runs in its own network namespace, it can't open the tap
305device itself because the interface won't be visible outside of its
306own namespace. So instead, have the toolstack open the device and pass
307it as an fd on the command-line:
308
309    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
310
311#### VNC
312
313If QEMU runs in its own network namespace, it is not straightforward
314to listen on a TCP socket outside of its own network namespace. One
315option would be to use VNC over a UNIX socket:
316
317    -vnc unix:/var/run/xen/vnc-<domid>
318
319However, this would break functionality in the general case; I think
320we need to have the toolstack open a socket and pass the fd to QEMU
321(which requires changes to QEMU).
322
323