% QEMU Deprivileging / dm_restrict
% Revision 1

\clearpage

# Basics

---------------- ----------------------------------------------------
         Status: **Tech Preview**

Architecture(s): x86

   Component(s): toolstack

---------------- ----------------------------------------------------

# Overview

By default, the QEMU device model is run in domain 0.  If an attacker
can gain control of a QEMU process, it could easily take control of a
system.

dm_restrict is a set of operations to restrict QEMU running in domain
0.  It consists of two halves:

 1. Mechanisms to restrict QEMU to only being able to affect its own
domain
 2. Mechanisms to restruct QEMU's ability to interact with domain 0.

# User details

## Getting the right versions of software

Linux: 4.11+

Qemu: 3.0+ (Or the version that comes with Xen 4.12+)

## Setting up a group and userid range

For maximum security, libxl needs to run the devicemodel for each
domain under a user id (UID) corresponding to its domain id.  There
are 32752 possible domain IDs, and so libxl needs 32752 user ids set
aside for it.  Setting up a group for all devicemodels to run at is
also recommended.

The simplest and most effective way to do this is to allocate a
contiguous block of UIDs, and create a single user named
`xen-qemuuser-range-base` with the first UID.  For example, under
Debian:

    adduser --system --uid 131072 --group --no-create-home xen-qemuuser-range-base

Two comments on this method:

  1. Most modern systems have 32-bit UIDs, and so can in theory go up
to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
UIDs however; UID 65535 is reserved for an invalid value, and 65534 is
normally allocated to "nobody".
  2. Additionally, some container systems have proposed using the
upper 16 bits of the uid for a container ID.  Using a multiple of 2^16
for the range base (as is done above) will result in all UIDs being
interpreted by such systems as a single container ID.

Another, less-secure way is to run all QEMUs as the same UID.  To do
this, create a user named `xen-qemuuser-shared`; for example:

    adduser --no-create-home --system xen-qemuuser-shared

A final way to set up a separate process for qemus is to allocate one
UID per VM, and set the UID in the domain config file with the
`device_model_user` argument.  For example, suppose you have a VM
named `c6-01`.  You might do the following:

    adduser --system --no-create-home --group xen-qemuuser-c6-01

And then in your config file, the following line:

    device_model_user="xen-qemuuser-c6-01"

If you use this method, you should also allocate one "reaper" user to
be used for killing device models:

    adduser --system --no-create-home --group xen-qemuuser-reaper

NOTE: It is important when using `device_model_user` that EACH VM HAVE
A SEPARATE UID, and that none of these UIDs map to root.  xl will
throw an error a uid maps to zero, but not if multiple VMs have the
same uid.  Multiple VMs with the same device model uid will cause
problems.

It is also important that `xen-qemuuser-reaper` not have any processes
associated with it, as they will be destroyed when deprivileged qemu
processes are destroyed.

## Domain config changes

The core domain config change is to add the following line to the
domain configuration:

    dm_restrict=1

This will perform a number of restrictions, outlined below in the
'Technical details' section.

# Technical details

See docs/design/qemu-deprivilege.md for technical details.

# Limitations

The following features still need to be implemented:

* Inserting a new cdrom while the guest is running (xl cdrom-insert)
* Support for qdisk backends

A number of restrictions still need to be implemented.  A compromised
device model may be able to do the following:

* Delay or exploit weaknesses in the toolstack
* Launch "fork bombs" or other resource exhaustion attacks
* Make network connections on the management network
* Break out of the restrictions after migration

Additionally, getting PCI passthrough to work securely would require a
significant rework of how passthrough works at the moment.  It may be
implemented at some point but is not a near-term priority.

See SUPPORT.md for security support status.

# History

------------------------------------------------------------------------
Date       Revision Version  Notes
---------- -------- -------- -------------------------------------------
2018-09-14 1        Xen 4.12 Imported from docs/misc
---------- -------- -------- -------------------------------------------