1=head1 NAME
2
3xen-tscmode - Xen TSC (time stamp counter) and timekeeping discussion
4
5=head1 OVERVIEW
6
7As of Xen 4.0, a new config option called tsc_mode may be specified
8for each domain.  The default for tsc_mode handles the vast majority
9of hardware and software environments.  This document is targeted
10for Xen users and administrators that may need to select a non-default
11tsc_mode.
12
13Proper selection of tsc_mode depends on an understanding not only of
14the guest operating system (OS), but also of the application set that will
15ever run on this guest OS.  This is because tsc_mode applies
16equally to both the OS and ALL apps that are running on this
17domain, now or in the future.
18
19Key questions to be answered for the OS and/or each application are:
20
21=over 4
22
23=item *
24
25Does the OS/app use the rdtsc instruction at all?
26(We will explain below how to determine this.)
27
28=item *
29
30At what frequency is the rdtsc instruction executed by either the OS
31or any running apps?  If the sum exceeds about 10,000 rdtsc instructions
32per second per processor, we call this a "high-TSC-frequency"
33OS/app/environment.  (This is relatively rare, and developers of OS's
34and apps that are high-TSC-frequency are usually aware of it.)
35
36=item *
37
38If the OS/app does use rdtsc, will it behave incorrectly if "time goes
39backwards" or if the frequency of the TSC suddenly changes?  If so,
40we call this a "TSC-sensitive" app or OS; otherwise it is "TSC-resilient".
41
42=back
43
44This last is the US$64,000 question as it may be very difficult
45(or, for legacy apps, even impossible) to predict all possible
46failure cases.  As a result, unless proven otherwise, any app
47that uses rdtsc must be assumed to be TSC-sensitive and, as we
48will see, this is the default starting in Xen 4.0.
49
50Xen's new tsc_mode parameter determines the circumstances under which
51the family of rdtsc instructions are executed "natively" vs emulated.
52Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
53may, under unpredictable circumstances, run incorrectly; emulated means
54there is some performance degradation (unobservable in most cases),
55but TSC-sensitive apps will always run correctly.  Prior to Xen 4.0,
56all rdtsc instructions were native: "fast but potentially incorrect."
57Starting at Xen 4.0, the default is that all rdtsc instructions are
58"correct but potentially slow".  The tsc_mode parameter in 4.0 provides
59an intelligent default but allows system administrator's to adjust
60how rdtsc instructions are executed differently for different domains.
61
62The non-default choices for tsc_mode are:
63
64=over 4
65
66=item * B<tsc_mode=1> (always emulate).
67
68All rdtsc instructions are emulated; this is the best choice when
69TSC-sensitive apps are running and it is necessary to understand
70worst-case performance degradation for a specific hardware environment.
71
72=item * B<tsc_mode=2> (never emulate).
73
74This is the same as prior to Xen 4.0 and is the best choice if it
75is certain that all apps running in this VM are TSC-resilient and
76highest performance is required.
77
78=item * B<tsc_mode=3> (PVRDTSCP).
79
80This mode has been removed.
81
82=back
83
84If tsc_mode is left unspecified (or set to B<tsc_mode=0>), a hybrid
85algorithm is utilized to ensure correctness while providing the
86best performance possible given:
87
88=over 4
89
90=item *
91
92the requirement of correctness,
93
94=item *
95
96the underlying hardware, and
97
98=item *
99
100whether or not the VM has been saved/restored/migrated
101
102=back
103
104To understand this in more detail, the rest of this document must
105be read.
106
107=head1 DETERMINING RDTSC FREQUENCY
108
109To determine the frequency of rdtsc instructions that are emulated,
110an "xl" command can be used by a privileged user of domain0.  The
111command:
112
113    # xl debug-key s; xl dmesg | tail
114
115provides information about TSC usage in each domain where TSC
116emulation is currently enabled.
117
118=head1 TSC HISTORY
119
120To understand tsc_mode completely, some background on TSC is required:
121
122The x86 "timestamp counter", or TSC, is a 64-bit register on each
123processor that increases monotonically.  Historically, TSC incremented
124every processor cycle, but on recent processors, it increases
125at a constant rate even if the processor changes frequency (for example,
126to reduce processor power usage).  TSC is known by x86 programmers
127as the fastest, highest-precision measurement of the passage of time
128so it is often used as a foundation for performance monitoring.
129And since it is guaranteed to be monotonically increasing and, at
13064 bits, is guaranteed to not wraparound within 10 years, it is
131sometimes used as a random number or a unique sequence identifier,
132such as to stamp transactions so they can be replayed in a specific
133order.
134
135On most older SMP and early multi-core machines, TSC was not synchronized
136between processors.  Thus if an application were to read the TSC on
137one processor, then was moved by the OS to another processor, then read
138TSC again, it might appear that "time went backwards".  This loss of
139monotonicity resulted in many obscure application bugs when TSC-sensitive
140apps were ported from a uniprocessor to an SMP environment; as a result,
141many applications -- especially in the Windows world -- removed their
142dependency on TSC and replaced their timestamp needs with OS-specific
143functions, losing both performance and precision. On some more recent
144generations of multi-core machines, especially multi-socket multi-core
145machines, the TSC was synchronized but if one processor were to enter
146certain low-power states, its TSC would stop, destroying the synchrony
147and again causing obscure bugs.  This reinforced decisions to avoid use
148of TSC altogether.  On the most recent generations of multi-core
149machines, however, synchronization is provided across all processors
150in all power states, even on multi-socket machines, and provide a
151flag that indicates that TSC is synchronized and "invariant".  Thus
152TSC is once again useful for applications, and even newer operating
153systems are using and depending upon TSC for critical timekeeping
154tasks when running on these recent machines.
155
156We will refer to hardware that ensures TSC is both synchronized and
157invariant as "TSC-safe" and any hardware on which TSC is not (or
158may not remain) synchronized as "TSC-unsafe".
159
160As a result of TSC's sordid history, two classes of applications use
161TSC: old applications designed for single processors, and the most recent
162enterprise applications which require high-frequency high-precision
163timestamping.
164
165We will refer to apps that might break if running on a TSC-unsafe
166machine as "TSC-sensitive"; apps that don't use TSC, or do use
167TSC but use it in a way that monotonicity and frequency invariance
168are unimportant as "TSC-resilient".
169
170The emergence of virtualization once again complicates the usage of
171TSC.  When features such as save/restore or live migration are employed,
172a guest OS and all its currently running applications may be invisibly
173transported to an entirely different physical machine.  While TSC
174may be "safe" on one machine, it is essentially impossible to precisely
175synchronize TSC across a data center or even a pool of machines.  As
176a result, when run in a virtualized environment, rare and obscure
177"time going backwards" problems might once again occur for those
178TSC-sensitive applications.  Worse, if a guest OS moves from, for
179example, a 3GHz
180machine to a 1.5GHz machine, attempts by an OS/app to measure time
181intervals with TSC may without notice be incorrect by a factor of two.
182
183The rdtsc (read timestamp counter) instruction is used to read the
184TSC register.  The rdtscp instruction is a variant of rdtsc on recent
185processors.  We refer to these together as the rdtsc family of instructions,
186or just "rdtsc".  Instructions in the rdtsc family are non-privileged, but
187privileged software may set a cpuid bit to cause all rdtsc family
188instructions to trap.  This trap can be detected by Xen, which can
189then transparently "emulate" the results of the rdtsc instruction and
190return control to the code following the rdtsc instruction.
191
192To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a
193fixed rate, Xen provides rdtsc emulation whenever necessary or when
194explicitly specified by a per-VM configuration option.  TSC emulation is
195relatively slow -- roughly 15-20 times slower than the rdtsc instruction
196when executed natively.  However, except when an OS or application uses
197the rdtsc instruction at a high frequency (e.g. more than about 10,000 times
198per second per processor), this performance degradation is not noticeable
199(i.e. <0.3%).  And, TSC emulation is nearly always faster than
200OS-provided alternatives (e.g. Linux's gettimeofday).  For environments
201where it is certain that all apps are TSC-resilient (e.g.
202"TSC-safeness" is not necessary) and highest performance is a
203requirement, TSC emulation may be entirely disabled (tsc_mode==2).
204
205The default mode (tsc_mode==0) checks TSC-safeness of the underlying
206hardware on which the virtual machine is launched.  If it is
207TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc
208will be emulated.  Once a virtual machine is save/restored or migrated,
209however, there are two possibilities: TSC remains native IF the source
210physical machine and target physical machine have the same TSC frequency
211(or, for HVM/PVH guests, if TSC scaling support is available); else TSC
212is emulated.  Note that, though emulated, the "apparent" TSC frequency
213will be the TSC frequency of the initial physical machine, even after
214migration.
215
216Finally, tsc_mode==1 always enables TSC emulation, regardless of
217the underlying physical hardware. The "apparent" TSC frequency will
218be the TSC frequency of the initial physical machine, even after migration.
219This mode is useful to measure any performance degradation that
220might be encountered by a tsc_mode==0 domain after migration occurs,
221or a tsc_mode==3 domain when it is running on TSC-unsafe hardware.
222
223Note that while Xen ensures that an emulated TSC is "safe" across migration,
224it does not ensure that it continues to tick at the same rate during
225the actual migration.  As an oversimplified example, if TSC is ticking
226once per second in a guest, and the guest is saved when the TSC is 1000,
227then restored 30 seconds later, TSC is only guaranteed to be greater
228than or equal to 1001, not precisely 1030.  This has some OS implications
229as will be seen in the next section.
230
231=head1 TSC INVARIANT BIT and NO_MIGRATE
232
233Related to TSC emulation, the "TSC Invariant" bit is architecturally defined
234in a cpuid bit on the most recent x86 processors.  If set, TSC invariance
235ensures that the TSC is "safe", that is it will increment at a constant rate
236regardless of power events, will be synchronized across all processors, and
237was properly initialized to zero on all processors at boot-time
238by system hardware/BIOS.  As long as system software never writes to TSC,
239TSC will be safe and continuously incremented at a fixed rate and thus
240can be used as a system "clocksource".
241
242This bit is used by some OS's, and specifically by Linux starting with
243version 2.6.30(?), to select TSC as a system clocksource.  Once selected,
244TSC remains the Linux system clocksource unless manually overridden.  In
245a virtualized environment, since it is not possible to synchronize TSC
246across all the machines in a pool or data center, a migration may "break"
247TSC as a usable clocksource; while time will not go backwards, it may
248not track wallclock time well enough to avoid certain time-sensitive
249consequences.  As a result, Xen can only expose the TSC Invariant bit
250to a guest OS if it is certain that the domain will never migrate.
251As of Xen 4.0, the "no_migrate=1" VM configuration option may be specified
252to disable migration.  If no_migrate is selected and the VM is running
253on a physical machine with "TSC Invariant", Linux 2.6.30+ will safely
254use TSC as the system clocksource.  But, attempts to migrate or, once
255saved, restore this domain will fail.
256
257There is another cpuid-related complication: The x86 cpuid instruction is
258non-privileged.  HVM domains are configured to always trap this instruction
259to Xen, where Xen can "filter" the result.  In a PV OS, all cpuid instructions
260have been replaced by a paravirtualized equivalent of the cpuid instruction
261("pvcpuid") and also trap to Xen.  But apps in a PV guest that use a
262cpuid instruction execute it directly, without a trap to Xen.  As a result,
263an app may directly examine the physical TSC Invariant cpuid bit and make
264decisions based on that bit.
265
266=head1 HARDWARE TSC SCALING
267
268Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read
269by guest rdtsc/p increasing in a different frequency than the host
270TSC frequency.
271
272If a HVM container in default TSC mode (tsc_mode=0) is created on a host
273that provides constant TSC, its guest TSC frequency will be the same as
274the host. If it is later migrated to another host that provides constant
275TSC and supports Intel VMX TSC scaling/AMD SVM TSC ratio, its guest TSC
276frequency will be the same before and after migration.
277
278For above HVM container in default TSC mode (tsc_mode=0), if above
279hosts support rdtscp, both guest rdtsc and rdtscp instructions will be
280executed natively before and after migration.
281
282=head1 AUTHORS
283
284Dan Magenheimer <dan.magenheimer@oracle.com>
285