1 Reliability, Availability, and Serviceability (RAS) Extensions
2 ==============================================================
3 
4 This document describes |TF-A| support for Arm Reliability, Availability, and
5 Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6 later CPUs, and also an optional extension to the base Armv8.0 architecture.
7 
8 In conjunction with the |EHF|, support for RAS extension enables firmware-first
9 paradigm for handling platform errors: exceptions resulting from errors are
10 routed to and handled in EL3. Said errors are Synchronous External Abort (SEA),
11 Asynchronous External Abort (signalled as SErrors), Fault Handling and Error
12 Recovery interrupts.  The |EHF| document mentions various :ref:`error handling
13 use-cases <delegation-use-cases>` .
14 
15 For the description of Arm RAS extensions, Standard Error Records, and the
16 precise definition of RAS terminology, please refer to the Arm Architecture
17 Reference Manual. The rest of this document assumes familiarity with
18 architecture and terminology.
19 
20 Overview
21 --------
22 
23 As mentioned above, the RAS support in |TF-A| enables routing to and handling of
24 exceptions resulting from platform errors in EL3. It allows the platform to
25 define an External Abort handler, and to register RAS nodes and interrupts. RAS
26 framework also provides `helpers`__ for accessing Standard Error Records as
27 introduced by the RAS extensions.
28 
29 .. __: `Standard Error Record helpers`_
30 
31 The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
32 time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
33 be set ``1``. ``RAS_TRAP_LOWER_EL_ERR_ACCESS`` controls the access to the RAS
34 error record registers from lower ELs.
35 
36 .. _ras-figure:
37 
38 .. image:: ../resources/diagrams/draw.io/ras.svg
39 
40 See more on `Engaging the RAS framework`_.
41 
42 Platform APIs
43 -------------
44 
45 The RAS framework allows the platform to define handlers for External Abort,
46 Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
47 refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
48 
49 Registering RAS error records
50 -----------------------------
51 
52 RAS nodes are components in the system capable of signalling errors to PEs
53 through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
54 nodes contain one or more error records, which are registers through which the
55 nodes advertise various properties of the signalled error. Arm recommends that
56 error records are implemented in the Standard Error Record format. The RAS
57 architecture allows for error records to be accessible via system or
58 memory-mapped registers.
59 
60 The platform should enumerate the error records providing for each of them:
61 
62 -  A handler to probe error records for errors;
63 -  When the probing identifies an error, a handler to handle it;
64 -  For memory-mapped error record, its base address and size in KB; for a system
65    register-accessed record, the start index of the record and number of
66    continuous records from that index;
67 -  Any node-specific auxiliary data.
68 
69 With this information supplied, when the run time firmware receives one of the
70 notification mechanisms, the RAS framework can iterate through and probe error
71 records for error, and invoke the appropriate handler to handle it.
72 
73 The RAS framework provides the macros to populate error record information. The
74 macros are versioned, and the latest version as of this writing is 1. These
75 macros create a structure of type ``struct err_record_info`` from its arguments,
76 which are later passed to probe and error handlers.
77 
78 For memory-mapped error records:
79 
80 .. code:: c
81 
82     ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
83 
84 And, for system register ones:
85 
86 .. code:: c
87 
88     ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
89 
90 The probe handler must have the following prototype:
91 
92 .. code:: c
93 
94     typedef int (*err_record_probe_t)(const struct err_record_info *info,
95                     int *probe_data);
96 
97 The probe handler must return a non-zero value if an error was detected, or 0
98 otherwise. The ``probe_data`` output parameter can be used to pass any useful
99 information resulting from probe to the error handler (see `below`__). For
100 example, it could return the index of the record.
101 
102 .. __: `Standard Error Record helpers`_
103 
104 The error handler must have the following prototype:
105 
106 .. code:: c
107 
108     typedef int (*err_record_handler_t)(const struct err_record_info *info,
109                int probe_data, const struct err_handler_data *const data);
110 
111 The ``data`` constant parameter describes the various properties of the error,
112 including the reason for the error, exception syndrome, and also ``flags``,
113 ``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
114 <EL3 interrupts>`.
115 
116 The platform is expected populate an array using the macros above, and register
117 the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
118 passing it the name of the array describing the records. Note that the macro
119 must be used in the same file where the array is defined.
120 
121 Standard Error Record helpers
122 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123 
124 The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
125 both memory-mapped and System Register accesses:
126 
127 .. code:: c
128 
129     int ras_err_ser_probe_memmap(const struct err_record_info *info,
130                 int *probe_data);
131 
132     int ras_err_ser_probe_sysreg(const struct err_record_info *info,
133                 int *probe_data);
134 
135 When the platform enumerates error records, for those records in the Standard
136 Error Record format, these helpers maybe used instead of rolling out their own.
137 Both helpers above:
138 
139 -  Return non-zero value when an error is detected in a Standard Error Record;
140 -  Set ``probe_data`` to the index of the error record upon detecting an error.
141 
142 Registering RAS interrupts
143 --------------------------
144 
145 RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
146 Recovery interrupts. For the firmware-first handling paradigm for interrupts to
147 work, the platform must setup and register with |EHF|. See `Interaction with
148 Exception Handling Framework`_.
149 
150 For each RAS interrupt, the platform has to provide structure of type ``struct
151 ras_interrupt``:
152 
153 -  Interrupt number;
154 -  The associated error record information (pointer to the corresponding
155    ``struct err_record_info``);
156 -  Optionally, a cookie.
157 
158 The platform is expected to define an array of ``struct ras_interrupt``, and
159 register it with the RAS framework using the macro
160 ``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
161 macro must be used in the same file where the array is defined.
162 
163 The array of ``struct ras_interrupt`` must be sorted in the increasing order of
164 interrupt number. This allows for fast look of handlers in order to service RAS
165 interrupts.
166 
167 Double-fault handling
168 ---------------------
169 
170 A Double Fault condition arises when an error is signalled to the PE while
171 handling of a previously signalled error is still underway. When a Double Fault
172 condition arises, the Arm RAS extensions only require for handler to perform
173 orderly shutdown of the system, as recovery may be impossible.
174 
175 The RAS extensions part of Armv8.4 introduced new architectural features to deal
176 with Double Fault conditions, specifically, the introduction of ``NMEA`` and
177 ``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
178 software which runs part of its entry/exit routines with exceptions momentarily
179 masked—meaning, in such systems, External Aborts/SErrors are not immediately
180 handled when they occur, but only after the exceptions are unmasked again.
181 
182 |TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
183 This means that all exceptions routed to EL3 are handled immediately. |TF-A|
184 thus is able to detect a Double Fault conditions in software, without needing
185 the intended advantages of Armv8.4 Double Fault architecture extensions.
186 
187 Double faults are fatal, and terminate at the platform double fault handler, and
188 doesn't return.
189 
190 Engaging the RAS framework
191 --------------------------
192 
193 Enabling RAS support is a platform choice constructed from three distinct, but
194 related, build options:
195 
196 -  ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
197 
198 -  ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
199    `Interaction with Exception Handling Framework`_;
200 
201 -  ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
202    EL3.
203 
204 The RAS support in |TF-A| introduces a default implementation of
205 ``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
206 is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
207 top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
208 to through platform-supplied error records, probe them, and when an error is
209 identified, look up and invoke the corresponding error handler.
210 
211 Note that, if the platform chooses to override the ``plat_ea_handler`` function
212 and intend to use the RAS framework, it must explicitly call
213 ``ras_ea_handler()`` from within.
214 
215 Similarly, for RAS interrupts, the framework defines
216 ``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
217 when  a RAS interrupt taken at EL3. The function bisects the platform-supplied
218 sorted array of interrupts to look up the error record information associated
219 with the interrupt number. That error handler for that record is then invoked to
220 handle the error.
221 
222 Interaction with Exception Handling Framework
223 ---------------------------------------------
224 
225 As mentioned in earlier sections, RAS framework interacts with the |EHF| to
226 arbitrate handling of RAS exceptions with others that are routed to EL3. This
227 means that the platform must partition a :ref:`priority level <Partitioning
228 priority levels>` for handling RAS exceptions. The platform must then define
229 the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
230 Platforms would typically want to allocate the highest secure priority for
231 RAS handling.
232 
233 Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
234 <non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
235 documentation. I.e., for interrupts, the priority management is implicit; but
236 for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
237 <Activating and Deactivating priorities>`.
238 
239 --------------
240 
241 *Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.*
242