1 |
3 |
xianfeng |
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
2 |
|
|
T. Long Nguyen
|
3 |
|
|
Yanmin Zhang
|
4 |
|
|
07/29/2006
|
5 |
|
|
|
6 |
|
|
|
7 |
|
|
1. Overview
|
8 |
|
|
|
9 |
|
|
1.1 About this guide
|
10 |
|
|
|
11 |
|
|
This guide describes the basics of the PCI Express Advanced Error
|
12 |
|
|
Reporting (AER) driver and provides information on how to use it, as
|
13 |
|
|
well as how to enable the drivers of endpoint devices to conform with
|
14 |
|
|
PCI Express AER driver.
|
15 |
|
|
|
16 |
|
|
1.2 Copyright © Intel Corporation 2006.
|
17 |
|
|
|
18 |
|
|
1.3 What is the PCI Express AER Driver?
|
19 |
|
|
|
20 |
|
|
PCI Express error signaling can occur on the PCI Express link itself
|
21 |
|
|
or on behalf of transactions initiated on the link. PCI Express
|
22 |
|
|
defines two error reporting paradigms: the baseline capability and
|
23 |
|
|
the Advanced Error Reporting capability. The baseline capability is
|
24 |
|
|
required of all PCI Express components providing a minimum defined
|
25 |
|
|
set of error reporting requirements. Advanced Error Reporting
|
26 |
|
|
capability is implemented with a PCI Express advanced error reporting
|
27 |
|
|
extended capability structure providing more robust error reporting.
|
28 |
|
|
|
29 |
|
|
The PCI Express AER driver provides the infrastructure to support PCI
|
30 |
|
|
Express Advanced Error Reporting capability. The PCI Express AER
|
31 |
|
|
driver provides three basic functions:
|
32 |
|
|
|
33 |
|
|
- Gathers the comprehensive error information if errors occurred.
|
34 |
|
|
- Reports error to the users.
|
35 |
|
|
- Performs error recovery actions.
|
36 |
|
|
|
37 |
|
|
AER driver only attaches root ports which support PCI-Express AER
|
38 |
|
|
capability.
|
39 |
|
|
|
40 |
|
|
|
41 |
|
|
2. User Guide
|
42 |
|
|
|
43 |
|
|
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
|
44 |
|
|
|
45 |
|
|
The PCI Express AER Root driver is a Root Port service driver attached
|
46 |
|
|
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
47 |
|
|
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
48 |
|
|
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
49 |
|
|
CONFIG_PCIEAER = y.
|
50 |
|
|
|
51 |
|
|
2.2 Load PCI Express AER Root Driver
|
52 |
|
|
There is a case where a system has AER support in BIOS. Enabling the AER
|
53 |
|
|
Root driver and having AER support in BIOS may result unpredictable
|
54 |
|
|
behavior. To avoid this conflict, a successful load of the AER Root driver
|
55 |
|
|
requires ACPI _OSC support in the BIOS to allow the AER Root driver to
|
56 |
|
|
request for native control of AER. See the PCI FW 3.0 Specification for
|
57 |
|
|
details regarding OSC usage. Currently, lots of firmwares don't provide
|
58 |
|
|
_OSC support while they use PCI Express. To support such firmwares,
|
59 |
|
|
forceload, a parameter of type bool, could enable AER to continue to
|
60 |
|
|
be initiated although firmwares have no _OSC support. To enable the
|
61 |
|
|
walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
|
62 |
|
|
when booting kernel. Note that forceload=n by default.
|
63 |
|
|
|
64 |
|
|
2.3 AER error output
|
65 |
|
|
When a PCI-E AER error is captured, an error message will be outputed to
|
66 |
|
|
console. If it's a correctable error, it is outputed as a warning.
|
67 |
|
|
Otherwise, it is printed as an error. So users could choose different
|
68 |
|
|
log level to filter out correctable error messages.
|
69 |
|
|
|
70 |
|
|
Below shows an example.
|
71 |
|
|
+------ PCI-Express Device Error -----+
|
72 |
|
|
Error Severity : Uncorrected (Fatal)
|
73 |
|
|
PCIE Bus Error type : Transaction Layer
|
74 |
|
|
Unsupported Request : First
|
75 |
|
|
Requester ID : 0500
|
76 |
|
|
VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
|
77 |
|
|
TLB Header:
|
78 |
|
|
04000001 00200a03 05010000 00050100
|
79 |
|
|
|
80 |
|
|
In the example, 'Requester ID' means the ID of the device who sends
|
81 |
|
|
the error message to root port. Pls. refer to pci express specs for
|
82 |
|
|
other fields.
|
83 |
|
|
|
84 |
|
|
|
85 |
|
|
3. Developer Guide
|
86 |
|
|
|
87 |
|
|
To enable AER aware support requires a software driver to configure
|
88 |
|
|
the AER capability structure within its device and to provide callbacks.
|
89 |
|
|
|
90 |
|
|
To support AER better, developers need understand how AER does work
|
91 |
|
|
firstly.
|
92 |
|
|
|
93 |
|
|
PCI Express errors are classified into two types: correctable errors
|
94 |
|
|
and uncorrectable errors. This classification is based on the impacts
|
95 |
|
|
of those errors, which may result in degraded performance or function
|
96 |
|
|
failure.
|
97 |
|
|
|
98 |
|
|
Correctable errors pose no impacts on the functionality of the
|
99 |
|
|
interface. The PCI Express protocol can recover without any software
|
100 |
|
|
intervention or any loss of data. These errors are detected and
|
101 |
|
|
corrected by hardware. Unlike correctable errors, uncorrectable
|
102 |
|
|
errors impact functionality of the interface. Uncorrectable errors
|
103 |
|
|
can cause a particular transaction or a particular PCI Express link
|
104 |
|
|
to be unreliable. Depending on those error conditions, uncorrectable
|
105 |
|
|
errors are further classified into non-fatal errors and fatal errors.
|
106 |
|
|
Non-fatal errors cause the particular transaction to be unreliable,
|
107 |
|
|
but the PCI Express link itself is fully functional. Fatal errors, on
|
108 |
|
|
the other hand, cause the link to be unreliable.
|
109 |
|
|
|
110 |
|
|
When AER is enabled, a PCI Express device will automatically send an
|
111 |
|
|
error message to the PCIE root port above it when the device captures
|
112 |
|
|
an error. The Root Port, upon receiving an error reporting message,
|
113 |
|
|
internally processes and logs the error message in its PCI Express
|
114 |
|
|
capability structure. Error information being logged includes storing
|
115 |
|
|
the error reporting agent's requestor ID into the Error Source
|
116 |
|
|
Identification Registers and setting the error bits of the Root Error
|
117 |
|
|
Status Register accordingly. If AER error reporting is enabled in Root
|
118 |
|
|
Error Command Register, the Root Port generates an interrupt if an
|
119 |
|
|
error is detected.
|
120 |
|
|
|
121 |
|
|
Note that the errors as described above are related to the PCI Express
|
122 |
|
|
hierarchy and links. These errors do not include any device specific
|
123 |
|
|
errors because device specific errors will still get sent directly to
|
124 |
|
|
the device driver.
|
125 |
|
|
|
126 |
|
|
3.1 Configure the AER capability structure
|
127 |
|
|
|
128 |
|
|
AER aware drivers of PCI Express component need change the device
|
129 |
|
|
control registers to enable AER. They also could change AER registers,
|
130 |
|
|
including mask and severity registers. Helper function
|
131 |
|
|
pci_enable_pcie_error_reporting could be used to enable AER. See
|
132 |
|
|
section 3.3.
|
133 |
|
|
|
134 |
|
|
3.2. Provide callbacks
|
135 |
|
|
|
136 |
|
|
3.2.1 callback reset_link to reset pci express link
|
137 |
|
|
|
138 |
|
|
This callback is used to reset the pci express physical link when a
|
139 |
|
|
fatal error happens. The root port aer service driver provides a
|
140 |
|
|
default reset_link function, but different upstream ports might
|
141 |
|
|
have different specifications to reset pci express link, so all
|
142 |
|
|
upstream ports should provide their own reset_link functions.
|
143 |
|
|
|
144 |
|
|
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
145 |
|
|
added.
|
146 |
|
|
|
147 |
|
|
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
148 |
|
|
|
149 |
|
|
Section 3.2.2.2 provides more detailed info on when to call
|
150 |
|
|
reset_link.
|
151 |
|
|
|
152 |
|
|
3.2.2 PCI error-recovery callbacks
|
153 |
|
|
|
154 |
|
|
The PCI Express AER Root driver uses error callbacks to coordinate
|
155 |
|
|
with downstream device drivers associated with a hierarchy in question
|
156 |
|
|
when performing error recovery actions.
|
157 |
|
|
|
158 |
|
|
Data struct pci_driver has a pointer, err_handler, to point to
|
159 |
|
|
pci_error_handlers who consists of a couple of callback function
|
160 |
|
|
pointers. AER driver follows the rules defined in
|
161 |
|
|
pci-error-recovery.txt except pci express specific parts (e.g.
|
162 |
|
|
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
163 |
|
|
definitions of the callbacks.
|
164 |
|
|
|
165 |
|
|
Below sections specify when to call the error callback functions.
|
166 |
|
|
|
167 |
|
|
3.2.2.1 Correctable errors
|
168 |
|
|
|
169 |
|
|
Correctable errors pose no impacts on the functionality of
|
170 |
|
|
the interface. The PCI Express protocol can recover without any
|
171 |
|
|
software intervention or any loss of data. These errors do not
|
172 |
|
|
require any recovery actions. The AER driver clears the device's
|
173 |
|
|
correctable error status register accordingly and logs these errors.
|
174 |
|
|
|
175 |
|
|
3.2.2.2 Non-correctable (non-fatal and fatal) errors
|
176 |
|
|
|
177 |
|
|
If an error message indicates a non-fatal error, performing link reset
|
178 |
|
|
at upstream is not required. The AER driver calls error_detected(dev,
|
179 |
|
|
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
180 |
|
|
question. for example,
|
181 |
|
|
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
|
182 |
|
|
If Upstream port A captures an AER error, the hierarchy consists of
|
183 |
|
|
Downstream port B and EndPoint.
|
184 |
|
|
|
185 |
|
|
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
186 |
|
|
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
187 |
|
|
whether it can recover or the AER driver calls mmio_enabled as next.
|
188 |
|
|
|
189 |
|
|
If an error message indicates a fatal error, kernel will broadcast
|
190 |
|
|
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
191 |
|
|
a hierarchy in question. Then, performing link reset at upstream is
|
192 |
|
|
necessary. As different kinds of devices might use different approaches
|
193 |
|
|
to reset link, AER port service driver is required to provide the
|
194 |
|
|
function to reset link. Firstly, kernel looks for if the upstream
|
195 |
|
|
component has an aer driver. If it has, kernel uses the reset_link
|
196 |
|
|
callback of the aer driver. If the upstream component has no aer driver
|
197 |
|
|
and the port is downstream port, we will use the aer driver of the
|
198 |
|
|
root port who reports the AER error. As for upstream ports,
|
199 |
|
|
they should provide their own aer service drivers with reset_link
|
200 |
|
|
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
201 |
|
|
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
202 |
|
|
to mmio_enabled.
|
203 |
|
|
|
204 |
|
|
3.3 helper functions
|
205 |
|
|
|
206 |
|
|
3.3.1 int pci_find_aer_capability(struct pci_dev *dev);
|
207 |
|
|
pci_find_aer_capability locates the PCI Express AER capability
|
208 |
|
|
in the device configuration space. If the device doesn't support
|
209 |
|
|
PCI-Express AER, the function returns 0.
|
210 |
|
|
|
211 |
|
|
3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
212 |
|
|
pci_enable_pcie_error_reporting enables the device to send error
|
213 |
|
|
messages to root port when an error is detected. Note that devices
|
214 |
|
|
don't enable the error reporting by default, so device drivers need
|
215 |
|
|
call this function to enable it.
|
216 |
|
|
|
217 |
|
|
3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
218 |
|
|
pci_disable_pcie_error_reporting disables the device to send error
|
219 |
|
|
messages to root port when an error is detected.
|
220 |
|
|
|
221 |
|
|
3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
|
222 |
|
|
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
223 |
|
|
error status register.
|
224 |
|
|
|
225 |
|
|
3.4 Frequent Asked Questions
|
226 |
|
|
|
227 |
|
|
Q: What happens if a PCI Express device driver does not provide an
|
228 |
|
|
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
229 |
|
|
|
230 |
|
|
A: The devices attached with the driver won't be recovered. If the
|
231 |
|
|
error is fatal, kernel will print out warning messages. Please refer
|
232 |
|
|
to section 3 for more information.
|
233 |
|
|
|
234 |
|
|
Q: What happens if an upstream port service driver does not provide
|
235 |
|
|
callback reset_link?
|
236 |
|
|
|
237 |
|
|
A: Fatal error recovery will fail if the errors are reported by the
|
238 |
|
|
upstream ports who are attached by the service driver.
|
239 |
|
|
|
240 |
|
|
Q: How does this infrastructure deal with driver that is not PCI
|
241 |
|
|
Express aware?
|
242 |
|
|
|
243 |
|
|
A: This infrastructure calls the error callback functions of the
|
244 |
|
|
driver when an error happens. But if the driver is not aware of
|
245 |
|
|
PCI Express, the device might not report its own errors to root
|
246 |
|
|
port.
|
247 |
|
|
|
248 |
|
|
Q: What modifications will that driver need to make it compatible
|
249 |
|
|
with the PCI Express AER Root driver?
|
250 |
|
|
|
251 |
|
|
A: It could call the helper functions to enable AER in devices and
|
252 |
|
|
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
253 |
|
|
|