1 |
3 |
xianfeng |
|
2 |
|
|
Configurable sysfs parameters for the x86-64 machine check code.
|
3 |
|
|
|
4 |
|
|
Machine checks report internal hardware error conditions detected
|
5 |
|
|
by the CPU. Uncorrected errors typically cause a machine check
|
6 |
|
|
(often with panic), corrected ones cause a machine check log entry.
|
7 |
|
|
|
8 |
|
|
Machine checks are organized in banks (normally associated with
|
9 |
|
|
a hardware subsystem) and subevents in a bank. The exact meaning
|
10 |
|
|
of the banks and subevent is CPU specific.
|
11 |
|
|
|
12 |
|
|
mcelog knows how to decode them.
|
13 |
|
|
|
14 |
|
|
When you see the "Machine check errors logged" message in the system
|
15 |
|
|
log then mcelog should run to collect and decode machine check entries
|
16 |
|
|
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
17 |
|
|
|
18 |
|
|
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
19 |
|
|
(N = CPU number)
|
20 |
|
|
|
21 |
|
|
The directory contains some configurable entries:
|
22 |
|
|
|
23 |
|
|
Entries:
|
24 |
|
|
|
25 |
|
|
bankNctl
|
26 |
|
|
(N bank number)
|
27 |
|
|
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
28 |
|
|
When a bit in the bitmask is zero then the respective
|
29 |
|
|
subevent will not be reported.
|
30 |
|
|
By default all events are enabled.
|
31 |
|
|
Note that BIOS maintain another mask to disable specific events
|
32 |
|
|
per bank. This is not visible here
|
33 |
|
|
|
34 |
|
|
The following entries appear for each CPU, but they are truly shared
|
35 |
|
|
between all CPUs.
|
36 |
|
|
|
37 |
|
|
check_interval
|
38 |
|
|
How often to poll for corrected machine check errors, in seconds
|
39 |
|
|
(Note output is hexademical). Default 5 minutes. When the poller
|
40 |
|
|
finds MCEs it triggers an exponential speedup (poll more often) on
|
41 |
|
|
the polling interval. When the poller stops finding MCEs, it
|
42 |
|
|
triggers an exponential backoff (poll less often) on the polling
|
43 |
|
|
interval. The check_interval variable is both the initial and
|
44 |
|
|
maximum polling interval.
|
45 |
|
|
|
46 |
|
|
tolerant
|
47 |
|
|
Tolerance level. When a machine check exception occurs for a non
|
48 |
|
|
corrected machine check the kernel can take different actions.
|
49 |
|
|
Since machine check exceptions can happen any time it is sometimes
|
50 |
|
|
risky for the kernel to kill a process because it defies
|
51 |
|
|
normal kernel locking rules. The tolerance level configures
|
52 |
|
|
how hard the kernel tries to recover even at some risk of
|
53 |
|
|
deadlock. Higher tolerant values trade potentially better uptime
|
54 |
|
|
with the risk of a crash or even corruption (for tolerant >= 3).
|
55 |
|
|
|
56 |
|
|
0: always panic on uncorrected errors, log corrected errors
|
57 |
|
|
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
58 |
|
|
2: SIGBUS or log uncorrected errors, log corrected errors
|
59 |
|
|
3: never panic or SIGBUS, log all errors (for testing only)
|
60 |
|
|
|
61 |
|
|
Default: 1
|
62 |
|
|
|
63 |
|
|
Note this only makes a difference if the CPU allows recovery
|
64 |
|
|
from a machine check exception. Current x86 CPUs generally do not.
|
65 |
|
|
|
66 |
|
|
trigger
|
67 |
|
|
Program to run when a machine check event is detected.
|
68 |
|
|
This is an alternative to running mcelog regularly from cron
|
69 |
|
|
and allows to detect events faster.
|
70 |
|
|
|
71 |
|
|
TBD document entries for AMD threshold interrupt configuration
|
72 |
|
|
|
73 |
|
|
For more details about the x86 machine check architecture
|
74 |
|
|
see the Intel and AMD architecture manuals from their developer websites.
|
75 |
|
|
|
76 |
|
|
For more details about the architecture see
|
77 |
|
|
see http://one.firstfloor.org/~andi/mce.pdf
|