1 |
1275 |
phoenix |
|
2 |
|
|
[NMI watchdog is available for x86 and x86-64 architectures]
|
3 |
|
|
|
4 |
|
|
Is your system locking up unpredictably? No keyboard activity, just
|
5 |
|
|
a frustrating complete hard lockup? Do you want to help us debugging
|
6 |
|
|
such lockups? If all yes then this document is definitely for you.
|
7 |
|
|
|
8 |
|
|
On many x86/x86-64 type hardware there is a feature that enables
|
9 |
|
|
us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt
|
10 |
|
|
which get executed even if the system is otherwise locked up hard).
|
11 |
|
|
This can be used to debug hard kernel lockups. By executing periodic
|
12 |
|
|
NMI interrupts, the kernel can monitor whether any CPU has locked up,
|
13 |
|
|
and print out debugging messages if so.
|
14 |
|
|
|
15 |
|
|
In order to use the NMI watchdoc, you need to have APIC support in your
|
16 |
|
|
kernel. For SMP kernels, APIC support gets compiled in automatically. For
|
17 |
|
|
UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
|
18 |
|
|
APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
|
19 |
|
|
features -> IO-APIC support on uniprocessors) in your kernel config.
|
20 |
|
|
CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
|
21 |
|
|
CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
|
22 |
|
|
kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
|
23 |
|
|
may implicitly disable the NMI watchdog.]
|
24 |
|
|
|
25 |
|
|
For x86-64, the needed APIC is always compiled in, and the NMI watchdog is
|
26 |
|
|
always enabled with I/O-APIC mode (nmi_watchdog=1). Currently, local APIC
|
27 |
|
|
mode (nmi_watchdog=2) does not work on x86-64.
|
28 |
|
|
|
29 |
|
|
Using local APIC (nmi_watchdog=2) needs the first performance register, so
|
30 |
|
|
you can't use it for other purposes (such as high precision performance
|
31 |
|
|
profiling.) However, at least oprofile and the perfctr driver disable the
|
32 |
|
|
local APIC NMI watchdog automatically.
|
33 |
|
|
|
34 |
|
|
To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
|
35 |
|
|
parameter. Eg. the relevant lilo.conf entry:
|
36 |
|
|
|
37 |
|
|
append="nmi_watchdog=1"
|
38 |
|
|
|
39 |
|
|
For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
|
40 |
|
|
For UP machines without an IO-APIC use nmi_watchdog=2, this only works
|
41 |
|
|
for some processor types. If in doubt, boot with nmi_watchdog=1 and
|
42 |
|
|
check the NMI count in /proc/interrupts; if the count is zero then
|
43 |
|
|
reboot with nmi_watchdog=2 and check the NMI count. If it is still
|
44 |
|
|
zero then log a problem, you probably have a processor that needs to be
|
45 |
|
|
added to the nmi code.
|
46 |
|
|
|
47 |
|
|
A 'lockup' is the following scenario: if any CPU in the system does not
|
48 |
|
|
execute the period local timer interrupt for more than 5 seconds, then
|
49 |
|
|
the NMI handler generates an oops and kills the process. This
|
50 |
|
|
'controlled crash' (and the resulting kernel messages) can be used to
|
51 |
|
|
debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
|
52 |
|
|
the oops will show up automatically. If the kernel produces no messages
|
53 |
|
|
then the system has crashed so hard (eg. hardware-wise) that either it
|
54 |
|
|
cannot even accept NMI interrupts, or the crash has made the kernel
|
55 |
|
|
unable to print messages.
|
56 |
|
|
|
57 |
|
|
NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default,
|
58 |
|
|
you have to enable it with a boot time parameter. Prior to 2.4.2-ac18
|
59 |
|
|
the NMI-oopser is enabled unconditionally on x86 SMP boxes.
|
60 |
|
|
|
61 |
|
|
[ feel free to send bug reports, suggestions and patches to
|
62 |
|
|
Ingo Molnar or the Linux SMP mailing
|
63 |
|
|
list at ]
|
64 |
|
|
|