1 |
3 |
xianfeng |
Most of the text from Keith Owens, hacked by AK
|
2 |
|
|
|
3 |
|
|
x86_64 page size (PAGE_SIZE) is 4K.
|
4 |
|
|
|
5 |
|
|
Like all other architectures, x86_64 has a kernel stack for every
|
6 |
|
|
active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
|
7 |
|
|
These stacks contain useful data as long as a thread is alive or a
|
8 |
|
|
zombie. While the thread is in user space the kernel stack is empty
|
9 |
|
|
except for the thread_info structure at the bottom.
|
10 |
|
|
|
11 |
|
|
In addition to the per thread stacks, there are specialized stacks
|
12 |
|
|
associated with each CPU. These stacks are only used while the kernel
|
13 |
|
|
is in control on that CPU; when a CPU returns to user space the
|
14 |
|
|
specialized stacks contain no useful data. The main CPU stacks are:
|
15 |
|
|
|
16 |
|
|
* Interrupt stack. IRQSTACKSIZE
|
17 |
|
|
|
18 |
|
|
Used for external hardware interrupts. If this is the first external
|
19 |
|
|
hardware interrupt (i.e. not a nested hardware interrupt) then the
|
20 |
|
|
kernel switches from the current task to the interrupt stack. Like
|
21 |
|
|
the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS),
|
22 |
|
|
this gives more room for kernel interrupt processing without having
|
23 |
|
|
to increase the size of every per thread stack.
|
24 |
|
|
|
25 |
|
|
The interrupt stack is also used when processing a softirq.
|
26 |
|
|
|
27 |
|
|
Switching to the kernel interrupt stack is done by software based on a
|
28 |
|
|
per CPU interrupt nest counter. This is needed because x86-64 "IST"
|
29 |
|
|
hardware stacks cannot nest without races.
|
30 |
|
|
|
31 |
|
|
x86_64 also has a feature which is not available on i386, the ability
|
32 |
|
|
to automatically switch to a new stack for designated events such as
|
33 |
|
|
double fault or NMI, which makes it easier to handle these unusual
|
34 |
|
|
events on x86_64. This feature is called the Interrupt Stack Table
|
35 |
|
|
(IST). There can be up to 7 IST entries per CPU. The IST code is an
|
36 |
|
|
index into the Task State Segment (TSS). The IST entries in the TSS
|
37 |
|
|
point to dedicated stacks; each stack can be a different size.
|
38 |
|
|
|
39 |
|
|
An IST is selected by a non-zero value in the IST field of an
|
40 |
|
|
interrupt-gate descriptor. When an interrupt occurs and the hardware
|
41 |
|
|
loads such a descriptor, the hardware automatically sets the new stack
|
42 |
|
|
pointer based on the IST value, then invokes the interrupt handler. If
|
43 |
|
|
software wants to allow nested IST interrupts then the handler must
|
44 |
|
|
adjust the IST values on entry to and exit from the interrupt handler.
|
45 |
|
|
(This is occasionally done, e.g. for debug exceptions.)
|
46 |
|
|
|
47 |
|
|
Events with different IST codes (i.e. with different stacks) can be
|
48 |
|
|
nested. For example, a debug interrupt can safely be interrupted by an
|
49 |
|
|
NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
|
50 |
|
|
pointers on entry to and exit from all IST events, in theory allowing
|
51 |
|
|
IST events with the same code to be nested. However in most cases, the
|
52 |
|
|
stack size allocated to an IST assumes no nesting for the same code.
|
53 |
|
|
If that assumption is ever broken then the stacks will become corrupt.
|
54 |
|
|
|
55 |
|
|
The currently assigned IST stacks are :-
|
56 |
|
|
|
57 |
|
|
* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
58 |
|
|
|
59 |
|
|
Used for interrupt 12 - Stack Fault Exception (#SS).
|
60 |
|
|
|
61 |
|
|
This allows the CPU to recover from invalid stack segments. Rarely
|
62 |
|
|
happens.
|
63 |
|
|
|
64 |
|
|
* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
65 |
|
|
|
66 |
|
|
Used for interrupt 8 - Double Fault Exception (#DF).
|
67 |
|
|
|
68 |
|
|
Invoked when handling one exception causes another exception. Happens
|
69 |
|
|
when the kernel is very confused (e.g. kernel stack pointer corrupt).
|
70 |
|
|
Using a separate stack allows the kernel to recover from it well enough
|
71 |
|
|
in many cases to still output an oops.
|
72 |
|
|
|
73 |
|
|
* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
74 |
|
|
|
75 |
|
|
Used for non-maskable interrupts (NMI).
|
76 |
|
|
|
77 |
|
|
NMI can be delivered at any time, including when the kernel is in the
|
78 |
|
|
middle of switching stacks. Using IST for NMI events avoids making
|
79 |
|
|
assumptions about the previous state of the kernel stack.
|
80 |
|
|
|
81 |
|
|
* DEBUG_STACK. DEBUG_STKSZ
|
82 |
|
|
|
83 |
|
|
Used for hardware debug interrupts (interrupt 1) and for software
|
84 |
|
|
debug interrupts (INT3).
|
85 |
|
|
|
86 |
|
|
When debugging a kernel, debug interrupts (both hardware and
|
87 |
|
|
software) can occur at any time. Using IST for these interrupts
|
88 |
|
|
avoids making assumptions about the previous state of the kernel
|
89 |
|
|
stack.
|
90 |
|
|
|
91 |
|
|
* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
92 |
|
|
|
93 |
|
|
Used for interrupt 18 - Machine Check Exception (#MC).
|
94 |
|
|
|
95 |
|
|
MCE can be delivered at any time, including when the kernel is in the
|
96 |
|
|
middle of switching stacks. Using IST for MCE events avoids making
|
97 |
|
|
assumptions about the previous state of the kernel stack.
|
98 |
|
|
|
99 |
|
|
For more details see the Intel IA32 or AMD AMD64 architecture manuals.
|