1 |
199 |
simons |
From: michael@Physik.Uni-Dortmund.DE (Michael Dirkmann)
|
2 |
|
|
|
3 |
|
|
thanks for your information. Attached is the tex-code of your
|
4 |
|
|
SMP-documentation :
|
5 |
|
|
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|
6 |
|
|
\documentclass[]{article}
|
7 |
|
|
\parindent0.0cm
|
8 |
|
|
\parskip0.2cm
|
9 |
|
|
|
10 |
|
|
\begin{document}
|
11 |
|
|
|
12 |
|
|
\begin{center}
|
13 |
|
|
\LARGE \bf
|
14 |
|
|
An Implementation Of Multiprocessor Linux
|
15 |
|
|
\normalsize
|
16 |
|
|
\end{center}
|
17 |
|
|
|
18 |
|
|
{ \it
|
19 |
|
|
This document describes the implementation of a simple SMP
|
20 |
|
|
Linux kernel extension and how to use this to develop SMP Linux kernels for
|
21 |
|
|
architectures other than the Intel MP v1.1 architecture for Pentium and 486
|
22 |
|
|
processors.}
|
23 |
|
|
|
24 |
|
|
\hfill Alan Cox, 1995
|
25 |
|
|
|
26 |
|
|
|
27 |
|
|
The author wishes to thank Caldera Inc. ( http://www.caldera.com )
|
28 |
|
|
whose donation of an ASUS dual Pentium board made this project possible,
|
29 |
|
|
and Thomas Radke, whose initial work on multiprocessor Linux formed
|
30 |
|
|
the backbone of this project.
|
31 |
|
|
|
32 |
|
|
\section{Background: The Intel MP specification.}
|
33 |
|
|
Most IBM PC style multiprocessor motherboards combine Intel 486 or Pentium
|
34 |
|
|
processors and glue chipsets with a hardware/software specification. The
|
35 |
|
|
specification places much of the onus for hard work on the chipset and
|
36 |
|
|
hardware rather than the operating system.
|
37 |
|
|
|
38 |
|
|
The Intel Pentium processors have a wide variety of built-in facilities for
|
39 |
|
|
supporting multiprocessing, including hardware cache coherency, built in
|
40 |
|
|
interprocessor interrupt handling and a set of atomic test and set,
|
41 |
|
|
exchange and similar operations. The cache coherency in particular makes the
|
42 |
|
|
operating systems job far easier.
|
43 |
|
|
|
44 |
|
|
The specification defines a detailed configuration structure in ROM that
|
45 |
|
|
the boot up processor can read to find the full configuration of the
|
46 |
|
|
processors and busses. It also defines a procedure for starting up the
|
47 |
|
|
other processors.
|
48 |
|
|
|
49 |
|
|
|
50 |
|
|
\section{Mutual Exclusion Within A Single Processor Linux Kernel}
|
51 |
|
|
For any kernel to function in a sane manner it has to provide internal
|
52 |
|
|
locking and protection of its own tables to prevent two processes updating
|
53 |
|
|
them at once and for example allocating the same memory block. There are
|
54 |
|
|
two strategies for this within current Unix and Unixlike kernels.
|
55 |
|
|
Traditional unix systems from the earliest of days use a scheme of 'Coarse
|
56 |
|
|
Grained Locking' where the entire kernel is protected as a small number of
|
57 |
|
|
locks only. Some modern systems use fine grained locking. Because fine
|
58 |
|
|
grained locking has more overhead it is normally used only on
|
59 |
|
|
multiprocessor kernels and real time kernels. In a real time kernel the
|
60 |
|
|
fine grained locking reduces the amount of time locks are held and reduces
|
61 |
|
|
the critical (to real time programming at least) latency times.
|
62 |
|
|
|
63 |
|
|
Within the Linux kernel certain guarantees are made. No process running in
|
64 |
|
|
kernel mode will be pre-empted by another kernel mode process unless it
|
65 |
|
|
voluntarily sleeps. This ensures that blocks of kernel code are
|
66 |
|
|
effectively atomic with respect to other processes and greatly simplifies
|
67 |
|
|
many operation. Secondly interrupts may pre-empt a kernel running process,
|
68 |
|
|
but will always return to that process. A process in kernel mode may
|
69 |
|
|
disable interrupts on the processor and guarantee such an interruption will
|
70 |
|
|
not occur. The final guarantee is that an interrupt will not be pre-empted
|
71 |
|
|
by a kernel task. That is interrupts will run to completion or be
|
72 |
|
|
pre-empted by other interrupts only.
|
73 |
|
|
|
74 |
|
|
The SMP kernel chooses to continue these basic guarantees in order to make
|
75 |
|
|
initial implementation and deployment easier. A single lock is maintained
|
76 |
|
|
across all processors. This lock is required to access the kernel space.
|
77 |
|
|
Any processor may hold it and once it is held may also re-enter the kernel
|
78 |
|
|
for interrupts and other services whenever it likes until the lock is
|
79 |
|
|
relinquished. This lock ensures that a kernel mode process will not be
|
80 |
|
|
pre-empted and ensures that blocking interrupts in kernel mode behaves
|
81 |
|
|
correctly. This is guaranteed because only the processor holding the lock
|
82 |
|
|
can be in kernel mode, only kernel mode processes can disable interrupts
|
83 |
|
|
and only the processor holding the lock may handle an interrupt.
|
84 |
|
|
|
85 |
|
|
Such a choice is however poor for performance. In the longer term it is
|
86 |
|
|
necessary to move to finer grained parallelism in order to get the best
|
87 |
|
|
system performance. This can be done hierarchically by gradually refining
|
88 |
|
|
the locks to cover smaller areas. With the current kernel highly CPU bound
|
89 |
|
|
process sets perform well but I/O bound task sets can easily degenerate to
|
90 |
|
|
near single processor performance levels. This refinement will be needed to
|
91 |
|
|
get the best from Linux/SMP.
|
92 |
|
|
|
93 |
|
|
\subsection{Changes To The Portable Kernel Components}
|
94 |
|
|
The kernel changes are split into generic SMP support changes and
|
95 |
|
|
architecture specific changes necessary to accommodate each different
|
96 |
|
|
processor type Linux is ported to.
|
97 |
|
|
|
98 |
|
|
|
99 |
|
|
\subsubsection{Initialisation}
|
100 |
|
|
The first problem with a multiprocessor kernel is starting the other
|
101 |
|
|
processors up. Linux/SMP defines that a single processor enters the normal
|
102 |
|
|
kernel entry point start\_kernel(). Other processors are assumed not to be
|
103 |
|
|
started or to have been captured elsewhere. The first processor begins the
|
104 |
|
|
normal Linux initialisation sequences and sets up paging, interrupts and
|
105 |
|
|
trap handlers. After it has obtained the processor information about the
|
106 |
|
|
boot CPU, the architecture specific function
|
107 |
|
|
|
108 |
|
|
|
109 |
|
|
{\tt \bf{void smp\_store\_cpu\_info(int processor\_id) }}
|
110 |
|
|
|
111 |
|
|
is called to store any information about the processor into a per processor
|
112 |
|
|
array. This includes things like the bogomips speed ratings.
|
113 |
|
|
|
114 |
|
|
Having completed the kernel initialisation the architecture specific
|
115 |
|
|
function
|
116 |
|
|
|
117 |
|
|
{\tt \bf void smp\_boot\_cpus(void) }
|
118 |
|
|
|
119 |
|
|
is called and is expected to start up each other processor and cause it to
|
120 |
|
|
enter start\_kernel() with its paging registers and other control
|
121 |
|
|
information correctly loaded. Each other processor skips the setup except
|
122 |
|
|
for calling the trap and irq initialisation functions that are needed on
|
123 |
|
|
some processors to set each CPU up correctly. These functions will
|
124 |
|
|
probably need to be modified in existing kernels to cope with this.
|
125 |
|
|
|
126 |
|
|
|
127 |
|
|
Each additional CPU the calls the architecture specific function
|
128 |
|
|
|
129 |
|
|
{\tt \bf void smp\_callin(void)}
|
130 |
|
|
|
131 |
|
|
which does any final setup and then spins the processor while the boot
|
132 |
|
|
up processor forks off enough idle threads for each processor. This is
|
133 |
|
|
necessary because the scheduler assumes there is always something to run.
|
134 |
|
|
Having generated these threads and forked init the architecture specific
|
135 |
|
|
|
136 |
|
|
{\tt \bf void smp\_commence(void)}
|
137 |
|
|
|
138 |
|
|
function is invoked. This does any final setup and indicates to the system
|
139 |
|
|
that multiprocessor mode is now active. All the processors spinning in the
|
140 |
|
|
smp\_callin() function are now released to run the idle processes, which
|
141 |
|
|
they will run when they have no real work to process.
|
142 |
|
|
|
143 |
|
|
|
144 |
|
|
\subsubsection{Scheduling}
|
145 |
|
|
The kernel scheduler implements a simple but very and effective task
|
146 |
|
|
scheduler. The basic structure of this scheduler is unchanged in the
|
147 |
|
|
multiprocessor kernel. A processor field is added to each task, and this
|
148 |
|
|
maintains the number of the processor executing a given task, or a magic
|
149 |
|
|
constant (NO\_PROC\_ID) indicating the job is not allocated to a processor.
|
150 |
|
|
|
151 |
|
|
Each processor executes the scheduler itself and will select the next task
|
152 |
|
|
to run from all runnable processes not allocated to a different processor.
|
153 |
|
|
The algorithm used by the selection is otherwise unchanged. This is
|
154 |
|
|
actually inadequate for the final system because there are advantages to
|
155 |
|
|
keeping a process on the same CPU, especially on processor boards with per
|
156 |
|
|
processor second level caches.
|
157 |
|
|
|
158 |
|
|
Throughout the kernel the variable 'current' is used as a global for the
|
159 |
|
|
current process. In Linux/SMP this becomes a macro which expands to
|
160 |
|
|
current\_set[smp\_processor\_id()]. This enables almost the entire kernel to
|
161 |
|
|
be unaware of the array of running processors, but still allows the SMP
|
162 |
|
|
aware kernel modules to see all of the running processes.
|
163 |
|
|
|
164 |
|
|
The fork system call is modified to generate multiple processes with a
|
165 |
|
|
process id of zero until the SMP kernel starts up properly. This is
|
166 |
|
|
necessary because process number 1 must be init, and it is desirable that
|
167 |
|
|
all the system threads are process 0.
|
168 |
|
|
|
169 |
|
|
The final area within the scheduling of processes that does cause problems
|
170 |
|
|
is the fact the uniprocessor kernel hard codes tests for the idle threads
|
171 |
|
|
as task[0] and the init process as task[1]. Because there are multiple idle
|
172 |
|
|
threads it is necessary to replace these with tests that the process id is
|
173 |
|
|
|
174 |
|
|
|
175 |
|
|
\subsubsection{Memory Management}
|
176 |
|
|
The memory management core of the existing Linux system functions
|
177 |
|
|
adequately within the multiprocessor framework providing the locking is
|
178 |
|
|
used. Certain processor specific areas do need changing, in particular
|
179 |
|
|
invalidate() must invalidate the TLBs of all processors before it returns.
|
180 |
|
|
|
181 |
|
|
|
182 |
|
|
\subsubsection{Miscellaneous Functions}
|
183 |
|
|
The portable SMP code rests on a small set of functions and variables
|
184 |
|
|
that are provided by the processor specification functionality. These are
|
185 |
|
|
|
186 |
|
|
{\tt \bf int smp\_processor\_id(void) }
|
187 |
|
|
|
188 |
|
|
which returns the identity of the process the call is executed upon. This
|
189 |
|
|
call is assumed to be valid at all times. This may mean additional tests
|
190 |
|
|
are needed during initialisation.
|
191 |
|
|
|
192 |
|
|
|
193 |
|
|
{\tt \bf int smp\_num\_cpus;}
|
194 |
|
|
|
195 |
|
|
This is the number of processors in the system. \
|
196 |
|
|
|
197 |
|
|
{\tt \bf void smp\_message\_pass(int target, int msg, unsigned long data,
|
198 |
|
|
int wait)}
|
199 |
|
|
|
200 |
|
|
This function passes messages between processors. At the moment it is not
|
201 |
|
|
sufficiently defined to sensibly document and needs cleaning up and further
|
202 |
|
|
work. Refer to the processor specific code documentation for more details.
|
203 |
|
|
|
204 |
|
|
|
205 |
|
|
\subsection{Architecture Specific Code For the Intel MP Port}
|
206 |
|
|
The architecture specific code for the intel port splits fairly cleanly
|
207 |
|
|
into four sections. Firstly the initialisation code used to boot the
|
208 |
|
|
system, secondly the message handling and support code, thirdly the
|
209 |
|
|
interrupt and kernel syscall entry function handling and finally the
|
210 |
|
|
extensions to standard kernel facilities to cope with multiple processors.
|
211 |
|
|
|
212 |
|
|
\subsubsection{Initialisation}
|
213 |
|
|
The Intel MP architecture captures all the processors except for a single
|
214 |
|
|
processor known as the 'boot processor' in the BIOS at boot time. Thus a
|
215 |
|
|
single processor enters the kernel bootup code. The first processor
|
216 |
|
|
executes the bootstrap code, loads and uncompresses the kernel. Having
|
217 |
|
|
unpacked the kernel it sets up the paging and control registers then enters
|
218 |
|
|
the C kernel startup.
|
219 |
|
|
|
220 |
|
|
The assembler startup code for the kernel is modified so that it can be
|
221 |
|
|
used by the other processors to do the processor identification and various
|
222 |
|
|
other low level configurations but does not execute those parts of the
|
223 |
|
|
startup code that would damage the running system (such as clearing the BSS
|
224 |
|
|
segment).
|
225 |
|
|
|
226 |
|
|
In the initialisation done by the first processor the arch/i386/mm/init
|
227 |
|
|
code is modified to scan the low page, top page and BIOS for intel MP
|
228 |
|
|
signature blocks. This is necessary because the MP signature blocks must
|
229 |
|
|
be read and processed before the kernel is allowed to allocate and destroy
|
230 |
|
|
the page at the top of low memory. Having established the number of
|
231 |
|
|
processors it reserves a set of pages to provide a stack come boot up area
|
232 |
|
|
for each processor in the system. These must be allocated at startup to
|
233 |
|
|
ensure they fall below the 1Mb boundary.
|
234 |
|
|
|
235 |
|
|
Further processors are started up in smp\_boot\_cpus() by programming the
|
236 |
|
|
APIC controller registers and sending an inter-processor interrupt (IPI) to
|
237 |
|
|
the processor. This message causes the target processor to begin executing
|
238 |
|
|
code at the start of any page of memory within the lowest 1Mb, in 16bit
|
239 |
|
|
real mode. The kernel uses the single page it allocated for each processor
|
240 |
|
|
to use as stack. Before booting a given CPU the relocatable code from
|
241 |
|
|
trampoline.S and trampoline32.S is copied to the bottom of its stack page
|
242 |
|
|
and used as the target for the startup.
|
243 |
|
|
|
244 |
|
|
The trampoline code calculates the desired stack base from the code
|
245 |
|
|
segment (since the code segment on startup is the bottom of the stack),
|
246 |
|
|
enters 32bit mode and jumps to the kernel entry assembler. This as
|
247 |
|
|
described above is modified to only execute the parts necessary for each
|
248 |
|
|
processor, and then to enter start\_kernel(). On entering the kernel the
|
249 |
|
|
processor initialises its trap and interrupt handlers before entering
|
250 |
|
|
smp\_callin(), where it reports its status and sets a flag that causes the
|
251 |
|
|
boot processor to continue and look for further processors. The processor
|
252 |
|
|
then spins until smp\_commence() is invoked.
|
253 |
|
|
|
254 |
|
|
Having started each processor up the smp\_commence( ) function flips a
|
255 |
|
|
flag. Each processor spinning in smp\_callin() then loads the task register
|
256 |
|
|
with the task state segment (TSS) of its idle thread as is needed for task
|
257 |
|
|
switching.
|
258 |
|
|
|
259 |
|
|
\subsubsection{Message Handling and Support Code}
|
260 |
|
|
The architecture specific code implements the smp\_processor\_id() function
|
261 |
|
|
by querying the APIC logical identity register. Because the APIC isn't
|
262 |
|
|
mapped into the kernel address space at boot, the initial value returned is
|
263 |
|
|
rigged by setting the APIC base pointer to point at a suitable constant.
|
264 |
|
|
Once the system starts doing the SMP setup (in smp\_boot\_cpus()), the APIC
|
265 |
|
|
is mapped with a vremap() call and the apic pointer is adjusted
|
266 |
|
|
appropriately. From then on the real APIC logical identity register is
|
267 |
|
|
read.
|
268 |
|
|
|
269 |
|
|
Message passing is accomplished using a pair of IPIs on interrupt 13
|
270 |
|
|
(unused by the 80486 FPUs in SMP mode) and interrupt 16. Two are used in
|
271 |
|
|
order to separate messages that cannot be processed until the receiver
|
272 |
|
|
obtains the kernel spinlock from messages that can be processed
|
273 |
|
|
immediately. In effect IRQ 13 is a fast IRQ handler that does not obtain
|
274 |
|
|
the locks, and cannot cause a reschedule, while IRQ 16 is a slow IRQ that
|
275 |
|
|
must acquire the kernel spinlocks and can cause a reschedule. This
|
276 |
|
|
interrupt is used for passing on slave timer messages from the processor
|
277 |
|
|
that receives the timer interrupt to the rest of the processors, so that
|
278 |
|
|
they can reschedule running tasks.
|
279 |
|
|
|
280 |
|
|
|
281 |
|
|
\subsubsection{Entry And Exit Code}
|
282 |
|
|
A single spinlock protects the entire kernel. The interrupt handlers, the
|
283 |
|
|
syscall entry code and the exception handlers all acquire the lock before
|
284 |
|
|
entering the kernel proper. When the processor is trying to acquire the
|
285 |
|
|
spinlock it spins continually on the lock with interrupts disabled. This
|
286 |
|
|
causes a specific deadlock problem. The lock owner may need to send an
|
287 |
|
|
invalidate request to the rest of the processors and wait for these to
|
288 |
|
|
complete before continuing. A processor spinning on the lock would not be
|
289 |
|
|
able to do thus. Thus the loop of the spinlock tests and handles invalidate
|
290 |
|
|
requests. If the invalidate bit for the spinning CPU is set the processor
|
291 |
|
|
invalidates its TLB and atomically clears the bit. When the spinlock is
|
292 |
|
|
obtained that processor will take an IPI and in the IPI test the bit and
|
293 |
|
|
skip the invalidate as the bit is clear.
|
294 |
|
|
|
295 |
|
|
One complexity of the spinlock is that a process running in kernel mode
|
296 |
|
|
can sleep voluntarily and be pre-empted. A switch from such a process to a
|
297 |
|
|
process executing in user space may reduce the lock count. To track this
|
298 |
|
|
the kernel uses a syscall\_count and a per process lock\_depth parameter to
|
299 |
|
|
track the kernel lock state. The switch\_to() function is modified in SMP
|
300 |
|
|
mode to adjust the lock appropriately.
|
301 |
|
|
|
302 |
|
|
The final problem is the idle thread. In the single processor kernel the
|
303 |
|
|
idle thread executes 'hlt' instructions. This saves power and reduces the
|
304 |
|
|
running temperature of the processors when they are idle. However it means
|
305 |
|
|
the process spends all its time in kernel mode and would thus hold the
|
306 |
|
|
kernel spinlock. The SMP idle thread continually reschedules a new task and
|
307 |
|
|
returns to user mode. This is far from ideal and will be modified to use
|
308 |
|
|
'hlt' instructions and release the spinlock soon. Using 'hlt' is even more
|
309 |
|
|
beneficial on a multiprocessor system as it almost completely takes an idle
|
310 |
|
|
processor off the bus.
|
311 |
|
|
|
312 |
|
|
Interrupts are distributed by an i82489 APIC. This chip is set up to work
|
313 |
|
|
as an emulation of the traditional PC interrupt controllers when the
|
314 |
|
|
machine boots (so that an Intel MP machine boots one CPU and PC
|
315 |
|
|
compatible). The kernel has all the relevant locks but does not yet
|
316 |
|
|
reprogram the 82489 to deliver interrupts to arbitrary processors as it
|
317 |
|
|
should. This requires further modification of the standard Linux interrupt
|
318 |
|
|
handling code, and is particularly messy as the interrupt handler behaviour
|
319 |
|
|
has to change as soon as the 82489 is switched into SMP mode.
|
320 |
|
|
|
321 |
|
|
|
322 |
|
|
\subsubsection{Extensions To Standard Facilities}
|
323 |
|
|
The kernel maintains a set of per processor control information such as
|
324 |
|
|
the speed of the processor for delay loops. These functions on the SMP
|
325 |
|
|
kernel look the values up in a per processor array that is set up from the
|
326 |
|
|
data generated at boot up by the smp\_store\_cpu\_info() function. This
|
327 |
|
|
includes other facts such as whether there is an FPU on the processor. The
|
328 |
|
|
current kernel does not handle floating point correctly, this requires some
|
329 |
|
|
changes to the techniques the single CPU kernel uses to minimise floating
|
330 |
|
|
point processor reloads.
|
331 |
|
|
|
332 |
|
|
The highly useful atomic bit operations are prefixed with the 'lock'
|
333 |
|
|
prefix in the SMP kernel to maintain their atomic properties when used
|
334 |
|
|
outside of (and by) the spinlock and message code. Amongst other things
|
335 |
|
|
this is needed for the invalidate handler, as all CPU's will invalidate at
|
336 |
|
|
the same time without any locks.
|
337 |
|
|
|
338 |
|
|
Interrupt 13 floating point error reporting is removed. This facility is
|
339 |
|
|
not usable on a multiprocessor board, nor relevant to the Intel MP
|
340 |
|
|
architecture which does not cover the 80386/80387 processor pair. \
|
341 |
|
|
|
342 |
|
|
The /proc filesystem support is changed so that the /proc/cpuinfo file
|
343 |
|
|
contains a column for each processor present. This information is extracted
|
344 |
|
|
from the data save by smp\_store\_cpu\_info().
|
345 |
|
|
|
346 |
|
|
\end{document}
|