1 |
62 |
marcus.erl |
#
|
2 |
|
|
# Copyright (c) 2006 Steven Rostedt
|
3 |
|
|
# Licensed under the GNU Free Documentation License, Version 1.2
|
4 |
|
|
#
|
5 |
|
|
|
6 |
|
|
RT-mutex implementation design
|
7 |
|
|
------------------------------
|
8 |
|
|
|
9 |
|
|
This document tries to describe the design of the rtmutex.c implementation.
|
10 |
|
|
It doesn't describe the reasons why rtmutex.c exists. For that please see
|
11 |
|
|
Documentation/rt-mutex.txt. Although this document does explain problems
|
12 |
|
|
that happen without this code, but that is in the concept to understand
|
13 |
|
|
what the code actually is doing.
|
14 |
|
|
|
15 |
|
|
The goal of this document is to help others understand the priority
|
16 |
|
|
inheritance (PI) algorithm that is used, as well as reasons for the
|
17 |
|
|
decisions that were made to implement PI in the manner that was done.
|
18 |
|
|
|
19 |
|
|
|
20 |
|
|
Unbounded Priority Inversion
|
21 |
|
|
----------------------------
|
22 |
|
|
|
23 |
|
|
Priority inversion is when a lower priority process executes while a higher
|
24 |
|
|
priority process wants to run. This happens for several reasons, and
|
25 |
|
|
most of the time it can't be helped. Anytime a high priority process wants
|
26 |
|
|
to use a resource that a lower priority process has (a mutex for example),
|
27 |
|
|
the high priority process must wait until the lower priority process is done
|
28 |
|
|
with the resource. This is a priority inversion. What we want to prevent
|
29 |
|
|
is something called unbounded priority inversion. That is when the high
|
30 |
|
|
priority process is prevented from running by a lower priority process for
|
31 |
|
|
an undetermined amount of time.
|
32 |
|
|
|
33 |
|
|
The classic example of unbounded priority inversion is were you have three
|
34 |
|
|
processes, let's call them processes A, B, and C, where A is the highest
|
35 |
|
|
priority process, C is the lowest, and B is in between. A tries to grab a lock
|
36 |
|
|
that C owns and must wait and lets C run to release the lock. But in the
|
37 |
|
|
meantime, B executes, and since B is of a higher priority than C, it preempts C,
|
38 |
|
|
but by doing so, it is in fact preempting A which is a higher priority process.
|
39 |
|
|
Now there's no way of knowing how long A will be sleeping waiting for C
|
40 |
|
|
to release the lock, because for all we know, B is a CPU hog and will
|
41 |
|
|
never give C a chance to release the lock. This is called unbounded priority
|
42 |
|
|
inversion.
|
43 |
|
|
|
44 |
|
|
Here's a little ASCII art to show the problem.
|
45 |
|
|
|
46 |
|
|
grab lock L1 (owned by C)
|
47 |
|
|
|
|
48 |
|
|
A ---+
|
49 |
|
|
C preempted by B
|
50 |
|
|
|
|
51 |
|
|
C +----+
|
52 |
|
|
|
53 |
|
|
B +-------->
|
54 |
|
|
B now keeps A from running.
|
55 |
|
|
|
56 |
|
|
|
57 |
|
|
Priority Inheritance (PI)
|
58 |
|
|
-------------------------
|
59 |
|
|
|
60 |
|
|
There are several ways to solve this issue, but other ways are out of scope
|
61 |
|
|
for this document. Here we only discuss PI.
|
62 |
|
|
|
63 |
|
|
PI is where a process inherits the priority of another process if the other
|
64 |
|
|
process blocks on a lock owned by the current process. To make this easier
|
65 |
|
|
to understand, let's use the previous example, with processes A, B, and C again.
|
66 |
|
|
|
67 |
|
|
This time, when A blocks on the lock owned by C, C would inherit the priority
|
68 |
|
|
of A. So now if B becomes runnable, it would not preempt C, since C now has
|
69 |
|
|
the high priority of A. As soon as C releases the lock, it loses its
|
70 |
|
|
inherited priority, and A then can continue with the resource that C had.
|
71 |
|
|
|
72 |
|
|
Terminology
|
73 |
|
|
-----------
|
74 |
|
|
|
75 |
|
|
Here I explain some terminology that is used in this document to help describe
|
76 |
|
|
the design that is used to implement PI.
|
77 |
|
|
|
78 |
|
|
PI chain - The PI chain is an ordered series of locks and processes that cause
|
79 |
|
|
processes to inherit priorities from a previous process that is
|
80 |
|
|
blocked on one of its locks. This is described in more detail
|
81 |
|
|
later in this document.
|
82 |
|
|
|
83 |
|
|
mutex - In this document, to differentiate from locks that implement
|
84 |
|
|
PI and spin locks that are used in the PI code, from now on
|
85 |
|
|
the PI locks will be called a mutex.
|
86 |
|
|
|
87 |
|
|
lock - In this document from now on, I will use the term lock when
|
88 |
|
|
referring to spin locks that are used to protect parts of the PI
|
89 |
|
|
algorithm. These locks disable preemption for UP (when
|
90 |
|
|
CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
|
91 |
|
|
entering critical sections simultaneously.
|
92 |
|
|
|
93 |
|
|
spin lock - Same as lock above.
|
94 |
|
|
|
95 |
|
|
waiter - A waiter is a struct that is stored on the stack of a blocked
|
96 |
|
|
process. Since the scope of the waiter is within the code for
|
97 |
|
|
a process being blocked on the mutex, it is fine to allocate
|
98 |
|
|
the waiter on the process's stack (local variable). This
|
99 |
|
|
structure holds a pointer to the task, as well as the mutex that
|
100 |
|
|
the task is blocked on. It also has the plist node structures to
|
101 |
|
|
place the task in the waiter_list of a mutex as well as the
|
102 |
|
|
pi_list of a mutex owner task (described below).
|
103 |
|
|
|
104 |
|
|
waiter is sometimes used in reference to the task that is waiting
|
105 |
|
|
on a mutex. This is the same as waiter->task.
|
106 |
|
|
|
107 |
|
|
waiters - A list of processes that are blocked on a mutex.
|
108 |
|
|
|
109 |
|
|
top waiter - The highest priority process waiting on a specific mutex.
|
110 |
|
|
|
111 |
|
|
top pi waiter - The highest priority process waiting on one of the mutexes
|
112 |
|
|
that a specific process owns.
|
113 |
|
|
|
114 |
|
|
Note: task and process are used interchangeably in this document, mostly to
|
115 |
|
|
differentiate between two processes that are being described together.
|
116 |
|
|
|
117 |
|
|
|
118 |
|
|
PI chain
|
119 |
|
|
--------
|
120 |
|
|
|
121 |
|
|
The PI chain is a list of processes and mutexes that may cause priority
|
122 |
|
|
inheritance to take place. Multiple chains may converge, but a chain
|
123 |
|
|
would never diverge, since a process can't be blocked on more than one
|
124 |
|
|
mutex at a time.
|
125 |
|
|
|
126 |
|
|
Example:
|
127 |
|
|
|
128 |
|
|
Process: A, B, C, D, E
|
129 |
|
|
Mutexes: L1, L2, L3, L4
|
130 |
|
|
|
131 |
|
|
A owns: L1
|
132 |
|
|
B blocked on L1
|
133 |
|
|
B owns L2
|
134 |
|
|
C blocked on L2
|
135 |
|
|
C owns L3
|
136 |
|
|
D blocked on L3
|
137 |
|
|
D owns L4
|
138 |
|
|
E blocked on L4
|
139 |
|
|
|
140 |
|
|
The chain would be:
|
141 |
|
|
|
142 |
|
|
E->L4->D->L3->C->L2->B->L1->A
|
143 |
|
|
|
144 |
|
|
To show where two chains merge, we could add another process F and
|
145 |
|
|
another mutex L5 where B owns L5 and F is blocked on mutex L5.
|
146 |
|
|
|
147 |
|
|
The chain for F would be:
|
148 |
|
|
|
149 |
|
|
F->L5->B->L1->A
|
150 |
|
|
|
151 |
|
|
Since a process may own more than one mutex, but never be blocked on more than
|
152 |
|
|
one, the chains merge.
|
153 |
|
|
|
154 |
|
|
Here we show both chains:
|
155 |
|
|
|
156 |
|
|
E->L4->D->L3->C->L2-+
|
157 |
|
|
|
|
158 |
|
|
+->B->L1->A
|
159 |
|
|
|
|
160 |
|
|
F->L5-+
|
161 |
|
|
|
162 |
|
|
For PI to work, the processes at the right end of these chains (or we may
|
163 |
|
|
also call it the Top of the chain) must be equal to or higher in priority
|
164 |
|
|
than the processes to the left or below in the chain.
|
165 |
|
|
|
166 |
|
|
Also since a mutex may have more than one process blocked on it, we can
|
167 |
|
|
have multiple chains merge at mutexes. If we add another process G that is
|
168 |
|
|
blocked on mutex L2:
|
169 |
|
|
|
170 |
|
|
G->L2->B->L1->A
|
171 |
|
|
|
172 |
|
|
And once again, to show how this can grow I will show the merging chains
|
173 |
|
|
again.
|
174 |
|
|
|
175 |
|
|
E->L4->D->L3->C-+
|
176 |
|
|
+->L2-+
|
177 |
|
|
| |
|
178 |
|
|
G-+ +->B->L1->A
|
179 |
|
|
|
|
180 |
|
|
F->L5-+
|
181 |
|
|
|
182 |
|
|
|
183 |
|
|
Plist
|
184 |
|
|
-----
|
185 |
|
|
|
186 |
|
|
Before I go further and talk about how the PI chain is stored through lists
|
187 |
|
|
on both mutexes and processes, I'll explain the plist. This is similar to
|
188 |
|
|
the struct list_head functionality that is already in the kernel.
|
189 |
|
|
The implementation of plist is out of scope for this document, but it is
|
190 |
|
|
very important to understand what it does.
|
191 |
|
|
|
192 |
|
|
There are a few differences between plist and list, the most important one
|
193 |
|
|
being that plist is a priority sorted linked list. This means that the
|
194 |
|
|
priorities of the plist are sorted, such that it takes O(1) to retrieve the
|
195 |
|
|
highest priority item in the list. Obviously this is useful to store processes
|
196 |
|
|
based on their priorities.
|
197 |
|
|
|
198 |
|
|
Another difference, which is important for implementation, is that, unlike
|
199 |
|
|
list, the head of the list is a different element than the nodes of a list.
|
200 |
|
|
So the head of the list is declared as struct plist_head and nodes that will
|
201 |
|
|
be added to the list are declared as struct plist_node.
|
202 |
|
|
|
203 |
|
|
|
204 |
|
|
Mutex Waiter List
|
205 |
|
|
-----------------
|
206 |
|
|
|
207 |
|
|
Every mutex keeps track of all the waiters that are blocked on itself. The mutex
|
208 |
|
|
has a plist to store these waiters by priority. This list is protected by
|
209 |
|
|
a spin lock that is located in the struct of the mutex. This lock is called
|
210 |
|
|
wait_lock. Since the modification of the waiter list is never done in
|
211 |
|
|
interrupt context, the wait_lock can be taken without disabling interrupts.
|
212 |
|
|
|
213 |
|
|
|
214 |
|
|
Task PI List
|
215 |
|
|
------------
|
216 |
|
|
|
217 |
|
|
To keep track of the PI chains, each process has its own PI list. This is
|
218 |
|
|
a list of all top waiters of the mutexes that are owned by the process.
|
219 |
|
|
Note that this list only holds the top waiters and not all waiters that are
|
220 |
|
|
blocked on mutexes owned by the process.
|
221 |
|
|
|
222 |
|
|
The top of the task's PI list is always the highest priority task that
|
223 |
|
|
is waiting on a mutex that is owned by the task. So if the task has
|
224 |
|
|
inherited a priority, it will always be the priority of the task that is
|
225 |
|
|
at the top of this list.
|
226 |
|
|
|
227 |
|
|
This list is stored in the task structure of a process as a plist called
|
228 |
|
|
pi_list. This list is protected by a spin lock also in the task structure,
|
229 |
|
|
called pi_lock. This lock may also be taken in interrupt context, so when
|
230 |
|
|
locking the pi_lock, interrupts must be disabled.
|
231 |
|
|
|
232 |
|
|
|
233 |
|
|
Depth of the PI Chain
|
234 |
|
|
---------------------
|
235 |
|
|
|
236 |
|
|
The maximum depth of the PI chain is not dynamic, and could actually be
|
237 |
|
|
defined. But is very complex to figure it out, since it depends on all
|
238 |
|
|
the nesting of mutexes. Let's look at the example where we have 3 mutexes,
|
239 |
|
|
L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
|
240 |
|
|
The following shows a locking order of L1->L2->L3, but may not actually
|
241 |
|
|
be directly nested that way.
|
242 |
|
|
|
243 |
|
|
void func1(void)
|
244 |
|
|
{
|
245 |
|
|
mutex_lock(L1);
|
246 |
|
|
|
247 |
|
|
/* do anything */
|
248 |
|
|
|
249 |
|
|
mutex_unlock(L1);
|
250 |
|
|
}
|
251 |
|
|
|
252 |
|
|
void func2(void)
|
253 |
|
|
{
|
254 |
|
|
mutex_lock(L1);
|
255 |
|
|
mutex_lock(L2);
|
256 |
|
|
|
257 |
|
|
/* do something */
|
258 |
|
|
|
259 |
|
|
mutex_unlock(L2);
|
260 |
|
|
mutex_unlock(L1);
|
261 |
|
|
}
|
262 |
|
|
|
263 |
|
|
void func3(void)
|
264 |
|
|
{
|
265 |
|
|
mutex_lock(L2);
|
266 |
|
|
mutex_lock(L3);
|
267 |
|
|
|
268 |
|
|
/* do something else */
|
269 |
|
|
|
270 |
|
|
mutex_unlock(L3);
|
271 |
|
|
mutex_unlock(L2);
|
272 |
|
|
}
|
273 |
|
|
|
274 |
|
|
void func4(void)
|
275 |
|
|
{
|
276 |
|
|
mutex_lock(L3);
|
277 |
|
|
|
278 |
|
|
/* do something again */
|
279 |
|
|
|
280 |
|
|
mutex_unlock(L3);
|
281 |
|
|
}
|
282 |
|
|
|
283 |
|
|
Now we add 4 processes that run each of these functions separately.
|
284 |
|
|
Processes A, B, C, and D which run functions func1, func2, func3 and func4
|
285 |
|
|
respectively, and such that D runs first and A last. With D being preempted
|
286 |
|
|
in func4 in the "do something again" area, we have a locking that follows:
|
287 |
|
|
|
288 |
|
|
D owns L3
|
289 |
|
|
C blocked on L3
|
290 |
|
|
C owns L2
|
291 |
|
|
B blocked on L2
|
292 |
|
|
B owns L1
|
293 |
|
|
A blocked on L1
|
294 |
|
|
|
295 |
|
|
And thus we have the chain A->L1->B->L2->C->L3->D.
|
296 |
|
|
|
297 |
|
|
This gives us a PI depth of 4 (four processes), but looking at any of the
|
298 |
|
|
functions individually, it seems as though they only have at most a locking
|
299 |
|
|
depth of two. So, although the locking depth is defined at compile time,
|
300 |
|
|
it still is very difficult to find the possibilities of that depth.
|
301 |
|
|
|
302 |
|
|
Now since mutexes can be defined by user-land applications, we don't want a DOS
|
303 |
|
|
type of application that nests large amounts of mutexes to create a large
|
304 |
|
|
PI chain, and have the code holding spin locks while looking at a large
|
305 |
|
|
amount of data. So to prevent this, the implementation not only implements
|
306 |
|
|
a maximum lock depth, but also only holds at most two different locks at a
|
307 |
|
|
time, as it walks the PI chain. More about this below.
|
308 |
|
|
|
309 |
|
|
|
310 |
|
|
Mutex owner and flags
|
311 |
|
|
---------------------
|
312 |
|
|
|
313 |
|
|
The mutex structure contains a pointer to the owner of the mutex. If the
|
314 |
|
|
mutex is not owned, this owner is set to NULL. Since all architectures
|
315 |
|
|
have the task structure on at least a four byte alignment (and if this is
|
316 |
|
|
not true, the rtmutex.c code will be broken!), this allows for the two
|
317 |
|
|
least significant bits to be used as flags. This part is also described
|
318 |
|
|
in Documentation/rt-mutex.txt, but will also be briefly described here.
|
319 |
|
|
|
320 |
|
|
Bit 0 is used as the "Pending Owner" flag. This is described later.
|
321 |
|
|
Bit 1 is used as the "Has Waiters" flags. This is also described later
|
322 |
|
|
in more detail, but is set whenever there are waiters on a mutex.
|
323 |
|
|
|
324 |
|
|
|
325 |
|
|
cmpxchg Tricks
|
326 |
|
|
--------------
|
327 |
|
|
|
328 |
|
|
Some architectures implement an atomic cmpxchg (Compare and Exchange). This
|
329 |
|
|
is used (when applicable) to keep the fast path of grabbing and releasing
|
330 |
|
|
mutexes short.
|
331 |
|
|
|
332 |
|
|
cmpxchg is basically the following function performed atomically:
|
333 |
|
|
|
334 |
|
|
unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
|
335 |
|
|
{
|
336 |
|
|
unsigned long T = *A;
|
337 |
|
|
if (*A == *B) {
|
338 |
|
|
*A = *C;
|
339 |
|
|
}
|
340 |
|
|
return T;
|
341 |
|
|
}
|
342 |
|
|
#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
|
343 |
|
|
|
344 |
|
|
This is really nice to have, since it allows you to only update a variable
|
345 |
|
|
if the variable is what you expect it to be. You know if it succeeded if
|
346 |
|
|
the return value (the old value of A) is equal to B.
|
347 |
|
|
|
348 |
|
|
The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
|
349 |
|
|
the architecture does not support CMPXCHG, then this macro is simply set
|
350 |
|
|
to fail every time. But if CMPXCHG is supported, then this will
|
351 |
|
|
help out extremely to keep the fast path short.
|
352 |
|
|
|
353 |
|
|
The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
|
354 |
|
|
the system for architectures that support it. This will also be explained
|
355 |
|
|
later in this document.
|
356 |
|
|
|
357 |
|
|
|
358 |
|
|
Priority adjustments
|
359 |
|
|
--------------------
|
360 |
|
|
|
361 |
|
|
The implementation of the PI code in rtmutex.c has several places that a
|
362 |
|
|
process must adjust its priority. With the help of the pi_list of a
|
363 |
|
|
process this is rather easy to know what needs to be adjusted.
|
364 |
|
|
|
365 |
|
|
The functions implementing the task adjustments are rt_mutex_adjust_prio,
|
366 |
|
|
__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
|
367 |
|
|
to already be taken), rt_mutex_get_prio, and rt_mutex_setprio.
|
368 |
|
|
|
369 |
|
|
rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
|
370 |
|
|
|
371 |
|
|
rt_mutex_getprio returns the priority that the task should have. Either the
|
372 |
|
|
task's own normal priority, or if a process of a higher priority is waiting on
|
373 |
|
|
a mutex owned by the task, then that higher priority should be returned.
|
374 |
|
|
Since the pi_list of a task holds an order by priority list of all the top
|
375 |
|
|
waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
|
376 |
|
|
to compare the top pi waiter to its own normal priority, and return the higher
|
377 |
|
|
priority back.
|
378 |
|
|
|
379 |
|
|
(Note: if looking at the code, you will notice that the lower number of
|
380 |
|
|
prio is returned. This is because the prio field in the task structure
|
381 |
|
|
is an inverse order of the actual priority. So a "prio" of 5 is
|
382 |
|
|
of higher priority than a "prio" of 10.)
|
383 |
|
|
|
384 |
|
|
__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
|
385 |
|
|
result does not equal the task's current priority, then rt_mutex_setprio
|
386 |
|
|
is called to adjust the priority of the task to the new priority.
|
387 |
|
|
Note that rt_mutex_setprio is defined in kernel/sched.c to implement the
|
388 |
|
|
actual change in priority.
|
389 |
|
|
|
390 |
|
|
It is interesting to note that __rt_mutex_adjust_prio can either increase
|
391 |
|
|
or decrease the priority of the task. In the case that a higher priority
|
392 |
|
|
process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
|
393 |
|
|
would increase/boost the task's priority. But if a higher priority task
|
394 |
|
|
were for some reason to leave the mutex (timeout or signal), this same function
|
395 |
|
|
would decrease/unboost the priority of the task. That is because the pi_list
|
396 |
|
|
always contains the highest priority task that is waiting on a mutex owned
|
397 |
|
|
by the task, so we only need to compare the priority of that top pi waiter
|
398 |
|
|
to the normal priority of the given task.
|
399 |
|
|
|
400 |
|
|
|
401 |
|
|
High level overview of the PI chain walk
|
402 |
|
|
----------------------------------------
|
403 |
|
|
|
404 |
|
|
The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
|
405 |
|
|
|
406 |
|
|
The implementation has gone through several iterations, and has ended up
|
407 |
|
|
with what we believe is the best. It walks the PI chain by only grabbing
|
408 |
|
|
at most two locks at a time, and is very efficient.
|
409 |
|
|
|
410 |
|
|
The rt_mutex_adjust_prio_chain can be used either to boost or lower process
|
411 |
|
|
priorities.
|
412 |
|
|
|
413 |
|
|
rt_mutex_adjust_prio_chain is called with a task to be checked for PI
|
414 |
|
|
(de)boosting (the owner of a mutex that a process is blocking on), a flag to
|
415 |
|
|
check for deadlocking, the mutex that the task owns, and a pointer to a waiter
|
416 |
|
|
that is the process's waiter struct that is blocked on the mutex (although this
|
417 |
|
|
parameter may be NULL for deboosting).
|
418 |
|
|
|
419 |
|
|
For this explanation, I will not mention deadlock detection. This explanation
|
420 |
|
|
will try to stay at a high level.
|
421 |
|
|
|
422 |
|
|
When this function is called, there are no locks held. That also means
|
423 |
|
|
that the state of the owner and lock can change when entered into this function.
|
424 |
|
|
|
425 |
|
|
Before this function is called, the task has already had rt_mutex_adjust_prio
|
426 |
|
|
performed on it. This means that the task is set to the priority that it
|
427 |
|
|
should be at, but the plist nodes of the task's waiter have not been updated
|
428 |
|
|
with the new priorities, and that this task may not be in the proper locations
|
429 |
|
|
in the pi_lists and wait_lists that the task is blocked on. This function
|
430 |
|
|
solves all that.
|
431 |
|
|
|
432 |
|
|
A loop is entered, where task is the owner to be checked for PI changes that
|
433 |
|
|
was passed by parameter (for the first iteration). The pi_lock of this task is
|
434 |
|
|
taken to prevent any more changes to the pi_list of the task. This also
|
435 |
|
|
prevents new tasks from completing the blocking on a mutex that is owned by this
|
436 |
|
|
task.
|
437 |
|
|
|
438 |
|
|
If the task is not blocked on a mutex then the loop is exited. We are at
|
439 |
|
|
the top of the PI chain.
|
440 |
|
|
|
441 |
|
|
A check is now done to see if the original waiter (the process that is blocked
|
442 |
|
|
on the current mutex) is the top pi waiter of the task. That is, is this
|
443 |
|
|
waiter on the top of the task's pi_list. If it is not, it either means that
|
444 |
|
|
there is another process higher in priority that is blocked on one of the
|
445 |
|
|
mutexes that the task owns, or that the waiter has just woken up via a signal
|
446 |
|
|
or timeout and has left the PI chain. In either case, the loop is exited, since
|
447 |
|
|
we don't need to do any more changes to the priority of the current task, or any
|
448 |
|
|
task that owns a mutex that this current task is waiting on. A priority chain
|
449 |
|
|
walk is only needed when a new top pi waiter is made to a task.
|
450 |
|
|
|
451 |
|
|
The next check sees if the task's waiter plist node has the priority equal to
|
452 |
|
|
the priority the task is set at. If they are equal, then we are done with
|
453 |
|
|
the loop. Remember that the function started with the priority of the
|
454 |
|
|
task adjusted, but the plist nodes that hold the task in other processes
|
455 |
|
|
pi_lists have not been adjusted.
|
456 |
|
|
|
457 |
|
|
Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
|
458 |
|
|
is taken. This is done by a spin_trylock, because the locking order of the
|
459 |
|
|
pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
|
460 |
|
|
lock, the pi_lock is released, and we restart the loop.
|
461 |
|
|
|
462 |
|
|
Now that we have both the pi_lock of the task as well as the wait_lock of
|
463 |
|
|
the mutex the task is blocked on, we update the task's waiter's plist node
|
464 |
|
|
that is located on the mutex's wait_list.
|
465 |
|
|
|
466 |
|
|
Now we release the pi_lock of the task.
|
467 |
|
|
|
468 |
|
|
Next the owner of the mutex has its pi_lock taken, so we can update the
|
469 |
|
|
task's entry in the owner's pi_list. If the task is the highest priority
|
470 |
|
|
process on the mutex's wait_list, then we remove the previous top waiter
|
471 |
|
|
from the owner's pi_list, and replace it with the task.
|
472 |
|
|
|
473 |
|
|
Note: It is possible that the task was the current top waiter on the mutex,
|
474 |
|
|
in which case the task is not yet on the pi_list of the waiter. This
|
475 |
|
|
is OK, since plist_del does nothing if the plist node is not on any
|
476 |
|
|
list.
|
477 |
|
|
|
478 |
|
|
If the task was not the top waiter of the mutex, but it was before we
|
479 |
|
|
did the priority updates, that means we are deboosting/lowering the
|
480 |
|
|
task. In this case, the task is removed from the pi_list of the owner,
|
481 |
|
|
and the new top waiter is added.
|
482 |
|
|
|
483 |
|
|
Lastly, we unlock both the pi_lock of the task, as well as the mutex's
|
484 |
|
|
wait_lock, and continue the loop again. On the next iteration of the
|
485 |
|
|
loop, the previous owner of the mutex will be the task that will be
|
486 |
|
|
processed.
|
487 |
|
|
|
488 |
|
|
Note: One might think that the owner of this mutex might have changed
|
489 |
|
|
since we just grab the mutex's wait_lock. And one could be right.
|
490 |
|
|
The important thing to remember is that the owner could not have
|
491 |
|
|
become the task that is being processed in the PI chain, since
|
492 |
|
|
we have taken that task's pi_lock at the beginning of the loop.
|
493 |
|
|
So as long as there is an owner of this mutex that is not the same
|
494 |
|
|
process as the tasked being worked on, we are OK.
|
495 |
|
|
|
496 |
|
|
Looking closely at the code, one might be confused. The check for the
|
497 |
|
|
end of the PI chain is when the task isn't blocked on anything or the
|
498 |
|
|
task's waiter structure "task" element is NULL. This check is
|
499 |
|
|
protected only by the task's pi_lock. But the code to unlock the mutex
|
500 |
|
|
sets the task's waiter structure "task" element to NULL with only
|
501 |
|
|
the protection of the mutex's wait_lock, which was not taken yet.
|
502 |
|
|
Isn't this a race condition if the task becomes the new owner?
|
503 |
|
|
|
504 |
|
|
The answer is No! The trick is the spin_trylock of the mutex's
|
505 |
|
|
wait_lock. If we fail that lock, we release the pi_lock of the
|
506 |
|
|
task and continue the loop, doing the end of PI chain check again.
|
507 |
|
|
|
508 |
|
|
In the code to release the lock, the wait_lock of the mutex is held
|
509 |
|
|
the entire time, and it is not let go when we grab the pi_lock of the
|
510 |
|
|
new owner of the mutex. So if the switch of a new owner were to happen
|
511 |
|
|
after the check for end of the PI chain and the grabbing of the
|
512 |
|
|
wait_lock, the unlocking code would spin on the new owner's pi_lock
|
513 |
|
|
but never give up the wait_lock. So the PI chain loop is guaranteed to
|
514 |
|
|
fail the spin_trylock on the wait_lock, release the pi_lock, and
|
515 |
|
|
try again.
|
516 |
|
|
|
517 |
|
|
If you don't quite understand the above, that's OK. You don't have to,
|
518 |
|
|
unless you really want to make a proof out of it ;)
|
519 |
|
|
|
520 |
|
|
|
521 |
|
|
Pending Owners and Lock stealing
|
522 |
|
|
--------------------------------
|
523 |
|
|
|
524 |
|
|
One of the flags in the owner field of the mutex structure is "Pending Owner".
|
525 |
|
|
What this means is that an owner was chosen by the process releasing the
|
526 |
|
|
mutex, but that owner has yet to wake up and actually take the mutex.
|
527 |
|
|
|
528 |
|
|
Why is this important? Why can't we just give the mutex to another process
|
529 |
|
|
and be done with it?
|
530 |
|
|
|
531 |
|
|
The PI code is to help with real-time processes, and to let the highest
|
532 |
|
|
priority process run as long as possible with little latencies and delays.
|
533 |
|
|
If a high priority process owns a mutex that a lower priority process is
|
534 |
|
|
blocked on, when the mutex is released it would be given to the lower priority
|
535 |
|
|
process. What if the higher priority process wants to take that mutex again.
|
536 |
|
|
The high priority process would fail to take that mutex that it just gave up
|
537 |
|
|
and it would need to boost the lower priority process to run with full
|
538 |
|
|
latency of that critical section (since the low priority process just entered
|
539 |
|
|
it).
|
540 |
|
|
|
541 |
|
|
There's no reason a high priority process that gives up a mutex should be
|
542 |
|
|
penalized if it tries to take that mutex again. If the new owner of the
|
543 |
|
|
mutex has not woken up yet, there's no reason that the higher priority process
|
544 |
|
|
could not take that mutex away.
|
545 |
|
|
|
546 |
|
|
To solve this, we introduced Pending Ownership and Lock Stealing. When a
|
547 |
|
|
new process is given a mutex that it was blocked on, it is only given
|
548 |
|
|
pending ownership. This means that it's the new owner, unless a higher
|
549 |
|
|
priority process comes in and tries to grab that mutex. If a higher priority
|
550 |
|
|
process does come along and wants that mutex, we let the higher priority
|
551 |
|
|
process "steal" the mutex from the pending owner (only if it is still pending)
|
552 |
|
|
and continue with the mutex.
|
553 |
|
|
|
554 |
|
|
|
555 |
|
|
Taking of a mutex (The walk through)
|
556 |
|
|
------------------------------------
|
557 |
|
|
|
558 |
|
|
OK, now let's take a look at the detailed walk through of what happens when
|
559 |
|
|
taking a mutex.
|
560 |
|
|
|
561 |
|
|
The first thing that is tried is the fast taking of the mutex. This is
|
562 |
|
|
done when we have CMPXCHG enabled (otherwise the fast taking automatically
|
563 |
|
|
fails). Only when the owner field of the mutex is NULL can the lock be
|
564 |
|
|
taken with the CMPXCHG and nothing else needs to be done.
|
565 |
|
|
|
566 |
|
|
If there is contention on the lock, whether it is owned or pending owner
|
567 |
|
|
we go about the slow path (rt_mutex_slowlock).
|
568 |
|
|
|
569 |
|
|
The slow path function is where the task's waiter structure is created on
|
570 |
|
|
the stack. This is because the waiter structure is only needed for the
|
571 |
|
|
scope of this function. The waiter structure holds the nodes to store
|
572 |
|
|
the task on the wait_list of the mutex, and if need be, the pi_list of
|
573 |
|
|
the owner.
|
574 |
|
|
|
575 |
|
|
The wait_lock of the mutex is taken since the slow path of unlocking the
|
576 |
|
|
mutex also takes this lock.
|
577 |
|
|
|
578 |
|
|
We then call try_to_take_rt_mutex. This is where the architecture that
|
579 |
|
|
does not implement CMPXCHG would always grab the lock (if there's no
|
580 |
|
|
contention).
|
581 |
|
|
|
582 |
|
|
try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
|
583 |
|
|
slow path. The first thing that is done here is an atomic setting of
|
584 |
|
|
the "Has Waiters" flag of the mutex's owner field. Yes, this could really
|
585 |
|
|
be false, because if the mutex has no owner, there are no waiters and
|
586 |
|
|
the current task also won't have any waiters. But we don't have the lock
|
587 |
|
|
yet, so we assume we are going to be a waiter. The reason for this is to
|
588 |
|
|
play nice for those architectures that do have CMPXCHG. By setting this flag
|
589 |
|
|
now, the owner of the mutex can't release the mutex without going into the
|
590 |
|
|
slow unlock path, and it would then need to grab the wait_lock, which this
|
591 |
|
|
code currently holds. So setting the "Has Waiters" flag forces the owner
|
592 |
|
|
to synchronize with this code.
|
593 |
|
|
|
594 |
|
|
Now that we know that we can't have any races with the owner releasing the
|
595 |
|
|
mutex, we check to see if we can take the ownership. This is done if the
|
596 |
|
|
mutex doesn't have a owner, or if we can steal the mutex from a pending
|
597 |
|
|
owner. Let's look at the situations we have here.
|
598 |
|
|
|
599 |
|
|
1) Has owner that is pending
|
600 |
|
|
----------------------------
|
601 |
|
|
|
602 |
|
|
The mutex has a owner, but it hasn't woken up and the mutex flag
|
603 |
|
|
"Pending Owner" is set. The first check is to see if the owner isn't the
|
604 |
|
|
current task. This is because this function is also used for the pending
|
605 |
|
|
owner to grab the mutex. When a pending owner wakes up, it checks to see
|
606 |
|
|
if it can take the mutex, and this is done if the owner is already set to
|
607 |
|
|
itself. If so, we succeed and leave the function, clearing the "Pending
|
608 |
|
|
Owner" bit.
|
609 |
|
|
|
610 |
|
|
If the pending owner is not current, we check to see if the current priority is
|
611 |
|
|
higher than the pending owner. If not, we fail the function and return.
|
612 |
|
|
|
613 |
|
|
There's also something special about a pending owner. That is a pending owner
|
614 |
|
|
is never blocked on a mutex. So there is no PI chain to worry about. It also
|
615 |
|
|
means that if the mutex doesn't have any waiters, there's no accounting needed
|
616 |
|
|
to update the pending owner's pi_list, since we only worry about processes
|
617 |
|
|
blocked on the current mutex.
|
618 |
|
|
|
619 |
|
|
If there are waiters on this mutex, and we just stole the ownership, we need
|
620 |
|
|
to take the top waiter, remove it from the pi_list of the pending owner, and
|
621 |
|
|
add it to the current pi_list. Note that at this moment, the pending owner
|
622 |
|
|
is no longer on the list of waiters. This is fine, since the pending owner
|
623 |
|
|
would add itself back when it realizes that it had the ownership stolen
|
624 |
|
|
from itself. When the pending owner tries to grab the mutex, it will fail
|
625 |
|
|
in try_to_take_rt_mutex if the owner field points to another process.
|
626 |
|
|
|
627 |
|
|
2) No owner
|
628 |
|
|
-----------
|
629 |
|
|
|
630 |
|
|
If there is no owner (or we successfully stole the lock), we set the owner
|
631 |
|
|
of the mutex to current, and set the flag of "Has Waiters" if the current
|
632 |
|
|
mutex actually has waiters, or we clear the flag if it doesn't. See, it was
|
633 |
|
|
OK that we set that flag early, since now it is cleared.
|
634 |
|
|
|
635 |
|
|
3) Failed to grab ownership
|
636 |
|
|
---------------------------
|
637 |
|
|
|
638 |
|
|
The most interesting case is when we fail to take ownership. This means that
|
639 |
|
|
there exists an owner, or there's a pending owner with equal or higher
|
640 |
|
|
priority than the current task.
|
641 |
|
|
|
642 |
|
|
We'll continue on the failed case.
|
643 |
|
|
|
644 |
|
|
If the mutex has a timeout, we set up a timer to go off to break us out
|
645 |
|
|
of this mutex if we failed to get it after a specified amount of time.
|
646 |
|
|
|
647 |
|
|
Now we enter a loop that will continue to try to take ownership of the mutex, or
|
648 |
|
|
fail from a timeout or signal.
|
649 |
|
|
|
650 |
|
|
Once again we try to take the mutex. This will usually fail the first time
|
651 |
|
|
in the loop, since it had just failed to get the mutex. But the second time
|
652 |
|
|
in the loop, this would likely succeed, since the task would likely be
|
653 |
|
|
the pending owner.
|
654 |
|
|
|
655 |
|
|
If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
|
656 |
|
|
here.
|
657 |
|
|
|
658 |
|
|
The waiter structure has a "task" field that points to the task that is blocked
|
659 |
|
|
on the mutex. This field can be NULL the first time it goes through the loop
|
660 |
|
|
or if the task is a pending owner and had it's mutex stolen. If the "task"
|
661 |
|
|
field is NULL then we need to set up the accounting for it.
|
662 |
|
|
|
663 |
|
|
Task blocks on mutex
|
664 |
|
|
--------------------
|
665 |
|
|
|
666 |
|
|
The accounting of a mutex and process is done with the waiter structure of
|
667 |
|
|
the process. The "task" field is set to the process, and the "lock" field
|
668 |
|
|
to the mutex. The plist nodes are initialized to the processes current
|
669 |
|
|
priority.
|
670 |
|
|
|
671 |
|
|
Since the wait_lock was taken at the entry of the slow lock, we can safely
|
672 |
|
|
add the waiter to the wait_list. If the current process is the highest
|
673 |
|
|
priority process currently waiting on this mutex, then we remove the
|
674 |
|
|
previous top waiter process (if it exists) from the pi_list of the owner,
|
675 |
|
|
and add the current process to that list. Since the pi_list of the owner
|
676 |
|
|
has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
|
677 |
|
|
should adjust its priority accordingly.
|
678 |
|
|
|
679 |
|
|
If the owner is also blocked on a lock, and had its pi_list changed
|
680 |
|
|
(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
|
681 |
|
|
and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
|
682 |
|
|
|
683 |
|
|
Now all locks are released, and if the current process is still blocked on a
|
684 |
|
|
mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
|
685 |
|
|
|
686 |
|
|
Waking up in the loop
|
687 |
|
|
---------------------
|
688 |
|
|
|
689 |
|
|
The schedule can then wake up for a few reasons.
|
690 |
|
|
1) we were given pending ownership of the mutex.
|
691 |
|
|
2) we received a signal and was TASK_INTERRUPTIBLE
|
692 |
|
|
3) we had a timeout and was TASK_INTERRUPTIBLE
|
693 |
|
|
|
694 |
|
|
In any of these cases, we continue the loop and once again try to grab the
|
695 |
|
|
ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
|
696 |
|
|
and on signal and timeout, will exit the loop, or if we had the mutex stolen
|
697 |
|
|
we just simply add ourselves back on the lists and go back to sleep.
|
698 |
|
|
|
699 |
|
|
Note: For various reasons, because of timeout and signals, the steal mutex
|
700 |
|
|
algorithm needs to be careful. This is because the current process is
|
701 |
|
|
still on the wait_list. And because of dynamic changing of priorities,
|
702 |
|
|
especially on SCHED_OTHER tasks, the current process can be the
|
703 |
|
|
highest priority task on the wait_list.
|
704 |
|
|
|
705 |
|
|
Failed to get mutex on Timeout or Signal
|
706 |
|
|
----------------------------------------
|
707 |
|
|
|
708 |
|
|
If a timeout or signal occurred, the waiter's "task" field would not be
|
709 |
|
|
NULL and the task needs to be taken off the wait_list of the mutex and perhaps
|
710 |
|
|
pi_list of the owner. If this process was a high priority process, then
|
711 |
|
|
the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
|
712 |
|
|
but this time it will be lowering the priorities.
|
713 |
|
|
|
714 |
|
|
|
715 |
|
|
Unlocking the Mutex
|
716 |
|
|
-------------------
|
717 |
|
|
|
718 |
|
|
The unlocking of a mutex also has a fast path for those architectures with
|
719 |
|
|
CMPXCHG. Since the taking of a mutex on contention always sets the
|
720 |
|
|
"Has Waiters" flag of the mutex's owner, we use this to know if we need to
|
721 |
|
|
take the slow path when unlocking the mutex. If the mutex doesn't have any
|
722 |
|
|
waiters, the owner field of the mutex would equal the current process and
|
723 |
|
|
the mutex can be unlocked by just replacing the owner field with NULL.
|
724 |
|
|
|
725 |
|
|
If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
|
726 |
|
|
the slow unlock path is taken.
|
727 |
|
|
|
728 |
|
|
The first thing done in the slow unlock path is to take the wait_lock of the
|
729 |
|
|
mutex. This synchronizes the locking and unlocking of the mutex.
|
730 |
|
|
|
731 |
|
|
A check is made to see if the mutex has waiters or not. On architectures that
|
732 |
|
|
do not have CMPXCHG, this is the location that the owner of the mutex will
|
733 |
|
|
determine if a waiter needs to be awoken or not. On architectures that
|
734 |
|
|
do have CMPXCHG, that check is done in the fast path, but it is still needed
|
735 |
|
|
in the slow path too. If a waiter of a mutex woke up because of a signal
|
736 |
|
|
or timeout between the time the owner failed the fast path CMPXCHG check and
|
737 |
|
|
the grabbing of the wait_lock, the mutex may not have any waiters, thus the
|
738 |
|
|
owner still needs to make this check. If there are no waiters then the mutex
|
739 |
|
|
owner field is set to NULL, the wait_lock is released and nothing more is
|
740 |
|
|
needed.
|
741 |
|
|
|
742 |
|
|
If there are waiters, then we need to wake one up and give that waiter
|
743 |
|
|
pending ownership.
|
744 |
|
|
|
745 |
|
|
On the wake up code, the pi_lock of the current owner is taken. The top
|
746 |
|
|
waiter of the lock is found and removed from the wait_list of the mutex
|
747 |
|
|
as well as the pi_list of the current owner. The task field of the new
|
748 |
|
|
pending owner's waiter structure is set to NULL, and the owner field of the
|
749 |
|
|
mutex is set to the new owner with the "Pending Owner" bit set, as well
|
750 |
|
|
as the "Has Waiters" bit if there still are other processes blocked on the
|
751 |
|
|
mutex.
|
752 |
|
|
|
753 |
|
|
The pi_lock of the previous owner is released, and the new pending owner's
|
754 |
|
|
pi_lock is taken. Remember that this is the trick to prevent the race
|
755 |
|
|
condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
|
756 |
|
|
on the mutex.
|
757 |
|
|
|
758 |
|
|
We now clear the "pi_blocked_on" field of the new pending owner, and if
|
759 |
|
|
the mutex still has waiters pending, we add the new top waiter to the pi_list
|
760 |
|
|
of the pending owner.
|
761 |
|
|
|
762 |
|
|
Finally we unlock the pi_lock of the pending owner and wake it up.
|
763 |
|
|
|
764 |
|
|
|
765 |
|
|
Contact
|
766 |
|
|
-------
|
767 |
|
|
|
768 |
|
|
For updates on this document, please email Steven Rostedt
|
769 |
|
|
|
770 |
|
|
|
771 |
|
|
Credits
|
772 |
|
|
-------
|
773 |
|
|
|
774 |
|
|
Author: Steven Rostedt
|
775 |
|
|
|
776 |
|
|
Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
|
777 |
|
|
|
778 |
|
|
Updates
|
779 |
|
|
-------
|
780 |
|
|
|
781 |
|
|
This document was originally written for 2.6.17-rc3-mm1
|