1 |
3 |
xianfeng |
Locking scheme used for directory operations is based on two
|
2 |
|
|
kinds of locks - per-inode (->i_mutex) and per-filesystem
|
3 |
|
|
(->s_vfs_rename_mutex).
|
4 |
|
|
|
5 |
|
|
For our purposes all operations fall in 5 classes:
|
6 |
|
|
|
7 |
|
|
1) read access. Locking rules: caller locks directory we are accessing.
|
8 |
|
|
|
9 |
|
|
2) object creation. Locking rules: same as above.
|
10 |
|
|
|
11 |
|
|
3) object removal. Locking rules: caller locks parent, finds victim,
|
12 |
|
|
locks victim and calls the method.
|
13 |
|
|
|
14 |
|
|
4) rename() that is _not_ cross-directory. Locking rules: caller locks
|
15 |
|
|
the parent, finds source and target, if target already exists - locks it
|
16 |
|
|
and then calls the method.
|
17 |
|
|
|
18 |
|
|
5) link creation. Locking rules:
|
19 |
|
|
* lock parent
|
20 |
|
|
* check that source is not a directory
|
21 |
|
|
* lock source
|
22 |
|
|
* call the method.
|
23 |
|
|
|
24 |
|
|
6) cross-directory rename. The trickiest in the whole bunch. Locking
|
25 |
|
|
rules:
|
26 |
|
|
* lock the filesystem
|
27 |
|
|
* lock parents in "ancestors first" order.
|
28 |
|
|
* find source and target.
|
29 |
|
|
* if old parent is equal to or is a descendent of target
|
30 |
|
|
fail with -ENOTEMPTY
|
31 |
|
|
* if new parent is equal to or is a descendent of source
|
32 |
|
|
fail with -ELOOP
|
33 |
|
|
* if target exists - lock it.
|
34 |
|
|
* call the method.
|
35 |
|
|
|
36 |
|
|
|
37 |
|
|
The rules above obviously guarantee that all directories that are going to be
|
38 |
|
|
read, modified or removed by method will be locked by caller.
|
39 |
|
|
|
40 |
|
|
|
41 |
|
|
If no directory is its own ancestor, the scheme above is deadlock-free.
|
42 |
|
|
Proof:
|
43 |
|
|
|
44 |
|
|
First of all, at any moment we have a partial ordering of the
|
45 |
|
|
objects - A < B iff A is an ancestor of B.
|
46 |
|
|
|
47 |
|
|
That ordering can change. However, the following is true:
|
48 |
|
|
|
49 |
|
|
(1) if object removal or non-cross-directory rename holds lock on A and
|
50 |
|
|
attempts to acquire lock on B, A will remain the parent of B until we
|
51 |
|
|
acquire the lock on B. (Proof: only cross-directory rename can change
|
52 |
|
|
the parent of object and it would have to lock the parent).
|
53 |
|
|
|
54 |
|
|
(2) if cross-directory rename holds the lock on filesystem, order will not
|
55 |
|
|
change until rename acquires all locks. (Proof: other cross-directory
|
56 |
|
|
renames will be blocked on filesystem lock and we don't start changing
|
57 |
|
|
the order until we had acquired all locks).
|
58 |
|
|
|
59 |
|
|
(3) any operation holds at most one lock on non-directory object and
|
60 |
|
|
that lock is acquired after all other locks. (Proof: see descriptions
|
61 |
|
|
of operations).
|
62 |
|
|
|
63 |
|
|
Now consider the minimal deadlock. Each process is blocked on
|
64 |
|
|
attempt to acquire some lock and already holds at least one lock. Let's
|
65 |
|
|
consider the set of contended locks. First of all, filesystem lock is
|
66 |
|
|
not contended, since any process blocked on it is not holding any locks.
|
67 |
|
|
Thus all processes are blocked on ->i_mutex.
|
68 |
|
|
|
69 |
|
|
Non-directory objects are not contended due to (3). Thus link
|
70 |
|
|
creation can't be a part of deadlock - it can't be blocked on source
|
71 |
|
|
and it means that it doesn't hold any locks.
|
72 |
|
|
|
73 |
|
|
Any contended object is either held by cross-directory rename or
|
74 |
|
|
has a child that is also contended. Indeed, suppose that it is held by
|
75 |
|
|
operation other than cross-directory rename. Then the lock this operation
|
76 |
|
|
is blocked on belongs to child of that object due to (1).
|
77 |
|
|
|
78 |
|
|
It means that one of the operations is cross-directory rename.
|
79 |
|
|
Otherwise the set of contended objects would be infinite - each of them
|
80 |
|
|
would have a contended child and we had assumed that no object is its
|
81 |
|
|
own descendent. Moreover, there is exactly one cross-directory rename
|
82 |
|
|
(see above).
|
83 |
|
|
|
84 |
|
|
Consider the object blocking the cross-directory rename. One
|
85 |
|
|
of its descendents is locked by cross-directory rename (otherwise we
|
86 |
|
|
would again have an infinite set of contended objects). But that
|
87 |
|
|
means that cross-directory rename is taking locks out of order. Due
|
88 |
|
|
to (2) the order hadn't changed since we had acquired filesystem lock.
|
89 |
|
|
But locking rules for cross-directory rename guarantee that we do not
|
90 |
|
|
try to acquire lock on descendent before the lock on ancestor.
|
91 |
|
|
Contradiction. I.e. deadlock is impossible. Q.E.D.
|
92 |
|
|
|
93 |
|
|
|
94 |
|
|
These operations are guaranteed to avoid loop creation. Indeed,
|
95 |
|
|
the only operation that could introduce loops is cross-directory rename.
|
96 |
|
|
Since the only new (parent, child) pair added by rename() is (new parent,
|
97 |
|
|
source), such loop would have to contain these objects and the rest of it
|
98 |
|
|
would have to exist before rename(). I.e. at the moment of loop creation
|
99 |
|
|
rename() responsible for that would be holding filesystem lock and new parent
|
100 |
|
|
would have to be equal to or a descendent of source. But that means that
|
101 |
|
|
new parent had been equal to or a descendent of source since the moment when
|
102 |
|
|
we had acquired filesystem lock and rename() would fail with -ELOOP in that
|
103 |
|
|
case.
|
104 |
|
|
|
105 |
|
|
While this locking scheme works for arbitrary DAGs, it relies on
|
106 |
|
|
ability to check that directory is a descendent of another object. Current
|
107 |
|
|
implementation assumes that directory graph is a tree. This assumption is
|
108 |
|
|
also preserved by all operations (cross-directory rename on a tree that would
|
109 |
|
|
not introduce a cycle will leave it a tree and link() fails for directories).
|
110 |
|
|
|
111 |
|
|
Notice that "directory" in the above == "anything that might have
|
112 |
|
|
children", so if we are going to introduce hybrid objects we will need
|
113 |
|
|
either to make sure that link(2) doesn't work for them or to make changes
|
114 |
|
|
in is_subdir() that would make it work even in presence of such beasts.
|