1 |
1275 |
phoenix |
Started Oct 1999 by Kanoj Sarcar
|
2 |
|
|
|
3 |
|
|
The intent of this file is to have an uptodate, running commentary
|
4 |
|
|
from different people about how locking and synchronization is done
|
5 |
|
|
in the Linux vm code.
|
6 |
|
|
|
7 |
|
|
page_table_lock & mmap_sem
|
8 |
|
|
--------------------------------------
|
9 |
|
|
|
10 |
|
|
Page stealers pick processes out of the process pool and scan for
|
11 |
|
|
the best process to steal pages from. To guarantee the existence
|
12 |
|
|
of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
|
13 |
|
|
Page stealers hold kernel_lock to protect against a bunch of races.
|
14 |
|
|
The vma list of the victim mm is also scanned by the stealer,
|
15 |
|
|
and the page_table_lock is used to preserve list sanity against the
|
16 |
|
|
process adding/deleting to the list. This also guarantees existence
|
17 |
|
|
of the vma. Vma existence is not guaranteed once try_to_swap_out()
|
18 |
|
|
drops the page_table_lock. To guarantee the existence of the underlying
|
19 |
|
|
file structure, a get_file is done before the swapout() method is
|
20 |
|
|
invoked. The page passed into swapout() is guaranteed not to be reused
|
21 |
|
|
for a different purpose because the page reference count due to being
|
22 |
|
|
present in the user's pte is not released till after swapout() returns.
|
23 |
|
|
|
24 |
|
|
Any code that modifies the vmlist, or the vm_start/vm_end/
|
25 |
|
|
vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
|
26 |
|
|
kswapd from looking at the chain.
|
27 |
|
|
|
28 |
|
|
The rules are:
|
29 |
|
|
1. To scan the vmlist (look but don't touch) you must hold the
|
30 |
|
|
mmap_sem with read bias, i.e. down_read(&mm->mmap_sem)
|
31 |
|
|
2. To modify the vmlist you need to hold the mmap_sem with
|
32 |
|
|
read&write bias, i.e. down_write(&mm->mmap_sem) *AND*
|
33 |
|
|
you need to take the page_table_lock.
|
34 |
|
|
3. The swapper takes _just_ the page_table_lock, this is done
|
35 |
|
|
because the mmap_sem can be an extremely long lived lock
|
36 |
|
|
and the swapper just cannot sleep on that.
|
37 |
|
|
4. The exception to this rule is expand_stack, which just
|
38 |
|
|
takes the read lock and the page_table_lock, this is ok
|
39 |
|
|
because it doesn't really modify fields anybody relies on.
|
40 |
|
|
5. You must be able to guarantee that while holding page_table_lock
|
41 |
|
|
or page_table_lock of mm A, you will not try to get either lock
|
42 |
|
|
for mm B.
|
43 |
|
|
|
44 |
|
|
The caveats are:
|
45 |
|
|
1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
|
46 |
|
|
The update of mmap_cache is racy (page stealer can race with other code
|
47 |
|
|
that invokes find_vma with mmap_sem held), but that is okay, since it
|
48 |
|
|
is a hint. This can be fixed, if desired, by having find_vma grab the
|
49 |
|
|
page_table_lock.
|
50 |
|
|
|
51 |
|
|
|
52 |
|
|
Code that add/delete elements from the vmlist chain are
|
53 |
|
|
1. callers of insert_vm_struct
|
54 |
|
|
2. callers of merge_segments
|
55 |
|
|
3. callers of avl_remove
|
56 |
|
|
|
57 |
|
|
Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
|
58 |
|
|
the list:
|
59 |
|
|
1. expand_stack
|
60 |
|
|
2. mprotect
|
61 |
|
|
3. mlock
|
62 |
|
|
4. mremap
|
63 |
|
|
|
64 |
|
|
It is advisable that changes to vm_start/vm_end be protected, although
|
65 |
|
|
in some cases it is not really needed. Eg, vm_start is modified by
|
66 |
|
|
expand_stack(), it is hard to come up with a destructive scenario without
|
67 |
|
|
having the vmlist protection in this case.
|
68 |
|
|
|
69 |
|
|
The page_table_lock nests with the inode i_shared_lock and the kmem cache
|
70 |
|
|
c_spinlock spinlocks. This is okay, since code that holds i_shared_lock
|
71 |
|
|
never asks for memory, and the kmem code asks for pages after dropping
|
72 |
|
|
c_spinlock. The page_table_lock also nests with pagecache_lock and
|
73 |
|
|
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
|
74 |
|
|
held.
|
75 |
|
|
|
76 |
|
|
The page_table_lock is grabbed while holding the kernel_lock spinning monitor.
|
77 |
|
|
|
78 |
|
|
The page_table_lock is a spin lock.
|
79 |
|
|
|
80 |
|
|
swap_list_lock/swap_device_lock
|
81 |
|
|
-------------------------------
|
82 |
|
|
The swap devices are chained in priority order from the "swap_list" header.
|
83 |
|
|
The "swap_list" is used for the round-robin swaphandle allocation strategy.
|
84 |
|
|
The #free swaphandles is maintained in "nr_swap_pages". These two together
|
85 |
|
|
are protected by the swap_list_lock.
|
86 |
|
|
|
87 |
|
|
The swap_device_lock, which is per swap device, protects the reference
|
88 |
|
|
counts on the corresponding swaphandles, maintained in the "swap_map"
|
89 |
|
|
array, and the "highest_bit" and "lowest_bit" fields.
|
90 |
|
|
|
91 |
|
|
Both of these are spinlocks, and are never acquired from intr level. The
|
92 |
|
|
locking hierarchy is swap_list_lock -> swap_device_lock.
|
93 |
|
|
|
94 |
|
|
To prevent races between swap space deletion or async readahead swapins
|
95 |
|
|
deciding whether a swap handle is being used, ie worthy of being read in
|
96 |
|
|
from disk, and an unmap -> swap_free making the handle unused, the swap
|
97 |
|
|
delete and readahead code grabs a temp reference on the swaphandle to
|
98 |
|
|
prevent warning messages from swap_duplicate <- read_swap_cache_async.
|
99 |
|
|
|
100 |
|
|
Swap cache locking
|
101 |
|
|
------------------
|
102 |
|
|
Pages are added into the swap cache with kernel_lock held, to make sure
|
103 |
|
|
that multiple pages are not being added (and hence lost) by associating
|
104 |
|
|
all of them with the same swaphandle.
|
105 |
|
|
|
106 |
|
|
Pages are guaranteed not to be removed from the scache if the page is
|
107 |
|
|
"shared": ie, other processes hold reference on the page or the associated
|
108 |
|
|
swap handle. The only code that does not follow this rule is shrink_mmap,
|
109 |
|
|
which deletes pages from the swap cache if no process has a reference on
|
110 |
|
|
the page (multiple processes might have references on the corresponding
|
111 |
|
|
swap handle though). lookup_swap_cache() races with shrink_mmap, when
|
112 |
|
|
establishing a reference on a scache page, so, it must check whether the
|
113 |
|
|
page it located is still in the swapcache, or shrink_mmap deleted it.
|
114 |
|
|
(This race is due to the fact that shrink_mmap looks at the page ref
|
115 |
|
|
count with pagecache_lock, but then drops pagecache_lock before deleting
|
116 |
|
|
the page from the scache).
|
117 |
|
|
|
118 |
|
|
do_wp_page and do_swap_page have MP races in them while trying to figure
|
119 |
|
|
out whether a page is "shared", by looking at the page_count + swap_count.
|
120 |
|
|
To preserve the sum of the counts, the page lock _must_ be acquired before
|
121 |
|
|
calling is_page_shared (else processes might switch their swap_count refs
|
122 |
|
|
to the page count refs, after the page count ref has been snapshotted).
|
123 |
|
|
|
124 |
|
|
Swap device deletion code currently breaks all the scache assumptions,
|
125 |
|
|
since it grabs neither mmap_sem nor page_table_lock.
|