1 |
1275 |
phoenix |
Started Jan 2000 by Kanoj Sarcar
|
2 |
|
|
|
3 |
|
|
Memory balancing is needed for non __GFP_WAIT as well as for non
|
4 |
|
|
__GFP_IO allocations.
|
5 |
|
|
|
6 |
|
|
There are two reasons to be requesting non __GFP_WAIT allocations:
|
7 |
|
|
the caller can not sleep (typically intr context), or does not want
|
8 |
|
|
to incur cost overheads of page stealing and possible swap io for
|
9 |
|
|
whatever reasons.
|
10 |
|
|
|
11 |
|
|
__GFP_IO allocation requests are made to prevent file system deadlocks.
|
12 |
|
|
|
13 |
|
|
In the absence of non sleepable allocation requests, it seems detrimental
|
14 |
|
|
to be doing balancing. Page reclamation can be kicked off lazily, that
|
15 |
|
|
is, only when needed (aka zone free memory is 0), instead of making it
|
16 |
|
|
a proactive process.
|
17 |
|
|
|
18 |
|
|
That being said, the kernel should try to fulfill requests for direct
|
19 |
|
|
mapped pages from the direct mapped pool, instead of falling back on
|
20 |
|
|
the dma pool, so as to keep the dma pool filled for dma requests (atomic
|
21 |
|
|
or not). A similar argument applies to highmem and direct mapped pages.
|
22 |
|
|
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
|
23 |
|
|
regular memory requests by allocating one from the dma pool, instead
|
24 |
|
|
of incurring the overhead of regular zone balancing.
|
25 |
|
|
|
26 |
|
|
In 2.2, memory balancing/page reclamation would kick off only when the
|
27 |
|
|
_total_ number of free pages fell below 1/64 th of total memory. With the
|
28 |
|
|
right ratio of dma and regular memory, it is quite possible that balancing
|
29 |
|
|
would not be done even when the dma zone was completely empty. 2.2 has
|
30 |
|
|
been running production machines of varying memory sizes, and seems to be
|
31 |
|
|
doing fine even with the presence of this problem. In 2.3, due to
|
32 |
|
|
HIGHMEM, this problem is aggravated.
|
33 |
|
|
|
34 |
|
|
In 2.3, zone balancing can be done in one of two ways: depending on the
|
35 |
|
|
zone size (and possibly of the size of lower class zones), we can decide
|
36 |
|
|
at init time how many free pages we should aim for while balancing any
|
37 |
|
|
zone. The good part is, while balancing, we do not need to look at sizes
|
38 |
|
|
of lower class zones, the bad part is, we might do too frequent balancing
|
39 |
|
|
due to ignoring possibly lower usage in the lower class zones. Also,
|
40 |
|
|
with a slight change in the allocation routine, it is possible to reduce
|
41 |
|
|
the memclass() macro to be a simple equality.
|
42 |
|
|
|
43 |
|
|
Another possible solution is that we balance only when the free memory
|
44 |
|
|
of a zone _and_ all its lower class zones falls below 1/64th of the
|
45 |
|
|
total memory in the zone and its lower class zones. This fixes the 2.2
|
46 |
|
|
balancing problem, and stays as close to 2.2 behavior as possible. Also,
|
47 |
|
|
the balancing algorithm works the same way on the various architectures,
|
48 |
|
|
which have different numbers and types of zones. If we wanted to get
|
49 |
|
|
fancy, we could assign different weights to free pages in different
|
50 |
|
|
zones in the future.
|
51 |
|
|
|
52 |
|
|
Note that if the size of the regular zone is huge compared to dma zone,
|
53 |
|
|
it becomes less significant to consider the free dma pages while
|
54 |
|
|
deciding whether to balance the regular zone. The first solution
|
55 |
|
|
becomes more attractive then.
|
56 |
|
|
|
57 |
|
|
The appended patch implements the second solution. It also "fixes" two
|
58 |
|
|
problems: first, kswapd is woken up as in 2.2 on low memory conditions
|
59 |
|
|
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
|
60 |
|
|
so as to give a fighting chance for replace_with_highmem() to get a
|
61 |
|
|
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
|
62 |
|
|
fall back into regular zone. This also makes sure that HIGHMEM pages
|
63 |
|
|
are not leaked (for example, in situations where a HIGHMEM page is in
|
64 |
|
|
the swapcache but is not being used by anyone)
|
65 |
|
|
|
66 |
|
|
kswapd also needs to know about the zones it should balance. kswapd is
|
67 |
|
|
primarily needed in a situation where balancing can not be done,
|
68 |
|
|
probably because all allocation requests are coming from intr context
|
69 |
|
|
and all process contexts are sleeping. For 2.3, kswapd does not really
|
70 |
|
|
need to balance the highmem zone, since intr context does not request
|
71 |
|
|
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
|
72 |
|
|
structure to decide whether a zone needs balancing.
|
73 |
|
|
|
74 |
|
|
Page stealing from process memory and shm is done if stealing the page would
|
75 |
|
|
alleviate memory pressure on any zone in the page's node that has fallen below
|
76 |
|
|
its watermark.
|
77 |
|
|
|
78 |
|
|
pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
|
79 |
|
|
per-zone fields, used to determine when a zone needs to be balanced. When
|
80 |
|
|
the number of pages falls below pages_min, the hysteric field low_on_memory
|
81 |
|
|
gets set. This stays set till the number of free pages becomes pages_high.
|
82 |
|
|
When low_on_memory is set, page allocation requests will try to free some
|
83 |
|
|
pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
|
84 |
|
|
to this, is the decision to poke kswapd to free some zone pages. That
|
85 |
|
|
decision is not hysteresis based, and is done when the number of free
|
86 |
|
|
pages is below pages_low; in which case zone_wake_kswapd is also set.
|
87 |
|
|
|
88 |
|
|
|
89 |
|
|
(Good) Ideas that I have heard:
|
90 |
|
|
1. Dynamic experience should influence balancing: number of failed requests
|
91 |
|
|
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
|
92 |
|
|
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
|
93 |
|
|
dma pages. (lkd@tantalophile.demon.co.uk)
|