1 |
3 |
xianfeng |
|
2 |
|
|
What is Linux Memory Policy?
|
3 |
|
|
|
4 |
|
|
In the Linux kernel, "memory policy" determines from which node the kernel will
|
5 |
|
|
allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
6 |
|
|
supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
7 |
|
|
The current memory policy support was added to Linux 2.6 around May 2004. This
|
8 |
|
|
document attempts to describe the concepts and APIs of the 2.6 memory policy
|
9 |
|
|
support.
|
10 |
|
|
|
11 |
|
|
Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
|
12 |
|
|
which is an administrative mechanism for restricting the nodes from which
|
13 |
|
|
memory may be allocated by a set of processes. Memory policies are a
|
14 |
|
|
programming interface that a NUMA-aware application can take advantage of. When
|
15 |
|
|
both cpusets and policies are applied to a task, the restrictions of the cpuset
|
16 |
|
|
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
|
17 |
|
|
|
18 |
|
|
MEMORY POLICY CONCEPTS
|
19 |
|
|
|
20 |
|
|
Scope of Memory Policies
|
21 |
|
|
|
22 |
|
|
The Linux kernel supports _scopes_ of memory policy, described here from
|
23 |
|
|
most general to most specific:
|
24 |
|
|
|
25 |
|
|
System Default Policy: this policy is "hard coded" into the kernel. It
|
26 |
|
|
is the policy that governs all page allocations that aren't controlled
|
27 |
|
|
by one of the more specific policy scopes discussed below. When the
|
28 |
|
|
system is "up and running", the system default policy will use "local
|
29 |
|
|
allocation" described below. However, during boot up, the system
|
30 |
|
|
default policy will be set to interleave allocations across all nodes
|
31 |
|
|
with "sufficient" memory, so as not to overload the initial boot node
|
32 |
|
|
with boot-time allocations.
|
33 |
|
|
|
34 |
|
|
Task/Process Policy: this is an optional, per-task policy. When defined
|
35 |
|
|
for a specific task, this policy controls all page allocations made by or
|
36 |
|
|
on behalf of the task that aren't controlled by a more specific scope.
|
37 |
|
|
If a task does not define a task policy, then all page allocations that
|
38 |
|
|
would have been controlled by the task policy "fall back" to the System
|
39 |
|
|
Default Policy.
|
40 |
|
|
|
41 |
|
|
The task policy applies to the entire address space of a task. Thus,
|
42 |
|
|
it is inheritable, and indeed is inherited, across both fork()
|
43 |
|
|
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
44 |
|
|
to establish the task policy for a child task exec()'d from an
|
45 |
|
|
executable image that has no awareness of memory policy. See the
|
46 |
|
|
MEMORY POLICY APIS section, below, for an overview of the system call
|
47 |
|
|
that a task may use to set/change it's task/process policy.
|
48 |
|
|
|
49 |
|
|
In a multi-threaded task, task policies apply only to the thread
|
50 |
|
|
[Linux kernel task] that installs the policy and any threads
|
51 |
|
|
subsequently created by that thread. Any sibling threads existing
|
52 |
|
|
at the time a new task policy is installed retain their current
|
53 |
|
|
policy.
|
54 |
|
|
|
55 |
|
|
A task policy applies only to pages allocated after the policy is
|
56 |
|
|
installed. Any pages already faulted in by the task when the task
|
57 |
|
|
changes its task policy remain where they were allocated based on
|
58 |
|
|
the policy at the time they were allocated.
|
59 |
|
|
|
60 |
|
|
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
61 |
|
|
virtual adddress space. A task may define a specific policy for a range
|
62 |
|
|
of its virtual address space. See the MEMORY POLICIES APIS section,
|
63 |
|
|
below, for an overview of the mbind() system call used to set a VMA
|
64 |
|
|
policy.
|
65 |
|
|
|
66 |
|
|
A VMA policy will govern the allocation of pages that back this region of
|
67 |
|
|
the address space. Any regions of the task's address space that don't
|
68 |
|
|
have an explicit VMA policy will fall back to the task policy, which may
|
69 |
|
|
itself fall back to the System Default Policy.
|
70 |
|
|
|
71 |
|
|
VMA policies have a few complicating details:
|
72 |
|
|
|
73 |
|
|
VMA policy applies ONLY to anonymous pages. These include pages
|
74 |
|
|
allocated for anonymous segments, such as the task stack and heap, and
|
75 |
|
|
any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
|
76 |
|
|
If a VMA policy is applied to a file mapping, it will be ignored if
|
77 |
|
|
the mapping used the MAP_SHARED flag. If the file mapping used the
|
78 |
|
|
MAP_PRIVATE flag, the VMA policy will only be applied when an
|
79 |
|
|
anonymous page is allocated on an attempt to write to the mapping--
|
80 |
|
|
i.e., at Copy-On-Write.
|
81 |
|
|
|
82 |
|
|
VMA policies are shared between all tasks that share a virtual address
|
83 |
|
|
space--a.k.a. threads--independent of when the policy is installed; and
|
84 |
|
|
they are inherited across fork(). However, because VMA policies refer
|
85 |
|
|
to a specific region of a task's address space, and because the address
|
86 |
|
|
space is discarded and recreated on exec*(), VMA policies are NOT
|
87 |
|
|
inheritable across exec(). Thus, only NUMA-aware applications may
|
88 |
|
|
use VMA policies.
|
89 |
|
|
|
90 |
|
|
A task may install a new VMA policy on a sub-range of a previously
|
91 |
|
|
mmap()ed region. When this happens, Linux splits the existing virtual
|
92 |
|
|
memory area into 2 or 3 VMAs, each with it's own policy.
|
93 |
|
|
|
94 |
|
|
By default, VMA policy applies only to pages allocated after the policy
|
95 |
|
|
is installed. Any pages already faulted into the VMA range remain
|
96 |
|
|
where they were allocated based on the policy at the time they were
|
97 |
|
|
allocated. However, since 2.6.16, Linux supports page migration via
|
98 |
|
|
the mbind() system call, so that page contents can be moved to match
|
99 |
|
|
a newly installed policy.
|
100 |
|
|
|
101 |
|
|
Shared Policy: Conceptually, shared policies apply to "memory objects"
|
102 |
|
|
mapped shared into one or more tasks' distinct address spaces. An
|
103 |
|
|
application installs a shared policies the same way as VMA policies--using
|
104 |
|
|
the mbind() system call specifying a range of virtual addresses that map
|
105 |
|
|
the shared object. However, unlike VMA policies, which can be considered
|
106 |
|
|
to be an attribute of a range of a task's address space, shared policies
|
107 |
|
|
apply directly to the shared object. Thus, all tasks that attach to the
|
108 |
|
|
object share the policy, and all pages allocated for the shared object,
|
109 |
|
|
by any task, will obey the shared policy.
|
110 |
|
|
|
111 |
|
|
As of 2.6.22, only shared memory segments, created by shmget() or
|
112 |
|
|
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
113 |
|
|
policy support was added to Linux, the associated data structures were
|
114 |
|
|
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
115 |
|
|
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
116 |
|
|
shmem segments were never "hooked up" to the shared policy support.
|
117 |
|
|
Although hugetlbfs segments now support lazy allocation, their support
|
118 |
|
|
for shared policy has not been completed.
|
119 |
|
|
|
120 |
|
|
As mentioned above [re: VMA policies], allocations of page cache
|
121 |
|
|
pages for regular files mmap()ed with MAP_SHARED ignore any VMA
|
122 |
|
|
policy installed on the virtual address range backed by the shared
|
123 |
|
|
file mapping. Rather, shared page cache pages, including pages backing
|
124 |
|
|
private mappings that have not yet been written by the task, follow
|
125 |
|
|
task policy, if any, else System Default Policy.
|
126 |
|
|
|
127 |
|
|
The shared policy infrastructure supports different policies on subset
|
128 |
|
|
ranges of the shared object. However, Linux still splits the VMA of
|
129 |
|
|
the task that installs the policy for each range of distinct policy.
|
130 |
|
|
Thus, different tasks that attach to a shared memory segment can have
|
131 |
|
|
different VMA configurations mapping that one shared object. This
|
132 |
|
|
can be seen by examining the /proc//numa_maps of tasks sharing
|
133 |
|
|
a shared memory region, when one task has installed shared policy on
|
134 |
|
|
one or more ranges of the region.
|
135 |
|
|
|
136 |
|
|
Components of Memory Policies
|
137 |
|
|
|
138 |
|
|
A Linux memory policy is a tuple consisting of a "mode" and an optional set
|
139 |
|
|
of nodes. The mode determine the behavior of the policy, while the
|
140 |
|
|
optional set of nodes can be viewed as the arguments to the behavior.
|
141 |
|
|
|
142 |
|
|
Internally, memory policies are implemented by a reference counted
|
143 |
|
|
structure, struct mempolicy. Details of this structure will be discussed
|
144 |
|
|
in context, below, as required to explain the behavior.
|
145 |
|
|
|
146 |
|
|
Note: in some functions AND in the struct mempolicy itself, the mode
|
147 |
|
|
is called "policy". However, to avoid confusion with the policy tuple,
|
148 |
|
|
this document will continue to use the term "mode".
|
149 |
|
|
|
150 |
|
|
Linux memory policy supports the following 4 behavioral modes:
|
151 |
|
|
|
152 |
|
|
Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
|
153 |
|
|
context or scope dependent.
|
154 |
|
|
|
155 |
|
|
As mentioned in the Policy Scope section above, during normal
|
156 |
|
|
system operation, the System Default Policy is hard coded to
|
157 |
|
|
contain the Default mode.
|
158 |
|
|
|
159 |
|
|
In this context, default mode means "local" allocation--that is
|
160 |
|
|
attempt to allocate the page from the node associated with the cpu
|
161 |
|
|
where the fault occurs. If the "local" node has no memory, or the
|
162 |
|
|
node's memory can be exhausted [no free pages available], local
|
163 |
|
|
allocation will "fallback to"--attempt to allocate pages from--
|
164 |
|
|
"nearby" nodes, in order of increasing "distance".
|
165 |
|
|
|
166 |
|
|
Implementation detail -- subject to change: "Fallback" uses
|
167 |
|
|
a per node list of sibling nodes--called zonelists--built at
|
168 |
|
|
boot time, or when nodes or memory are added or removed from
|
169 |
|
|
the system [memory hotplug]. These per node zonelist are
|
170 |
|
|
constructed with nodes in order of increasing distance based
|
171 |
|
|
on information provided by the platform firmware.
|
172 |
|
|
|
173 |
|
|
When a task/process policy or a shared policy contains the Default
|
174 |
|
|
mode, this also means "local allocation", as described above.
|
175 |
|
|
|
176 |
|
|
In the context of a VMA, Default mode means "fall back to task
|
177 |
|
|
policy"--which may or may not specify Default mode. Thus, Default
|
178 |
|
|
mode can not be counted on to mean local allocation when used
|
179 |
|
|
on a non-shared region of the address space. However, see
|
180 |
|
|
MPOL_PREFERRED below.
|
181 |
|
|
|
182 |
|
|
The Default mode does not use the optional set of nodes.
|
183 |
|
|
|
184 |
|
|
MPOL_BIND: This mode specifies that memory must come from the
|
185 |
|
|
set of nodes specified by the policy.
|
186 |
|
|
|
187 |
|
|
The memory policy APIs do not specify an order in which the nodes
|
188 |
|
|
will be searched. However, unlike "local allocation", the Bind
|
189 |
|
|
policy does not consider the distance between the nodes. Rather,
|
190 |
|
|
allocations will fallback to the nodes specified by the policy in
|
191 |
|
|
order of numeric node id. Like everything in Linux, this is subject
|
192 |
|
|
to change.
|
193 |
|
|
|
194 |
|
|
MPOL_PREFERRED: This mode specifies that the allocation should be
|
195 |
|
|
attempted from the single node specified in the policy. If that
|
196 |
|
|
allocation fails, the kernel will search other nodes, exactly as
|
197 |
|
|
it would for a local allocation that started at the preferred node
|
198 |
|
|
in increasing distance from the preferred node. "Local" allocation
|
199 |
|
|
policy can be viewed as a Preferred policy that starts at the node
|
200 |
|
|
containing the cpu where the allocation takes place.
|
201 |
|
|
|
202 |
|
|
Internally, the Preferred policy uses a single node--the
|
203 |
|
|
preferred_node member of struct mempolicy. A "distinguished
|
204 |
|
|
value of this preferred_node, currently '-1', is interpreted
|
205 |
|
|
as "the node containing the cpu where the allocation takes
|
206 |
|
|
place"--local allocation. This is the way to specify
|
207 |
|
|
local allocation for a specific range of addresses--i.e. for
|
208 |
|
|
VMA policies.
|
209 |
|
|
|
210 |
|
|
MPOL_INTERLEAVED: This mode specifies that page allocations be
|
211 |
|
|
interleaved, on a page granularity, across the nodes specified in
|
212 |
|
|
the policy. This mode also behaves slightly differently, based on
|
213 |
|
|
the context where it is used:
|
214 |
|
|
|
215 |
|
|
For allocation of anonymous pages and shared memory pages,
|
216 |
|
|
Interleave mode indexes the set of nodes specified by the policy
|
217 |
|
|
using the page offset of the faulting address into the segment
|
218 |
|
|
[VMA] containing the address modulo the number of nodes specified
|
219 |
|
|
by the policy. It then attempts to allocate a page, starting at
|
220 |
|
|
the selected node, as if the node had been specified by a Preferred
|
221 |
|
|
policy or had been selected by a local allocation. That is,
|
222 |
|
|
allocation will follow the per node zonelist.
|
223 |
|
|
|
224 |
|
|
For allocation of page cache pages, Interleave mode indexes the set
|
225 |
|
|
of nodes specified by the policy using a node counter maintained
|
226 |
|
|
per task. This counter wraps around to the lowest specified node
|
227 |
|
|
after it reaches the highest specified node. This will tend to
|
228 |
|
|
spread the pages out over the nodes specified by the policy based
|
229 |
|
|
on the order in which they are allocated, rather than based on any
|
230 |
|
|
page offset into an address range or file. During system boot up,
|
231 |
|
|
the temporary interleaved system default policy works in this
|
232 |
|
|
mode.
|
233 |
|
|
|
234 |
|
|
MEMORY POLICY APIs
|
235 |
|
|
|
236 |
|
|
Linux supports 3 system calls for controlling memory policy. These APIS
|
237 |
|
|
always affect only the calling task, the calling task's address space, or
|
238 |
|
|
some shared object mapped into the calling task's address space.
|
239 |
|
|
|
240 |
|
|
Note: the headers that define these APIs and the parameter data types
|
241 |
|
|
for user space applications reside in a package that is not part of
|
242 |
|
|
the Linux kernel. The kernel system call interfaces, with the 'sys_'
|
243 |
|
|
prefix, are defined in ; the mode and flag
|
244 |
|
|
definitions are defined in .
|
245 |
|
|
|
246 |
|
|
Set [Task] Memory Policy:
|
247 |
|
|
|
248 |
|
|
long set_mempolicy(int mode, const unsigned long *nmask,
|
249 |
|
|
unsigned long maxnode);
|
250 |
|
|
|
251 |
|
|
Set's the calling task's "task/process memory policy" to mode
|
252 |
|
|
specified by the 'mode' argument and the set of nodes defined
|
253 |
|
|
by 'nmask'. 'nmask' points to a bit mask of node ids containing
|
254 |
|
|
at least 'maxnode' ids.
|
255 |
|
|
|
256 |
|
|
See the set_mempolicy(2) man page for more details
|
257 |
|
|
|
258 |
|
|
|
259 |
|
|
Get [Task] Memory Policy or Related Information
|
260 |
|
|
|
261 |
|
|
long get_mempolicy(int *mode,
|
262 |
|
|
const unsigned long *nmask, unsigned long maxnode,
|
263 |
|
|
void *addr, int flags);
|
264 |
|
|
|
265 |
|
|
Queries the "task/process memory policy" of the calling task, or
|
266 |
|
|
the policy or location of a specified virtual address, depending
|
267 |
|
|
on the 'flags' argument.
|
268 |
|
|
|
269 |
|
|
See the get_mempolicy(2) man page for more details
|
270 |
|
|
|
271 |
|
|
|
272 |
|
|
Install VMA/Shared Policy for a Range of Task's Address Space
|
273 |
|
|
|
274 |
|
|
long mbind(void *start, unsigned long len, int mode,
|
275 |
|
|
const unsigned long *nmask, unsigned long maxnode,
|
276 |
|
|
unsigned flags);
|
277 |
|
|
|
278 |
|
|
mbind() installs the policy specified by (mode, nmask, maxnodes) as
|
279 |
|
|
a VMA policy for the range of the calling task's address space
|
280 |
|
|
specified by the 'start' and 'len' arguments. Additional actions
|
281 |
|
|
may be requested via the 'flags' argument.
|
282 |
|
|
|
283 |
|
|
See the mbind(2) man page for more details.
|
284 |
|
|
|
285 |
|
|
MEMORY POLICY COMMAND LINE INTERFACE
|
286 |
|
|
|
287 |
|
|
Although not strictly part of the Linux implementation of memory policy,
|
288 |
|
|
a command line tool, numactl(8), exists that allows one to:
|
289 |
|
|
|
290 |
|
|
+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
291 |
|
|
exec(2)
|
292 |
|
|
|
293 |
|
|
+ set the shared policy for a shared memory segment via mbind(2)
|
294 |
|
|
|
295 |
|
|
The numactl(8) tool is packages with the run-time version of the library
|
296 |
|
|
containing the memory policy system call wrappers. Some distributions
|
297 |
|
|
package the headers and compile-time libraries in a separate development
|
298 |
|
|
package.
|
299 |
|
|
|
300 |
|
|
|
301 |
|
|
MEMORY POLICIES AND CPUSETS
|
302 |
|
|
|
303 |
|
|
Memory policies work within cpusets as described above. For memory policies
|
304 |
|
|
that require a node or set of nodes, the nodes are restricted to the set of
|
305 |
|
|
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
306 |
|
|
specified for the policy contains nodes that are not allowed by the cpuset, or
|
307 |
|
|
the intersection of the set of nodes specified for the policy and the set of
|
308 |
|
|
nodes with memory is the empty set, the policy is considered invalid
|
309 |
|
|
and cannot be installed.
|
310 |
|
|
|
311 |
|
|
The interaction of memory policies and cpusets can be problematic for a
|
312 |
|
|
couple of reasons:
|
313 |
|
|
|
314 |
|
|
1) the memory policy APIs take physical node id's as arguments. As mentioned
|
315 |
|
|
above, it is illegal to specify nodes that are not allowed in the cpuset.
|
316 |
|
|
The application must query the allowed nodes using the get_mempolicy()
|
317 |
|
|
API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
|
318 |
|
|
restrict itself to those nodes. However, the resources available to a
|
319 |
|
|
cpuset can be changed by the system administrator, or a workload manager
|
320 |
|
|
application, at any time. So, a task may still get errors attempting to
|
321 |
|
|
specify policy nodes, and must query the allowed memories again.
|
322 |
|
|
|
323 |
|
|
2) when tasks in two cpusets share access to a memory region, such as shared
|
324 |
|
|
memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
|
325 |
|
|
MAP_SHARED flags, and any of the tasks install shared policy on the region,
|
326 |
|
|
only nodes whose memories are allowed in both cpusets may be used in the
|
327 |
|
|
policies. Obtaining this information requires "stepping outside" the
|
328 |
|
|
memory policy APIs to use the cpuset information and requires that one
|
329 |
|
|
know in what cpusets other task might be attaching to the shared region.
|
330 |
|
|
Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
|
331 |
|
|
allocation is the only valid policy.
|