1 |
1275 |
phoenix |
The paging design used on the x86-64 linux kernel port in 2.4.x provides:
|
2 |
|
|
|
3 |
|
|
o per process virtual address space limit of 512 Gigabytes
|
4 |
|
|
o top of userspace stack located at address 0x0000007fffffffff
|
5 |
|
|
o start of the kernel mapping = 0x0000010000000000
|
6 |
|
|
o global RAM per system 508*512GB=254 Terabytes
|
7 |
|
|
o no need of any common code change
|
8 |
|
|
o 512GB of vmalloc/ioremap space
|
9 |
|
|
|
10 |
|
|
Description:
|
11 |
|
|
x86-64 has a 4 level page structure, similar to ia32 PSE but with
|
12 |
|
|
some extensions. Each level consits of a 4K page with 512 64bit
|
13 |
|
|
entries. The levels are named in Linux PML4, PGD, PMD, PTE; AMD calls them
|
14 |
|
|
PML4E, PDPE, PDE, PTE respectively. For direct and kernel mapping
|
15 |
|
|
only 3 levels are used with the PMD pointing to 2MB pages.
|
16 |
|
|
|
17 |
|
|
Userspace is able to modify and it sees only the 3rd/2nd/1st level
|
18 |
|
|
pagetables (pgd_offset() implicitly walks the 1st slot of the 4th
|
19 |
|
|
level pagetable and it returns an entry into the 3rd level pagetable).
|
20 |
|
|
This is where the per-process 512 Gigabytes limit cames from.
|
21 |
|
|
|
22 |
|
|
The common code pgd is the PDPE, the pmd is the PDE, the
|
23 |
|
|
pte is the PTE. The PML4 remains invisible to the common
|
24 |
|
|
code.
|
25 |
|
|
|
26 |
|
|
Since the per-process limit is 512 Gigabytes (due to kernel common
|
27 |
|
|
code 3 level pagetable limitation), the higher virtual address mapped
|
28 |
|
|
into userspace is 0x7fffffffff and it makes sense to use it
|
29 |
|
|
as the top of the userspace stack to allow the stack to grow as
|
30 |
|
|
much as possible.
|
31 |
|
|
|
32 |
|
|
The kernel mapping and the direct memory mapping are split. Direct memory
|
33 |
|
|
mapping starts directly after userspace after a 512GB gap, while
|
34 |
|
|
kernel mapping is at the end of (negative) virtual address space to exploit
|
35 |
|
|
the kernel code model. There is no support for discontig memory, this
|
36 |
|
|
implies that kernel mapping/vmalloc/ioremap/module mapping are not
|
37 |
|
|
represented in their "real" mapping in mem_map, but only with their
|
38 |
|
|
direct mapped (but normally not used) alias.
|
39 |
|
|
|
40 |
|
|
Future:
|
41 |
|
|
|
42 |
|
|
During 2.5.x we can break the 512 Gigabytes per-process limit
|
43 |
|
|
possibly by removing from the common code any knowledge about the
|
44 |
|
|
architectural dependent physical layout of the virtual to physical
|
45 |
|
|
mapping.
|
46 |
|
|
|
47 |
|
|
Once the 512 Gigabytes limit will be removed the kernel stack will
|
48 |
|
|
be moved (most probably to virtual address 0x00007fffffffffff).
|
49 |
|
|
Nothing will break in userspace due that move, as nothing breaks
|
50 |
|
|
in IA32 compiling the kernel with CONFIG_2G.
|
51 |
|
|
|
52 |
|
|
Linus agreed on not breaking common code and to live with the 512 Gigabytes
|
53 |
|
|
per-process limitation for the 2.4.x timeframe and he has given me and Andi
|
54 |
|
|
some very useful hints... (thanks! :)
|
55 |
|
|
|
56 |
|
|
Thanks also to H. Peter Anvin for his interesting and useful suggestions on
|
57 |
|
|
the x86-64-discuss lists!
|
58 |
|
|
|
59 |
|
|
Current PML4 Layout:
|
60 |
|
|
Each CPU has an PML4 page that never changes.
|
61 |
|
|
Each slot is 512GB of virtual memory.
|
62 |
|
|
|
63 |
|
|
|
64 |
|
|
1 unmapped
|
65 |
|
|
2 __PAGE_OFFSET - start of direct mapping of physical memory
|
66 |
|
|
... direct mapping in further slots as needed.
|
67 |
|
|
509 some io mappings (others are in a memory hole below 4gb)
|
68 |
|
|
510 vmalloc and ioremap space
|
69 |
|
|
511 kernel code mapping, fixmaps and modules.
|
70 |
|
|
|
71 |
|
|
Other memory management related issues follows:
|
72 |
|
|
|
73 |
|
|
PAGE_SIZE:
|
74 |
|
|
|
75 |
|
|
If somebody is wondering why these days we still have a so small
|
76 |
|
|
4k pagesize (16 or 32 kbytes would be much better for performance
|
77 |
|
|
of course), the PAGE_SIZE have to remain 4k for 32bit apps to
|
78 |
|
|
provide 100% backwards compatible IA32 API (we can't allow silent
|
79 |
|
|
fs corruption or as best a loss of coherency with the page cache
|
80 |
|
|
by allocating MAP_SHARED areas in MAP_ANONYMOUS memory with a
|
81 |
|
|
do_mmap_fake). I think it could be possible to have a dynamic page
|
82 |
|
|
size between 32bit and 64bit apps but it would need extremely
|
83 |
|
|
intrusive changes in the common code as first for page cache and
|
84 |
|
|
we sure don't want to depend on them right now even if the
|
85 |
|
|
hardware would support that.
|
86 |
|
|
|
87 |
|
|
PAGETABLE SIZE:
|
88 |
|
|
|
89 |
|
|
In turn we can't afford to have pagetables larger than 4k because
|
90 |
|
|
we could not be able to allocate them due physical memory
|
91 |
|
|
fragmentation, and failing to allocate the kernel stack is a minor
|
92 |
|
|
issue compared to failing the allocation of a pagetable. If we
|
93 |
|
|
fail the allocation of a pagetable the only thing we can do is to
|
94 |
|
|
sched_yield polling the freelist (deadlock prone) or to segfault
|
95 |
|
|
the task (not even the sighandler would be sure to run).
|
96 |
|
|
|
97 |
|
|
KERNEL STACK:
|
98 |
|
|
|
99 |
|
|
1st stage:
|
100 |
|
|
|
101 |
|
|
The kernel stack will be at first allocated with an order 2 allocation
|
102 |
|
|
(16k) (the utilization of the stack for a 64bit platform really
|
103 |
|
|
isn't exactly the double of a 32bit platform because the local
|
104 |
|
|
variables may not be all 64bit wide, but not much less). This will
|
105 |
|
|
make things even worse than they are right now on IA32 with
|
106 |
|
|
respect of failing fork/clone due memory fragmentation.
|
107 |
|
|
|
108 |
|
|
2nd stage:
|
109 |
|
|
|
110 |
|
|
We'll benchmark if reserving one register as task_struct
|
111 |
|
|
pointer will improve performance of the kernel (instead of
|
112 |
|
|
recalculating the task_struct pointer starting from the stack
|
113 |
|
|
pointer each time). My guess is that recalculating will be faster
|
114 |
|
|
but it worth a try.
|
115 |
|
|
|
116 |
|
|
If reserving one register for the task_struct pointer
|
117 |
|
|
will be faster we can as well split task_struct and kernel
|
118 |
|
|
stack. task_struct can be a slab allocation or a
|
119 |
|
|
PAGE_SIZEd allocation, and the kernel stack can then be
|
120 |
|
|
allocated in a order 1 allocation. Really this is risky,
|
121 |
|
|
since 8k on a 64bit platform is going to be less than 7k
|
122 |
|
|
on a 32bit platform but we could try it out. This would
|
123 |
|
|
reduce the fragmentation problem of an order of magnitude
|
124 |
|
|
making it equal to the current IA32.
|
125 |
|
|
|
126 |
|
|
We must also consider the x86-64 seems to provide in hardware a
|
127 |
|
|
per-irq stack that could allow us to remove the irq handler
|
128 |
|
|
footprint from the regular per-process-stack, so it could allow
|
129 |
|
|
us to live with a smaller kernel stack compared to the other
|
130 |
|
|
linux architectures.
|
131 |
|
|
|
132 |
|
|
3rd stage:
|
133 |
|
|
|
134 |
|
|
Before going into production if we still have the order 2
|
135 |
|
|
allocation we can add a sysctl that allows the kernel stack to be
|
136 |
|
|
allocated with vmalloc during memory fragmentation. This have to
|
137 |
|
|
remain turned off during benchmarks :) but it should be ok in real
|
138 |
|
|
life.
|
139 |
|
|
|
140 |
|
|
Order of PAGE_CACHE_SIZE and other allocations:
|
141 |
|
|
|
142 |
|
|
On the long run we can increase the PAGE_CACHE_SIZE to be
|
143 |
|
|
an order 2 allocations and also the slab/buffercache etc.ec..
|
144 |
|
|
could be all done with order 2 allocations. To make the above
|
145 |
|
|
to work we should change lots of common code thus it can be done
|
146 |
|
|
only once the basic port will be in a production state. Having
|
147 |
|
|
a working PAGE_CACHE_SIZE would be a benefit also for
|
148 |
|
|
IA32 and other architectures of course.
|
149 |
|
|
|
150 |
|
|
vmalloc:
|
151 |
|
|
vmalloc should be outside the first 512GB to keep that space free
|
152 |
|
|
for the user space. It needs an own pgd to work on in common code.
|
153 |
|
|
It currently gets an own pgd in the 510th slot of the per CPU PML4.
|
154 |
|
|
|
155 |
|
|
PML4:
|
156 |
|
|
Each CPU as an own PML4 (=top level of the 4 level page hierarchy). On
|
157 |
|
|
context switch the first slot is rewritten to the pgd of the new process
|
158 |
|
|
and CR3 is flushed.
|
159 |
|
|
|
160 |
|
|
Modules:
|
161 |
|
|
Modules need to be in the same 4GB range as the core kernel. Otherwise
|
162 |
|
|
a GOT would be needed. Modules are currently at 0xffffffffa0000000
|
163 |
|
|
to 0xffffffffafffffff. This is inbetween the kernel text and the
|
164 |
|
|
vsyscall/fixmap mappings.
|
165 |
|
|
|
166 |
|
|
Vsyscalls:
|
167 |
|
|
Vsyscalls have a reserved space near the end of user space that is
|
168 |
|
|
acessible by user space. This address is part of the ABI and cannot be
|
169 |
|
|
changed. They have ffffffffff600000 to ffffffffffe00000 (but only
|
170 |
|
|
some small space at the beginning is allocated and known to user space
|
171 |
|
|
currently). See vsyscall.c for more details.
|
172 |
|
|
|
173 |
|
|
Fixmaps:
|
174 |
|
|
Fixed mappings set up at boot. Used to access IO APIC and some other hardware.
|
175 |
|
|
These are at the end of vsyscall space (ffffffffffe00000) downwards,
|
176 |
|
|
but are not accessible by user space of course.
|
177 |
|
|
|
178 |
|
|
Early mapping:
|
179 |
|
|
On a 120TB memory system bootmem could use upto 3.5GB
|
180 |
|
|
of memory for its bootmem bitmap. To avoid having to map 3.5GB by hand
|
181 |
|
|
for bootmem's purposes the full direct mapping is created before bootmem
|
182 |
|
|
is initialized. The direct mapping needs some memory for its page tables,
|
183 |
|
|
these are directly taken from the physical memory after the kernel. To
|
184 |
|
|
access these pages they need to be mapped, this is done by a temporary
|
185 |
|
|
mapping with a few spare static 2MB PMD entries.
|
186 |
|
|
|
187 |
|
|
Unsolved issues:
|
188 |
|
|
2MB pages for user space - may need to add a highmem zone for that again to
|
189 |
|
|
avoid fragmentation.
|
190 |
|
|
|
191 |
|
|
Andrea SuSE
|
192 |
|
|
Andi Kleen SuSE
|
193 |
|
|
|
194 |
|
|
$Id: mm.txt,v 1.1.1.1 2004-04-15 02:33:37 phoenix Exp $
|