OpenCores
URL https://opencores.org/ocsvn/or1k/or1k/trunk

Subversion Repositories or1k

[/] [or1k/] [trunk/] [linux/] [linux-2.4/] [Documentation/] [x86_64/] [mm.txt] - Blame information for rev 1765

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 1275 phoenix
The paging design used on the x86-64 linux kernel port in 2.4.x provides:
2
 
3
o       per process virtual address space limit of 512 Gigabytes
4
o       top of userspace stack located at address 0x0000007fffffffff
5
o       start of the kernel mapping =  0x0000010000000000
6
o       global RAM per system 508*512GB=254 Terabytes
7
o       no need of any common code change
8
o       512GB of vmalloc/ioremap space
9
 
10
Description:
11
        x86-64 has a 4 level page structure, similar to ia32 PSE but with
12
        some extensions. Each level consits of a 4K page with 512 64bit
13
        entries. The levels are named in Linux PML4, PGD, PMD, PTE; AMD calls them
14
        PML4E, PDPE, PDE, PTE respectively. For direct and kernel mapping
15
        only 3 levels are used with the PMD pointing to 2MB pages.
16
 
17
        Userspace is able to modify and it sees only the 3rd/2nd/1st level
18
        pagetables (pgd_offset() implicitly walks the 1st slot of the 4th
19
        level pagetable and it returns an entry into the 3rd level pagetable).
20
        This is where the per-process 512 Gigabytes limit cames from.
21
 
22
        The common code pgd is the PDPE, the pmd is the PDE, the
23
        pte is the PTE. The PML4 remains invisible to the common
24
        code.
25
 
26
        Since the per-process limit is 512 Gigabytes (due to kernel common
27
        code 3 level pagetable limitation), the higher virtual address mapped
28
        into userspace is 0x7fffffffff and it makes sense to use it
29
        as the top of the userspace stack to allow the stack to grow as
30
        much as possible.
31
 
32
        The kernel mapping and the direct memory mapping are split. Direct memory
33
        mapping starts directly after userspace after a 512GB gap, while
34
        kernel mapping is at the end of (negative) virtual address space to exploit
35
        the kernel code model. There is no support for discontig memory, this
36
        implies that kernel mapping/vmalloc/ioremap/module mapping are not
37
        represented in their "real" mapping in mem_map, but only with their
38
        direct mapped (but normally not used) alias.
39
 
40
Future:
41
 
42
        During 2.5.x we can break the 512 Gigabytes per-process limit
43
        possibly by removing from the common code any knowledge about the
44
        architectural dependent physical layout of the virtual to physical
45
        mapping.
46
 
47
        Once the 512 Gigabytes limit will be removed the kernel stack will
48
        be moved (most probably to virtual address 0x00007fffffffffff).
49
        Nothing will break in userspace due that move, as nothing breaks
50
        in IA32 compiling the kernel with CONFIG_2G.
51
 
52
Linus agreed on not breaking common code and to live with the 512 Gigabytes
53
per-process limitation for the 2.4.x timeframe and he has given me and Andi
54
some very useful hints... (thanks! :)
55
 
56
Thanks also to H. Peter Anvin for his interesting and useful suggestions on
57
the x86-64-discuss lists!
58
 
59
Current PML4 Layout:
60
        Each CPU has an PML4 page that never changes.
61
        Each slot is 512GB of virtual memory.
62
 
63
 
64
        1    unmapped
65
        2    __PAGE_OFFSET - start of direct mapping of physical memory
66
        ...  direct mapping in further slots as needed.
67
        509  some io mappings (others are in a memory hole below 4gb)
68
        510  vmalloc and ioremap space
69
        511  kernel code mapping, fixmaps and modules.
70
 
71
Other memory management related issues follows:
72
 
73
PAGE_SIZE:
74
 
75
        If somebody is wondering why these days we still have a so small
76
        4k pagesize (16 or 32 kbytes would be much better for performance
77
        of course), the PAGE_SIZE have to remain 4k for 32bit apps to
78
        provide 100% backwards compatible IA32 API (we can't allow silent
79
        fs corruption or as best a loss of coherency with the page cache
80
        by allocating MAP_SHARED areas in MAP_ANONYMOUS memory with a
81
        do_mmap_fake). I think it could be possible to have a dynamic page
82
        size between 32bit and 64bit apps but it would need extremely
83
        intrusive changes in the common code as first for page cache and
84
        we sure don't want to depend on them right now even if the
85
        hardware would support that.
86
 
87
PAGETABLE SIZE:
88
 
89
        In turn we can't afford to have pagetables larger than 4k because
90
        we could not be able to allocate them due physical memory
91
        fragmentation, and failing to allocate the kernel stack is a minor
92
        issue compared to failing the allocation of a pagetable. If we
93
        fail the allocation of a pagetable the only thing we can do is to
94
        sched_yield polling the freelist (deadlock prone) or to segfault
95
        the task (not even the sighandler would be sure to run).
96
 
97
KERNEL STACK:
98
 
99
        1st stage:
100
 
101
        The kernel stack will be at first allocated with an order 2 allocation
102
        (16k) (the utilization of the stack for a 64bit platform really
103
        isn't exactly the double of a 32bit platform because the local
104
        variables may not be all 64bit wide, but not much less). This will
105
        make things even worse than they are right now on IA32 with
106
        respect of failing fork/clone due memory fragmentation.
107
 
108
        2nd stage:
109
 
110
        We'll benchmark if reserving one register as task_struct
111
        pointer will improve performance of the kernel (instead of
112
        recalculating the task_struct pointer starting from the stack
113
        pointer each time). My guess is that recalculating will be faster
114
        but it worth a try.
115
 
116
                If reserving one register for the task_struct pointer
117
                will be faster we can as well split task_struct and kernel
118
                stack. task_struct can be a slab allocation or a
119
                PAGE_SIZEd allocation, and the kernel stack can then be
120
                allocated in a order 1 allocation. Really this is risky,
121
                since 8k on a 64bit platform is going to be less than 7k
122
                on a 32bit platform but we could try it out. This would
123
                reduce the fragmentation problem of an order of magnitude
124
                making it equal to the current IA32.
125
 
126
                We must also consider the x86-64 seems to provide in hardware a
127
                per-irq stack that could allow us to remove the irq handler
128
                footprint from the regular per-process-stack, so it could allow
129
                us to live with a smaller kernel stack compared to the other
130
                linux architectures.
131
 
132
        3rd stage:
133
 
134
        Before going into production if we still have the order 2
135
        allocation we can add a sysctl that allows the kernel stack to be
136
        allocated with vmalloc during memory fragmentation. This have to
137
        remain turned off during benchmarks :) but it should be ok in real
138
        life.
139
 
140
Order of PAGE_CACHE_SIZE and other allocations:
141
 
142
        On the long run we can increase the PAGE_CACHE_SIZE to be
143
        an order 2 allocations and also the slab/buffercache etc.ec..
144
        could be all done with order 2 allocations. To make the above
145
        to work we should change lots of common code thus it can be done
146
        only once the basic port will be in a production state. Having
147
        a working PAGE_CACHE_SIZE would be a benefit also for
148
        IA32 and other architectures of course.
149
 
150
vmalloc:
151
        vmalloc should be outside the first 512GB to keep that space free
152
        for the user space. It needs an own pgd to work on in common code.
153
        It currently gets an own pgd in the 510th slot of the per CPU PML4.
154
 
155
PML4:
156
        Each CPU as an own PML4 (=top level of the 4 level page hierarchy). On
157
        context switch the first slot is rewritten to the pgd of the new process
158
        and CR3 is flushed.
159
 
160
Modules:
161
        Modules need to be in the same 4GB range as the core kernel. Otherwise
162
        a GOT would be needed. Modules are currently at 0xffffffffa0000000
163
        to 0xffffffffafffffff. This is inbetween the kernel text and the
164
        vsyscall/fixmap mappings.
165
 
166
Vsyscalls:
167
        Vsyscalls have a reserved space near the end of user space that is
168
        acessible by user space. This address is part of the ABI and cannot be
169
        changed. They have ffffffffff600000 to ffffffffffe00000 (but only
170
        some small space at the beginning is allocated and known to user space
171
        currently). See vsyscall.c for more details.
172
 
173
Fixmaps:
174
        Fixed mappings set up at boot. Used to access IO APIC and some other hardware.
175
        These are at the end of vsyscall space (ffffffffffe00000) downwards,
176
        but are not accessible by user space of course.
177
 
178
Early mapping:
179
        On a 120TB memory system bootmem could use upto 3.5GB
180
        of memory for its bootmem bitmap. To avoid having to map 3.5GB by hand
181
        for bootmem's purposes the full direct mapping is created before bootmem
182
        is initialized. The direct mapping needs some memory for its page tables,
183
        these are directly taken from the physical memory after the kernel. To
184
        access these pages they need to be mapped, this is done by a temporary
185
        mapping with a few spare static 2MB PMD entries.
186
 
187
Unsolved issues:
188
         2MB pages for user space - may need to add a highmem zone for that again to
189
         avoid fragmentation.
190
 
191
Andrea  SuSE
192
Andi Kleen  SuSE
193
 
194
$Id: mm.txt,v 1.1.1.1 2004-04-15 02:33:37 phoenix Exp $

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.