1 |
62 |
marcus.erl |
-*-Mode: outline-*-
|
2 |
|
|
|
3 |
|
|
Light-weight System Calls for IA-64
|
4 |
|
|
-----------------------------------
|
5 |
|
|
|
6 |
|
|
Started: 13-Jan-2003
|
7 |
|
|
Last update: 27-Sep-2003
|
8 |
|
|
|
9 |
|
|
David Mosberger-Tang
|
10 |
|
|
|
11 |
|
|
|
12 |
|
|
Using the "epc" instruction effectively introduces a new mode of
|
13 |
|
|
execution to the ia64 linux kernel. We call this mode the
|
14 |
|
|
"fsys-mode". To recap, the normal states of execution are:
|
15 |
|
|
|
16 |
|
|
- kernel mode:
|
17 |
|
|
Both the register stack and the memory stack have been
|
18 |
|
|
switched over to kernel memory. The user-level state is saved
|
19 |
|
|
in a pt-regs structure at the top of the kernel memory stack.
|
20 |
|
|
|
21 |
|
|
- user mode:
|
22 |
|
|
Both the register stack and the kernel stack are in
|
23 |
|
|
user memory. The user-level state is contained in the
|
24 |
|
|
CPU registers.
|
25 |
|
|
|
26 |
|
|
- bank 0 interruption-handling mode:
|
27 |
|
|
This is the non-interruptible state which all
|
28 |
|
|
interruption-handlers start execution in. The user-level
|
29 |
|
|
state remains in the CPU registers and some kernel state may
|
30 |
|
|
be stored in bank 0 of registers r16-r31.
|
31 |
|
|
|
32 |
|
|
In contrast, fsys-mode has the following special properties:
|
33 |
|
|
|
34 |
|
|
- execution is at privilege level 0 (most-privileged)
|
35 |
|
|
|
36 |
|
|
- CPU registers may contain a mixture of user-level and kernel-level
|
37 |
|
|
state (it is the responsibility of the kernel to ensure that no
|
38 |
|
|
security-sensitive kernel-level state is leaked back to
|
39 |
|
|
user-level)
|
40 |
|
|
|
41 |
|
|
- execution is interruptible and preemptible (an fsys-mode handler
|
42 |
|
|
can disable interrupts and avoid all other interruption-sources
|
43 |
|
|
to avoid preemption)
|
44 |
|
|
|
45 |
|
|
- neither the memory-stack nor the register-stack can be trusted while
|
46 |
|
|
in fsys-mode (they point to the user-level stacks, which may
|
47 |
|
|
be invalid, or completely bogus addresses)
|
48 |
|
|
|
49 |
|
|
In summary, fsys-mode is much more similar to running in user-mode
|
50 |
|
|
than it is to running in kernel-mode. Of course, given that the
|
51 |
|
|
privilege level is at level 0, this means that fsys-mode requires some
|
52 |
|
|
care (see below).
|
53 |
|
|
|
54 |
|
|
|
55 |
|
|
* How to tell fsys-mode
|
56 |
|
|
|
57 |
|
|
Linux operates in fsys-mode when (a) the privilege level is 0 (most
|
58 |
|
|
privileged) and (b) the stacks have NOT been switched to kernel memory
|
59 |
|
|
yet. For convenience, the header file provides
|
60 |
|
|
three macros:
|
61 |
|
|
|
62 |
|
|
user_mode(regs)
|
63 |
|
|
user_stack(task,regs)
|
64 |
|
|
fsys_mode(task,regs)
|
65 |
|
|
|
66 |
|
|
The "regs" argument is a pointer to a pt_regs structure. The "task"
|
67 |
|
|
argument is a pointer to the task structure to which the "regs"
|
68 |
|
|
pointer belongs to. user_mode() returns TRUE if the CPU state pointed
|
69 |
|
|
to by "regs" was executing in user mode (privilege level 3).
|
70 |
|
|
user_stack() returns TRUE if the state pointed to by "regs" was
|
71 |
|
|
executing on the user-level stack(s). Finally, fsys_mode() returns
|
72 |
|
|
TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
|
73 |
|
|
The fsys_mode() macro is equivalent to the expression:
|
74 |
|
|
|
75 |
|
|
!user_mode(regs) && user_stack(task,regs)
|
76 |
|
|
|
77 |
|
|
* How to write an fsyscall handler
|
78 |
|
|
|
79 |
|
|
The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
|
80 |
|
|
(fsyscall_table). This table contains one entry for each system call.
|
81 |
|
|
By default, a system call is handled by fsys_fallback_syscall(). This
|
82 |
|
|
routine takes care of entering (full) kernel mode and calling the
|
83 |
|
|
normal Linux system call handler. For performance-critical system
|
84 |
|
|
calls, it is possible to write a hand-tuned fsyscall_handler. For
|
85 |
|
|
example, fsys.S contains fsys_getpid(), which is a hand-tuned version
|
86 |
|
|
of the getpid() system call.
|
87 |
|
|
|
88 |
|
|
The entry and exit-state of an fsyscall handler is as follows:
|
89 |
|
|
|
90 |
|
|
** Machine state on entry to fsyscall handler:
|
91 |
|
|
|
92 |
|
|
- r10 = 0
|
93 |
|
|
- r11 = saved ar.pfs (a user-level value)
|
94 |
|
|
- r15 = system call number
|
95 |
|
|
- r16 = "current" task pointer (in normal kernel-mode, this is in r13)
|
96 |
|
|
- r32-r39 = system call arguments
|
97 |
|
|
- b6 = return address (a user-level value)
|
98 |
|
|
- ar.pfs = previous frame-state (a user-level value)
|
99 |
|
|
- PSR.be = cleared to zero (i.e., little-endian byte order is in effect)
|
100 |
|
|
- all other registers may contain values passed in from user-mode
|
101 |
|
|
|
102 |
|
|
** Required machine state on exit to fsyscall handler:
|
103 |
|
|
|
104 |
|
|
- r11 = saved ar.pfs (as passed into the fsyscall handler)
|
105 |
|
|
- r15 = system call number (as passed into the fsyscall handler)
|
106 |
|
|
- r32-r39 = system call arguments (as passed into the fsyscall handler)
|
107 |
|
|
- b6 = return address (as passed into the fsyscall handler)
|
108 |
|
|
- ar.pfs = previous frame-state (as passed into the fsyscall handler)
|
109 |
|
|
|
110 |
|
|
Fsyscall handlers can execute with very little overhead, but with that
|
111 |
|
|
speed comes a set of restrictions:
|
112 |
|
|
|
113 |
|
|
o Fsyscall-handlers MUST check for any pending work in the flags
|
114 |
|
|
member of the thread-info structure and if any of the
|
115 |
|
|
TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
|
116 |
|
|
doing a full system call (by calling fsys_fallback_syscall).
|
117 |
|
|
|
118 |
|
|
o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
|
119 |
|
|
r15, b6, and ar.pfs) because they will be needed in case of a
|
120 |
|
|
system call restart. Of course, all "preserved" registers also
|
121 |
|
|
must be preserved, in accordance to the normal calling conventions.
|
122 |
|
|
|
123 |
|
|
o Fsyscall-handlers MUST check argument registers for containing a
|
124 |
|
|
NaT value before using them in any way that could trigger a
|
125 |
|
|
NaT-consumption fault. If a system call argument is found to
|
126 |
|
|
contain a NaT value, an fsyscall-handler may return immediately
|
127 |
|
|
with r8=EINVAL, r10=-1.
|
128 |
|
|
|
129 |
|
|
o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
|
130 |
|
|
any other operation that would trigger mandatory RSE
|
131 |
|
|
(register-stack engine) traffic.
|
132 |
|
|
|
133 |
|
|
o Fsyscall-handlers MUST NOT write to any stacked registers because
|
134 |
|
|
it is not safe to assume that user-level called a handler with the
|
135 |
|
|
proper number of arguments.
|
136 |
|
|
|
137 |
|
|
o Fsyscall-handlers need to be careful when accessing per-CPU variables:
|
138 |
|
|
unless proper safe-guards are taken (e.g., interruptions are avoided),
|
139 |
|
|
execution may be pre-empted and resumed on another CPU at any given
|
140 |
|
|
time.
|
141 |
|
|
|
142 |
|
|
o Fsyscall-handlers must be careful not to leak sensitive kernel'
|
143 |
|
|
information back to user-level. In particular, before returning to
|
144 |
|
|
user-level, care needs to be taken to clear any scratch registers
|
145 |
|
|
that could contain sensitive information (note that the current
|
146 |
|
|
task pointer is not considered sensitive: it's already exposed
|
147 |
|
|
through ar.k6).
|
148 |
|
|
|
149 |
|
|
o Fsyscall-handlers MUST NOT access user-memory without first
|
150 |
|
|
validating access-permission (this can be done typically via
|
151 |
|
|
probe.r.fault and/or probe.w.fault) and without guarding against
|
152 |
|
|
memory access exceptions (this can be done with the EX() macros
|
153 |
|
|
defined by asmmacro.h).
|
154 |
|
|
|
155 |
|
|
The above restrictions may seem draconian, but remember that it's
|
156 |
|
|
possible to trade off some of the restrictions by paying a slightly
|
157 |
|
|
higher overhead. For example, if an fsyscall-handler could benefit
|
158 |
|
|
from the shadow register bank, it could temporarily disable PSR.i and
|
159 |
|
|
PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
|
160 |
|
|
needed. In other words, following the above rules yields extremely
|
161 |
|
|
fast system call execution (while fully preserving system call
|
162 |
|
|
semantics), but there is also a lot of flexibility in handling more
|
163 |
|
|
complicated cases.
|
164 |
|
|
|
165 |
|
|
* Signal handling
|
166 |
|
|
|
167 |
|
|
The delivery of (asynchronous) signals must be delayed until fsys-mode
|
168 |
|
|
is exited. This is accomplished with the help of the lower-privilege
|
169 |
|
|
transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
|
170 |
|
|
checks whether the interrupted task was in fsys-mode and, if so, sets
|
171 |
|
|
PSR.lp and returns immediately. When fsys-mode is exited via the
|
172 |
|
|
"br.ret" instruction that lowers the privilege level, a trap will
|
173 |
|
|
occur. The trap handler clears PSR.lp again and returns immediately.
|
174 |
|
|
The kernel exit path then checks for and delivers any pending signals.
|
175 |
|
|
|
176 |
|
|
* PSR Handling
|
177 |
|
|
|
178 |
|
|
The "epc" instruction doesn't change the contents of PSR at all. This
|
179 |
|
|
is in contrast to a regular interruption, which clears almost all
|
180 |
|
|
bits. Because of that, some care needs to be taken to ensure things
|
181 |
|
|
work as expected. The following discussion describes how each PSR bit
|
182 |
|
|
is handled.
|
183 |
|
|
|
184 |
|
|
PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used
|
185 |
|
|
to ensure the CPU is in little-endian mode before the first
|
186 |
|
|
load/store instruction is executed. PSR.be is normally NOT
|
187 |
|
|
restored upon return from an fsys-mode handler. In other
|
188 |
|
|
words, user-level code must not rely on PSR.be being preserved
|
189 |
|
|
across a system call.
|
190 |
|
|
PSR.up Unchanged.
|
191 |
|
|
PSR.ac Unchanged.
|
192 |
|
|
PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers!
|
193 |
|
|
PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers!
|
194 |
|
|
PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
|
195 |
|
|
PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed.
|
196 |
|
|
PSR.pk Unchanged.
|
197 |
|
|
PSR.dt Unchanged.
|
198 |
|
|
PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers!
|
199 |
|
|
PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers!
|
200 |
|
|
PSR.sp Unchanged.
|
201 |
|
|
PSR.pp Unchanged.
|
202 |
|
|
PSR.di Unchanged.
|
203 |
|
|
PSR.si Unchanged.
|
204 |
|
|
PSR.db Unchanged. The kernel prevents user-level from setting a hardware
|
205 |
|
|
breakpoint that triggers at any privilege level other than 3 (user-mode).
|
206 |
|
|
PSR.lp Unchanged.
|
207 |
|
|
PSR.tb Lazy redirect. If a taken-branch trap occurs while in
|
208 |
|
|
fsys-mode, the trap-handler modifies the saved machine state
|
209 |
|
|
such that execution resumes in the gate page at
|
210 |
|
|
syscall_via_break(), with privilege level 3. Note: the
|
211 |
|
|
taken branch would occur on the branch invoking the
|
212 |
|
|
fsyscall-handler, at which point, by definition, a syscall
|
213 |
|
|
restart is still safe. If the system call number is invalid,
|
214 |
|
|
the fsys-mode handler will return directly to user-level. This
|
215 |
|
|
return will trigger a taken-branch trap, but since the trap is
|
216 |
|
|
taken _after_ restoring the privilege level, the CPU has already
|
217 |
|
|
left fsys-mode, so no special treatment is needed.
|
218 |
|
|
PSR.rt Unchanged.
|
219 |
|
|
PSR.cpl Cleared to 0.
|
220 |
|
|
PSR.is Unchanged (guaranteed to be 0 on entry to the gate page).
|
221 |
|
|
PSR.mc Unchanged.
|
222 |
|
|
PSR.it Unchanged (guaranteed to be 1).
|
223 |
|
|
PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit.
|
224 |
|
|
PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit.
|
225 |
|
|
PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit.
|
226 |
|
|
PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to
|
227 |
|
|
be taken. The trap handler then modifies the saved machine
|
228 |
|
|
state such that execution resumes in the gate page at
|
229 |
|
|
syscall_via_break(), with privilege level 3.
|
230 |
|
|
PSR.ri Unchanged.
|
231 |
|
|
PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode
|
232 |
|
|
handler performed a speculative load that gets NaTted. If so, this
|
233 |
|
|
would be the normal & expected behavior, so no special treatment is
|
234 |
|
|
needed.
|
235 |
|
|
PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed.
|
236 |
|
|
Doing so requires clearing PSR.i and PSR.ic as well.
|
237 |
|
|
PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit.
|
238 |
|
|
|
239 |
|
|
* Using fast system calls
|
240 |
|
|
|
241 |
|
|
To use fast system calls, userspace applications need simply call
|
242 |
|
|
__kernel_syscall_via_epc(). For example
|
243 |
|
|
|
244 |
|
|
-- example fgettimeofday() call --
|
245 |
|
|
-- fgettimeofday.S --
|
246 |
|
|
|
247 |
|
|
#include
|
248 |
|
|
|
249 |
|
|
GLOBAL_ENTRY(fgettimeofday)
|
250 |
|
|
.prologue
|
251 |
|
|
.save ar.pfs, r11
|
252 |
|
|
mov r11 = ar.pfs
|
253 |
|
|
.body
|
254 |
|
|
|
255 |
|
|
mov r2 = 0xa000000000020660;; // gate address
|
256 |
|
|
// found by inspection of System.map for the
|
257 |
|
|
// __kernel_syscall_via_epc() function. See
|
258 |
|
|
// below for how to do this for real.
|
259 |
|
|
|
260 |
|
|
mov b7 = r2
|
261 |
|
|
mov r15 = 1087 // gettimeofday syscall
|
262 |
|
|
;;
|
263 |
|
|
br.call.sptk.many b6 = b7
|
264 |
|
|
;;
|
265 |
|
|
|
266 |
|
|
.restore sp
|
267 |
|
|
|
268 |
|
|
mov ar.pfs = r11
|
269 |
|
|
br.ret.sptk.many rp;; // return to caller
|
270 |
|
|
END(fgettimeofday)
|
271 |
|
|
|
272 |
|
|
-- end fgettimeofday.S --
|
273 |
|
|
|
274 |
|
|
In reality, getting the gate address is accomplished by two extra
|
275 |
|
|
values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
|
276 |
|
|
|
277 |
|
|
o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
|
278 |
|
|
o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
|
279 |
|
|
|
280 |
|
|
The ELF DSO is a pre-linked library that is mapped in by the kernel at
|
281 |
|
|
the gate page. It is a proper ELF shared object so, with a dynamic
|
282 |
|
|
loader that recognises the library, you should be able to make calls to
|
283 |
|
|
the exported functions within it as with any other shared library.
|
284 |
|
|
AT_SYSINFO points into the kernel DSO at the
|
285 |
|
|
__kernel_syscall_via_epc() function for historical reasons (it was
|
286 |
|
|
used before the kernel DSO) and as a convenience.
|