1 |
148 |
Agner |
% chapter included in forwardcom.tex
|
2 |
|
|
\documentclass[forwardcom.tex]{subfiles}
|
3 |
|
|
\begin{document}
|
4 |
|
|
\RaggedRight
|
5 |
|
|
|
6 |
|
|
\chapter{Memory model} \label{memoryModel}
|
7 |
|
|
ForwardCom has a flat address space using unsigned 64-bit addresses and 64-bit pointers. Future extension to 128-bit addresses is possible, but this will probably not be relevant in a foreseeable future.
|
8 |
|
|
\vv
|
9 |
|
|
|
10 |
|
|
Absolute addresses are rarely used. Most data objects, functions and jump targets are addressed with signed offsets of 32 bits or less relative to some reference point contained in a 64-bit pointer. This pointer can be the instruction pointer (IP), the data section pointer (DATAP), the thread data
|
11 |
|
|
pointer (THREADP), the stack pointer (SP), or a general purpose register.
|
12 |
|
|
\vv
|
13 |
|
|
|
14 |
|
|
An application can have access to the following sections of data:
|
15 |
|
|
\vv
|
16 |
|
|
|
17 |
|
|
\begin{itemize}
|
18 |
|
|
\item Program code (CODE). This memory block is executable, but usually without read and write access. The CODE section can be shared between multiple processes running the same program.
|
19 |
|
|
|
20 |
|
|
\item Read-only program data (CONST). This contains constants and tables used by the program without write access. It may be shared between multiple processes.
|
21 |
|
|
|
22 |
|
|
\item Static read/write program data sections, which can be initialized (DATA) and uninitialized (BSS). This is used for global data and for static data inside functions. Multiple instances are needed if multiple processes are running the same code.
|
23 |
|
|
|
24 |
|
|
\item Stack data (STACK). This is used for non-static data inside functions. Each process or thread has its own stack, with addresses relative to the stack pointer. The stack grows downward from high to low addresses when data are added to the stack.
|
25 |
|
|
|
26 |
|
|
\item Thread data (THREADD). Allocated when a thread is created and used for thread-local static data and thread environment block. This has one instance for each thread.
|
27 |
|
|
|
28 |
|
|
\item Program heap (HEAP). Used for dynamic memory allocation by an application program.
|
29 |
|
|
\end{itemize}
|
30 |
|
|
|
31 |
|
|
References within the CODE section use 8-bit, 16-bit, 24-bit, and 32-bit signed references relative to the instruction pointer, scaled by the code word size which is 4 bytes.
|
32 |
|
|
\vv
|
33 |
|
|
|
34 |
|
|
A CONST section may be placed either near the CODE section and addressed relative to the instruction pointer (IP), or together with the other data sections and addressed relative to the the data section pointer (DATAP).
|
35 |
|
|
\vv
|
36 |
|
|
|
37 |
|
|
The DATA and BSS sections are addressed relative to the data section pointer (DATAP) which is a special register that points to some reference point in these sections. The default reference point is where DATA ends and BSS begins. Multiple running instances of the same program will have different values of the data section pointer. The CODE and CONST sections contain no absolute references to DATA or BSS, only references relative to the data section pointer. This makes it possible for multiple processes to share the same CODE and CONST sections, but have each their private DATA and BSS sections without the need for virtual address translation. The DATA and BSS sections can be placed anywhere in the address space independently of where CONST and CODE are placed.
|
38 |
|
|
\vv
|
39 |
|
|
|
40 |
|
|
STACK data are addressed relative to the stack pointer (SP). Heap data are addressed through pointers provided by a heap allocation function.
|
41 |
|
|
\vv
|
42 |
|
|
|
43 |
|
|
Thread data are addressed relative to the thread data pointer (THREADP), which is separate for each thread in the process. The thread data section may be allocated on the stack when a new thread is created.
|
44 |
|
|
\vv
|
45 |
|
|
|
46 |
|
|
The STACK, DATA, BSS, and HEAP data sections are preferably kept together in one contiguous block in order to optimize caching and memory management.
|
47 |
|
|
\vv
|
48 |
|
|
|
49 |
|
|
This model allows the program to access up to 8 GB of CODE, 2 GB of CONST, 2 GB of DATA, 2 GB of BSS, 2 GB of THREADD, an almost unlimited size of STACK with 2 GB frames, and an almost unlimited amount of HEAP data.
|
50 |
|
|
\vv
|
51 |
|
|
|
52 |
|
|
A pointer to the CONST section may be provided in the thread environment block in order to access CONST data in case the CONST section is placed with a distance of more than 2 GB from IP or DATAP or if the CONST section is relocated independently of CODE and DATA.
|
53 |
|
|
\vv
|
54 |
|
|
|
55 |
|
|
\subsection{Padding space} \label{PaddingSpace}
|
56 |
|
|
%\label{extraSpaceAtEndOfData}
|
57 |
|
|
Any read/writeable data section that is not followed by another readable section may have an unused space at the end, at least the size of the maximum vector length. The purpose of this extra space is to make it possible for the pop instruction to read more than necessary when restoring a vector of unknown length, and to make
|
58 |
|
|
it possible to search for the end of a zero-terminated string without having to read only a single byte at a time. This does not apply to IP-addressed read-only data sections that contain no strings, because a variable-size unused space between the constant section and the code section would make it necessary to recalculate relative addresses when a program is loaded on a system with a different maximum vector length, which would reduce the advantage of position-independent code.
|
59 |
|
|
\vv
|
60 |
|
|
|
61 |
|
|
An alternative solution is to allow vector reads that begin in a readable section to extend past the end of the readable section without raising an exception. The vector part that is outside the permitted section must get the value zero. It remains to be investigated which of these two solutions is most efficient.
|
62 |
|
|
\vv
|
63 |
|
|
|
64 |
|
|
\subsection{Stack direction} \label{StackDirection}
|
65 |
|
|
Most microprocessor systems have the data stack growing backwards. This also applies to the ForwardCom system, but mainly for a different reason. When a vector register is saved on the stack by the push instruction, it is stored as the length followed by the amount of data indicated by the length. When the vector register is restored using the pop instruction, it is necessary to read the length followed by the data. The stack pointer must point to the low end where the length is stored, otherwise it would be impossible to find where the length is stored.
|
66 |
|
|
\vv
|
67 |
|
|
|
68 |
|
|
It is possible to make additional stacks growing forwards or backwards, but any stack containing variable-length vectors must grow backwards.
|
69 |
|
|
\vv
|
70 |
|
|
|
71 |
|
|
\section{Thread memory protection} \label{threadMemoryProtection}
|
72 |
|
|
Each thread has its own stack. The thread data (THREADD) may be placed on this stack. The ForwardCom system allows inter-thread memory protection. The stack data of the main thread of a program is accessible to all its child threads, but all other threads in the program can have private data which is not accessible to any other threads, not even to the main thread. Any communication and synchronization between threads must use static memory or memory belonging to the main thread.
|
73 |
|
|
\vv
|
74 |
|
|
|
75 |
|
|
It is recommended to use this inter-thread memory protection in all cases except where legacy software requires one memory space shared by all threads.
|
76 |
|
|
\vv
|
77 |
|
|
|
78 |
|
|
\subsubsection{Isolated memory blocks} \label{isolatedMemoryBlocks}
|
79 |
|
|
It is possible to make a system function that allocates an isolated memory block surrounded by inaccessible memory on both sides. Such a memory block, which will be accessible only to a specific thread, can be used for example for a data buffer in cases where security requirements are high. Each thread can have only a limited number of such protected memory blocks because of the limited size of the memory map.
|
80 |
|
|
|
81 |
|
|
\section{Memory management} \label{memoryManagement}
|
82 |
|
|
It is a design goal to minimize memory fragmentation and to minimize the need for virtual address translation. Traditional designs often have very complicated memory management systems with multilevel address translation, large translation-lookaside-buffers (TLB), and huge page tables.
|
83 |
|
|
\vv
|
84 |
|
|
|
85 |
|
|
The reason why many current systems have large page tables with fixed-size memory pages is that this is the only efficient way to look up an address in a heavily fragmented memory space. It is a design goal of ForwardCom to minimize or eliminate memory fragmentation. This is achieved in several ways explained below. If there are only a few memory blocks visible to a particular thread, then it is possible to have variable-size memory blocks. It requires only a limited number of hardware comparators to look up an address in a table of only a few memory blocks with variable size. The ambition is to replace the TLB, which has a large number of fixed-size memory blocks, by a memory map with a few memory blocks of variable size.
|
86 |
|
|
A memory map with such a limited number of entries can easily be implemented on the chip in a very efficient way and it can easily be changed on task switches. Each process and each thread must have its own memory map.
|
87 |
|
|
\vv
|
88 |
|
|
|
89 |
|
|
In most cases, the main thread of an application will only need three blocks of memory: CONST (read only), CODE (execute only), and the combined STACK+DATA+BSS+HEAP (read-write). A child thread needs one more entry for its private stack. Similar blocks are defined for system code.
|
90 |
|
|
\vv
|
91 |
|
|
|
92 |
|
|
The memory map supports virtual address translation in the form of a constant offset that defines the distance between the virtual address and the physical address for each map entry. The hardware should not waste time and power on virtual address translation when it is not used.
|
93 |
|
|
\vv
|
94 |
|
|
|
95 |
|
|
A limited number of extra entries are needed in the memory map to deal with cases where the memory becomes fragmented, but memory fragmentation can be avoided in most cases. The following techniques are provided to simplify memory management and avoid memory fragmentation:
|
96 |
|
|
|
97 |
|
|
\begin{itemize}
|
98 |
|
|
\item There is only one type of function libraries which can be used for both static and dynamic linking. These are linked with a mechanism that keeps the CONST, CODE and DATA sections contiguous with the similar sections of the main program. It is described on page \pageref{libraryLinkMethods} below how this avoids memory fragmentation and improves caching.
|
99 |
|
|
|
100 |
|
|
\item The required stack size is calculated by the compiler and the linker so that stack overflow can be avoided in most cases. This technique is described on page \pageref{predictingStackSize}.
|
101 |
|
|
|
102 |
|
|
\item The operating system can keep statistical records of the heap use of each program in order to predict the required heap size. The same technique can be used for predicting stack use in cases where the required stack size cannot be predicted exactly (e. g. recursive function calls).
|
103 |
|
|
\end{itemize}
|
104 |
|
|
|
105 |
|
|
The memory space may become fragmented despite the use of these techniques. Problems that can result in memory fragmentation are listed below.
|
106 |
|
|
|
107 |
|
|
\begin{itemize}
|
108 |
|
|
\item Recursive functions can use unlimited stack space. We may require that the programmer specifies a maximum recursion level in a pragma.
|
109 |
|
|
|
110 |
|
|
\item Allocation of variable-size arrays on the stack using the alloca function in C. We may require that the programmer specifies a maximum size.
|
111 |
|
|
|
112 |
|
|
\item Runtime linking. The program can reserve space for loading and linking function libraries at run time (see page \pageref{runtimeLinking}). The memory may become fragmented if the memory space reserved for this purpose turns out to be insufficient.
|
113 |
|
|
|
114 |
|
|
\item Script languages and byte code languages. It is difficult to predict the required size of stack and heap when running interpreted or emulated code. It is recommended to use a just-in-time compiler instead. Self-modifying scripts cannot be compiled. The same problem can occur with large user-defined macros.
|
115 |
|
|
|
116 |
|
|
\item Unpredictable number of threads without protection. The required stack size for a thread may be computed in advance, but in some cases it may be difficult to predict the number of threads that a program will generate. Multiple threads will mostly share the same code sections, but they need separate stacks. The stack of a thread can be placed anywhere in memory without problems if inter-thread memory protection is used. But if memory is shared between threads and the number of threads is unpredictable then the shared
|
117 |
|
|
memory space may become fragmented.
|
118 |
|
|
|
119 |
|
|
\item Unpredictable heap size. Programs that process large amounts of data, e. g. multimedia processing, may need a large heap. A program heap can use discontiguous memory, but this will require extra entries in the memory map.
|
120 |
|
|
|
121 |
|
|
\item Lazy loading and code overlay. A large program may have certain code units that are rarely used and loaded only when needed. Lazy loading can be useful to save memory, but it may require virtual memory translation and it may cause memory fragmentation. A straightforward solution is to implement such code units as separate executable programs.
|
122 |
|
|
|
123 |
|
|
\item Hot patching, i. e. updating of code while it is running.
|
124 |
|
|
|
125 |
|
|
\item Shared memory for inter-process communication. This requires extra entries in the memory map as explained below.
|
126 |
|
|
|
127 |
|
|
\item Many programs running. The memory can become fragmented when many programs of different sizes are loaded and unloaded randomly or swapped to memory.
|
128 |
|
|
|
129 |
|
|
\item Memory mapping of files and devices.
|
130 |
|
|
|
131 |
|
|
\end{itemize}
|
132 |
|
|
|
133 |
|
|
A possible remedy against overflow of stack and heap is to place the STACK, DATA, BSS and HEAP data together (in this order) in an address range with large unused virtual address spaces below and above, so that the stack can grow downwards and the heap can grow upwards into the vacant spaces. This method can avoid fragmentation of the virtual address space, but not the physical address space. Fragmentation of the physical address space can be remedied by moving data from a memory block of insufficient size to another block that is larger. This method has the cost of a time delay when the data block is moved.
|
134 |
|
|
\vv
|
135 |
|
|
|
136 |
|
|
If runtime linking runs into memory problems and lack of memory map entries then it is allowed to mix CONST and CODE sections together in a common section with both read and execute access. If a library function contains constant data that originate from an untrusted source, while the code is trusted, then it is preferred to put the untrusted data into the DATA section rather than the CONST section in order to prevent execution of malicious code placed in the CONST section.
|
137 |
|
|
\vv
|
138 |
|
|
|
139 |
|
|
An application that needs many independent memory blocks can make a separate thread or sub-process to service each memory block. Each thread has its own memory map with an entry for its private memory block.
|
140 |
|
|
\vv
|
141 |
|
|
|
142 |
|
|
\label{sharedMemory} Shared memory can be used when there is a need to transfer large amounts of data between two processes. One process shares a part of its memory with another process. The receiving process needs an extra entry in its memory map to indicate read and/or write access rights to the shared memory block. The process that owns the shared memory block does not need any extra entry in its memory map. There is a limit to how many shared memory blocks an application can receive access to, if we want to keep the memory map small. If one program needs to communicate with a large number of other programs then we can use one of these solutions: (1) let the program that needs many connections own the shared memory and give each of its clients access to one part of it, (2) run multiple threads in (or multiple instances of) the program that needs many connections so that each thread has access to only one shared memory block, (3) let multiple communication channels use the same shared memory block or parts of it, (4) communicate through function calls, (5) communicate through network sockets, or (6) communicate through files.
|
143 |
|
|
\vv
|
144 |
|
|
|
145 |
|
|
Executable memory cannot be shared between different applications. A function that is used by multiple different applications will usually be placed in a function library that these applications are linking to.
|
146 |
|
|
The mechanism of interprocess calls must be used if one application needs to call a function in another application. This is described on page \pageref{interProcessCalls}.
|
147 |
|
|
\vv
|
148 |
|
|
|
149 |
|
|
Software that relies heavily on memory mapping should be avoided. As the trend in microprocessor technology goes towards an increasing number of CPU cores, it should be possible to give critical applications and critical threads each their core with its own memory space so that context switching and memory fragmentation can be avoided or minimized.
|
150 |
|
|
\vv
|
151 |
|
|
|
152 |
|
|
|
153 |
|
|
\subsubsection{Conclusion} \label{MemoryManagementConclusion}
|
154 |
|
|
We can probably keep memory fragmentation low by using the principles discussed here so that a relatively small memory map for each thread will be sufficient to cover normal cases. This will be much more efficient than the large TLB and multilevel address translation of current designs. It will save a lot of chip space and power. We can avoid the cost of TLB misses and page faults, and it will make task switches faster. The actual size of the memory map will depend on the hardware implementation.
|
155 |
|
|
\vv
|
156 |
|
|
|
157 |
|
|
This system puts the priority on performance-critical applications. The programmer will be able to get top performance by observing some discipline to avoid memory fragmentation, for example by recycling allocated memory. However, the system should also be able to run less well-behaved applications. Applications that cause heavy memory fragmentation are probably built with less regard for performance. The system should have methods to allow such applications to work, perhaps even software based methods to deal with memory fragmentation, but we can regard it as acceptable to make a system that prioritizes the performance of well-behaved applications at the cost of inferior performance for applications that cause heavy memory fragmentation.
|
158 |
|
|
|
159 |
|
|
|
160 |
|
|
\end{document}
|