The persistent cache approach.
source: Boris Muratshin
The persistent cache approach.
Boris Muratshin (firstname.lastname@example.org),
Alexander Artiushin (email@example.com)
Jul 31, 2002
Ideally, memory should be able to supply a processor by data in such a way of avoiding any extra data delays. Unfortunately, there are unknown an appropriate compilation methods to provide a proper data transport. In modern computer systems the decreasing of data retrieving time is achieved via a hierarchical memory organization where any following memory (cache) level larger and slower then the previous one, at that only the last level is addressable. First level cache size is commonly 8Kwords (1.5Mb at HP8500), second and third (if exists) levels are much greater.
When an instruction gets as operand a virtual memory address, it is converted into a physical one. Using this physical address as a key (tag), data are requested from memory and if these data exist in cache, we get some speed-up. Let's think, why cache works with physical addresses? The problem is in probable intersection of address spaces allocated by OS for different processes. It is possible to enlarge virtual address (tag) with the process ID and convert it into the physical one only if cache lookup is unsuccessful but in this case we have significant cache complexity increasing against the small saving of address conversion costs.
Moreover, address conversion is usually executed in processor whereas the upper cache levels are commonly separated from the processor. The shortcomings of existing caching approach are obvious and mostly are concluded in shared usage of cache memory by all running processes. Using any strategy of caching, at a control returning after a series of context (processes) switches the process should consider its own cached data context totally lost (so-called cold start problem). When the regular number of simultaneously running processes is evaluated in hundreds, to save (even partially) some personal data of process we need caches of huge sizes.
Virtual pages swapping causes still one problem. When during a virtual address conversion we got a lack of corresponding virtual page, it will be read but some other page will be lost. Obviously, we should destroy all the cache lines corresponding to this condemned page.
The advantages of existing caching technique are sometimes given to be moot points. The average retrieval time is better especially in case of consecutive and localized work. In any other case a program should not expect any significant help from a caching system. At that such an "average" behavior plays a mean trick with programmers when it makes no sense to take care of algorithm's "beauty" and to optimize programs by hands - cache levels off all code and frequently "bad" code turned out faster then "good". An attempts to use a cache particularities in optimization yield in the lack of portability even between different versions of the same processor. There are attempts to make cache more predictable e.g. speculative loading technique. The compiler places the instruction of loading data into cache before their necessity (SPARC V9, IBM POWER3 and HP PA-8xxx). The ability to explicitly invalidate an unnecessary cache line also could be an useful feature. Some systems (e.g. TMS320C6xxx) allow to configure the cache as directly addressed super-fast memory but in this case there exist difficulties with its sharing between tasks.
Additional words should be devoted to the cache coherency problem. Such as the coherency is supported by inter-processor messages and the number of the messages is significantly nonlinear against the number of processors, there exist an objective upper limit for the number of processors in the system. Let's note what the majority of the mentioned messages are useless such as certainly will not be ever used. This is a payment for a caching in physical addresses, which causes extra dependencies between processor units. And the method to avoid these extra dependencies will let to significantly increase above-mentioned upper limit of the number of processors in a system.
Let's advert to the essence of supposed approach. The basic ideas are:
- the separation of virtual addresses into global and local which can be done on basis of the value of this address
- caching is performed in virtual address space
- each task works with own cache context and its cached local data can't be affected by other tasks
- the buffer of cache memory is spliced into two equal parts (pages)
- at a task switch, cache pages are switched, at this a new task immediately runs with one page and the other page is used firstly for the saving of its content into main memory and, secondly, for a restoring of data for a next task from main memory
- the global data tag is extended with the identifier given by OS, e.g. the process number
Let the discussing computing system consists of at least one processor module. Let this system is intended to run in parallel a number of processes each of which can contain more then one parallel thread. We will call the data, which can be used by the only thread as local, and the data, which are shared between different threads of one process as global. The data sharing between processes is allowable but requires a special treatment e.g. they are uncacheable or unswapable or accessed only via system calls.
The program loader is implemented to place a process global data into segments with easily discriminate virtual addresses e.g. the result of the binary '&' should give true at testing this address against some predefined or software re-definable bit mask. The OS kernel (later simply kernel) is responsible for the threads scheduling and switching.
Each processor module contains local caching mechanism (later cache) of the same size and organization. The cache contains following logical blocks:
- two equal data pages one of which we call active and the second one - shadowed
- pages access block
- global data management block
- global data page
- at least one DMA channel
Depending on the implementation, data and command caches can be placed either on the same cache page with sharing of all caching mechanism or on independent pages with duplication of some (or all) mentioned mechanism.
The pages access block realizes one of known caching strategies and algorithms e.g. it can be fully associative, write through and LRU extrusion.
The global data management block is used for the maintenance of data coherency in a processor module and also in all system scale.
The global data page is shared between all threads, this page doesn't have to be of the same size as a local data page and is used directly by the global data management block as an associative memory pool.
The DMA channels are meant for copying data from cache shadowed local page to basic memory and back. Every of existing DMA channels works independently with its own (at least in this moment) page region.
The active page is dedicated for usage by active on this processor module task, during that the content of this page is the caching context of the current task. The kernel contains the mechanism for downloading of the shadowed page content into a main memory and for uploading back. In the moment, when the kernel changes the current task, the cache pages switching is performed, so the active page becomes shadowed and vice versa. Which page is currently active and which is shadowed can be determined by the value of some control register and this value is changed either automatically or explicitly by kernel.
In this moment the shadowed page should contain caching context of the next task fully uploaded by the kernel. Just after the task switch, the kernel begins to download the content of new shadowed page. After that and when kernel defines the next task to switch, the caching context of this task should be uploaded into the shadowed page. The uploading and downloading processes can be performed in the same time with the time shift to exclude their collision.
When the cache line is set in the global page, the line tag is expanded by some identifier, which is got from the OS (which gets it from e.g. dedicated register). An example of such identifier can be the process ID, in any way this extended tag should be unique in the scale of all system. When the processor requests a global data, which are not registered in the global page, one should process the retrieval from the main memory and registration in the global page and put data to the local (active) page
In case of global page repletion different implementations are possible. E.g. we can reject to perform caching for this value or substitute some any value from an active and global page in any way all global data in both local page should be coherent with those in the global page. In the process of data uploading to the shadowed page, the global data management block passes through all local data (which, as we remember, can be easily separated from the global ones by virtual address) and brings global data to conformity with the content of the global page. In case of a multiprocessor system, the global data management block should realize some mechanism of distributed transactions to maintain the all-system data coherence. Is there a coincidence of global pages contents or not, depends on the supported protocol of distributed transactions.
The caching context of each task is stored in some memory, allocated by the kernel for this task to describe its behavior.
Touching some control register can stop the caching activity. In the same way can be stopped the caching of global data. In case of an apparatus interrupt, e.g. in case of a virtual page miss, the caching activity is stopped up to the end of an interrupt handler work.
The active page additionally has a direct addressing mode. In such a way every task can (if necessary) use the active page not as a caching buffer but as an own area of super-fast memory.
What are the advantages of supposed cache organization? First of all, a super-fast memory can have own address space and own support instructions or it can be mapped into a virtual address space. In any way the existence of such memory by no means is considered in high level languages. The objective method of their utilization is a temporary variables (appeared during a compilation) storing. Alas, there is no necessity to store thousands of temporary variables so we need some new ideas. There are at least three variants.
- A stack is arranged in this memory. Stack contains local variables (and temporary too) which are owned only by the current task. Traditionally, a thread can pass to other thread of the same process an address of variable from its own stack, but this technique is rather an evil deed and creates more problems then solves. Also this data are grouped, the caching of the top of a stack is always useful and an actual stack top is usually not too large. When the stack overgrows an allocated for it memory, the exception occurs, OS allocates an additional buffer for data saving and uses the super-fast memory as an window of accelerated access to a stack. If there is an request out of window borders, the conversion of a virtual address to physical gives an appropriate physical address, pointing into one of previously allocated saving buffers.
- This super-fast memory can be manually managed via OS calls.
- This method assumes some changes in high level languages and/or creation of new compilation methods supporting mentioned architecture details. This is especially important for the architectures with explicit parallelism (ILP) where the caching in general can be harmful due an inherent unpredictability.
Second advantage is in explicit division of data into local and global. This allows a directly managing of an intended for coherency support inter-processor messages. The entire of spectrum is possible. If we prohibit the global data caching, the synchronization occurs on the level of main memory access. This is not so bad if global data are rear. In case of absence of global data the coherency support is not necessary. In situation where all data are global, we have the same as when we use traditional cache with physical tags. It is important for a programmer to clear understand which data are used in this moment and what is the payment for it. The choice here can be realized by using of two system calls for memory allocations - for local and process-global data.
Third advantage. A saving task context can contain some additional data such as branch history table (BHT) or physical addresses cache (TLB). Some systems keep BHT together with an instructions cache, in any way if this table becomes task-personal, this allow to increase jump predictions for speculative calculations without significant costs (e.g. in MIPS R10000 BHT contains just 512 items). A common TLB size is in hundreds of items. There are no problems to arrange it inside savable cache page. The obvious difficulty is in necessity to care about swapped out pages during the context uploading.
This letter is an essence of the patent application: "B. Muratshin, A. Artiushin The persistent cache approach for multitasking and SMP systems. Novosibirsk, Jul 2002, patent application RUS N2002121880/20(022788)". But the OpenCores members can use it freely under the OpenCores license.