1 |
210 |
ja_rd |
\chapter{Cache/Memory Controller Module}
|
2 |
|
|
\label{cache}
|
3 |
|
|
|
4 |
|
|
The project includes a cache+memory controller module from revision 114.\\
|
5 |
|
|
|
6 |
|
|
Both the I- and the D-Cache are implemented. But the parametrization
|
7 |
|
|
generics are still mostly unused, with many values hardcoded. And SDRAM is
|
8 |
|
|
not supported yet. Besides, there are some loose ends in the implementation
|
9 |
|
|
still to be solved, exlained in section ~\ref{cache_problems}.\\
|
10 |
|
|
|
11 |
|
|
|
12 |
|
|
\section{Cache Initialization and Control}
|
13 |
|
|
\label{cache_init_and_control}
|
14 |
|
|
|
15 |
|
|
The cache module comes up from reset in an indeterminate, unuseable state.
|
16 |
|
|
It needs to be initialized before being enabled.\\
|
17 |
|
|
Initialization means mostly marking all D- and I-cache lines as invalid.
|
18 |
|
|
The old R3000 had its own means to achieve this, but this core implements an
|
19 |
|
|
alternative, simplified scheme.\\
|
20 |
|
|
|
21 |
|
|
The standard R3000 cache control flags in the SR are not used, either. Instead,
|
22 |
221 |
ja_rd |
two flags from the SR have been repurposed for cache control.\\
|
23 |
210 |
ja_rd |
|
24 |
|
|
\subsection{Cache control flags}
|
25 |
|
|
\label{cache_control_flags}
|
26 |
|
|
|
27 |
|
|
Bits 17 and 16 of the SR are NOT used for their standard R3000 purpose.
|
28 |
|
|
Instead they are used as explained below:
|
29 |
|
|
|
30 |
|
|
\begin{itemize}
|
31 |
|
|
\item Bit 17: Cache enable [reset value = 0]
|
32 |
|
|
\item Bit 16: I- and D-Cache line invalidate [reset value = 0]
|
33 |
|
|
\end{itemize}
|
34 |
|
|
|
35 |
|
|
You always use both these flags together to set the cache operating mode:
|
36 |
|
|
|
37 |
|
|
\begin{itemize}
|
38 |
|
|
\item Bits 17:16='00'\\
|
39 |
|
|
Cache is enabled and working.
|
40 |
|
|
\item Bits 17:16='01'\\
|
41 |
|
|
Cache is in D- and I-cache line invalidation mode.\\
|
42 |
|
|
Writing word X.X.X.N to ANY address will
|
43 |
|
|
invalidate I-Cache line N (N is an 8-bit word and X is an 8-bit
|
44 |
|
|
don't care). Besides, the actual write will be performed too; be
|
45 |
|
|
careful...\\
|
46 |
|
|
|
47 |
|
|
Reading from any address will cause the
|
48 |
|
|
corresponding D-cache line to be invalidated; the read will not be
|
49 |
|
|
actually performed and the read value is undefined.\\
|
50 |
|
|
\item Bits 17:16='10'\\
|
51 |
|
|
Cache is disabled; no lines are being invalidated.\\
|
52 |
|
|
\item Bits 17:16='11'\\
|
53 |
|
|
Cache behavior is UNDETERMINED -- i.e. guaranteed crash.\\
|
54 |
|
|
\end{itemize}
|
55 |
|
|
|
56 |
|
|
Now, after reset the cache memory comes up in an undetermined state but
|
57 |
|
|
it comes up disabled too. Before enabling it, you need to invalidate all
|
58 |
|
|
cache lines in software (see routine cache\_init in the memtest sample).\\
|
59 |
|
|
|
60 |
|
|
\section{Cache Tags and Cache Address Mirroring}
|
61 |
|
|
\label{cache_tags}
|
62 |
|
|
|
63 |
|
|
In order to save space in the I-Cache tag table, the tags are shorter than
|
64 |
|
|
they should -- 14 bits instead of the 20 bits we would need to cover the
|
65 |
|
|
entire 32-bit address:
|
66 |
|
|
|
67 |
|
|
\needspace{10\baselineskip}
|
68 |
|
|
\begin{verbatim}
|
69 |
|
|
|
70 |
221 |
ja_rd |
_________ <-- These address bits are NOT in the tag
|
71 |
|
|
/ \
|
72 |
210 |
ja_rd |
31 .. 27| 26 .. 21 |20 .. 12|11 .. 4|3:2|
|
73 |
|
|
+---------+-----------+-----------------+---------------+---+---+
|
74 |
|
|
| 5 | | 9 | 8 | 2 | |
|
75 |
|
|
+---------+-----------+-----------------+---------------+---+---+
|
76 |
|
|
^ ^ ^ ^- LINE_INDEX_SIZE
|
77 |
|
|
5 bits 9 bits LINE_NUMBER_SIZE
|
78 |
|
|
|
79 |
|
|
\end{verbatim}\\
|
80 |
|
|
|
81 |
|
|
Since bits 26 downto 21 are not included in the tag, there will be a
|
82 |
|
|
'mirror' effect in the cache. We have effectively split the memory space
|
83 |
|
|
into 32 separate blocks of 1MB which is obviously not enough but will do
|
84 |
221 |
ja_rd |
for the initial versions of the core.
|
85 |
|
|
|
86 |
210 |
ja_rd |
In subsequent versions of the cache, the tag size needs to be enlarged AND
|
87 |
|
|
some of the top bits might be omitted when they're not needed to implement
|
88 |
221 |
ja_rd |
the default MIPS memory map (namely bit 30 which is always '0').
|
89 |
210 |
ja_rd |
|
90 |
|
|
|
91 |
|
|
\section{Memory Controller}
|
92 |
|
|
\label{memory_controller}
|
93 |
|
|
|
94 |
|
|
The cache functionality and the memory controller functionality are so
|
95 |
|
|
closely related that I found it convenient to bundle both in the same
|
96 |
|
|
module: I have experimented with separate modules and was unable to come up
|
97 |
|
|
with the same performance with my synthesis tools of choice.
|
98 |
|
|
So, while it would be desirable to split the cache from the memory controller
|
99 |
|
|
at some later version, for the time being they are a single module.\\
|
100 |
|
|
|
101 |
|
|
The memory controller interfaces the cache to external memory, be it off-core
|
102 |
|
|
or off-chip.
|
103 |
|
|
The memory controller implements the refill and writethrough state machines,
|
104 |
|
|
that will necessarily be different for different kinds of memory.\\
|
105 |
|
|
|
106 |
|
|
|
107 |
|
|
\subsection{Memory Map Definition}
|
108 |
|
|
\label{memory_map_definition}
|
109 |
|
|
|
110 |
|
|
The MIPS architecture specs define a memory map which determines which areas
|
111 |
|
|
are cached and which is the default address translation \cite[p.~2-8]{r3k_ref_man}.\\
|
112 |
|
|
Neither the memory translation nor the cacheable/uncacheable attribute of
|
113 |
|
|
the standard MIPS architecture have been implemented. In this core, program
|
114 |
|
|
addresses are always identical to hardware addresses.\\
|
115 |
|
|
|
116 |
|
|
When requested to perform a refill or a writethrough, the memory controller
|
117 |
|
|
needs to know what type of memory it is to be dealing with. The type of
|
118 |
|
|
memory for each memory area is defined in a hardcoded memory map
|
119 |
|
|
implemented in function 'decode\_address\_mips1', defined in package
|
120 |
|
|
'mips\_pkg.vhdl'. This function will synthesize into regular, combinational
|
121 |
|
|
decode logic.\\
|
122 |
|
|
|
123 |
|
|
For each address, the memory map logic will supply the following information:
|
124 |
|
|
|
125 |
|
|
\begin{enumerate}
|
126 |
221 |
ja_rd |
\item What kind of memory it is.
|
127 |
|
|
\item How many wait states to use.
|
128 |
|
|
\item Whether it is writeable or not (ignored in current version).
|
129 |
|
|
\item Whether it is cacheable or not (ignored in current version).
|
130 |
210 |
ja_rd |
\end{enumerate}
|
131 |
|
|
|
132 |
|
|
In the present implementation the memory map can't be modified at run time.\\
|
133 |
|
|
|
134 |
|
|
These are the currently supported memory types:
|
135 |
|
|
|
136 |
|
|
\begin{tabular}{ll}
|
137 |
|
|
\hline
|
138 |
|
|
Identifier & Description \\
|
139 |
|
|
\hline
|
140 |
|
|
MT\_BRAM & Synchronous, on-chip FPGA BRAM\\
|
141 |
|
|
MT\_IO\_SYNC & Synchronous, on-chip register (meant for I/O)\\
|
142 |
|
|
MT\_SRAM\_16B & Asynchronous, off-chip static memory, 16-bit wide\\
|
143 |
|
|
MT\_SRAM\_8B & Asynchronous, off-chip static memory, 8-bit wide\\
|
144 |
|
|
MT\_DDR\_16B & Unused yet\\
|
145 |
|
|
MT\_UNMAPPED & Unmapped area\\
|
146 |
|
|
\hline
|
147 |
|
|
\end{tabular}
|
148 |
|
|
|
149 |
|
|
\subsection{Invalid memory accesses}
|
150 |
|
|
\label{invalid_memory}
|
151 |
|
|
|
152 |
|
|
Whenever the CPU attempts an invalid memory access, the 'unmapped' output
|
153 |
|
|
port of the Cache module will be asserted for 1 clock cycle.
|
154 |
|
|
|
155 |
|
|
The accesses that will raise the 'unmapped' output are these:
|
156 |
|
|
|
157 |
|
|
\begin{enumerate}
|
158 |
|
|
\item Code fetch from address decoded as MT\_IO\_SYNC.
|
159 |
|
|
\item Data write to memory address decoded as other than RAM or IO.
|
160 |
|
|
\item Any access to an address decoded as MT\_UNMAPPED.
|
161 |
|
|
\end{enumerate}
|
162 |
|
|
|
163 |
|
|
The 'unmapped' output is ignored by the current version of the parent MCU
|
164 |
|
|
module -- it is only used to raise a permanent flag that is then connected
|
165 |
|
|
to a LED for debugging purposes, hardly a useful approach in a real project.
|
166 |
|
|
|
167 |
|
|
In subsequent versions of the MCU module, the 'unmapped' signal will trigger
|
168 |
|
|
a hardware interrupt.
|
169 |
|
|
|
170 |
|
|
Note again that the memory attributes 'writeable' and 'cacheable' are not
|
171 |
|
|
used in the current version. In subsequent versions 'writeable' will be
|
172 |
|
|
used and 'cacheable' will be removed.
|
173 |
|
|
|
174 |
|
|
\subsection{Uncacheable memory areas}
|
175 |
|
|
\label{uncacheable_memory}
|
176 |
|
|
|
177 |
|
|
There are no predefined 'uncacheable' memory areas in the current version of
|
178 |
|
|
the core; all memory addresses are cacheable unless they are defined as
|
179 |
|
|
IO (see mips\_cache.vhdl).\\
|
180 |
|
|
In short, if it's not defined as MT\_UNMAPPED or MT\_IO\_SYNC, it is
|
181 |
|
|
cacheable.
|
182 |
|
|
|
183 |
|
|
\section{Cache Refill and Writethrough Chronograms}
|
184 |
|
|
\label{cache_state_machine}
|
185 |
|
|
|
186 |
|
|
|
187 |
|
|
The cache state machine deals with cache refills and data writethroughs. It is
|
188 |
|
|
this state machine that 'knows' about the type and size of the external
|
189 |
|
|
memories. when DRAM is eventually implemented, or when other widths of SRAM are
|
190 |
|
|
supported, this state machine will need to change.
|
191 |
|
|
|
192 |
|
|
The following subsections will describe the refill and writethrough procedures.
|
193 |
|
|
Bear in mind that there is a single state machine that handles it all -- refills
|
194 |
|
|
and writethroughs can't be done simultaneously.
|
195 |
|
|
|
196 |
|
|
|
197 |
|
|
\subsubsection{SRAM interface read cycle timing -- 16-bit interface}
|
198 |
|
|
\label{sram_read_cycle_16b}
|
199 |
|
|
|
200 |
|
|
The refill procedure is identical for both D- and I-cache refills. All that
|
201 |
|
|
matters is the type of memory being read.
|
202 |
|
|
|
203 |
|
|
|
204 |
|
|
|
205 |
|
|
\needspace{15\baselineskip}
|
206 |
|
|
\begin{verbatim}
|
207 |
|
|
==== Chronogram 4.1: 16-bit SRAM refill -- DATA ============================
|
208 |
|
|
__ __ __ __ __ __ _ __ __ __ __
|
209 |
|
|
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ ..._/ \__/ \__/ \__/
|
210 |
|
|
|
211 |
|
|
cache/ps ?| (1) | (2) | ... | (2) |??
|
212 |
|
|
|
213 |
221 |
ja_rd |
refill_ctr ?| 0 | ... | 3 |??
|
214 |
210 |
ja_rd |
|
215 |
|
|
chip_addr ?| 210h | 211h | ... | 217h |--
|
216 |
|
|
|
217 |
|
|
data_rd -XXXXX [218h] XXXXX [219h] | ... XXXXX [217h] |--
|
218 |
|
|
|<- 2-state sequence ->|
|
219 |
|
|
_ __
|
220 |
|
|
sram_oe_n \____________________________________ ... __________________/
|
221 |
|
|
|<- Total: 24 clock cycles to refill a 4-word cache line ->|
|
222 |
|
|
============================================================================
|
223 |
|
|
\end{verbatim}\\
|
224 |
|
|
|
225 |
|
|
(NOTE: signal names left-clipped to fit page margins)
|
226 |
|
|
|
227 |
|
|
In the diagram, the data coming into bram\_data\_rd is depicted with some delay.
|
228 |
|
|
|
229 |
|
|
Signal \emph{cache/ps} is the current state of the cache state machine, and
|
230 |
|
|
in this chronogram it takes the following values:
|
231 |
|
|
|
232 |
|
|
\begin{enumerate}
|
233 |
|
|
\item data\_refill\_sram\_0
|
234 |
|
|
\item data\_refill\_sram\_1
|
235 |
|
|
\end{enumerate}
|
236 |
|
|
|
237 |
|
|
Each of the two states reads a halfword from SRAM. The two-state sequence is
|
238 |
|
|
repeated four times to refill the four-word cache entry.
|
239 |
|
|
|
240 |
|
|
Signal \emph{refill\_ctr} counts the word index within the line being refilled,
|
241 |
|
|
and runs from 0 to 4.\\
|
242 |
|
|
|
243 |
|
|
|
244 |
|
|
\subsubsection{SRAM interface read cycle timing -- 8-bit interface}
|
245 |
|
|
\label{sram_read_cycle_8b}
|
246 |
|
|
|
247 |
221 |
ja_rd |
The refill from an 8-bit static memory is essentially the same as depicted
|
248 |
|
|
above, except we need to read 4 bytes (over the LSB lines of the static memory
|
249 |
|
|
data bus) instead of 2 16-bit halfwords. The operation takes correspondingly
|
250 |
|
|
longer to perform and uses an extra address line but is otherwise identical.
|
251 |
210 |
ja_rd |
|
252 |
221 |
ja_rd |
TODO: 8-bit refill chronogram to be done.
|
253 |
210 |
ja_rd |
|
254 |
221 |
ja_rd |
|
255 |
|
|
\subsubsection{16-bit SRAM interface write cycle timing}
|
256 |
210 |
ja_rd |
\label{sram_write_cycle}
|
257 |
|
|
|
258 |
|
|
The path of the state machine that deals with SRAM writethroughs is linear so
|
259 |
|
|
a state diagram would not be very interesting. As you can see in the source
|
260 |
|
|
code, all the states are one clock long except for states
|
261 |
|
|
\emph{data\_writethrough\_sram\_0b} and \emph{data\_writethrough\_sram\_1b},
|
262 |
|
|
which will be as long as the number of wait states plus one.
|
263 |
|
|
This is the only writethrough parameter that is influenced by the wait state
|
264 |
|
|
attribute.\\
|
265 |
|
|
|
266 |
|
|
A general memory write will be 32-bit wide and thus it will take two 16-bit
|
267 |
|
|
memory accesses to complete. Unaligned, halfword or byte wide CPU writes might
|
268 |
|
|
in some cases be optimized to take only a single 16-bit memory access. This
|
269 |
221 |
ja_rd |
module does no such optimization yet.
|
270 |
210 |
ja_rd |
For simplicity, all writethroughs take two 16-bit access cycles, even if one
|
271 |
|
|
of them has both we\_n signals deasserted.\\
|
272 |
|
|
|
273 |
|
|
The following chronogram has been copied from a simulation of the 'hello'
|
274 |
|
|
sample. It's a 32-bit wide write to address 00000430h.
|
275 |
|
|
As you can see, the 'chip address' (the address fed to the SRAM chip) is the
|
276 |
|
|
target address divided by 2 (because there are 2 16-bit halfwords to the 32-bit
|
277 |
|
|
word). In this particular case, all the four bytes of the long word are written
|
278 |
|
|
and so both the we\_n signals are asserted for both halfwords.
|
279 |
|
|
|
280 |
|
|
In this example, the SRAM is being accessed with 1 WS: WE\_N is asserted for
|
281 |
|
|
two cycles.
|
282 |
221 |
ja_rd |
Note how a lot of cycles are used in order to guarantee compliance with the
|
283 |
210 |
ja_rd |
setup and hold times of the SRAM against the we, address and data lines.
|
284 |
|
|
|
285 |
|
|
\needspace{15\baselineskip}
|
286 |
|
|
\begin{verbatim}
|
287 |
|
|
==== Chronogram 4.3: 16-bit SRAM writethrough, 32-bit wide =================
|
288 |
|
|
__ __ __ __ __ __ __ __ __ _
|
289 |
|
|
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/
|
290 |
|
|
|
291 |
|
|
cache/ps ?| (1) | (2) | (3) | (4) | (5) | (6) | (7) |?
|
292 |
|
|
|
293 |
|
|
sram_chip_addr ?| 218h | 219h |?
|
294 |
|
|
|
295 |
|
|
sram_data_wr -------| 0000h | 044Ch |-
|
296 |
|
|
_____________ ___________ _______
|
297 |
|
|
sram_byte_we_n(0) \___________/ \___________/
|
298 |
|
|
_____________ ___________ _______
|
299 |
|
|
sram_byte_we_n(1) \___________/ \___________/
|
300 |
|
|
|
301 |
|
|
_________________________________________________________
|
302 |
|
|
sram_oe_n
|
303 |
|
|
============================================================================
|
304 |
|
|
\end{verbatim}\\
|
305 |
|
|
|
306 |
|
|
Signal \emph{cache/ps} is the current state of the cache state machine, and
|
307 |
|
|
in this chronogram it takes the following values:
|
308 |
|
|
|
309 |
|
|
\begin{enumerate}
|
310 |
|
|
\item data\_writethrough\_sram\_0a
|
311 |
|
|
\item data\_writethrough\_sram\_0b
|
312 |
|
|
\item data\_writethrough\_sram\_0c
|
313 |
|
|
\item data\_writethrough\_sram\_1a
|
314 |
|
|
\item data\_writethrough\_sram\_1b
|
315 |
|
|
\item data\_writethrough\_sram\_1c
|
316 |
|
|
\end{enumerate}
|
317 |
|
|
|
318 |
|
|
|
319 |
|
|
|
320 |
|
|
\section{Known Problems}
|
321 |
|
|
\label{cache_problems}
|
322 |
|
|
|
323 |
221 |
ja_rd |
The cache implementation is still provisional and has a number of
|
324 |
|
|
acknowledged problems:
|
325 |
|
|
|
326 |
210 |
ja_rd |
\begin{enumerate}
|
327 |
|
|
\item All parameters hardcoded -- generics are almost ignored.
|
328 |
|
|
\item SRAM read state machine does not guarantee internal FPGA $T_{hold}$.
|
329 |
|
|
In my current target board it works because the FPGA hold times
|
330 |
|
|
(including an input mux
|
331 |
|
|
in the parent module) are far smaller than the SRAM response times, but
|
332 |
|
|
it would be better to insert an extra cycle after the wait states in
|
333 |
|
|
the sram read state machine.
|
334 |
221 |
ja_rd |
\item Cache logic mixed with memory controller logic.
|
335 |
210 |
ja_rd |
\end{enumerate}
|
336 |
|
|
|