1 |
210 |
ja_rd |
\chapter{Cache/Memory Controller Module}
2 |
3 |
4 |
The project includes a cache+memory controller module from revision 114.\\
5 |
6 |
Both the I- and the D-Cache are implemented. But the parametrization
7 |
generics are still mostly unused, with many values hardcoded. And SDRAM is
8 |
not supported yet. Besides, there are some loose ends in the implementation
9 |
still to be solved, exlained in section ~\ref{cache_problems}.\\
10 |
11 |
12 |
\section{Cache Initialization and Control}
13 |
14 |
15 |
The cache module comes up from reset in an indeterminate, unuseable state.
16 |
It needs to be initialized before being enabled.\\
17 |
Initialization means mostly marking all D- and I-cache lines as invalid.
18 |
The old R3000 had its own means to achieve this, but this core implements an
19 |
alternative, simplified scheme.\\
20 |
21 |
The standard R3000 cache control flags in the SR are not used, either. Instead,
22 |
two flags from the SR have been commandeered for cache control.\\
23 |
24 |
\subsection{Cache control flags}
25 |
26 |
27 |
Bits 17 and 16 of the SR are NOT used for their standard R3000 purpose.
28 |
Instead they are used as explained below:
29 |
30 |
31 |
\item Bit 17: Cache enable [reset value = 0]
32 |
\item Bit 16: I- and D-Cache line invalidate [reset value = 0]
33 |
34 |
35 |
You always use both these flags together to set the cache operating mode:
36 |
37 |
38 |
\item Bits 17:16='00'\\
39 |
Cache is enabled and working.
40 |
\item Bits 17:16='01'\\
41 |
Cache is in D- and I-cache line invalidation mode.\\
42 |
Writing word X.X.X.N to ANY address will
43 |
invalidate I-Cache line N (N is an 8-bit word and X is an 8-bit
44 |
don't care). Besides, the actual write will be performed too; be
45 |
46 |
47 |
Reading from any address will cause the
48 |
corresponding D-cache line to be invalidated; the read will not be
49 |
actually performed and the read value is undefined.\\
50 |
\item Bits 17:16='10'\\
51 |
Cache is disabled; no lines are being invalidated.\\
52 |
\item Bits 17:16='11'\\
53 |
Cache behavior is UNDETERMINED -- i.e. guaranteed crash.\\
54 |
55 |
56 |
Now, after reset the cache memory comes up in an undetermined state but
57 |
it comes up disabled too. Before enabling it, you need to invalidate all
58 |
cache lines in software (see routine cache\_init in the memtest sample).\\
59 |
60 |
\section{Cache Tags and Cache Address Mirroring}
61 |
62 |
63 |
In order to save space in the I-Cache tag table, the tags are shorter than
64 |
they should -- 14 bits instead of the 20 bits we would need to cover the
65 |
entire 32-bit address:
66 |
67 |
68 |
69 |
70 |
___________ <-- These address bits are NOT in the tag
71 |
/ \
72 |
31 .. 27| 26 .. 21 |20 .. 12|11 .. 4|3:2|
73 |
74 |
| 5 | | 9 | 8 | 2 | |
75 |
76 |
77 |
5 bits 9 bits LINE_NUMBER_SIZE
78 |
79 |
80 |
81 |
Since bits 26 downto 21 are not included in the tag, there will be a
82 |
'mirror' effect in the cache. We have effectively split the memory space
83 |
into 32 separate blocks of 1MB which is obviously not enough but will do
84 |
for the initial tests.
85 |
In subsequent versions of the cache, the tag size needs to be enlarged AND
86 |
some of the top bits might be omitted when they're not needed to implement
87 |
the default memory map (namely bit 30 which is always '0').
88 |
89 |
90 |
\section{Memory Controller}
91 |
92 |
93 |
The cache functionality and the memory controller functionality are so
94 |
closely related that I found it convenient to bundle both in the same
95 |
module: I have experimented with separate modules and was unable to come up
96 |
with the same performance with my synthesis tools of choice.
97 |
So, while it would be desirable to split the cache from the memory controller
98 |
at some later version, for the time being they are a single module.\\
99 |
100 |
The memory controller interfaces the cache to external memory, be it off-core
101 |
or off-chip.
102 |
The memory controller implements the refill and writethrough state machines,
103 |
that will necessarily be different for different kinds of memory.\\
104 |
105 |
106 |
\subsection{Memory Map Definition}
107 |
108 |
109 |
The MIPS architecture specs define a memory map which determines which areas
110 |
are cached and which is the default address translation \cite[p.~2-8]{r3k_ref_man}.\\
111 |
Neither the memory translation nor the cacheable/uncacheable attribute of
112 |
the standard MIPS architecture have been implemented. In this core, program
113 |
addresses are always identical to hardware addresses.\\
114 |
115 |
When requested to perform a refill or a writethrough, the memory controller
116 |
needs to know what type of memory it is to be dealing with. The type of
117 |
memory for each memory area is defined in a hardcoded memory map
118 |
implemented in function 'decode\_address\_mips1', defined in package
119 |
'mips\_pkg.vhdl'. This function will synthesize into regular, combinational
120 |
decode logic.\\
121 |
122 |
For each address, the memory map logic will supply the following information:
123 |
124 |
125 |
\item What kind of memory it is
126 |
\item How many wait states to use
127 |
\item Whether it is writeable or not (ignored in current version)
128 |
\item Whether it is cacheable or not (ignored in current version)
129 |
130 |
131 |
In the present implementation the memory map can't be modified at run time.\\
132 |
133 |
These are the currently supported memory types:
134 |
135 |
136 |
137 |
Identifier & Description \\
138 |
139 |
MT\_BRAM & Synchronous, on-chip FPGA BRAM\\
140 |
MT\_IO\_SYNC & Synchronous, on-chip register (meant for I/O)\\
141 |
MT\_SRAM\_16B & Asynchronous, off-chip static memory, 16-bit wide\\
142 |
MT\_SRAM\_8B & Asynchronous, off-chip static memory, 8-bit wide\\
143 |
MT\_DDR\_16B & Unused yet\\
144 |
MT\_UNMAPPED & Unmapped area\\
145 |
146 |
147 |
148 |
\subsection{Invalid memory accesses}
149 |
150 |
151 |
Whenever the CPU attempts an invalid memory access, the 'unmapped' output
152 |
port of the Cache module will be asserted for 1 clock cycle.
153 |
154 |
The accesses that will raise the 'unmapped' output are these:
155 |
156 |
157 |
\item Code fetch from address decoded as MT\_IO\_SYNC.
158 |
\item Data write to memory address decoded as other than RAM or IO.
159 |
\item Any access to an address decoded as MT\_UNMAPPED.
160 |
161 |
162 |
The 'unmapped' output is ignored by the current version of the parent MCU
163 |
module -- it is only used to raise a permanent flag that is then connected
164 |
to a LED for debugging purposes, hardly a useful approach in a real project.
165 |
166 |
In subsequent versions of the MCU module, the 'unmapped' signal will trigger
167 |
a hardware interrupt.
168 |
169 |
Note again that the memory attributes 'writeable' and 'cacheable' are not
170 |
used in the current version. In subsequent versions 'writeable' will be
171 |
used and 'cacheable' will be removed.
172 |
173 |
\subsection{Uncacheable memory areas}
174 |
175 |
176 |
There are no predefined 'uncacheable' memory areas in the current version of
177 |
the core; all memory addresses are cacheable unless they are defined as
178 |
IO (see mips\_cache.vhdl).\\
179 |
In short, if it's not defined as MT\_UNMAPPED or MT\_IO\_SYNC, it is
180 |
181 |
182 |
\section{Cache Refill and Writethrough Chronograms}
183 |
184 |
185 |
186 |
The cache state machine deals with cache refills and data writethroughs. It is
187 |
this state machine that 'knows' about the type and size of the external
188 |
memories. when DRAM is eventually implemented, or when other widths of SRAM are
189 |
supported, this state machine will need to change.
190 |
191 |
The following subsections will describe the refill and writethrough procedures.
192 |
Bear in mind that there is a single state machine that handles it all -- refills
193 |
and writethroughs can't be done simultaneously.
194 |
195 |
196 |
\subsubsection{SRAM interface read cycle timing -- 16-bit interface}
197 |
198 |
199 |
The refill procedure is identical for both D- and I-cache refills. All that
200 |
matters is the type of memory being read.
201 |
202 |
203 |
204 |
205 |
206 |
==== Chronogram 4.1: 16-bit SRAM refill -- DATA ============================
207 |
__ __ __ __ __ __ _ __ __ __ __
208 |
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ ..._/ \__/ \__/ \__/
209 |
210 |
cache/ps ?| (1) | (2) | ... | (2) |??
211 |
212 |
refill_ctr ?| 0 | ... < 3 |??
213 |
214 |
chip_addr ?| 210h | 211h | ... | 217h |--
215 |
216 |
data_rd -XXXXX [218h] XXXXX [219h] | ... XXXXX [217h] |--
217 |
|<- 2-state sequence ->|
218 |
_ __
219 |
sram_oe_n \____________________________________ ... __________________/
220 |
|<- Total: 24 clock cycles to refill a 4-word cache line ->|
221 |
222 |
223 |
224 |
(NOTE: signal names left-clipped to fit page margins)
225 |
226 |
In the diagram, the data coming into bram\_data\_rd is depicted with some delay.
227 |
228 |
Signal \emph{cache/ps} is the current state of the cache state machine, and
229 |
in this chronogram it takes the following values:
230 |
231 |
232 |
\item idle
233 |
\item data\_refill\_sram\_0
234 |
\item data\_refill\_sram\_1
235 |
236 |
237 |
Each of the two states reads a halfword from SRAM. The two-state sequence is
238 |
repeated four times to refill the four-word cache entry.
239 |
240 |
Signal \emph{refill\_ctr} counts the word index within the line being refilled,
241 |
and runs from 0 to 4.\\
242 |
243 |
244 |
\subsubsection{SRAM interface read cycle timing -- 8-bit interface}
245 |
246 |
247 |
TODO: 8-bit refill procedure to be done.
248 |
249 |
250 |
\subsubsection{SRAM interface write cycle timing}
251 |
252 |
253 |
The path of the state machine that deals with SRAM writethroughs is linear so
254 |
a state diagram would not be very interesting. As you can see in the source
255 |
code, all the states are one clock long except for states
256 |
\emph{data\_writethrough\_sram\_0b} and \emph{data\_writethrough\_sram\_1b},
257 |
which will be as long as the number of wait states plus one.
258 |
This is the only writethrough parameter that is influenced by the wait state
259 |
260 |
261 |
A general memory write will be 32-bit wide and thus it will take two 16-bit
262 |
memory accesses to complete. Unaligned, halfword or byte wide CPU writes might
263 |
in some cases be optimized to take only a single 16-bit memory access. This
264 |
module does no such optimization.
265 |
For simplicity, all writethroughs take two 16-bit access cycles, even if one
266 |
of them has both we\_n signals deasserted.\\
267 |
268 |
The following chronogram has been copied from a simulation of the 'hello'
269 |
sample. It's a 32-bit wide write to address 00000430h.
270 |
As you can see, the 'chip address' (the address fed to the SRAM chip) is the
271 |
target address divided by 2 (because there are 2 16-bit halfwords to the 32-bit
272 |
word). In this particular case, all the four bytes of the long word are written
273 |
and so both the we\_n signals are asserted for both halfwords.
274 |
275 |
In this example, the SRAM is being accessed with 1 WS: WE\_N is asserted for
276 |
two cycles.
277 |
Note how a lot of cycles are lost in order to guarantee compliance with the
278 |
setup and hold times of the SRAM against the we, address and data lines.
279 |
280 |
281 |
282 |
==== Chronogram 4.3: 16-bit SRAM writethrough, 32-bit wide =================
283 |
__ __ __ __ __ __ __ __ __ _
284 |
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/
285 |
286 |
cache/ps ?| (1) | (2) | (3) | (4) | (5) | (6) | (7) |?
287 |
288 |
sram_chip_addr ?| 218h | 219h |?
289 |
290 |
sram_data_wr -------| 0000h | 044Ch |-
291 |
_____________ ___________ _______
292 |
sram_byte_we_n(0) \___________/ \___________/
293 |
_____________ ___________ _______
294 |
sram_byte_we_n(1) \___________/ \___________/
295 |
296 |
297 |
298 |
299 |
300 |
301 |
Signal \emph{cache/ps} is the current state of the cache state machine, and
302 |
in this chronogram it takes the following values:
303 |
304 |
305 |
\item idle
306 |
\item data\_writethrough\_sram\_0a
307 |
\item data\_writethrough\_sram\_0b
308 |
\item data\_writethrough\_sram\_0c
309 |
\item data\_writethrough\_sram\_1a
310 |
\item data\_writethrough\_sram\_1b
311 |
\item data\_writethrough\_sram\_1c
312 |
313 |
314 |
315 |
316 |
\section{Known Problems}
317 |
318 |
319 |
320 |
\item All parameters hardcoded -- generics are almost ignored.
321 |
\item SRAM read state machine does not guarantee internal FPGA $T_{hold}$.
322 |
In my current target board it works because the FPGA hold times
323 |
(including an input mux
324 |
in the parent module) are far smaller than the SRAM response times, but
325 |
it would be better to insert an extra cycle after the wait states in
326 |
the sram read state machine.
327 |
328 |