| 1 |
210 |
ja_rd |
\chapter{Cache/Memory Controller Module}
|
| 2 |
|
|
\label{cache}
|
| 3 |
|
|
|
| 4 |
|
|
The project includes a cache+memory controller module from revision 114.\\
|
| 5 |
|
|
|
| 6 |
|
|
Both the I- and the D-Cache are implemented. But the parametrization
|
| 7 |
|
|
generics are still mostly unused, with many values hardcoded. And SDRAM is
|
| 8 |
|
|
not supported yet. Besides, there are some loose ends in the implementation
|
| 9 |
|
|
still to be solved, exlained in section ~\ref{cache_problems}.\\
|
| 10 |
|
|
|
| 11 |
|
|
|
| 12 |
|
|
\section{Cache Initialization and Control}
|
| 13 |
|
|
\label{cache_init_and_control}
|
| 14 |
|
|
|
| 15 |
|
|
The cache module comes up from reset in an indeterminate, unuseable state.
|
| 16 |
|
|
It needs to be initialized before being enabled.\\
|
| 17 |
|
|
Initialization means mostly marking all D- and I-cache lines as invalid.
|
| 18 |
|
|
The old R3000 had its own means to achieve this, but this core implements an
|
| 19 |
|
|
alternative, simplified scheme.\\
|
| 20 |
|
|
|
| 21 |
|
|
The standard R3000 cache control flags in the SR are not used, either. Instead,
|
| 22 |
221 |
ja_rd |
two flags from the SR have been repurposed for cache control.\\
|
| 23 |
210 |
ja_rd |
|
| 24 |
|
|
\subsection{Cache control flags}
|
| 25 |
|
|
\label{cache_control_flags}
|
| 26 |
|
|
|
| 27 |
|
|
Bits 17 and 16 of the SR are NOT used for their standard R3000 purpose.
|
| 28 |
|
|
Instead they are used as explained below:
|
| 29 |
|
|
|
| 30 |
|
|
\begin{itemize}
|
| 31 |
|
|
\item Bit 17: Cache enable [reset value = 0]
|
| 32 |
|
|
\item Bit 16: I- and D-Cache line invalidate [reset value = 0]
|
| 33 |
|
|
\end{itemize}
|
| 34 |
|
|
|
| 35 |
|
|
You always use both these flags together to set the cache operating mode:
|
| 36 |
|
|
|
| 37 |
|
|
\begin{itemize}
|
| 38 |
|
|
\item Bits 17:16='00'\\
|
| 39 |
|
|
Cache is enabled and working.
|
| 40 |
|
|
\item Bits 17:16='01'\\
|
| 41 |
|
|
Cache is in D- and I-cache line invalidation mode.\\
|
| 42 |
|
|
Writing word X.X.X.N to ANY address will
|
| 43 |
|
|
invalidate I-Cache line N (N is an 8-bit word and X is an 8-bit
|
| 44 |
|
|
don't care). Besides, the actual write will be performed too; be
|
| 45 |
|
|
careful...\\
|
| 46 |
|
|
|
| 47 |
|
|
Reading from any address will cause the
|
| 48 |
|
|
corresponding D-cache line to be invalidated; the read will not be
|
| 49 |
|
|
actually performed and the read value is undefined.\\
|
| 50 |
|
|
\item Bits 17:16='10'\\
|
| 51 |
|
|
Cache is disabled; no lines are being invalidated.\\
|
| 52 |
|
|
\item Bits 17:16='11'\\
|
| 53 |
|
|
Cache behavior is UNDETERMINED -- i.e. guaranteed crash.\\
|
| 54 |
|
|
\end{itemize}
|
| 55 |
|
|
|
| 56 |
|
|
Now, after reset the cache memory comes up in an undetermined state but
|
| 57 |
|
|
it comes up disabled too. Before enabling it, you need to invalidate all
|
| 58 |
|
|
cache lines in software (see routine cache\_init in the memtest sample).\\
|
| 59 |
|
|
|
| 60 |
|
|
\section{Cache Tags and Cache Address Mirroring}
|
| 61 |
|
|
\label{cache_tags}
|
| 62 |
|
|
|
| 63 |
|
|
In order to save space in the I-Cache tag table, the tags are shorter than
|
| 64 |
|
|
they should -- 14 bits instead of the 20 bits we would need to cover the
|
| 65 |
|
|
entire 32-bit address:
|
| 66 |
|
|
|
| 67 |
|
|
\needspace{10\baselineskip}
|
| 68 |
|
|
\begin{verbatim}
|
| 69 |
|
|
|
| 70 |
221 |
ja_rd |
_________ <-- These address bits are NOT in the tag
|
| 71 |
|
|
/ \
|
| 72 |
210 |
ja_rd |
31 .. 27| 26 .. 21 |20 .. 12|11 .. 4|3:2|
|
| 73 |
|
|
+---------+-----------+-----------------+---------------+---+---+
|
| 74 |
|
|
| 5 | | 9 | 8 | 2 | |
|
| 75 |
|
|
+---------+-----------+-----------------+---------------+---+---+
|
| 76 |
|
|
^ ^ ^ ^- LINE_INDEX_SIZE
|
| 77 |
|
|
5 bits 9 bits LINE_NUMBER_SIZE
|
| 78 |
|
|
|
| 79 |
|
|
\end{verbatim}\\
|
| 80 |
|
|
|
| 81 |
|
|
Since bits 26 downto 21 are not included in the tag, there will be a
|
| 82 |
|
|
'mirror' effect in the cache. We have effectively split the memory space
|
| 83 |
|
|
into 32 separate blocks of 1MB which is obviously not enough but will do
|
| 84 |
221 |
ja_rd |
for the initial versions of the core.
|
| 85 |
|
|
|
| 86 |
210 |
ja_rd |
In subsequent versions of the cache, the tag size needs to be enlarged AND
|
| 87 |
|
|
some of the top bits might be omitted when they're not needed to implement
|
| 88 |
221 |
ja_rd |
the default MIPS memory map (namely bit 30 which is always '0').
|
| 89 |
210 |
ja_rd |
|
| 90 |
|
|
|
| 91 |
|
|
\section{Memory Controller}
|
| 92 |
|
|
\label{memory_controller}
|
| 93 |
|
|
|
| 94 |
|
|
The cache functionality and the memory controller functionality are so
|
| 95 |
|
|
closely related that I found it convenient to bundle both in the same
|
| 96 |
|
|
module: I have experimented with separate modules and was unable to come up
|
| 97 |
|
|
with the same performance with my synthesis tools of choice.
|
| 98 |
|
|
So, while it would be desirable to split the cache from the memory controller
|
| 99 |
|
|
at some later version, for the time being they are a single module.\\
|
| 100 |
|
|
|
| 101 |
|
|
The memory controller interfaces the cache to external memory, be it off-core
|
| 102 |
|
|
or off-chip.
|
| 103 |
|
|
The memory controller implements the refill and writethrough state machines,
|
| 104 |
|
|
that will necessarily be different for different kinds of memory.\\
|
| 105 |
|
|
|
| 106 |
|
|
|
| 107 |
|
|
\subsection{Memory Map Definition}
|
| 108 |
|
|
\label{memory_map_definition}
|
| 109 |
|
|
|
| 110 |
|
|
The MIPS architecture specs define a memory map which determines which areas
|
| 111 |
|
|
are cached and which is the default address translation \cite[p.~2-8]{r3k_ref_man}.\\
|
| 112 |
|
|
Neither the memory translation nor the cacheable/uncacheable attribute of
|
| 113 |
|
|
the standard MIPS architecture have been implemented. In this core, program
|
| 114 |
|
|
addresses are always identical to hardware addresses.\\
|
| 115 |
|
|
|
| 116 |
|
|
When requested to perform a refill or a writethrough, the memory controller
|
| 117 |
|
|
needs to know what type of memory it is to be dealing with. The type of
|
| 118 |
|
|
memory for each memory area is defined in a hardcoded memory map
|
| 119 |
|
|
implemented in function 'decode\_address\_mips1', defined in package
|
| 120 |
|
|
'mips\_pkg.vhdl'. This function will synthesize into regular, combinational
|
| 121 |
|
|
decode logic.\\
|
| 122 |
|
|
|
| 123 |
|
|
For each address, the memory map logic will supply the following information:
|
| 124 |
|
|
|
| 125 |
|
|
\begin{enumerate}
|
| 126 |
221 |
ja_rd |
\item What kind of memory it is.
|
| 127 |
|
|
\item How many wait states to use.
|
| 128 |
|
|
\item Whether it is writeable or not (ignored in current version).
|
| 129 |
|
|
\item Whether it is cacheable or not (ignored in current version).
|
| 130 |
210 |
ja_rd |
\end{enumerate}
|
| 131 |
|
|
|
| 132 |
|
|
In the present implementation the memory map can't be modified at run time.\\
|
| 133 |
|
|
|
| 134 |
|
|
These are the currently supported memory types:
|
| 135 |
|
|
|
| 136 |
|
|
\begin{tabular}{ll}
|
| 137 |
|
|
\hline
|
| 138 |
|
|
Identifier & Description \\
|
| 139 |
|
|
\hline
|
| 140 |
|
|
MT\_BRAM & Synchronous, on-chip FPGA BRAM\\
|
| 141 |
|
|
MT\_IO\_SYNC & Synchronous, on-chip register (meant for I/O)\\
|
| 142 |
|
|
MT\_SRAM\_16B & Asynchronous, off-chip static memory, 16-bit wide\\
|
| 143 |
|
|
MT\_SRAM\_8B & Asynchronous, off-chip static memory, 8-bit wide\\
|
| 144 |
|
|
MT\_DDR\_16B & Unused yet\\
|
| 145 |
|
|
MT\_UNMAPPED & Unmapped area\\
|
| 146 |
|
|
\hline
|
| 147 |
|
|
\end{tabular}
|
| 148 |
|
|
|
| 149 |
|
|
\subsection{Invalid memory accesses}
|
| 150 |
|
|
\label{invalid_memory}
|
| 151 |
|
|
|
| 152 |
|
|
Whenever the CPU attempts an invalid memory access, the 'unmapped' output
|
| 153 |
|
|
port of the Cache module will be asserted for 1 clock cycle.
|
| 154 |
|
|
|
| 155 |
|
|
The accesses that will raise the 'unmapped' output are these:
|
| 156 |
|
|
|
| 157 |
|
|
\begin{enumerate}
|
| 158 |
|
|
\item Code fetch from address decoded as MT\_IO\_SYNC.
|
| 159 |
|
|
\item Data write to memory address decoded as other than RAM or IO.
|
| 160 |
|
|
\item Any access to an address decoded as MT\_UNMAPPED.
|
| 161 |
|
|
\end{enumerate}
|
| 162 |
|
|
|
| 163 |
|
|
The 'unmapped' output is ignored by the current version of the parent MCU
|
| 164 |
|
|
module -- it is only used to raise a permanent flag that is then connected
|
| 165 |
|
|
to a LED for debugging purposes, hardly a useful approach in a real project.
|
| 166 |
|
|
|
| 167 |
|
|
In subsequent versions of the MCU module, the 'unmapped' signal will trigger
|
| 168 |
|
|
a hardware interrupt.
|
| 169 |
|
|
|
| 170 |
|
|
Note again that the memory attributes 'writeable' and 'cacheable' are not
|
| 171 |
|
|
used in the current version. In subsequent versions 'writeable' will be
|
| 172 |
|
|
used and 'cacheable' will be removed.
|
| 173 |
|
|
|
| 174 |
|
|
\subsection{Uncacheable memory areas}
|
| 175 |
|
|
\label{uncacheable_memory}
|
| 176 |
|
|
|
| 177 |
|
|
There are no predefined 'uncacheable' memory areas in the current version of
|
| 178 |
|
|
the core; all memory addresses are cacheable unless they are defined as
|
| 179 |
|
|
IO (see mips\_cache.vhdl).\\
|
| 180 |
|
|
In short, if it's not defined as MT\_UNMAPPED or MT\_IO\_SYNC, it is
|
| 181 |
|
|
cacheable.
|
| 182 |
|
|
|
| 183 |
|
|
\section{Cache Refill and Writethrough Chronograms}
|
| 184 |
|
|
\label{cache_state_machine}
|
| 185 |
|
|
|
| 186 |
|
|
|
| 187 |
|
|
The cache state machine deals with cache refills and data writethroughs. It is
|
| 188 |
|
|
this state machine that 'knows' about the type and size of the external
|
| 189 |
|
|
memories. when DRAM is eventually implemented, or when other widths of SRAM are
|
| 190 |
|
|
supported, this state machine will need to change.
|
| 191 |
|
|
|
| 192 |
|
|
The following subsections will describe the refill and writethrough procedures.
|
| 193 |
|
|
Bear in mind that there is a single state machine that handles it all -- refills
|
| 194 |
|
|
and writethroughs can't be done simultaneously.
|
| 195 |
|
|
|
| 196 |
|
|
|
| 197 |
|
|
\subsubsection{SRAM interface read cycle timing -- 16-bit interface}
|
| 198 |
|
|
\label{sram_read_cycle_16b}
|
| 199 |
|
|
|
| 200 |
|
|
The refill procedure is identical for both D- and I-cache refills. All that
|
| 201 |
|
|
matters is the type of memory being read.
|
| 202 |
|
|
|
| 203 |
|
|
|
| 204 |
|
|
|
| 205 |
|
|
\needspace{15\baselineskip}
|
| 206 |
|
|
\begin{verbatim}
|
| 207 |
|
|
==== Chronogram 4.1: 16-bit SRAM refill -- DATA ============================
|
| 208 |
|
|
__ __ __ __ __ __ _ __ __ __ __
|
| 209 |
|
|
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ ..._/ \__/ \__/ \__/
|
| 210 |
|
|
|
| 211 |
|
|
cache/ps ?| (1) | (2) | ... | (2) |??
|
| 212 |
|
|
|
| 213 |
221 |
ja_rd |
refill_ctr ?| 0 | ... | 3 |??
|
| 214 |
210 |
ja_rd |
|
| 215 |
|
|
chip_addr ?| 210h | 211h | ... | 217h |--
|
| 216 |
|
|
|
| 217 |
|
|
data_rd -XXXXX [218h] XXXXX [219h] | ... XXXXX [217h] |--
|
| 218 |
|
|
|<- 2-state sequence ->|
|
| 219 |
|
|
_ __
|
| 220 |
|
|
sram_oe_n \____________________________________ ... __________________/
|
| 221 |
|
|
|<- Total: 24 clock cycles to refill a 4-word cache line ->|
|
| 222 |
|
|
============================================================================
|
| 223 |
|
|
\end{verbatim}\\
|
| 224 |
|
|
|
| 225 |
|
|
(NOTE: signal names left-clipped to fit page margins)
|
| 226 |
|
|
|
| 227 |
|
|
In the diagram, the data coming into bram\_data\_rd is depicted with some delay.
|
| 228 |
|
|
|
| 229 |
|
|
Signal \emph{cache/ps} is the current state of the cache state machine, and
|
| 230 |
|
|
in this chronogram it takes the following values:
|
| 231 |
|
|
|
| 232 |
|
|
\begin{enumerate}
|
| 233 |
|
|
\item data\_refill\_sram\_0
|
| 234 |
|
|
\item data\_refill\_sram\_1
|
| 235 |
|
|
\end{enumerate}
|
| 236 |
|
|
|
| 237 |
|
|
Each of the two states reads a halfword from SRAM. The two-state sequence is
|
| 238 |
|
|
repeated four times to refill the four-word cache entry.
|
| 239 |
|
|
|
| 240 |
|
|
Signal \emph{refill\_ctr} counts the word index within the line being refilled,
|
| 241 |
|
|
and runs from 0 to 4.\\
|
| 242 |
|
|
|
| 243 |
|
|
|
| 244 |
|
|
\subsubsection{SRAM interface read cycle timing -- 8-bit interface}
|
| 245 |
|
|
\label{sram_read_cycle_8b}
|
| 246 |
|
|
|
| 247 |
221 |
ja_rd |
The refill from an 8-bit static memory is essentially the same as depicted
|
| 248 |
|
|
above, except we need to read 4 bytes (over the LSB lines of the static memory
|
| 249 |
|
|
data bus) instead of 2 16-bit halfwords. The operation takes correspondingly
|
| 250 |
|
|
longer to perform and uses an extra address line but is otherwise identical.
|
| 251 |
210 |
ja_rd |
|
| 252 |
221 |
ja_rd |
TODO: 8-bit refill chronogram to be done.
|
| 253 |
210 |
ja_rd |
|
| 254 |
221 |
ja_rd |
|
| 255 |
|
|
\subsubsection{16-bit SRAM interface write cycle timing}
|
| 256 |
210 |
ja_rd |
\label{sram_write_cycle}
|
| 257 |
|
|
|
| 258 |
|
|
The path of the state machine that deals with SRAM writethroughs is linear so
|
| 259 |
|
|
a state diagram would not be very interesting. As you can see in the source
|
| 260 |
|
|
code, all the states are one clock long except for states
|
| 261 |
|
|
\emph{data\_writethrough\_sram\_0b} and \emph{data\_writethrough\_sram\_1b},
|
| 262 |
|
|
which will be as long as the number of wait states plus one.
|
| 263 |
|
|
This is the only writethrough parameter that is influenced by the wait state
|
| 264 |
|
|
attribute.\\
|
| 265 |
|
|
|
| 266 |
|
|
A general memory write will be 32-bit wide and thus it will take two 16-bit
|
| 267 |
|
|
memory accesses to complete. Unaligned, halfword or byte wide CPU writes might
|
| 268 |
|
|
in some cases be optimized to take only a single 16-bit memory access. This
|
| 269 |
221 |
ja_rd |
module does no such optimization yet.
|
| 270 |
210 |
ja_rd |
For simplicity, all writethroughs take two 16-bit access cycles, even if one
|
| 271 |
|
|
of them has both we\_n signals deasserted.\\
|
| 272 |
|
|
|
| 273 |
|
|
The following chronogram has been copied from a simulation of the 'hello'
|
| 274 |
|
|
sample. It's a 32-bit wide write to address 00000430h.
|
| 275 |
|
|
As you can see, the 'chip address' (the address fed to the SRAM chip) is the
|
| 276 |
|
|
target address divided by 2 (because there are 2 16-bit halfwords to the 32-bit
|
| 277 |
|
|
word). In this particular case, all the four bytes of the long word are written
|
| 278 |
|
|
and so both the we\_n signals are asserted for both halfwords.
|
| 279 |
|
|
|
| 280 |
|
|
In this example, the SRAM is being accessed with 1 WS: WE\_N is asserted for
|
| 281 |
|
|
two cycles.
|
| 282 |
221 |
ja_rd |
Note how a lot of cycles are used in order to guarantee compliance with the
|
| 283 |
210 |
ja_rd |
setup and hold times of the SRAM against the we, address and data lines.
|
| 284 |
|
|
|
| 285 |
|
|
\needspace{15\baselineskip}
|
| 286 |
|
|
\begin{verbatim}
|
| 287 |
|
|
==== Chronogram 4.3: 16-bit SRAM writethrough, 32-bit wide =================
|
| 288 |
|
|
__ __ __ __ __ __ __ __ __ _
|
| 289 |
|
|
clk _/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/ \__/
|
| 290 |
|
|
|
| 291 |
|
|
cache/ps ?| (1) | (2) | (3) | (4) | (5) | (6) | (7) |?
|
| 292 |
|
|
|
| 293 |
|
|
sram_chip_addr ?| 218h | 219h |?
|
| 294 |
|
|
|
| 295 |
|
|
sram_data_wr -------| 0000h | 044Ch |-
|
| 296 |
|
|
_____________ ___________ _______
|
| 297 |
|
|
sram_byte_we_n(0) \___________/ \___________/
|
| 298 |
|
|
_____________ ___________ _______
|
| 299 |
|
|
sram_byte_we_n(1) \___________/ \___________/
|
| 300 |
|
|
|
| 301 |
|
|
_________________________________________________________
|
| 302 |
|
|
sram_oe_n
|
| 303 |
|
|
============================================================================
|
| 304 |
|
|
\end{verbatim}\\
|
| 305 |
|
|
|
| 306 |
|
|
Signal \emph{cache/ps} is the current state of the cache state machine, and
|
| 307 |
|
|
in this chronogram it takes the following values:
|
| 308 |
|
|
|
| 309 |
|
|
\begin{enumerate}
|
| 310 |
|
|
\item data\_writethrough\_sram\_0a
|
| 311 |
|
|
\item data\_writethrough\_sram\_0b
|
| 312 |
|
|
\item data\_writethrough\_sram\_0c
|
| 313 |
|
|
\item data\_writethrough\_sram\_1a
|
| 314 |
|
|
\item data\_writethrough\_sram\_1b
|
| 315 |
|
|
\item data\_writethrough\_sram\_1c
|
| 316 |
|
|
\end{enumerate}
|
| 317 |
|
|
|
| 318 |
|
|
|
| 319 |
|
|
|
| 320 |
|
|
\section{Known Problems}
|
| 321 |
|
|
\label{cache_problems}
|
| 322 |
|
|
|
| 323 |
221 |
ja_rd |
The cache implementation is still provisional and has a number of
|
| 324 |
|
|
acknowledged problems:
|
| 325 |
|
|
|
| 326 |
210 |
ja_rd |
\begin{enumerate}
|
| 327 |
|
|
\item All parameters hardcoded -- generics are almost ignored.
|
| 328 |
|
|
\item SRAM read state machine does not guarantee internal FPGA $T_{hold}$.
|
| 329 |
|
|
In my current target board it works because the FPGA hold times
|
| 330 |
|
|
(including an input mux
|
| 331 |
|
|
in the parent module) are far smaller than the SRAM response times, but
|
| 332 |
|
|
it would be better to insert an extra cycle after the wait states in
|
| 333 |
|
|
the sram read state machine.
|
| 334 |
221 |
ja_rd |
\item Cache logic mixed with memory controller logic.
|
| 335 |
210 |
ja_rd |
\end{enumerate}
|
| 336 |
|
|
|