URL
https://opencores.org/ocsvn/ion/ion/trunk
Subversion Repositories ion
Compare Revisions
- This comparison shows the changes necessary to convert path
/ion/trunk/doc
- from Rev 155 to Rev 156
- ↔ Reverse comparison
Rev 155 → Rev 156
/ion_project.txt
1,8 → 1,8
This file contains instructions and notes about the Ion CPU core project. |
The core structure is briefly explained in section 2. The rest of this doc is |
mostly usage instructions for the test samples and custom utilities. |
In its present state it' more a reminder to myself than a proper explaination of |
the system. |
In its present state it's more a reminder to myself than a proper explaination |
of the system. |
|
Last modified: Apr/04/2011 |
|
36,12 → 36,15
2.- Catch all undefined opcodes (and trigger exception). |
3.- Operate in kernel/user mode as per the architecture definition. |
4.- Handle exceptions in a manner compatible to MIPS-I standard. |
5.- Implement as much of CP0 as necessary for the above goals. |
6.- Be no bigger than Plasma in a Spartan-3 or Cyclone-2 device, and |
5.- Code cache and data cache, even if not standard. |
No MMU and no TLB, and no cache-related instructions. |
6.- Implement as much of CP0 as necessary for the above goals. |
7.- Interface to external SRAM (or FLASH) on 8- and 16-bit data bus. |
8.- Be no bigger than Plasma in a Spartan-3 or Cyclone-2 device, and |
no slower -- Plasma is used as a reference in many ways. |
Speed measured in raw clock frequency for the time being. |
(I.e. don't not consider stalls, interlocks, etc. yet) |
7.- Interlock behavior of MUL/DIV and L* compatible to toolchain. |
9.- Interlock behavior of MUL/DIV and L* compatible to toolchain. |
That is, interlock loads instead of relying on a delay slot. |
|
Unaligned load/stores are excluded not because of patent concerns (the |
48,53 → 51,56
patents already expired) but because they're not essential for a first |
version of the core. The same goes for all other exclusions. |
|
As of rev. 154 all the 1st block goals have been accomplished (but not very |
heavily tested; many bugs remain, probably). |
|
|
For a second iteration I plan on the following: |
|
1.- Proper interlocking of load cycles (with no wasted cycles). |
2.- External interrupt support. |
3.- Code cache and data cache, even if not standard. |
No MMU and no TLB, and no cache-related instructions. |
4.- Interface to external SRAM (or FLASH) on 8- and 16-bit data bus. |
3.- Trap handlers (instruction emulation) for unaligned load and store |
instructions. |
4.- Trap handlers (instruction emulation) for the most usual MIPS32 |
instructions. |
|
None of these things have been done. |
|
Many of the above goals have already been accomplished. |
|
|
|
1.2.- Development status |
|
In its present state, the CPU can pass a basic opcode test and can execute |
some basic MIPS-I code compiled with standard gcc tools (specifically, it |
can run an 'Adventure' demo and a tiny 'hello world' program, see |
section 6). |
'Basic' means that the core has a number of limitations that prevent it from |
running just any code -- mostly unimplemented instructions and skeletal |
memory interface. |
The CPU is already able to execute almost any MIPS-I code (excluding some |
unimplemented instructions such as cache control). |
It can pass a basic opcode test and can execute some basic applications |
compiled with standard gcc tools (specifically, it can run an 'Adventure' |
demo and a tiny 'hello world' program, see section 6). |
|
Besides, the core can already access external static memory (SRAM or FLASH) |
on 8-bit and/or 16 bit buses. |
My main development target is a DE-1 board from Terasic (Cyclone-2) which |
happens to have that kind of memory on it. |
The most important limitations are the very basic memory interface, with |
no support for SDRAM, and the absence of MIPS32 trap handlers (see xxx). |
|
The memory controller can already access external static memory (SRAM or |
FLASH) on 8-bit and/or 16 bit buses. Still does not support SDRAM or |
static RAM in other bus widths. |
My main development target is a DE-1 board from Terasic (Cyclone-2) and I |
have focused in the kind of memory it has. |
|
Wait states can be configured at synthesis, see section 2.6 below. |
Code sample 'memtest' takes advantage of this to do a basic test of the |
external SRAM, and code sample 'Adventure' uses both Flash and SRAM. |
|
|
The opcode test is included in '/src/opcodes/opcodes.s' (see section 6). |
The code samples can be found in the /src directory (see section 6). |
|
|
This is the state of the CPU at this time: |
|
### MIPS-I things not implemented |
- Kernel/user status. |
- RTE instruction -- as of now, returns from traps with JR. |
- Most of the CP0 registers and of course all of the CP1. |
- External interrupts. |
- Caches (there's an empty 'stub' cache and an unfinished real cache) |
1.- External interrupts. |
2.- Some basic means of debugging (besides the trap instruction). |
|
### Things implemented but not fully tested. |
- Memory pause input -- work in progress along with the cache module. |
1.- Rte instruction. |
|
### Things with provisional implementation |
|
109,6 → 115,13
The interlock logic needs a stronger test bench anyway. |
2.- Documentation is too sparse and source code is barely commented. |
This is critical and I plan to fix it ASAP. |
3.- The D-Cache handles RAW hazards in a very inefficient way. |
Data refills in a SW+LW sequence should only be triggered when the |
SW invalidates the same line the LW is loading. Instead, the current |
cache triggers the data refill always (for a SW+LW sequence, that |
is). |
This performance drag has to be fixed without ruining the clock rate |
(that's the catch). |
|
|
### Performance |
115,18 → 128,20
|
In my main test system, a Cyclone-2 grade -7, I'm quite sure that the core |
with caches and with mul/div and all other necessary functionality, plus |
a barebones UART, will be below 2500 LEs, running at least at 50 MHz (with |
'balanced optimization' on Quartus-II). |
a barebones UART, will be below 2500 LEs + 18 BRAMs, running at least at |
50 MHz (with 'balanced optimization' on Quartus-II). |
|
|
As soon as the core is in a stable state I will include a few synthesis |
performance numbers for common configurations. |
|
As soon as I can build a dhrystone benchmark I will post results (and commit |
the code). The core needs a timer before I can do that. |
|
1.3.- Next steps |
|
* Implement efficient load interlock detection with no wasted cycles. |
* Add a couple other code samples, including one with FP arithmetic. |
* Do whatever it takes to use standard C library functions (see xxx). |
* Add a couple of benchmarks, including one with FP arithmetic. |
* Modify the software simulator so it can boot uClinux. |
* Make a uClinux port suitable for a R3000 derivative, from BuildRoot. |
|
137,6 → 152,7
2.- CPU description |
================================================================================ |
|
This section is about the 'core' cpu, excluding the cache. |
|
2.0.- Some general features |
|
282,17 → 298,14
Memory wait cycles have already been implemented and tested with a 'stub' |
cache (module mips_cache_stub) and the first version of the real cache |
(module mips_cache.vhdl). The 'stub cache' is actually just an interface |
to external 8/16-bit wide memory that has only been barely tested on real |
hardware. This stub cache is described in section 2.7 below while the real |
to external 8/16-bit wide memory, described in section 2.7 below. The real |
cache is descibed in section 2.8. |
The memory wait state logic is a work in progress and might change as |
development on the cache module proceeds. |
|
In short, the 'mem_wait' input will unconditionally stall all pipeline |
stages as long as it is active. It is meant to be used by the cache at cache |
refills. |
|
Again, the current implementation of the wait input and its logic is going |
The current implementation of the wait input and its logic is going |
to change. Eventually it will be described here. |
|
|
301,7 → 314,7
2.2 Pipeline |
|
Here is where I would explain the structure of the cpu in detail; these |
brief comments will have to wait until I write some real documentation. |
brief comments will have to suffice until I write some real documentation. |
|
This section could really use a diagram; since it can take me days to draw |
one, that will have to wait for a further revision. |
468,9 → 481,9
|
The C toolchain needs to be set up for MIPS-I compliance in order to build |
object code compatible with this scheme. |
But all succeeding versions of the MIPS architecture, including the MIPS-32 |
that this core aims to be compatible with in the future, implement a |
different scheme instead, 'load interlock' ([1], pag. 28). |
But all succeeding versions of the MIPS architecture implement a |
different scheme instead, 'load interlock' ([1], pag. 28). You often see |
this behavior in code generated by gcc, even when using the -mips1 flag. |
In short, it pays to implement load interlocks so this core does, but the |
feature should be optional through a generic. |
|
533,12 → 546,16
2.4 Exceptions |
|
The only exceptions supported so far are software exceptions, and of those |
only the instructions BREAK and SYSCALL and the unimplemented opcode trap. |
only the instructions BREAK and SYSCALL, the unimplemented opcode trap and |
the user access to CP0 trap. |
Memory provileges are not and will not be implemented. Hardware/software |
interrupts are still unimplemented too. |
|
Both do a limited version of the regular MIPS exception behavior. |
They save their own address to EPC, abort the following instruction, and |
jump to the exception vector 0x03c. All as per the specs except the vector |
Exceptions are meant to work as in the R3000 CPUs except for the vector |
address. |
They save their own address to EPC, update the SR, abort the following |
instruction, and jump to the exception vector 0x0180. All as per the specs |
except the vector address (we only use one). |
|
The following instruction is aborted even if it is a load or a jump, and |
traps work as specified even from a delay slot -- in that case, the address |
546,10 → 563,11
instruction's as explained in [1], pag. 64. |
|
Plasma used to save in epc the address of the instruction after break or |
syscall. This core will use the standard MIPS way instead. |
syscall, and used an unstandard vector address (0x03c). This core will use |
the standard R3000 way instead. |
|
Note that the epc register is not used by any instruction other than mfc0; |
ERET is not implemented yet, because privilege levels aren't either. |
RTE is implemented and works as per R3000 specs. |
|
|
2.5.- Multiplier |
591,7 → 609,7
d.- Whether it is cacheable or not |
|
In the present implementation the memory map can't be modified at run time. |
|
|
The cache module uses 'decode_address_mips1' to determine what to do for |
each cache refill -- the refill state mechine is different for each kind of |
memory, see section 2.7. |
598,9 → 616,20
|
Note that the cache stub implements only points a, b and c. |
|
(NOTE: the cache module includes the memory controller, which is what |
actually uses all this information. The I- and D-Cache logic don't care |
about memory types or mappings). |
|
|
2.7.- 'Stub' cache |
|
IMPORTANT: in all revisions later than 106, the cache stub module is no |
longer supported -- it is incompatible with changes made to the rest of the |
project. |
It remains present for clarity only. |
This expplaination applies only to the relevant (<=106) revisions. |
|
|
As of revision 53, there is a first synthesizable version of the cache, a |
'stub' with almost no functionality meant to bring the system up. |
The cache in its present state (rev. 53) is just an interface to external |
856,27 → 885,27
As of revision 114, the project already includes a real cache module, still |
unfinished. Both the simulation template and the synthesis template still |
use the stub cache by default, so if you want to test the real cache you |
have to modify the templated manually. |
have to modify the templated manually. |
Use always the latest revision! the cache module is still a work in |
progress (last bug fix in rev. 153). |
|
Only the I-Cache is implemented; the D-Cache still uses the stub logic from |
the mips_cache_stub module. Besides, SDRAM is not supported yet. And there |
are a number of loose ends in the implementation still to be solved. For |
example, the cache size is supposed to be parametrizable through generics; |
but though the generics are already there, the actual code does not use them |
and uses hardcoded constants instead. There's many things like this |
still unfinished. |
Both the I- and the D-Cache are implemented. But the parametrization |
generics are still mostly unused, with many values hardcoded. And SDRAM is |
not supported yet. Besides, there are some loose ends in the implementation |
still to be solved. |
|
As time permits I will add timing diagrams to this section. For now, I will |
only say that the timing diagrams will be nearly identical to those of the |
stub cache with only one important difference: In the stub cache, each |
refill operation only reads a single word from memory, whereas in the real |
cache each refill reads 4 words in reverse order (i.e. 3, 2, 1, 0 LSB |
address bits). |
cache each refill reads 4 words. |
|
FIXME: add at least one cache timing diagram. |
|
|
2.8.1.- Cache initialization and control |
|
Bits 17 and 16 of the SR are NOT used for their standard MIPS-I purpose. |
Bits 17 and 16 of the SR are NOT used for their standard R3000 purpose. |
Instead they are used as explained below: |
|
- Bit 17: Cache enable [reset value = 0] |
885,12 → 914,17
cache refill (even sucessive accesses to the same line). |
When '1', caches are enabled and work as usual. |
|
- Bit 16: I-Cache line invalidate [reset value = 0] |
- Bit 16: I- and D-Cache line invalidate [reset value = 0] |
|
When bits 17:16='01', writing word X.X.X.N to ANY address will |
invalidate I-Cache line N (N is an 8-bit word and X is an 8-bit |
don't care). |
Besides, the actual write will be performed too. |
|
When bits 17:16='01', reading from any address will cause the |
corresponding data cache line to be invalidated; the read will not be |
actually performed and the read value is undefined. |
|
When bit 16 is '0', the cache will work as usual. |
When bits 17:16='11' cache behavior is UNDETERMINED. |
|
930,14 → 964,15
state. |
|
There are a few simulation test bench templates in the /src directory, which |
are used by all the code samples (and there's just two code samples in this |
release of the project...). |
are used by all the code samples. |
The only one actually useful is '/src/mips_tb2_template.vhdl'. The others |
are remnants of previous versions that will be removed ASAP. |
|
The template in file '/src/mips_tb2_template.vhdl' is filled in with all |
the necesary data (mostly memory init. strings) from the |
the necesary data (mostly memory initialization strings) from the |
code sample object file(s) and then written to '/vhdl/tb/mips_tb2.vhdl'. |
This is done by a python script (/src/bin2hdl.py) which is invoked from |
the makefiles. |
The idea is that these tb templates are an execution harness to be shared by |
all test programs. See the template comments for a quick explaination of |
their purpose (and the reason there is more than one template). |
945,6 → 980,11
The test bench templates provided are only good to test instruction |
execution and basic core functionality. As the project moves forward and |
new features are added (e.g. caches) I will add more test bench templates. |
While the test benches and sample code are good enough to catch MOST errors |
in the full system (i.e. cache included) they don't help with diagnostic; |
once you know there's an error, and the approximate address where it's |
triggered (approximate because of the cache) you have to dig into the |
simulation waveforms to find it. It's easier than it seems. |
|
|
3.1.- Running the simulation |
998,6 → 1038,7
Be aware that the simulation timeout is arbitrarily fixed in the makefile. |
This time may or may not be enough to execute the program. Change it if |
necessary. |
|
I should include some means to check for program termination (perhaps a |
debug register written to from boot.s after returning from main, or vhdl |
debug code that catches 1-instruction closed loops). |
1006,21 → 1047,22
|
3.2.- Simulation file logging |
|
The test bench will log any of the following events: |
The simulation test bench will log any of the following events: |
|
- Changes in the register bank. |
- Changes in registers HI and LO (implemented even if mul/div is not). |
- Changes in register EPC (implemented even though it is unused). |
- Changes in registers EPC and SR. |
- Data loads (any resulting register change is logged separately). |
- Data stores. |
|
Note that changes in other internal registers, including PC, are not logged. |
This means that for example a long chain of NOPs, or MOVEs that don't change |
register values, will not be seen in the log file. |
register values, will not be seen in the log file. This is on purpose. |
|
Events are logged with the current value of the PC; this value usually |
points to the instruction following the instruction that triggered the |
event, due to pipelining. This holds true even for load instructions. |
Events are logged with the address of the instruction that triggered |
the change. This holds true even for load instructions. |
Note that early versions of the project logged the address of the |
preceding instruction -- it was confuse and I have fixed it. |
|
The simulation log file is stored by default in modelsim's working directory |
(see above). I don't provide any automated script to do the comparison, you |
1029,6 → 1071,10
|
3.2.1.- Log file format |
|
FIXME: the log examples below are from an early version of the logger that |
did not use the instruction address but the previous address. They are very |
misleading and should be updated. |
|
There is a text line for each of the following events: |
|
* Register change |
1107,7 → 1153,7
(00000298) [03]=20000000 |
... |
|
(NOTE: this example taken from revision 1, yours may vary) |
(NOTE: this example taken from revision 1, yours will probably vary) |
|
The read cycle at pc=0x288 modifies register 0x05; that's why there are two |
lines with the same pc value. |
1159,7 → 1205,7
|
The test bench needs to access those signals in order to detect changes in |
the internal cpu state that should be logged. That is, it really needs to |
look at those signals if it is to be of any use. |
look at those signals if it is to be of any use, this is no whim of mine. |
|
If you are using any other simulation tool, look for an alternative method |
to get those internal signals or just add them to the core interface. I |
1169,6 → 1215,11
I guess this is why Mentor people took the trouble to write SygnalSpy. |
|
I plan to move to Symphony EDA eventually, so I'll have to fix this. |
|
Using GHDL would be an option, except because it only supports vhdl. The |
project will use a SDRAM model in verilog for which I could not find a |
vhdl replacement. If the project is to be ported to GHDL (a very desirable |
goal) this will have to be worked around. |
|
|
|
1178,9 → 1229,10
|
4.1.- Pre-generated demo |
|
The project includes 3 synthesizable code samples, a 'Hello world' demo |
and a memory tester. Only the 'hello' demo is included in pre-generated |
form, the others have to be built using the included makefiles. |
The project includes a few synthesizable code samples, including a |
'Hello world' demo and a memory tester. Only the 'hello' demo is included |
in pre-generated form, the others have to be built using the included |
makefiles -- assuming you have a mips toolchain. |
|
This is just for convenience, so that you can launch some demo on hardware |
without installing the C toolchain. |
1200,26 → 1252,32
I assume you are familiar with Altera tools but anyway this is how to set up |
a project using Quartus II: |
|
1.- Create new project with the new project wizard |
Top entity should be c2sb_demo |
Suggested path is /syn/altera/<project name> |
2.- Set target device as EP2C20F484C7 |
This choice determines speed grade and chip package |
3.- 'Next' your way out of the new project wizard |
4.- Add to the project all the vhdl files in /vhdl and /vhdl/demo |
Select file c2sb_demo.vhdl as top |
5.- Import pin constraints file (assignments->import assignments) |
6.- Create a clock constraint for signal clk (52 MHz or some other |
1.- Create new project with the new project wizard. |
Top entity should be c2sb_demo. |
Suggested path is /syn/altera/<project name>. |
2.- Set target device as EP2C20F484C7. |
This choice determines speed grade and chip package. |
3.- 'Next' your way out of the new project wizard. |
4.- Add to the project all the vhdl files in /vhdl and /vhdl/demo. |
Select file c2sb_demo.vhdl as top. |
5.- Import pin constraints file (assignments->import assignments). |
6.- Create a clock constraint for signal clk (51 MHz or some other |
suitable speed which gives us some minimal slack). |
7.- In the device settings window, click "Device and pin options..." |
8.- Select tab "Dual-Purpose pins" |
9.- Double-click on nCEO value column and select "use as regular I/O" |
7.- In the device settings window, click "Device and pin options...". |
8.- Select tab "Dual-Purpose pins". |
9.- Double-click on nCEO value column and select "use as regular I/O". |
IMPORTANT: otherwise the synthesis will fail; we need to use a FPGA |
pin that happens to be dual-purpose (programming and regular). |
10.-Save the project and synthesize |
11.-Make sure the clock constraint is met (timing analyzer report) |
12.-If you have a terminal hooked to the serial port (19200/8/N/1) you |
10.-Select 'balanced' optimization. |
11.-Save the project and synthesize. |
12.-Make sure the clock constraint is met (timing analyzer report). |
There is a random element to the synthesis process, as you know, |
so it is possible that you need to repeat it if the first trial does |
not pass the constraints. |
13.- Program the FPGA from Quartus-2 |
14.-If you have a terminal hooked to the serial port (19200/8/N/1) you |
should see a welcome message after depressing the reset button. |
(by default this is pusbutton 2). |
|
In case you need to troubleshoot, my synthesis of the default demo is |
usually like this: |
1226,8 → 1284,8
|
- Selected optimization: balanced |
- All other options: default |
- <2250 LEs plus 27 M4K blocks if you use the real cache |
- Clock constraint met but not by much (~53 MHz) |
- <2400 LEs plus a lot of M4K blocks if you use the cache and code BRAM |
- Clock constraint met but not by much (~51+ MHz) |
- <60 warnings, mostly harmless (ugliest: unused pins, undeclared clock) |
|
Note that none of the on-board goodies are used in the demo except as noted |
1244,12 → 1302,15
|
4.2.- Porting to other dev boards |
|
I will only deal here with the 'hello' demo. |
I will only deal here with the 'hello' demo, the process is the same |
for all other samples that don't involve the flash. |
|
|
The 'hello' demo should be easily portable to any board which has all of |
this: |
|
- An FPGA capable enough (the demo uses internal memory only) |
- An FPGA capable enough (the demo uses internal memory for code) |
- At least 4KB of 16-bit wide SRAM |
- A reset pin (possibly a pushbutton) |
- A clock input (uart modules assume 50MHz, see below) |
- RXD and TXD UART pins, plus a connector, header or whatever |
1272,21 → 1333,17
to match your board setup. |
|
All the code in this project is vendor agnostic (or should be, I have only |
tried it on Quartus and ISE). Specifically, it does not instance memory |
tried it on Quartus and ISE). Specifically, it does not instantiate memory |
blocks (relying instead on memory inference) or clock managers or buffers. |
This has its drawbacks but is an stated goal of the project -- in the long |
run it pays, I think. |
|
|
The memory test demo, on the other hand, uses the external static RAM on |
the DE-1 board. Porting it involves only adapting the memory interface |
signals and should be quite straightforward. You're on your own, though. |
|
4.3 'Adventure' demo |
|
There is a second demo targeting the same hardware as the hello demo above: |
There is another demo targeting the same hardware as the hello demo above: |
a port of 'Adventure'. The C source (included) has been slightly modified |
to not use any library functions nor any fulesystem (instead uses a built-in |
to not use any library functions nor any filesystem (instead uses a built-in |
constant string table). |
|
Build steps are the same as for the hello demo (the make target is |
1295,7 → 1352,9
Since the binary executable is too large to fit internal BRAM, it has to be |
executed from the DE-1 onboard flash. You need to write file adventure.bin |
to the start of the FLASH using the 'Control Panel' tool that came with your |
DE-1 board. That's the only salient difference. |
DE-1 board. That's the only salient difference. That and the amount of SRAM; |
The 512KB present on the DE-1 are enough but I don't remember right now |
what is the minimum, please look at the map file. |
|
The game will offer you an auto-walkthrough option. Answer 'y' and it will |
play itself for about 250 moves, leaving you at an intermediate stage of |
1331,7 → 1390,8
|
The simulator can be compiled for compatibility to Plasma or to a more |
standard mips1. This affects the simulated memory map and the trap vectors |
only, as of rev. 70. Note that mips1 simulation is far from complete yet. |
only, as of rev. 70. Note that mips1 simulation is far from complete yet, |
and the caches are not simulated at all. |
|
Upon startup, the simulator loads a number of binary files as requested in |
the command line: |