Username:
Password:

Remember me

Browse

Projects
Forums
About
- Mission
- Logos
- Community
- Statistics
HowTo/FAQ
- FAQ
- Project
- SVN
- WISHBONE
- EDA Tools
Media
- News
- Articles
- Newsletter
Licensing
Commerce
- Shop
- Advertise
- Jobs
Partners
Maintainers
Contact us

Tools

URL https://opencores.org/ocsvn/ion/ion/trunk

Subversion Repositories ion

Compare Revisions

This comparison shows the changes necessary to convert path
```
/ion/trunk/doc
```
from Rev 155 to Rev 156
↔ Reverse comparison
Compare Path: Rev

With Path: Rev

Rev 155 → Rev 156

/ion_project.txt

1,8 → 1,8

 This file contains instructions and notes about the Ion CPU core project.
 The core structure is briefly explained in section 2. The rest of this doc is
 mostly usage instructions for the test samples and custom utilities.
-In its present state it' more a reminder to myself than a proper explaination of
-the system.
+In its present state it's more a reminder to myself than a proper explaination
+of the system.
 Last modified: Apr/04/2011

36,12 → 36,15

 .- Catch all undefined opcodes (and trigger exception).
 .- Operate in kernel/user mode as per the architecture definition.
 .- Handle exceptions in a manner compatible to MIPS-I standard.
-.- Implement as much of CP0 as necessary for the above goals.
-.- Be no bigger than Plasma in a Spartan-3 or Cyclone-2 device, and
+.- Code cache and data cache, even if not standard.
+        No MMU and no TLB, and no cache-related instructions.
+.- Implement as much of CP0 as necessary for the above goals.
+.- Interface to external SRAM (or FLASH) on 8- and 16-bit data bus.
+.- Be no bigger than Plasma in a Spartan-3 or Cyclone-2 device, and
         no slower -- Plasma is used as a reference in many ways.
         Speed measured in raw clock frequency for the time being.
         (I.e. don't not consider stalls, interlocks, etc. yet)
-.- Interlock behavior of MUL/DIV and L* compatible to toolchain.
+.- Interlock behavior of MUL/DIV and L* compatible to toolchain.
         That is, interlock loads instead of relying on a delay slot.
     Unaligned load/stores are excluded not because of patent concerns (the

48,53 → 51,56

     patents already expired) but because they're not essential for a first
     version of the core. The same goes for all other exclusions.
+    As of rev. 154 all the 1st block goals have been accomplished (but not very
+    heavily tested; many bugs remain, probably).
     For a second iteration I plan on the following:
 .- Proper interlocking of load cycles (with no wasted cycles).
 .- External interrupt support.
-.- Code cache and data cache, even if not standard.
-        No MMU and no TLB, and no cache-related instructions.
-.- Interface to external SRAM (or FLASH) on 8- and 16-bit data bus.
+.- Trap handlers (instruction emulation) for unaligned load and store
+        instructions.
+.- Trap handlers (instruction emulation) for the most usual MIPS32
+        instructions.
+    None of these things have been done.
-    Many of the above goals have already been accomplished.
 .2.- Development status
-    In its present state, the CPU can pass a basic opcode test and can execute
-    some basic MIPS-I code compiled with standard gcc tools (specifically, it
-    can run an 'Adventure' demo and a tiny 'hello world' program, see
-    section 6).
-    'Basic' means that the core has a number of limitations that prevent it from
-    running just any code -- mostly unimplemented instructions and skeletal
-    memory interface.
+    The CPU is already able to execute almost any MIPS-I code (excluding some
+    unimplemented instructions such as cache control).
+    It can pass a basic opcode test and can execute some basic applications
+    compiled with standard gcc tools (specifically, it can run an 'Adventure'
+    demo and a tiny 'hello world' program, see section 6).
-    Besides, the core can already access external static memory (SRAM or FLASH)
-    on 8-bit and/or 16 bit buses.
-    My main development target is a DE-1 board from Terasic (Cyclone-2) which
-    happens to have that kind of memory on it.
+    The most important limitations are the very basic memory interface, with
+    no support for SDRAM, and the absence of MIPS32 trap handlers (see xxx).
+    The memory controller can already access external static memory (SRAM or
+    FLASH) on 8-bit and/or 16 bit buses. Still does not support SDRAM or
+    static RAM in other bus widths.
+    My main development target is a DE-1 board from Terasic (Cyclone-2) and I
+    have focused in the kind of memory it has.
     Wait states can be configured at synthesis, see section 2.6 below.
     Code sample 'memtest' takes advantage of this to do a basic test of the
     external SRAM, and code sample 'Adventure' uses both Flash and SRAM.
-    The opcode test is included in '/src/opcodes/opcodes.s' (see section 6).
+    The code samples can be found in the /src directory (see section 6).
     This is the state of the CPU at this time:
     ### MIPS-I things not implemented
-        - Kernel/user status.
-        - RTE instruction -- as of now, returns from traps with JR.
-        - Most of the CP0 registers and of course all of the CP1.
-        - External interrupts.
-        - Caches (there's an empty 'stub' cache and an unfinished real cache)
+.- External interrupts.
+.- Some basic means of debugging (besides the trap instruction).
     ### Things implemented but not fully tested.
-        - Memory pause input -- work in progress along with the cache module.
+.- Rte instruction.
     ### Things with provisional implementation

109,6 → 115,13

             The interlock logic needs a stronger test bench anyway.
 .- Documentation is too sparse and source code is barely commented.
             This is critical and I plan to fix it ASAP.
+.- The D-Cache handles RAW hazards in a very inefficient way.
+            Data refills in a SW+LW sequence should only be triggered when the
+            SW invalidates the same line the LW is loading. Instead, the current
+            cache triggers the data refill always (for a SW+LW sequence, that
+            is).
+            This performance drag has to be fixed without ruining the clock rate
+            (that's the catch).
     ### Performance

115,18 → 128,20

     In my main test system, a Cyclone-2 grade -7, I'm quite sure that the core
     with caches and with mul/div and all other necessary functionality, plus
-    a barebones UART, will be below 2500 LEs, running at least at 50 MHz (with
-    'balanced optimization' on Quartus-II).
+    a barebones UART, will be below 2500 LEs + 18 BRAMs, running at least at
+MHz (with 'balanced optimization' on Quartus-II).
     As soon as the core is in a stable state I will include a few synthesis
     performance numbers for common configurations.
+    As soon as I can build a dhrystone benchmark I will post results (and commit
+    the code). The core needs a timer before I can do that.
 .3.- Next steps
     * Implement efficient load interlock detection with no wasted cycles.
-    * Add a couple other code samples, including one with FP arithmetic.
+    * Do whatever it takes to use standard C library functions (see xxx).
+    * Add a couple of benchmarks, including one with FP arithmetic.
     * Modify the software simulator so it can boot uClinux.
     * Make a uClinux port suitable for a R3000 derivative, from BuildRoot.

137,6 → 152,7

2.- CPU description

================================================================================

This section is about the 'core' cpu, excluding the cache.

2.0.- Some general features

282,17 → 298,14

     Memory wait cycles have already been implemented and tested with a 'stub'
     cache (module mips_cache_stub) and the first version of the real cache
     (module mips_cache.vhdl). The 'stub cache' is actually just an interface
-    to external 8/16-bit wide memory that has only been barely tested on real
-    hardware. This stub cache is described in section 2.7 below while the real
+    to external 8/16-bit wide memory, described in section 2.7 below. The real
     cache is descibed in section 2.8.
-    The memory wait state logic is a work in progress and might change as
-    development on the cache module proceeds.
     In short, the 'mem_wait' input will unconditionally stall all pipeline
     stages as long as it is active. It is meant to be used by the cache at cache
     refills.
-    Again, the current implementation of the wait input and its logic is going
+    The current implementation of the wait input and its logic is going
     to change. Eventually it will be described here.

301,7 → 314,7

 .2 Pipeline
     Here is where I would explain the structure of the cpu in detail; these
-    brief comments will have to wait until I write some real documentation.
+    brief comments will have to suffice until I write some real documentation.
     This section could really use a diagram; since it can take me days to draw
     one, that will have to wait for a further revision.

468,9 → 481,9

     The C toolchain needs to be set up for MIPS-I compliance in order to build
     object code compatible with this scheme.
-    But all succeeding versions of the MIPS architecture, including the MIPS-32
-    that this core aims to be compatible with in the future, implement a
-    different scheme instead, 'load interlock' ([1], pag. 28).
+    But all succeeding versions of the MIPS architecture implement a
+    different scheme instead, 'load interlock' ([1], pag. 28). You often see
+    this behavior in code generated by gcc, even when using the -mips1 flag.
     In short, it pays to implement load interlocks so this core does, but the
     feature should be optional through a generic.

533,12 → 546,16

 .4 Exceptions
     The only exceptions supported so far are software exceptions, and of those
-    only the instructions BREAK and SYSCALL and the unimplemented opcode trap.
+    only the instructions BREAK and SYSCALL, the unimplemented opcode trap and
+    the user access to CP0 trap.
+    Memory provileges are not and will not be implemented. Hardware/software
+    interrupts are still unimplemented too.
-    Both do a limited version of the regular MIPS exception behavior.
-    They save their own address to EPC, abort the following instruction, and
-    jump to the exception vector 0x03c. All as per the specs except the vector
+    Exceptions are meant to work as in the R3000 CPUs except for the vector
     address.
+    They save their own address to EPC, update the SR, abort the following
+    instruction, and jump to the exception vector 0x0180. All as per the specs
+    except the vector address (we only use one).
     The following instruction is aborted even if it is a load or a jump, and
     traps work as specified even from a delay slot -- in that case, the address

546,10 → 563,11

     instruction's as explained in [1], pag. 64.
     Plasma used to save in epc the address of the instruction after break or
-    syscall. This core will use the standard MIPS way instead.
+    syscall, and used an unstandard vector address (0x03c). This core will use
+    the standard R3000 way instead.
     Note that the epc register is not used by any instruction other than mfc0;
-    ERET is not implemented yet, because privilege levels aren't either.
+    RTE is implemented and works as per R3000 specs.
 .5.- Multiplier

591,7 → 609,7

         d.- Whether it is cacheable or not
     In the present implementation the memory map can't be modified at run time.
     The cache module uses 'decode_address_mips1' to determine what to do for
     each cache refill -- the refill state mechine is different for each kind of
     memory, see section 2.7.

598,9 → 616,20

     Note that the cache stub implements only points a, b and c.
+    (NOTE: the cache module includes the memory controller, which is what
+    actually uses all this information. The I- and D-Cache logic don't care
+    about memory types or mappings).
 .7.- 'Stub' cache
+    IMPORTANT: in all revisions later than 106, the cache stub module is no
+    longer supported -- it is incompatible with changes made to the rest of the
+    project.
+    It remains present for clarity only.
+    This expplaination applies only to the relevant (<=106) revisions.
     As of revision 53, there is a first synthesizable version of the cache, a
     'stub' with almost no functionality meant to bring the system up.
     The cache in its present state (rev. 53) is just an interface to external

856,27 → 885,27

     As of revision 114, the project already includes a real cache module, still
     unfinished. Both the simulation template and the synthesis template still
     use the stub cache by default, so if you want to test the real cache you
-    have to modify the templated manually.
+    have to modify the templated manually.
+    Use always the latest revision! the cache module is still a work in
+    progress (last bug fix in rev. 153).
-    Only the I-Cache is implemented; the D-Cache still uses the stub logic from
-    the mips_cache_stub module. Besides, SDRAM is not supported yet. And there
-    are a number of loose ends in the implementation still to be solved. For
-    example, the cache size is supposed to be parametrizable through generics;
-    but though the generics are already there, the actual code does not use them
-    and uses hardcoded constants instead. There's many things like this
-    still unfinished.
+    Both the I- and the D-Cache are implemented. But the parametrization
+    generics are still mostly unused, with many values hardcoded. And SDRAM is
+    not supported yet. Besides, there are some loose ends in the implementation
+    still to be solved.
     As time permits I will add timing diagrams to this section. For now, I will
     only say that the timing diagrams will be nearly identical to those of the
     stub cache with only one important difference: In the stub cache, each
     refill operation only reads a single word from memory, whereas in the real
-    cache each refill reads 4 words in reverse order (i.e. 3, 2, 1, 0 LSB
-    address bits).
+    cache each refill reads 4 words.
+    FIXME: add at least one cache timing diagram.
 .8.1.- Cache initialization and control
-    Bits 17 and 16 of the SR are NOT used for their standard MIPS-I purpose.
+    Bits 17 and 16 of the SR are NOT used for their standard R3000 purpose.
     Instead they are used as explained below:
     - Bit 17: Cache enable              [reset value = 0]

885,12 → 914,17

         cache refill (even sucessive accesses to the same line).
         When '1', caches are enabled and work as usual.
-    - Bit 16: I-Cache line invalidate   [reset value = 0]
+    - Bit 16: I- and D-Cache line invalidate   [reset value = 0]
         When bits 17:16='01', writing word X.X.X.N to ANY address will
         invalidate I-Cache line N (N is an 8-bit word and X is an 8-bit
         don't care).
         Besides, the actual write will be performed too.
+        When bits 17:16='01', reading from any address will cause the
+        corresponding data cache line to be invalidated; the read will not be
+        actually performed and the read value is undefined.
         When bit 16 is '0', the cache will work as usual.
         When bits 17:16='11' cache behavior is UNDETERMINED.

930,14 → 964,15

     state.
     There are a few simulation test bench templates in the /src directory, which
-    are used by all the code samples (and there's just two code samples in this
-    release of the project...).
+    are used by all the code samples.
     The only one actually useful is '/src/mips_tb2_template.vhdl'. The others
     are remnants of previous versions that will be removed ASAP.
     The template in file '/src/mips_tb2_template.vhdl' is filled in with all
-    the necesary data (mostly memory init. strings) from the
+    the necesary data (mostly memory initialization strings) from the
     code sample object file(s) and then written to '/vhdl/tb/mips_tb2.vhdl'.
+    This is done by a python script (/src/bin2hdl.py) which is invoked from
+    the makefiles.
     The idea is that these tb templates are an execution harness to be shared by
     all test programs. See the template comments for a quick explaination of
     their purpose (and the reason there is more than one template).

945,6 → 980,11

     The test bench templates provided are only good to test instruction
     execution and basic core functionality. As the project moves forward and
     new features are added (e.g. caches) I will add more test bench templates.
+    While the test benches and sample code are good enough to catch MOST errors
+    in the full system (i.e. cache included) they don't help with diagnostic;
+    once you know there's an error, and the approximate address where it's
+    triggered (approximate because of the cache) you have to dig into the
+    simulation waveforms to find it. It's easier than it seems.
 .1.- Running the simulation

998,6 → 1038,7

Be aware that the simulation timeout is arbitrarily fixed in the makefile.

This time may or may not be enough to execute the program. Change it if

necessary.

I should include some means to check for program termination (perhaps a

debug register written to from boot.s after returning from main, or vhdl

debug code that catches 1-instruction closed loops).

1006,21 → 1047,22

 .2.- Simulation file logging
-    The test bench will log any of the following events:
+    The simulation test bench will log any of the following events:
     - Changes in the register bank.
     - Changes in registers HI and LO (implemented even if mul/div is not).
-    - Changes in register EPC (implemented even though it is unused).
+    - Changes in registers EPC and SR.
     - Data loads (any resulting register change is logged separately).
     - Data stores.
     Note that changes in other internal registers, including PC, are not logged.
     This means that for example a long chain of NOPs, or MOVEs that don't change
-    register values, will not be seen in the log file.
+    register values, will not be seen in the log file. This is on purpose.
-    Events are logged with the current value of the PC; this value usually
-    points to the instruction following the instruction that triggered the
-    event, due to pipelining. This holds true even for load instructions.
+    Events are logged with the address of the instruction that triggered
+    the change. This holds true even for load instructions.
+    Note that early versions of the project logged the address of the
+    preceding instruction -- it was confuse and I have fixed it.
     The simulation log file is stored by default in modelsim's working directory
     (see above). I don't provide any automated script to do the comparison, you

1029,6 → 1071,10

 .2.1.- Log file format
+    FIXME: the log examples below are from an early version of the logger that
+    did not use the instruction address but the previous address. They are very
+    misleading and should be updated.
     There is a text line for each of the following events:
     * Register change

1107,7 → 1153,7

                 (00000298) [03]=20000000
                 ...
-    (NOTE: this example taken from revision 1, yours may vary)
+    (NOTE: this example taken from revision 1, yours will probably vary)
     The read cycle at pc=0x288 modifies register 0x05; that's why there are two
     lines with the same pc value.

1159,7 → 1205,7

     The test bench needs to access those signals in order to detect changes in
     the internal cpu state that should be logged. That is, it really needs to
-    look at those signals if it is to be of any use.
+    look at those signals if it is to be of any use, this is no whim of mine.
     If you are using any other simulation tool, look for an alternative method
     to get those internal signals or just add them to the core interface. I

1169,6 → 1215,11

     I guess this is why Mentor people took the trouble to write SygnalSpy.
     I plan to move to Symphony EDA eventually, so I'll have to fix this.
+    Using GHDL would be an option, except because it only supports vhdl. The
+    project will use a SDRAM model in verilog for which I could not find a
+    vhdl replacement. If the project is to be ported to GHDL (a very desirable
+    goal) this will have to be worked around.

1178,9 → 1229,10

 .1.- Pre-generated demo
-    The project includes 3 synthesizable code samples, a 'Hello world' demo
-    and a memory tester. Only the 'hello' demo is included in pre-generated
-    form, the others have to be built using the included makefiles.
+    The project includes a few synthesizable code samples, including a
+    'Hello world' demo and a memory tester. Only the 'hello' demo is included
+    in pre-generated form, the others have to be built using the included
+    makefiles -- assuming you have a mips toolchain.
     This is just for convenience, so that you can launch some demo on hardware
     without installing the C toolchain.

1200,26 → 1252,32

     I assume you are familiar with Altera tools but anyway this is how to set up
     a project using Quartus II:
-.- Create new project with the new project wizard
-            Top entity should be c2sb_demo
-            Suggested path is /syn/altera/<project name>
-.- Set target device as EP2C20F484C7
-            This choice determines speed grade and chip package
-.- 'Next' your way out of the new project wizard
-.- Add to the project all the vhdl files in /vhdl and /vhdl/demo
-            Select file c2sb_demo.vhdl as top
-.- Import pin constraints file (assignments->import assignments)
-.- Create a clock constraint for signal clk (52 MHz or some other
+.- Create new project with the new project wizard.
+            Top entity should be c2sb_demo.
+            Suggested path is /syn/altera/<project name>.
+.- Set target device as EP2C20F484C7.
+            This choice determines speed grade and chip package.
+.- 'Next' your way out of the new project wizard.
+.- Add to the project all the vhdl files in /vhdl and /vhdl/demo.
+            Select file c2sb_demo.vhdl as top.
+.- Import pin constraints file (assignments->import assignments).
+.- Create a clock constraint for signal clk (51 MHz or some other
             suitable speed which gives us some minimal slack).
-.- In the device settings window, click "Device and pin options..."
-.- Select tab "Dual-Purpose pins"
-.- Double-click on nCEO value column and select "use as regular I/O"
+.- In the device settings window, click "Device and pin options...".
+.- Select tab "Dual-Purpose pins".
+.- Double-click on nCEO value column and select "use as regular I/O".
             IMPORTANT: otherwise the synthesis will fail; we need to use a FPGA
             pin that happens to be dual-purpose (programming and regular).
-.-Save the project and synthesize
-.-Make sure the clock constraint is met (timing analyzer report)
-.-If you have a terminal hooked to the serial port (19200/8/N/1) you
+.-Select 'balanced' optimization.
+.-Save the project and synthesize.
+.-Make sure the clock constraint is met (timing analyzer report).
+            There is a random element to the synthesis process, as you know,
+            so it is possible that you need to repeat it if the first trial does
+            not pass the constraints.
+.- Program the FPGA from Quartus-2
+.-If you have a terminal hooked to the serial port (19200/8/N/1) you
             should see a welcome message after depressing the reset button.
+            (by default this is pusbutton 2).
     In case you need to troubleshoot, my synthesis of the default demo is
     usually like this:

1226,8 → 1284,8

         - Selected optimization: balanced
         - All other options: default
-        - <2250 LEs plus 27 M4K blocks if you use the real cache
-        - Clock constraint met but not by much (~53 MHz)
+        - <2400 LEs plus a lot of M4K blocks if you use the cache and code BRAM
+        - Clock constraint met but not by much (~51+ MHz)
         - <60 warnings, mostly harmless (ugliest: unused pins, undeclared clock)
     Note that none of the on-board goodies are used in the demo except as noted

1244,12 → 1302,15

 .2.- Porting to other dev boards
-    I will only deal here with the 'hello' demo.
+    I will only deal here with the 'hello' demo, the process is the same
+    for all other samples that don't involve the flash.
     The 'hello' demo should be easily portable to any board which has all of
     this:
-        - An FPGA capable enough (the demo uses internal memory only)
+        - An FPGA capable enough (the demo uses internal memory for code)
+        - At least 4KB of 16-bit wide SRAM
         - A reset pin (possibly a pushbutton)
         - A clock input (uart modules assume 50MHz, see below)
         - RXD and TXD UART pins, plus a connector, header or whatever

1272,21 → 1333,17

     to match your board setup.
     All the code in this project is vendor agnostic (or should be, I have only
-    tried it on Quartus and ISE). Specifically, it does not instance memory
+    tried it on Quartus and ISE). Specifically, it does not instantiate memory
     blocks (relying instead on memory inference) or clock managers or buffers.
     This has its drawbacks but is an stated goal of the project -- in the long
     run it pays, I think.
-    The memory test demo, on the other hand, uses the external static RAM on
-    the DE-1 board. Porting it involves only adapting the memory interface
-    signals and should be quite straightforward. You're on your own, though.
 .3 'Adventure' demo
-    There is a second demo targeting the same hardware as the hello demo above:
+    There is another demo targeting the same hardware as the hello demo above:
     a port of 'Adventure'. The C source (included) has been slightly modified
-    to not use any library functions nor any fulesystem (instead uses a built-in
+    to not use any library functions nor any filesystem (instead uses a built-in
     constant string table).
     Build steps are the same as for the hello demo (the make target is

1295,7 → 1352,9

     Since the binary executable is too large to fit internal BRAM, it has to be
     executed from the DE-1 onboard flash. You need to write file adventure.bin
     to the start of the FLASH using the 'Control Panel' tool that came with your
-    DE-1 board. That's the only salient difference.
+    DE-1 board. That's the only salient difference. That and the amount of SRAM;
+    The 512KB present on the DE-1 are enough but I don't remember right now
+    what is the minimum, please look at the map file.
     The game will offer you an auto-walkthrough option. Answer 'y' and it will
     play itself for about 250 moves, leaving you at an intermediate stage of

1331,7 → 1390,8

     The simulator can be compiled for compatibility to Plasma or to a more
     standard mips1. This affects the simulated memory map and the trap vectors
-    only, as of rev. 70. Note that mips1 simulation is far from complete yet.
+    only, as of rev. 70. Note that mips1 simulation is far from complete yet,
+    and the caches are not simulated at all.
     Upon startup, the simulator loads a number of binary files as requested in
     the command line: