Line 11... |
Line 11... |
// Author: Nicolae Dumitrache
|
// Author: Nicolae Dumitrache
|
// e-mail: ndumitrache@opencores.org
|
// e-mail: ndumitrache@opencores.org
|
//
|
//
|
/////////////////////////////////////////////////////////////////////////////////
|
/////////////////////////////////////////////////////////////////////////////////
|
//
|
//
|
// Copyright (C) 2011 Nicolae Dumitrache
|
// Copyright (C) 2012 Nicolae Dumitrache
|
//
|
//
|
// This source file may be used and distributed without
|
// This source file may be used and distributed without
|
// restriction provided that this copyright statement is not
|
// restriction provided that this copyright statement is not
|
// removed from the file and that any derivative work contains
|
// removed from the file and that any derivative work contains
|
// the original copyright notice and the associated disclaimer.
|
// the original copyright notice and the associated disclaimer.
|
Line 37... |
Line 37... |
// from http://www.opencores.org/lgpl.shtml
|
// from http://www.opencores.org/lgpl.shtml
|
//
|
//
|
///////////////////////////////////////////////////////////////////////////////////
|
///////////////////////////////////////////////////////////////////////////////////
|
// Additional Comments:
|
// Additional Comments:
|
//
|
//
|
// Description: Next186 Bus Interface Unit, 32bit width SRAM, double CPU frequency (2T), delayed read (1T)
|
// - Links the CPU with a 32bit static synchronous RAM (or cache)
|
//
|
// - Able to address up to 1MB
|
// The CPU is able to execute (up to) one instrusction/clock with sepparate data and instruction buses.
|
// - 16byte instruction queue
|
// This particular bus interface links the CPU with a 32bit width memory bus and it is able to address up to 1MB.
|
// - Works at 2 X CPU frequency (80Mhz on Spartan3AN), requiring minimum 2T for an instruction.
|
// It have a 16byte instruction queue, and works at up to 80Mhz, allowing the CPU to execute an instruction at 2T. It is possible
|
// - The 32bit data bus and the double CPU clock allows the instruction queue to be almost always full, avoiding the CPU starving.
|
// to implement a BIU with sepparate data/instruction buses, which will run at the same frequency as the CPU, but it requires more resources.
|
|
//
|
|
// The 32bit data bus width and the double BIU clock allows the instruction queue to be almost always full, avoiding the CPU starving.
|
|
// The data un-alignement penalties are required only when data words crosses the 4byte boundaries.
|
// The data un-alignement penalties are required only when data words crosses the 4byte boundaries.
|
|
//
|
//////////////////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////////////////
|
// How to compute each instruction duration, in clock cycles (please note that it is specific to this particular BIU implementation):
|
//
|
|
// How to compute each instruction duration, in clock cycles (for this particular BIU implementation!):
|
//
|
//
|
// 1 - From the Next186_features.doc see for each instruction how many T states are required (you will notice they are always
|
// 1 - From the Next186_features.doc see for each instruction how many T states are required (you will notice they are always
|
// less or equal than 486 and much less than the original 80186
|
// less or equal than 486 and much less than the original 80186
|
// 2 - multiply this number by 2 - the BIU works at double ALU frequency because it needs to multiplex the data and instructions,
|
// 2 - Multiply this number by 2 - the BIU works at double ALU frequency because it needs to multiplex the data and instructions,
|
// in order to keep the ALU permanently feed with instructions. The 16bit queue acts like a flexible instruction buffer.
|
// in order to keep the ALU permanently feed with instructions. The 16bit queue acts like a flexible instruction buffer.
|
// 3 - add penalties, as follows:
|
// 3 - Add penalties, as follows:
|
// +1T for each memory read - because of the synchronous SRAM which need this extra cycle to deliver the data
|
// +1T for each memory read - because of the synchronous SRAM which need this extra cycle to deliver the data
|
// +2T for each jump - required to flush and re-fill the instruction queue
|
// +2T for each jump - required to flush and re-fill the instruction queue
|
// +1T for each 16bit(word) read/write which overlaps the 4byte boundary - specific to 32bit bus width
|
// +1T for each 16bit(word) read/write which overlaps the 4byte boundary - specific to 32bit bus width
|
// +1T if the jump is made at an address with the latest 2bits 11 - specific to 32bit bus width
|
// +1T if the jump is made at an address with the latest 2bits 11 - specific to 32bit bus width
|
// +1T when the instruction queue empties - this case appears very rare, when a lot of 5-6 bytes memory write
|
// +1T when the instruction queue empties - this case appears very rare, when a lot of 5-6 bytes memory write instructions are executed in direct sequence
|
// instructions are executed one after the other
|
//
|
// Some examples:
|
// Some examples:
|
// - the instruction "inc word ptr [1]" will require 5T (2x2T inc M + 1T read)
|
// - "lea ax,[bx+si+1234]" requires 2T
|
// - the instruction "inc word ptr [3]" will require 7T (2x2T inc M + 1T read + 1T unaligned read + 1T unaligned write)
|
// - "add ax, 2345" requires 2T
|
// - the instruction "imul ax,bx,234" will require 4T (2x2T imul)
|
// - "xchg ax, bx" requires 4T
|
// - the instruction "loop <address = 1>" will require 4T (2x1T loop + 2T flush)
|
// - "inc word ptr [1]" requires 5T (2x2T inc M + 1T read)
|
// - the instruction "loop <address = 3>" will require 5T (2x1T loop + 2T flush + 1T unaligned jump)
|
// - "inc word ptr [3]" requires 7T (2x2T inc M + 1T read + 1T unaligned read + 1T unaligned write)
|
// - the instruction "call <address = 0>" will require 4T (2x1T call near + 2T flush
|
// - "imul ax,bx,234" requires 4T (2x2T imul)
|
// - the instruction "ret <address = 0>" will require 5T (2x2T ret + 1T read penalty)
|
// - "loop address != 3(mod 4)" requires 4T (2x1T loop + 2T flush)
|
|
// - "loop address == 3(mod 4)" requires 5T (2x1T loop + 2T flush + 1T unaligned jump)
|
|
// - "call address 0" requires 4T (2x1T call near + 2T flush
|
|
// - "ret address 0" requires 5T (2x2T ret + 1T read penalty)
|
|
//
|
//////////////////////////////////////////////////////////////////////////////////
|
//////////////////////////////////////////////////////////////////////////////////
|
|
|
`timescale 1ns / 1ps
|
`timescale 1ns / 1ps
|
|
|
|
|