URL
https://opencores.org/ocsvn/m65c02/m65c02/trunk
Subversion Repositories m65c02
[/] [m65c02/] [trunk/] [README.md] - Rev 2
Compare with Previous | Blame | View Log
M65C02 Microprocessor Core
=======================
Copyright (C) 2012-2013, Michael A. Morris <morrisma@mchsi.com>.
All Rights Reserved.
Released under LGPL.
News
----
Recently completed tests have verified the M65C02 soft-
processor to operate as designed at a frequency of 73.728 MHz in an XC3S50A-
4VQG100I FPGA. See below for a more complete description of Release 2.7.2.
with which this milestone was achieved.
General Description
-------------------
This project provides a microprogrammed synthesizable IP core compatible with
the WDC and Rockwell 65C02 microprocessors.
It is provided as a core. Several external components are required to form a
functioning processor: (1) memory, (2) interrupt controller, and (3) I/O
interface buffers. The Verilog testbench provided demonstrates a simple
configuration for a functioning processor implemented with the M65C02 core:
M65C02_Core. The M65C02 core supports the full instruction set of the W65C02.
The core accepts an interrupt signal from an external interrupt controller.
The core provides the interrupt mask bit to the external interrupt controller,
and expects the controller to handle the detection of the NMI edge, the
prioritization of the interrupt sources, and to provide the interrupt and
exception vectors. The core also provides an indication of whether the BRK
instruction is being executed. With this additional information, the external
interrupt controller is expected to provide the same vector for the BRK
exception as the vector for the IRQ interrupt request, or another suitable
vector. This approach to interrupt handling can be used to support a vectored
interrupt structure with more interrupt sources than the original processor
implementation supported: NMI, RST, and IRQ.
With Release 2.x, the core now provides a microcycle length controller as an
integral component of the M65C02 Microprogram Controller (MPC). The M65C02
core microprogram can now inform the external memory controller, on a cycle by
cycle basis, of the memory cycle type. Logic external to the core can use this
output to map the memory cycle to whatever memory is appropriate, and to drive
the microcycle length inputs of the core to extend each microcycle if
necessary. Thus, the Release 2.x core no longer assumes that the external
memory is implemented as an asynchronous memory device, and as a result, the
core no longer expects that the memory will accept an address and return the
read data at that address in the same cycle. With the built-in microcycle
length controller, single cycle LUT-based zero page memory, 2 cycle internal
block RAM memory, and 4 cycle external memory can easily be supported. A Wait
input can also be used to extend, i.e. add wait states, to the 4 cycle
microcycles, so a wide variety of memories can be easily supported; the only
limitation being the memory types supported by the user-supplied external
memory controller.
The core provides a large number of status and control signals that external
logic may use. It also provides access to many internal signals such as all of
the registers, A, X, Y, S, and P. The *Mode*, *Done*, *SC*, and *RMW* status
outputs may be used to provide additional signals to external devices.
*Mode* provides an indication of the kind of instruction being executed:
0 - STP - stop processor instruction executed,
1 - INV - invalid instruction (uniformly treated a single cycle NOPs),
2 - BRK - break instruction being executed
3 - JMP - branch/jump/return (Bcc, BBRx/BBSx, JMP/JSR, RTS/RTI),
4 - STK - stack access (PHA/PLA, PHX/PLX, PHY/PLY),
5 - INT - single cycle instruction (INC/DEC A, TAX/TXA, SEI/CLI, etc.),
6 - MEM - multi-cycle instruction with memory access for operands,
7 - WAI - wait for interrupt instruction being executed.
*Done* is asserted during the instruction fetch of the next instruction.
During that fetch cycle, all instructions complete execution. Thus, the M65C02
is pipelined, and executes many instructions in fewer cycles than the 65C02.
*SC* is used to indicate a single cycle instruction.
*RMW* indicates that a read-modify-write instruction will be performed. External
logic can use this signal to lock memory.
*IO_Op* indicates the I/O cycle required. *IO_Op* signals data memory writes,
data memory reads, and instruction memory reads. Therefore, external logic may
implement separate data and instruction memories and potentially double the
amount of memory that an implementation may access.
Implementation
--------------
The implementation of the core provided consists of five Verilog source files
and several memory initialization files:
M65C02_Core.v - Top level module
M65C02_MPCv3.v - M65C02 MPC with microcycle length controller
M65C02_AddrGen.v - M65C02 Address Generator module
M65C02_ALU.v - M65C02 ALU module
M65C02_BIN.v - M65C02 Binary Mode Adder module
M65C02_BCD.v - M65C02 Decimal Mode Adder module
M65C02_Decoder_ROM.coe - M65C02 core microprogram ALU control fields
M65C02_uPgm_V3a.coe - M65C02 core microprogram (sequence control)
M65C02_Core.ucf - User Constraints File: period and pin LOCs
M65C02.tcl - Project settings file
tb_M65C02_Core.v - Completed core testbench with test RAM
M65C02_Tst3.txt - Memory configuration file of M65C02 "ROM" program
M65C02_Tst3.a65 - Kingswood A65 assembler source code test program
tb_M65C02_ALU.v - testbench for the ALU module
tb_M65C02_BCD.v - testbench for the BCD adder module
Synthesis
---------
The objective for the core is to synthesize such that the FF-FF speed is 100 MHz
or higher in a Xilinx XC3S200AN-5FGG256 FPGA using Xilinx ISE 10.1i SP3. In that
regard, the core provided meets and exceeds that objective. Using the settings
provided in the M65C02.tcl file, ISE 10.1i tool implements the design and
reports that the 10.000 ns period (100 MHz) constraint is satisfied.
The ISE 10.1i SP3 implementation results are as follows:
Number of Slice FFs: 191
Number of 4-input LUTs: 747
Number of Occupied Slices: 459
Total Number of 4-input LUTs: 760 (13 used as route-throughs)
Number of BUFGMUXs: 1
Number of RAMB16BWEs 2 (M65C02_Decoder_ROM, M65C02_uPgm_V3a)
Best Case Achievable: 9.962 ns (0.038 ns Setup, 1.028 ns Hold)
Status
------
Design and verification is complete.
Release Notes
-------------
###Release 1
Release 1 of the M65C02 had an issue in that addressing wrapping of zero page
addressing was not properly implemented. Unlike the W65C02 and MOS6502, the
M65C02 synthesizable core implemented the addressing modes, but allowed page
boundaries to be crossed for all addressing modes. This initial behavior is
more like that of the WDC 65C802/816 microprocessors in native mode. With this
release, Release 2, the zero page addressing modes of the M65C02 core behave
like those of the WDC W65C02.
Following Release 1, a couple of quick patches were made to the zero page
addressing, but these failed to address all of the issues. Release 2 uses the
same basic next address generation logic, except that it now allows the
microcode to control when addresses are computed modulo 256. With this change,
all outstanding issues with respect to zero page addressing have been
corrected.
###Release 2
Release 2 has reworked the Microprogram Controller (MPC) to include a
microcycle length controller directly. With this new MPC, it is expected that
it will be easier to adapt the core to use LUT RAM for page 0 (data page) and
page 1 (stack page), and to attach a external memory controller with variable
length access cycles. The microcycle length controller allows 1, 2, or 4 cycle
microcycles. Neither the 1 and 2 cycle microcyles support wait state
insertion, but the 4 cycle microcycle allows the insertion of wait states.
With this architecture, LUT and internal Block RAMs can be used to provide
high speed operation. The 4 cycle external memory microcycle should easily
allow the core to support asynchronous or synchronous external memory. Release
1 allowed variable length microcycles, but the address-based mechanism
implemented was difficult to use in practice. Release 1 targeted a single
cycle memory like that provided by the distributed LUT RAMs of the target
FPGAs. The approach used in Release 2 should make it much easier to adapt the
M65C02 core.
####Release 2.1
Release 2.1 has modified the core to export signals to an external memory
controller that would allow the memory controller to drive the core logic with
the required microcycle length value for the next microcycle. The test bench for
the core is running in parallel with the original Release 1 (with zero page
adressing corrected) core (M65C02_Base.v) so that a self-checking configuration
is achieved between the two cores and the common test program. Release 2.1 also
includes a modified memory model module, M65C02_RAM,v, that supports all three
types of memory that is expected to be used with the core: LUT (page 0), BRAM
(page 1 and internal program/data memory), and external pipelined SynchRAM.
####Release 2.2
Release 2.2 has been tested using microcycles of 1, 2, or 4 cycles in length.
During testing, some old issues returned when multi-cycle microcycles were
used. With single cycle microcycles there were no problems with either of the
two cores: M65C02_Core.v or M65C02_Base.v. For example, with 2 and 4 cycle
microcycles, the modification of the PSW before the first instruction of the
ISR was found to be taking place several microcycles before it should. This
issue was tracked down to the fact that the microprogram ROMs and the PSW
update logic were not being qualified by the internal Rdy signal, or end-of-
microcycle. In the single cycle microcycle case, previous corrections applied
to address this issue still worked, but the single cycle solutions applied did
not generalize to the multi-cycle cases. Thus, several modules were modified
so that ISR, BCD, and zero page addressing modes now behave correctly for
single and multi-cycle microcycles.
####Release 2.3
Release 2.3 implements the standard 6502/65C02 vector fetch operations and
adds the WAI and STP instructions. Both versions are updated to incorporate
these features. The testbench has been modified to include another M6502_RAM
module, and to separate the two modules into "ROM" at high memory and "RAM" at
low memory. The test program has been updated to include initialization of
"RAM" by the test program running from "ROM". Initialization of the stack
pointer is still part of the core logic, and the test program expects that S
is initialized to 0xFF on reset, and that the reset vector fetch sequence does
not modify the stack. In other words, the Release 2.3 core does not write to
the stack before fetching the vector and starting execution at that address.
####Release 2.4
Release 2.4 incorporates the 32 Rockwell instruction opcodes and the WAI and STP
instructions.
####Release 2.5
Release 2.5 makes some minor modifications to the M65C02 core module to allow
the output of some signals that allow the generation of interface signals such
as the active low Vector Pull output of the W65C02S microprocessor. In
addition to bringing out of these signals, Release 2.5 also provides an
implementation of a standalone microprocessor, or system-on-chip, which
demonstrates how the M65C02 can be used to provide a stand-alone
implementation of a 65C02 processor. This implementation is composed of the
following files:
M65C02.v - M65C02 microprocessor demonstration
ClkGen.xaw - Xilinx Architecture Wizard clock generator file
M65C02.ucf - User Constraints File: period and pin LOCs
M65C02.tcl - Project settings file
tb_M65C02.v - M65C02 testbench with RAM/ROM and interrupt sources
The header of the M65C02.v module provides details of the differences between
the 65C02 microprocessor implementation represented by the M65C02.v and a
65C02 processor implementation as represented by the WDC W65C02S microprocessor.
The M65C02 implementation is targeted at an XC3S50A-4VQG100I FPGA. The User
Constraints File (ucf) has been developed so that the resulting implementation
can be used as a fully functional microprocessor when attached to external I/O
devices, external SRAM device(s) (25ns or faster), and external an NOR Flash
device (4kB, 45ns or faster). A development board is presently being developed
to demonstrate the M65C02, and to provide a suitable platform for further
development of the remaining FPGA resources into a more complete system-on-
chip based on the M65C02 core.
The Xilinx ISE 10.1i SP3 synthesis results for the M65C02 are as follows:
Used Avail %
Number of Slice Flip Flops 200 1408 14%
Number of 4 input LUTs 736 1408 52%
Logic Distribution
Number of occupied Slices 426 704 60%
Number of Slices related logic 426 426 100%
Number of Slices unrelated logic 0 426 0%
Total Number of 4 input LUTs 745 1408 52%
Number used as logic 735
Number used as a route-thru 9
Number used as Shift registers 1
Number of bonded IOBs
Number of bonded pads 53 68 77%
IOB Flip Flops 79
Number of BUFGMUXs 4 24 16%
Number of DCMs 1 2 50%
Number of RAMB16BWEs 2 3 66%
Best Case Achievable: 13.213ns (0.037ns Setup, 1.023ns Hold)
Please read the header and other comments for more details on the M65C02
processor implementation. In particular, read and understand the discussion
regarding the use of an FPGA-specific clock multiplexer to manage the memory
cycle length in lieu of supporting wait state generation/insertion.
#####Release 2.6
Modified the M65C02 processor to use the last available block RAM in the
XC3S50A-xVQG100I device as a 2kB Boot/Monitor ROM. Added an external pin to
inhibit writes into this block RAM. The UCF file includes a PULLUP on the pin
which enables writes. Also modified the clock stretch logic to only apply when
system ROM, CE[2], or User ROM, CE[1], are addressed. The Boot/Monitor
ROM/RAM, IO (CE[3]), and User RAM, CE[0], do not use the clock stretching
logic and therefore require devices able to respond in a single memory cycle of
the M65C02, ~25ns.
Adding the additional (internal) device select and data multiplexer to the
M65C02 caused a drop in performance. External memory operating frequency
decreased from ~20 MHz (max) to ~16 MHz for a -5 speed grade part. There was
also an increase in the size of the implementation, but that was expected and
did use a reasonable number of additional resources.
The following table summarizes PAR results for the new release of the M65C02
processor:
Used Avail %
Number of Slice Flip Flops 205 1408 14%
Number of 4 input LUTs 724 1408 51%
Logic Distribution
Number of occupied Slices 443 704 62%
Number of Slices related logic 443 443 100%
Number of Slices unrelated logic 0 426 0%
Total Number of 4 input LUTs 732 1408 51%
Number used as logic 723
Number used as a route-thru 8
Number used as Shift registers 1
Number of bonded IOBs
Number of bonded pads 54 68 79%
IOB Flip Flops 80
Number of BUFGMUXs 4 24 16%
Number of DCMs 1 2 50%
Number of RAMB16BWEs 3 3 100%
Best Case Achievable: 15.147ns (0.003ns Setup, 0.817ns Hold)
The modified files are:
M65C02.v - M65C02 microprocessor demonstration
M65C02.ucf - User Constraints File: period and pin LOCs
tb_M65C02.v - M65C02 testbench with RAM/ROM and interrupt sources
Additional work is needed for verification, but this release successfully
executes the same test program as the previous release of the M65C02 processor
and the M65C02 core.
#####Release 2.7
Modified the Release 2.6 M65C02 processor to use a newly released version of the
microprogram controller. The new microprogram controller, M65C02_MPCv4.v,
modifies the behavior of the built-in microcycle length controller. It fixes the
microcycle length to 4, and adds four additional states by which external
devices can request wait states. The new microprogram controller adds wait
states in integer multiples of the memory cycle. In this way, the clock stretch
logic built using a FF and a BUFGMUX clock multiplexer can be removed, and the
external Phi1O and Phi2O signals will maintain their natural 50% DC signal
characteristic.
The change to the microprogram controller required a change to the core and to
the interface between the core and the M65C02 processor. Within the core, the
change in the microprogram controller removed the need for the cycle extension
logic used to insert an extra state in the microcycle whenever a BCD instruction
is executed. That extra cycle is only needed when the core is operating with
single memory. Since the microcycle is fixed to 4 with the new microprogram
controller, the BCD mode microcycle extension logic was removed.
The interface change refers to the need to increase the width of the microstate
signal, MC, from 2 to 3 bits. Within the M65C02 processor, the additional states
supported by the larger MC port required that the clock enable for the external
memory data input register be modified. The nominal external input data sampling
point is cycle 3, falling edge of Phi2O. With wait states, the data sampling
point becomed cycle 3 or cycle 7. For data sampling, the external Rdy input
signal must also be asserted. A final change to the M65C02 processor is that the
Phi1O and Phi2O signals are now set and reset using four microstate decode
signals rather than two.
The incorporation of the last block memory into the design resulted in a loss
of performance. The M65C02 processor is unable to maintain an external memory
cycle rate of 18.432 MHz when the internal block RAM is included. The
additional decode and input data multiplexer impose a path delay that lowers
the memory interface operating speed to 16 MHz. Thus, the nearest baud rate
frequency is 14.7456 MHz.
Operating at 14.7456 MHz requires external devices to request a wait state if
they are unable to accept or supply data within 33.908ns. (At 16 MHz
operation, the access time requirement is 31.25ns.) A single wait state
extends the memory access time to 101.725ns. At 14.7456 MHz or 16 MHz, the
memory cycle characteristics of the M65C02 processor allow the use of low-cost
high-speed asynchronous SRAMs, and with one wait state, low-cost NOR Flash
EEPROMs in 45, 55, 70, or 90ns speed grades.
The following table summarizes PAR results for Release 2.7 of the M65C02
processor:
Used Avail %
Number of Slice Flip Flops 205 1408 14%
Number of 4 input LUTs 720 1408 51%
Number of occupied Slices 401 704 56%
Number of Slices related logic 401 401 100%
Number of Slices unrelated logic 0 401 0%
Total Number of 4 input LUTs 728 1408 51%
Number used as logic 719
Number used as a route-thru 8
Number used as Shift registers 1
Number of bonded IOBs
Number of bonded pads 54 68 79%
IOB Flip Flops 79
Number of BUFGMUXs 4 24 16%
Number of DCMs 1 2 50%
Number of RAMB16BWEs 3 3 100%
Best Case Achievable: 15.625ns (0.000ns Setup, 0.961ns Hold)
The files modified in this release are:
M65C02.v - M65C02 microprocessor demonstration
M65C02_Core.v - M65C02 core logic
M65C02_MPCv4.v - M65C02 core microprogram controller
M65C02.ucf - User Constraints File: period and pin LOCs
M65C02.tcl - M65C02 ISE tool configurations/settings
tb_M65C02.v - M65C02 testbench with RAM/ROM and interrupt sources
Testing with the current testbench demonstrates that the M65C02 processor
correctly executes the 65C02 test program, M65C02_Tst3.a65, used in previous
testing of the M65C02 core with tb_M65C02_Core.v. That provides confidence
that the integration of the core logic with the memory interface, interrupt
handler, reset controller, and internal block RAM did not introduce any errors
related to the core. However, the circuits in the wrapper around the core
logic have not been extensively tested. The testing that has been performed to
date indicate these circuits are operating correctly, but the tests performed
to date only test the nominal cases and not those cases on the margins.
For example, the interrupt handler has demonstrated that it is able to handle
vector generation for RST, IRQ, and BRK; NMI vector processing has not yet
been tested. Another signal not yet tested is the reset logic's characteristic
that requires the external nRst signal to be asserted for four cycle of the
input clock before it is recognized. This behavior has not yet been tested, nor
has the related behavior that a loss of lock of the internal clock generator
will assert reset to the M65C02 processor.
#####Release 2.71
Corrected logic for generating an internal reset signal, Rst, based on an
external reset, nRst, and the state of the DCM_Locked signal. The vector
reduction operator applied, '&', is incorrect. The correct vector reduction
operator is '|', or logic OR. The correction has been made, and the FPGA
correctly drives the nRstO output with the complement of the internal reset
signal, Rst.
The changes have been made to the M65C02.v module, and only that module has
been loaded into the MAM65C02 GitHUB repository.
#####Release 2.72
Improved the timing of the soft-core microprocessor, M65C02, by using a more
efficient scheme for the internal bus multiplexers. Previous releases of the
core, M65C02_Core, and the soft-core microprocessor used multiplexers
generated using _switch/case select_ constructs.
Although these constructs are an effective and fast means for generating bus
multiplexers, there are some penalties. This latest release has resorted to
using one-hot decode ROMs tied to the various bus selects in the
implementation, and then forcing the various data sources to connect to the
busses as gated signals. When not gated, a logic 0 is driven onto the bus. At
the terminal end, a simple OR gate is used to collect all of the desired gated
signals.
The result of this effort has been a significant improvement in the
combinatorial path delays. Prior to this optimization, the synthesizer
reported a clock period performance of ~55 MHz. After the OR bus optimization
was fully incorporated, the synthesizer reports a minimum period of ~74 MHz.
This is nearly a 35% improvement in the combinatorial path delays.
The resulting improvement is sufficient to allow the soft-core processor to
support an operating speed of **73.728 MHz** which corresponds to a single
instruction cycle time of **18.432 MHz** given this core's 4 cycle microcycle.
In addition to the improved combinatorial path delays, the improvement in path
delays has allowed the core to be synthesized, Mapped, and PARed for minimum
area. The result is a significant reduction in the resource utilization in the
target XC3S50A-4VQG100I FPGA.
The following table summarizes PAR results for Release 2.7 of the M65C02
processor: **XC3S50A-4VQG100I**
Used Avail %
Number of Slice Flip Flops 248 1408 17%
Number of 4 input LUTs 647 1408 45%
Number of occupied Slices 400 704 56%
Number of Slices related logic 400 400 100%
Number of Slices unrelated logic 0 400 0%
Total Number of 4 input LUTs 661 1408 46%
Number used as logic 646
Number used as a route-thru 14
Number used as Shift registers 1
Number of bonded IOBs
Number of bonded pads 54 68 79%
IOB Flip Flops 79
Number of BUFGMUXs 3 24 16%
Number of DCMs 1 2 50%
Number of RAMB16BWEs 3 3 100%
Best Case Achievable: 13.516ns (0.047ns Setup, 1.021ns Hold)
The files modified in this release are:
M65C02.v - M65C02 microprocessor demonstration
M65C02_Core.v - M65C02 core logic
M65C02_AddrGen.v - M65C02 core microprogram controller
M65C02_ALU.v - M65C02 core ALU
M65C02_BIN.v - M65C02 ALU Binary mode adder
M65C02_BCD.v - M65C02 ALU Decimal mode adder
M65C02.ucf - User Constraints File: period and pin LOCs
M65C02.tcl - M65C02 ISE tool configurations/settings
Additional optimizations in the ALU can be applied, but with the improvements
made with this release, a -5 speed grade part can be made to operate at 90+
MHz. If higher speeds are needed, then further optimization, including adding
pipeline registers to the ALU, can be made. Some pipelining can be easily
added because of the 4 clock microcycle around which the soft-core processor
is built.
#####Release 2.73
Improved the modularity of the M65C02 top level module by creating modules for
clock generation and interrupt handling. Updated the design document, and
deleted unnecessary files.