Zip Cpu :: An honest assessment
Having now worked with the Zip CPU for a while, it is worth offering an honest assessment of how well it works and how well it was designed. At the end of this assessment, I will propose some changes that may take place in a later version of this Zip CPU to make it better.
- The Zip CPU is light weight and fully featured as it exists today. For anyone who wishes to build a general purpose CPU and then to experiment with building and adding particular features, the Zip CPU makes a good starting point--it is fairly simple. Modifications should be simple enough. Further, with builds in as low as 1.1k LUTs, it is light enough for many constrained environments.
The Zip CPU has a working backend for GCC and binutils!
It is possible to build a minimal Operating System using the Zip CPU! A proof of concept of such an O/S already exists.
The Zip CPU was designed to be an implementable soft core that would be placed within an FPGA, controlling actions internal to the FPGA. It fits this role rather nicely. It does not fit the role of a system on a chip very well, but then it was never intended to be a system on a chip but rather a system within a chip.
The extremely simplified instruction set of the Zip CPU was a good choice. Although it did not have many of the commonly used instructions, PUSH, POP, JSR (sometimes a JL instruction or CALL), and RET among them, the simplified instruction set did demonstrate an amazing versatility. Indeed, the implementation of GCC doesn't seem to notice the lack of these instructions, working fairly well inspite of any such lack.
This simplified instruction set is easy to decode.
The simplified bus transactions (32-bit words only) were also very easy to implement.
The novel approach of having a single interrupt vector, which just brings the CPU back to the instruction it left off at within the last interrupt context doesn't appear to have been that much of a problem. If most modern systems handle interrupt vectoring in software anyway, why maintain hardware support for it?
It's actually even better than that. A single interrupt entry point means that the supervisor code has a single entrance point, rather than multiple entry points. This makes the supervisor code easier to write and implement.
My goal of a high rate of instructions per clock may not be the proper measure. For example, if instructions are being read from a SPI flash device, such as is common among FPGA implementations, these same instructions may suffer stalls of between 64 and 128 cycles per instruction just to read the instruction from the flash. Executing the instruction in a single clock cycle is no longer the appropriate measure. At the same time, it should be possible to use the DMA peripheral to copy instructions from the FLASH to a temporary memory location, after which they may be executed at a single instruction cycle per access again.
The novel approach to pipelined data accesses over the wishbone bus is worthy of paying attention to. Thus, even without a data cache, the Zip CPU can achieve nearly single cycle data accesses--after the initial memory pipeline setup cost.
The early branching algorithm is about as simple as it can get, and yet it is quite useful. BRAnch instructions take two cycles to execute, and LJMP instructions three--assuming the target is in the instruction cache. Conditional branches still suffer complete pipeline stalls, as there is no true branch prediction in this CPU.
The CPU has no octet (8-bit) support. This is both good and bad. Realistically, the CPU works just fine without it. The C language specification allows for bytes being other than 8-bits, and defines a "byte" as the smallest unit of addressable memory. Hence, on the Zip CPU a byte is a 32-bit value. Characters can be supported as subsets of 32-bit words without any problem. Practically, though, it has made compiling non-Zip CPU code difficult--especially anything that assumes sizeof(int)=4*sizeof(char), that accesses int8's or int16's, or that tries to create unions with characters and integers and then attempts to reference the address of the characters within that union.
What makes this lack of octet support particularly painful is that the implementations of the standard C-Library I have come accross all assume that a byte is 8-bits. This has slowed down the delivery of a standard library, and therefore of GCC testbench support to prove the capability of the compiler.
- The Zip CPU does not support a data cache. One can still be built externally, but this is a limitation of the CPU proper as built. Further, under the theory of the Zip CPU design (that of an embedded soft-core processor within an FPGA, where any "address" may reference either memory or a peripheral that may have side-effects), any data cache would need to be based upon an initial knowledge of whether or not it is supporting memory (cachable) or peripherals. This knowledge must exist somewhere, and that somewhere is currently and by design external to the CPU.
This may also be written off as a "feature" of the Zip CPU, since the addition of a data cache can greatly increase the LUT count of a soft core.
Where this is particularly painful is when using the compiler in an unoptimized fashion. In this mode, GCC generates a lot of references to local variables placed on the stack. Examining the assembly the compiler produces will yield that some of these references are useful and meaningful, while others are not so useful. Particularly annoying is when the compiler saves (stores) a result, and then immediately reads it back again, but never references it anywhere else! This means that every instruction will involve a minimum of a read from and write to memory. The compiler does much better, however, when optimizations are turned on.
Many other instruction sets offer three operand instructions, whereas the Zip CPU only offers two operand instructions. This means that it takes the Zip CPU more instructions to do many of the same operations. The good part of this is that it gives the Zip CPU a greater amount of flexibility in its immediate operand mode, although that increased flexibility isn't necessarily as valuable as one might like.
The Zip CPU doesn't support out of order execution. I suppose it could be modified to do so, but then it would no longer be the "simple" and low LUT count CPU it was designed to be. The two primary results are that 1) loads may unnecessarily stall the CPU, even if other things could be done while waiting for the load to complete, 2) bus errors on stores will never be caught at the point of the error (this will need to be fixed with the integration of an MMU), 3) divide or floating point instructions will lock the CPU for many cycles until complete, and 4) true branch prediction becomes more difficult.
The Zip CPU is by no means generic: it will never handle addresses larger than 32-bits (16GB) without a complete and total redesign. This may limit its utility as a generic CPU in the future, although as an embedded CPU within an FPGA this isn't really much of a limit or restriction.
Particularly annoying when porting GCC is that the ZipCPU has support for only some comparisons. For example, it supports signed less than, greater than, and greater than or equal to, but not less than or equal to. When it comes to unsigned, the Zip CPU supports an unsigned less than comparison, but no unsigned greater than comparisons or any unsigned less than equal or greater than equal comparisons. While GCC has been instructed to modify its comparisons so that this "missing" predicates can still be fully useful, it currently means that GCC isn't going to do as much with the predicates available to it as the CPU can do.
CPU Speed, as currently implemented, is limited by the propagation time from the output of the ALU/MEM/DIV/FPU unit to the result of the first ALU operation. This limits our clock speed to 100MHz on an Artix-7, or 80MHz on a Spartan-6. Still ... work is underway to further pipeline the CPU and raise this maximum clock rate up to 200 MHz on an Artix-7.
There is really only one ugly problem currently known with the ZipCPU.
The ZipCPU lacks a Memory Management Unit, or MMU.
An MMU would allow the ZipCPU to make use of virtual memory. It would allow every program running on the CPU to believe it owned the entire memory space, and it would allow the user programs to effectively request more memory from the kernel whenever they needed it. A properly implemented MMU would also allow the kernel to restrict peripheral access from user programs.
As an added benefit, a proper MMU could (should) be implemented with a data cache to then solve the problem of what gets cached and what does not.
The big problem here is that most *Nix implementations depend upon an MMU, and so the lack of an MMU has become one of the big hurdles preventing a full *Nix type of implementation.
This has resulted in the following consequences in the demonstration O/S:
- Process stack sizes are static, not dynamic, and the full stack size needed by a process must be known ahead of time.
- Supervisor memory has no protection from user processes, and so user processes can (inadvertantly) overwrite kernel data structures.
- Peripheral memory has no protection from user processes either, so on one particularly bad day I reconfigured several one-time-only configurable flash memory registers ...
The lack of an MMU is a candidate for fixing in future versions of the Zip CPU.
A couple big changes are waiting for the next generation: the introduction of an MMU, the completion of the floating point unit (FPU), and (perhaps) even a high-speed version of the CPU capable of running on a 200MHz clock within an Artix-7 processor. Non-working drafts exist for all of these capabilities, so what I can say is that they are works in progress.