Proposal for a new high performance ISA
by Agner on Mar 23, 2016 |
Agner
Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023 |
||
I want to inform you about a proposal for a new open ISA with focus on high performance vector processors. It is published at http://www.agner.org/optimize/#instructionset
I totally agree with the philosophy of an open ISA and open core. In fact, I have been arguing the same for years. But I want to take the idea a step further and design an ISA and microarchitecture that can actually outperform existing commercial architectures. The design must have a strong focus on features that are important for top performance, such as efficient vector support and an efficient pipeline structure. OpenRISC and RISC-V are well suited for smaller systems where silicon space must be economized and for scientific experiments, but they have little focus on details that are important for the performance of larger and more powerful processors. My proposal is combining the best from the RISC and CISC principles. Instructions can have three different sizes: one, two or three 32-bit words. Common simple instructions can also be stuffed two-by-two into single code words. The detection of the instruction length is made as simple as possible in order to make it easy for the decoder to identify several instructions per clock cycle. Instructions have a medium level of complexity: An instruction can do multiple things, but only if it fits into the pipeline structure so that it is doing one thing at each pipeline stage. This will make it possible to maintain a throughput of one instruction per clock cycle per pipeline lane (except for division and cache misses). There is no need to split instructions into micro-operations or to use microcode. Vector support is integrated as a fundamental part of the design rather than an appendix. The vector registers have variable length, where the maximum length is implementation-dependent. A software program will automatically use the maximum vector length supported by the processor without the need to compile the code separately for each vector length. A special addressing mode makes it easy to make a loop through an array so that the last iteration uses a smaller vector length in case the array size is not divisible by the maximum vector length. The proposal includes recommendations for standardization of the whole ecosystem of ABI standard, binary file format, function libraries, compiler support and OS support. With open standards for the entire ecosystem we would be able to combine different programming languages in the same program and use the same function libraries with all compilers. This proposal is the result of a long discussion on my blog: http://www.agner.org/optimize/blog/read.php?i=421 Feel free to comment on my blog (no registration needed), or here on the cores forum. The proposal is in an initial discussion phase where a template instruction format has been designed but the finer details about each instruction have not been decided yet. This is not a final standard yet, but an impetus to start an evolutionary process that could involve both ISA specification, software development tools, emulator and hardware design. About me, I have done research for many years on microarchitectures and on how to optimize software to get the maximum performance out of a given microarchitecture. My reports, manuals and other resources are widely used by programmers and compiler makers. See www.agner.org/optimize/ |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 24, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
I look forward to the opportunity to review this instruction set. I think, after having implemented a CPU, I might even have an opinion as to what you might find helpful. Give me some time to get back to you.
Dan |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 24, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
Here's a quick question: Have you looked over the RTX codes that GCC will try to map into instructions? You can find these in section 16.9 of the gccint.pdf file. You may find some capabilities that you haven't thought of (yet) there. Otherwise ... I'm still reading. Dan P.S. Have you ever ported GCC to a new backend before? |
RE: Proposal for a new high performance ISA
by Agner on Mar 24, 2016 |
Agner
Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023 |
||
Thank you for the link.
No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM? There is a lot of work to do: make an assembler, linker, compiler, simulator, FPGA implementation. But first, we have to fix the details of the new standard. |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 24, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM?
From my own experience, I have been working on the ZipCPU here on OpenCores. If you look it up, you'll discover that the ZipCPU has a completely different purpose from what you are working with: it's designed to be a system within a chip--basically an FPGA microcontroller that takes as few resources from the FPGA as possible. Yes, it's designed to be a minimalist (32-bit) CPU. I am right now finishing (?) up my work on a GCC compiler port. (You know, the first 90% takes 10% of the time? I think I may have finished the 9% of the time ...) I have ported the binutils assembler, linker, disassembler, and (hopefully) the entire binutils stack, although my port doesn't support DLLs or thread local storage (yet--although the hardware doesn't really support it yet either). Along the way, I had to make changes to the CPU given what I am learning in hind-sight from the tools work. For example,
But let me come back to your question, should you support GCC or LLVM. I'm going to suggest that you want to support whatever compiler gives you the easiest path to getting Debian/Linux running on your platform. I think that will be GCC, but I'll invite you to prove me wrong. Dan |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 24, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
I didn't find any information regarding interrupts and interrupt handling, nor did I find any description of protected modes within the document. Was this an oversight?
Dan |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 24, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
Agner,
I finally made it through your document! Thank you for putting the effort into trying to build a new/better CPU architecture. I like your goal(s)--so much so that I was tempted to start building something similar to what you are proposing before I saw your proposal!
Again, you've stepped into a gigantic undertaking. There's a long, long road between here and a working CPU, but I commend you for trying. Hope this helps! Dan |
RE: Proposal for a new high performance ISA
by redbear on Mar 24, 2016 |
redbear
Posts: 8 Joined: Feb 16, 2009 Last seen: Nov 13, 2023 |
||
Need block verification :-) ?
|
RE: Proposal for a new high performance ISA
by robfinch on Mar 25, 2016 |
robfinch
Posts: 28 Joined: Sep 29, 2005 Last seen: Nov 18, 2024 |
||
Hi Agner,
I just started working on a simple vector processor (3 element vectors) the other day with the goal of having it run a ray-tracing program. So this might be in sync with what I'm looking into right now. I started reading your initial docs on a vector ISA. Since I don't really know much about vector machinery I reserve most of my comments for later. I've heard that the ISA only makes a small difference to overall performance when all other things are considered. It would seem to me however that for a vector cpu a vector oriented ISA would make sense. One thing I noticed was using r31 to indicate no index register, wouldn't it make more sense to use R0 ? R0 is typically reserved for the value of zero in a number of other cpu's. Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented). A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project. The variable length instruction coding with the four sizes seems reasonable. Kinda reminds me of the RiSCV encoding bits. I think I see what you're trying to do with the variable length vector registers. I'd put the vector length in an SPR though. Will it vary while a given program is running ? Looking forward to see more of this project. |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 25, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
Wow, you just reiterated a lot of my thoughts!
Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented).
I almost forgot: this was the reason I got rid of push and pop instructions in the ZipCPU. Like I mentioned earlier, though, GCC is good enough that I hardly miss them.
A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project.
Let's chat about that link register for a moment. Do you need it? When building the ZipCPU I had the problem with call instructions that they required two writes to the register set: one to write the link register and another to set the PC to the new value. I managed to replace the call sequence with a "MOV return_label,R0; JMP _subroutine; return_label:". The only trick was that I had to tell GCC that this two instruction and a label sequence actually constituted a single instruction that just happened to clobber R0. (I couldn't tell GCC about the existence of the label within the instruction set combination, or else it would optimize it out of existence.) Aside from that, this sequence has worked very well for me. Return instructions simply jump to R0. Further, it wasn't difficult to tell GCC that it needed to save R0 if it called any other functions.
I should mention that at one time I had tried placing the return address on the stack. Returning from a function then involved (in the callee) reading this stack value into the program counter, and in the caller I needed to 1) allocate an extra space on the stack for this address, and 2) write the return address (manually) to that location. If the caller was already using the stack, then the stack had already been adjusted to have enough memory. If not, the stack pointer needed to be updated once at the top of the callee to create and then once at the end of the callee to dispose of this location. In the end, the register calling approach sped up leaf functions, although in all other cases the result was a wash. However the difference was significant enough to change my marks on certain performance benchmarks.
YMMV,
Dan |
RE: Proposal for a new high performance ISA
by Agner on Mar 25, 2016 |
Agner
Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023 |
||
Thank you for all your comments. This discussion is really helpful. To dgisselq: TEST instruction and flags. This instruction is not setting any flags, it is a combined ALU and conditional jump instruction. There is also an instruction to test a single bit which is shorter because it needs only a bit index, not a 64-bit constant. I don't know how well GCC can handle combined ALU-and-conditional-jump instructions. If you want to split it in two instructions - which would be suboptimal - then you can explicitly require an ALU instruction to write to one of the integer registers that can be used for flags. Or use a compare instruction followed by a test-bit-and-conditional-jump. Branches. The need for 32-bit displacements is one of the most important reasons for allowing instruction size to be more than 32 bits. I have designed it so that all immediate constants, addresses and offsets have power-of-2 sizes and proper alignment in order to make the link process smooth. I don't have 64-bit absolute addresses because you can reach +/- 8 Gigabytes with a 32-bit signed offset scaled by the code word size, which is 4. Atomic instructions. My document just has a short note that these are needed. I don't know which ones are most needed. Protected mode. The simplest solution would be to make a category in the memory map for system code. You have privileged access when the instruction pointer is within a system code segment. It is not decided yet how to make system calls. We need input from people with different experiences to find out what is most efficient. A simple solution is just to make a function call to a privileged address and have a table of function pointers to system functions (MacOS has fixed absolute addresses for some system functions). Maybe there are security problems in this if a hacker can make a call to some address inside a system function. 1. market adoption. My exercise at this initial state is to line out what an ideal system would look like if we could redesign everything from the bottom, knowing what we know today. It may be necessary to make compromises later, but for now I prefer to start with an almost blank slate. We have to look at the ideal solution before we can weigh the costs/benefits of a compromise. 2. vector support. My idea is that the processor should have parallel ALUs so that it can process a whole vector at once. The size of vector registers and the number of parallel ALUs and the width of the databus to the cache should all match so that you can have a throughput of one full vector per clock cycle. (floating point add/mul is pipelined with a latency of several clock cycles but still a throughput of one vector per clock). The system should be scalable so that different processors can have different vector lengths. This is where the x86 family is heading. The problem with the existing solutions is that you need to compile the code separately for each vector size (see below). 3. FPGA on chip. If the FPGA is connected to some bus distant from the CPU then you have longer access times. It will be less suited for single application-specific instructions. Such a system might be used for coding a whole state machine that does some calculation autonomously. My idea is to have a smaller programmable logic unit to do just single tasks in this mini-FPGA and do the rest in the CPU software stream. It will have longer latencies than one clock cycle anyway, but I prefer to keep the access time low. It will be difficult to construct no matter which solution we choose, but I think that we will see some kind of programmable logic in mainstream CPUs within a few years anyway. It is worth discussing what it might look like. Actually, Intel has already introduced a universal boolean logic function with three inputs and a truth table (LUT) with 8 entries. This is an instruction in the AVX512 instruction set. 4. Instructions with memory operands. My idea is that ALU instructions can have three different types of source operands: immediate, register or memory. But not a memory destination operand and not more than one non-register operand. The CPU pipeline will have one stage for calculating the address of the memory operand, one stage for reading the memory operand, and one stage for the ALU operation. This gives a few clock cycles extra latency after changing a pointer register, but this latency has little effect in out-of-order processors. The total throughput of the processor is increased because the combination of reading a memory operand and doing some calculation with it is ubiquitous in most software. 5. Division by zero. Traps on floating point division by zero can be enabled. Do we need traps on integer division by zero? I think it is illogical to have traps on integer division by zero but not on integer addition and multiplication overflow. Instead, I decided to have an optional flags output on all integer arithmetic instructions. You can specify a register for flags output, but you don't have flags output when you don't need them. 6. Push and pop. Push and pop instructions will be used mostly for spilling registers, and this can be done in other ways if you want. But we still have implicit push and pop in call and return instructions, so I think we need the hardware machinery for stack operations anyway. If you are saving the function return address in a register then the branch predictor has a problem distinguishing between a function return and a register-indirect jump. If you can't make this distinction then you will have a lot of expensive branch mispredictions. 7. Pipeline. Please see chapter 7 in my document. This is of course not part of the ISA specification, and different implementations are allowed. 8. Potentially missing instructions. Overflow detection: this is supported with an optional flags output. Leading/trailing bits: you can use the bit scan instructions after inverting all bits. Conditional move and store: most instructions can have a predicate for conditional execution. Vector instructions have masks for conditional execution at the vector element level. Insert into bitfield: I don't have these. What are they used for? Conditional trap: I don't like traps. I prefer conditional jump to some error-handling code. 9. Multithreading. Sorry if I have not expressed myself clearly enough. Of course multiple threads can use the same CPU core on a time share basis, but not simultaneously. I am referring to the problematic trick that Intel calls hyperthreading: two threads running simultaneously in the came core, competing for all the resources of the core. 10. Modifications of the ELF format. I don't see a big problem in this. Each ISA has its own variant of ELF with different relocation types. Even 32-bit and 64-bit x86 have each their ELF variant. A compatible linker is simple to build. I want to standardize as much as possible of the software ecosystem right from the beginning. There is no reason why we have different function calling conventions, different object file formats, different assembly languages, etc. for Windows and Linux, other than historical path dependence. 11. Memory organization. The TLB system of modern processors is very complex and costly. If you can reduce the number of separate segments per process by designing a memory model that keeps e.g. the code segments of all DLLs together then you can simplify the memory management. If the number of memory segments per process can be kept sufficiently low, then we can replace the traditional TLB having a large number of fixed-size pages by a memory map with a limited number of variable-size sections. The advantage of this is potentially so high that it is worth exploring and make experiments with this. The difference between Windows DLLs and Linux shared objects is that the latter have symbol interposition. This means that you can override all symbol references in a shared object, even internal references. The cost of this is that all internal accesses to variables and functions have to go through a table lookup. This is totally wasteful because this feature is virtually never used. If you really need this feature then give the function a pointer parameter to whatever reference you want to be changeable. 12. Base pointer. Almost all integer registers can be used as base pointer. The special registers can be used as well (instruction pointer and data section pointer). 13. Variable length registers. The most important reason for having variable length vectors is that the system should be scalable so that the software can adapt at runtime to different microprocessors with different maximum vector length. It is a big problem with existing systems that you have to compile separately for each vector size. The software programmer has to make a separate branch for each vector size in order to run optimally on all processors. Each branch has to be coded, tested and maintained. And a new branch has to be added each time a new processor generation with a larger vector size comes on the market with new instruction set extensions of unpredictable design. This is such a high burden on the programmers that it is simply not done. I have made a lot of research on this, and the only software I have seen which has properly maintained CPU dispatching for all vector sizes is Intel's own function libraries. Unfortunately, some of these libraries are deliberately designed to run slower on processors that don't have the "Genuine Intel" label in the CPUID. Even some of the most popular math programs use software libraries that are 7 years behind the hardware market. The system with variable length vectors can adapt at runtime to the vector length of the processor without the need for recompilation when a new processor with a different vector length comes on the market. It can save a full-length vector even if vectors of that length were not available when the software was developed. A further advantage of variable length vectors is that they can facilitate loops through an array of a size that is not divisible by the vector length. I have designed an addressing mode that handles this situation smoothly without any extra code for the odd extra array elements. 14. Immediate constants. I am not supporting 64-bit absolute addresses. But you can load a 64-bit address into a pointer. See above. 15. Call stack. No, there is only one stack. It is used for return addresses, register spilling and local variables. It may be useful to switch to another stack on interrupts and traps. This will make it possible to use the red zone beyond the stack pointer for temporary storage. But I am uncertain on whether this is worth the extra hardware complexity. 16. Context switch. Obviously, we want to save all registers on a context switch. Lazy saving of vector registers may be relevant, but I am not sure it is worth the extra complexity. I am considering making tiny instructions for "save register and advance pointer", including full length vector registers. This will make it possible to save all registers using 64 tiny instructions = 128 bytes of code, using only a little more than 64 clock cycles, and similar for restore. This is probably better than having a microcoded "save all" instruction. Microcode should be avoided if possible. To Robfinch: We don't need r0 = 0 because instructions can have an immediate operand of 0. r31 is the stack pointer. The stack pointer can be used as base pointer, but there is no use for the stack pointer as array index, so I decided to use 31 for no index. POP instruction. Yes, the pop instruction has two outputs. So has most other instructions if you specify a flags output, so the hardware must support two outputs anyway. Link register. See point 6 above and chapter 4.2 in my document. Vector length. I don't want to put the vector length in a system register because it is modified during a loop through an odd-sized array. It has to be renamed during out-of-order processing. You may have different vectors of different lengths in the same program. |
RE: Proposal for a new high performance ISA
by dgisselq on Mar 25, 2016 |
dgisselq
Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024 |
||
This is an exciting project. I wish you the best. I would like, with your permission, to return to a couple items:
But let's move on. Your ideas are novel enough that they are going to need some preliminary proof of concept work. I anticipate that you will want to have a simulator running with much of your toolchain before you move much farther. Have you made any choices here for what your roadmap is going forward? Will you be using FPGA's, Verilator, or something else to prove that these ideas work? Wishing you all the best, Dan |
RE: Proposal for a new high performance ISA
by Agner on Mar 26, 2016 |
Agner
Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023 |
||
To dgisselq:
2+10. You can load a 64-bit absolute address into a register and make an indirect jump to the register value. It is unconditional only. I don't allow predication on indirect jumps, calls and returns because this would add extra complexity to the branch predictor. 5. The abort function doesn't need to use a trap. This will be implementation dependent. 6. My main reason for supporting push and pop is that the smoothest way to make function calls - from a software point of view - is to push the return address on call and pop it on return. A secondary advantage is that you don't need extra instructions to adjust the stack pointer when spilling registers. We will also need instructions for "save register and advance pointer". Such instructions will use the same machinery as push and pop. 7. Whether a particular brand of FPGA allows easy reconfiguration of a small area is subordinate to the general discussion of whether the principle is practicable in an ASIC. I am predicting that mainstream CPUs will have some kind of programmable logic within a few years, and we may as well play with that idea and find the most efficient ways to do it. 11. GCC isn't the only compiler in the world though certainly one of the best. GCC is constantly developing and it's open source so anybody can add features. I am more concerned with the special vector loop with a negative index. It will require quite a bit of work to make GCC able to autovectorize a loop with this method and to call the special function libraries with variable-length vector parameters. The roadmap: The roadmap is long. First discuss details of the ISA and ABI standard. Then make assembler, compiler, linker, simulator, profiler, FPGA, ASIC. There are a lot of possibilities for interesting research projects and experiments all along the road and we really don't know where this road will take us. |
RE: Proposal for a new high performance ISA
by rob06 on Mar 26, 2016 |
rob06
Posts: 1 Joined: Dec 27, 2008 Last seen: Nov 21, 2023 |
||
There is an existing project like this: OR1K the openRISC project,
I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space. |
RE: Proposal for a new high performance ISA
by aikijw on Mar 27, 2016 |
aikijw
Posts: 76 Joined: Oct 21, 2011 Last seen: Jul 8, 2023 |
||
...or you could implement a subject filter using your mail client's built in filter capability...
This is a forum... Your reasons for taking the conversation elsewhere are specious... Regards, /jw
There is an existing project like this: OR1K the openRISC project,
I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space. |