OpenCores

OpenCores

Proposal for a new high performance ISA

no use

no use

1/2

Next

Last

Proposal for a new high performance ISA by Agner on Mar 23, 2016			Agner Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023
I want to inform you about a proposal for a new open ISA with focus on high performance vector processors. It is published at http://www.agner.org/optimize/#instructionset I totally agree with the philosophy of an open ISA and open core. In fact, I have been arguing the same for years. But I want to take the idea a step further and design an ISA and microarchitecture that can actually outperform existing commercial architectures. The design must have a strong focus on features that are important for top performance, such as efficient vector support and an efficient pipeline structure. OpenRISC and RISC-V are well suited for smaller systems where silicon space must be economized and for scientific experiments, but they have little focus on details that are important for the performance of larger and more powerful processors. My proposal is combining the best from the RISC and CISC principles. Instructions can have three different sizes: one, two or three 32-bit words. Common simple instructions can also be stuffed two-by-two into single code words. The detection of the instruction length is made as simple as possible in order to make it easy for the decoder to identify several instructions per clock cycle. Instructions have a medium level of complexity: An instruction can do multiple things, but only if it fits into the pipeline structure so that it is doing one thing at each pipeline stage. This will make it possible to maintain a throughput of one instruction per clock cycle per pipeline lane (except for division and cache misses). There is no need to split instructions into micro-operations or to use microcode. Vector support is integrated as a fundamental part of the design rather than an appendix. The vector registers have variable length, where the maximum length is implementation-dependent. A software program will automatically use the maximum vector length supported by the processor without the need to compile the code separately for each vector length. A special addressing mode makes it easy to make a loop through an array so that the last iteration uses a smaller vector length in case the array size is not divisible by the maximum vector length. The proposal includes recommendations for standardization of the whole ecosystem of ABI standard, binary file format, function libraries, compiler support and OS support. With open standards for the entire ecosystem we would be able to combine different programming languages in the same program and use the same function libraries with all compilers. This proposal is the result of a long discussion on my blog: http://www.agner.org/optimize/blog/read.php?i=421 Feel free to comment on my blog (no registration needed), or here on the cores forum. The proposal is in an initial discussion phase where a template instruction format has been designed but the finer details about each instruction have not been decided yet. This is not a final standard yet, but an impetus to start an evolutionary process that could involve both ISA specification, software development tools, emulator and hardware design. About me, I have done research for many years on microarchitectures and on how to optimize software to get the maximum performance out of a given microarchitecture. My reports, manuals and other resources are widely used by programmers and compiler makers. See www.agner.org/optimize/ I want to inform you about a proposal for a new open ISA with focus on high performance vector processors. It is published at http://www.agner.org/optimize/#instructionset I totally agree with the philosophy of an open ISA and open core. In fact, I have been arguing the same for years. But I want to take the idea a step further and design an ISA and microarchitecture that can actually outperform existing commercial architectures. The design must have a strong focus on features that are important for top performance, such as efficient vector support and an efficient pipeline structure. OpenRISC and RISC-V are well suited for smaller systems where silicon space must be economized and for scientific experiments, but they have little focus on details that are important for the performance of larger and more powerful processors. My proposal is combining the best from the RISC and CISC principles. Instructions can have three different sizes: one, two or three 32-bit words. Common simple instructions can also be stuffed two-by-two into single code words. The detection of the instruction length is made as simple as possible in order to make it easy for the decoder to identify several instructions per clock cycle. Instructions have a medium level of complexity: An instruction can do multiple things, but only if it fits into the pipeline structure so that it is doing one thing at each pipeline stage. This will make it possible to maintain a throughput of one instruction per clock cycle per pipeline lane (except for division and cache misses). There is no need to split instructions into micro-operations or to use microcode. Vector support is integrated as a fundamental part of the design rather than an appendix. The vector registers have variable length, where the maximum length is implementation-dependent. A software program will automatically use the maximum vector length supported by the processor without the need to compile the code separately for each vector length. A special addressing mode makes it easy to make a loop through an array so that the last iteration uses a smaller vector length in case the array size is not divisible by the maximum vector length. The proposal includes recommendations for standardization of the whole ecosystem of ABI standard, binary file format, function libraries, compiler support and OS support. With open standards for the entire ecosystem we would be able to combine different programming languages in the same program and use the same function libraries with all compilers. This proposal is the result of a long discussion on my blog: http://www.agner.org/optimize/blog/read.php?i=421 Feel free to comment on my blog (no registration needed), or here on the cores forum. The proposal is in an initial discussion phase where a template instruction format has been designed but the finer details about each instruction have not been decided yet. This is not a final standard yet, but an impetus to start an evolutionary process that could involve both ISA specification, software development tools, emulator and hardware design. About me, I have done research for many years on microarchitectures and on how to optimize software to get the maximum performance out of a given microarchitecture. My reports, manuals and other resources are widely used by programmers and compiler makers. See www.agner.org/optimize/

RE: Proposal for a new high performance ISA by dgisselq on Mar 24, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
<P>I look forward to the opportunity to review this instruction set. I think, after having implemented a CPU, I might even have an opinion as to what you might find helpful. Give me some time to get back to you. <P>Dan I look forward to the opportunity to review this instruction set. I think, after having implemented a CPU, I might even have an opinion as to what you might find helpful. Give me some time to get back to you. Dan

RE: Proposal for a new high performance ISA by dgisselq on Mar 24, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
<P>Here's a quick question: Have you looked over the RTX codes that GCC will try to map into instructions? You can find these in section 16.9 of the <A HREF="https://gcc.gnu.org/onlinedocs/gccind.pdf">gccint.pdf</A> file. You may find some capabilities that you haven't thought of (yet) there. <P>Otherwise ... I'm still reading. <P>Dan <P>P.S. Have you ever ported GCC to a new backend before? Here's a quick question: Have you looked over the RTX codes that GCC will try to map into instructions? You can find these in section 16.9 of the gccint.pdf file. You may find some capabilities that you haven't thought of (yet) there. Otherwise ... I'm still reading. Dan P.S. Have you ever ported GCC to a new backend before?

RE: Proposal for a new high performance ISA by Agner on Mar 24, 2016			Agner Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023
Thank you for the link. No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM? There is a lot of work to do: make an assembler, linker, compiler, simulator, FPGA implementation. But first, we have to fix the details of the new standard. Thank you for the link. No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM? There is a lot of work to do: make an assembler, linker, compiler, simulator, FPGA implementation. But first, we have to fix the details of the new standard.

RE: Proposal for a new high performance ISA by dgisselq on Mar 24, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
[q]No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM?[/q] <P>From my own experience, I have been working on the <A HREF="/project,zipcpu">ZipCPU</A> here on OpenCores. If you look it up, you'll discover that the ZipCPU has a completely different purpose from what you are working with: it's designed to be a system within a chip--basically an FPGA microcontroller that takes as few resources from the FPGA as possible. Yes, it's designed to be a minimalist (32-bit) CPU. I am right now finishing (?) up my work on a GCC compiler port. (You know, the first 90% takes 10% of the time? I think I may have finished the 9% of the time ...) I have ported the binutils assembler, linker, disassembler, and (hopefully) the entire binutils stack, although my port doesn't support DLLs or thread local storage (yet--although the hardware doesn't really support it yet either). <P>Along the way, I had to make changes to the CPU given what I am learning in hind-sight from the tools work. For example, <UL> <LI>I thought a TEST instruction would be valuable. So, I designed this instruction to be identical to an AND instruction, only it sets the flags and nothing more. When I write in assembler, I use it often. It's designed for testing whether or not bit fields are set. However, GCC doesn't seem to know how to use it, so the instruction seems/feels wasted. <LI>I originally allowed branches that were PC+18-bit offsets. I quickly reallized that this won't work with a full 32-bit address space, and I wanted the compiler to be able to work with any address space any CPU had. So I had to repurpose an instruction to handle arbitrary jumps. The repurposed instruction, LOD (PC),PC, simply loads the next value in the instruction stream into the program counter. It would be the same as a two-word instruction to load the PC from an immediate located in the second word, only as I created all of the instructions they were all 32-bits. I then went back and modified the hardware so that the LOD (PC),PC instruction would utilize early branching within the pipeline, and so that it could be recognized before it needed to actually do the memory operation to load the value into the PC. <P>The new instruction works <em>wonderfully</em> with the linker: the linker likes to fill in the correct address for functions as a final step in program loading, and this particular instruction is not dependent (as my others were) on whether the difference between PC values could fit into 18 bits or not. (I still use the 18-bit jump as often as I can--it just doesn't work well when linking disparate object files together.) <LI>As another example, the ZipCPU only supports 8 conditions. Sure, it has your standard condition codes: Z, C, N, and V, but it only creates 8 conditions from these: (always), =, !=, <, >, >=, < (unsigned), and overflow. GCC, however, undersands many more conditions such as <=, and the other three unsigned conditions, while not necessarily using the overflow condition. Porting the toolchain has been a hassle because I need to 1) reroute logic to use the conditions I support, and 2) find ways to make all of GCC's conditions work in spite of whatever struggle I'm having. (Don't get me wrong, it's not bad enough that the hardware needs changing it's ... just been a hassle.) <LI>GCC supports many, many atomic instructions. For me, atomic instructions were an afterthought for the CPU. I then thought I might implement one atomic instruction, only to be dismayed when I saw what GCC was able to support. Now I feel like I may need to go back and rework my atomic instruction design. <P>I could say the same about floating point operations as well. </UL> <P>But let me come back to your question, should you support GCC or LLVM. I'm going to suggest that you want to support whatever compiler gives you the easiest path to getting Debian/Linux running on your platform. I think that will be GCC, but I'll invite you to prove me wrong. <P>Dan No, I haven't tried to write a GCC backend yet. Which would you recommend - GCC or LLVM? From my own experience, I have been working on the ZipCPU here on OpenCores. If you look it up, you'll discover that the ZipCPU has a completely different purpose from what you are working with: it's designed to be a system within a chip--basically an FPGA microcontroller that takes as few resources from the FPGA as possible. Yes, it's designed to be a minimalist (32-bit) CPU. I am right now finishing (?) up my work on a GCC compiler port. (You know, the first 90% takes 10% of the time? I think I may have finished the 9% of the time ...) I have ported the binutils assembler, linker, disassembler, and (hopefully) the entire binutils stack, although my port doesn't support DLLs or thread local storage (yet--although the hardware doesn't really support it yet either). Along the way, I had to make changes to the CPU given what I am learning in hind-sight from the tools work. For example, I thought a TEST instruction would be valuable. So, I designed this instruction to be identical to an AND instruction, only it sets the flags and nothing more. When I write in assembler, I use it often. It's designed for testing whether or not bit fields are set. However, GCC doesn't seem to know how to use it, so the instruction seems/feels wasted. I originally allowed branches that were PC+18-bit offsets. I quickly reallized that this won't work with a full 32-bit address space, and I wanted the compiler to be able to work with any address space any CPU had. So I had to repurpose an instruction to handle arbitrary jumps. The repurposed instruction, LOD (PC),PC, simply loads the next value in the instruction stream into the program counter. It would be the same as a two-word instruction to load the PC from an immediate located in the second word, only as I created all of the instructions they were all 32-bits. I then went back and modified the hardware so that the LOD (PC),PC instruction would utilize early branching within the pipeline, and so that it could be recognized before it needed to actually do the memory operation to load the value into the PC. The new instruction works wonderfully with the linker: the linker likes to fill in the correct address for functions as a final step in program loading, and this particular instruction is not dependent (as my others were) on whether the difference between PC values could fit into 18 bits or not. (I still use the 18-bit jump as often as I can--it just doesn't work well when linking disparate object files together.) As another example, the ZipCPU only supports 8 conditions. Sure, it has your standard condition codes: Z, C, N, and V, but it only creates 8 conditions from these: (always), =, !=, <, >, >=, < (unsigned), and overflow. GCC, however, undersands many more conditions such as <=, and the other three unsigned conditions, while not necessarily using the overflow condition. Porting the toolchain has been a hassle because I need to 1) reroute logic to use the conditions I support, and 2) find ways to make all of GCC's conditions work in spite of whatever struggle I'm having. (Don't get me wrong, it's not bad enough that the hardware needs changing it's ... just been a hassle.) GCC supports many, many atomic instructions. For me, atomic instructions were an afterthought for the CPU. I then thought I might implement one atomic instruction, only to be dismayed when I saw what GCC was able to support. Now I feel like I may need to go back and rework my atomic instruction design. I could say the same about floating point operations as well. But let me come back to your question, should you support GCC or LLVM. I'm going to suggest that you want to support whatever compiler gives you the easiest path to getting Debian/Linux running on your platform. I think that will be GCC, but I'll invite you to prove me wrong. Dan

RE: Proposal for a new high performance ISA by dgisselq on Mar 24, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
<P>I didn't find any information regarding interrupts and interrupt handling, nor did I find any description of protected modes within the document. Was this an oversight? <P>Dan I didn't find any information regarding interrupts and interrupt handling, nor did I find any description of protected modes within the document. Was this an oversight? Dan

RE: Proposal for a new high performance ISA by dgisselq on Mar 24, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
Agner, <P>I finally made it through your document! <P>Thank you for putting the effort into trying to build a new/better CPU architecture. I like your goal(s)--so much so that I was tempted to start building something similar to what you are proposing before I saw your proposal! <OL> <LI>But, let's start at the top. If you want broad market adoption you will need a well supported tool-chain. You'll want to look at the GCC internals document I mentioned earlier for a good indication at what the tool-chain can and will support. <P>I know a former systems programmer for Cray. From my discussions with him, I'd like to share two insights: 1) He claimed that one of the reasons Cray did so well in its day was that Seymour Cray refused to allow poor scalar performance while trying to get gains from the vector registers. He also grumbled to me often about how poor the toolchain support for vector registers was. The Cray compiler in that day was required to automatically vectorize its source code. Doing so wasn't trivial. I note that a lot of work has taken place since then, but still his criticism remains valuable: not all code is meant to be vectorized, and the compiler can't do everything. <P>The bottom line hear is that you want Debian/Linux (or for that matter any other Linux flavor) to be easy to port to your computer. The more the number of things you need to modify, the more difficult it will be to achieve any sort of market adoption. <LI> But let's go on. Do you intend your vector support to be like that of the MMX/SSE support which (I suppose, I really don't know) operates on the whole vector at once (1 clock), or is your purpose to do it as the original cray computers which did the whole vector at one clock per element plus the setup time? That is, are vector operations just extremely pipelined operations, or are they fundamental ops? <LI>I appreciate the thought you've put into user application instructions. Having an FPGA adjunct to a CPU only makes sense if you can do something with it. Thus, to offer special, user programmable features is one fascinating way to integrate the FPGA and CPU together. However, as you note, the method suffers from the difficulty of timesharing the FPGA. <P>In my opinion (if it's worth anything), timesharing an FPGA by savings its state and reconfiguring it on (potentially) every time-slice defeats the amazing utility of the FPGA. As a result, I personally favor a different approach to FPGA/CPU integration: Allow a system "bus" to be defined, whereby the FPGA responds with application specific functionality for each bus address that it owns. Configuration is done once at boot time--although I can forsee configuring the FPGA later. In that case, the later FPGA configuration would be physically separate from the CPU. (Possibly on the same chip, but not likely--physically separate either way.) <LI>About that fully orthogonal instruction set--I thought Hennessey and Patterson had recommended a load store architecture instead, having proven that the logic for such could be made faster? Certainly wikipedia has some things to say about it. <P>Perhaps my biggest concern about the fully orthogonal instruction set is that, upon reading the ISA specification, I didn't follow how it would be implemented. I think you said that all instructions could access memory--but if this is the case, it wasn't apparent from the section 3. I guess I just found the format for instructions section quite confusing. <LI>With the last ISA I built, I didn't offer exceptions for things like division by zero initially. In the end, I needed to go back and retrofit what I had done to offer this. Are you sure you are not going to need to go back and fix this one? <LI>How about those push/pop instructions? I got rid of them with very little penalty, while greatly simplifying what I was doing. Do you really need them? <LI>As I read through your ISA, I'm struggling to visualize the pipeline architecture within it. Have you thought this through? If so, it would help my understanding of your proposal to have that to read along side by side. <LI>Looking at the GNU/GCC toolchain, here are some examples of potentially missing instructions: <UL> <LI>(Un)signed overflow detection <LI>clrsb - Count the number of leading redundant sign bits <LI>ctz -- count the number of trailing bits in an operand <LI>insv -- insert an operand of variable width into a bitfield at a specified location <LI>extv -- extract an operand of variable width from a bitfield at a given location <LI>movxxcc -- Conditionally move instruction <LI>addxxcc -- Conditional add instruction <LI>cstore -- Conditional store instruction <LI>trap/conditional trap --- (GCC does use these!) <LI>floating point round, truncate, ceiling, will you round away from or towards zero on a tie? Or towards the nearest integer? Or will this be configurable? <LI>Lots of Atomic memory operations are missing. </UL> <LI>I take issue with your assessment of multithreading. "It is probably worthwhile to allow multiple threads to share the same CPU core and level-1 cache because this would allow a low priority thread to steal resources from a high priority thread, ..." As proof contrary, try a 'ps -e' on the nearest Linux system to look at how many processes share your CPU. Even on my eight core system, there's still more tasks than CPU's. I think you are stuck, therefore, with multiple threads (or at least tasks) sharing a CPU in a timesharing fashion. <LI>Special purpose modifications to ELF are a problem in and of themselves. I can see why you might wish to change the standard, but you might wish to choose your poison and stick with it instead rather than expanding your scope beyond what you need to do. <P>This same comment applies to exception handling, stack unrolling, debug information, error message handling, and internationalization. Some of these are libc issues, some gdb issues, some of them operating system issues, but few are actual architectural issues. <P>Your assembler syntax discussion probably doesn't belong in your architecture document. I mean, sure, go and propose an assembler syntax and use it. But if you are truly doing your ISA in an open source fashion, then you have no way of controlling whether or not someone else builds an assembler that uses a different syntax. <LI>I'm not sure you've done a good/thorough enough job thinking through and working through the problems of memory organization, virtual addresses, and potential memory management units. This is evidenced by sections that begin with "Hopefully, we can ..." This leaves me wondering, can you? "We can probably keep memory fragmentation so low," can you? <P>Perhaps you'll need to run some experiments first. Then, can you? <P>Likewise I think you'll need to dig more into the shared library implementation. Shared libraries, with dynamic linking, are important features. <P>I think you'd like to simplify memory access ... but your current approach does not do so (yet). <LI>Are you sure you want only one base pointer address register? What happens when you wish to copy from memory at one base pointer into memory at another? (or did I misunderstand this ...) <LI>I'm not sure you can get your variable length vector sizes to work. You might want to get some experiments up and running that use these and demonstrate their use, and compiler support, before you move much farther. <P>I would recommend fixed length vector registers instead. Sorry, I'm just not one for the variable length bandwagon. If I were to rock the vector register boat, I would instead make a vector register a continuous group of general purpose registers, and perhaps boost the number of 64-bit general purpose registers from 32 up to 64 or 128--but including vector support within contiguous groups of general purpose registers. Will this work? I have no idea--but its no more radical than what you are proposing. :-) <LI>I like your ability to support either 32-bit or 64-bit immediate constants within your instruction set. I think you will find this valuable, if not vital, when trying to get the linker working with your architecture. The only comment I might add is to question whether or not you want conditional jumps to arbitrary 64-bit addresses. If done right, such conditional jumps could easily simplify the linker (somewhat--there are ways around this, so this isn't truly necessary). <LI>You've described a call stack, separate from the registers and stack, used for speeding up pipelining and early branching on returns. Have you thought through how this would be saved in an interrupt or context switch? <LI>Speaking of context switches, I think I mentioned earlier that I didn't see support for multiple protection modes such as a kernel and a user mode. Was this an oversight? <P>My real question here, though, is whether or not the entire CPU state can be captured during a context switch. Will you need special instructions for this? And, if so, do you have them in place? </OL> <P>Again, you've stepped into a gigantic undertaking. There's a long, long road between here and a working CPU, but I commend you for trying. Hope this helps! <P>Dan Agner, I finally made it through your document! Thank you for putting the effort into trying to build a new/better CPU architecture. I like your goal(s)--so much so that I was tempted to start building something similar to what you are proposing before I saw your proposal! But, let's start at the top. If you want broad market adoption you will need a well supported tool-chain. You'll want to look at the GCC internals document I mentioned earlier for a good indication at what the tool-chain can and will support. I know a former systems programmer for Cray. From my discussions with him, I'd like to share two insights: 1) He claimed that one of the reasons Cray did so well in its day was that Seymour Cray refused to allow poor scalar performance while trying to get gains from the vector registers. He also grumbled to me often about how poor the toolchain support for vector registers was. The Cray compiler in that day was required to automatically vectorize its source code. Doing so wasn't trivial. I note that a lot of work has taken place since then, but still his criticism remains valuable: not all code is meant to be vectorized, and the compiler can't do everything. The bottom line hear is that you want Debian/Linux (or for that matter any other Linux flavor) to be easy to port to your computer. The more the number of things you need to modify, the more difficult it will be to achieve any sort of market adoption. But let's go on. Do you intend your vector support to be like that of the MMX/SSE support which (I suppose, I really don't know) operates on the whole vector at once (1 clock), or is your purpose to do it as the original cray computers which did the whole vector at one clock per element plus the setup time? That is, are vector operations just extremely pipelined operations, or are they fundamental ops? I appreciate the thought you've put into user application instructions. Having an FPGA adjunct to a CPU only makes sense if you can do something with it. Thus, to offer special, user programmable features is one fascinating way to integrate the FPGA and CPU together. However, as you note, the method suffers from the difficulty of timesharing the FPGA. In my opinion (if it's worth anything), timesharing an FPGA by savings its state and reconfiguring it on (potentially) every time-slice defeats the amazing utility of the FPGA. As a result, I personally favor a different approach to FPGA/CPU integration: Allow a system "bus" to be defined, whereby the FPGA responds with application specific functionality for each bus address that it owns. Configuration is done once at boot time--although I can forsee configuring the FPGA later. In that case, the later FPGA configuration would be physically separate from the CPU. (Possibly on the same chip, but not likely--physically separate either way.) About that fully orthogonal instruction set--I thought Hennessey and Patterson had recommended a load store architecture instead, having proven that the logic for such could be made faster? Certainly wikipedia has some things to say about it. Perhaps my biggest concern about the fully orthogonal instruction set is that, upon reading the ISA specification, I didn't follow how it would be implemented. I think you said that all instructions could access memory--but if this is the case, it wasn't apparent from the section 3. I guess I just found the format for instructions section quite confusing. With the last ISA I built, I didn't offer exceptions for things like division by zero initially. In the end, I needed to go back and retrofit what I had done to offer this. Are you sure you are not going to need to go back and fix this one? How about those push/pop instructions? I got rid of them with very little penalty, while greatly simplifying what I was doing. Do you really need them? As I read through your ISA, I'm struggling to visualize the pipeline architecture within it. Have you thought this through? If so, it would help my understanding of your proposal to have that to read along side by side. Looking at the GNU/GCC toolchain, here are some examples of potentially missing instructions: (Un)signed overflow detection clrsb - Count the number of leading redundant sign bits ctz -- count the number of trailing bits in an operand insv -- insert an operand of variable width into a bitfield at a specified location extv -- extract an operand of variable width from a bitfield at a given location movxxcc -- Conditionally move instruction addxxcc -- Conditional add instruction cstore -- Conditional store instruction trap/conditional trap --- (GCC does use these!) floating point round, truncate, ceiling, will you round away from or towards zero on a tie? Or towards the nearest integer? Or will this be configurable? Lots of Atomic memory operations are missing. I take issue with your assessment of multithreading. "It is probably worthwhile to allow multiple threads to share the same CPU core and level-1 cache because this would allow a low priority thread to steal resources from a high priority thread, ..." As proof contrary, try a 'ps -e' on the nearest Linux system to look at how many processes share your CPU. Even on my eight core system, there's still more tasks than CPU's. I think you are stuck, therefore, with multiple threads (or at least tasks) sharing a CPU in a timesharing fashion. Special purpose modifications to ELF are a problem in and of themselves. I can see why you might wish to change the standard, but you might wish to choose your poison and stick with it instead rather than expanding your scope beyond what you need to do. This same comment applies to exception handling, stack unrolling, debug information, error message handling, and internationalization. Some of these are libc issues, some gdb issues, some of them operating system issues, but few are actual architectural issues. Your assembler syntax discussion probably doesn't belong in your architecture document. I mean, sure, go and propose an assembler syntax and use it. But if you are truly doing your ISA in an open source fashion, then you have no way of controlling whether or not someone else builds an assembler that uses a different syntax. I'm not sure you've done a good/thorough enough job thinking through and working through the problems of memory organization, virtual addresses, and potential memory management units. This is evidenced by sections that begin with "Hopefully, we can ..." This leaves me wondering, can you? "We can probably keep memory fragmentation so low," can you? Perhaps you'll need to run some experiments first. Then, can you? Likewise I think you'll need to dig more into the shared library implementation. Shared libraries, with dynamic linking, are important features. I think you'd like to simplify memory access ... but your current approach does not do so (yet). Are you sure you want only one base pointer address register? What happens when you wish to copy from memory at one base pointer into memory at another? (or did I misunderstand this ...) I'm not sure you can get your variable length vector sizes to work. You might want to get some experiments up and running that use these and demonstrate their use, and compiler support, before you move much farther. I would recommend fixed length vector registers instead. Sorry, I'm just not one for the variable length bandwagon. If I were to rock the vector register boat, I would instead make a vector register a continuous group of general purpose registers, and perhaps boost the number of 64-bit general purpose registers from 32 up to 64 or 128--but including vector support within contiguous groups of general purpose registers. Will this work? I have no idea--but its no more radical than what you are proposing. :-) I like your ability to support either 32-bit or 64-bit immediate constants within your instruction set. I think you will find this valuable, if not vital, when trying to get the linker working with your architecture. The only comment I might add is to question whether or not you want conditional jumps to arbitrary 64-bit addresses. If done right, such conditional jumps could easily simplify the linker (somewhat--there are ways around this, so this isn't truly necessary). You've described a call stack, separate from the registers and stack, used for speeding up pipelining and early branching on returns. Have you thought through how this would be saved in an interrupt or context switch? Speaking of context switches, I think I mentioned earlier that I didn't see support for multiple protection modes such as a kernel and a user mode. Was this an oversight? My real question here, though, is whether or not the entire CPU state can be captured during a context switch. Will you need special instructions for this? And, if so, do you have them in place? Again, you've stepped into a gigantic undertaking. There's a long, long road between here and a working CPU, but I commend you for trying. Hope this helps! Dan

RE: Proposal for a new high performance ISA by redbear on Mar 24, 2016			redbear Posts: 8 Joined: Feb 16, 2009 Last seen: Feb 18, 2025
Need block verification :-) ? Need block verification :-) ?

RE: Proposal for a new high performance ISA by robfinch on Mar 25, 2016			robfinch Posts: 28 Joined: Sep 29, 2005 Last seen: Jun 5, 2025
Hi Agner, I just started working on a simple vector processor (3 element vectors) the other day with the goal of having it run a ray-tracing program. So this might be in sync with what I'm looking into right now. I started reading your initial docs on a vector ISA. Since I don't really know much about vector machinery I reserve most of my comments for later. I've heard that the ISA only makes a small difference to overall performance when all other things are considered. It would seem to me however that for a vector cpu a vector oriented ISA would make sense. One thing I noticed was using r31 to indicate no index register, wouldn't it make more sense to use R0 ? R0 is typically reserved for the value of zero in a number of other cpu's. Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented). A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project. The variable length instruction coding with the four sizes seems reasonable. Kinda reminds me of the RiSCV encoding bits. I think I see what you're trying to do with the variable length vector registers. I'd put the vector length in an SPR though. Will it vary while a given program is running ? Looking forward to see more of this project. Hi Agner, I just started working on a simple vector processor (3 element vectors) the other day with the goal of having it run a ray-tracing program. So this might be in sync with what I'm looking into right now. I started reading your initial docs on a vector ISA. Since I don't really know much about vector machinery I reserve most of my comments for later. I've heard that the ISA only makes a small difference to overall performance when all other things are considered. It would seem to me however that for a vector cpu a vector oriented ISA would make sense. One thing I noticed was using r31 to indicate no index register, wouldn't it make more sense to use R0 ? R0 is typically reserved for the value of zero in a number of other cpu's. Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented). A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project. The variable length instruction coding with the four sizes seems reasonable. Kinda reminds me of the RiSCV encoding bits. I think I see what you're trying to do with the variable length vector registers. I'd put the vector length in an SPR though. Will it vary while a given program is running ? Looking forward to see more of this project.

RE: Proposal for a new high performance ISA by dgisselq on Mar 25, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
Wow, you just reiterated a lot of my thoughts! [q]Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented). [/q] <P>I almost forgot: this was the reason I got rid of push and pop instructions in the ZipCPU. Like I mentioned earlier, though, GCC is good enough that I hardly miss them. [q]A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project.[/q] <P>Let's chat about that link register for a moment. Do you need it? When building the ZipCPU I had the problem with call instructions that they required two writes to the register set: one to write the link register and another to set the PC to the new value. I managed to replace the call sequence with a "MOV return_label,R0; JMP _subroutine; return_label:". The only trick was that I had to tell GCC that this two instruction and a label sequence actually constituted a single instruction that just happened to clobber R0. (I couldn't tell GCC about the existence of the label within the instruction set combination, or else it would optimize it out of existence.) Aside from that, this sequence has worked very well for me. Return instructions simply jump to R0. Further, it wasn't difficult to tell GCC that it needed to save R0 if it called any other functions. <P>I should mention that at one time I had tried placing the return address on the stack. Returning from a function then involved (in the callee) reading this stack value into the program counter, and in the caller I needed to 1) allocate an extra space on the stack for this address, and 2) write the return address (manually) to that location. If the caller was already using the stack, then the stack had already been adjusted to have enough memory. If not, the stack pointer needed to be updated once at the top of the callee to create and then once at the end of the callee to dispose of this location. In the end, the register calling approach sped up leaf functions, although in all other cases the result was a wash. However the difference was significant enough to change my marks on certain performance benchmarks. <P>YMMV, <P>Dan Wow, you just reiterated a lot of my thoughts! Push/Pop operations: the pop operation can't be implemented without being able to update two registers in a single instruction. This isn't simple to do. For my own project I ended up test coding it as two spearate micro-ops. Then I ended up removing them cause of a lack of LUT resources. My system's gotten by fine without these op's (but I still wish they could be implemented). I almost forgot: this was the reason I got rid of push and pop instructions in the ZipCPU. Like I mentioned earlier, though, GCC is good enough that I hardly miss them. A quandry was the link register. I'd keep it around. Some programmers optimize by placing jump addresses in registers. Since there are several code type addresses in a system that need to be managed (exception addresses, instruction addresses) I collected them together into a code address register set for my own project. Let's chat about that link register for a moment. Do you need it? When building the ZipCPU I had the problem with call instructions that they required two writes to the register set: one to write the link register and another to set the PC to the new value. I managed to replace the call sequence with a "MOV return_label,R0; JMP _subroutine; return_label:". The only trick was that I had to tell GCC that this two instruction and a label sequence actually constituted a single instruction that just happened to clobber R0. (I couldn't tell GCC about the existence of the label within the instruction set combination, or else it would optimize it out of existence.) Aside from that, this sequence has worked very well for me. Return instructions simply jump to R0. Further, it wasn't difficult to tell GCC that it needed to save R0 if it called any other functions. I should mention that at one time I had tried placing the return address on the stack. Returning from a function then involved (in the callee) reading this stack value into the program counter, and in the caller I needed to 1) allocate an extra space on the stack for this address, and 2) write the return address (manually) to that location. If the caller was already using the stack, then the stack had already been adjusted to have enough memory. If not, the stack pointer needed to be updated once at the top of the callee to create and then once at the end of the callee to dispose of this location. In the end, the register calling approach sped up leaf functions, although in all other cases the result was a wash. However the difference was significant enough to change my marks on certain performance benchmarks. YMMV, Dan

RE: Proposal for a new high performance ISA by Agner on Mar 25, 2016			Agner Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023
Thank you for all your comments. This discussion is really helpful.<br> <br> <strong>To dgisselq:<br> <br> </strong><span style="text-decoration: underline;">TEST instruction and flags.</span> This instruction is not setting any flags, it is a combined ALU and conditional jump instruction. There is also an instruction to test a single bit which is shorter because it needs only a bit index, not a 64-bit constant. I don't know how well GCC can handle combined ALU-and-conditional-jump instructions. If you want to split it in two instructions - which would be suboptimal - then you can explicitly require an ALU instruction to write to one of the integer registers that can be used for flags. Or use a compare instruction followed by a test-bit-and-conditional-jump.<br> <br> <span style="text-decoration: underline;">Branches.</span> The need for 32-bit displacements is one of the most important reasons for allowing instruction size to be more than 32 bits. I have designed it so that all immediate constants, addresses and offsets have power-of-2 sizes and proper alignment in order to make the link process smooth. I don't have 64-bit absolute addresses because you can reach +/- 8 Gigabytes with a 32-bit signed offset scaled by the code word size, which is 4.<br> <br> <span style="text-decoration: underline;">Atomic instructions.</span> My document just has a short note that these are needed. I don't know which ones are most needed. <br> <br> <span style="text-decoration: underline;">Protected mode.</span> The simplest solution would be to make a category in the memory map for system code. You have privileged access when the instruction pointer is within a system code segment. It is not decided yet how to make system calls. We need input from people with different experiences to find out what is most efficient. A simple solution is just to make a function call to a privileged address and have a table of function pointers to system functions (MacOS has fixed absolute addresses for some system functions). Maybe there are security problems in this if a hacker can make a call to some address inside a system function.<br> <br> <span style="text-decoration: underline;">1. market adoption.</span> My exercise at this initial state is to line out what an ideal system would look like if we could redesign everything from the bottom, knowing what we know today. It may be necessary to make compromises later, but for now I prefer to start with an almost blank slate. We have to look at the ideal solution before we can weigh the costs/benefits of a compromise.<br> <br> <span style="text-decoration: underline;">2. vector support.</span> My idea is that the processor should have parallel ALUs so that it can process a whole vector at once. The size of vector registers and the number of parallel ALUs and the width of the databus to the cache should all match so that you can have a throughput of one full vector per clock cycle. (floating point add/mul is pipelined with a latency of several clock cycles but still a throughput of one vector per clock). The system should be scalable so that different processors can have different vector lengths. This is where the x86 family is heading. The problem with the existing solutions is that you need to compile the code separately for each vector size (see below).<br> <br> <span style="text-decoration: underline;">3. FPGA on chip.</span> If the FPGA is connected to some bus distant from the CPU then you have longer access times. It will be less suited for single application-specific instructions. Such a system might be used for coding a whole state machine that does some calculation autonomously. My idea is to have a smaller programmable logic unit to do just single tasks in this mini-FPGA and do the rest in the CPU software stream. It will have longer latencies than one clock cycle anyway, but I prefer to keep the access time low. It will be difficult to construct no matter which solution we choose, but I think that we will see some kind of programmable logic in mainstream CPUs within a few years anyway. It is worth discussing what it might look like. Actually, Intel has already introduced a universal boolean logic function with three inputs and a truth table (LUT) with 8 entries. This is an instruction in the AVX512 instruction set.<br> <br> <span style="text-decoration: underline;">4. Instructions with memory operands.</span> My idea is that ALU instructions can have three different types of source operands: immediate, register or memory. But not a memory destination operand and not more than one non-register operand. The CPU pipeline will have one stage for calculating the address of the memory operand, one stage for reading the memory operand, and one stage for the ALU operation. This gives a few clock cycles extra latency after changing a pointer register, but this latency has little effect in out-of-order processors. The total throughput of the processor is increased because the combination of reading a memory operand and doing some calculation with it is ubiquitous in most software.<br> <br> <span style="text-decoration: underline;">5. Division by zero.</span> Traps on floating point division by zero can be enabled. Do we need traps on integer division by zero? I think it is illogical to have traps on integer division by zero but not on integer addition and multiplication overflow. Instead, I decided to have an optional flags output on all integer arithmetic instructions. You can specify a register for flags output, but you don't have flags output when you don't need them.<br> <br> <span style="text-decoration: underline;">6. Push and pop.</span> Push and pop instructions will be used mostly for spilling registers, and this can be done in other ways if you want. But we still have implicit push and pop in call and return instructions, so I think we need the hardware machinery for stack operations anyway. If you are saving the function return address in a register then the branch predictor has a problem distinguishing between a function return and a register-indirect jump. If you can't make this distinction then you will have a lot of expensive branch mispredictions.<br> <br> <span style="text-decoration: underline;">7. Pipeline.</span> Please see chapter 7 in my document. This is of course not part of the ISA specification, and different implementations are allowed.<br> <br> <span style="text-decoration: underline;">8. Potentially missing instructions.</span> Overflow detection: this is supported with an optional flags output. Leading/trailing bits: you can use the bit scan instructions after inverting all bits. Conditional move and store: most instructions can have a predicate for conditional execution. Vector instructions have masks for conditional execution at the vector element level. Insert into bitfield: I don't have these. What are they used for? Conditional trap: I don't like traps. I prefer conditional jump to some error-handling code.<br> <br> <span style="text-decoration: underline;">9. Multithreading.</span> Sorry if I have not expressed myself clearly enough. Of course multiple threads can use the same CPU core on a time share basis, but not simultaneously. I am referring to the problematic trick that Intel calls hyperthreading: two threads running <em>simultaneously</em> in the came core, competing for all the resources of the core.<br> <br> <span style="text-decoration: underline;">10. Modifications of the ELF format.</span> I don't see a big problem in this. Each ISA has its own variant of ELF with different relocation types. Even 32-bit and 64-bit x86 have each their ELF variant. A compatible linker is simple to build. I want to standardize as much as possible of the software ecosystem right from the beginning. There is no reason why we have different function calling conventions, different object file formats, different assembly languages, etc. for Windows and Linux, other than historical path dependence.<br> <br> <span style="text-decoration: underline;">11. Memory organization.</span> The TLB system of modern processors is very complex and costly. If you can reduce the number of separate segments per process by designing a memory model that keeps e.g. the code segments of all DLLs together then you can simplify the memory management. If the number of memory segments per process can be kept sufficiently low, then we can replace the traditional TLB having a large number of fixed-size pages by a memory map with a limited number of variable-size sections. The advantage of this is potentially so high that it is worth exploring and make experiments with this.<br> <br> The difference between Windows DLLs and Linux shared objects is that the latter have symbol interposition. This means that you can override all symbol references in a shared object, even internal references. The cost of this is that all internal accesses to variables and functions have to go through a table lookup. This is totally wasteful because this feature is virtually never used. If you really need this feature then give the function a pointer parameter to whatever reference you want to be changeable.<br> <br> <span style="text-decoration: underline;">12. Base pointer.</span> Almost all integer registers can be used as base pointer. The special registers can be used as well (instruction pointer and data section pointer).<br> <br> <span style="text-decoration: underline;">13. Variable length registers.</span> The most important reason for having variable length vectors is that the system should be scalable so that the software can adapt <em>at runtime</em> to different microprocessors with different maximum vector length. It is a big problem with existing systems that you have to compile separately for each vector size. The software programmer has to make a separate branch for each vector size in order to run optimally on all processors. Each branch has to be coded, tested and maintained. And a new branch has to be added each time a new processor generation with a larger vector size comes on the market with new instruction set extensions of unpredictable design. This is such a high burden on the programmers that it is simply not done. I have made a lot of research on this, and the only software I have seen which has properly maintained CPU dispatching for all vector sizes is Intel's own function libraries. Unfortunately, some of these libraries are deliberately designed to run slower on processors that don't have the "Genuine Intel" label in the CPUID. Even some of the most popular math programs use software libraries that are 7 years behind the hardware market.<br> <br> The system with variable length vectors can adapt at runtime to the vector length of the processor without the need for recompilation when a new processor with a different vector length comes on the market. It can save a full-length vector even if vectors of that length were not available when the software was developed. <br> <br> A further advantage of variable length vectors is that they can facilitate loops through an array of a size that is not divisible by the vector length. I have designed an addressing mode that handles this situation smoothly without any extra code for the odd extra array elements.<br> <br> <span style="text-decoration: underline;">14. Immediate constants.</span> I am not supporting 64-bit absolute addresses. But you can load a 64-bit address into a pointer. See above.<br> <br> <span style="text-decoration: underline;">15. Call stack.</span> No, there is only one stack. It is used for return addresses, register spilling and local variables. It may be useful to switch to another stack on interrupts and traps. This will make it possible to use the red zone beyond the stack pointer for temporary storage. But I am uncertain on whether this is worth the extra hardware complexity.<br> <br> <span style="text-decoration: underline;">16. Context switch.</span> Obviously, we want to save all registers on a context switch. Lazy saving of vector registers may be relevant, but I am not sure it is worth the extra complexity. I am considering making tiny instructions for "save register and advance pointer", including full length vector registers. This will make it possible to save all registers using 64 tiny instructions = 128 bytes of code, using only a little more than 64 clock cycles, and similar for restore. This is probably better than having a microcoded "save all" instruction. Microcode should be avoided if possible.<br> <br> <strong>To Robfinch:</strong><br> We don't need r0 = 0 because instructions can have an immediate operand of 0. r31 is the stack pointer. The stack pointer can be used as base pointer, but there is no use for the stack pointer as array index, so I decided to use 31 for no index.<br> <br> <span style="text-decoration: underline;">POP instruction.</span> Yes, the pop instruction has two outputs. So has most other instructions if you specify a flags output, so the hardware must support two outputs anyway. <br> <br> <span style="text-decoration: underline;">Link register.</span> See point 6 above and chapter 4.2 in <a target="_blank" title="my document" href="http://www.agner.org/optimize/instructionset.pdf">my document</a>. <br> <br> <span style="text-decoration: underline;">Vector length.</span> I don't want to put the vector length in a system register because it is modified during a loop through an odd-sized array. It has to be renamed during out-of-order processing. You may have different vectors of different lengths in the same program.<br> Thank you for all your comments. This discussion is really helpful. To dgisselq: TEST instruction and flags. This instruction is not setting any flags, it is a combined ALU and conditional jump instruction. There is also an instruction to test a single bit which is shorter because it needs only a bit index, not a 64-bit constant. I don't know how well GCC can handle combined ALU-and-conditional-jump instructions. If you want to split it in two instructions - which would be suboptimal - then you can explicitly require an ALU instruction to write to one of the integer registers that can be used for flags. Or use a compare instruction followed by a test-bit-and-conditional-jump. Branches. The need for 32-bit displacements is one of the most important reasons for allowing instruction size to be more than 32 bits. I have designed it so that all immediate constants, addresses and offsets have power-of-2 sizes and proper alignment in order to make the link process smooth. I don't have 64-bit absolute addresses because you can reach +/- 8 Gigabytes with a 32-bit signed offset scaled by the code word size, which is 4. Atomic instructions. My document just has a short note that these are needed. I don't know which ones are most needed. Protected mode. The simplest solution would be to make a category in the memory map for system code. You have privileged access when the instruction pointer is within a system code segment. It is not decided yet how to make system calls. We need input from people with different experiences to find out what is most efficient. A simple solution is just to make a function call to a privileged address and have a table of function pointers to system functions (MacOS has fixed absolute addresses for some system functions). Maybe there are security problems in this if a hacker can make a call to some address inside a system function. 1. market adoption. My exercise at this initial state is to line out what an ideal system would look like if we could redesign everything from the bottom, knowing what we know today. It may be necessary to make compromises later, but for now I prefer to start with an almost blank slate. We have to look at the ideal solution before we can weigh the costs/benefits of a compromise. 2. vector support. My idea is that the processor should have parallel ALUs so that it can process a whole vector at once. The size of vector registers and the number of parallel ALUs and the width of the databus to the cache should all match so that you can have a throughput of one full vector per clock cycle. (floating point add/mul is pipelined with a latency of several clock cycles but still a throughput of one vector per clock). The system should be scalable so that different processors can have different vector lengths. This is where the x86 family is heading. The problem with the existing solutions is that you need to compile the code separately for each vector size (see below). 3. FPGA on chip. If the FPGA is connected to some bus distant from the CPU then you have longer access times. It will be less suited for single application-specific instructions. Such a system might be used for coding a whole state machine that does some calculation autonomously. My idea is to have a smaller programmable logic unit to do just single tasks in this mini-FPGA and do the rest in the CPU software stream. It will have longer latencies than one clock cycle anyway, but I prefer to keep the access time low. It will be difficult to construct no matter which solution we choose, but I think that we will see some kind of programmable logic in mainstream CPUs within a few years anyway. It is worth discussing what it might look like. Actually, Intel has already introduced a universal boolean logic function with three inputs and a truth table (LUT) with 8 entries. This is an instruction in the AVX512 instruction set. 4. Instructions with memory operands. My idea is that ALU instructions can have three different types of source operands: immediate, register or memory. But not a memory destination operand and not more than one non-register operand. The CPU pipeline will have one stage for calculating the address of the memory operand, one stage for reading the memory operand, and one stage for the ALU operation. This gives a few clock cycles extra latency after changing a pointer register, but this latency has little effect in out-of-order processors. The total throughput of the processor is increased because the combination of reading a memory operand and doing some calculation with it is ubiquitous in most software. 5. Division by zero. Traps on floating point division by zero can be enabled. Do we need traps on integer division by zero? I think it is illogical to have traps on integer division by zero but not on integer addition and multiplication overflow. Instead, I decided to have an optional flags output on all integer arithmetic instructions. You can specify a register for flags output, but you don't have flags output when you don't need them. 6. Push and pop. Push and pop instructions will be used mostly for spilling registers, and this can be done in other ways if you want. But we still have implicit push and pop in call and return instructions, so I think we need the hardware machinery for stack operations anyway. If you are saving the function return address in a register then the branch predictor has a problem distinguishing between a function return and a register-indirect jump. If you can't make this distinction then you will have a lot of expensive branch mispredictions. 7. Pipeline. Please see chapter 7 in my document. This is of course not part of the ISA specification, and different implementations are allowed. 8. Potentially missing instructions. Overflow detection: this is supported with an optional flags output. Leading/trailing bits: you can use the bit scan instructions after inverting all bits. Conditional move and store: most instructions can have a predicate for conditional execution. Vector instructions have masks for conditional execution at the vector element level. Insert into bitfield: I don't have these. What are they used for? Conditional trap: I don't like traps. I prefer conditional jump to some error-handling code. 9. Multithreading. Sorry if I have not expressed myself clearly enough. Of course multiple threads can use the same CPU core on a time share basis, but not simultaneously. I am referring to the problematic trick that Intel calls hyperthreading: two threads running simultaneously in the came core, competing for all the resources of the core. 10. Modifications of the ELF format. I don't see a big problem in this. Each ISA has its own variant of ELF with different relocation types. Even 32-bit and 64-bit x86 have each their ELF variant. A compatible linker is simple to build. I want to standardize as much as possible of the software ecosystem right from the beginning. There is no reason why we have different function calling conventions, different object file formats, different assembly languages, etc. for Windows and Linux, other than historical path dependence. 11. Memory organization. The TLB system of modern processors is very complex and costly. If you can reduce the number of separate segments per process by designing a memory model that keeps e.g. the code segments of all DLLs together then you can simplify the memory management. If the number of memory segments per process can be kept sufficiently low, then we can replace the traditional TLB having a large number of fixed-size pages by a memory map with a limited number of variable-size sections. The advantage of this is potentially so high that it is worth exploring and make experiments with this. The difference between Windows DLLs and Linux shared objects is that the latter have symbol interposition. This means that you can override all symbol references in a shared object, even internal references. The cost of this is that all internal accesses to variables and functions have to go through a table lookup. This is totally wasteful because this feature is virtually never used. If you really need this feature then give the function a pointer parameter to whatever reference you want to be changeable. 12. Base pointer. Almost all integer registers can be used as base pointer. The special registers can be used as well (instruction pointer and data section pointer). 13. Variable length registers. The most important reason for having variable length vectors is that the system should be scalable so that the software can adapt at runtime to different microprocessors with different maximum vector length. It is a big problem with existing systems that you have to compile separately for each vector size. The software programmer has to make a separate branch for each vector size in order to run optimally on all processors. Each branch has to be coded, tested and maintained. And a new branch has to be added each time a new processor generation with a larger vector size comes on the market with new instruction set extensions of unpredictable design. This is such a high burden on the programmers that it is simply not done. I have made a lot of research on this, and the only software I have seen which has properly maintained CPU dispatching for all vector sizes is Intel's own function libraries. Unfortunately, some of these libraries are deliberately designed to run slower on processors that don't have the "Genuine Intel" label in the CPUID. Even some of the most popular math programs use software libraries that are 7 years behind the hardware market. The system with variable length vectors can adapt at runtime to the vector length of the processor without the need for recompilation when a new processor with a different vector length comes on the market. It can save a full-length vector even if vectors of that length were not available when the software was developed. A further advantage of variable length vectors is that they can facilitate loops through an array of a size that is not divisible by the vector length. I have designed an addressing mode that handles this situation smoothly without any extra code for the odd extra array elements. 14. Immediate constants. I am not supporting 64-bit absolute addresses. But you can load a 64-bit address into a pointer. See above. 15. Call stack. No, there is only one stack. It is used for return addresses, register spilling and local variables. It may be useful to switch to another stack on interrupts and traps. This will make it possible to use the red zone beyond the stack pointer for temporary storage. But I am uncertain on whether this is worth the extra hardware complexity. 16. Context switch. Obviously, we want to save all registers on a context switch. Lazy saving of vector registers may be relevant, but I am not sure it is worth the extra complexity. I am considering making tiny instructions for "save register and advance pointer", including full length vector registers. This will make it possible to save all registers using 64 tiny instructions = 128 bytes of code, using only a little more than 64 clock cycles, and similar for restore. This is probably better than having a microcoded "save all" instruction. Microcode should be avoided if possible. To Robfinch: We don't need r0 = 0 because instructions can have an immediate operand of 0. r31 is the stack pointer. The stack pointer can be used as base pointer, but there is no use for the stack pointer as array index, so I decided to use 31 for no index. POP instruction. Yes, the pop instruction has two outputs. So has most other instructions if you specify a flags output, so the hardware must support two outputs anyway. Link register. See point 6 above and chapter 4.2 in my document. Vector length. I don't want to put the vector length in a system register because it is modified during a loop through an odd-sized array. It has to be renamed during out-of-order processing. You may have different vectors of different lengths in the same program.

RE: Proposal for a new high performance ISA by dgisselq on Mar 25, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Mar 27, 2025
<P>This is an exciting project. I wish you the best. I would like, with your permission, to return to a couple items: <OL> <LI>Obviously, you know that you need to do some work still on interrupts, traps, and context switches. <LI>If you can load any 64-bit value into a register and then use it as a pointer, you and the linker will get along fine. This is similar to the capability that the ZipCPU has: I can only do 17-bit immediate addresses on the ZipCPU. It's such a pitiful support for absolute addresses that I don't let the compiler use absolute addresses in any fashion other than loading the address into a register and then using it--like we were just talking about. Of course, this only works for a load/store. You still need the ability to jump to arbitrary 64-bit locations. Maybe I missed thiss--you do have this capability, right? <LI>If any register may be used as a base pointer, then my comment about only being able to use a single register doesn't make sense anymore. Your approach should then work. <LI>I agree that the TLB system of modern processors could use a ... friendly redesign. I look forward to hearing and seeing how your ideas play out. <LI>If you don't like traps, you may find yourself arguing with GCC. <P>I had a piece of code I wrote that erroneously took an integer pointer, int p, set it to zero, p=0, and tried to assign a value. GCC rewrote my code to call an abort() function--something I felt was better implemented by a trap--especially since my "library" wasn't complete enough to include an abort() function yet. <LI>You still don't need push or pop for register spilling. The compiler knows what its going to spill when it builds your routine. As a result, something like: <BR>SUB sizeof(locals),SP <BR>... <BR>PUSH spilled_register <BR>... <BR>POP spilled_register <BR>... <BR>ADD sizeof(locals),SP <P>can be recompiled into <BR>SUB sizeof(locals)+sizeof(spilled_regs),SP <BR>.... <BR>STORE spilled_register,stack_offset(SP) <BR>... <BR>LOAD stack_offset(SP),spilled_register <BR>... <BR>ADD sizeof(locals)+sizeof(spilled_regs),SP <P>The end result is that you still don't need the push or pop instruction. I'm not saying you can't have them, but just trying to let you know that the compiler support is good enough that you don't need them--even for spilled registers. <LI>As for the FPGA on chip---do you know whether or not the tools you wish to use to simulate this chip will have the support here for what you want to do? <LI>Protected mode ... you might want to put some more thought into this. But I may have just said this above. <LI>As for atomic instructions, just look through the GCC internals manual and you'll get a good idea of what the compiler is willing to support. <LI>Branches ... While 32-bit displacements may seem good enough, I think I can guarantee that you will still need to jump to 64-bit addresses. It may be enough to load one of these addresses into a register as an immediate constant and perform an indirect jump, though. <LI>As for how well GCC can handle combined ALU and conditional jump statements, look up the cbranch code. It turns out, GCC expects a combined compare and jump instruction. I had to do some extra work to get it to accept a condition flags register instead. In the end, my implementation cost an extra instruction: one for the compare and one for the jump, when I might've been just as successful with something closer to your approach. <P>The only trick here is, GCC has no real support for jumping on anything other than a comparison. So you only really need to support one ALU instruction (a compare or subtract) in conjunction with a conditional branch. Other ALU combinations aren't likely to be supported and/or used--no matter how easy or hard they are to build. (The only exception being branch on overflows ...) </OL> <P>But let's move on. Your ideas are novel enough that they are going to need some preliminary proof of concept work. I anticipate that you will want to have a simulator running with much of your toolchain before you move much farther. Have you made any choices here for what your roadmap is going forward? Will you be using FPGA's, Verilator, or something else to prove that these ideas work? <P>Wishing you all the best, <P>Dan This is an exciting project. I wish you the best. I would like, with your permission, to return to a couple items: Obviously, you know that you need to do some work still on interrupts, traps, and context switches. If you can load any 64-bit value into a register and then use it as a pointer, you and the linker will get along fine. This is similar to the capability that the ZipCPU has: I can only do 17-bit immediate addresses on the ZipCPU. It's such a pitiful support for absolute addresses that I don't let the compiler use absolute addresses in any fashion other than loading the address into a register and then using it--like we were just talking about. Of course, this only works for a load/store. You still need the ability to jump to arbitrary 64-bit locations. Maybe I missed thiss--you do have this capability, right? If any register may be used as a base pointer, then my comment about only being able to use a single register doesn't make sense anymore. Your approach should then work. I agree that the TLB system of modern processors could use a ... friendly redesign. I look forward to hearing and seeing how your ideas play out. If you don't like traps, you may find yourself arguing with GCC. I had a piece of code I wrote that erroneously took an integer pointer, int p, set it to zero, p=0, and tried to assign a value. GCC rewrote my code to call an abort() function--something I felt was better implemented by a trap--especially since my "library" wasn't complete enough to include an abort() function yet. You still don't need push or pop for register spilling. The compiler knows what its going to spill when it builds your routine. As a result, something like: SUB sizeof(locals),SP ... PUSH spilled_register ... POP spilled_register ... ADD sizeof(locals),SP can be recompiled into SUB sizeof(locals)+sizeof(spilled_regs),SP .... STORE spilled_register,stack_offset(SP) ... LOAD stack_offset(SP),spilled_register ... ADD sizeof(locals)+sizeof(spilled_regs),SP The end result is that you still don't need the push or pop instruction. I'm not saying you can't have them, but just trying to let you know that the compiler support is good enough that you don't need them--even for spilled registers. As for the FPGA on chip---do you know whether or not the tools you wish to use to simulate this chip will have the support here for what you want to do? Protected mode ... you might want to put some more thought into this. But I may have just said this above. As for atomic instructions, just look through the GCC internals manual and you'll get a good idea of what the compiler is willing to support. Branches ... While 32-bit displacements may seem good enough, I think I can guarantee that you will still need to jump to 64-bit addresses. It may be enough to load one of these addresses into a register as an immediate constant and perform an indirect jump, though. As for how well GCC can handle combined ALU and conditional jump statements, look up the cbranch code. It turns out, GCC expects a combined compare and jump instruction. I had to do some extra work to get it to accept a condition flags register instead. In the end, my implementation cost an extra instruction: one for the compare and one for the jump, when I might've been just as successful with something closer to your approach. The only trick here is, GCC has no real support for jumping on anything other than a comparison. So you only really need to support one ALU instruction (a compare or subtract) in conjunction with a conditional branch. Other ALU combinations aren't likely to be supported and/or used--no matter how easy or hard they are to build. (The only exception being branch on overflows ...) But let's move on. Your ideas are novel enough that they are going to need some preliminary proof of concept work. I anticipate that you will want to have a simulator running with much of your toolchain before you move much farther. Have you made any choices here for what your roadmap is going forward? Will you be using FPGA's, Verilator, or something else to prove that these ideas work? Wishing you all the best, Dan

RE: Proposal for a new high performance ISA by Agner on Mar 26, 2016			Agner Posts: 12 Joined: Mar 23, 2016 Last seen: Mar 18, 2023
To dgisselq: 2+10. You can load a 64-bit absolute address into a register and make an indirect jump to the register value. It is unconditional only. I don't allow predication on indirect jumps, calls and returns because this would add extra complexity to the branch predictor. 5. The abort function doesn't need to use a trap. This will be implementation dependent. 6. My main reason for supporting push and pop is that the smoothest way to make function calls - from a software point of view - is to push the return address on call and pop it on return. A secondary advantage is that you don't need extra instructions to adjust the stack pointer when spilling registers. We will also need instructions for "save register and advance pointer". Such instructions will use the same machinery as push and pop. 7. Whether a particular brand of FPGA allows easy reconfiguration of a small area is subordinate to the general discussion of whether the principle is practicable in an ASIC. I am predicting that mainstream CPUs will have some kind of programmable logic within a few years, and we may as well play with that idea and find the most efficient ways to do it. 11. GCC isn't the only compiler in the world though certainly one of the best. GCC is constantly developing and it's open source so anybody can add features. I am more concerned with the special vector loop with a negative index. It will require quite a bit of work to make GCC able to autovectorize a loop with this method and to call the special function libraries with variable-length vector parameters. The roadmap: The roadmap is long. First discuss details of the ISA and ABI standard. Then make assembler, compiler, linker, simulator, profiler, FPGA, ASIC. There are a lot of possibilities for interesting research projects and experiments all along the road and we really don't know where this road will take us. To dgisselq: 2+10. You can load a 64-bit absolute address into a register and make an indirect jump to the register value. It is unconditional only. I don't allow predication on indirect jumps, calls and returns because this would add extra complexity to the branch predictor. 5. The abort function doesn't need to use a trap. This will be implementation dependent. 6. My main reason for supporting push and pop is that the smoothest way to make function calls - from a software point of view - is to push the return address on call and pop it on return. A secondary advantage is that you don't need extra instructions to adjust the stack pointer when spilling registers. We will also need instructions for "save register and advance pointer". Such instructions will use the same machinery as push and pop. 7. Whether a particular brand of FPGA allows easy reconfiguration of a small area is subordinate to the general discussion of whether the principle is practicable in an ASIC. I am predicting that mainstream CPUs will have some kind of programmable logic within a few years, and we may as well play with that idea and find the most efficient ways to do it. 11. GCC isn't the only compiler in the world though certainly one of the best. GCC is constantly developing and it's open source so anybody can add features. I am more concerned with the special vector loop with a negative index. It will require quite a bit of work to make GCC able to autovectorize a loop with this method and to call the special function libraries with variable-length vector parameters. The roadmap: The roadmap is long. First discuss details of the ISA and ABI standard. Then make assembler, compiler, linker, simulator, profiler, FPGA, ASIC. There are a lot of possibilities for interesting research projects and experiments all along the road and we really don't know where this road will take us.

RE: Proposal for a new high performance ISA by rob06 on Mar 26, 2016			rob06 Posts: 1 Joined: Dec 27, 2008 Last seen: Nov 21, 2023
There is an existing project like this: OR1K the openRISC project, I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space. There is an existing project like this: OR1K the openRISC project, I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space.

RE: Proposal for a new high performance ISA by aikijw on Mar 27, 2016			aikijw Posts: 76 Joined: Oct 21, 2011 Last seen: Jul 8, 2023
...or you could implement a subject filter using your mail client's built in filter capability... This is a forum... Your reasons for taking the conversation elsewhere are specious... Regards, /jw [q]There is an existing project like this: OR1K the openRISC project, I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space.[/q] ...or you could implement a subject filter using your mail client's built in filter capability... This is a forum... Your reasons for taking the conversation elsewhere are specious... Regards, /jw There is an existing project like this: OR1K the openRISC project, I am wondering why it is not mentioned in this proposal. Expending effort to duplicate the basics is perhaps not as good a strategy as improving on what is there and making one very strong project. Also, I don't know if there is a way to have a separate forum for this project but I am getting far too much email on this, and I don't want to opt out of OpenCores email. So if it is possible I would encourage you to set up the project and communicate in that space.

no use

no use

1/2

Next

Last

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.