Explicitly Parallel RISC (EPRISC)
by tekhknowledge on 2002-03-31
source: Ed Tomaszewski
source: Ed Tomaszewski
Explicitly Parallel RISC (EPRISC)
Here is an idea for extending the performance of RISC processors in the face of Explicitly Parallel Instruction Set Computers (EPIC).
The race is still on for increasing processor throughput by increasing clock speed and architectural improvements. Some of these architectural improvements deal with the issuing and execution of multiple instructions at the same time. After of couple of generation of hardware only solutions the designers have hit “the wall” of factorially increasing complexity. It is also getting tougher to find enough instruction level parallelism to utilize the full potential of these parallel designs and the designers enlisted the help of the compiler writers to generate the right kind of instruction stream. EPIC represents the next step in the evolution that reaches for even greater parallelism by relying on compiler technology. Unfortunately it has required a redo of the entire software base, including OS’s and applications. There is no compatibility between the preceeding software base and EPIC’s native mode and many applications even when recompiled cannot exploit enough instruction level parallism to keep the processor fully utilized.
This idea tries to embraces the compiler instruction scheduling approach while trying to reuse as much of current RISC processor designs and their existing software. The goal is to still be able to execute all existing applications without recompilation. There would be a need for minimal changes to an OS to allow for newer recompiled applications. Existing compilers would still work, but a new one would generate code with blocks of data independent instructions. Multiple RISC processors would be glued together by a shared register file and a unified instruction and data paths. The reason for focusing in on only RISC is that its fixed sized instruction makes for easy instruction stream parsing.
The crux of the idea is that a compiler would generate varying length blocks of data independent instructions, i.e. no instruction’s sources or destination registers match another instruction’s within the block. Each block would be terminated someway to indicate its end. The number of instructions within a block is bounded by the number of registers available in the register file. A way to allow bigger blocks would be to increase the register file size, but a larger register file means compatibility issues. My feeling is leave it alone for now but choose a RISC processor that has at least 32 registers. A program would consist of some number of these blocks. The processor can now process the instructions within these blocks sequentially “N” instructions at a time where “N” is the superscalerness of the processor, eg. if the processor is configured two way superscalar then it can issue two instructions at a time until it hits then end of the block. I suggest using register scoreboarding to handle the case where we cross into a new block. Let me briefly explain scoreboarding for those who are unfamiliar. Every register has a scoreboard bit associated with it. When an instruction is issued, the scoreboard bit associated with the register that receives the result has its bit set. When the result finally arrives, then the bit is cleared. If another instruction uses the result before it has arrived, that instruction stalls until the result arrives. While we are within a block there should be no scoreboarding since the compiler has guaranteed data independence. If the processor straddles two blocks then it should wait until all of the instructions from the first block are issued before starting to issue from the next, that way the register’s scoreboard bit’s can get set before issuing instructions that may be data dependent. Once all the instructions from the first block are issued, then any dependencies in the next block will be caught by the scoreboard and stall the issue of that instruction until its data arrives. There could be be resource dependencies caused by units that take multple cycles to complete, but the result is that the processor will advance upto that instruction and stall until the resource becomes available. Some care is need to handle common state information such as condition codes, such as setting the code from that last instruction in the block that sets the code. A condition code used in a branch would be set in the previous block. The advantage of this design is that a processor configured to any superscalarness can run this code. Even older processors can run this code just as long as the end of block marker is treated as a no-op. This gives the RISC processor makers a way of creaping towards higher superscalarness without requiring a revamp of their existing software base.
As good as this may sound, this design would still face the same dilemma as EPIC, ie., for a few chosen applications the processor would have record breaking performance, but in many others it would be under utilized because of the lack of instruction level parallelism.
The next idea of allowing one “N” way processor to split off some of its processing elements into several “1” way processors, that way there is flexibility in configuring the processor to adapt itself to the current operating environment. Each processing element would need its own private register file for executing as a uniprocessor. A shared, many ported register file would allow for collaboration between multiple processors. Any other shared resource like condition codes would have to be handled similarly. The newly compiled applications could switch between any number of ways during execution but might need to do a context switch to refresh the register file. There is some challenge in designing the caching systems to accomodate different modes, but my gut feeling that it is doable.
In actuality, I think that having “N” independent processors with multi-threading compilers and OS’s give the most performance, but if a designer were to field a new leading edge processor that contained multiple uniprocessors on a chip, he would probably be greeted with yawns. The ability to merge many into one, gives the designers the ability to hype this processor by achieving record breaking performance for at least a few well chosen applications.
In summary, I believe that this idea represents a fairly low risk, low cost approach to advancing RISC processors to the next level of performance and hype. I do believe that this architecture is realizable and does not need a technological breakthrough to implement it. I welcome debate either way, for or against. I can be reached at tekhknowledge@opencores.org.
Ed
Here is an idea for extending the performance of RISC processors in the face of Explicitly Parallel Instruction Set Computers (EPIC).
The race is still on for increasing processor throughput by increasing clock speed and architectural improvements. Some of these architectural improvements deal with the issuing and execution of multiple instructions at the same time. After of couple of generation of hardware only solutions the designers have hit “the wall” of factorially increasing complexity. It is also getting tougher to find enough instruction level parallelism to utilize the full potential of these parallel designs and the designers enlisted the help of the compiler writers to generate the right kind of instruction stream. EPIC represents the next step in the evolution that reaches for even greater parallelism by relying on compiler technology. Unfortunately it has required a redo of the entire software base, including OS’s and applications. There is no compatibility between the preceeding software base and EPIC’s native mode and many applications even when recompiled cannot exploit enough instruction level parallism to keep the processor fully utilized.
This idea tries to embraces the compiler instruction scheduling approach while trying to reuse as much of current RISC processor designs and their existing software. The goal is to still be able to execute all existing applications without recompilation. There would be a need for minimal changes to an OS to allow for newer recompiled applications. Existing compilers would still work, but a new one would generate code with blocks of data independent instructions. Multiple RISC processors would be glued together by a shared register file and a unified instruction and data paths. The reason for focusing in on only RISC is that its fixed sized instruction makes for easy instruction stream parsing.
The crux of the idea is that a compiler would generate varying length blocks of data independent instructions, i.e. no instruction’s sources or destination registers match another instruction’s within the block. Each block would be terminated someway to indicate its end. The number of instructions within a block is bounded by the number of registers available in the register file. A way to allow bigger blocks would be to increase the register file size, but a larger register file means compatibility issues. My feeling is leave it alone for now but choose a RISC processor that has at least 32 registers. A program would consist of some number of these blocks. The processor can now process the instructions within these blocks sequentially “N” instructions at a time where “N” is the superscalerness of the processor, eg. if the processor is configured two way superscalar then it can issue two instructions at a time until it hits then end of the block. I suggest using register scoreboarding to handle the case where we cross into a new block. Let me briefly explain scoreboarding for those who are unfamiliar. Every register has a scoreboard bit associated with it. When an instruction is issued, the scoreboard bit associated with the register that receives the result has its bit set. When the result finally arrives, then the bit is cleared. If another instruction uses the result before it has arrived, that instruction stalls until the result arrives. While we are within a block there should be no scoreboarding since the compiler has guaranteed data independence. If the processor straddles two blocks then it should wait until all of the instructions from the first block are issued before starting to issue from the next, that way the register’s scoreboard bit’s can get set before issuing instructions that may be data dependent. Once all the instructions from the first block are issued, then any dependencies in the next block will be caught by the scoreboard and stall the issue of that instruction until its data arrives. There could be be resource dependencies caused by units that take multple cycles to complete, but the result is that the processor will advance upto that instruction and stall until the resource becomes available. Some care is need to handle common state information such as condition codes, such as setting the code from that last instruction in the block that sets the code. A condition code used in a branch would be set in the previous block. The advantage of this design is that a processor configured to any superscalarness can run this code. Even older processors can run this code just as long as the end of block marker is treated as a no-op. This gives the RISC processor makers a way of creaping towards higher superscalarness without requiring a revamp of their existing software base.
As good as this may sound, this design would still face the same dilemma as EPIC, ie., for a few chosen applications the processor would have record breaking performance, but in many others it would be under utilized because of the lack of instruction level parallelism.
The next idea of allowing one “N” way processor to split off some of its processing elements into several “1” way processors, that way there is flexibility in configuring the processor to adapt itself to the current operating environment. Each processing element would need its own private register file for executing as a uniprocessor. A shared, many ported register file would allow for collaboration between multiple processors. Any other shared resource like condition codes would have to be handled similarly. The newly compiled applications could switch between any number of ways during execution but might need to do a context switch to refresh the register file. There is some challenge in designing the caching systems to accomodate different modes, but my gut feeling that it is doable.
In actuality, I think that having “N” independent processors with multi-threading compilers and OS’s give the most performance, but if a designer were to field a new leading edge processor that contained multiple uniprocessors on a chip, he would probably be greeted with yawns. The ability to merge many into one, gives the designers the ability to hype this processor by achieving record breaking performance for at least a few well chosen applications.
In summary, I believe that this idea represents a fairly low risk, low cost approach to advancing RISC processors to the next level of performance and hype. I do believe that this architecture is realizable and does not need a technological breakthrough to implement it. I welcome debate either way, for or against. I can be reached at tekhknowledge@opencores.org.
Ed