OpenCores

OpenCores

Winning with a reconfigurable computer

First

Prev

2/3

Next

Last

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Hi. On Mon, 2004-09-06 at 12:45, <A HREF="http://www.opencores.org/mailman/listinfo/cores">markus at reaaliaika.net</A> wrote: [q] I'm almost forced to answer to this... :-) Altought there was many interesting ideas, I give my reply to only one issue: programming languages and the memory. It's true, that programming languages are developed a central memory in developer's mind. Programming languages assume, that you have uniform access to all data objects in the system. Whenever you don't have this (for example, you're working with code distributed over a network or memory bridge), you have to write some complex code to receive/transmit information between two points. This assumption makes it hard to divide one program for multiple processors without having a shared memory. And when you have this, it gives restrictions to the speed which is achievable by the system overall. But there's a good reason to use one single memory block in general-purpose CPUs - if you divide it to get parallel accesses (like DSPs does), you in the same time hurt the dynamic adaption for the workload; one of your memories might be full, while there could be plenty of room in other memories. Thus you need to implement some kind of "gateway" to store larger arrays to multiple memories --> you get the shared memory.[/q] Good point. For an initial system, I'd just direct map arrays to physical memory. It would be wasteful as heck. There would need to be hard-wired addressing circuitry on the FPGA to do a better job without loosing much speed. [q] In general, programming languages doesn't limit the underlaying architectures (the compilers can do nice job for overcoming this), but of course the programming language causes the programmer to think things in the way that it's represented in the langauge.[/q] Actually, it's not legal in C to distribute structure data as arrays of properties. However, if you ignore some of the lesser known rules, you could do it and few would notice. [q]> To be more specific, let's examine the memory layout of a typical > object in almost any modern language. The data fields are generally > sequential in memory... > ...That naturally eliminates parallel memory access. [/q] For example, if my memory serves me well, the C standard doesn't explicitely say, in what ways the objects are stored. First, the compiler can use whatever alignment it decides and second, the tables can be splitted or interleaved. This should not break any code, except those who uses fancy pointer arithmetics.[/q] Agreed. I see a lot of that fancy pointer arithmetic, but it's pretty rare. [q] The reason for this is, that the regular CPUs have more efficient instructions to handle objects stored in sequential manner.[/q] They also have efficient instructions for accessing arrays. In fact, I've benchmarked Intel CPUs in the past to see which data layout works better. The parallel arrays of properties layout generally wins the benchmarks by about 15% in my EDA tool benchmarks. When I look at the code, everything seems like it should run at the same speed. However, the parallel arrays runs faster. I think the reason this is has to do with cache efficiency. With parallel arrays of data, you load only the data fields of interest into the cache. With traditional structure layout, you get all the other fields. [q]> Instead, we need to represent data fields for each class of object > in separate arrays.[/q] Many databases do this i.e. they load fields to separate arrays.[/q] Yep. It's not too hard to do in a computer language. Bill Hi. On Mon, 2004-09-06 at 12:45, markus at reaaliaika.net wrote: I'm almost forced to answer to this... :-) Altought there was many interesting ideas, I give my reply to only one issue: programming languages and the memory. It's true, that programming languages are developed a central memory in developer's mind. Programming languages assume, that you have uniform access to all data objects in the system. Whenever you don't have this (for example, you're working with code distributed over a network or memory bridge), you have to write some complex code to receive/transmit information between two points. This assumption makes it hard to divide one program for multiple processors without having a shared memory. And when you have this, it gives restrictions to the speed which is achievable by the system overall. But there's a good reason to use one single memory block in general-purpose CPUs - if you divide it to get parallel accesses (like DSPs does), you in the same time hurt the dynamic adaption for the workload; one of your memories might be full, while there could be plenty of room in other memories. Thus you need to implement some kind of "gateway" to store larger arrays to multiple memories --> you get the shared memory. Good point. For an initial system, I'd just direct map arrays to physical memory. It would be wasteful as heck. There would need to be hard-wired addressing circuitry on the FPGA to do a better job without loosing much speed. In general, programming languages doesn't limit the underlaying architectures (the compilers can do nice job for overcoming this), but of course the programming language causes the programmer to think things in the way that it's represented in the langauge. Actually, it's not legal in C to distribute structure data as arrays of properties. However, if you ignore some of the lesser known rules, you could do it and few would notice. > To be more specific, let's examine the memory layout of a typical > object in almost any modern language. The data fields are generally > sequential in memory... > ...That naturally eliminates parallel memory access. For example, if my memory serves me well, the C standard doesn't explicitely say, in what ways the objects are stored. First, the compiler can use whatever alignment it decides and second, the tables can be splitted or interleaved. This should not break any code, except those who uses fancy pointer arithmetics. Agreed. I see a lot of that fancy pointer arithmetic, but it's pretty rare. The reason for this is, that the regular CPUs have more efficient instructions to handle objects stored in sequential manner. They also have efficient instructions for accessing arrays. In fact, I've benchmarked Intel CPUs in the past to see which data layout works better. The parallel arrays of properties layout generally wins the benchmarks by about 15% in my EDA tool benchmarks. When I look at the code, everything seems like it should run at the same speed. However, the parallel arrays runs faster. I think the reason this is has to do with cache efficiency. With parallel arrays of data, you load only the data fields of interest into the cache. With traditional structure layout, you get all the other fields. > Instead, we need to represent data fields for each class of object > in separate arrays. Many databases do this i.e. they load fields to separate arrays. Yep. It's not too hard to do in a computer language. Bill

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
<snip> [q] [q]> VLIW Computers have not been particularly successful -- [/q] sure you are [q]> talking about multiple streams of data -- but the physics [/q] problems [q]> are the same.[/q] I'm not a fan of VLIW. It sounds good... just use a bunch of functional units in parallel. However, these guys tend to forget that it's more of a data routing problem than a computation problem. As a router guy, VLIW sounds like a great way to burn power. [/q] Depends on the problem you want to solve. Running Word on a VLIW processor doesn't make sense ;-) I am a video guy. The problem with video is that you're always short of bandwidth. Those damn pixels need to be drawn pretty fast. And the calculations per pixel are enormous, 10 operations per pixel isn't an exception. Where 1 operation (e.g. shading/rendering) isn't 1 instruction !!! And then it needs to be done at least 3 times per pixel (RGB). However advantage is that the computations per color per pixel are identical. So VLIW does work in this case, together with huge pipelines (200+ deep). Although this all brings us back at the issue where you need to feed the execution units with the data they need fast enough. Cheers, Richard > VLIW Computers have not been particularly successful -- sure you are > talking about multiple streams of data -- but the physics problems > are the same. I'm not a fan of VLIW. It sounds good... just use a bunch of functional units in parallel. However, these guys tend to forget that it's more of a data routing problem than a computation problem. As a router guy, VLIW sounds like a great way to burn power. Depends on the problem you want to solve. Running Word on a VLIW processor doesn't make sense ;-) I am a video guy. The problem with video is that you're always short of bandwidth. Those damn pixels need to be drawn pretty fast. And the calculations per pixel are enormous, 10 operations per pixel isn't an exception. Where 1 operation (e.g. shading/rendering) isn't 1 instruction !!! And then it needs to be done at least 3 times per pixel (RGB). However advantage is that the computations per color per pixel are identical. So VLIW does work in this case, together with huge pipelines (200+ deep). Although this all brings us back at the issue where you need to feed the execution units with the data they need fast enough. Cheers, Richard

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Hi, Rudi. On Mon, 2004-09-06 at 13:16, Rudolf Usselmann wrote: [q] On Mon, 2004-09-06 at 18:00, Bill Cox wrote: [q]> Hi.[/q] [SNIP] [q]> The bottleneck in almost all of the algorithms I examined was not > computation. The real bottleneck was memory access. I've concluded > that parallel memory access is the key to beating the Von Neumann CPU.[/q] Thats not quite correct (your conclusion that is). I think you should clarify what you mean with memory access. Obviously we can get plenty of linear bandwidth out of todays DDR chips and by building really wide buses. e.g. using a 256bit bus, with 400MHz DDRs, is probably more liner memory bandwidth than an Intel P4 can handle today, not counting GPUs.[/q] You're correct. [q] The problem that many have tried to solve but have been more or less unsuccessful are random access latencies. Thats why all modern high performance CPUs use caches. They allow zero penalty random access to a small amount of program and data space.[/q] Latency and cache misses are the key issues that keeps my inner loops running slowly, from what I can see. [q] If you would replace you main system memory with high speed SRAMs you would not need caches anymore and your system bottleneck will be back in the CPU.[/q] The cache + traditional SRAM is a pretty good aproximation. However, there still seems to be only one data address bus in an Intel CPU. Data is inherently sequentially accessed. This can be improved upon. [q] [SNIP] [q]> So, why hasn't this been done yet?[/q] Well, two obvious reasons come to mind: 1) Cost 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield. So the real issue here is memory architecture, not integration or distribution, etc.[/q] Cost is a real issue, as is trying to compete against Intel. However, having multiple memory systems working in parallel is potentially faster than the current single data bus architecture. So, for example, if every cycle the FPGA core receives all the data fields needed to execute a while loop, it could execute the whole thing in parallel. The Intel CPUs, on the other hand, still read just one memory data item per cycle, so far as I know. It's always good talking to you. Bill Hi, Rudi. On Mon, 2004-09-06 at 13:16, Rudolf Usselmann wrote: On Mon, 2004-09-06 at 18:00, Bill Cox wrote: > Hi. [SNIP] > The bottleneck in almost all of the algorithms I examined was not > computation. The real bottleneck was memory access. I've concluded > that parallel memory access is the key to beating the Von Neumann CPU. Thats not quite correct (your conclusion that is). I think you should clarify what you mean with memory access. Obviously we can get plenty of linear bandwidth out of todays DDR chips and by building really wide buses. e.g. using a 256bit bus, with 400MHz DDRs, is probably more liner memory bandwidth than an Intel P4 can handle today, not counting GPUs. You're correct. The problem that many have tried to solve but have been more or less unsuccessful are random access latencies. Thats why all modern high performance CPUs use caches. They allow zero penalty random access to a small amount of program and data space. Latency and cache misses are the key issues that keeps my inner loops running slowly, from what I can see. If you would replace you main system memory with high speed SRAMs you would not need caches anymore and your system bottleneck will be back in the CPU. The cache + traditional SRAM is a pretty good aproximation. However, there still seems to be only one data address bus in an Intel CPU. Data is inherently sequentially accessed. This can be improved upon. [SNIP] > So, why hasn't this been done yet? Well, two obvious reasons come to mind: 1) Cost 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield. So the real issue here is memory architecture, not integration or distribution, etc. Cost is a real issue, as is trying to compete against Intel. However, having multiple memory systems working in parallel is potentially faster than the current single data bus architecture. So, for example, if every cycle the FPGA core receives all the data fields needed to execute a while loop, it could execute the whole thing in parallel. The Intel CPUs, on the other hand, still read just one memory data item per cycle, so far as I know. It's always good talking to you. Bill

Winning with a reconfigurable computer by bporcella on Sep 7, 2004			bporcella Posts: 22 Joined: Jan 16, 2004 Last seen: Oct 2, 2007
At 12:16 PM 9/6/2004, you wrote: [q]Latency and cache misses are the key issues that keeps my inner loops running slowly, from what I can see.[/q] Ok gotta admit thinking about this stuff is MUCH more fun than doing real work....... So suppose that we could convince ourselves that it is possible to say quadruple DDR - cache throughput (relative to present state of the art microprocessor) without (of course) compromising latency -- on a single chip . i.e. we can create 4 instances of independent caches (32 -64 bits) that we can run at say 2Gh. What's next ??? Well "layout" IS one of those interesting problems that have been seriously studied over the last 50 years or so... If we could show a real 4X speedup using 4 RISC type engines and a configurable logic interconnect system we could possibly convince some hard hitting venture capitalist ( Sequoia ? ) to drop a few million on further research. Some ideas...... I suspect that the answer to the throughput question is already pretty well understood by somebody -- so we should be able to discover it with a reasonable level of research. For the application speedup question we could put together a simulation. The cache and processor are easy to simulate - If we could come up with ANY interconnect system that would show a 4x speedup over present state of the art we would have something. Any Ideas on how to configure the interconnect system ??? [q]_______________________________________________ <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A>[/q] bj Porcella <A HREF="http://pages.sbcglobal.net/bporcella/">http://pages.sbcglobal.net/bporcella/</A> At 12:16 PM 9/6/2004, you wrote: Latency and cache misses are the key issues that keeps my inner loops running slowly, from what I can see. Ok gotta admit thinking about this stuff is MUCH more fun than doing real work....... So suppose that we could convince ourselves that it is possible to say quadruple DDR - cache throughput (relative to present state of the art microprocessor) without (of course) compromising latency -- on a single chip . i.e. we can create 4 instances of independent caches (32 -64 bits) that we can run at say 2Gh. What's next ??? Well "layout" IS one of those interesting problems that have been seriously studied over the last 50 years or so... If we could show a real 4X speedup using 4 RISC type engines and a configurable logic interconnect system we could possibly convince some hard hitting venture capitalist ( Sequoia ? ) to drop a few million on further research. Some ideas...... I suspect that the answer to the throughput question is already pretty well understood by somebody -- so we should be able to discover it with a reasonable level of research. For the application speedup question we could put together a simulation. The cache and processor are easy to simulate - If we could come up with ANY interconnect system that would show a 4x speedup over present state of the art we would have something. Any Ideas on how to configure the interconnect system ??? _______________________________________________ http://www.opencores.org/mailman/listinfo/cores bj Porcella http://pages.sbcglobal.net/bporcella/

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
On Mon, 2004-09-06 at 21:48, bporcella wrote: [q] At 12:16 PM 9/6/2004, you wrote: [q]>Latency and cache misses are the key issues that keeps my inner loops >running slowly, from what I can see.[/q] Ok gotta admit thinking about this stuff is MUCH more fun than doing real work.......[/q] It's Labor Day weekend! Who needs real work? [q] So suppose that we could convince ourselves that it is possible to say quadruple DDR - cache throughput (relative to present state of the art microprocessor) without (of course) compromising latency -- on a single chip . i.e. we can create 4 instances of independent caches (32 -64 bits) that we can run at say 2Gh. What's next ??? Well "layout" IS one of those interesting problems that have been seriously studied over the last 50 years or so... If we could show a real 4X speedup using 4 RISC type engines and a configurable logic interconnect system we could possibly convince some hard hitting venture capitalist ( Sequoia ? ) to drop a few million on further research.[/q] Actually, Xilinx has a venture fund. They also have all the really hard pieces to make something like this possibly work: a great fab relationship, money, true layout artists, not to mention an excellent FPGA fabric and most of the hard IP needed. I also have high respect for their management. They also probably have a bit more sense than to invest in a new CPU architecture... By now, everyone should understand that it's not how good your CPU is, it's what software you run, and who you are, and how much it all costs (this architecture doesn't feel particularly cheap). However, I love trying to convince myself that something could actually happen. I'd love to work on a project like this. In the meantime, I'll keep my day-job making one-mask structured ASICs better... [q] Some ideas...... I suspect that the answer to the throughput question is already pretty well understood by somebody -- so we should be able to discover it with a reasonable level of research.[/q] Someone listening out there probably knows the answers: How fast can a Xilinx FPGA access external memory? How many ports can they support? How fast of a direct mapped cache can the internal memories make? [q] For the application speedup question we could put together a simulation. The cache and processor are easy to simulate - If we could come up with ANY interconnect system that would show a 4x speedup over present state of the art we would have something. Any Ideas on how to configure the interconnect system ???[/q] I'm pretty comfortable with interconnect. It'd be custom for each algorithm being sped up. If only inner loops are optimized, the routing to the SRAM buses should be fairly clean. Once the code for the memory access is synthesized, we should be able to throw the problem at the place and route tools and get fair results. It's not like we're trying to route data from any bus to any functional unit. There should only be a few destinations and sources per SRAM interface for a typical algorithm. Putting together some demonstrations showing speedups of inner loops should be doable. The hardware compiler is a big project. Building the prototype system isn't simple. I have to say that pulling off this project sounds far fetched. Unless Xilinx or Altera wants bragging rights to the fastest general purpose computing machine, there probably wont be any money available to build it. Still, it's a three day weekend, and it's time to think about far-fetched ideas. How cool would it be to build the worlds fastest single processor computer? Way cool... I think I'll go to bed and dream about it now. Thanks for the lively discussion! Bill On Mon, 2004-09-06 at 21:48, bporcella wrote: At 12:16 PM 9/6/2004, you wrote: >Latency and cache misses are the key issues that keeps my inner loops >running slowly, from what I can see. Ok gotta admit thinking about this stuff is MUCH more fun than doing real work....... It's Labor Day weekend! Who needs real work? So suppose that we could convince ourselves that it is possible to say quadruple DDR - cache throughput (relative to present state of the art microprocessor) without (of course) compromising latency -- on a single chip . i.e. we can create 4 instances of independent caches (32 -64 bits) that we can run at say 2Gh. What's next ??? Well "layout" IS one of those interesting problems that have been seriously studied over the last 50 years or so... If we could show a real 4X speedup using 4 RISC type engines and a configurable logic interconnect system we could possibly convince some hard hitting venture capitalist ( Sequoia ? ) to drop a few million on further research. Actually, Xilinx has a venture fund. They also have all the really hard pieces to make something like this possibly work: a great fab relationship, money, true layout artists, not to mention an excellent FPGA fabric and most of the hard IP needed. I also have high respect for their management. They also probably have a bit more sense than to invest in a new CPU architecture... By now, everyone should understand that it's not how good your CPU is, it's what software you run, and who you are, and how much it all costs (this architecture doesn't feel particularly cheap). However, I love trying to convince myself that something could actually happen. I'd love to work on a project like this. In the meantime, I'll keep my day-job making one-mask structured ASICs better... Some ideas...... I suspect that the answer to the throughput question is already pretty well understood by somebody -- so we should be able to discover it with a reasonable level of research. Someone listening out there probably knows the answers: How fast can a Xilinx FPGA access external memory? How many ports can they support? How fast of a direct mapped cache can the internal memories make? For the application speedup question we could put together a simulation. The cache and processor are easy to simulate - If we could come up with ANY interconnect system that would show a 4x speedup over present state of the art we would have something. Any Ideas on how to configure the interconnect system ??? I'm pretty comfortable with interconnect. It'd be custom for each algorithm being sped up. If only inner loops are optimized, the routing to the SRAM buses should be fairly clean. Once the code for the memory access is synthesized, we should be able to throw the problem at the place and route tools and get fair results. It's not like we're trying to route data from any bus to any functional unit. There should only be a few destinations and sources per SRAM interface for a typical algorithm. Putting together some demonstrations showing speedups of inner loops should be doable. The hardware compiler is a big project. Building the prototype system isn't simple. I have to say that pulling off this project sounds far fetched. Unless Xilinx or Altera wants bragging rights to the fastest general purpose computing machine, there probably wont be any money available to build it. Still, it's a three day weekend, and it's time to think about far-fetched ideas. How cool would it be to build the worlds fastest single processor computer? Way cool... I think I'll go to bed and dream about it now. Thanks for the lively discussion! Bill

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
Let's see if I get it right. Main issue here is getting random data fast enough into multiple execution units, or one big execution unit that works on a lot of data (what's the difference?), right? How about using QDR-II, or DDR-II for that matter, memories? These are true SRAMs, produced in a 90nm technology. The QDR-II has separate read and write ports, uses a 250MHz double pumped (DDR) clock per port. This provides an astonishingly: - 1G read/write operations per sec - 18Gbit/sec bandwidth per port (36bit databus) - 2GByte/sec bandwidth per port (32bit databus) This all without any latency! Pipelined yes, latency no. Now these devices are available in 72Mbit (2M x 36bits) densities. Not directly enough for main memory, but it would make a hell of a cache to execute your while loops from. Best of all, both Xilinx and Altera claim they can handle these memories with their FPGAs. Then there's another memory variant called QuadPorts. These memories have 4 independent ports. All running at 133MHz. This allows you to access the same shared memory from 4 units independantly. Unfortunately these memories are pretty small; 1Mbit max. But still, this might be a candidate for your cache. Cheers, Richard [q] -----Original Message----- From: <A HREF="http://www.opencores.org/mailman/listinfo/cores">cores-bounces at opencores.org</A> [mailto:<A HREF="http://www.opencores.org/mailman/listinfo/cores">cores-bounces at opencores.org</A>] On Behalf Of Bill Cox Sent: Tuesday, September 07, 2004 5:22 AM To: bporcella Cc: Discussion list about free open source IP cores Subject: Re: [oc] Winning with a reconfigurable computer On Mon, 2004-09-06 at 21:48, bporcella wrote: [q]> At 12:16 PM 9/6/2004, you wrote: > [q] >Latency and cache misses are the key issues that keeps my [/q] inner loops [q] >running slowly, from what I can see.[/q] > > Ok gotta admit thinking about this stuff is MUCH more fun [/q] than doing [q]> real work.......[/q] It's Labor Day weekend! Who needs real work? [q]> So suppose that we could convince ourselves that it is [/q] possible to say [q]> quadruple DDR - cache throughput (relative to present state [/q] of the art [q]> microprocessor) without (of course) compromising latency [/q] -- on a single [q]> chip . i.e. we can create 4 instances of independent [/q] caches (32 -64 [q]> bits) that we can run at say 2Gh. What's next ??? > > Well "layout" IS one of those interesting problems that [/q] have been seriously [q]> studied over the last 50 years or so... If we could show [/q] a real 4X [q]> speedup using 4 RISC type engines and a configurable logic > interconnect system we could possibly convince some hard hitting > venture capitalist ( Sequoia ? ) to drop a few million on [/q] further research. Actually, Xilinx has a venture fund. They also have all the really hard pieces to make something like this possibly work: a great fab relationship, money, true layout artists, not to mention an excellent FPGA fabric and most of the hard IP needed. I also have high respect for their management. They also probably have a bit more sense than to invest in a new CPU architecture... By now, everyone should understand that it's not how good your CPU is, it's what software you run, and who you are, and how much it all costs (this architecture doesn't feel particularly cheap). However, I love trying to convince myself that something could actually happen. I'd love to work on a project like this. In the meantime, I'll keep my day-job making one-mask structured ASICs better... [q]> Some ideas...... > > I suspect that the answer to the throughput question is [/q] already pretty [q]> well understood by somebody -- so we should be able to discover it > with a reasonable level of research.[/q] Someone listening out there probably knows the answers: How fast can a Xilinx FPGA access external memory? How many ports can they support? How fast of a direct mapped cache can the internal memories make? [q]> For the application speedup question we could put together a > simulation. The cache and processor are easy to simulate [/q] - If we could [q]> come up with ANY interconnect system that would show a 4x [/q] speedup over [q]> present state of the art we would have something. > > > Any Ideas on how to configure the interconnect system ???[/q] I'm pretty comfortable with interconnect. It'd be custom for each algorithm being sped up. If only inner loops are optimized, the routing to the SRAM buses should be fairly clean. Once the code for the memory access is synthesized, we should be able to throw the problem at the place and route tools and get fair results. It's not like we're trying to route data from any bus to any functional unit. There should only be a few destinations and sources per SRAM interface for a typical algorithm. Putting together some demonstrations showing speedups of inner loops should be doable. The hardware compiler is a big project. Building the prototype system isn't simple. I have to say that pulling off this project sounds far fetched. Unless Xilinx or Altera wants bragging rights to the fastest general purpose computing machine, there probably wont be any money available to build it. Still, it's a three day weekend, and it's time to think about far-fetched ideas. How cool would it be to build the worlds fastest single processor computer? Way cool... I think I'll go to bed and dream about it now. Thanks for the lively discussion! Bill _______________________________________________ <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A> [/q] Let's see if I get it right. Main issue here is getting random data fast enough into multiple execution units, or one big execution unit that works on a lot of data (what's the difference?), right? How about using QDR-II, or DDR-II for that matter, memories? These are true SRAMs, produced in a 90nm technology. The QDR-II has separate read and write ports, uses a 250MHz double pumped (DDR) clock per port. This provides an astonishingly: - 1G read/write operations per sec - 18Gbit/sec bandwidth per port (36bit databus) - 2GByte/sec bandwidth per port (32bit databus) This all without any latency! Pipelined yes, latency no. Now these devices are available in 72Mbit (2M x 36bits) densities. Not directly enough for main memory, but it would make a hell of a cache to execute your while loops from. Best of all, both Xilinx and Altera claim they can handle these memories with their FPGAs. Then there's another memory variant called QuadPorts. These memories have 4 independent ports. All running at 133MHz. This allows you to access the same shared memory from 4 units independantly. Unfortunately these memories are pretty small; 1Mbit max. But still, this might be a candidate for your cache. Cheers, Richard -----Original Message----- From: cores-bounces at opencores.org [mailto:cores-bounces at opencores.org] On Behalf Of Bill Cox Sent: Tuesday, September 07, 2004 5:22 AM To: bporcella Cc: Discussion list about free open source IP cores Subject: Re: [oc] Winning with a reconfigurable computer On Mon, 2004-09-06 at 21:48, bporcella wrote: > At 12:16 PM 9/6/2004, you wrote: > >Latency and cache misses are the key issues that keeps my inner loops >running slowly, from what I can see. > > Ok gotta admit thinking about this stuff is MUCH more fun than doing > real work....... It's Labor Day weekend! Who needs real work? > So suppose that we could convince ourselves that it is possible to say > quadruple DDR - cache throughput (relative to present state of the art > microprocessor) without (of course) compromising latency -- on a single > chip . i.e. we can create 4 instances of independent caches (32 -64 > bits) that we can run at say 2Gh. What's next ??? > > Well "layout" IS one of those interesting problems that have been seriously > studied over the last 50 years or so... If we could show a real 4X > speedup using 4 RISC type engines and a configurable logic > interconnect system we could possibly convince some hard hitting > venture capitalist ( Sequoia ? ) to drop a few million on further research. Actually, Xilinx has a venture fund. They also have all the really hard pieces to make something like this possibly work: a great fab relationship, money, true layout artists, not to mention an excellent FPGA fabric and most of the hard IP needed. I also have high respect for their management. They also probably have a bit more sense than to invest in a new CPU architecture... By now, everyone should understand that it's not how good your CPU is, it's what software you run, and who you are, and how much it all costs (this architecture doesn't feel particularly cheap). However, I love trying to convince myself that something could actually happen. I'd love to work on a project like this. In the meantime, I'll keep my day-job making one-mask structured ASICs better... > Some ideas...... > > I suspect that the answer to the throughput question is already pretty > well understood by somebody -- so we should be able to discover it > with a reasonable level of research. Someone listening out there probably knows the answers: How fast can a Xilinx FPGA access external memory? How many ports can they support? How fast of a direct mapped cache can the internal memories make? > For the application speedup question we could put together a > simulation. The cache and processor are easy to simulate - If we could > come up with ANY interconnect system that would show a 4x speedup over > present state of the art we would have something. > > > Any Ideas on how to configure the interconnect system ??? I'm pretty comfortable with interconnect. It'd be custom for each algorithm being sped up. If only inner loops are optimized, the routing to the SRAM buses should be fairly clean. Once the code for the memory access is synthesized, we should be able to throw the problem at the place and route tools and get fair results. It's not like we're trying to route data from any bus to any functional unit. There should only be a few destinations and sources per SRAM interface for a typical algorithm. Putting together some demonstrations showing speedups of inner loops should be doable. The hardware compiler is a big project. Building the prototype system isn't simple. I have to say that pulling off this project sounds far fetched. Unless Xilinx or Altera wants bragging rights to the fastest general purpose computing machine, there probably wont be any money available to build it. Still, it's a three day weekend, and it's time to think about far-fetched ideas. How cool would it be to build the worlds fastest single processor computer? Way cool... I think I'll go to bed and dream about it now. Thanks for the lively discussion! Bill _______________________________________________ http://www.opencores.org/mailman/listinfo/cores

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
On Tue, 2004-09-07 at 02:16, Bill Cox wrote: [SNIP] [q] The cache + traditional SRAM is a pretty good aproximation. However, there still seems to be only one data address bus in an Intel CPU. Data is inherently sequentially accessed. This can be improved upon.[/q] I think this will be the biggest challenge. Problem is right now that most CPUs implement multiple execution units which can each use multiple data. Keeping them all fed will be tricky. Even once the High Speed SRAM like memory will become available. I suspect at some point all memories will be multi port, and feature low pin count (multi) gigabit links to the CPU. [SNIP] [q] Cost is a real issue, as is trying to compete against Intel. However, having multiple memory systems working in parallel is potentially faster than the current single data bus architecture. So, for example, if every cycle the FPGA core receives all the data fields needed to execute a while loop, it could execute the whole thing in parallel. The Intel CPUs, on the other hand, still read just one memory data item per cycle, so far as I know.[/q] Well, the only way you can do this right now is with address guessing (aka prediction). And that will only marginal work. EVERY CPU maker out there has tried to solve this problem and I think all of them cam to the conclusion that it's impossible to create a solution that is universal. Sure as you say, if your memory can have unlimited port, you will get much closer to the solution, but cost is a factor ... [q] It's always good talking to you. Bill[/q] Anyway, allow me to redirect this whole discussion to a slightly different direction. I actually think this is what you really want to do: Problem Specific CPUs. Going against Intel and trying to compete with them, is really a tough job. Unlike Microsoft, I believe Intel has actually quite a few very bright engineers and I have a lot of respect for what they have accomplished, same goes for AMD. Here are two bad examples of problem specific CPUs (lets call them PSCPUs): - Java Processor - Verilog Processor (I know a company was developing this at one point, but don't know if it ever became a product or not). The reason I call these bad examples, is because they are still trying to solve generic problems, by using difference languages and representations.m. Two years ago I worked on PSCPU with a British company. The project fell through for various other reasons. The idea was to build an accelerator for Spice/HSpice simulator. One guy who was a mathematician, sat down and plugged the spice code apart, and decided that all he needed to accelerate the inner loop was a massive parallel matrix multiplier. So we where going to build just that. It would include some sort of sequencer and data steering, and the heart of it would be a matrix multiplier. We would use an array of these units interconnected to each other and some memory to solve a dedicated problems. This is just an example of where I am trying to go. Perhaps what you, Bill, really want is a dedicated engine that can accelerate routing, or solve some other sort of dedicated problem ? Thats something I think we can do at OpenCores. Competing with Intel on the other hand, I would seriously doubt ! Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, <A HREF="http://www.asics.ws">http://www.asics.ws</A> Your Partner for IP Cores, Design, Verification and Synthesis On Tue, 2004-09-07 at 02:16, Bill Cox wrote: [SNIP] The cache + traditional SRAM is a pretty good aproximation. However, there still seems to be only one data address bus in an Intel CPU. Data is inherently sequentially accessed. This can be improved upon. I think this will be the biggest challenge. Problem is right now that most CPUs implement multiple execution units which can each use multiple data. Keeping them all fed will be tricky. Even once the High Speed SRAM like memory will become available. I suspect at some point all memories will be multi port, and feature low pin count (multi) gigabit links to the CPU. [SNIP] Cost is a real issue, as is trying to compete against Intel. However, having multiple memory systems working in parallel is potentially faster than the current single data bus architecture. So, for example, if every cycle the FPGA core receives all the data fields needed to execute a while loop, it could execute the whole thing in parallel. The Intel CPUs, on the other hand, still read just one memory data item per cycle, so far as I know. Well, the only way you can do this right now is with address guessing (aka prediction). And that will only marginal work. EVERY CPU maker out there has tried to solve this problem and I think all of them cam to the conclusion that it's impossible to create a solution that is universal. Sure as you say, if your memory can have unlimited port, you will get much closer to the solution, but cost is a factor ... It's always good talking to you. Bill Anyway, allow me to redirect this whole discussion to a slightly different direction. I actually think this is what you really want to do: Problem Specific CPUs. Going against Intel and trying to compete with them, is really a tough job. Unlike Microsoft, I believe Intel has actually quite a few very bright engineers and I have a lot of respect for what they have accomplished, same goes for AMD. Here are two bad examples of problem specific CPUs (lets call them PSCPUs): - Java Processor - Verilog Processor (I know a company was developing this at one point, but don't know if it ever became a product or not). The reason I call these bad examples, is because they are still trying to solve generic problems, by using difference languages and representations.m. Two years ago I worked on PSCPU with a British company. The project fell through for various other reasons. The idea was to build an accelerator for Spice/HSpice simulator. One guy who was a mathematician, sat down and plugged the spice code apart, and decided that all he needed to accelerate the inner loop was a massive parallel matrix multiplier. So we where going to build just that. It would include some sort of sequencer and data steering, and the heart of it would be a matrix multiplier. We would use an array of these units interconnected to each other and some memory to solve a dedicated problems. This is just an example of where I am trying to go. Perhaps what you, Bill, really want is a dedicated engine that can accelerate routing, or solve some other sort of dedicated problem ? Thats something I think we can do at OpenCores. Competing with Intel on the other hand, I would seriously doubt ! Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, http://www.asics.ws Your Partner for IP Cores, Design, Verification and Synthesis

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
On Tue, 2004-09-07 at 02:17, Richard Herveille wrote: [q] Let's see if I get it right. Main issue here is getting random data fast enough into multiple execution units, or one big execution unit that works on a lot of data (what's the difference?), right?[/q] Right. The concept of an execution unit might not be quite right. Basically, I'm talking about synthesizing the critical inner loops of an algorithm directly to hardware, and executing a whole loop per clock cycle. So, the execution unit in this case is custom for a specific software loop. This custom execution unit often needs several fields from different classes, all in parallel. With several memory ports, we could provide that data just as the loop's execution unit needs it. [q] How about using QDR-II, or DDR-II for that matter, memories? These are true SRAMs, produced in a 90nm technology. The QDR-II has separate read and write ports, uses a 250MHz double pumped (DDR) clock per port. This provides an astonishingly: - 1G read/write operations per sec - 18Gbit/sec bandwidth per port (36bit databus) - 2GByte/sec bandwidth per port (32bit databus) This all without any latency! Pipelined yes, latency no. Now these devices are available in 72Mbit (2M x 36bits) densities. Not directly enough for main memory, but it would make a hell of a cache to execute your while loops from. Best of all, both Xilinx and Altera claim they can handle these memories with their FPGAs.[/q] Ok, that's cool. While these memories seem on the small side, they're large enough to do practical work on real EDA problems. For example, let's look at the placement problem. A big design might have 1 million placeable instances. Each QDR-II memory could handle a 32-bit field for any class that had at most 2M instances. Memory for instance ports would be a problem. We'd need memories with several times the size of the instance memory to hold even a single field for ports on instances. So, I'd guess that I'd need several memories of the 2Mx36 bit variety, but also at least a couple that were several times deeper. We could use the 2M by 36 bit memories as cache for the bigger arrays, and store most data in a single large external memory. [q] Then there's another memory variant called QuadPorts. These memories have 4 independent ports. All running at 133MHz. This allows you to access the same shared memory from 4 units independantly. Unfortunately these memories are pretty small; 1Mbit max. But still, this might be a candidate for your cache.[/q] These don't appeal to me much. I'd rather go with true parallel memories. Bill On Tue, 2004-09-07 at 02:17, Richard Herveille wrote: Let's see if I get it right. Main issue here is getting random data fast enough into multiple execution units, or one big execution unit that works on a lot of data (what's the difference?), right? Right. The concept of an execution unit might not be quite right. Basically, I'm talking about synthesizing the critical inner loops of an algorithm directly to hardware, and executing a whole loop per clock cycle. So, the execution unit in this case is custom for a specific software loop. This custom execution unit often needs several fields from different classes, all in parallel. With several memory ports, we could provide that data just as the loop's execution unit needs it. How about using QDR-II, or DDR-II for that matter, memories? These are true SRAMs, produced in a 90nm technology. The QDR-II has separate read and write ports, uses a 250MHz double pumped (DDR) clock per port. This provides an astonishingly: - 1G read/write operations per sec - 18Gbit/sec bandwidth per port (36bit databus) - 2GByte/sec bandwidth per port (32bit databus) This all without any latency! Pipelined yes, latency no. Now these devices are available in 72Mbit (2M x 36bits) densities. Not directly enough for main memory, but it would make a hell of a cache to execute your while loops from. Best of all, both Xilinx and Altera claim they can handle these memories with their FPGAs. Ok, that's cool. While these memories seem on the small side, they're large enough to do practical work on real EDA problems. For example, let's look at the placement problem. A big design might have 1 million placeable instances. Each QDR-II memory could handle a 32-bit field for any class that had at most 2M instances. Memory for instance ports would be a problem. We'd need memories with several times the size of the instance memory to hold even a single field for ports on instances. So, I'd guess that I'd need several memories of the 2Mx36 bit variety, but also at least a couple that were several times deeper. We could use the 2M by 36 bit memories as cache for the bigger arrays, and store most data in a single large external memory. Then there's another memory variant called QuadPorts. These memories have 4 independent ports. All running at 133MHz. This allows you to access the same shared memory from 4 units independantly. Unfortunately these memories are pretty small; 1Mbit max. But still, this might be a candidate for your cache. These don't appeal to me much. I'd rather go with true parallel memories. Bill

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade. First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores. The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only bigger bandwidth RAM will not help. Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization. However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays. One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies. The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area. best regards, Marko On Tuesday 07 September 2004 10:31, Bill Cox wrote: On Tue, 2004-09-07 at 02:17, Richard Herveille wrote: [q]> Let's see if I get it right. Main issue here is getting random data fast > enough into multiple execution units, or one big execution unit that > works on a lot of data (what's the difference?), right?[/q] Right. The concept of an execution unit might not be quite right. Basically, I'm talking about synthesizing the critical inner loops of an algorithm directly to hardware, and executing a whole loop per clock cycle. So, the execution unit in this case is custom for a specific software loop. This custom execution unit often needs several fields from different classes, all in parallel. With several memory ports, we could provide that data just as the loop's execution unit needs it. [q]> How about using QDR-II, or DDR-II for that matter, memories? These are > true SRAMs, produced in a 90nm technology. The QDR-II has separate read > and write ports, uses a 250MHz double pumped (DDR) clock per port. This > provides an astonishingly: > - 1G read/write operations per sec > - 18Gbit/sec bandwidth per port (36bit databus) > - 2GByte/sec bandwidth per port (32bit databus) > > This all without any latency! Pipelined yes, latency no. > Now these devices are available in 72Mbit (2M x 36bits) densities. Not > directly enough for main memory, but it would make a hell of a cache to > execute your while loops from. > Best of all, both Xilinx and Altera claim they can handle these memories > with their FPGAs.[/q] Ok, that's cool. While these memories seem on the small side, they're large enough to do practical work on real EDA problems. For example, let's look at the placement problem. A big design might have 1 million placeable instances. Each QDR-II memory could handle a 32-bit field for any class that had at most 2M instances. Memory for instance ports would be a problem. We'd need memories with several times the size of the instance memory to hold even a single field for ports on instances. So, I'd guess that I'd need several memories of the 2Mx36 bit variety, but also at least a couple that were several times deeper. We could use the 2M by 36 bit memories as cache for the bigger arrays, and store most data in a single large external memory. [q]> Then there's another memory variant called QuadPorts. These memories have > 4 independent ports. All running at 133MHz. This allows you to access the > same shared memory from 4 units independantly. > Unfortunately these memories are pretty small; 1Mbit max. But still, this > might be a candidate for your cache.[/q] These don't appeal to me much. I'd rather go with true parallel memories. Bill[/q] Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade. First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores. The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only bigger bandwidth RAM will not help. Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization. However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays. One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies. The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area. best regards, Marko On Tuesday 07 September 2004 10:31, Bill Cox wrote: On Tue, 2004-09-07 at 02:17, Richard Herveille wrote: > Let's see if I get it right. Main issue here is getting random data fast > enough into multiple execution units, or one big execution unit that > works on a lot of data (what's the difference?), right? Right. The concept of an execution unit might not be quite right. Basically, I'm talking about synthesizing the critical inner loops of an algorithm directly to hardware, and executing a whole loop per clock cycle. So, the execution unit in this case is custom for a specific software loop. This custom execution unit often needs several fields from different classes, all in parallel. With several memory ports, we could provide that data just as the loop's execution unit needs it. > How about using QDR-II, or DDR-II for that matter, memories? These are > true SRAMs, produced in a 90nm technology. The QDR-II has separate read > and write ports, uses a 250MHz double pumped (DDR) clock per port. This > provides an astonishingly: > - 1G read/write operations per sec > - 18Gbit/sec bandwidth per port (36bit databus) > - 2GByte/sec bandwidth per port (32bit databus) > > This all without any latency! Pipelined yes, latency no. > Now these devices are available in 72Mbit (2M x 36bits) densities. Not > directly enough for main memory, but it would make a hell of a cache to > execute your while loops from. > Best of all, both Xilinx and Altera claim they can handle these memories > with their FPGAs. Ok, that's cool. While these memories seem on the small side, they're large enough to do practical work on real EDA problems. For example, let's look at the placement problem. A big design might have 1 million placeable instances. Each QDR-II memory could handle a 32-bit field for any class that had at most 2M instances. Memory for instance ports would be a problem. We'd need memories with several times the size of the instance memory to hold even a single field for ports on instances. So, I'd guess that I'd need several memories of the 2Mx36 bit variety, but also at least a couple that were several times deeper. We could use the 2M by 36 bit memories as cache for the bigger arrays, and store most data in a single large external memory. > Then there's another memory variant called QuadPorts. These memories have > 4 independent ports. All running at 133MHz. This allows you to access the > same shared memory from 4 units independantly. > Unfortunately these memories are pretty small; 1Mbit max. But still, this > might be a candidate for your cache. These don't appeal to me much. I'd rather go with true parallel memories. Bill

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
Hi, Marko. Hey, a real expert in the field! I was hoping there would be one or two of you out there to keep the discussion honest. On Tue, 2004-09-07 at 05:17, Marko Mlinar wrote: [q] Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade.[/q] Almost makes me wish I were a grad student, rather than a EDA tool developer. [q] First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores.[/q] I'll check it out. Do you have a link to it? I missed it when I checked the links from the OpenCores site. [q] The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only bigger bandwidth RAM will not help.[/q] Agreed. [q] Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization.[/q] Yes, it's obviously an NP hard (or worse) problem to solve optimally. I shouldn't have said it's not hard. In particular, doing it well is very hard. However, it's doable, as you've proved. [q] However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays.[/q] I agree that a sub-set of Java or C# would be a better starting point. Both languages are less ambiguous, and both deemphasize pointers. At work, we work in a subset of C, and we already use integers as handles to objects. Data already is stored in arrays of properties. We don't allow programmers to use pointers, except in very specific cases (like returning multiple values). Even simpler than compiling Java or C# would be to pick a subset that's expressive enough to allow good algorithms to be written. We've already done this at work. No '.', '->', or '' operators are allowed. Everything is either a variable or an object, and objects are accessed through their handles. It should be relatively easy to analyze our code, and our data structures are well defined in a text based schema file, which is used to generate all our object support code. We do allow new property arrays to be allocated to extend the fields of a class. We also automate generation of recursive destructors, eliminating most of the dangling pointer problems. So, we have little use for garbage collection, and our object support code can get dumped into the C->Verilog translator just like algorithmic code. [q] One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies.[/q] This is a cool idea. I'd like to examine how it translates into hardware more carefully. [q] The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area.[/q] I'm not surprised. I have found that taking advantage of parallel processing generally requires changes to the algorithms at a high level. I'm not thinking of generating hardware that's many times faster than a traditional CPU. I'm just thinking that more than 2x is doable, primarily through good memory management across multiple data ports. I also think the same tricks could be used to speed up a more traditional CPU. The main thing is to hammer down the memory bottleneck through several independent memory buses. [q] best regards, Marko[/q] Thanks for the well informed thoughts on the topic. Bill Hi, Marko. Hey, a real expert in the field! I was hoping there would be one or two of you out there to keep the discussion honest. On Tue, 2004-09-07 at 05:17, Marko Mlinar wrote: Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade. Almost makes me wish I were a grad student, rather than a EDA tool developer. First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores. I'll check it out. Do you have a link to it? I missed it when I checked the links from the OpenCores site. The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only* bigger bandwidth RAM will not help. Agreed. Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization. Yes, it's obviously an NP hard (or worse) problem to solve optimally. I shouldn't have said it's not hard. In particular, doing it well is very hard. However, it's doable, as you've proved. However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays. I agree that a sub-set of Java or C# would be a better starting point. Both languages are less ambiguous, and both deemphasize pointers. At work, we work in a subset of C, and we already use integers as handles to objects. Data already is stored in arrays of properties. We don't allow programmers to use pointers, except in very specific cases (like returning multiple values). Even simpler than compiling Java or C# would be to pick a subset that's expressive enough to allow good algorithms to be written. We've already done this at work. No '.', '->', or '*' operators are allowed. Everything is either a variable or an object, and objects are accessed through their handles. It should be relatively easy to analyze our code, and our data structures are well defined in a text based schema file, which is used to generate all our object support code. We do allow new property arrays to be allocated to extend the fields of a class. We also automate generation of recursive destructors, eliminating most of the dangling pointer problems. So, we have little use for garbage collection, and our object support code can get dumped into the C->Verilog translator just like algorithmic code. One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies. This is a cool idea. I'd like to examine how it translates into hardware more carefully. The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area. I'm not surprised. I have found that taking advantage of parallel processing generally requires changes to the algorithms at a high level. I'm not thinking of generating hardware that's many times faster than a traditional CPU. I'm just thinking that more than 2x is doable, primarily through good memory management across multiple data ports. I also think the same tricks could be used to speed up a more traditional CPU. The main thing is to hammer down the memory bottleneck through several independent memory buses. best regards, Marko Thanks for the well informed thoughts on the topic. Bill

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
On Tue, 2004-09-07 at 16:17, Marko Mlinar wrote: [q] Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade. First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores. The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only bigger bandwidth RAM will not help. Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization. However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays. One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies. The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area. best regards, Marko[/q] Very interesting and kind of what I would have expected. If that was an easy/possible solution we would have seen this in production long time ago. One question does come to mind here: Why do people always try to take such a high level language like C as a starting point ? As you Marko, point out, it's impossible to solve some of the higher level abstracts. Why not take assembly language and try parallelize it and map that in to Hardware ? rudi ============================================================= Rudolf Usselmann, ASICS World Services, <A HREF="http://www.asics.ws">http://www.asics.ws</A> Your Partner for IP Cores, Design, Verification and Synthesis On Tue, 2004-09-07 at 16:17, Marko Mlinar wrote: Bill, quite a discussion you have out here ;) I have spend quite some time on this subject and I am sure it will be one of the hottest research areas in the following decade. First let me say I wrote several compilers from ANSI C to Verilog RTL, one of the compilers (the first one) is available also on Opencores. The biggest problems as you already noted is of course memory bandwidth. More specifically the problem is to deliver the data on time -- more specifically the biggest problem here is latency. Therefore search for only bigger bandwidth RAM will not help. Although it may seem at the first glance that C->RTL conversion is straightforward and all it requires is a lot of work, but this is not true. I have not been able to prove that solving this problem in finite time would require a machine computationally better than Turing, but that seems very likely, since the problem becomes very similar to code optimization. However the problem can be non-optimally solved with transformations which maintain code equality like current compilers do. This is of course very hard with all the pointers in the C-code like you already mentioned. With simple transformations you don't get enough optimizations to compete with high frequency sequential CPUs. Java and C# therefore seem much more suitable for conversion than full ANSI C compatibility. Or saying that C pointers cannot point to just any area, but just to specific arrays. One of the biggest finding I had is that every loop that can be optimized (containing memory accesses) can be duplicated, where from the first loop you can calculate memory addresses and in the second you collect data. In the worst case your second loop is empty and you have no optimizations, but in most cases you can generate the addresses needed to cope with long RAM access latencies. The next important finding is that it is quite impossible for a machine to change the architecture (usually needs to change whole algorithm) to be more suitable for HW and to save area. best regards, Marko Very interesting and kind of what I would have expected. If that was an easy/possible solution we would have seen this in production long time ago. One question does come to mind here: Why do people always try to take such a high level language like C as a starting point ? As you Marko, point out, it's impossible to solve some of the higher level abstracts. Why not take assembly language and try parallelize it and map that in to Hardware ? rudi ============================================================= Rudolf Usselmann, ASICS World Services, http://www.asics.ws Your Partner for IP Cores, Design, Verification and Synthesis

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
On Tuesday 07 September 2004 13:16, Bill Cox wrote: [q] Hi, Marko. Hey, a real expert in the field! I was hoping there would be one or two of you out there to keep the discussion honest.[/q] Thanks, but don't expect too much, since I am also in search for most of the answers ;) What solution are you trying to make? Is the solution commercial? Is it under viASIC? Are you interested in a commercial cooperation? [q]> I have spend quite some time on this subject and I am sure it will be one > of the hottest research areas in the following decade.[/q] Almost makes me wish I were a grad student, rather than a EDA tool developer.[/q] Well, that is not so bad either ;) [q] I'll check it out. Do you have a link to it? I missed it when I checked the links from the OpenCores site.[/q] It is a part of or1ksim, called CUC (custom unit compiler). I believe it is currently broke and unmaintained, but there was some versions which worked on the simulator. The CUC was abandoned due to large circuit sizes and I began to built second tool, which is more commercial. [q]> Although it may seem at the first glance that C->RTL conversion is > straightforward and all it requires is a lot of work, but this is not > true. I have not been able to prove that solving this problem in finite > time would require a machine computationally better than Turing, but that > seems very likely, since the problem becomes very similar to code > optimization.[/q] Yes, it's obviously an NP hard (or worse) problem to solve optimally. I shouldn't have said it's not hard. In particular, doing it well is very hard. However, it's doable, as you've proved.[/q] Well, I have proven that conversion is possible -- which is undoubtly true. But speed is another issue. CUC suffered severely from memory bandwidth, despite the memory burst it was trying to make by concatenation of memory accesses. [q] I agree that a sub-set of Java or C# would be a better starting point. Both languages are less ambiguous, and both deemphasize pointers.[/q] Not fully true -- C# has pointers, but their usage is limited to smaller areas, which means the arrays used there can be put into smaller RAMs. [q] At work, we work in a subset of C, and we already use integers as handles to objects. Data already is stored in arrays of properties. We don't allow programmers to use pointers, except in very specific cases (like returning multiple values).[/q] Moreover you could allow pointers as indexes. All this allows you to distribute the memories into smaller pieces which is very valuable. [q] Even simpler than compiling Java or C# would be to pick a subset that's expressive enough to allow good algorithms to be written. We've already done this at work. No '.', '->', or '' operators are allowed. Everything is either a variable or an object, and objects are accessed through their handles. It should be relatively easy to analyze our code, and our data structures are well defined in a text based schema file, which is used to generate all our object support code. We do allow new property arrays to be allocated to extend the fields of a class. We also automate generation of recursive destructors, eliminating most of the dangling pointer problems. So, we have little use for garbage collection, and our object support code can get dumped into the C->Verilog translator just like algorithmic code.[/q] If I understand you correctly you first transform C code to schema and optimize it further. [q]> One of the biggest finding I had is that every loop that can be optimized > (containing memory accesses) can be duplicated, where from the first loop > you can calculate memory addresses and in the second you collect data. In > the worst case your second loop is empty and you have no optimizations, > but in most cases you can generate the addresses needed to cope with long > RAM access latencies.[/q] This is a cool idea. I'd like to examine how it translates into hardware more carefully.[/q] I think this is the right way to go. [q] I'm not surprised. I have found that taking advantage of parallel processing generally requires changes to the algorithms at a high level.[/q] Yes it is quite obvious, but very constraining. [q] I'm not thinking of generating hardware that's many times faster than a traditional CPU. I'm just thinking that more than 2x is doable, primarily through good memory management across multiple data ports.[/q] I believe that 2x is doable, but more would probably require manual algorithm change. If you can afford that SW developer would write code optimized for your architecture, then you are definitely a winner. [q] I also think the same tricks could be used to speed up a more traditional CPU. The main thing is to hammer down the memory bottleneck through several independent memory buses.[/q] Yes, if you look at IA64 architecture, it has a lot of instructions dedicated for address prediction and for supplying address sooner than needed. [q] Thanks for the well informed thoughts on the topic.[/q] I am very interested in this discussion, since apparenly you also have some experience in the area. best regards, Marko On Tuesday 07 September 2004 13:16, Bill Cox wrote: Hi, Marko. Hey, a real expert in the field! I was hoping there would be one or two of you out there to keep the discussion honest. Thanks, but don't expect too much, since I am also in search for most of the answers ;) What solution are you trying to make? Is the solution commercial? Is it under viASIC? Are you interested in a commercial cooperation? > I have spend quite some time on this subject and I am sure it will be one > of the hottest research areas in the following decade. Almost makes me wish I were a grad student, rather than a EDA tool developer. Well, that is not so bad either ;) I'll check it out. Do you have a link to it? I missed it when I checked the links from the OpenCores site. It is a part of or1ksim, called CUC (custom unit compiler). I believe it is currently broke and unmaintained, but there was some versions which worked on the simulator. The CUC was abandoned due to large circuit sizes and I began to built second tool, which is more commercial. > Although it may seem at the first glance that C->RTL conversion is > straightforward and all it requires is a lot of work, but this is not > true. I have not been able to prove that solving this problem in finite > time would require a machine computationally better than Turing, but that > seems very likely, since the problem becomes very similar to code > optimization. Yes, it's obviously an NP hard (or worse) problem to solve optimally. I shouldn't have said it's not hard. In particular, doing it well is very hard. However, it's doable, as you've proved. Well, I have proven that conversion is possible -- which is undoubtly true. But speed is another issue. CUC suffered severely from memory bandwidth, despite the memory burst it was trying to make by concatenation of memory accesses. I agree that a sub-set of Java or C# would be a better starting point. Both languages are less ambiguous, and both deemphasize pointers. Not fully true -- C# has pointers, but their usage is limited to smaller areas, which means the arrays used there can be put into smaller RAMs. At work, we work in a subset of C, and we already use integers as handles to objects. Data already is stored in arrays of properties. We don't allow programmers to use pointers, except in very specific cases (like returning multiple values). Moreover you could allow pointers as indexes. All this allows you to distribute the memories into smaller pieces which is very valuable. Even simpler than compiling Java or C# would be to pick a subset that's expressive enough to allow good algorithms to be written. We've already done this at work. No '.', '->', or '' operators are allowed. Everything is either a variable or an object, and objects are accessed through their handles. It should be relatively easy to analyze our code, and our data structures are well defined in a text based schema file, which is used to generate all our object support code. We do allow new property arrays to be allocated to extend the fields of a class. We also automate generation of recursive destructors, eliminating most of the dangling pointer problems. So, we have little use for garbage collection, and our object support code can get dumped into the C->Verilog translator just like algorithmic code. If I understand you correctly you first transform C code to schema and optimize it further. > One of the biggest finding I had is that every loop that can be optimized > (containing memory accesses) can be duplicated, where from the first loop > you can calculate memory addresses and in the second you collect data. In > the worst case your second loop is empty and you have no optimizations, > but in most cases you can generate the addresses needed to cope with long > RAM access latencies. This is a cool idea. I'd like to examine how it translates into hardware more carefully. I think this is the right way to go. I'm not surprised. I have found that taking advantage of parallel processing generally requires changes to the algorithms at a high level. Yes it is quite obvious, but very constraining. I'm not thinking of generating hardware that's many times faster than a traditional CPU. I'm just thinking that more than 2x is doable, primarily through good memory management across multiple data ports. I believe that 2x is doable, but more would probably require manual algorithm change. If you can afford that SW developer would write code optimized for your architecture, then you are definitely a winner. I also think the same tricks could be used to speed up a more traditional CPU. The main thing is to hammer down the memory bottleneck through several independent memory buses. Yes, if you look at IA64 architecture, it has a lot of instructions dedicated for address prediction and for supplying address sooner than needed. Thanks for the well informed thoughts on the topic. I am very interested in this discussion, since apparenly you also have some experience in the area. best regards, Marko

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
[q] Very interesting and kind of what I would have expected. If that was an easy/possible solution we would have seen this in production long time ago. One question does come to mind here: Why do people always try to take such a high level language like C as a starting point ? As you Marko, point out, it's impossible to solve some of the higher level abstracts. Why not take assembly language and try parallelize it and map that in to Hardware ?[/q] Rudi, If you look at asm->Verilog conversion is surely simpler than C->Verilog, but if you want to have Verilog small and fast enough it is much easier to transform from C directly. Even C is not high level enough for converision. Java is a bit better; some functional language would be even a bit better, but still the main problems remain and nobody is using it. For example asm versus C: all asm instructions use 32bit types, which would yield very large die area. With C functions you have at least 8b and 16b types. The detection of the (used) size is possible to some extent, but very limited and even then very hard. best regards, Marko Very interesting and kind of what I would have expected. If that was an easy/possible solution we would have seen this in production long time ago. One question does come to mind here: Why do people always try to take such a high level language like C as a starting point ? As you Marko, point out, it's impossible to solve some of the higher level abstracts. Why not take assembly language and try parallelize it and map that in to Hardware ? Rudi, If you look at asm->Verilog conversion is surely simpler than C->Verilog, but if you want to have Verilog small and fast enough it is much easier to transform from C directly. Even C is not high level enough for converision. Java is a bit better; some functional language would be even a bit better, but still the main problems remain and nobody is using it. For example asm versus C: all asm instructions use 32bit types, which would yield very large die area. With C functions you have at least 8b and 16b types. The detection of the (used) size is possible to some extent, but very limited and even then very hard. best regards, Marko

Winning with a reconfigurable computer by Unknown on Sep 7, 2004			Not available!
Le Mardi 7 Septembre 2004 14:40, Marko Mlinar a Ã©crit : [q]> Very interesting and kind of what I would have expected. If > that was an easy/possible solution we would have seen this > in production long time ago. > > One question does come to mind here: Why do people always > try to take such a high level language like C as a starting > point ? As you Marko, point out, it's impossible to solve > some of the higher level abstracts. Why not take assembly > language and try parallelize it and map that in to Hardware ?[/q] Rudi, If you look at asm->Verilog conversion is surely simpler than C->Verilog, but if you want to have Verilog small and fast enough it is much easier to transform from C directly. Even C is not high level enough for converision. Java is a bit better; some functional language would be even a bit better, but still the main problems remain and nobody is using it. For example asm versus C: all asm instructions use 32bit types, which would yield very large die area. With C functions you have at least 8b and 16b types. The detection of the (used) size is possible to some extent, but very limited and even then very hard. best regards, Marko[/q] C seems to be a nice target because lot of people know C, a lot more than vhdl or verilog. But the C guys did not understand anything on hardware. There code are too often, slow... :) [q]From the software point of view, i did not understand why "Behavior compiler" [/q] was not more used. A lot of feature are useful. For exemple, it could extract FSM from "wait clock" inside a function which is not permit by usual synthesier. Using "wait" inside function, could really reduice the code size. The use of a least one big memory permit to even handle very high level description. So, a compiler that could handel high level HDL could be soon a great help. The other point is for the optimising part. "Some times" the syntheser is not clever at all. This tend to make dirty the code to catch your needs. In that case, a intermediate humain intervention should be possible, something more powerfull than usual compiler scripting. Nicolas Boulay Le Mardi 7 Septembre 2004 14:40, Marko Mlinar a Ã©crit : > Very interesting and kind of what I would have expected. If > that was an easy/possible solution we would have seen this > in production long time ago. > > One question does come to mind here: Why do people always > try to take such a high level language like C as a starting > point ? As you Marko, point out, it's impossible to solve > some of the higher level abstracts. Why not take assembly > language and try parallelize it and map that in to Hardware ? Rudi, If you look at asm->Verilog conversion is surely simpler than C->Verilog, but if you want to have Verilog small and fast enough it is much easier to transform from C directly. Even C is not high level enough for converision. Java is a bit better; some functional language would be even a bit better, but still the main problems remain and nobody is using it. For example asm versus C: all asm instructions use 32bit types, which would yield very large die area. With C functions you have at least 8b and 16b types. The detection of the (used) size is possible to some extent, but very limited and even then very hard. best regards, Marko C seems to be a nice target because lot of people know C, a lot more than vhdl or verilog. But the C guys did not understand anything on hardware. There code are too often, slow... :) From the software point of view, i did not understand why "Behavior compiler" was not more used. A lot of feature are useful. For exemple, it could extract FSM from "wait clock" inside a function which is not permit by usual synthesier. Using "wait" inside function, could really reduice the code size. The use of a least one big memory permit to even handle very high level description. So, a compiler that could handel high level HDL could be soon a great help. The other point is for the optimising part. "Some times" the syntheser is not clever at all. This tend to make dirty the code to catch your needs. In that case, a intermediate humain intervention should be possible, something more powerfull than usual compiler scripting. Nicolas Boulay

Winning with a reconfigurable computer by Unknown on Sep 8, 2004			Not available!
some related news.. <A HREF="http://www.us.design-reuse.com/news/news8597.html">http://www.us.design-reuse.com/news/news8597.html</A> Tensilica Announces Major IC Design Automation Breakthrough, The Automatic Generation of Optimized Programmable RTL Engines from Standard C Code Automates RTL Block Design; Adds Flexibility through Programmability SANTA CLARA, Calif. - Sept. 6, 2004 - Tensilica(R), Inc. today announced that it has achieved a major design automation breakthrough - the automated design of optimized configurable processors from standard C code using the company's new XPRESÂ™ (Xtensa(R) PRocessor Extension Synthesis) compiler. This tool enables the rapid development of optimized system-on-chip (SOC) devices without requiring designers to hand code their hardware using design languages like VHDL and Verilog, which take months of design and verification effort. Instead, designers input the original algorithm that they're trying to optimize, written in standard ANSI C/C++, and the XPRES compiler, coupled with Tensilica's automated processor generation technology, automatically generates an RTL (register transfer level) hardware description and associated software tool chain. In less than an hour, the resulting hardware block is delivered in the form of a pre-verified Xtensa LX processor core, enabling customers to future proof their designs due to its inherent programmability, and avoid the cost and risk associated with verifying custom logic. Additionally, the generated RTL fully rivals the performance and efficiency of hand-coded RTL blocks with many concurrent operations, efficient data types, and optimized multiple wide deep pipelines. <A HREF="http://www.us.design-reuse.com/news/news8597.html">http://www.us.design-reuse.com/news/news8597.html</A> __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around <A HREF="http://mail.yahoo.com">http://mail.yahoo.com</A> -------------- next part -------------- An HTML attachment was scrubbed... URL: <A HREF="http://www.opencores.org/forums.cgi/cores/attachments/20040908/b7a9dfc5/attachment.htm">http://www.opencores.org/forums.cgi/cores/attachments/20040908/b7a9dfc5/attachment.htm</A> some related news.. http://www.us.design-reuse.com/news/news8597.html Tensilica Announces Major IC Design Automation Breakthrough, The Automatic Generation of Optimized Programmable RTL Engines from Standard C Code Automates RTL Block Design; Adds Flexibility through Programmability SANTA CLARA, Calif. - Sept. 6, 2004 - Tensilica(R), Inc. today announced that it has achieved a major design automation breakthrough - the automated design of optimized configurable processors from standard C code using the company's new XPRESÂ™ (Xtensa(R) PRocessor Extension Synthesis) compiler. This tool enables the rapid development of optimized system-on-chip (SOC) devices without requiring designers to hand code their hardware using design languages like VHDL and Verilog, which take months of design and verification effort. Instead, designers input the original algorithm that they're trying to optimize, written in standard ANSI C/C++, and the XPRES compiler, coupled with Tensilica's automated processor generation technology, automatically generates an RTL (register transfer level) hardware description and associated software tool chain. In less than an hour, the resulting hardware block is delivered in the form of a pre-verified Xtensa LX processor core, enabling customers to future proof their designs due to its inherent programmability, and avoid the cost and risk associated with verifying custom logic. Additionally, the generated RTL fully rivals the performance and efficiency of hand-coded RTL blocks with many concurrent operations, efficient data types, and optimized multiple wide deep pipelines. http://www.us.design-reuse.com/news/news8597.html __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.opencores.org/forums.cgi/cores/attachments/20040908/b7a9dfc5/attachment.htm

First

Prev

2/3

Next

Last

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.