OpenCores

OpenCores

Winning with a reconfigurable computer

no use

no use

1/3

Next

Last

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
[q] The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. [/q] Bill, this finding of yours is perfectly true. There are many older studies/papers which prove this. In any computation, the bottleneck is usually memory access times. I carried out a similar simulation based research sometime ago on program execution speeds with various CPU and memory speeds. Traditionally, logic manufacturing on a chip is usually optimized for high speed, but memory manufacturing technology is optimized for area. And this difference in manufacturing process/methodology has created a "gap" between logic speeds and memory speeds. This gap is ever increasing in the Von Neumann architecture. So an Intel CPU usually spends most of its time waiting for memory, and the number of wasted CPU cycles are increasing. [q] Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. [/q] This is probably correct, although I am no software expert. But even then, FPGA's need to have highly optimized memory arrays, which give a lot less access times that Intel CPU caches provide, to speed up your applications. [q] To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill[/q] I think we are still to optimize memory interfaces for speed, which primarily would be accomplished if memory manufacturing technologies improve. -Rohit [q][/q] [q] _______________________________________________ <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A> [/q] The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. Bill, this finding of yours is perfectly true. There are many older studies/papers which prove this. In any computation, the bottleneck is usually memory access times. I carried out a similar simulation based research sometime ago on program execution speeds with various CPU and memory speeds. Traditionally, logic manufacturing on a chip is usually optimized for high speed, but memory manufacturing technology is optimized for area. And this difference in manufacturing process/methodology has created a "gap" between logic speeds and memory speeds. This gap is ever increasing in the Von Neumann architecture. So an Intel CPU usually spends most of its time waiting for memory, and the number of wasted CPU cycles are increasing. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. This is probably correct, although I am no software expert. But even then, FPGA's need to have highly optimized memory arrays, which give a lot less access times that Intel CPU caches provide, to speed up your applications. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill I think we are still to optimize memory interfaces for speed, which primarily would be accomplished if memory manufacturing technologies improve. -Rohit _______________________________________________ http://www.opencores.org/mailman/listinfo/cores

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Nice idea. It look like FPGA + memories. You speak about he use of many memory port. But what is the best 2 32 bits port or one 64 bits port ? Usually the true limitation is the pin number (because of package cost, routing pb, power consumption). The very last memory controller of GPU use 4 64 bits ports instead of one big 256 bits. I beleive it's more a problem of fanout in the adress bus.From the inside chip, it look like a multibank port. [q]From a market point of view, if you create a c compiler that could use an FPGA [/q] + common DDR-SDRAM chip, you could have a great success ! But is that so easy to create a C compiler that target a FPGA plus "some" memory port ? nicO Le Lundi 6 Septembre 2004 13:00, Bill Cox a Ã©crit : [q] Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill _______________________________________________ <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A>[/q] Nice idea. It look like FPGA + memories. You speak about he use of many memory port. But what is the best 2 32 bits port or one 64 bits port ? Usually the true limitation is the pin number (because of package cost, routing pb, power consumption). The very last memory controller of GPU use 4 64 bits ports instead of one big 256 bits. I beleive it's more a problem of fanout in the adress bus.From the inside chip, it look like a multibank port. From a market point of view, if you create a c compiler that could use an FPGA + common DDR-SDRAM chip, you could have a great success ! But is that so easy to create a C compiler that target a FPGA plus "some" memory port ? nicO Le Lundi 6 Septembre 2004 13:00, Bill Cox a Ã©crit : Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill _______________________________________________ http://www.opencores.org/mailman/listinfo/cores

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Le Lundi 6 Septembre 2004 13:35, Rohit Mathur a Ã©crit : [q]> > So, why hasn't this been done yet? > Bill[/q] I think we are still to optimize memory interfaces for speed, which primarily would be accomplished if memory manufacturing technologies improve. -Rohit[/q] You remember the big slot A of first pentium II ? It was used to have 512 Ko cache less expensive than the on die cache of the pentium pro. You intel/amd want more speed they could always make the same. But instead of using sram, they could use the very last memory interface, like DDR2 or 3, like GPU does. Video card could use fastest memory because there is not any connector on the routing path. So you could imagine an external 128 Mo L3 cache running at 1 Ghz, which is 4 times the actual data rate of RAM. Then , the usual memory access could better concentrate on high bandwith rather than low latency. nicO Le Lundi 6 Septembre 2004 13:35, Rohit Mathur a Ã©crit : > > So, why hasn't this been done yet? > Bill I think we are still to optimize memory interfaces for speed, which primarily would be accomplished if memory manufacturing technologies improve. -Rohit You remember the big slot A of first pentium II ? It was used to have 512 Ko cache less expensive than the on die cache of the pentium pro. You intel/amd want more speed they could always make the same. But instead of using sram, they could use the very last memory interface, like DDR2 or 3, like GPU does. Video card could use fastest memory because there is not any connector on the routing path. So you could imagine an external 128 Mo L3 cache running at 1 Ghz, which is 4 times the actual data rate of RAM. Then , the usual memory access could better concentrate on high bandwith rather than low latency. nicO

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Bill and OpenCores forum readers :-) Great discussion point, I quite enjoyed the read. I am currently an Electrical and Computer Systems Engineering student at Monash University in Melbourne, Australia. As part of a core course unit we all study Computer Architecture, particularly the MIPS architecture which seems to be a fairly efficient system. As an extension to the architecture, we have also been versed in parallel processor systems and memory architectures. I think what you propose could make a great final year thesis project for a student at a university such as Monash to get your idea rolling! What better way to start off a good open source project than to get a student involved who may then follow it for the rest of his/her career. We also have VHDL / FPGA based subjects, so I'm sure a few students might look into it. You just never know. Perhaps OpenCores can contact a few universities around the world and sponsor / take care of a project for a set of students to look further into the ideas you posed. This way everyone at OpenCores can follow along and help guide the project with the experience of all the forum readers. What do you think Bill? - Justin Young -----Original Message----- From: Bill Cox [mailto:<A HREF="http://www.opencores.org/mailman/listinfo/cores">bill at viasic.com</A>] Sent: Monday, 6 September 2004 9:00 PM To: Discussion list about free open source IP cores Subject: [oc] Winning with a reconfigurable computer Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill Bill and OpenCores forum readers :-) Great discussion point, I quite enjoyed the read. I am currently an Electrical and Computer Systems Engineering student at Monash University in Melbourne, Australia. As part of a core course unit we all study Computer Architecture, particularly the MIPS architecture which seems to be a fairly efficient system. As an extension to the architecture, we have also been versed in parallel processor systems and memory architectures. I think what you propose could make a great final year thesis project for a student at a university such as Monash to get your idea rolling! What better way to start off a good open source project than to get a student involved who may then follow it for the rest of his/her career. We also have VHDL / FPGA based subjects, so I'm sure a few students might look into it. You just never know. Perhaps OpenCores can contact a few universities around the world and sponsor / take care of a project for a set of students to look further into the ideas you posed. This way everyone at OpenCores can follow along and help guide the project with the experience of all the forum readers. What do you think Bill? - Justin Young -----Original Message----- From: Bill Cox [mailto:bill at viasic.com] Sent: Monday, 6 September 2004 9:00 PM To: Discussion list about free open source IP cores Subject: [oc] Winning with a reconfigurable computer Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill

Winning with a reconfigurable computer by bporcella on Sep 6, 2004			bporcella Posts: 22 Joined: Jan 16, 2004 Last seen: Oct 2, 2007
An HTML attachment was scrubbed... URL: <A HREF="http://www.opencores.org/forums.cgi/cores/attachments/20040906/c91f615f/attachment.htm">http://www.opencores.org/forums.cgi/cores/attachments/20040906/c91f615f/attachment.htm</A> An HTML attachment was scrubbed... URL: http://www.opencores.org/forums.cgi/cores/attachments/20040906/c91f615f/attachment.htm

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
On Mon, 6 Sep 2004, Bill Cox wrote: [q] Hi.[/q] [q] parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? [/q] I dimly remember IBM making a fuss about something they called 'memory in processor' systems around 1998, and even more dimly remember they were similar to what you're talking about. But googling isn't turning up anything. Does anyone else have a better memory than me? Graham [q] Bill _______________________________________________ <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A> [/q] On Mon, 6 Sep 2004, Bill Cox wrote: Hi. parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? I dimly remember IBM making a fuss about something they called 'memory in processor' systems around 1998, and even more dimly remember they were similar to what you're talking about. But googling isn't turning up anything. Does anyone else have a better memory than me? Graham Bill _______________________________________________ http://www.opencores.org/mailman/listinfo/cores

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
* Graham Seaman <<A HREF="http://www.opencores.org/mailman/listinfo/cores">graham at seul.org</A>> [2004-09-06T12:19:09-0400]: [q] I dimly remember IBM making a fuss about something they called 'memory in processor' systems around 1998, and even more dimly remember they were similar to what you're talking about. But googling isn't turning up anything. Does anyone else have a better memory than me? [/q] How about these? Intelligent RAM (IRAM): <A HREF="http://iram.cs.berkeley.edu/">http://iram.cs.berkeley.edu/</A> Processor-in-Memory (PIM): <A HREF="http://www.cse.nd.edu/~pim/">http://www.cse.nd.edu/~pim/</A> -dave * Graham Seaman graham at seul.org> [2004-09-06T12:19:09-0400]: I dimly remember IBM making a fuss about something they called 'memory in processor' systems around 1998, and even more dimly remember they were similar to what you're talking about. But googling isn't turning up anything. Does anyone else have a better memory than me? How about these? Intelligent RAM (IRAM): http://iram.cs.berkeley.edu/ Processor-in-Memory (PIM): http://www.cse.nd.edu/~pim/ -dave

Winning with a reconfigurable computer by markus on Sep 6, 2004			markus Posts: 32 Joined: Dec 9, 2002 Last seen: Apr 30, 2013
I'm almost forced to answer to this... :-) Altought there was many interesting ideas, I give my reply to only one issue: programming languages and the memory. It's true, that programming languages are developed a central memory in developer's mind. Programming languages assume, that you have uniform access to all data objects in the system. Whenever you don't have this (for example, you're working with code distributed over a network or memory bridge), you have to write some complex code to receive/transmit information between two points. This assumption makes it hard to divide one program for multiple processors without having a shared memory. And when you have this, it gives restrictions to the speed which is achievable by the system overall. But there's a good reason to use one single memory block in general-purpose CPUs - if you divide it to get parallel accesses (like DSPs does), you in the same time hurt the dynamic adaption for the workload; one of your memories might be full, while there could be plenty of room in other memories. Thus you need to implement some kind of "gateway" to store larger arrays to multiple memories --> you get the shared memory. In general, programming languages doesn't limit the underlaying architectures (the compilers can do nice job for overcoming this), but of course the programming language causes the programmer to think things in the way that it's represented in the langauge. [q] To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory... ...That naturally eliminates parallel memory access. [/q] For example, if my memory serves me well, the C standard doesn't explicitely say, in what ways the objects are stored. First, the compiler can use whatever alignment it decides and second, the tables can be splitted or interleaved. This should not break any code, except those who uses fancy pointer arithmetics. The reason for this is, that the regular CPUs have more efficient instructions to handle objects stored in sequential manner. [q] Instead, we need to represent data fields for each class of object in separate arrays.[/q] Many databases do this i.e. they load fields to separate arrays. I'm almost forced to answer to this... :-) Altought there was many interesting ideas, I give my reply to only one issue: programming languages and the memory. It's true, that programming languages are developed a central memory in developer's mind. Programming languages assume, that you have uniform access to all data objects in the system. Whenever you don't have this (for example, you're working with code distributed over a network or memory bridge), you have to write some complex code to receive/transmit information between two points. This assumption makes it hard to divide one program for multiple processors without having a shared memory. And when you have this, it gives restrictions to the speed which is achievable by the system overall. But there's a good reason to use one single memory block in general-purpose CPUs - if you divide it to get parallel accesses (like DSPs does), you in the same time hurt the dynamic adaption for the workload; one of your memories might be full, while there could be plenty of room in other memories. Thus you need to implement some kind of "gateway" to store larger arrays to multiple memories --> you get the shared memory. In general, programming languages doesn't limit the underlaying architectures (the compilers can do nice job for overcoming this), but of course the programming language causes the programmer to think things in the way that it's represented in the langauge. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory... ...That naturally eliminates parallel memory access. For example, if my memory serves me well, the C standard doesn't explicitely say, in what ways the objects are stored. First, the compiler can use whatever alignment it decides and second, the tables can be splitted or interleaved. This should not break any code, except those who uses fancy pointer arithmetics. The reason for this is, that the regular CPUs have more efficient instructions to handle objects stored in sequential manner. Instead, we need to represent data fields for each class of object in separate arrays. Many databases do this i.e. they load fields to separate arrays.

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Le Lundi 6 Septembre 2004 18:42, David I. Lehn a Ã©crit : [q] <A HREF="http://www.cse.nd.edu/~pim/">http://www.cse.nd.edu/~pim/</A>[/q] Pim sound like processor + memory on the same die. Quite common nowadays. This remember me an article about the new super computer of IBM. The said that each node use 128 Mbyte "on chip" (or 128 Mb ?) and then could use the "external memory" as swap is usually used. So thoughput is the true goal of the main Memory when latency is the goal of on-chip memory. So memory could be inside, with an external storage, when it's no enough, like disk swap. I have read an article few month ago, about a compagny that design a cpu with a kind of fpga as a computing unit. They give extraordinary performance for a cpu for task as DES computing. An other article speak about super_computer_on_a_chip. This is basically network of few dozen of SIMD like processor, with hundred of 64 bits multipliers. Maybe a kind of mix of all of this could be done :) nicO Le Lundi 6 Septembre 2004 18:42, David I. Lehn a Ã©crit : http://www.cse.nd.edu/~pim/ Pim sound like processor + memory on the same die. Quite common nowadays. This remember me an article about the new super computer of IBM. The said that each node use 128 Mbyte "on chip" (or 128 Mb ?) and then could use the "external memory" as swap is usually used. So thoughput is the true goal of the main Memory when latency is the goal of on-chip memory. So memory could be inside, with an external storage, when it's no enough, like disk swap. I have read an article few month ago, about a compagny that design a cpu with a kind of fpga as a computing unit. They give extraordinary performance for a cpu for task as DES computing. An other article speak about super_computer_on_a_chip. This is basically network of few dozen of SIMD like processor, with hundred of 64 bits multipliers. Maybe a kind of mix of all of this could be done :) nicO

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
On Mon, 2004-09-06 at 18:00, Bill Cox wrote: [q] Hi.[/q] [SNIP] [q] The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU.[/q] Thats not quite correct (your conclusion that is). I think you should clarify what you mean with memory access. Obviously we can get plenty of linear bandwidth out of todays DDR chips and by building really wide buses. e.g. using a 256bit bus, with 400MHz DDRs, is probably more liner memory bandwidth than an Intel P4 can handle today, not counting GPUs. The problem that many have tried to solve but have been more or less unsuccessful are random access latencies. Thats why all modern high performance CPUs use caches. They allow zero penalty random access to a small amount of program and data space. If you would replace you main system memory with high speed SRAMs you would not need caches anymore and your system bottleneck will be back in the CPU. [SNIP] [q] So, why hasn't this been done yet?[/q] Well, two obvious reasons come to mind: 1) Cost 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield. So the real issue here is memory architecture, not integration or distribution, etc. [q] Bill[/q] Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, <A HREF="http://www.asics.ws">http://www.asics.ws</A> Your Partner for IP Cores, Design, Verification and Synthesis On Mon, 2004-09-06 at 18:00, Bill Cox wrote: Hi. [SNIP] The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. Thats not quite correct (your conclusion that is). I think you should clarify what you mean with memory access. Obviously we can get plenty of linear bandwidth out of todays DDR chips and by building really wide buses. e.g. using a 256bit bus, with 400MHz DDRs, is probably more liner memory bandwidth than an Intel P4 can handle today, not counting GPUs. The problem that many have tried to solve but have been more or less unsuccessful are random access latencies. Thats why all modern high performance CPUs use caches. They allow zero penalty random access to a small amount of program and data space. If you would replace you main system memory with high speed SRAMs you would not need caches anymore and your system bottleneck will be back in the CPU. [SNIP] So, why hasn't this been done yet? Well, two obvious reasons come to mind: 1) Cost 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield. So the real issue here is memory architecture, not integration or distribution, etc. Bill Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, http://www.asics.ws Your Partner for IP Cores, Design, Verification and Synthesis

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
Le Lundi 6 Septembre 2004 18:42, David I. Lehn a Ã©crit : [q] Intelligent RAM (IRAM): <A HREF="http://iram.cs.berkeley.edu/">http://iram.cs.berkeley.edu/</A>[/q] iram is quite different, it's a mix of cpu and memory (lot of memory), but if memory space miss, they don't try to put external DRAM chip but an other iram chip... So the real bottleneck became the network between the chip. What ever the task, this solution provide the more memory bandwith. Imagine, a computer where you add processeur as you today add DRAM. :) Le Lundi 6 Septembre 2004 19:16, Rudolf Usselmann a Ã©crit : [q] 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield.[/q] This kind of memory exist. This is the "fram". And all the technology called "spintronic", they use the spin of electron instead of the use of the presence of electrons. Current fram are quite small and have problem of density (because of the control not because of the cell). They have around 15 ps of random access time. This technology could be added to current CMOS technology. FRAM use 1t and is non volatile memory. Intel have passed some contract with a french laboratory for there cache for there future chip. This will work in the 5 to 10 years. So it will be the time of 4 Gbyte chip + 4 cpu core in it :) nicO Le Lundi 6 Septembre 2004 18:42, David I. Lehn a Ã©crit : Intelligent RAM (IRAM): http://iram.cs.berkeley.edu/ iram is quite different, it's a mix of cpu and memory (lot of memory), but if memory space miss, they don't try to put external DRAM chip but an other iram chip... So the real bottleneck became the network between the chip. What ever the task, this solution provide the more memory bandwith. Imagine, a computer where you add processeur as you today add DRAM. :) Le Lundi 6 Septembre 2004 19:16, Rudolf Usselmann a Ã©crit : 2) I guess nobody has come up with an alternative economical memory design. We need memory that can be randomly accessed just like a plain old SRAM, BUT, it must be small (1T preferred) like DRAM and we must be able to manufacture it in high volumes at a high yield. This kind of memory exist. This is the "fram". And all the technology called "spintronic", they use the spin of electron instead of the use of the presence of electrons. Current fram are quite small and have problem of density (because of the control not because of the cell). They have around 15 ps of random access time. This technology could be added to current CMOS technology. FRAM use 1t and is non volatile memory. Intel have passed some contract with a french laboratory for there cache for there future chip. This will work in the 5 to 10 years. So it will be the time of 4 Gbyte chip + 4 cpu core in it :) nicO

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
On Mon, 2004-09-06 at 08:08, Nicolas Boulay wrote: [q] Nice idea. It look like FPGA + memories. You speak about he use of many memory port. But what is the best 2 32 bits port or one 64 bits port ? Usually the true limitation is the pin number (because of package cost, routing pb, power consumption). [/q] Good points. The really big FPGAs come with several hundreds of pins. Are the general purpose I/Os capable of high speed memory access? If not, that's a real problem. For starters, I'd probably dump virtual memory access, and just directly map property arrays onto multiple external memory buses. Hopefully I could get 3 to 5 memory interfaces before running into power problems. [q] The very last memory controller of GPU use 4 64 bits ports instead of one big 256 bits. I beleive it's more a problem of fanout in the adress bus.From the inside chip, it look like a multibank port.[/q] There are lots of great tricks for speeding up a memory. However, it still takes a single address port, or two in the case of a Harvard architecture, but still only one for data memory. That's a real problem for parallel access. [q]>From a market point of view, if you create a c compiler that could use an FPGA [/q] + common DDR-SDRAM chip, you could have a great success ! But is that so easy to create a C compiler that target a FPGA plus "some" memory port ? nicO[/q] Well, it's a really big task. I'd hate to have to do it as an open-source project. However, there's nothing in the list of tasks that isn't doable. It's mostly common technology. As long as the compiler understands the data arrays, scheduling which array goes in which memory seems fairly straight-forward (but still a lot of work). I'm assuming that someone would build a board with a honking big FPGA on it, with several large independent external memory banks attached to it. There might even be second-level external cache for each memory bus. Internal memory would mostly be used for first-level cache. Assuming such a board exists, assigning arrays of data to memory banks is a scheduling problem. I assume the big FPGA would have at least one embedded CPU that could be used for running the vast majority of a program's code. Some nifty piece of the compiler would schedule critical inner loops to be implemented in hardware (probably with user input). The time consuming inner loops would hopefully run faster than a general purpose CPU, since those data accesses they do could be done in parallel. Of course, if the best memory access times we can do on an FPGA is 10x slower than the best Intel processor busses and cache, then the whole idea wont work. For this to work out, the FPGA would have to have very good memory access times. Bill On Mon, 2004-09-06 at 08:08, Nicolas Boulay wrote: Nice idea. It look like FPGA + memories. You speak about he use of many memory port. But what is the best 2 32 bits port or one 64 bits port ? Usually the true limitation is the pin number (because of package cost, routing pb, power consumption). Good points. The really big FPGAs come with several hundreds of pins. Are the general purpose I/Os capable of high speed memory access? If not, that's a real problem. For starters, I'd probably dump virtual memory access, and just directly map property arrays onto multiple external memory buses. Hopefully I could get 3 to 5 memory interfaces before running into power problems. The very last memory controller of GPU use 4 64 bits ports instead of one big 256 bits. I beleive it's more a problem of fanout in the adress bus.From the inside chip, it look like a multibank port. There are lots of great tricks for speeding up a memory. However, it still takes a single address port, or two in the case of a Harvard architecture, but still only one for data memory. That's a real problem for parallel access. >From a market point of view, if you create a c compiler that could use an FPGA + common DDR-SDRAM chip, you could have a great success ! But is that so easy to create a C compiler that target a FPGA plus "some" memory port ? nicO Well, it's a really big task. I'd hate to have to do it as an open-source project. However, there's nothing in the list of tasks that isn't doable. It's mostly common technology. As long as the compiler understands the data arrays, scheduling which array goes in which memory seems fairly straight-forward (but still a lot of work). I'm assuming that someone would build a board with a honking big FPGA on it, with several large independent external memory banks attached to it. There might even be second-level external cache for each memory bus. Internal memory would mostly be used for first-level cache. Assuming such a board exists, assigning arrays of data to memory banks is a scheduling problem. I assume the big FPGA would have at least one embedded CPU that could be used for running the vast majority of a program's code. Some nifty piece of the compiler would schedule critical inner loops to be implemented in hardware (probably with user input). The time consuming inner loops would hopefully run faster than a general purpose CPU, since those data accesses they do could be done in parallel. Of course, if the best memory access times we can do on an FPGA is 10x slower than the best Intel processor busses and cache, then the whole idea wont work. For this to work out, the FPGA would have to have very good memory access times. Bill

Winning with a reconfigurable computer by Unknown on Sep 6, 2004			Not available!
On Mon, 2004-09-06 at 11:43, bporcella wrote: [q] At 04:00 AM 9/6/2004, you wrote: Some of these issues have ben touched on --- but I'll throw in my 2 cents worth..... [snip][/q] Always worth more than that :-) [q]> However, I'm convinced that this need not be the case. I've > analyzed > several EDA algorithms I've written, and have concluded that almost > all > of them could run faster in an FPGA vs on an Intel machine. > However, > we'd have to get super-fast memory interfaces on new FPGAs, and some > really fast on-chip memory to implement caches. In the meantime, we > can > compile our algorithms to custom ASICs and start winning in hardware > there.[/q] As you mention "new FPGA's" you clearly understand part of the problem -- one cannot use generic I/O's and the distributed general purpose memory of FPGA's to implement high speed cache -- The cache and its interface must be carefully layed out. There are also SSO (Simultaneous Switching ) issues related to all this. [/q] Yep. You get it. These aren't insurmountable problems, but unless Xilinx or Altera through a lot of $ at the problem and come out with an FPGA that can do the job in the latest process, it probably wont happen. I wonder if the lure of the general purpose CPU market is interesting enough to entice them. [q] VLIW Computers have not been particularly successful -- sure you are talking about multiple streams of data -- but the physics problems are the same. [/q] I'm not a fan of VLIW. It sounds good... just use a bunch of functional units in parallel. However, these guys tend to forget that it's more of a data routing problem than a computation problem. As a router guy, VLIW sounds like a great way to burn power. [q] I suspect the Intel's are pretty close to the bleeding edge of I/O throughput - per package - already - and they have all the big bucks to spend on improving the state of the art. If you can catch up with them using any alternative architecture -- they continue doubling their I/O throughput every 2 years.......[/q] The only way this could work out is if a big guy like Xilinx put a real effort into it. Otherwise, you'd get left in Intel's dust as they shrink to 65nm and below. Isn't Xilinx also on the bleeding edge of I/O throughput? Is it possible to make good memory buses with what they have? I doubt I could run the Xilinx memories at 1Gig or more. That doesn't mean it couldn't be done... [q]> The bottleneck in almost all of the algorithms I examined was not > computation. The real bottleneck was memory access. I've concluded > that parallel memory access is the key to beating the Von Neumann > CPU.[/q] [big snip] Lots of people have come to this conclusion. One would think the solution would be shared distributed memory machines using standard microprocessors --- Just need a good inter-processor communication system ( sounds simple doesn't it :-). [q]> So, why hasn't this been done yet?[/q] Cause Its hard. [/q] I'd sure be fun, though :-) [q]> _______________________________________________ > <A HREF="http://www.opencores.org/mailman/listinfo/cores">http://www.opencores.org/mailman/listinfo/cores</A>[/q] bj Porcella <A HREF="http://pages.sbcglobal.net/bporcella/">http://pages.sbcglobal.net/bporcella/</A>[/q] On Mon, 2004-09-06 at 11:43, bporcella wrote: At 04:00 AM 9/6/2004, you wrote: Some of these issues have ben touched on --- but I'll throw in my 2 cents worth..... [snip] Always worth more than that :-) > However, I'm convinced that this need not be the case. I've > analyzed > several EDA algorithms I've written, and have concluded that almost > all > of them could run faster in an FPGA vs on an Intel machine. > However, > we'd have to get super-fast memory interfaces on new FPGAs, and some > really fast on-chip memory to implement caches. In the meantime, we > can > compile our algorithms to custom ASICs and start winning in hardware > there. As you mention "new FPGA's" you clearly understand part of the problem -- one cannot use generic I/O's and the distributed general purpose memory of FPGA's to implement high speed cache -- The cache and its interface must be carefully layed out. There are also SSO (Simultaneous Switching ) issues related to all this. Yep. You get it. These aren't insurmountable problems, but unless Xilinx or Altera through a lot of $ at the problem and come out with an FPGA that can do the job in the latest process, it probably wont happen. I wonder if the lure of the general purpose CPU market is interesting enough to entice them. VLIW Computers have not been particularly successful -- sure you are talking about multiple streams of data -- but the physics problems are the same. I'm not a fan of VLIW. It sounds good... just use a bunch of functional units in parallel. However, these guys tend to forget that it's more of a data routing problem than a computation problem. As a router guy, VLIW sounds like a great way to burn power. I suspect the Intel's are pretty close to the bleeding edge of I/O throughput - per package - already - and they have all the big bucks to spend on improving the state of the art. If you can catch up with them using any alternative architecture -- they continue doubling their I/O throughput every 2 years....... The only way this could work out is if a big guy like Xilinx put a real effort into it. Otherwise, you'd get left in Intel's dust as they shrink to 65nm and below. Isn't Xilinx also on the bleeding edge of I/O throughput? Is it possible to make good memory buses with what they have? I doubt I could run the Xilinx memories at 1Gig or more. That doesn't mean it couldn't be done... > The bottleneck in almost all of the algorithms I examined was not > computation. The real bottleneck was memory access. I've concluded > that parallel memory access is the key to beating the Von Neumann > CPU. [big snip] Lots of people have come to this conclusion. One would think the solution would be shared distributed memory machines using standard microprocessors --- Just need a good inter-processor communication system ( sounds simple doesn't it :-). > So, why hasn't this been done yet? Cause Its hard. I'd sure be fun, though :-) > _______________________________________________ > http://www.opencores.org/mailman/listinfo/cores bj Porcella http://pages.sbcglobal.net/bporcella/

Winning with a reconfigurable computer by markus on Sep 6, 2004			markus Posts: 32 Joined: Dec 9, 2002 Last seen: Apr 30, 2013
Bill Cox: [q] There are lots of great tricks for speeding up a memory. However, it still takes a single address port, or two in the case of a Harvard architecture, but still only one for data memory.[/q] Not always. In my guitar effect box project I use Motorola DSP563xx series and there's two separate data memory "spaces" (X and Y). Basically you can have three memory accesses (code + 2 x data) on one cycle. And not to mention other, more modern DSPs. It could be useful to you to check, what kind of accessing instructions they have in that DSP? [q]> NicO: But is that so easy to create a C compiler that target a FPGA > plus "some" > memory port ? [/q] Well, it's a really big task. I'd hate to have to do it as an open-source project. However, there's nothing in the list of tasks that isn't doable. It's mostly common technology. [/q] I have a HLL compiler "wreck" for the DSP563xx, altought it's not C compiler... But a conventional compiler anyway. I made a huge job for getting the compiler to reserve memory blocks "wisely" from three alternatives (P, X and Y memories) and to make decisions about which data is located in to internal memories and which ones go outside. It's not that simple; for example, if you send a pointer to a function, you should have different opcodes handling different memory spaces (in DSP563xx) --> you need separate functions accessing different memory address spaces. [q] As long as the compiler understands the data arrays, scheduling which array goes in which memory seems fairly straight-forward (but still a lot of work). [/q] No, IMHO it's not that simple (in general case). You don't have runtime access pattern information in the compiler, so it can't make optimal decisions during the compilation. Any selected schedulig algorithm have it's worst case for some kind of access patterns. [q] Assuming such a board exists, assigning arrays of data to memory banks is a scheduling problem. [/q] You might also want to have a runtime scheduler, which would change the bank of the data, if it seems to be a bottleneck? Bill Cox: There are lots of great tricks for speeding up a memory. However, it still takes a single address port, or two in the case of a Harvard architecture, but still only one for data memory. Not always. In my guitar effect box project I use Motorola DSP563xx series and there's two separate data memory "spaces" (X and Y). Basically you can have three memory accesses (code + 2 x data) on one cycle. And not to mention other, more modern DSPs. It could be useful to you to check, what kind of accessing instructions they have in that DSP? > NicO: But is that so easy to create a C compiler that target a FPGA > plus "some" > memory port ? Well, it's a really big task. I'd hate to have to do it as an open-source project. However, there's nothing in the list of tasks that isn't doable. It's mostly common technology. I have a HLL compiler "wreck" for the DSP563xx, altought it's not C compiler... But a conventional compiler anyway. I made a huge job for getting the compiler to reserve memory blocks "wisely" from three alternatives (P, X and Y memories) and to make decisions about which data is located in to internal memories and which ones go outside. It's not that simple; for example, if you send a pointer to a function, you should have different opcodes handling different memory spaces (in DSP563xx) --> you need separate functions accessing different memory address spaces. As long as the compiler understands the data arrays, scheduling which array goes in which memory seems fairly straight-forward (but still a lot of work). No, IMHO it's not that simple (in general case). You don't have runtime access pattern information in the compiler, so it can't make optimal decisions during the compilation. Any selected schedulig algorithm have it's worst case for some kind of access patterns. Assuming such a board exists, assigning arrays of data to memory banks is a scheduling problem. You might also want to have a runtime scheduler, which would change the bank of the data, if it seems to be a bottleneck?

no use

no use

1/3

Next

Last

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.