OpenCores
no use no use 1/3 Next Last
Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Hi.

Aren't discussion boards great? You can publish ideas without
brown-nosing anybody!

It's kind of a radical possibility, but a very old idea. If an FPGA, or
Parallel Array Processor (or any other reconfigurable computer), could
be a better general purpose CPU, the market for it could eventually run
into the billions (even if Intel still made most of them). It certainly
would add a fun twist to my linux machine.

Summary of following discussion:

I believe that we can write hardware compilers for general purpose
algorithms that will run faster when compiled to FPGAs than on an Intel
machine. That could result in some pretty amazing things.

IMO, the main focus of such a machine needs to be massively improved
memory access, rather than the more common focus on parallel
computation. By simultaneously optimizing the logic used in computation
with a customized memory system, improvements vs traditional CPU
architectures become practical.

This would require minor changes to the computer languages we use today
in how they represent and track objects in memory. Also, new hardware
compilers would have to be written. Building the hardware is probably
the easiest part.

Some ideas are outlined below.

Bill

----------------------------------------------

While I'm not an expert in the large field of reconfigurable computing,
I am an expert in configurable logic fabrics. I'm also well versed in
synthesis technology. My main focus has on place and route algorithms
used to implement logic into FPGAs and ASICs.

That said, I haven't kept up with the many papers in reconfigurable
computing, so these ideas may well be old news. Sorry about that.

Last I heard, Intel processors still usually won at running generic CPU
intensive algorithms (like place and route), even if you compiled your C
code into an advanced ASIC. IMO, that makes the entire field of
reconfigurable computing a neat idea, but with little practical use.

However, I'm convinced that this need not be the case. I've analyzed
several EDA algorithms I've written, and have concluded that almost all
of them could run faster in an FPGA vs on an Intel machine. However,
we'd have to get super-fast memory interfaces on new FPGAs, and some
really fast on-chip memory to implement caches. In the meantime, we can
compile our algorithms to custom ASICs and start winning in hardware
there.

The bottleneck in almost all of the algorithms I examined was not
computation. The real bottleneck was memory access. I've concluded
that parallel memory access is the key to beating the Von Neumann CPU.

In a typical inner loop, I found that at most a few CPU instructions get
executed before we hit another memory access, usually a read
instruction. The instructions can easily be done in parallel, and would
run faster on any FPGA than on an Intel CPU. However, that memory
access presents a real problem. Intel CPUs suck data in from main
memory at unbelievable rates. So, if my ASIC or FPGA based system had
to spend even 1ns per data read or write, it loses.

So, how can we beat a traditional CPU? Simple. Read memory in
parallel. I didn't measure it, but I saw in the inner loops that I read
very high potential for parallel memory access. For example, in
simulated annealing placement algorithms, we compute bounding boxes on
nets. One memory interface could suck in the port references, and
others could use those port references to suck in X and Y coordinates.
There were several other fields on instances, cells, and ports that also
could be accessed in parallel.

Part of the problem here is that the Von Neumann bottleneck has made it
into our computer languages. It's so ingrained, that we naturally
reject anything that's much different.

To be more specific, let's examine the memory layout of a typical object
in almost any modern language. The data fields are generally sequential
in memory. An object reference is generally just a pointer to the first
field of the object.

If we lay out objects in memory in this way, we naturally have to have
all our fields for an object in the same address space. We can play
tricks like having multiple memory banks, but in the end, we access the
same address port to get each field of our object. That naturally
eliminates parallel memory access.

Instead, we need to represent data fields for each class of object in
separate arrays. For example, let's suppose that one array in memory
contains only arrays of ports on nets. Another contains only port X
locations, and another contains only port Y locations. Each array can
easily be located in different physical memories. An object reference
is simply an index into these arrays, rather than a pointer to memory.

Now, a net is just an index. It can be used to access any property just
by indexing into the corresponding array for that property by the net
index. Going back to the simulated annealing example, once we've got
port data streaming in at high speed through one memory port, we can use
the port indexes to access the other memories containing the X and Y
locations. This simple parallel access results in that much needed
multiplier of more than 2x speed improvement vs a traditional CPU. We
didn't get the big win of 100x or anything like that, but we win.

Generally, compiling C code into hardware is not all that hard (lot's of
work, but not that hard). There are multiple companies that do it
today. However, those nasty pointers are a problem. I've seen hardware
compilers for C compile all your algorithms just fine, and then generate
mux trees to munge all memory accesses down into a single address bus.
The problem is that the language definition tells you how memory is laid
out. We can fix the language with only minor changes, so that how we
represent objects is not visible to the programmer.

Another problem in more modern languages is garbage collection. It's
nice to have, but it greatly complicates the job of compiling custom
memory hardware for a specific application. However, if we stick to a
slightly modified C, we're in good shape. We can even add advanced
object tracking to the language to make up for not having garbage
collection.

To summarize, to win we'd need a new or modified C compiler, and a
hardware compiler capable of generating not only parallel logic, but
parallel memories. If access to those parallel memories is fast enough,
with good caches and high-speed interfaces, we could easily win vs a
traditional computer. There's a lot of work involved, but no real
technology barriers.

So, why hasn't this been done yet?

Bill



Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!


The bottleneck in almost all of the algorithms I examined was not
computation. The real bottleneck was memory access. I've concluded
that parallel memory access is the key to beating the Von Neumann CPU.

Bill, this finding of yours is perfectly true. There are many older
studies/papers which prove this. In any computation, the bottleneck is
usually memory access times. I carried out a similar simulation based
research sometime ago on program execution speeds with various CPU and
memory speeds. Traditionally, logic manufacturing on a chip is usually
optimized for high speed, but memory manufacturing technology is optimized
for area. And this difference in manufacturing process/methodology has
created a "gap" between logic speeds and memory speeds. This gap is ever
increasing in the Von Neumann architecture. So an Intel CPU usually spends
most of its time waiting for memory, and the number of wasted CPU cycles are
increasing.


Part of the problem here is that the Von Neumann bottleneck has made it
into our computer languages. It's so ingrained, that we naturally
reject anything that's much different.

This is probably correct, although I am no software expert. But even then,
FPGA's need to have highly optimized memory arrays, which give a lot less
access times that Intel CPU caches provide, to speed up your applications.


To summarize, to win we'd need a new or modified C compiler, and a
hardware compiler capable of generating not only parallel logic, but
parallel memories. If access to those parallel memories is fast enough,
with good caches and high-speed interfaces, we could easily win vs a
traditional computer. There's a lot of work involved, but no real
technology barriers.

So, why hasn't this been done yet?
Bill


I think we are still to optimize memory interfaces for speed, which
primarily would be accomplished if memory manufacturing technologies
improve.

-Rohit



_______________________________________________ http://www.opencores.org/mailman/listinfo/cores



Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Nice idea. It look like FPGA + memories.

You speak about he use of many memory port. But what is the best 2 32 bits
port or one 64 bits port ?

Usually the true limitation is the pin number (because of package cost,
routing pb, power consumption).

The very last memory controller of GPU use 4 64 bits ports instead of one big
256 bits. I beleive it's more a problem of fanout in the adress bus.From the
inside chip, it look like a multibank port.

From a market point of view, if you create a c compiler that could use an FPGA

+ common DDR-SDRAM chip, you could have a great success !

But is that so easy to create a C compiler that target a FPGA plus "some"
memory port ?

nicO

Le Lundi 6 Septembre 2004 13:00, Bill Cox a écrit :
Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill _______________________________________________ http://www.opencores.org/mailman/listinfo/cores





Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Le Lundi 6 Septembre 2004 13:35, Rohit Mathur a écrit :
>
> So, why hasn't this been done yet?
> Bill


I think we are still to optimize memory interfaces for speed, which
primarily would be accomplished if memory manufacturing technologies
improve.

-Rohit


You remember the big slot A of first pentium II ? It was used to have 512 Ko
cache less expensive than the on die cache of the pentium pro.

You intel/amd want more speed they could always make the same. But instead of
using sram, they could use the very last memory interface, like DDR2 or 3,
like GPU does. Video card could use fastest memory because there is not any
connector on the routing path.

So you could imagine an external 128 Mo L3 cache running at 1 Ghz, which is 4
times the actual data rate of RAM. Then , the usual memory access could
better concentrate on high bandwith rather than low latency.

nicO



Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Bill and OpenCores forum readers :-) Great discussion point, I quite enjoyed the read. I am currently an Electrical and Computer Systems Engineering student at Monash University in Melbourne, Australia. As part of a core course unit we all study Computer Architecture, particularly the MIPS architecture which seems to be a fairly efficient system. As an extension to the architecture, we have also been versed in parallel processor systems and memory architectures. I think what you propose could make a great final year thesis project for a student at a university such as Monash to get your idea rolling! What better way to start off a good open source project than to get a student involved who may then follow it for the rest of his/her career. We also have VHDL / FPGA based subjects, so I'm sure a few students might look into it. You just never know. Perhaps OpenCores can contact a few universities around the world and sponsor / take care of a project for a set of students to look further into the ideas you posed. This way everyone at OpenCores can follow along and help guide the project with the experience of all the forum readers. What do you think Bill? - Justin Young -----Original Message----- From: Bill Cox [mailto:bill at viasic.com] Sent: Monday, 6 September 2004 9:00 PM To: Discussion list about free open source IP cores Subject: [oc] Winning with a reconfigurable computer Hi. Aren't discussion boards great? You can publish ideas without brown-nosing anybody! It's kind of a radical possibility, but a very old idea. If an FPGA, or Parallel Array Processor (or any other reconfigurable computer), could be a better general purpose CPU, the market for it could eventually run into the billions (even if Intel still made most of them). It certainly would add a fun twist to my linux machine. Summary of following discussion: I believe that we can write hardware compilers for general purpose algorithms that will run faster when compiled to FPGAs than on an Intel machine. That could result in some pretty amazing things. IMO, the main focus of such a machine needs to be massively improved memory access, rather than the more common focus on parallel computation. By simultaneously optimizing the logic used in computation with a customized memory system, improvements vs traditional CPU architectures become practical. This would require minor changes to the computer languages we use today in how they represent and track objects in memory. Also, new hardware compilers would have to be written. Building the hardware is probably the easiest part. Some ideas are outlined below. Bill ---------------------------------------------- While I'm not an expert in the large field of reconfigurable computing, I am an expert in configurable logic fabrics. I'm also well versed in synthesis technology. My main focus has on place and route algorithms used to implement logic into FPGAs and ASICs. That said, I haven't kept up with the many papers in reconfigurable computing, so these ideas may well be old news. Sorry about that. Last I heard, Intel processors still usually won at running generic CPU intensive algorithms (like place and route), even if you compiled your C code into an advanced ASIC. IMO, that makes the entire field of reconfigurable computing a neat idea, but with little practical use. However, I'm convinced that this need not be the case. I've analyzed several EDA algorithms I've written, and have concluded that almost all of them could run faster in an FPGA vs on an Intel machine. However, we'd have to get super-fast memory interfaces on new FPGAs, and some really fast on-chip memory to implement caches. In the meantime, we can compile our algorithms to custom ASICs and start winning in hardware there. The bottleneck in almost all of the algorithms I examined was not computation. The real bottleneck was memory access. I've concluded that parallel memory access is the key to beating the Von Neumann CPU. In a typical inner loop, I found that at most a few CPU instructions get executed before we hit another memory access, usually a read instruction. The instructions can easily be done in parallel, and would run faster on any FPGA than on an Intel CPU. However, that memory access presents a real problem. Intel CPUs suck data in from main memory at unbelievable rates. So, if my ASIC or FPGA based system had to spend even 1ns per data read or write, it loses. So, how can we beat a traditional CPU? Simple. Read memory in parallel. I didn't measure it, but I saw in the inner loops that I read very high potential for parallel memory access. For example, in simulated annealing placement algorithms, we compute bounding boxes on nets. One memory interface could suck in the port references, and others could use those port references to suck in X and Y coordinates. There were several other fields on instances, cells, and ports that also could be accessed in parallel. Part of the problem here is that the Von Neumann bottleneck has made it into our computer languages. It's so ingrained, that we naturally reject anything that's much different. To be more specific, let's examine the memory layout of a typical object in almost any modern language. The data fields are generally sequential in memory. An object reference is generally just a pointer to the first field of the object. If we lay out objects in memory in this way, we naturally have to have all our fields for an object in the same address space. We can play tricks like having multiple memory banks, but in the end, we access the same address port to get each field of our object. That naturally eliminates parallel memory access. Instead, we need to represent data fields for each class of object in separate arrays. For example, let's suppose that one array in memory contains only arrays of ports on nets. Another contains only port X locations, and another contains only port Y locations. Each array can easily be located in different physical memories. An object reference is simply an index into these arrays, rather than a pointer to memory. Now, a net is just an index. It can be used to access any property just by indexing into the corresponding array for that property by the net index. Going back to the simulated annealing example, once we've got port data streaming in at high speed through one memory port, we can use the port indexes to access the other memories containing the X and Y locations. This simple parallel access results in that much needed multiplier of more than 2x speed improvement vs a traditional CPU. We didn't get the big win of 100x or anything like that, but we win. Generally, compiling C code into hardware is not all that hard (lot's of work, but not that hard). There are multiple companies that do it today. However, those nasty pointers are a problem. I've seen hardware compilers for C compile all your algorithms just fine, and then generate mux trees to munge all memory accesses down into a single address bus. The problem is that the language definition tells you how memory is laid out. We can fix the language with only minor changes, so that how we represent objects is not visible to the programmer. Another problem in more modern languages is garbage collection. It's nice to have, but it greatly complicates the job of compiling custom memory hardware for a specific application. However, if we stick to a slightly modified C, we're in good shape. We can even add advanced object tracking to the language to make up for not having garbage collection. To summarize, to win we'd need a new or modified C compiler, and a hardware compiler capable of generating not only parallel logic, but parallel memories. If access to those parallel memories is fast enough, with good caches and high-speed interfaces, we could easily win vs a traditional computer. There's a lot of work involved, but no real technology barriers. So, why hasn't this been done yet? Bill
Winning with a reconfigurable computer
by bporcella on Sep 6, 2004
bporcella
Posts: 22
Joined: Jan 16, 2004
Last seen: Oct 2, 2007
Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
On Mon, 6 Sep 2004, Bill Cox wrote:

Hi.


parallel memories. If access to those parallel memories is fast enough,
with good caches and high-speed interfaces, we could easily win vs a
traditional computer. There's a lot of work involved, but no real
technology barriers.

So, why hasn't this been done yet?


I dimly remember IBM making a fuss about something they called 'memory in
processor' systems around 1998, and even more dimly remember they were
similar to what you're talking about. But googling isn't turning up
anything. Does anyone else have a better memory than me?

Graham


Bill _______________________________________________ http://www.opencores.org/mailman/listinfo/cores




Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
* Graham Seaman graham at seul.org> [2004-09-06T12:19:09-0400]:
I dimly remember IBM making a fuss about something they called 'memory in
processor' systems around 1998, and even more dimly remember they were
similar to what you're talking about. But googling isn't turning up
anything. Does anyone else have a better memory than me?
How about these? Intelligent RAM (IRAM): http://iram.cs.berkeley.edu/ Processor-in-Memory (PIM): http://www.cse.nd.edu/~pim/ -dave
Winning with a reconfigurable computer
by markus on Sep 6, 2004
markus
Posts: 32
Joined: Dec 9, 2002
Last seen: Apr 30, 2013
I'm almost forced to answer to this... :-)

Altought there was many interesting ideas, I give my reply to only one
issue: programming languages and the memory.

It's true, that programming languages are developed a central memory
in developer's mind. Programming languages assume, that you have
uniform access to all data objects in the system. Whenever you don't
have this (for example, you're working with code distributed over a
network or memory bridge), you have to write some complex code to
receive/transmit information between two points.

This assumption makes it hard to divide one program for multiple
processors without having a shared memory. And when you have this, it
gives restrictions to the speed which is achievable by the system overall.

But there's a good reason to use one single memory block in
general-purpose CPUs - if you divide it to get parallel accesses (like
DSPs does), you in the same time hurt the dynamic adaption for the
workload; one of your memories might be full, while there could be
plenty of room in other memories. Thus you need to implement some kind
of "gateway" to store larger arrays to multiple memories --> you get
the shared memory.

In general, programming languages doesn't limit the underlaying
architectures (the compilers can do nice job for overcoming this), but
of course the programming language causes the programmer to think
things in the way that it's represented in the langauge.

To be more specific, let's examine the memory layout of a typical
object in almost any modern language. The data fields are generally
sequential in memory...
...That naturally eliminates parallel memory access.


For example, if my memory serves me well, the C standard doesn't
explicitely say, in what ways the objects are stored. First, the
compiler can use whatever alignment it decides and second, the tables
can be splitted or interleaved. This should not break any code, except
those who uses fancy pointer arithmetics.

The reason for this is, that the regular CPUs have more efficient
instructions to handle objects stored in sequential manner.

Instead, we need to represent data fields for each class of object
in separate arrays.


Many databases do this i.e. they load fields to separate arrays.


Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Le Lundi 6 Septembre 2004 18:42, David I. Lehn a écrit :


Pim sound like processor + memory on the same die. Quite common nowadays.

This remember me an article about the new super computer of IBM. The said that
each node use 128 Mbyte "on chip" (or 128 Mb ?) and then could use the
"external memory" as swap is usually used.

So thoughput is the true goal of the main Memory when latency is the goal of
on-chip memory.

So memory could be inside, with an external storage, when it's no enough, like
disk swap.

I have read an article few month ago, about a compagny that design a cpu with
a kind of fpga as a computing unit. They give extraordinary performance for a
cpu for task as DES computing.

An other article speak about super_computer_on_a_chip. This is basically
network of few dozen of SIMD like processor, with hundred of 64 bits
multipliers.

Maybe a kind of mix of all of this could be done :)

nicO



Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
On Mon, 2004-09-06 at 18:00, Bill Cox wrote:
Hi.

[SNIP]

The bottleneck in almost all of the algorithms I examined was not
computation. The real bottleneck was memory access. I've concluded
that parallel memory access is the key to beating the Von Neumann CPU.


Thats not quite correct (your conclusion that is). I think you
should clarify what you mean with memory access.

Obviously we can get plenty of linear bandwidth out of todays
DDR chips and by building really wide buses. e.g. using a 256bit
bus, with 400MHz DDRs, is probably more liner memory bandwidth
than an Intel P4 can handle today, not counting GPUs.

The problem that many have tried to solve but have been more
or less unsuccessful are random access latencies. Thats why
all modern high performance CPUs use caches. They allow zero
penalty random access to a small amount of program and data
space.

If you would replace you main system memory with high speed
SRAMs you would not need caches anymore and your system
bottleneck will be back in the CPU.


[SNIP]

So, why hasn't this been done yet?


Well, two obvious reasons come to mind:

1) Cost
2) I guess nobody has come up with an alternative economical
memory design. We need memory that can be randomly accessed
just like a plain old SRAM, BUT, it must be small (1T
preferred) like DRAM and we must be able to manufacture
it in high volumes at a high yield.

So the real issue here is memory architecture, not integration
or distribution, etc.

Bill
Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, http://www.asics.ws Your Partner for IP Cores, Design, Verification and Synthesis
Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
Le Lundi 6 Septembre 2004 18:42, David I. Lehn a écrit :
Intelligent RAM (IRAM): http://iram.cs.berkeley.edu/


iram is quite different, it's a mix of cpu and memory (lot of memory), but if
memory space miss, they don't try to put external DRAM chip but an other iram
chip...

So the real bottleneck became the network between the chip.

What ever the task, this solution provide the more memory bandwith.

Imagine, a computer where you add processeur as you today add DRAM. :)

Le Lundi 6 Septembre 2004 19:16, Rudolf Usselmann a écrit :
2) I guess nobody has come up with an alternative economical
memory design. We need memory that can be randomly accessed
just like a plain old SRAM, BUT, it must be small (1T
preferred) like DRAM and we must be able to manufacture
it in high volumes at a high yield.


This kind of memory exist. This is the "fram". And all the technology called
"spintronic", they use the spin of electron instead of the use of the
presence of electrons. Current fram are quite small and have problem of
density (because of the control not because of the cell). They have around 15
ps of random access time.

This technology could be added to current CMOS technology. FRAM use 1t and is
non volatile memory. Intel have passed some contract with a french laboratory
for there cache for there future chip. This will work in the 5 to 10 years.

So it will be the time of 4 Gbyte chip + 4 cpu core in it :)

nicO



Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
On Mon, 2004-09-06 at 08:08, Nicolas Boulay wrote:
Nice idea. It look like FPGA + memories.

You speak about he use of many memory port. But what is the best 2 32 bits
port or one 64 bits port ?

Usually the true limitation is the pin number (because of package cost,
routing pb, power consumption).


Good points. The really big FPGAs come with several hundreds of pins.
Are the general purpose I/Os capable of high speed memory access? If
not, that's a real problem.

For starters, I'd probably dump virtual memory access, and just directly
map property arrays onto multiple external memory buses. Hopefully I
could get 3 to 5 memory interfaces before running into power problems.

The very last memory controller of GPU use 4 64 bits ports instead of one big
256 bits. I beleive it's more a problem of fanout in the adress bus.From the
inside chip, it look like a multibank port.


There are lots of great tricks for speeding up a memory. However, it
still takes a single address port, or two in the case of a Harvard
architecture, but still only one for data memory. That's a real problem
for parallel access.

>From a market point of view, if you create a c compiler that could use an FPGA

+ common DDR-SDRAM chip, you could have a great success !

But is that so easy to create a C compiler that target a FPGA plus "some"
memory port ?

nicO


Well, it's a really big task. I'd hate to have to do it as an
open-source project. However, there's nothing in the list of tasks that
isn't doable. It's mostly common technology.

As long as the compiler understands the data arrays, scheduling which
array goes in which memory seems fairly straight-forward (but still a
lot of work).

I'm assuming that someone would build a board with a honking big FPGA on
it, with several large independent external memory banks attached to
it. There might even be second-level external cache for each memory
bus. Internal memory would mostly be used for first-level cache.

Assuming such a board exists, assigning arrays of data to memory banks
is a scheduling problem.

I assume the big FPGA would have at least one embedded CPU that could be
used for running the vast majority of a program's code. Some nifty
piece of the compiler would schedule critical inner loops to be
implemented in hardware (probably with user input). The time consuming
inner loops would hopefully run faster than a general purpose CPU, since
those data accesses they do could be done in parallel.

Of course, if the best memory access times we can do on an FPGA is 10x
slower than the best Intel processor busses and cache, then the whole
idea wont work. For this to work out, the FPGA would have to have very
good memory access times.

Bill




Winning with a reconfigurable computer
by Unknown on Sep 6, 2004
Not available!
On Mon, 2004-09-06 at 11:43, bporcella wrote:
At 04:00 AM 9/6/2004, you wrote:

Some of these issues have ben touched on --- but I'll throw in my 2
cents worth.....
[snip]


Always worth more than that :-)


> However, I'm convinced that this need not be the case. I've
> analyzed
> several EDA algorithms I've written, and have concluded that almost
> all
> of them could run faster in an FPGA vs on an Intel machine.
> However,
> we'd have to get super-fast memory interfaces on new FPGAs, and some
> really fast on-chip memory to implement caches. In the meantime, we
> can
> compile our algorithms to custom ASICs and start winning in hardware
> there.


As you mention "new FPGA's" you clearly understand part of the
problem -- one cannot use
generic I/O's and the distributed general purpose memory of FPGA's to
implement high speed
cache -- The cache and its interface must be carefully layed out.
There are also SSO (Simultaneous Switching ) issues related to all
this.


Yep. You get it. These aren't insurmountable problems, but unless
Xilinx or Altera through a lot of $ at the problem and come out with an
FPGA that can do the job in the latest process, it probably wont happen.

I wonder if the lure of the general purpose CPU market is interesting
enough to entice them.

VLIW Computers have not been particularly successful -- sure you are
talking about multiple streams
of data -- but the physics problems are the same.


I'm not a fan of VLIW. It sounds good... just use a bunch of functional
units in parallel. However, these guys tend to forget that it's more of
a data routing problem than a computation problem. As a router guy,
VLIW sounds like a great way to burn power.

I suspect the Intel's are pretty close to the bleeding edge of I/O
throughput - per package - already - and they have
all the big bucks to spend on improving the state of the art. If you
can catch up with them using any alternative architecture --
they continue doubling their I/O throughput every 2 years.......


The only way this could work out is if a big guy like Xilinx put a real
effort into it. Otherwise, you'd get left in Intel's dust as they
shrink to 65nm and below. Isn't Xilinx also on the bleeding edge of I/O
throughput? Is it possible to make good memory buses with what they
have? I doubt I could run the Xilinx memories at 1Gig or more. That
doesn't mean it couldn't be done...

> The bottleneck in almost all of the algorithms I examined was not
> computation. The real bottleneck was memory access. I've concluded
> that parallel memory access is the key to beating the Von Neumann
> CPU.


[big snip]

Lots of people have come to this conclusion. One would think the
solution would be shared distributed memory machines
using standard microprocessors --- Just need a good inter-processor
communication system ( sounds simple doesn't it :-).


> So, why hasn't this been done yet?


Cause Its hard.

I'd sure be fun, though :-)

> _______________________________________________ > http://www.opencores.org/mailman/listinfo/cores
bj Porcella http://pages.sbcglobal.net/bporcella/



Winning with a reconfigurable computer
by markus on Sep 6, 2004
markus
Posts: 32
Joined: Dec 9, 2002
Last seen: Apr 30, 2013
Bill Cox:
There are lots of great tricks for speeding up a memory. However,
it still takes a single address port, or two in the case of a Harvard
architecture, but still only one for data memory.


Not always. In my guitar effect box project I use Motorola DSP563xx
series and there's two separate data memory "spaces" (X and Y).
Basically you can have three memory accesses (code + 2 x data) on one
cycle. And not to mention other, more modern DSPs.

It could be useful to you to check, what kind of accessing
instructions they have in that DSP?

> NicO: But is that so easy to create a C compiler that target a FPGA
> plus "some"
> memory port ?


Well, it's a really big task. I'd hate to have to do it as an
open-source project. However, there's nothing in the list of tasks
that isn't doable. It's mostly common technology.


I have a HLL compiler "wreck" for the DSP563xx, altought it's not C
compiler... But a conventional compiler anyway. I made a huge job for
getting the compiler to reserve memory blocks "wisely" from three
alternatives (P, X and Y memories) and to make decisions about which
data is located in to internal memories and which ones go outside.

It's not that simple; for example, if you send a pointer to a
function, you should have different opcodes handling different memory
spaces (in DSP563xx) --> you need separate functions accessing
different memory address spaces.

As long as the compiler understands the data arrays, scheduling
which array goes in which memory seems fairly straight-forward
(but still a lot of work).


No, IMHO it's not that simple (in general case). You don't have
runtime access pattern information in the compiler, so it can't make
optimal decisions during the compilation. Any selected schedulig
algorithm have it's worst case for some kind of access patterns.

Assuming such a board exists, assigning arrays of data to memory
banks is a scheduling problem.


You might also want to have a runtime scheduler, which would change
the bank of the data, if it seems to be a bottleneck?


no use no use 1/3 Next Last
© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.