OpenCores

OpenCores

Maximum piplelined, parallalel and parametrized multiplier

no use

no use

1/1

no use

no use

Maximum piplelined, parallalel and parametrized multiplier by deltax on Apr 12, 2016			deltax Posts: 11 Joined: Mar 18, 2016 Last seen: Dec 29, 2016
Hi all, just for learn (see my other topic) I'm developing a simple multiplier, to refresh my VHDL coding skills, and then maybe partecipate to the other more ambitious and challenging projects on OpenCores. I know this is a open source Platform, by now I'm using Vivado but can easply ported to a open source Platform. However, my general idea may be contain some logical or implementation errors that make the project wrong; tell me if it is. The general idea is, starting from 2number with N and M bit respectively, do the multiplication in the shortest time possible. So I have implemented a solution that can be maximise the parallelization. The 2 principal stages are the shifters and the adders. In order to make fast the adders, I choose the inner signal to be M+N bit long, to avoid the carry in the adder and performs the sum in parallel. The parallel sum are performed in parallel, like the attached image. I have implemented the solution until this point, I'm testing it because I have some problem with the synthesis in Vivado. The second step is to pipeline the circuite. A pipeline can be put at the end of the sifters, and between each adder stage. My main doubt is the solution M+N bit is too expensive in terms of area and signal integrity (bus too wide) and overwhelm the benefit derivated to the maximum parallelization. Let me know what do you think! Criticism, possible improvment etc.[q]<a href="uploads/1460446656_xitIg.jpg">xitIg.jpg</a> (49 kb)[/q] Hi all, just for learn (see my other topic) I'm developing a simple multiplier, to refresh my VHDL coding skills, and then maybe partecipate to the other more ambitious and challenging projects on OpenCores. I know this is a open source Platform, by now I'm using Vivado but can easply ported to a open source Platform. However, my general idea may be contain some logical or implementation errors that make the project wrong; tell me if it is. The general idea is, starting from 2number with N and M bit respectively, do the multiplication in the shortest time possible. So I have implemented a solution that can be maximise the parallelization. The 2 principal stages are the shifters and the adders. In order to make fast the adders, I choose the inner signal to be M+N bit long, to avoid the carry in the adder and performs the sum in parallel. The parallel sum are performed in parallel, like the attached image. I have implemented the solution until this point, I'm testing it because I have some problem with the synthesis in Vivado. The second step is to pipeline the circuite. A pipeline can be put at the end of the sifters, and between each adder stage. My main doubt is the solution M+N bit is too expensive in terms of area and signal integrity (bus too wide) and overwhelm the benefit derivated to the maximum parallelization. Let me know what do you think! Criticism, possible improvment etc. xitIg.jpg (49 kb)

RE: Maximum piplelined, parallalel and parametrized multiplier by hellwig on Apr 12, 2016			hellwig Posts: 32 Joined: Dec 30, 2007 Last seen: Nov 3, 2024
There is a bit more theory behind really fast multiplication. See for example Hennessy and Patterson, either under "carry save adder" or "Wallace tree". Hellwig There is a bit more theory behind really fast multiplication. See for example Hennessy and Patterson, either under "carry save adder" or "Wallace tree". Hellwig

RE: Maximum piplelined, parallalel and parametrized multiplier by deltax on Apr 12, 2016			deltax Posts: 11 Joined: Mar 18, 2016 Last seen: Dec 29, 2016
yes I know, I read some paper about the various existing techniques. However, if I'm not wrong, all of them involving the fastet way to manage the carry problem. In my implementation - that have for sure problem, the first I think the wast of resources - there isn't carry problem due to the wide of the signal. Indeed, if you took 2 number with N and M bit, the highest number that you can represent is M+N bits long. If I define ALL the internal signal M+N wide, I have no carry for these adders. Probabily the final result will be slower than the wallace three or the other solution that involve the carry managment with a fixed number of bit, I did it only to learn. yes I know, I read some paper about the various existing techniques. However, if I'm not wrong, all of them involving the fastet way to manage the carry problem. In my implementation - that have for sure problem, the first I think the wast of resources - there isn't carry problem due to the wide of the signal. Indeed, if you took 2 number with N and M bit, the highest number that you can represent is M+N bits long. If I define ALL the internal signal M+N wide, I have no carry for these adders. Probabily the final result will be slower than the wallace three or the other solution that involve the carry managment with a fixed number of bit, I did it only to learn.

RE: Maximum piplelined, parallalel and parametrized multiplier by hellwig on Apr 12, 2016			hellwig Posts: 32 Joined: Dec 30, 2007 Last seen: Nov 3, 2024
I do not understand this, sorry. You can of course represent the "internal signals" (which are the partial products, I assume) with N+M bits, and there are no adders (and no carries) involved. But in the end you have to add these partial products, and there you will have the carry propagation problem again. Also, I don't understand your circuit diagram: each adder will have a carry output. What do you do with these carries? Perhaps you can draw a complete 4x4 multiplier, so that your solution for a fast multiplier is easier to grasp. Hellwig I do not understand this, sorry. You can of course represent the "internal signals" (which are the partial products, I assume) with N+M bits, and there are no adders (and no carries) involved. But in the end you have to add these partial products, and there you will have the carry propagation problem again. Also, I don't understand your circuit diagram: each adder will have a carry output. What do you do with these carries? Perhaps you can draw a complete 4x4 multiplier, so that your solution for a fast multiplier is easier to grasp. Hellwig

RE: Maximum piplelined, parallalel and parametrized multiplier by UlrichAlthoefer on Apr 12, 2016			UlrichAlthoefer Posts: 1 Joined: Oct 20, 2011 Last seen: Apr 15, 2016
[q]Hi all, just for learn (see my other topic) I'm developing a simple multiplier, to refresh my VHDL coding skills, and then maybe partecipate to the other more ambitious and challenging projects on OpenCores. I know this is a open source Platform, by now I'm using Vivado but can easply ported to a open source Platform. However, my general idea may be contain some logical or implementation errors that make the project wrong; tell me if it is. The general idea is, starting from 2number with N and M bit respectively, do the multiplication in the shortest time possible. So I have implemented a solution that can be maximise the parallelization. The 2 principal stages are the shifters and the adders. In order to make fast the adders, I choose the inner signal to be M+N bit long, to avoid the carry in the adder and performs the sum in parallel. The parallel sum are performed in parallel, like the attached image. I have implemented the solution until this point, I'm testing it because I have some problem with the synthesis in Vivado. The second step is to pipeline the circuite. A pipeline can be put at the end of the sifters, and between each adder stage. My main doubt is the solution M+N bit is too expensive in terms of area and signal integrity (bus too wide) and overwhelm the benefit derivated to the maximum parallelization. Let me know what do you think! Criticism, possible improvment etc.[q]<a href="uploads/1460446656_xitIg.jpg">xitIg.jpg</a> (49 kb)[/q][/q] Hello, do you know the following link: https://threesixty360.wordpress.com/25-ways-to-multiply/ There is an 'Arithmetic Module Generator' (AMG) for generating Verilog or VHDL. Greetings from Germany Ulrich Hi all, just for learn (see my other topic) I'm developing a simple multiplier, to refresh my VHDL coding skills, and then maybe partecipate to the other more ambitious and challenging projects on OpenCores. I know this is a open source Platform, by now I'm using Vivado but can easply ported to a open source Platform. However, my general idea may be contain some logical or implementation errors that make the project wrong; tell me if it is. The general idea is, starting from 2number with N and M bit respectively, do the multiplication in the shortest time possible. So I have implemented a solution that can be maximise the parallelization. The 2 principal stages are the shifters and the adders. In order to make fast the adders, I choose the inner signal to be M+N bit long, to avoid the carry in the adder and performs the sum in parallel. The parallel sum are performed in parallel, like the attached image. I have implemented the solution until this point, I'm testing it because I have some problem with the synthesis in Vivado. The second step is to pipeline the circuite. A pipeline can be put at the end of the sifters, and between each adder stage. My main doubt is the solution M+N bit is too expensive in terms of area and signal integrity (bus too wide) and overwhelm the benefit derivated to the maximum parallelization. Let me know what do you think! Criticism, possible improvment etc. xitIg.jpg (49 kb) Hello, do you know the following link: https://threesixty360.wordpress.com/25-ways-to-multiply/ There is an 'Arithmetic Module Generator' (AMG) for generating Verilog or VHDL. Greetings from Germany Ulrich

RE: Maximum piplelined, parallalel and parametrized multiplier by dgisselq on Apr 12, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024
Daniele, <P>I was hoping you'd start this topic! <P>I recently needed to build a multiplier in Verilog as part of my <A HREF="/project,dblclockfft">FFT project</A>. The project uses hardware multiplies as long as you let it do so, but when your hardware multiplies run out it tries to build a multiplier in logic. If you want to look at this multiplier, use the program to build an FFT and then look for the bimpy.v and longbimpy.v files. (Bimpy = Binary Multiply, or equivalently two-bit Binary Multiply). <P>When I started, I was quite encouraged by the Wikipedia article on the topic. Looking back now, I can find several Wikipedia articles on the topic. (Maybe I got lucky when I searched the first time.) At any rate, the article I found most useful is <A HREF="https://en.wikipedia.org/wiki/Binary_multiplier">here</A>. Other articles you might be interested in are <A HREF="https://en.wikipedia.org/wiki/Binary_number#Multiplication">here</A>, <A HREF="https://en.wikipedia.org/wiki/Multiplication_algorithm">here</A>, and <A HREF="https://en.wikipedia.org/wiki/Multiplication">here</A>. What the article's didn't discuss were multiplying numbers of different lengths together, and what to do when those numbers were signed. Some algorithms work for both signed and unsigned numbers--but only when the numbers are the same length. (I made plenty of mistakes trying to apply algorithms to multiplicands of varying lengths, only to be surprised when they didn't work ...) <P>I then came across an article from Xilinx on the web, although I don't seem to have written down the reference. The article attempted to stuff as much logic into a Xilinx logic element as possible. Given that Xilinx logic elements accept 6 inputs per output, or five inputs if you can produce two outputs with the same bits, Xilinx suggested a different approach. They suggested building a two-bit multiplier out of LUT's: Two bits in from source 1, two bits from source 2, plus a carry from somewhere else, and produce the result with a minimum number of LUTs. It was a neat idea, but frankly it didn't run any faster or with any fewer elements. <P>One thing to keep in mind is that, within modern FPGA's, carry's are fast. They are optimized for speed. (Use them!) <P>I considered building a <A HREF="https://en.wikipedia.org/wiki/Booth's_multiplication_algorithm">Booth multiplier</A>, but what I needed required exact timing. The explanation's of Booth's algorithm didn't seem to lend themselves to a pipeline of a fixed length. <P>So back to long multiplies: What worked well for me was to build a "tableau" (or a set of partial product numbers to add), and then to add them in an optimized fashion. I would add adjacent pairs of numbers together. So a tableau of length N would become N/2 in one clock, N/4 in the next, and so forth. Further, by doing this, I managed to exploit the high speed carrys present in the FPGA's. While it's not the same as a Wallace tree, it's close and (in my humble opinion) it was simpler and easier to implement. <P>This has been my minimal clock/area approach, although I would love to hear from others if anyone has done anything faster. (Especially since the cost of these multiplies remains fairly high!) <P>Yours, <P>Dan Daniele, I was hoping you'd start this topic! I recently needed to build a multiplier in Verilog as part of my FFT project. The project uses hardware multiplies as long as you let it do so, but when your hardware multiplies run out it tries to build a multiplier in logic. If you want to look at this multiplier, use the program to build an FFT and then look for the bimpy.v and longbimpy.v files. (Bimpy = Binary Multiply, or equivalently two-bit Binary Multiply). When I started, I was quite encouraged by the Wikipedia article on the topic. Looking back now, I can find several Wikipedia articles on the topic. (Maybe I got lucky when I searched the first time.) At any rate, the article I found most useful is here. Other articles you might be interested in are here, here, and here. What the article's didn't discuss were multiplying numbers of different lengths together, and what to do when those numbers were signed. Some algorithms work for both signed and unsigned numbers--but only when the numbers are the same length. (I made plenty of mistakes trying to apply algorithms to multiplicands of varying lengths, only to be surprised when they didn't work ...) I then came across an article from Xilinx on the web, although I don't seem to have written down the reference. The article attempted to stuff as much logic into a Xilinx logic element as possible. Given that Xilinx logic elements accept 6 inputs per output, or five inputs if you can produce two outputs with the same bits, Xilinx suggested a different approach. They suggested building a two-bit multiplier out of LUT's: Two bits in from source 1, two bits from source 2, plus a carry from somewhere else, and produce the result with a minimum number of LUTs. It was a neat idea, but frankly it didn't run any faster or with any fewer elements. One thing to keep in mind is that, within modern FPGA's, carry's are fast. They are optimized for speed. (Use them!) I considered building a Booth multiplier, but what I needed required exact timing. The explanation's of Booth's algorithm didn't seem to lend themselves to a pipeline of a fixed length. So back to long multiplies: What worked well for me was to build a "tableau" (or a set of partial product numbers to add), and then to add them in an optimized fashion. I would add adjacent pairs of numbers together. So a tableau of length N would become N/2 in one clock, N/4 in the next, and so forth. Further, by doing this, I managed to exploit the high speed carrys present in the FPGA's. While it's not the same as a Wallace tree, it's close and (in my humble opinion) it was simpler and easier to implement. This has been my minimal clock/area approach, although I would love to hear from others if anyone has done anything faster. (Especially since the cost of these multiplies remains fairly high!) Yours, Dan

RE: Maximum piplelined, parallalel and parametrized multiplier by deltax on Apr 13, 2016			deltax Posts: 11 Joined: Mar 18, 2016 Last seen: Dec 29, 2016
So, as I thought, my implementation seems completely useless. Well, at least I have learnt the lesson and exercised a bit with the VHDL syntax, package etc. I will try to implement Booth's algorithm. Dan, what do you exactly mean with " The explanation's of Booth's algorithm didn't seem to lend themselves to a pipeline of a fixed length", regardin the implementation of Booth' algorithm? So, as I thought, my implementation seems completely useless. Well, at least I have learnt the lesson and exercised a bit with the VHDL syntax, package etc. I will try to implement Booth's algorithm. Dan, what do you exactly mean with " The explanation's of Booth's algorithm didn't seem to lend themselves to a pipeline of a fixed length", regardin the implementation of Booth' algorithm?

RE: Maximum piplelined, parallalel and parametrized multiplier by dgisselq on Apr 13, 2016			dgisselq Posts: 247 Joined: Feb 20, 2015 Last seen: Oct 24, 2024
Daniele, <P>Please take this as a challenge to prove me wrong, but ... <P>As I read the Wikipedia article, it sounded like the strength of Booth's algorithm is that for some (sparse) multiplicands, you could skip bits. This wouldn't work well in a pipeline, since the pipeline would then be of an uncertain length--dependent upon which bits were 'on' in the multiplicand. <P>If, instead, you didn't skip bits--then you are back to processing the multiply with one clock per bit with little gain. <P>When compared with my partial product sum implementation, I was processing the multiply in log_2(N)+2 clocks (IIRC), rather than N, so I felt I was doing better. <P>Feel free to tell me I'm wrong ... <P>Yours, <P>Dan Daniele, Please take this as a challenge to prove me wrong, but ... As I read the Wikipedia article, it sounded like the strength of Booth's algorithm is that for some (sparse) multiplicands, you could skip bits. This wouldn't work well in a pipeline, since the pipeline would then be of an uncertain length--dependent upon which bits were 'on' in the multiplicand. If, instead, you didn't skip bits--then you are back to processing the multiply with one clock per bit with little gain. When compared with my partial product sum implementation, I was processing the multiply in log_2(N)+2 clocks (IIRC), rather than N, so I felt I was doing better. Feel free to tell me I'm wrong ... Yours, Dan

no use

no use

1/1

no use

no use

© copyright 1999-2025 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.