I inspected the code generated and was surprised to find that it was simply a large tree of small adders.
Have you considered the following:
Having the first stage add more than two bits? Most FPGAs have four or six input LUTs, so only adding two bits wastes a lot of space for a wide counter.
Using Wallace trees? They should improve the performance.
There are some performance figures for a Wallace Tree population counter given in this thread: http://opencores.org/forum,Cores,0,3737
(e.g. 512 bit input : 585 LUTs, 5.271 ns clock period in slowest speed grade part.)
Regards, Allan
Hi Allan,
Thanks for your comments.
I will have a look and try to improve the code.
Nikos