 ## Challenge #3 – Counting the Number of “1”s

November 13, 2008

Time for a new challenge! The last two had some great responses and solutions. If you read through the comments you’d see there were some disagreements on what is the best approach. Some claimed a hand crafted approach is the best, while others said it was more of a theoretical problem and we should use a synthesis tool to solve it.
Both have pros and cons, although for those specific challenges I personally tend to go with the hand crafted approach – you, of course, don’t have to agree with me.

For this time we got a very practical problem that pops up again and again: counting the number of “1”s in a vector.
Use the metrics given in challenge #1 and find the minimal delay circuit for a combo cloud that counts the number of “1”s in an 8-bit vector. You get 8 bits in and supply 4 output bits which give a binary representation of the amount of “1”s in the 8-bit vector.

Oh and don’t forget to mention how your method scales when counting 16-bit and 32-bit vectors.

1. Hi Nir,
Can you please tell me in detail about your statement is your puzzle “For this time we got a very practical problem that pops up again and again: counting the number of “1″s in a vector…”

Can you please tell me what are the practical situations / scenarios where you need this?
If you can share your problem from where you have created this puzzle …

2. I just used this circuit today for a CDR system. In CDRs you usually get early and late signals from the analog samplers. What is normally done is counting the amount of “ups” and deducting it from the amount of “downs” this in turn is fed into an integrator.
This is just one example but I came across this circuit many times in the past.

3. Seems that the bitonic sorting algorithm also can be used to this one. For 8 bit I come on 20 delay units:

– Take 12 for the sort on 8 bit (call this B)
– add one layer of gates + an inverter to make induvidual bits set in (B)
(this is for example C = B & ~B, C = B & ~B, …), this can be done since every higher bit set implies all lower bits set.
– havint this, we just need a 8 to 4 decoder, where the LSB has the most terms (C | C | C | C) –> 2 levels of logic

so we have worst case : 12 + (1 + 2) + (2 * 2) + 1 (the last +1 for comensating the odd nr of levels!)
= 20.

It scales to the scaling for the sort + 3 (fixed) + log2(Bits/2) (or term for LSB) + 1 if logic levels are odd.

Stefaan

4. I managed with 19 delay units but I feel it can be done faster.
I also used a sorting approach but not for all the bits. The LSB can be calculated in parallel by building a sort of XOR tree with NANDs and NORs.
The MSB is just an AND tree.
my critical path is for the second from right bit (LSB+1) I feel this can be done faster, but as I said this is just a feeling…
N.

5. ok, I believe my feeling was correct – it can be done with less delay…

6. no one there?
it can be done with 16 delay units.

7. 16 looks very tight to beat. Give me some hint?

Stefaan

8. Stefaan,

look at each digit of the output separately! that is the hint…
e.g. the lsb can be made with an XOR tree built with 2 stage NANDs plus an inverter here or there.
The tricky one is the digit second from right (in the 4 bit result)

Nir

9. Ohh!!!
This is some difficult to solve this query. Please give some hint.

10. Nir Dahan,

Can you publish the solution. I’m awaiting for the answer

11. Hi,
for large scale problems, like 32 bit and 64 bit common solution is to use bit compression.
for example, take a look at this 36 bit solution given by altera. ( I pasted it here from my Stratix Cookbook!, the author of this code is baeckler ):
it uses nine 6:3 compressor ( which takes 27 6-luts ) six_three_comp is a six input three output compressor that maps to a 6-luts. it’s a lut!

module thirtysix_six_comp (data,sum);

input [35:0] data;
output [5:0] sum;
wire [5:0] sum;

wire [5:0] word_l;
wire [5:0] word_m;
wire [5:0] word_h;

wire [2:0] sa,sb,sc,sd,se,sf;
wire [2:0] slo,sme,shi;

six_three_comp a (.data(data[5:0]),.sum(sa));
six_three_comp b (.data(data[11:6]),.sum(sb));
six_three_comp c (.data(data[17:12]),.sum(sc));
six_three_comp d (.data(data[23:18]),.sum(sd));
six_three_comp e (.data(data[29:24]),.sum(se));
six_three_comp f (.data(data[35:30]),.sum(sf));

six_three_comp lo (.data({sa,sb,sc,sd,se,sf}),.sum(slo));
six_three_comp me (.data({sa,sb,sc,sd,se,sf}),.sum(sme));
six_three_comp hi (.data({sa,sb,sc,sd,se,sf}),.sum(shi));

wire [7:0] tmp_sum;
.b({2’b0,sme,1’b0}),
.c({1’b0,shi,2’b0}),
.o(tmp_sum));
defparam t .WIDTH = 6;
assign sum = tmp_sum[5:0];

endmodule

12. Hi,

You can actually count the number of ones very efficiently by utilizing lookup tables

let’s say you have and 8 bit vector: ABCD EFGH,
you can break it up into groups of 3, by extending
the vector:

0 ABCD EFGH,

breaking up into groups of 3, results in 3 groups:

0AB, CDE, FGH

simply use a lookup table for each of them (decoder):

input output
000 —> 000
001 —> 001
010 —> 001
011 —> 010
100 —> 001
101 —> 010
110 —> 010
111 —> 011

and now, three partial sums result.

you can use a carry-save adder to compute the 3 value converted numbers.

I don’t think this method scales too well, although you could break it up for 16 and 32 bits into groups of 4, so that would mean 4 partial sums, and 8 partial sums result.. respectively.

hope this helps

13. california (comment above)
I think this is a very inefficient way to count.
look at the first comments especially the one by “stefaan”. His logic is faster and more compact than the LUT solution.
it is also a good idea to look at the previous 2 challenges.

14. My idea is to sort the input vector using “tree sort” (using And,OR gates) and get all the ‘1’s in one side. Then use a priority encoder to get the output.

Hope this helps.

15. Stefan, I’m bit confused with your answer and could not follow up. Could you elaborate the flow with an example. Thanks

16. You can use Full adders and half adders to get the solution. The idea is to do something like Wallace tree. You have to generate groups from 3 1 bits to get a 2 bit result.

x }
x } FA 1
x }
x ]
x ] FA 2
x ]
x } HA 1
x }
+—

x x }
x x } Use one FA for bit 0 and one FA for bit 1
x x }
+—

x x 0 } Use a regular 3 bit adder = 3 full adders.
0 x x }
+—
x x x x

So, this can be done by 7 FAs and 1 HA.

17. Actually for the last step, you only need 1 FA and 1 HA. So, you need 2 HAs and 5 FAs.

18. using FA is way too expensive for this challenge. please reread the conditions set.
also in real life it is more expensive than the solutions proposed here…

19. I read the conditions now. I was thinking more higher level. 16 units is pretty good. Let me see what I can get.

In general, what is your experience with hand coding vs. synthesizing this kind of circuits. I would think that synthesis should give a pretty optimum result for such a small circuit even if you design it with an LUT.

20. I could get it with like 10 logic levels

21. 18 gates and three full adders

22. Is not an XOR tree the same as using adders? So if adders are disallowed then so should chaining XOR’s. I would say that a tuned FA library cell would be at worst equal to the XOR implementation of that FA.