## Challenge #1 – DBI Detect

October 8, 2008

It has been a while since we had a challenge question on the site (last one was the divide by 3 question), and I would like to have more of those in the future. I will basically pose a problem and ask you to solve it under certain conditions – e.g. least hardware or latency, lowest power etc.

This time the challenge is related to a real problem I encountered recently. I reached a certain solution, which I do not claim to be optimal, actually I have the feeling it can be done better – I am therefore very interested in your own way of solving the problem.

Your challenge is to design a combo-block with 8 inputs and 1 output. You receive an 8-bit vector, If the vector contains 4 ‘1’s or more, the output should be high, otherwise low (This kind of calculation is commonly used for data bus inversion detection).

What is the best way to design it with respect to minimizing latency (in term of delay units), meaning the lowest logic depth possible.

Just so we could compare solutions, let’s agree on some metrics. I am aware that your own library might have different delay ratios between the different elements, but we gotta have something to work with.

• Inverter – 1 delay unit
• NOR, NAND – 2 delay units
• AND, OR – 3 delay units
• 3 or 4 input NOR, NAND – 4 delay units (2 for first stage + 2 for second stage)
• 3 or 4 input OR, AND – 6 delay units (2 for first stage + 2 for second stage)
• XOR, MUX – 7 delay units (2 AND/OR + 1 Inverter)

Take it from here guys…

1. Is 31 delay units good enough to mention?

Stefaan

2. I managed to do it in less – I don’t want to mention the exact number cause I want more answers coming.

One fellow engineer told me that my little delay table above is far from realistic. This is true of course, but I would like to keep it that way.

I looked into a real library and found some more ways to optimize this.

I think this is a beautiful problem that is worth more attention…

3. Hi,
this solution is quite an extreme but has “only” a delay of 23 units. However, that’s what makes it interesting…
There are 70 ways of distributing 4 ‘1’s across 8 bit. Now use 70 4-input NANDS + INV (5 delay in total) to explicitly find those combinations. Now we need to cascade 4 hierarchies of ORs (3 times 4-input NOR + INV and one 2 input OR, thus 18 delay units in total) in order to get the final result. If we have more than 4 bit, the circuit still works since the AND-grid just finds more than a single 4-bit combination. The advantage of this circuit is that it uses as much parallelism as possible with a significant area overhead. In order to decrease the area overhead you could use plain OR/AND gates which in turn have more delay. I hope I’m not on the wrong track here…

Philip

4. hi Philip,

that is a very nice solution indeed.
however, I believe the delay can be cut significantly.
I still don’t want to publish my answer, since this challenge is only 48 hours or so online…
thanks!!

5. Hi Nir,
I could cut the delay to 19 by changing the idea slightly. Rather than looking for 4 1s I’m now looking for 5 zeros… The concept is the same, though!

Philip

6. well done! best answer so far (including the email answers which are not published here) but hey – we can go lower than that!

Nir

7. Hi,

After discovering the fun of playing with gates I got it on 16 delay units :

First attempts with two trees of adders failed at 31 (see first post), but after optimising here and there I got the adder solution on 21 delay units.

Some observations however could be reused :
– we don’t need to count the excact amount of bits (obviuos)
– the use of active low logic gives smalled delays
– a tree approach can be useful, but not with adders

So my best solution is this :
– split the 8 bit in two parts of 4
– decode the 4 bits in 4 status signals :
* one bit set : 10 (active low)
* two or more bits set : 9 (active low)
* three bits set : 10 (same as one bit set, inverted inputs)
* all four bits set : 5 (active low)
– two such decoders are combined as follow (noted in active high logic):
* a = one1 & three2
* b = two1 & two2
* c = three1 & one2
* d = four1 | four 2
this adds 2 delay units (only nor and nand are used)
– final or of the a, b, c, d, signals adds 2 levels or nand/nors = 4 delay units

Result : 16 delay units

Stefaan

8. Stefaan,

16 is indeed one of the best results so far (I got another claim in my email that it can be done in 16 but I await a detailed diagram)

however… I know it can be done with less…
keep the great ideas coming…
Nir

9. Hi,
I have another idea. unfortunately, the result in terms of delay is rather medium but perhaps somebody can tune it a bit.
We stick to the “find more than 5 zeros” idea. The idea is, to use something similar to the autocorrelation function of the input vector. we correlate the input vector with shifted versions of itself. The idea is the following: (Remember: we use the inverted input and look for 5 or more 1’s now)
now we compare (using pairwise ANDs and ORing the results together) the input with a by one bit shifted input. now repeat this with a 2 bit shifted input, a 3 bit shifted input and so on. Now we have 7 intermediate signals that show us a kind of autocorrelation. Now we can use a nice property: If the input has more than 5 ones we know that ALL 7 autocorrelation results MUST be one. Otherwise we have less than 5 1’s….

If you optimize this a bit you can use NORs in the very first comparator stage but I still get delays in the ~20th because of a long OR and AND tree.

I though I just put it out there…

Philip

10. Next attempt: 14

– stage1 : from two bits, do an and and an or action : and output == 2 ‘ones’ found, or output meand >= 1 ‘one’ found (sounds rather basic, …)
– do this 4 times (bit 0 and 1, 2 and 3, 4 and 5, 6 and 7).
– make stage2 with the same logic connected as follow :
from bit 0-1 and 2-3 : 2 and 2 –> 4 (f0)
from bit 0-1 and 2-3 : 2 or 2 –> >=2 (t0)
from bit 4-5 and 6-7 : 2 and 2 –> 4 (f1)
from bit 4-5 and 6-7 : 2 or 2 –> >=2 (t1)
from bit 0-1 and 2-3 : >=1 and >=1 –> >=2 (t2)
from bit 0-1 and 2-3 : >=1 or >=1 –> >=1 (o0)
from bit 4-5 and 6-7 : >=1 and >=1 –> >=2 (t3)
from bit 4-5 and 6-7 : >=1 or >=1 –> >=1 (o1)
– f0 and f1 represent signals where 4 ones are found
– stage 3 is again the same logic applied :
t0 and t1 –> 4 (f2)
t0 or t1 –> >=2 (t4)
t2 and t3 –> 4 (f3)
t2 or t3 –> >=2 (t5)
o0 and o1 –> >=2 (t6)
– f2 and f3 represent >= 4 bits ‘one’ found
– finally : (t4 & t5 & t6) = f4
and out = f0 | f1 | f2 | f3 | f4

Result :
– 3 stages with 2 input and/ors (actually nand/nors) = 6 delay units
– final 3 input and combined with 5 input or : 8 units when optimised with nor/nand

6 + 8 = 14

I hope the explanation is good enough to get the picture?

Stefaan

11. …getting closer
great ideas Stefaan but it can be pushed a bit further, at least with the solution I came up with, which I do not guarantee to be optimal.

12. Status : 12

Going further on my previous idea of splitting in two groups and count the bits per group.
It is not necessary to count exactly, detecting ‘one bit set’ changed to ‘at least one bit set’, and so on.

– at least one bit set : two levels = 4 delays (active high output)
– at least two bits set : three levels = 6 delays (active low output)
– at least three bits set : three levels = 6 delays (active low output)
– four bits set : two levels = 4 delays (active high output)

Combination of above to detect all possibilities with at least 4 bits set : one level = 2 delays
(one1 and three2, two1 and two2, three1 and one2, four1 or four2)

Final or of 4 signals : 2 levels = 4 delays

6+2+4 = 12 delay units

Stefaan

13. 12 is my result as well!!!
now let’s try to break this or prove this is not possible!!!

Let A,B,C,D,E,F,G,H be the eight bits, then the function to detect four or more bits are 1 can be simplified as:
function(A,B,C,D,E,F,G,H) =
(A+B).(C+D).(E+F).(G+H)
+ (AB+CD).(E+F).(G+H)
+ (AB+EF).(C+D).(G+H)
+ (AB+GH).(C+D).(E+F)
+ (CD+EF).(A+B).(G+H)
+ (CD+GH).(A+B).(E+F)
+ (EF+GH).(A+B).(C+D)
+ ABCD + ABEF + ABGH + CDEF + CDGH + EFGH

This function can be implemented in a logic depth of 6 gates with a delay of 14 delay units.

15. Hi Animesh,
How do you count 14 delay units? You have 12 outermost OR gates which translate to at least 10 unites (2 cascaded 4/3-input ORs).
And this:
(AB+EF).(C+D).(G+H)
will be: 2 input AND + 2 input OR + 3 input AND = 3+3+5=11

I count at least 21 units.

Philip

16. I will give the implementation of function:

First stage logic will generate:
Four NOR2 : (A+B)’ , (C+D)’ , (E+F)’ , (G+H)’
Four NAND2 : (AB)’ , (CD)’ , (EF)’ , (GH)’

Second stage logic will generate:
Six NOR2: (A+B).(C+D) , (A+B).(E+F) , (A+B).(G+H) , (C+D).(E+F) , (C+D).(G+H), (E+F).(G+H)
Six NAND2: AB+CD , AB+EF , AB+GH , CD+EF , CD+GH , EF+GH

Third stage of logic will generate:
Seven NAND2: ((A+B).(C+D).(E+F).(G+H))’ , ((AB+CD).(E+F).(G+H) )’ , ((AB+EF).(C+D).(G+H))’ ,
((AB+GH).(C+D).(E+F) )’ , ((CD+EF).(A+B).(G+H) )’ , ((CD+GH).(A+B).(E+F) )’ , ((EF+GH).(A+B).(C+D) )’
Three NOR2: (ABCD + ABEF)’ , (ABGH + CDEF )’ , (CDGH + EFGH)’

Fourth Stage of logic will generate:
One INV: (A+B).(C+D).(E+F).(G+H)
Three NAND3:
(AB+CD).(E+F).(G+H) + (AB+EF).(C+D).(G+H) + (AB+GH).(C+D).(E+F) ,
(CD+EF).(A+B).(G+H) + (CD+GH).(A+B).(E+F) + (EF+GH).(A+B).(C+D) ,
ABCD + ABEF + ABGH + CDEF + CDGH + EFGH

Fifth stage has two NAND2 combining above four terms
Sixth stage have one NAND2 to combining two terms generated by fifth stage to implement the required function.

So, critical path have NOR2,NOR2,NOR2,NAND3,NAND2,NAND2 at a total delay of 2+2+2+4+2+2 = 14 delay units.
If you require additional clarification, I will provide a figure.

17. A small mistake from my side in above implementation:
The fifth stage will have two NOR2 instead of two NAND2 as stated above, but the total delay will remain same of 14 delay units.
So, critical path have NOR2,NOR2,NOR2,NAND3,NOR2,NAND2 at a total delay of 2+2+2+4+2+2 = 14 delay units.

18. 2+2+2+2+3 = 11 delay

19. the equation 2+2+2+2+3=11 is correct, but we would like to see a circuit or explanation

20. Hi guys,
we (a colleague and myself) found another 12 delay solution! I’ll hand in an explanation tomorrow but it uses the idea of my last post!

Philip

21. too lazy to do the critical path math , but the
required logic can be implemented using some CSA’s
and ignoring sum logic and extracting carry logic.

verilog implementation is below.
——————-
module (a,y);
input [7:0] a;
output y;

wire [3:0] k;
assign k[3:0] = a[0] + a[1] + a[2] + a[3] +
a[4] + a[5] + a[6] + a[7];

// assert y = 1 when k is 4,5,6,7,8
assign y = k[2] | (k[3] ;//since a + abar.b = a + b
endmodule

22. that is exactly the point anon, to extract the critical path. I believe that using the adding approach will NOT lead to the optimal solution.

try to calculate the critical path – I am curious about your results…

23. Using the same technique with rows of half-adders as I proposed in the onehot problem…

module dbi (combo, unlock);
input [7:0] combo;
output unlock;
reg unlock;

reg [7:0] sum;
reg [7:0] carry;
reg one, twoa, twob, four;

always @(combo)
begin
{carry[0],sum[0]} = combo[0] + combo[1] + combo[2];
{carry[1],sum[1]} = combo[3] + combo[4] + combo[5];
{carry[2],sum[2]} = combo[6] + combo[7];
{four,twoa} = carry[0] + carry[1] + carry[2];
{twob,one} = sum[0] + sum[1] + sum[2];
// OUTPUT LOGIC — QED OPTIMIZATION
unlock = four | (twoa & twob);
end

endmodule

Synthesis produces this result…

combo[3] (in) 0.00 0.00 f
U76/Z (EO3) 2.01 2.01 f
U80/Z (AO6P) 0.93 2.94 r
U83/Z (AO2) 0.47 3.41 f
unlock (out) 0.00 3.41 f

Which based on an inverter being 3.41/0.33=10 units.

I appreciate the simplification of choosing arbitrary values for the cells, but without real values, you are optimizing imaginary solutions to imaginary problems.

Regards,
David

24. Just for fun, I tried anon’s very simple adder approach and got the following synthesis results…

a[7] (in) 0.00 0.00 f
U32/Z (EO3) 2.01 2.01 f
U37/Z (AO3P) 0.63 2.64 r
U41/Z (AO3) 0.39 3.03 f
y (out) 0.00 3.03 f

This is funny, because the simplest possible approach in a real-world analysis yields an extremely fast 3.03/.33=9 units. I guess good optimizing synthesizers have been optimized for strings of adders.

Regards,
David

25. hi David,

this is far from an imaginary approach.
the idea of limiting the units to 2 input delay units and favoring NANDs/NORs on other gates allows us to find an algorithm for the lowest amount of DECISIONS we have to take.
we can then further on simplify this.
I tried synthesizing the adder approach vs. another approach very similar to what was presented by Stefaan here and I saw a 25% difference.
moreover, it is later very easy to find AOI or OAI combinations especially if you have regular structure. AOI and OAI are very fast gates wrt. to the amount of logic they handle.

26. Here is the synthesized result of anon’s bunch of adders…

module dbi_adder ( a, y );
input [7:0] a;
output y;
wire n57, n58, n59, n60, n61, n62, n63, n64, n65, n66, n67, n68, n69, n70,
n71, n72, n73, n74, n75, n76, n77, n78;

ND2 U28 ( .A(a[3]), .B(a[5]), .Z(n62) );
ND2 U29 ( .A(a[7]), .B(a[1]), .Z(n61) );
AO2 U30 ( .A(a[7]), .B(a[1]), .C(a[3]), .D(a[5]), .Z(n63) );
EO3 U31 ( .A(a[2]), .B(a[6]), .C(a[4]), .Z(n67) );
EO3 U32 ( .A(a[7]), .B(a[1]), .C(a[0]), .Z(n66) );
AO6 U33 ( .A(n61), .B(n73), .C(n62), .Z(n60) );
ND2 U34 ( .A(n63), .B(n73), .Z(n59) );
ND2 U35 ( .A(n77), .B(n74), .Z(n58) );
ND2 U36 ( .A(a[7]), .B(n70), .Z(n69) );
AO3P U37 ( .A(n66), .B(n78), .C(n67), .D(n72), .Z(n68) );
AO6 U38 ( .A(n58), .B(n59), .C(n60), .Z(n57) );
AO3 U39 ( .A(a[7]), .B(n70), .C(n78), .D(n69), .Z(n64) );
IV U40 ( .A(n72), .Z(n65) );
AO3 U41 ( .A(n65), .B(n64), .C(n57), .D(n68), .Z(y) );
EO U42 ( .A(a[1]), .B(a[0]), .Z(n70) );
AO7P U43 ( .A(a[6]), .B(a[4]), .C(a[2]), .Z(n74) );
AN2 U44 ( .A(n77), .B(n74), .Z(n75) );
AO7P U45 ( .A(a[7]), .B(a[1]), .C(a[0]), .Z(n73) );
AN2P U46 ( .A(n71), .B(n73), .Z(n76) );
AO2P U47 ( .A(a[1]), .B(a[7]), .C(a[5]), .D(a[3]), .Z(n71) );
ND2P U48 ( .A(n75), .B(n76), .Z(n72) );
EO U49 ( .A(a[5]), .B(a[3]), .Z(n78) );
ND2 U50 ( .A(a[6]), .B(a[4]), .Z(n77) );
endmodule

A couple NAND, AND-OR-INV, XOR, and an INV. I think that’s awesome results for a coding style that is self-documenting and easily verified by inspection. Instead of a very convoluted psuedo-pre-optimized sea of gates kind of style. Some of the “solutions” that have presented might not even work, they are going to require exhaustive simulation before they are trustworthy. We need to balance speed, area, and power versus bugs.

Regards,
David

27. David,

I think no one really thinks synthesis tools are bad and that they shouldn’t be used.
but sometimes on critical blocks with pencil and paper and a little bit thinking you can do better – that’s all.
I had projects where I had to design a certain critical block that was used around 300 times within the design, it had to be fast and low power. That is a classical example for a pencil and paper approach.

No ASIC would exist nowadays without the help of synthesis, but these small and modest challenges are here to show examples were the “other” approach might be more useful.
Moreover, sometimes the best result come when you have a rough idea on how the solution *should* look like and you “guide” the synthesis in that direction by deliberately using a specific coding style. I use this approach in a lot every single day.

N.

28. I need a solution to the above challenge