Challenge #2 – One Hot Detection

October 20, 2008

The last challenge was a big success with many people sending their solutions via email or just posting them as a comments.
Many of you said they were waiting for the next challenge. So, before returning to the usual set of posts about different aspects of digital design, let’s look at another one.

Imagine you have a vector of 8 bits, The vector is supposed to be one hot coded (only a single logic “1” is allowed in the set). Your task if you choose to accept it :-), is to design a combo block to detect if the vector is indeed one-hot encoded.

We are again looking for the block with the shortest delay. As for the solution metrics for this challenge please refer to the previous challenge.

Also try to think how your design scales when the input vector is 16 bits wide, 32 bits wide and the general case of n bits wide.

Good luck!


  1. Let us assume the 8-bit number is A[7:0]
    Symbol conventions: . = AND, + = OR, ‘ = NOT

    The logic equation (A[7]+A[6]+A[5]+A[4])’ . (A[3]+A[2]+A[1]+A[0]’)’ is one case when output will be active. This equation can be evaluated in 8 delay units.

    We have 8 such cases to evaluate in parallel and then OR them all. OR-ing all the 8 cases will take another 9 delay units.

    The output is evaluated with such a circuit in 17 delay units.

    This design is scalable in the sense that the logic equation above can be extended to n-bits easily and will be implemented as a slice and replicated n times. The n outputs will then be OR-ed to get the final output.

    Will think of a faster design.

  2. Sundar,

    we are certainly looking for a faster solution – maybe something faster by an order of 2? maybe more…


  3. How about this one. It could probably be a bit better, but it should scale fairly well for 16 and 32 bit buses.

    module onehot(input wire [7:0] x,
    output wire isonehot);

    wire [3:0] firstnor,firstfail_neg;

    wire [1:0] secondor;
    wire [3:0] secondfail;

    wire thirdor;
    wire [2:0] thirdfail;



    // Two delay units in total

    nand (secondor[0],firstnor[3],firstnor[2]);
    nand (secondor[1],firstnor[1],firstnor[0]);


    // Four delay units in total


    // Six delay units in total


    // 11 delay units in total (four units for the nor + inverters for thirdfail[2:0])


  4. I think it can be done slightly better still…

  5. Hi,
    by using a (new) 2 input xor component which is comprised of 2 2-input NANDs and 2 inverters (total delay of 5) it is possible to build a more efficient xor tree than using the rather complex 7-delay xor. For a 8 bit input vector we get a total delay of 15.
    In general it scales with log_2(N)*5.


  6. 10 is the one to beat.

    Even more, if you beat 10, you probably win also the previous challange!

    1. detect 2 or more bits in the first 4 : 3 levels of logic (active low output)
    2. combine two of above with a nand gate : one extra level, this represents the at least two ones signal (active high)
    3. detect a single one in 8 bits = 3 levels of logic active low output
    4. combine output of step 2 with inverted output of step 3 in a nand gate (1 logic level)

    Since one logic level is 2 delays and only 2 input nand/nor ports are used : 10 delay units. (the inverter in step 4 is not in the critical path).

    For N = number of bits :
    logic levels needed = ceil(log2(N))+3
    delay units = 2 x logic levels + 1 if levels are odd

    (the +1 is for an extra inverter in that case)


  7. Sorry,

    my formula in the above post is wrong, it must be :

    logic levels needed = ceil(log2(N))+2
    and you have to add one delay unit when the nr of levels is even.


  8. The most scalable solution is a row of half-adders, (which reduce the signals by 3:1) followed by another row of half-adders on all the sum outputs, followed by another row, etc…, etc… until you are left with only a single sum output and a whole bunch of carries. The input is one-hot if the final sum is one and all of the carries are zero. There may be faster solutions for certain small fixed widths, but nothing will be faster for the n-input general case. I can’t calculate the delay since you didn’t give a delay for a half-adder and I think your XOR cell is very badly designed to be that slow.

  9. Stefaan,

    my solution is 10 as well. I really had the hopes you would beat it with the approach from the last challenge.
    I guess both of our approaches are basically the same.

    I think that even if you take a realistic XOR, it would never be faster than a NAND2 or a NOR2 (probably around double)
    The solution Stefaan (from the comments above suggests) utilizes a tree of NANDs or NORs. I think that cascading XORs in a tree like fashion would at best yield a log2(N)XOR delays, which the solution proposed by Stefaan would certainly beat.

    one more thing. I specifically gave a very bad XOR delay to try to lead everybody away from XOR based solutions and more to NANDs NORs. The table itself has many other inaccuracies (e.g. a 3 input NAND will probably be in the same order as an 2 input OR gate).
    We really needed a way to compare approaches and I thought that limiting the “library” to strictly 2 input elements gives an easier way to compare…

    The real world is NOT like that – I know.

  10. Are there any CAD tools available which can be used to solve this kind of problems? For example, a colleague has been thinking about how to solve the problem of inverting 3 signals with only 2 inverters and an unlimited amount of AND/OR gates using a minimum amount of logic.

    BTW: I think it should be fairly easy to remove the inverters from my solution to get down to 10 delay units now that I think a little bit more about it.

  11. the problem with the 2 not gates is a very famous one, and IMHO a very hard one.
    I know of two approaches to solve it – and it is indeed beautiful.

    I think however it doesn’t really fall into the category of the above challenges.
    Theoretically a synthesis tool should be able to solve these problems. Reality shows different.
    This is also part of the reason I post those challenges, to trigger people into thinking in how their implementations will look like.
    Sure synthesis tools are great, but especially with problems like these a good designers can “help” the synthesis and get MUCH better results.

    just my 2 cents

  12. stefaan wrote:
    “I specifically gave a very bad XOR delay to try to lead everybody away from XOR based solutions”

    This is an artificial problem then. Here is a real-world solution…

    module onehot8 (onehot, pass);
    input [7:0] onehot;
    output pass;

    reg [7:0] sum0, sum1;
    reg [7:0] carry;
    reg pass;

    always @(onehot)
    // First row of half-adders
    {carry[0],sum0[0]} = onehot[0] + onehot[1] + onehot[2];
    {carry[1],sum0[1]} = onehot[3] + onehot[4] + onehot[5];
    {carry[2],sum0[2]} = onehot[6] + onehot[7];
    // Second row of half-adders
    {carry[3],sum1[0]} = sum0[0] + sum0[1] + sum0[2];
    // Output logic
    pass = sum1[0] & ~|carry[3:0];


    With the following report_timing after synthesis (with dont_use on EXOR gates)…

    onehot[0] (in) 0.00 0.00 f
    U28/Z (IVAP) 0.33 0.33 r
    U30/Z (ND2P) 0.25 0.58 f
    U37/Z (AO7P) 0.63 1.21 r
    U40/Z (AO3) 0.50 1.71 f
    U41/Z (AO6) 0.82 2.53 r
    pass (out) 0.00 2.53 r

    You will have to judge the time units of this result because it is using a bunch of and-or-inv gates. But I calculate 2.53/.33 = 7.6 units.


  13. Hi,
    after someone raised the question whether there are any tools to tackle those kinds of problems I tried Berkley’s abc tool (http://www.eecs.berkeley.edu/~alanmi/abc/). I specified our imaginary gate library according to the given data (the cell areas are set to zero to prevent optimization in two domains). The tool seems to be very picky and I got very bad results at the beginning (~30). My first attempt was to feed it with a pla definition of our problem. However, the generated output tree was highly unbalanced which resulted in relatively high delays. Next step was to feed it with the quine&mccluskey handoptimized equation just as a starting point for further optimization. However, still anything but acceptable. Then I simply used the 7 minterms directly and played with various options.
    I managed to get a result as good as 10 delay units by just playing around with various different parameters. abc is an synthesis/mapping tool which uses heuristical approaches, thus it seems that it gets easily stuck on local optima. It seems that you can get fairly good results if you have a clue about the optimal solution.

    I neither don’t know if abc is the very right tool for this problem nor if I used it correctly (perhaps it has a superman mode that i don’t know of).

    If someone is interested, here is what abc coughed up:

    input a, b, c, d, e, f, g;
    output Y;
    wire n8, n9, n10, n11, n12, n13, n14, n15, n16, n17, n18, n19, n20, n21,
    n22, n23, n24, n25;
    nand2 g00(.a(b), .b(a), .O(n8));
    nand2 g01(.a(d), .b(c), .O(n9));
    nand2 g02(.a(n9), .b(n8), .O(n10));
    inv1 g03(.a(e), .O(n11));
    nand2 g04(.a(g), .b(f), .O(n12));
    nor2 g05(.a(n12), .b(n11), .O(n13));
    nor2 g06(.a(b), .b(a), .O(n14));
    nor2 g07(.a(d), .b(c), .O(n15));
    nor2 g08(.a(n15), .b(n14), .O(n16));
    nand4 g09(.a(c), .b(b), .c(a), .d(d), .O(n17));
    nand4 g10(.a(n16), .b(n13), .c(n10), .d(n17), .O(n18));
    nand3 g11(.a(g), .b(f), .c(e), .O(n19));
    nor2 g12(.a(n9), .b(n8), .O(n20));
    inv1 g13(.a(f), .O(n21));
    inv1 g14(.a(g), .O(n22));
    nand2 g15(.a(n22), .b(n21), .O(n23));
    nand2 g16(.a(n12), .b(n11), .O(n24));
    nand4 g17(.a(n23), .b(n20), .c(n19), .d(n24), .O(n25));
    nand2 g18(.a(n25), .b(n18), .O(Y));


  14. sorry,
    I made a mistake and the provided output is wrong! (it has only 7 inputs due to an error in the input file). unfortunately, the corrected version has a delay of _11_ units, which proves again that pen&paper pays off.


  15. Hi all,
    It can be done in 6 levels
    No. of levels = 1 + log2( (n(n-1))/2 )
    n= 8 in given problem.
    No. of levels = 1 + log2( (8(8-1))/2 )
    = 1 + log2 ( 4*7)
    = 1+ 5
    LEVEL 1:x[0]= A[1]&A[0] ;
    x[1] = A[0]&A[2];
    x[3] =A[1]&A[2];
    (So level 1 comprises of anding of all possible combinations = n(n-1)/2 )

    Now just or all results of level 1.
    Levels required now = log2( no. of x’s from level 1)

    LEVEL 2: y[0]=x[0] | x[1];
    y[1]=x[1] | x[2];
    LEVEL 3: z= y[0] | y[1];

    if z =1’b1 then vector is not one hot encoded

  16. Shobit et al:

    Your solution of 6 levels does not exclude the case with NO bit set. So it actually takes 7 levels to properly generate Nir’s requested result. With his metrics, that’s a delay of 14 using only NAND and NOR gates.

    I think you’ll find that all pairs of bits are not checked in the solutions with fewer than 7 levels.

    The trade-off in getting a log increase in delay is to get an n-squared increase in gate count. Accepting a linear delay increase nets a 3n gate count using a ‘carry-chain’ approach:

    more_than_one_of_eight_set = b0&b1 | (b0|b1)&b2 | ((b0|b1)|b2)&b3 | … | ((((((b0|b1)|b2)|b3)|b4)|b5)|b6)&b7;

    none_of_eight_set = !(b0|b1|b2|b3|b4|b5|b6|b7);

    valid_one_of_eight_set = !(more_than_one_of_eight_set|none_of_eight_set);

    —validate one-hot encoding—
    bits tree chain
    levels gates levels gates
    2 2 3 2 4
    4 5 15 4 9
    8 7 63 8 23
    16 8 255 16 51
    32 10 1023 32 75

  17. Sorry about the table–the comment box is in courier and not proportional. The 5 columns are:


  18. Hi,

    here is B.Eq to determine if an eight bit vector is truly one hot or not.

    one_hot_pass = (((A0 xor A1) and (A2 nor A3)) or ((A0 nor A1) and (A2 xor A3))) xor (((A4 xor A5) and (A6 nor A7)) or ((A4 nor A5) and (A6 xor A7)));

    this should be done 4 gate delay


    • Karthik,

      Doesn’t your solution report a false positive with this scenario and others like it?:



  19. It can be done in 10cycles.
    Select one bit every clock cycle.It should pass thru a ‘D’flip flop.The output of this D flip flop in inverted and connected to the enable port of the same D flop.
    Now use another D flop.The output of the first flop and next input .i.e next bit of the input vector is ANDED and supplied to second flop.
    The EXOR output of the D Flops tells whether this was a perfect One Hot Encoded or not.It also considers case of not even a single bit being encoded as ‘1’.

  20. Can someone pls evaluate the below ckt to detect one-hot encoding??

    We have 8-inputs A7, A6…..A1, A0.
    i) A7 and A6 are given as inputs to an Xor gate whose output is L1.
    ii) A5, A4 & A3 are given as inputs to a Full adder whose outputs Sum and not(Cout) are fed to an AND gate, whose output is L2.
    iii)A2, A1 & A0 are given as inputs to a Full adder whose outputs Sum and not (Cout) are fed to an AND gate, whose output is L3.
    iv) L1, L2 & L3 are given to a Full adder whose outputs Sum and not(Cout) are fed to an AND gate, whose output is F.

    My analysis says ‘F’ gives output 1 only for one-hot encoding.

    Please correct me if I am wrong.

  21. Evaluates within 8 (or 9) gate delays for 8-bit.

    For a four bit vector a3, a2, a1, a0.
    Level 1 : x = nand(a3,a2) ; y = nand(a1,a0) z = nor (a3,a2) p = nor (a1, a0) = 2 gate delays
    Level 2: P = nand(x,y) Q = nor (z,p) = 2 gate delays
    Level 3: Z = nor (P,Q) = 2 gate delays

    Parallel this for the remaining 4-bits using the same logic structure. The results must be NANDed. This adds 2 gate delays. Therefore 8 delays in total. The output is 1 on failure. The delay is 9 if the output is 1 on success (requires an OR on the output instead of NOR).

  22. Often for tools it’s good if you can define a tree like structure as that will get the best results for logic levels. We can detect if only one bit is set for each pair of bits. Followed by detecting if one bit is set for the output of a pair of pairs so on and so on. The problem is we need to detect a third failed condition. If both bits are set we have failed so the output needs to be zero. I believe the below would cover the third failed condition zeroing the output and would create a tree like structure for a 2^n bit bus.

    Assuming a is my input, o is my output.
    Define two buses xsb and x.

    always@(*) begin
    for(j=0;i<(1<<j);i=i+1) begin
    xsb[i] = (xsb[2*i] | xsb[2*i+1]) | (x[2*i]&x[2*i+1]);
    x[i] = (x[2*i]&~x[2*i+1])|(~x[2*i]&x[2*i+1]);
    o = x[0] & ~xsb[0]; //zero output if xsb[0] got set

    • looks like my always@(*) statement got a bit messed up but hopefully its clear 🙂

  23. This can be solved using a 4:2 compressor. Say x1~x8 are the 8 bits of 8bit vector. The equation for 4:2 compressor is:
    x1+x2+x3+x4+Cin1 = Sum1 + 2*(Carry1 + Cout1)
    x5+x6+x7+x8+Cin2 = Sum2 + 2*(Carry2 + Cout2)
    assign Cin1=0;
    assign Cin2=0;
    Y= ~Carry1 & ~Carry2 & ~Cout1 & ~Cout2 & (Sum1^Sum2)

    The compressor is not complicated, especially with Cin equals to 0, you can search in google for more detail.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: