Archive for January, 2008

h1

Ultimate Technical Interview Question – Take 2

January 31, 2008

Allow me to quote from Martin Gardner’s excellent, excellent book Mathematical Carnival (chapter 17):
When a mathematical puzzle is found to contain a major flaw - when the answer is wrong, when there is no answer, or when, contrary to claims, there is more than one answer or a better answer - the puzzle is said to be "cooked".

From the number of hits, it looks like the last post was quite popular. Therefore, I decided to give the problem some more thought and to try to find more minimal solutions – or as defined in the above quote “to cook this problem”.

My initial hunch was to try and utilize an SR latch somehow. After all it is a memory element for the price of only two gates. I just had a feeling there is someway to do it like that.
I decided to leave the count-to-3 circuitry, cause if we want to do a divide by 3, we somehow have to count…
Here is what I first came up with:

div_by_3_sr_1_cir.png

The basic idea is to use the LSB of the counter to set the SR flop and to reset the SR flop with a combination of some states and the low clock.
Here is the timing diagram that corresponds to the circuit above.

div_by_3_sr_1_wave.png

But! not everything is bright. The timing diagram is not marked red for nothing.
In an ideal world the propagation time through the bottom NOR gate would be zero. This would mean that exactly when the S pin of the SR latch goes high the R pin of the flop goes low – which means both pins are never high at the same time. Just as a reminder, if both inputs of an SR latch are high, we get a race condition and the outputs can toggle – not something you want on your clock signal. Back to the circuit… In our case, the propagation time through the bottom NOR gate is not zero, and the S pin of the latch will first go high, then – only after some time, the R pin will go low. In other words we will have on overlap time where both R and S pin of the latch will be high.

div_by_3_sr_1_prob.png

Looking back at the waveform, it would be nice if we could eliminate the second pulse in each set of two pulses on the R pin of the latch (marked as a * on the waveform). This means we just have to use the pulse which occurs during the “00” state of the counter.
This is easy enough, since we have to use the “00” from the counter and the “0” from the clock itself – this is just the logic for a 3 input NOR gate!

The complete and corrected circuit looks like this now:

div_by_3_sr_2_cir.png

And the corresponding waveform below. Notice how the S and R inputs of the SR latch are not overlapping.

div_by_3_sr_2_wave.png

h1

Ultimate Technical Interview Question – The Standard Solution

January 24, 2008

OK, so I am getting tons of email with requests to post a solution for this question which was initially posted here.
I am going to post now what I consider the “standard minimal solution”, but some of you have come up with some neat and tricky ways, which I will save for future a post.
The basic insight was to notice that if you are doing a divide by 3 and wanna keep the duty cycle at 50% you have to use the falling edge of the clock as well.
The trick is how to come up with a minimal design, implementing as little as possible flip-flops, logic and guaranteeing glitch free divided clock.

Most solutions that came in, utilized 4 or 5 flip flops plus a lot more logic than I believe is necessary. The solution, which I believe is minimal requires 3 flops – two working on the rising edge of the clock and generating a count-to-3 counter and an additional flop working on the falling edge of the clock.

A count-to-3 counter can be achieved with 2 flops and a NOR or a NAND gate only, as depicted below. These counters are also very robust and do not have a “stuck state”.

count_to_3.png

The idea now is to use the falling edge of the clock to sample one of the counter bits and generate simply a delayed version of it.
We will then use some more logic (preferably as little as possible) to combine the rising edge bits and falling edge bit in a way that will generate a divide by 3 output (with respect to out incoming clock).

The easiest way (IMHO) to actually solve this, is by drawing the wave forms and simply playing around. Here is what I came up with, which I believe to be the optimal solution for this approach – but you are more than welcome to question me!

clk_div_by_3.png

and here is also the wave form diagram that describes the operation of the circuit, I guess it is self-explanatory.

clk_div_by_3_wave.png

One more interesting point about this implementation is that it does not require reset! The circuit will wake up in some state and will arrive a steady state operation that will generate a divide by 3 clock on its own. We discussed some of those techniques in the past when talking about ring counters – link to that post here.

h1

Low-Power Design Book

January 16, 2008

Everybody is talking low power design now. I try to give some tips here and there on this blog – mainly from the digital design or RTL point of view.

This book (google books link): Low-Power CMOS Circuits: Technology, Logic Design and CAD Tools By Christian Piguet has really something for everyone. Whether you are an analog designer, digital designer, architect or even a CAD guy – read it. It is heavy on examples, which immediately gets my points.

I found the low-power RTL chapter very informative and it even covers some of the stuff I addressed in this blog.
Check it out, it is worth your time!

h1

Real World Examples #1 – DBI Bug Solution

January 7, 2008

In the previous post I presented the problem. If you haven’t read it, go back to it now cause it will make this entire explanation simpler.

Given the RTL code that was described, the synthesizer will generate something of this sort:

dbi_bug_sol1.png

A straight forward approach, to solve the problem, would be to try to generate the MSB of the addition logic and do the comparison on the 4-bit result. This logic cloud would (probably) be created if we would make the result vector to be 4-bit wide in the first place. It would look something like this:

dbi_bug_sol2.png

This looks nice on the paper, but press the pause button for a second and think – what is really hiding behind the MSB logic? You could probably re-use some of the addition logic already present, but you would have to do some digging in the layout netlist and make sure you got the right nets. On top of that, you would probably need to introduce some logic involving XORs (because of the nature of the addition). This is quite simple if you get to use any gate you wish, but it becomes complex when you got only NANDs and NORs available. It is possible from a logical point of view, but since you need to employ several spare cells, you might run into timing problems since the spare cells are spread all over and are not necessarily in the vicinity of your logic. Therefore, a solution with the least amount of gates is recommended!

So let’s rethink the problem. We know that the circuit works for 0-7 “1”s but fails only for the case of 8 “1”s. We also know that in that case the circuit behaves as if there were 0 “1”s. Remember we go 4 input NANDs and NORs to our disposal. We could take any 4 bits of the vector, AND them and OR them with the current result. It’s true, we do not identify 8 “1”s but in a case of 8 “1”s the AND result of any 4 bits will be high and together with the OR it will give the correct result. On other cases the output of this AND will be low and pass the correct result via the old circuit! There is a special case where there are exactly 4 bits on and these are the bits that are fed into our added AND gate, but in this case we have to anyway assert the DBI bit.
The above paragraph was relatively complicated so here is a picture to describe it:

dbi_bug_sol3.png

It is important to notice that with this solution, the newly introduced AND gate is driven directly from the flip-flops of the vector. This makes it much easier to locate in the layout netlist, since flip-flop names are not changed at all (or very slightly changed).

Here is the above circuit implemented with 4 input NAND gates only (marked in red). This is also the final solution that was implemented.

dbi_bug_sol4_fix.png

Closing words – this example is aimed to show that when doing ECOs one has to really put effort and try to look for the cheapest and simplest solution. Every gate counts, and a lot of tricks need to be used. This is also the true essence of our work, but let’s not get philosophical…

h1

Real World Examples #1 – The DBI bug

January 3, 2008

OK, back after the long holidays (which were spent mainly in bed due to severe sickness, both of myself and my kids…) with some new ideas.
I thought it would be interesting to pick up some real life examples and blog about them. I mainly concentrated so far on design guide lines, tricky puzzles and general advice. I guess it would benefit many if we dive into the real world a bit. So – I added a new category called (in a very sophisticated way) “real life examples”, which all this stuff will be tagged under.
Let’s start with the first one.

The circuit under scrutiny, was supposed to calculate a DBI (Data Bus Inversion) bit for an 8-bit vector. Basically, on this specific application, if the 8-bit vector had more than 4 “1”s a DBI bit should have gone high, otherwise it should have stayed low.
The RTL designer decided to add all the bits up and if the result was 4 or higher the DBI bit was asserted – this is not a bad approach in itself and usually superior to LUT.

The pseudo code looked something like that:

assign sum_of_ones = data[0] + data[1] + data[2] + data[3] + data[4] + data[5] + data[6] + data[7];
assign dbi_bit = (sum_of_ones > 3);

The problem was that accidentally the designer chose “sum_of_ones” to be 3-bit wide only! This meant that if the vector was all “1”s, the adder logic that generates the value for sums_of_ones would wrap around and give a value of “000”, which in turn would not result in the DBI bit being asserted as it should. During verification and simulation the problem was not detected for some reason (a thing in itself to question), but we were now facing with a problem we needed to fix as cheaply as possible. We decided to try a metal fix.

The $50K (or whatever the specific mask set cost was) question is how do you fix this as fast as possible with as little overhead as possible, assuming you have only 4 input NAND and 4 input NOR gates available?
Answer in the next post