Archive for the ‘Real World Examples’ Category

h1

Real World Examples #5 – Clock Divider by 5

August 26, 2009

Here is a neat little circuit that was used in an actual project a long, long time ago (in a galaxy far, far away…).

The requirement was to build a divide by 5 circuit for the clock with 50% duty cycle. The initial (on reset) behavior was not important – i.e. the circuit could wake up in an undefined state, but should have settled after a given time. The engineer produced the circuit below:

divby5

Basically, the circuit is made out of a 3-bit counter, that counts from 000 to 100 and then resets. Signal ‘X’ goes high when the value of the counter is either 000, 001 or 010. Signal ‘Y’ goes high when the counter equals its ‘middle’ state 010. ‘Z’ is a sample on the falling edge of ‘Y’ in order to generate the 50% duty cycle.

So far so good. The general thinking was OK, but there was a major problem with the circuit, can you discover what it was? How would you fix it in RTL? and more important, how would you fix it in an ECO (as it was eventually done)?

No extra flops are allowed!

h1

Real World Examples #4 – More on “Thinking Hardware”

January 20, 2009

I was reviewing some code not so long ago, and noticed together with the owner of the code, that we had some timing problems.
Part of the code looked something like that (Verilog):


wire [127:0] a;
wire [127:0] b;
wire [127:0] c;
assign c = select_register ? a : b;

For those not familiar with Verilog syntax, the code describes a MUX construct using the ternary operator. The two data inputs for the MUX are “a” and “b” and the select is “select_register”.

So why was this code translated into a relatively slow design? The answer is in the width of the signals. The code actually synthesizes to 128 parallel MUX structures. The “select_register” has actually 128 loads.
When a construct like this is hidden within a large code, our tendency is to just neglect it by saying it is “only” 2:1 MUX deep, but we have to look more carefully than that – and always remember to consider the load.

Solving this problem is relatively easy by replication. Just creating more versions of the “select_register” helped significantly.

h1

Real World Examples #3 – PRBS Look-ahead

August 8, 2008

PRBS generation is very useful in many digital design applications as well as in DFT.
I am almost always confused when given a PRBS polynomial and asked to implement it, so I find it handy to visit this site.

This is all nice and well for simple PRBS patterns. In some systems however, the PHY is working in a much higher rate than the digital core (say n times higher). The data is collected in wide buses in the core and then serialized (n:1 serialization) and driven out of the chip by the PHY.

This means that if we do a normal PRBS in the core clock domain, we would not get a real PRBS pattern on the pin of the chip but rather a mixed up version of PRBS with repeating sub-patterns. Best way to see this is to experiment with it on paper.

To get a real PRBS on the pin we must calculate n PRBS steps in each core clock cycle. That is, execute the polynomial, then execute it again on the result and then again, n times.

Let me describe a real life example I encountered not so long ago. The core was operating 8 times slower than the PHY and there was a requirement for a maximum length PRBS7 to be implemented.
There are a few maximum length polynomials for a PRBS7, here are two of them:

Both of these will generate a maximum length sequence of 127 different states. We now have to format it into 8 registers and hand it over to the PHY on each clock, But which of the two should we use? is there a speed/power/area advantage of one over the other? does it really matter?

Well, if you do a PRBS look-ahead, which is approximately the same order as your PRBS polynomial, then it really does matter. In our case we have to do a 8 look-ahead for a PRBS 7.

Compare the implementations of both polynomials below. For convenience both diagrams show the 8 intermediate steps needed for calculating the 8 look-ahead. In the circuit itself only the final result (the contents of the boxes in step 8 ) is used.


Because the XOR gate of the second polynomial is placed more close to where we have to shift in the new calculation of the PRBS, the amount of XORs (already too small in the second image to even notice) accumulate with each step. For the final step we have to use an XOR tree that basically XORs 7 of the 8 original bits – this is more in amount than the first implementation (even if you reuse some of the XORs in the logic) and the logic itself is deeper and thus the circuit becomes slower compared to the other implementation.

The first implementation requires at most a 3 input XOR for the calculation of look-ahead bit6 but the rest require only 2 input XOR gates.

Bottom line, if you do a PRBS look-ahead and have the possibility to choose a polynomial, choose one with lower exponents.

h1

Real World Examples #2 – Fast Counters

February 11, 2008

This is something which will be obvious to the “old school” people, because it was used a lot in the past.

A few weeks ago a designer who was working on a very, very small block asked for my advice on the implementation of counters. The problem was that he was using a 7-bit counter, defined in RTL as cntr <= cntr +1;
The synthesis tool generated a normal binary counter, but unfortunately it could not fulfill the timing requirements – a few GHz.
(Don’t ask me why this was not done in full-custom to begin with…)

Now, the key to solving this problem was to notice that in this specific design only the terminal count was necessary. This meant that all intermediate counter states were not used anywhere else, but the circuits purpose was to determine if certain amount of clock cycles have occured.

This brings us to the question: “Under these conditions, is there a cheaper/faster counter than the usual binary counter?”
Well I wouldn’t write this post if the answer was negative… so obviously the answer is “Yes” – this is our old friend the LFSR!
LFSRs can be also used as counters, and they are being used in two very common, specific ways:

  1. As a terminal counter – counter needs to measure a certain amount of clock edges. It counts to a specific value and then cycles over or resets
  2. As a FIFO pointer – where the actual value itself is not of great importance but the order of increment needs to be deterministic and/or the relationship to another pointer of the same nature

Back in the age of prehistorical chip design (the 1970s), when designers really had to think hard for every gate, LFSRs were a very common building block and were often used as counters.

A slight disadvantage, is that the counting space of a full length n-bit LFSR is not 2^n but rather (2^n)-1. This sounds a bit petty on my side but believe me it can be annoying. Fear not! There is a very easy way to transform the state space to a full 2^n states. (can you find how???)

So next time you need a very fast counter, or when you need pointers for your FIFO structure – consider your good old friend the LFSR. Normally with just a single XOR gate as glue logic to your registers, you achieve (almost) the same counting capabilities given to you by the common binary counter.

h1

Real World Examples #1 – DBI Bug Solution

January 7, 2008

In the previous post I presented the problem. If you haven’t read it, go back to it now cause it will make this entire explanation simpler.

Given the RTL code that was described, the synthesizer will generate something of this sort:

dbi_bug_sol1.png

A straight forward approach, to solve the problem, would be to try to generate the MSB of the addition logic and do the comparison on the 4-bit result. This logic cloud would (probably) be created if we would make the result vector to be 4-bit wide in the first place. It would look something like this:

dbi_bug_sol2.png

This looks nice on the paper, but press the pause button for a second and think – what is really hiding behind the MSB logic? You could probably re-use some of the addition logic already present, but you would have to do some digging in the layout netlist and make sure you got the right nets. On top of that, you would probably need to introduce some logic involving XORs (because of the nature of the addition). This is quite simple if you get to use any gate you wish, but it becomes complex when you got only NANDs and NORs available. It is possible from a logical point of view, but since you need to employ several spare cells, you might run into timing problems since the spare cells are spread all over and are not necessarily in the vicinity of your logic. Therefore, a solution with the least amount of gates is recommended!

So let’s rethink the problem. We know that the circuit works for 0-7 “1”s but fails only for the case of 8 “1”s. We also know that in that case the circuit behaves as if there were 0 “1”s. Remember we go 4 input NANDs and NORs to our disposal. We could take any 4 bits of the vector, AND them and OR them with the current result. It’s true, we do not identify 8 “1”s but in a case of 8 “1”s the AND result of any 4 bits will be high and together with the OR it will give the correct result. On other cases the output of this AND will be low and pass the correct result via the old circuit! There is a special case where there are exactly 4 bits on and these are the bits that are fed into our added AND gate, but in this case we have to anyway assert the DBI bit.
The above paragraph was relatively complicated so here is a picture to describe it:

dbi_bug_sol3.png

It is important to notice that with this solution, the newly introduced AND gate is driven directly from the flip-flops of the vector. This makes it much easier to locate in the layout netlist, since flip-flop names are not changed at all (or very slightly changed).

Here is the above circuit implemented with 4 input NAND gates only (marked in red). This is also the final solution that was implemented.

dbi_bug_sol4_fix.png

Closing words – this example is aimed to show that when doing ECOs one has to really put effort and try to look for the cheapest and simplest solution. Every gate counts, and a lot of tricks need to be used. This is also the true essence of our work, but let’s not get philosophical…

h1

Real World Examples #1 – The DBI bug

January 3, 2008

OK, back after the long holidays (which were spent mainly in bed due to severe sickness, both of myself and my kids…) with some new ideas.
I thought it would be interesting to pick up some real life examples and blog about them. I mainly concentrated so far on design guide lines, tricky puzzles and general advice. I guess it would benefit many if we dive into the real world a bit. So – I added a new category called (in a very sophisticated way) “real life examples”, which all this stuff will be tagged under.
Let’s start with the first one.

The circuit under scrutiny, was supposed to calculate a DBI (Data Bus Inversion) bit for an 8-bit vector. Basically, on this specific application, if the 8-bit vector had more than 4 “1”s a DBI bit should have gone high, otherwise it should have stayed low.
The RTL designer decided to add all the bits up and if the result was 4 or higher the DBI bit was asserted – this is not a bad approach in itself and usually superior to LUT.

The pseudo code looked something like that:

assign sum_of_ones = data[0] + data[1] + data[2] + data[3] + data[4] + data[5] + data[6] + data[7];
assign dbi_bit = (sum_of_ones > 3);

The problem was that accidentally the designer chose “sum_of_ones” to be 3-bit wide only! This meant that if the vector was all “1”s, the adder logic that generates the value for sums_of_ones would wrap around and give a value of “000”, which in turn would not result in the DBI bit being asserted as it should. During verification and simulation the problem was not detected for some reason (a thing in itself to question), but we were now facing with a problem we needed to fix as cheaply as possible. We decided to try a metal fix.

The $50K (or whatever the specific mask set cost was) question is how do you fix this as fast as possible with as little overhead as possible, assuming you have only 4 input NAND and 4 input NOR gates available?
Answer in the next post