## Designing Robust Circuits

May 25, 2007

There are many ways to design a certain circuit. Moreover, there are many trade-offs like power, area, speed etc.
In this post we will discuss a bit about robustness and as usual, we will use a practical, real life example to state our point.

When one talks about robustness in digital design, one usually means that if a certain type of failure occurs during operation the circuit does not need outside “help” in order to return to a defined or at least allowed state. Maybe this is a bit cryptic so let’s look at a very simple example – a ring counter.

As pictured on the right a 4 bit ring counter has 4 different states, with only a single “1″ in each state. “counting” is performed by shifting or more correctly rotating the “1″ to one direction with each rising clock edge. Ring counters have many uses, one of the most common is as a pointer for a synchronous FIFO. Because of their simplicity, one finds them many times in high speed full custom designs. Ring counters have only a subset of all possible states as allowed or legal states. For example, the state “1001″ is not allowed.

A very simple implementation for a ring counter is the one depicted below. The 4 flip-flops are connected in a circular shift register fashion. Three of the registers have an asynchronous reset pin while the left most has an asynchronous set pin. When going into the reset state the ring counter will assume the state “1000″.

Now, imagine that for some reason (inappropriate reset removal, cross talk noise etc.) the state “1100″ appeared in the above design – an illegal state. From now on, the ring counter will always toggle between illegal states and this situation will continue until the next asynchronous reset is de-asserted. If a system is noisy, and such risk is not unthinkable, hard reseting the entire system just to bring the counter to a known state might be disastrous.

Let’s inspect a different, more robust design of a ring counter in the picture below.

With the new implementation the NOR gate is functioning as the left most output. But more important, the NOR gate will drive “0″s into the 3-bit shift register until all 3-bits are “0″, then a “1″ will be driven. If we look at a forbidden or illegal state like “0110″, we see that the new circuit will go through the following states: “0110″–>”0011″–>”0001″ until it independently reaches a legal state! This means we might experience an unwanted behavior for a few cycles but we would not need to reset the circuit to bring it back to a legal state.

In a later post, when discussing Johnson counters, we will see this property again.

## Late Arriving Signals

May 23, 2007

As I mentioned before, it is my personal opinion that many digital designers put themselves more and more further away from the physical implementation of digital circuits and concentrate more on the HDL implementations. A relatively simple construction like the one I am about to discuss, is already quite hard to debug directly in HDL. With a visual aid of how the circuit looks like, it is much easier (and faster) to find a solution.

The classic example we will discuss is that of a late arriving signal. Look at the picture below. The critical path through the circuit is along the red arrow. Let’s assume that there is a setup violation on FF6.
Let’s also assume that in this example the logic cloud marked as “A”, which in turn controls the MUX that chooses between FF3 and FF4, is quite heavy. The combination of cloud “A” and cloud “B” plus the MUXes in sequence is just too much. But we have to use the result of “A” before calculating “B”! What can be done?

The most important observation is that we could duplicate the entire logic that follows “A”. We assume for the duplicated blocks that one time the result of “A” was a logic “0″ and in another logic “1″. Later we could choose between the two calculations. Another picture will make it clearer.
Notice how the MUX that selected between FF3 and FF4 has vanished. There is now a MUX that selects between FF3 and FF5 (“A” result was a “0″) and a MUX in the parallel logic that selects between FF4 and FF5 (“A” result was a “1″) .
At the end of the path we introduced a new MUX which selects between the two calculations we made, this time depending on cloud “A”. It is easy to see that although this implementation takes more area due to the duplicated logic, the calculation of the big logic clouds “A” and “B” is done in parallel rather than in series.

This technique is relatively easy to implement and to spot if you have a circuit diagram of your design. Also do not count on the synthesis tool to do this for you. It might be able to do it with relatively small structures but when those logic clouds get bigger, you should implement this trick on your own – you will see improvements in timing (and often in synthesis run time). What you pay for is area and maybe power – nothing comes for free…

## Do You Think Low Power???

May 20, 2007

There is almost no design today, where low power is not a concern. Reducing power is an issue which can be tackled on many levels, from the system design to the most fundamental implementation techniques.

In digital design, clock gating is the back bone of low power design. It is true that there are many other ways the designer can influence the power consumption, but IMHO clock gating is the easiest and simplest to introduce without a huge overhead or compromise.

Here is a simple example on how to easily implement low power features.

The picture on the right shows a very simple synchronous FIFO. That FIFO is a very common design structure which is easily implementable using a shift register. The data is being pushed to the right with each clock and the tap select decides which register to pick. The problem with this construction is that with each clock all the flip-flops potentially toggle, and a clock is driven to all. This hurts especially in data or packet processing applications where the size of this FIFO can be in the range of thousands of flip-flops!!

The correct approach is instead of moving the entire data around with each clock, to “move” the clock itself. Well not really move, but to keep only one specific cell (or row in the case of vectors) active while all the other flip-flops are gated. This is done by using a simple counter (or a state machine for specific applications) that rotates a “one hot” signal – thus enabling only one cell at a time. Notice that the data_in signal is connected to all the cells in parallel,. When new data arrives only the cell which receives a clock edge in that moment will have a new value stored.

## Another Synchronization Pitfall…

May 18, 2007

Many are the headaches of a designer doing multi clock domain designs. The basics that everyone should know when doing multi clock domain designs are presented in this paper. I would like to discuss on this post a lesser known problem, which is overlooked by most designers. Just as a small anecdote, this problem was encountered by a design team led by a friend of mine. The team was offered a 2 day vacation reward for anyone tracking and solving the weird failures that they experienced. I guess this already is a good reason to continue reading…

OK, we all know that when sending a control signal (better be a single one! – see the paper referenced above) from one clock domain to another, we must synchronize it at the other end by using a two stage shift register (some libraries even have a “sync cell” especially for this purpose).

Take a look at the hypothetical example below

Apparently all is well, the control signal, which is an output of some combinational logic, is being synchronized at the other end.
So what is wrong?
In some cases the combinational logic might generate a hazard, depending on the inputs. Regardless whether it is a static one (as depicted in the timing diagram) or a dynamic one, it is possible that exactly that point is being sampled at the other end. Take a close look at the timing diagram, the glitch was recognized as a “0″ on clk_b’s side although it was not intended to be.

The solution to this problem is relatively easy and involves adding another sampling stage clocked with the sending clock as depicted below. Notice how this time the control signal at the other end was not recognized as a “0″. This is because the glitch had enough time to settle until the next rising edge of clk_a.

In general, the control signal sent between the two clock domains should present a strict behavior during switching- either 1–>0 or a 0–>1. Static hazards (1–>0–>1 or 0–>1–>0) or Dynamic hazards (1–>0–>1–>0 or 0–>1–>0–>1) are a cause for a problem.

Just a few more lines on synchronization faults. Quite often they might pop up in only some of the designs. You might have 2 identical chips, one will show a problem the other not. This can be due to slight process variations that make some logic faster or slower, and in turn generate a hazard exactly at the wrong moment.

## Eliminating Unnecessary MUX structures

May 16, 2007

You will often hear engineers in our business saying something along these lines:
“I first code, and then let synthesis find the optimal implementation” or “synthesis tools are so good these days, there is no use in spending time on thinking in the circuit level…”. Well not me – sorry!! I am a true fan of “helping” or “directing” the synthesis.
The example I will discuss on this post, is a real life example that occurred while reviewing a fellow engineer’s work.

The block in discussion is quite a heavy one, with very tight timing requirements and complicated functionality (aren’t they all like that…). Somewhere in the code I encountered this if-else-if statement (Verilog):

``````if (s1)
y = 1'b1;
else if (s2)
y = 1'b0;
else
y = x;
```
```

Now, if this would have stood on its own, it would not have risen much suspicion. But this statement happened to be part of the critical path. On first look, the if-else-if ladder is translated into a set of cascaded muxes, but looking carefully at it, one can simplify it into two gates (or even one complex gate in most libraries) as shown below.

I do not say that a good synthesis tool is not able to simplify this construction, and I have to admit I do not really know what is going on inside the optimization process – this seems to be some sort of black magic of our art – but fact is, that timing improved after describing the if-else-if statement explicitly as an or-and combination.
The reason can be, as depicted, that the muxes are being “dragged” somehow into the logic clouds just before and after them in hope of simplifying them there. I just don’t know!
A good sign to spot when such simplification is easily possible, is when you have an if-else-if ladder or a case statement with constants on the right hand side (RHS). It does make the code a bit less readable, but IMHO it is worth it.

Here is a short summery of some common mux constructs with fixed inputs and their simplified forms.