## Reordering Nets for Low Power

May 10, 2009

As posts accumulate, you can see that low power design aspects is a big topic on this site. I try to bring more subtle design examples for lower power design that you can control and implement (i.e. in RTL and the micro architectural stage).

Identifying “glitchy” nets is not always easy. Some good candidates are wide parity or CRC calculations (deep and wide XOR trees), complicated arithmetic paths and basically most logic that originates in very wide buses and converges to a single output controlling a specific path (e.g. as a select pin of a MUX for a wide data path).

If you happen to identify a good candidate, it is advisable (when possible) that you feed the “glitchy” nets as late as possible in the calculation path. This way the total amount of toggling in the logic is reduced.

Sounds easy enough? Well the crux of the problem is identifying those opportunities – and it is far from easy. I hope this post at least makes you more aware of that possibility.

To sum up, here are two figures that illustrate the issue visually. The figure below depicts the situation before the transformation.. The nets which are highlighted represent high activity nets.

After the transformation – pushing the glitchy net calculation late in the path. The transformation is logically equivalent (otherwise there is no point…) and we see less high activity nets.

## Who Said Clock Skew is Only Bad?

October 2, 2008

We always have this fear of adding clock skew. Well, seems like this is one of the holy cows of digital design, but sometimes clock skew can be advantageous.

Take a look at the example below. The capturing flop would normally violate setup requirements due to the deep logic cloud. By intentionally adding delay we could help make the clock arrive later and thus meet the setup condition. Nothing comes for free though, if we have another register just after the capturing one, the timing budget there will be cut.

This technique can also be implemented on the block level as well. Assume we have two blocks A and B. B’s signals, which are headed towards A, are generated by a deep logic cloud. On the other hand A’s signals, which arrive at B, are generated by a rather small logic cloud. Skewing the clock in the direction of A now, will give more timing budget for the B to A signals but will eat away the budget from A to B’s signals.

Inserting skew is very much disliked by physical implementation guys although a lot of the modern tools know how to handle it very nicely and even account for the clock re-convergence pessimism (more on this in another post). I have the feeling this dislike is more of a relic of the past, but as we push designs to be more complex, faster, less power hungry etc. we have to consider such techniques.

## Arithmetic Tips and Tricks #2 – Another Look at a Slow Adder

August 18, 2008

Do you remember the old serial adder circuit below? A stream of bits comes in (LSB first) on the FA inputs, the present carry-out bit is registered and fed in the next cycle as a carry in. The sum comes in serially on the output (LSB first).

True, it is rather slow – it takes n cycles to add n bits. But hold on, check out the logic depth – one full adder only!! This means the clock can run a lot faster than your typical n-bit adder.
Moreover, it is by far the smallest, cheapest and consumes the least power of all adders known to mankind.

Of course you gotta have this high speed clock available in the system already, and you still gotta know when to stop adding and to sample your result.
Taking all this into consideration, I am sure this old nugget can still be useful somewhere. If you already used it before, or have an idea, place a comment.

## Predictive Synchronizers

July 7, 2008

As we discussed many times before, synchronization of signals involves also latency issues. Sometimes these latency issues are quite a mess. This post will go over the principle of operation of predictive synchronizers, which offer a specific solution for a very specific case.

Let’s start by describing the conditions for this specific case. For the sake of explanation let us assume we have two clock domains with different clock periods. On top we have a certain limited or capped jitter component defined by our spec.
Taking the conservative approach, we would always use a full two flip flop synchronizer. However, a closer look at a typical waveform reveals something interesting.

The figure above shows both clocks. The limited jitter as defined by our spec, is shown in gray. Notice how only during specific periods a full synchronizer needs to be used. For the upper clock each 5th cycle is a “dangerous” one, while for the lower clock each 4th is problematic. The time window in which these danger zones occur is predictable.

In general we could count the clock cycles, and then, when the next clock edge occurs in the “danger zone” we could switch and use a full synchronizer circuit, otherwise a single flop is enough.

A circuit which implements this idea can be seen below. The potentially metastable node is blocked by the FSM during the “danger time” and the synchronizer output is taken, otherwise the normal, first flop’s output, is taken. The logic at the output is basically that of a MUX.

## Non-Power-of-2 Gray Counter Design

June 9, 2008

So… you want to design a counter with a cycle which is different than a power of 2. You would like to use a Gray counter because of its advantages and just because it is simply beautiful, but alas, your cycle length is not a power of two – what to do?
This post will try to give you a sort of recipe of how to design such a non-power-of-2 Gray counter and the reasoning behind.

First, if your cycle length is an odd number, you are in trouble since this is just not possible to construct a counter with the Gray properties and with an odd cycle length. A simple way to see why it is so, is to notice that a Gray counter changes its parity with each count because only one bit changes at a time.
This naturally means that the parity toggles, but since we have an odd number of states and if we started with even parity – the last state will also have odd parity, and when we wrap around the parity won’t change! Assuming that the first and last states are different, this means that 2 bits must change at a time, thus contradicting the Gray hypothesis.

OK, so we limited ourselves to an even amount of states, is it possible now? It is! We could ask our friend Google and come up with some info and even some patents, but the best discussion on the subject that I found was written by Clive Maxfield here.

When approaching this problem, what (hopefully) should immediately struck us, is that we have to somehow use the reflection property of the Gray code (This method among others is discussed by Clive as well). Let’s take a deeper look at the 4-bit Gray code below.

The pairs of states which have identical distance from the axis of reflection are only different by their MSB. This in turn, means that we could eliminate pairs-at-a-time around the axis of reflection, and arrive to our desired number of states for the counter. Moreover, we notice that the (n-1) LSBs count up to a certain value then change direction and count down again. This property remains true even if we remove any amount of pairs around the axis of reflection.

What we have to do now, is to find this “switching value”, when we reach it on the up-count, toggle the direction bit – which is also our MSB, and block the (n-1) LSBs Gray counter for this direction switch cycle (otherwise 2 bits would change). We now count down to the initial state (all zeros). When we reach it, we again have to switch direction and block the counter and so on ad infinitum.

We can use the modular up/down Gray counter I described here, here and here. for our (n-1) LSBs. We have to find a priori the “switching value”, which is the (n-1) bit Gray value of our number of counter states divided by 2. For Example, if you want a 10 state Gray counter then: 10/2 = 5, therefore we need the 5th Gray value of a normal 3 bit Gray code, which turns out to be 110
The rest of the circuit is depicted in the figure below:

It is important to see that we use the minimal possible memory elements required for the Gray counter (i.e. no extra states to remember or pipeline) and that during “direction switching” we gate the clock for the (n-1) LSBs up/down Gray counter using an ordinary clock gate construct.
If we look carefully we see that the “direction switching” logic is basically a mux structure with the select being the direction bit.

A timing diagram of the above circuit for a 10 state Gray counter is also depicted below for clarity.

May 23, 2008

I was recently talking to some friends, and they mentioned some problems they encountered after tape out. Turns out that, the suspicious part of the design was done full custom and the designers thought it would be best to save some power and area and use asynchronous ripple counters like the one pictured below. The problem was that those counters were later fed into a semi-custom block – the rest is history.

Asynchronous ripple counters are nice and great but you really have to be careful with them. They are asynchronous because not all bits change at the same time. For the MSB to change the signal has to ripple through all the bits to its right, changing them first. The nice thing about them is that they are cheap in area and power. This is why they are so attractive in fast designs, but this is also why they are very dangerous because the ripple time through the counter can approach the order of magnitude of the clock period. This means that a digital circuit that depends on the asynchronous ripple counter as an input might violate the setup-hold window of the capturing flop behind it. To sum up, just because it is coming from a flop doesn’t mean it has to be synchronous.

If you can, even if you are a full custom designer, I strongly recommend replacing your ripple counters with the following almost identical circuit.

It is based on T-flops (toggle flip flops are just normal flops with an XOR of the current state and the input, which is also called the toggle signal) and from the principle of operation is almost the same, although here instead of generating the clock edge for the next stage, we generate a toggle signal when all previous (LSB) are “1”. Notice that the counter is synchronous since the clock signal (marked red) arrives simultaneously to all flops.

## Latch Based Design Robustness

May 19, 2008

Latch based design is usually not given enough attention by digital designers. There are many reasons for that, some are very well based, other reasons are just because latch based design is just unknown and looked at as a strange beast.

I intend to have a serious of posts concerning latch based design issues. I have to admit that most of my design experience was gained doing flip flop based designs, but the latch based designs I did were always interesting and challenging. If any of you readers have some interesting latch based design examples please send them over to my email and I will include them in later posts.

Just to start the ball rolling, here is a very interesting paper on the robustness of latch based design with comparison to flip flop based design. If you don’t have the time to go through the entire paper just look at figure 1 and its description.

I also added this paper to the list of recommended reading list.

## Ring Buffers

May 14, 2008

On the last technical post I discussed the problem of transferring information serially between two different clock domains with similar frequency but with drifting phase.
This post will try to explain how this issue is being solved.

When approaching this problem, we have to remember that the phase might drift over time and we have to quantify this drift before the design starts. Modeling the channel beforehand is very helpful here.
Once we know the needed margin, we can approach the design of the ring buffer.

The ring buffer is a FIFO with both ends “tied together” as depicted below. Pointers designate the read and write position and are moved with each respected clock signal in the direction of the arrow (in the figure below – clockwise). Remember, the read and write pointers move at different times but the overall rate of change of both is similar. This means that in some moment one can move ahead of the other, and in another it can lag behind, but over time the the amount of clock edges is the same.

The tolerance of the ring buffer is represented below with the dashed arrows. The read and write clocks can drift in time up to a point just before they meet and cross each other.

The series of images below depicts how the read and write pointers move with time and how the buffer is filled with new information (green) and how it is read (red). Notice how the first two reads will generate garbage because it reads out information that was not written into the buffer.

One of the most complicated issues is the start-up of the ring buffer, because both clock domains are unrelated. A certain “start” signal has to be generated and “tell” both pointers to start to advance. If this is not done carefully enough, one pointer will start to advance ahead of time and thus “bite away” some of the margin we designed for. This problem is even more complicated, when a lot of channels with different ring buffers are operated in parallel.

In one of the next posts we will explore a simple technique that enables us to determine if the ring buffer failed and the information read is actually one which is not updated.

## Clock Domain Crossing – An Important Problem

April 21, 2008

Sometimes, when crossing clock domains, synchronizers are just not enough.

Imagine sending data serially over a single line and receiving it on the other side from the output of a common synchronizer as shown bellow.

Assuming one clock cycle is enough to recover from metastability under the given operating conditions, what seems to be the main problem is not the integrity of the signal – i.e. making sure it is not propagating metastability through the rest of the circuit – but rather the correctness of the data.

Let’s observe the waveform below. The red vertical lines represent the sampling point of the incoming signal. We see from the waveform that since sometimes we sample during a transition – in effect violating the setup-hold window – the output of the first sampling flop (marked “x“) goes metastable. This metastability does not propagate further into the circuit, it is effectively blocked by the second flop, but since the result of recovery from metastability is not certain (see previous post) the outcome might be a corrupt data.
In this specific example we see that net x goes metastable after sampling the 3rd bit but recovers correctly. In a later sampling, for the 6th bit we see that the recovered outcome is not correct and as a result the output data is wrong.

Another interesting case is when both the send clock and the receive clock are frequency locked but their phase might drift in time or the clock signals might experience occasional jitter.
In that case, a bit might “stretch” or “shrink” and can be accidentally sampled twice or not sampled at all.
The waveform below demonstrates the problem. Notice how bit 2, was stretched and sampled twice.

To sum up, never use a simple synchronizer structure to transfer information serially between clock domains, even if they are frequency locked. You might be in more trouble than you initially thought.

On the next post we will discuss how to solve this problem with ring buffers (sometimes mistakenly called FIFOs).

## Resource Sharing vs. Performance

June 27, 2007

I wanted to spend a few words on the issue of resource sharing vs. performance. I believe it is trivial for most engineers but a few extra words won’t do any harm I guess.
The issue is relevant most evidently when there is a need to perform a “heavy” or “expensive” calculation on several inputs in a repeated way.

The approaches usually in consideration are: building a balanced tree structure, sequencing the operations, or a combination of the two.

A tree structure architecture is depicted below. The logic cloud represents the “heavy” calculation. One can see immediately that the operation on a,b and c,d is done in parallel and thus saves latency on the expense of instantiating the logic cloud twice.

The other common solution, depicted below, is to use the logic cloud only once but introducing a state machine which controls a MUX, that determines which values will be calculated on the next cycle. The overhead of designing this FSM is minimal (and even less). The main saving is in using the logic cloud only once. Notice that we pay here in throughput and latency! With some more thinking, one could also save a calculation cycle by introducing another MUX in the feedback path, and using one of the inputs just for the first calculation, thereafter using always the feedback path.