## Reordering Nets for Low Power

May 10, 2009

As posts accumulate, you can see that low power design aspects is a big topic on this site. I try to bring more subtle design examples for lower power design that you can control and implement (i.e. in RTL and the micro architectural stage).

Identifying “glitchy” nets is not always easy. Some good candidates are wide parity or CRC calculations (deep and wide XOR trees), complicated arithmetic paths and basically most logic that originates in very wide buses and converges to a single output controlling a specific path (e.g. as a select pin of a MUX for a wide data path).

If you happen to identify a good candidate, it is advisable (when possible) that you feed the “glitchy” nets as late as possible in the calculation path. This way the total amount of toggling in the logic is reduced.

Sounds easy enough? Well the crux of the problem is identifying those opportunities – and it is far from easy. I hope this post at least makes you more aware of that possibility.

To sum up, here are two figures that illustrate the issue visually. The figure below depicts the situation before the transformation.. The nets which are highlighted represent high activity nets.

After the transformation – pushing the glitchy net calculation late in the path. The transformation is logically equivalent (otherwise there is no point…) and we see less high activity nets.

## Reducing Power Through Retiming

February 23, 2009

Here is an interesting and almost trivial technique for (potential) power reduction, which I never used myself, nor seen used in others’ designs. Well… maybe I am doing the wrong designs… but I thought it is well worth mentioning. So, if any of my readers use this, please do post a short comment on how exactly did you implement it and if it really resulted in some significant savings.

We usually have many high activity nets in the design. They are in many cases toggling during calculation more than once per cycle. Even worse, they often drive long and high capacitive nets. Since, in a usual synchronous design (which 99% of us do), we only need the stable result once per cycle – when the calculation is done – we can just put a register to drive the high capacitive net. The register effectively blocks all toggling on that net (where it hurts) and allows it to change maximum one time per cycle.

The image below tells the whole story. (a) is before the insertion of the flop, (b) right after.

This is all nice, but just remember that in real life it can be quite hard to identify those special nets and those special high toggling logic clouds. Moreover, most of the time we cannot afford the flop for latency reasons. But if you happen to be in the early design phase and you know more or less your floor plan, think about moving some of those flops so they will reduce the toggling on those high capacitive nets.

## Arithmetic Tips and Tricks #2 – Another Look at a Slow Adder

August 18, 2008

Do you remember the old serial adder circuit below? A stream of bits comes in (LSB first) on the FA inputs, the present carry-out bit is registered and fed in the next cycle as a carry in. The sum comes in serially on the output (LSB first).

True, it is rather slow – it takes n cycles to add n bits. But hold on, check out the logic depth – one full adder only!! This means the clock can run a lot faster than your typical n-bit adder.
Moreover, it is by far the smallest, cheapest and consumes the least power of all adders known to mankind.

Of course you gotta have this high speed clock available in the system already, and you still gotta know when to stop adding and to sample your result.
Taking all this into consideration, I am sure this old nugget can still be useful somewhere. If you already used it before, or have an idea, place a comment.

## Low Power Methodology Manual

May 24, 2008

I recently got an email from Synopsys. It was telling me I can download a personalized copy of a “Low Power Methodology Manual”. Now, to tell you the truth, I sometimes get overly suspicious of those emails (not necessarily from Synopsys) and I didn’t really expect to find real value in the manual – boy, was I wrong, big time!

Here you get a very nice book (as pdf file), which has extremely practical advice. It does not just spend ink on vague concepts – you get excellent explanations with examples. And mind you, this is all for free.

Just rush to their site and download this excellent reference book that should be on the table of each digital designer.

The hard copy version can be bought here or here.

## The Principle Behind Multi-Vdd Designs

April 2, 2008

Multi-Vdd design is a sort of buzz word lately. There are still many issues involved before it could become a real accepted and supported design methodology, but I wanted to write a few words on the principle behind the multi-Vdd approach.

The basic idea is that by lowering the operating voltage of a logic gate we naturally also cut the power dissipation through the gate.
The price we pay is that gates operated by lower voltage are somewhat slower (exact factor is dependent on many factors).

The basic idea is to identify the non-critical paths and to power the gates in those paths with a lower voltage. Seen below are two paths, there is obviously less logic through the blue path than through the orange one and is therefore a candidate for being supplied with lower Vdd.

The idea looks elegant but as always the devil is in the details. There are routing overheads for the different power grids, level shifters must be introduced when two different Vdd logic paths converge to create a new logical function, new power source for the new Vdd must be designed and most important of all, there has to be support present by the CAD tools – if that doesn’t exist this technique will be buried.

## A Concise Guide to Why and How to Split your State Machines

September 8, 2007

So, why do we really care about state machine partitioning? Why can’t I have my big fatty FSM with 147 states if I want to?

Well, smaller state machines are:

1. Easier to debug and probably less buggy
2. More easily modified
3. Require less decoding
4. Are more suitable for low power applications
5. Just nicer…

There is no rule of thumb stating the correct size of an FSM. Moreover, a lot of times it just doesn’t make sense to split the FSM – So when can we do it? or when should we do it? Part of the answer lies in a deeper analysis of the FSM itself, its transitions and most important, the probability of occupying specific states.

Look at the diagram below. After some (hypothetical) analysis we recognize that in certain modes of operation, we spend either a lot of time among the states marked in red or among the states marked in blue. Transitions between the red and blue areas are possible but are less frequent.

The trick now, is to look at the entire red zone as one state for a new “blue” FSM, and vice versa for the a new “red” FSM. We basically split the original FSM into two completely separate FSMs and add to each of the FSM a new state, which we will call a “wait state”. The diagram below depicts our new construction.

Notice how for the “red” FSM transitioning in and out of the new “wait state” is exactly equivalent (same conditions) to switching in and out of the red zone of the original FSM. Same goes for the blue FSM but the conditions for going in and out of the “wait state” are naturally reversed.

OK, so far so good, but what is this good for? For starters, it would probably be easier now to choose state encodings for each separate FSM that will reduce switching (check out this post on that subject). However, the sweetest thing is that when we are in the “red wait state” we could gate the clock for the rest of the red FSM and all its dependent logic! This is a significant bonus, since although previously such strategy would have been possible, it would just be by far more complicated to implement. The price we pay is additional states which will sometimes lead to more flip-flops needed to hold the current state.

As mentioned before, it is not wise to just blindly partition your FSMs arbitrarily. It is important to try to look for patterns and recognize “regions of operation”. Then, try to find transitions in and out of this regions which are relatively simple (ideally one condition to go in and one to go out). This means that sometimes it pays to include in a “region” one more state, just to make the transitioning in and out of the “region” simpler.

Use this technique. It will make your FSMs easy to debug, simple to code and hopefully will enable you to introduce low power concepts more easily in your design.

## FSM State Encoding – More Switching Reduction Tips

September 4, 2007

I promised before to write some words on reducing switching activity by cleverly assigning the states of an FSM, so here goes…

Look at the example below. The FSM has five states “A”-“E”. Most naturally, one would just sequentially enumerate them (or use some enumeration scheme given by VHDL or Veriog – which is easier for debugging purposes).
In the diagram the sequential enumeration is marked in red. Now, consider only the topology of the FSM – i.e. without any reference to the probability of state transitions. You will notice that the diagram states (pun intended) in red near each arc the amount of bits switching for this specific transition. For example, to go from state “E” (100) to state “B” (001), two bits will toggle.

But could we choose a better enumeration scheme that will reduce the amount of switching? Turns out that yes (don’t tell anybody but I forced this example to have a better enumeration ðŸ™‚ ). If you look at the green state enumeration you will clearly see that at most only one bit toggles for every transition.

If you sum up all transitions (assuming equal probability) you would see that the green implementation toggles exactly half the time as the red. An interesting point is that we need only to consider states “B” – “E”, because once state “A” is exited it can never be returned to (this is sometimes being referred to as “black hole” or “a pit”).

The fact that we chose the states enumeration more cleverly doesn’t only mean that we reduced switching in the actual flip-flops that hold the state itself, but we also reduce glitches/hazards in all the combinational logic that is dependent on the FSM! The latter point is extremely important since those combinational clouds can be huge in comparison to the n flops that hold the state of the FSM.

The procedure on choosing the right enumeration deserve more words but this will become a too lengthy post. In the usually small FSMs that the average designer handles on a daily basis, the most efficient enumeration can be easily reached by trial and error. I am sure there is somewhere some sort of clever algorithm that given an FSM topology can spit out the best enumeration. If you are aware of something like that, please send me an email.

## Low Power Buses – More Tricks for Switching Reduction

August 31, 2007

Viewing the search engine key words which people use to get to this blog, I can see that low power tips and tricks are one of the most interesting topics for people. Before I start this post, it is important to mention that although there is almost always something to do, the price can be great. Price doesn’t always mean in area or complexity, sometimes it is just your own precious time. You can spend tons of time thinking on a very clever architecture or encoding for a bus but you might miss your dead lines all together.
OK, enough of me blubbering about “nonsense”, let’s get into some more switching reduction tricks.

Switching reduction means less dynamic power consumption, this has little to do with static power or leakage current reduction. When thinking of architectures or designing a block in HDL (verilog or VHDL) this is the main point we can tackle though. There is much less we could do about static power reduction by using various HDL tricks. This can be left to the system architect, our standard cell library developers or our FPGA vendor.

• Bus States Re-encoding
• Buses usually transfer information across a chip, therefore in a lot of cases they are wide and long. Reduction of switching on a wide or long bus is of high importance. Assume you already have a design in a late stage which is already pretty well debugged. Try running some real life cases and extract what are the most common transitions that occur on the bus. If we got a 32-bit bus that switches a lot between 00…00 and 11…11 we know it is bad. It is a good idea to re-encode the state 11…11 into 00…01, for example. Then, decode it back on the other side. We would save the switching of 31 bits in this case. This is naturally a very simple case, but analyze your system, these things happen in real life and are relatively easy to solve – even on a late stage of a design! If you read this blog for sometime now, you probably know that I prefer visualization. The diagram below summarizes the entire paragraph.

• Exploiting Special Cases – Identifying Patterns
• Imagine this, you have a system which uses a memory. During many operation stages you have to dump some contents into or out of the memory element. This is done by addressing the memory address by address in a sequential manner. We probably can’t do much about the data, since it is by nature random but what about the address bus? We see a pattern that repeats over and over again: an address is followed by the next. We could add another line which tells the other side to increment the previous address given to him. This way we save the entire switching on the bus when sweeping through the entire address range.
The diagram below gives a qualitative solution of how an approach like this would work. If you are really a perfectionist, you could gate the clock to the bus sampling flops which reserve the previous state, because their value is only important when doing the increments. You would just have to be careful on some corner cases.

Generally speaking it is always a good idea to recognize patterns and symmetry and exploit it when transmitting information on a bus. Sometimes it can be the special numbering system being used, or a specific sequence which is often used on a bus or a million different other things. The point is to identify the trade off between investing a lot of investigations and the simplicity of the design.

On one of the future posts, we will investigate how we could use the same switching reduction techniques for FSM state assignments, so stay tuned.

## Arithmetic Tips & Tricks #1

August 1, 2007

Every single one of us had sometime or another to design a block utilizing some arithmetic operations. Usually we use the necessary operator and forget about it, but since we are “hardware men” (should be said with pride and a full chest) we know there is much more going under the hood. I intend to have a series of posts dealing specifically with arithmetic implementation tips and tricks. There are plenty of them, I don’t know all, probably not even half. So if you got some interesting ones please send them to me and I will post them with credits.

Let’s start. This post will explain 2 of the most obvious and simple ones.

• Multiplying by a constant
• Multipliers are extremely area hungry and thus when possible should be eliminated. One of the classic examples is when multiplying by a constant.
Assume you need to multiply the result of register A by a factor, say 5. Instead of instantiating a multiplier, you could “shift and add”. 5 in binary is 101, just add A to A00 (2 trailing zeros, have the effect of multiplying by 4) and you have the equivalent of multiplying by 5, since what you basically did was 4A+A = 5A.
This is of course very simplistic, but when you write your code, make sure the constant is not passed on as an argument to a function. It might be that the synthesis tool knows how to handle it, but why take the risk.

• Sometimes (or even often), we need to add two values where one is much smaller than the other and bounded. For example adding a 3 bit value to a 32 bit register. The idea here is not to be neat and pad the 3 bit value by leading zeros and create by force a 32 bit register from it. Why? adding two 32 bit values instantiates full adder logic on all the 32 bits, while adding 3 bits to 32 will infer a full adder logic on the 3 LSBs and an increment logic (which is much faster and cheaper) on the rest of the bits. I am quite positive that today’s synthesis tools know how to handle this, but again, it is good practice to always check the synthesis result and see what came up. If you didn’t get what you wanted it is easy enough to force it by coding it in such a way.

## Low Power Techniques – Reducing Switching

June 15, 2007

In one of the previous posts we discussed a cool technique to reduce leakage current. This time we will look at dynamic power consumption due to switching and some common techniques to reduce it.

Usually, with just a little bit of thinking, reduction of switching activity is quite possible. Let’s look at some examples.

• Bus inversion
• Bus inversion is an old technique which is used a lot in communication protocols between chip-sets (memories, processors, etc.), but not very often between modules within a chip. The basic idea is to add another line to the bus, which signals whether to invert the entire bus (or not). When more than half of the lines needs to be switched the bus inversion line is asserted. Here is a small example of a hypothetical transaction and the comparison of amount of transitions between the two schemes.

If you studied the above example a bit, you could immediately see that I manipulated the values in such a way that a significant difference in the total amount of transitions is evident.

• Binary Number Representation
• The two most common binary number representation in applications are 2’s complement and signed magnitude, with the former one usually preferred. However, for some very specific applications signed digit shows advantages in switching. Imagine you have a sort of integrator, which does nothing more than summing up values each clock cycle. Imagine also that the steady state value is around 0, but fluctuations above and below are common. If you would use 2’s complement going from 0 to -1 will result in switching of the entire bit range (-1 in 2’s complement is represented by 111….). If you would use signed digit, only 2 bits will switch when going from 0 to -1.

• Disabling/Enabling Logic Clouds
• When handling a heavy logic cloud (with wide adders, multipliers, etc.) it is wise to enable this logic only when needed.
Take a look at the diagrams below. On the left implementation, only the flop at the end of the path – flop “B” has an enable signal, since flop “A” could not be gated (its outputs are used someplace else!) the entire logic cloud is toggling and wasting power. On the right (no pun intended) implementation, the enable signal was moved before the logic cloud and just for good measures, the clock for flop “B” was gated.

• High Activity Nets
• This trick is usually completely ignored by designers. This is a shame since only power saving tools which can drive input vectors on your design and run an analysis of the active nets, might be able to resolve this.
The idea here is to identify the nets which have high activity among other very quiet nets, and to try to push them as deep as possible in the logic cloud.

On the left, we see a logic cloud which is a function of X1..Xn,Y. X1..Xn change with very low frequency, while Y is a high activity net. On the implementation on the right, the logic cloud was duplicated, once assuming Y=0 and once for Y=1, and then selecting between the 2 options depending on the value of Y. Often, the two new logic clouds will be reduced in size since Y has a fixed value there.