Archive for the ‘Low Power’ Category

h1

Low Power Buses – More Tricks for Switching Reduction

August 31, 2007

Viewing the search engine key words which people use to get to this blog, I can see that low power tips and tricks are one of the most interesting topics for people. Before I start this post, it is important to mention that although there is almost always something to do, the price can be great. Price doesn’t always mean in area or complexity, sometimes it is just your own precious time. You can spend tons of time thinking on a very clever architecture or encoding for a bus but you might miss your dead lines all together.
OK, enough of me blubbering about “nonsense”, let’s get into some more switching reduction tricks.

Switching reduction means less dynamic power consumption, this has little to do with static power or leakage current reduction. When thinking of architectures or designing a block in HDL (verilog or VHDL) this is the main point we can tackle though. There is much less we could do about static power reduction by using various HDL tricks. This can be left to the system architect, our standard cell library developers or our FPGA vendor.

  • Bus States Re-encoding
  • Buses usually transfer information across a chip, therefore in a lot of cases they are wide and long. Reduction of switching on a wide or long bus is of high importance. Assume you already have a design in a late stage which is already pretty well debugged. Try running some real life cases and extract what are the most common transitions that occur on the bus. If we got a 32-bit bus that switches a lot between 00…00 and 11…11 we know it is bad. It is a good idea to re-encode the state 11…11 into 00…01, for example. Then, decode it back on the other side. We would save the switching of 31 bits in this case. This is naturally a very simple case, but analyze your system, these things happen in real life and are relatively easy to solve – even on a late stage of a design! If you read this blog for sometime now, you probably know that I prefer visualization. The diagram below summarizes the entire paragraph.

    bus_reencoding.png

  • Exploiting Special Cases – Identifying Patterns
  • Imagine this, you have a system which uses a memory. During many operation stages you have to dump some contents into or out of the memory element. This is done by addressing the memory address by address in a sequential manner. We probably can’t do much about the data, since it is by nature random but what about the address bus? We see a pattern that repeats over and over again: an address is followed by the next. We could add another line which tells the other side to increment the previous address given to him. This way we save the entire switching on the bus when sweeping through the entire address range.
    The diagram below gives a qualitative solution of how an approach like this would work. If you are really a perfectionist, you could gate the clock to the bus sampling flops which reserve the previous state, because their value is only important when doing the increments. You would just have to be careful on some corner cases.

    increment_bus.png

    Generally speaking it is always a good idea to recognize patterns and symmetry and exploit it when transmitting information on a bus. Sometimes it can be the special numbering system being used, or a specific sequence which is often used on a bus or a million different other things. The point is to identify the trade off between investing a lot of investigations and the simplicity of the design.

    On one of the future posts, we will investigate how we could use the same switching reduction techniques for FSM state assignments, so stay tuned.

    Advertisements
    h1

    Arithmetic Tips & Tricks #1

    August 1, 2007

    Every single one of us had sometime or another to design a block utilizing some arithmetic operations. Usually we use the necessary operator and forget about it, but since we are “hardware men” (should be said with pride and a full chest) we know there is much more going under the hood. I intend to have a series of posts dealing specifically with arithmetic implementation tips and tricks. There are plenty of them, I don’t know all, probably not even half. So if you got some interesting ones please send them to me and I will post them with credits.

    Let’s start. This post will explain 2 of the most obvious and simple ones.

  • Multiplying by a constant
  • Multipliers are extremely area hungry and thus when possible should be eliminated. One of the classic examples is when multiplying by a constant.
    Assume you need to multiply the result of register A by a factor, say 5. Instead of instantiating a multiplier, you could “shift and add”. 5 in binary is 101, just add A to A00 (2 trailing zeros, have the effect of multiplying by 4) and you have the equivalent of multiplying by 5, since what you basically did was 4A+A = 5A.
    This is of course very simplistic, but when you write your code, make sure the constant is not passed on as an argument to a function. It might be that the synthesis tool knows how to handle it, but why take the risk.

  • Adding a bounded value
  • Sometimes (or even often), we need to add two values where one is much smaller than the other and bounded. For example adding a 3 bit value to a 32 bit register. The idea here is not to be neat and pad the 3 bit value by leading zeros and create by force a 32 bit register from it. Why? adding two 32 bit values instantiates full adder logic on all the 32 bits, while adding 3 bits to 32 will infer a full adder logic on the 3 LSBs and an increment logic (which is much faster and cheaper) on the rest of the bits. I am quite positive that today’s synthesis tools know how to handle this, but again, it is good practice to always check the synthesis result and see what came up. If you didn’t get what you wanted it is easy enough to force it by coding it in such a way.

    h1

    Low Power Techniques – Reducing Switching

    June 15, 2007

    In one of the previous posts we discussed a cool technique to reduce leakage current. This time we will look at dynamic power consumption due to switching and some common techniques to reduce it.

    Usually, with just a little bit of thinking, reduction of switching activity is quite possible. Let’s look at some examples.

  • Bus inversion
  • Bus inversion is an old technique which is used a lot in communication protocols between chip-sets (memories, processors, etc.), but not very often between modules within a chip. The basic idea is to add another line to the bus, which signals whether to invert the entire bus (or not). When more than half of the lines needs to be switched the bus inversion line is asserted. Here is a small example of a hypothetical transaction and the comparison of amount of transitions between the two schemes.

    bu_inversion.png

    If you studied the above example a bit, you could immediately see that I manipulated the values in such a way that a significant difference in the total amount of transitions is evident.

  • Binary Number Representation
  • The two most common binary number representation in applications are 2’s complement and signed magnitude, with the former one usually preferred. However, for some very specific applications signed digit shows advantages in switching. Imagine you have a sort of integrator, which does nothing more than summing up values each clock cycle. Imagine also that the steady state value is around 0, but fluctuations above and below are common. If you would use 2’s complement going from 0 to -1 will result in switching of the entire bit range (-1 in 2’s complement is represented by 111….). If you would use signed digit, only 2 bits will switch when going from 0 to -1.

    reduce_switching_coding.png

  • Disabling/Enabling Logic Clouds
  • When handling a heavy logic cloud (with wide adders, multipliers, etc.) it is wise to enable this logic only when needed.
    Take a look at the diagrams below. On the left implementation, only the flop at the end of the path – flop “B” has an enable signal, since flop “A” could not be gated (its outputs are used someplace else!) the entire logic cloud is toggling and wasting power. On the right (no pun intended) implementation, the enable signal was moved before the logic cloud and just for good measures, the clock for flop “B” was gated.

    reducing_switching_enable1.pngreducing_switching_enable2.png

  • High Activity Nets
  • This trick is usually completely ignored by designers. This is a shame since only power saving tools which can drive input vectors on your design and run an analysis of the active nets, might be able to resolve this.
    The idea here is to identify the nets which have high activity among other very quiet nets, and to try to push them as deep as possible in the logic cloud.

    reducing_switching_high_activity_net1.pngreducing_switching_high_activity_net2.png

    On the left, we see a logic cloud which is a function of X1..Xn,Y. X1..Xn change with very low frequency, while Y is a high activity net. On the implementation on the right, the logic cloud was duplicated, once assuming Y=0 and once for Y=1, and then selecting between the 2 options depending on the value of Y. Often, the two new logic clouds will be reduced in size since Y has a fixed value there.

    h1

    Low Power – Clock Gating Is Not The End Of It…

    June 12, 2007

    A good friend of mine, who works for one of the micro-electronics giants, told me how low power is the buzz word today. They care less about speed/frequency and more about minimizing power consumption.

    He exposed me to a technique in logic design I was not familiar with. It is basically described in this paper. Let me just give you the basic idea.

    The main observation is that even when not active, logic gates have different leakage current values depending on their inputs. The example given in the article shows that a NAND gate can have its leakage current reduced by almost a factor of 2.5 depending on the inputs!
    How is this applied in reality? Assume that a certain part of the design is clock gated, this means all flip-flops are inactive and in turn the logic clouds between them. By “muxing” a different value at the output of the flop, which is logic dependent, we could minimize the leakage through the logic clouds. When waking up, we return to the old stored value.

    The article, which is not a recent work by the way, describes a neat and cheap way of implementing a storage element with a “sleep mode” output of either logic “1” or logic “0”. Notice that the “non-sleep mode” or normal operation value is still kept in the storage element. The cool thing is, that this need not really be a true MUX in the output of the flop – after finalizing the design an off-line application analyzes the logic clouds between the storage elements and determines what values are needed to be forced during sleep mode at the output of each flop. Then, the proper flavor of the storage element is instantiated in place (either a “sleep mode” logic “0” or a “sleep mode” logic “1”).

    It turns out that the main problem is the analysis of the logic clouds and that the complexity of this problem is rather high. There is also some routing overhead for the “sleep mode” lines and of course a minor area overhead.
    I am interested to know how those trade-offs are handled. As usual, emails and comments are welcome.

    Bottom line – this is a way cool technique!!!

    h1

    Big Chips – Some Low Power Considerations

    June 2, 2007

    As designers, especially ones who only code in HDL, we don’t normally take into account the physical size of the chip we are working on. There are many effects which surface only past the synthesis stage and when approaching the layout.

    As usual, let’s look at an example. Consider the situation described on the diagram below.

    global_decoding.png

    Imagine that block A and B are located physically far from one another, and could not be placed closer to one another. If the speeds we are dealing with are relatively high, it may very well be that the flight time of the signals from one side of the chip to another, already becomes too critical and even a flop to flop connection without any logic in between will violate setup requirements!
    Now, imagine as depicted that many signals are sent across the chip. If you need to pipeline, you would need to pipeline a lot of parallel lines. This may result in a lot of extra flip-flops. Moreover, your layout tool will have to put in a lot of buffers to keep sharp edged signals. From architectural point of view, decoding globally may sound attractive at first, since you only need to do it once but can lead to a very power hungry architecture.

    The alternative is to send as less long lines as possible across the chip, As depicted below.

    local_decoding.png

    With this architecture block B decodes the logic locally. If the lines sent to block B, need also to be spread all over the chip, we definitely pay in duplicating the logic for each target block.

    There is no strict criteria to decide when to take the former or the latter architectures, as there is no definite crossover point. I believe this is more of a feeling and experience thing. It is just important to have this in mind when working on large designs.

    h1

    Do You Think Low Power???

    May 20, 2007

    There is almost no design today, where low power is not a concern. Reducing power is an issue which can be tackled on many levels, from the system design to the most fundamental implementation techniques.

    In digital design, clock gating is the back bone of low power design. It is true that there are many other ways the designer can influence the power consumption, but IMHO clock gating is the easiest and simplest to introduce without a huge overhead or compromise.

    Here is a simple example on how to easily implement low power features.

    sync_fifo.png The picture on the right shows a very simple synchronous FIFO. That FIFO is a very common design structure which is easily implementable using a shift register. The data is being pushed to the right with each clock and the tap select decides which register to pick. The problem with this construction is that with each clock all the flip-flops potentially toggle, and a clock is driven to all. This hurts especially in data or packet processing applications where the size of this FIFO can be in the range of thousands of flip-flops!!

    sync_fifo_low_power.png The correct approach is instead of moving the entire data around with each clock, to “move” the clock itself. Well not really move, but to keep only one specific cell (or row in the case of vectors) active while all the other flip-flops are gated. This is done by using a simple counter (or a state machine for specific applications) that rotates a “one hot” signal – thus enabling only one cell at a time. Notice that the data_in signal is connected to all the cells in parallel,. When new data arrives only the cell which receives a clock edge in that moment will have a new value stored.