Resource Sharing vs. Performance

June 27, 2007

I wanted to spend a few words on the issue of resource sharing vs. performance. I believe it is trivial for most engineers but a few extra words won’t do any harm I guess.
The issue is relevant most evidently when there is a need to perform a “heavy” or “expensive” calculation on several inputs in a repeated way.

The approaches usually in consideration are: building a balanced tree structure, sequencing the operations, or a combination of the two.

A tree structure architecture is depicted below. The logic cloud represents the “heavy” calculation. One can see immediately that the operation on a,b and c,d is done in parallel and thus saves latency on the expense of instantiating the logic cloud twice.

The other common solution, depicted below, is to use the logic cloud only once but introducing a state machine which controls a MUX, that determines which values will be calculated on the next cycle. The overhead of designing this FSM is minimal (and even less). The main saving is in using the logic cloud only once. Notice that we pay here in throughput and latency! With some more thinking, one could also save a calculation cycle by introducing another MUX in the feedback path, and using one of the inputs just for the first calculation, thereafter using always the feedback path.

Puzzle #4 – Solution

June 24, 2007

Here are the block diagrams for the solution of the MinMax problem.

Puzzle #6 – The Spy – (A real tough one…)

June 22, 2007

This one I heard a while back and saw that a version of it also appears in Peter Winkler’s excellent book Mathematical Puzzles – A Connoisseur’s Collection. Here is the version that appears in the book:

A spy in an enemy country wants to transmit information back to his home country.
The spy wants to utilize the enemy country’s daily morning radio transmission of 15-bits (which is also received in his home country). The spy is able to infiltrate the radio station 5 minutes before transmission time, analyze the transmission that is about to go on air, and can either leave as it is, or flip a single bit somewhere in the transmission (a flip of more than one bit would make the original transmission too corrupt).

how much information can the spy transmit to his operators?

remember:

• The transmission is most likely a different set of 15-bits each day but can also repeat the last day’s transmission. Best, assume it is random
• The spy is allowed to change a maximum of 1 bit in any position
• The spy has agreed on an algorithm/strategy with his operators before he was sent to the enemy country
• No other information or communication is available. the communication is strictly one way
• The spy sees for the first time the intended daily transmission 5 minutes before it goes on the air, he does not hold a list of all future transmissions
• The information on the other end should be extracted in a deterministic way

I believe this one is too tough for an interview question – it took me well over an hour to come up with a solution (well, that actually doesn’t say much…). Anyways, this is definitely one of my favorite puzzles.

June 19, 2007

I actually enjoy surfing the web and reading technical papers which are somewhat related to my work. A lot of the good stuff appears in books, but if you want to find the coolest techniques and breakthrough ideas, they naturally first appear in technical papers.

I have to admit I don’t like the format used by the standard technical papers, some of them seem to be made non-readable on purpose. Here is a real paper that can compete for the dubious title of being the most non-readable paper around.

Here is one of my papers. Before you continue, stop and try digesting what was written…

If you made through the first page, consider yourself a hero. That “technical paper” was generated automatically using SCIgen.

I bet a lot of people would be impressed if you present a list of papers generated by this service. A sort of a high-tech “emperor’s new cloths” syndrome – no one wants to admit he doesn’t understand a technical paper describing some “major” work in his own field…

Puzzle #3 – Solution

June 19, 2007

This post is written only for completeness reasons. The answer to puzzle #3 was almost immediately given in the comments. I will just repeat it here.
The important observations are that XOR (X,X) = 0 and that XOR(X,0) = X The solution is therefore:

```Operation       Result
---------------------------------
X = XOR(,)     X^Y,Y
Y = XOR(,)     X^Y,X^Y^Y = X
X = XOR(,)     X^X^Y = Y,X  done!
```

Low Power Techniques – Reducing Switching

June 15, 2007

In one of the previous posts we discussed a cool technique to reduce leakage current. This time we will look at dynamic power consumption due to switching and some common techniques to reduce it.

Usually, with just a little bit of thinking, reduction of switching activity is quite possible. Let’s look at some examples.

• Bus inversion
• Bus inversion is an old technique which is used a lot in communication protocols between chip-sets (memories, processors, etc.), but not very often between modules within a chip. The basic idea is to add another line to the bus, which signals whether to invert the entire bus (or not). When more than half of the lines needs to be switched the bus inversion line is asserted. Here is a small example of a hypothetical transaction and the comparison of amount of transitions between the two schemes.

If you studied the above example a bit, you could immediately see that I manipulated the values in such a way that a significant difference in the total amount of transitions is evident.

• Binary Number Representation
• The two most common binary number representation in applications are 2′s complement and signed magnitude, with the former one usually preferred. However, for some very specific applications signed digit shows advantages in switching. Imagine you have a sort of integrator, which does nothing more than summing up values each clock cycle. Imagine also that the steady state value is around 0, but fluctuations above and below are common. If you would use 2′s complement going from 0 to -1 will result in switching of the entire bit range (-1 in 2′s complement is represented by 111….). If you would use signed digit, only 2 bits will switch when going from 0 to -1.

• Disabling/Enabling Logic Clouds
• When handling a heavy logic cloud (with wide adders, multipliers, etc.) it is wise to enable this logic only when needed.
Take a look at the diagrams below. On the left implementation, only the flop at the end of the path – flop “B” has an enable signal, since flop “A” could not be gated (its outputs are used someplace else!) the entire logic cloud is toggling and wasting power. On the right (no pun intended) implementation, the enable signal was moved before the logic cloud and just for good measures, the clock for flop “B” was gated.

• High Activity Nets
• This trick is usually completely ignored by designers. This is a shame since only power saving tools which can drive input vectors on your design and run an analysis of the active nets, might be able to resolve this.
The idea here is to identify the nets which have high activity among other very quiet nets, and to try to push them as deep as possible in the logic cloud.

On the left, we see a logic cloud which is a function of X1..Xn,Y. X1..Xn change with very low frequency, while Y is a high activity net. On the implementation on the right, the logic cloud was duplicated, once assuming Y=0 and once for Y=1, and then selecting between the 2 options depending on the value of Y. Often, the two new logic clouds will be reduced in size since Y has a fixed value there.

A Short Note on Automatic Clock Gates Insertion

June 13, 2007

As we discussed before, clock gating is one of the most solid logic design techniques, which one can use when aiming for low power design.
It is only natural that most tools on the market support an automatic clock gating insertion option. Here is a quote from a synopsys article describing their power compiler tool

…Module clock gating can be used at the architectural level to disable the clock to parts of the design that are not in use. Synopsysâ€™ Power Compilerâ„˘ helps replace the clock gating logic inserted manually, gating the clock to any module using an Integrated Clock Gating (ICG) cell from the library. The tool automatically identifies such combinational logic…

But what does it really mean? What is this combinational logic that the tool “recognizes”?

The answer is relatively simple. Imagine a flip-flop with an enable signal. Implementation wise, this is done with a normal flip-flop and a MUX before with a feedback path to preserve the logical value of the flop when the enable is low. This is equivalent to a flop with the MUX removed and the enable signal controlling the enable of a clock gate cell, which in turn drives the clock for the flip-flop.

The picture below is better than any verbal explanation.

Puzzle #5 – Binary-Gray

June 12, 2007

Assuming you have an n-bit binary counter, made of n identical cascaded cells, which hold the corresponding bit value. Each of the binary cells dissipates a power of P units only when it toggles.
You also have an n-bit Gray counter made of n cascaded cells, which dissipates 3P units of power per cell when it toggles.

You now let the counters run through an entire cycle (2^n different values) until they return to their starting position. Which counter burns more power?

Low Power – Clock Gating Is Not The End Of It…

June 12, 2007

A good friend of mine, who works for one of the micro-electronics giants, told me how low power is the buzz word today. They care less about speed/frequency and more about minimizing power consumption.

He exposed me to a technique in logic design I was not familiar with. It is basically described in this paper. Let me just give you the basic idea.

The main observation is that even when not active, logic gates have different leakage current values depending on their inputs. The example given in the article shows that a NAND gate can have its leakage current reduced by almost a factor of 2.5 depending on the inputs!
How is this applied in reality? Assume that a certain part of the design is clock gated, this means all flip-flops are inactive and in turn the logic clouds between them. By “muxing” a different value at the output of the flop, which is logic dependent, we could minimize the leakage through the logic clouds. When waking up, we return to the old stored value.

The article, which is not a recent work by the way, describes a neat and cheap way of implementing a storage element with a “sleep mode” output of either logic “1″ or logic “0″. Notice that the “non-sleep mode” or normal operation value is still kept in the storage element. The cool thing is, that this need not really be a true MUX in the output of the flop – after finalizing the design an off-line application analyzes the logic clouds between the storage elements and determines what values are needed to be forced during sleep mode at the output of each flop. Then, the proper flavor of the storage element is instantiated in place (either a “sleep mode” logic “0″ or a “sleep mode” logic “1″).

It turns out that the main problem is the analysis of the logic clouds and that the complexity of this problem is rather high. There is also some routing overhead for the “sleep mode” lines and of course a minor area overhead.
I am interested to know how those trade-offs are handled. As usual, emails and comments are welcome.

Bottom line – this is a way cool technique!!!

Puzzle #4 – The min-max question

June 8, 2007

Here is a question you are bound to stumble upon in one of your logic design job interviews, why? I don’t know, I personally think it is pretty obvious, but what do I know…

MinMax2 is a component with 2 inputs – A and B, and 2 outputs – Max and Min. You guessed it, you connect the 2 n-bit numbers at the inputs and the component drives the Max output with the bigger of the two and the Min output with the smaller of the two.

Your job is to design a component – MinMax4, with 4 inputs and 4 outputs which sorts the 4 numbers using only MinMax2 components. Try to use as little as possible MinMax2 components.

If you made it so far, try making a MinMax6 component from MinMax2 and MinMax4 components.

For bonus points – how many different input sequences are needed to verify the logical behavior of MinMax4?