Archive for the ‘Coding Style’ Category

h1

Real World Examples #4 – More on “Thinking Hardware”

January 20, 2009

I was reviewing some code not so long ago, and noticed together with the owner of the code, that we had some timing problems.
Part of the code looked something like that (Verilog):


wire [127:0] a;
wire [127:0] b;
wire [127:0] c;
assign c = select_register ? a : b;

For those not familiar with Verilog syntax, the code describes a MUX construct using the ternary operator. The two data inputs for the MUX are “a” and “b” and the select is “select_register”.

So why was this code translated into a relatively slow design? The answer is in the width of the signals. The code actually synthesizes to 128 parallel MUX structures. The “select_register” has actually 128 loads.
When a construct like this is hidden within a large code, our tendency is to just neglect it by saying it is “only” 2:1 MUX deep, but we have to look more carefully than that – and always remember to consider the load.

Solving this problem is relatively easy by replication. Just creating more versions of the “select_register” helped significantly.

h1

A Coding Tip for Multi Clock Domain Designs

December 13, 2008

Multi clock domain designs are always interesting, but almost always hide some synchronization problems, which are not that trivial. There are tools on the market that identify all(??) clock domain crossings within a design. I personally had no experience with them, so I can’t give an opinion (although I heard some unflattering remarks from fellow engineers).

Seems like each company has its own ways of handling this problem. One of the oldest, easiest and IMHO one of the most efficient ways, is to keep strict naming guidelines for your signals, whether combinatorial or sequential !!

The most common way is to add a prefix to each signal which describes its driver clock e.g. clk_800_mux_32to1_out or clk_666_redge_acknowledge.

If you don’t use this simple technique, you won’t believe how useful it is. Many of the related problems of synchronization are actually discovered during the coding process itself. Moreover, it even makes life easier when doing the code review.

If you have more tips on naming convention guidelines for signals in RTL – post them as a comment!

h1

Another Reason to Add Hierarchies to Your Designs

November 30, 2008

We are usually very annoyed when the team leader insists on code restructuring and hierarchical design.
I also know this very well from the other side as well. Trying to convince designers to better restructure their own design which they know so very well already.

Well, here is another small, yet important reason why you might want to do this more often.
Assume your design is more or less mature, you ran some simulation, went through some synthesis runs and see that you don’t meet timing.
You analyze the timing report just to find a huge timing path optimized by the tool and made of tons of NANDs, NORs, XORs and what not. Well you see the starting point and the end point very clearly, but you find yourself asking if this is the path that goes through the MUX or through the adder maybe?

Most logic designs are extremely complicated and the circuit is not just something you can draw immediately on paper. Moreover, looking at a timing report of optimized logic, it is very hard to interpret the exact path taken through the higher level structured – or in other words, what part of the code I am really looking at here??!! Adding an hierarchy will also add its name to the optimized structures in the timing report and you could then easily pin point your problems.

I even saw an engineer that uses this technique as a debugging tool. If he has a very deep logic cloud, he will intentionally build an hierarchy around say a simple 2:1 MUX in the design and look for it in the timing report. This enables him to “feel” how the synthesis tool optimizes the path and where manual optimization needs to be implemented .

Use this on your bigger blocks, it saves a lot of time and effort in the long run.

h1

Fun With Enable Flip-Flops

October 27, 2008

Almost each library has enable flip-flops included. Unfortunately, they are not always used to their full potential. We will explore some of their potential in this post.

An enable flop is nothing but a regular flop which only registers new data if the enable signal is high, otherwise it keeps the old value. We normally implement this using a MUX and a feedback from the flop’s output as depicted below.

So what is the big deal about it? The nice thing is that the enable flop is already implemented by the guys who built the library in a very optimized way. Usually implementing this with a MUX before the flop will eat away from the cycle time you could otherwise use for your logic. However, a short glance at your library will prove that this MUX comes almost for free when you use an enable flop (for my current library the cost is 20ps).

So how can we use this to our advantage?

Example #1 – Soft reset coding

In many applications soft reset is a necessity. It is a signal usually driven by a register that will (soft) reset all flip flops given that a clock is running. Many times an enable “if” is also used in conjunction.
This is usually coded in this way (I use Verilog pseudo syntax and ask the forgiveness of you VHDL people):
always @(posedge clk or negedge hard_rst)
if (!hard_rst)
ff <= 1'b0;
else if (!soft_rst)
ff <= 1'b0;
else if (en)
ff <= D;

The above code usually results in the construction given in the picture below. The red arrow represents the critical timing path through a MUX and the AND gate that was generated for the soft reset.

Now, if we could only exchange the order of the last two “if” commands this would put the MUX in front of the AND gate and then we could use an enable flop… well, if we do that, it will not be logically equivalent anymore. Thinking about it a bit harder, we could use a trick – let’s exchange the MUX and the AND gate but during soft reset we could force the select pin of the MUX to be “1”, and thus transferring a “0” to the flop! Here’s the result in a picture form.

We can now use an enable flop and we basically got the MUX delay almost for free. This may look a bit petty to you, but this trick can save you a few extra precious tens or hundreds of pico-seconds.

Example #2 – Toggle Flip Flops

Toggle flops are really neat, and there are many cool ways to use them. The normal implementation requires an XOR gate combining the T input and a feedback of the flop itself.

Let’s have a closer look at the logical implementation of an XOR gate and how it is related to a MUX implementation: (a) is a MUX gate equivalent implementation (b) is an XOR gate equivalent implementation and (c) is an XOR implemented from a MUX.

Now, let’s try making a T flop using an enable flop. We saw already how to change the MUX into an XOR gate – all that is left, is to put everything together.

h1

Why You Don’t See a Lot of Verilog or VHDL on This Site

August 31, 2008

I get a lot of emails from readers from all over the world. Many want me to help them with their latest design or problem. This is OK, after all this is what this site is all about – giving tips, tricks and helping other designers making their steps through the complex world of ASIC digital design.

Many ask me for solutions directly in Verilog or VHDL. Although this would be pretty simple to give, I try to make sure NOT to do so. The reason is that it is my personal belief that thinking of a design in terms of Verilog or VHDL directly is a mistake and leads to poorer designs. I am aware that some readers may differ, but I repeatedly saw this kind of thinking leading to bigger, power hungry and faulty designs.

Moreover, I am in the opinion that for design work it is not necessary to know all the ins and outs of VHDL or Verilog (this is different if you do modeling, especially for a mixed signal project).
Eventually we all have to write code, but if you would look at my code you’d see it is extremely simple. For example, I rarely use any “for” statement and strictly try not using arrays.

Another important point on the subject is for you guys who interview people for a new position in your company. Please don’t ask your candidates to write something in VHDL as an interview question. I see and hear about this mistake over and over again. The candidate should know how to think in hardware terms; It is of far lesser importance if he can generate some sophisticated code.
If he/she knows what he is aiming for in hardware terms he/she will be a much better designer than a Verilog or VHDL wiz who doesn’t know what his code will be synthesized into. This is btw a very typical problem for people who come from CS background and go for design.

h1

A Concise Guide to Why and How to Split your State Machines

September 8, 2007

So, why do we really care about state machine partitioning? Why can’t I have my big fatty FSM with 147 states if I want to?

Well, smaller state machines are:

  1. Easier to debug and probably less buggy
  2. More easily modified
  3. Require less decoding
  4. Are more suitable for low power applications
  5. Just nicer…

There is no rule of thumb stating the correct size of an FSM. Moreover, a lot of times it just doesn’t make sense to split the FSM – So when can we do it? or when should we do it? Part of the answer lies in a deeper analysis of the FSM itself, its transitions and most important, the probability of occupying specific states.

Look at the diagram below. After some (hypothetical) analysis we recognize that in certain modes of operation, we spend either a lot of time among the states marked in red or among the states marked in blue. Transitions between the red and blue areas are possible but are less frequent.

fsm_partition1.png

The trick now, is to look at the entire red zone as one state for a new “blue” FSM, and vice versa for the a new “red” FSM. We basically split the original FSM into two completely separate FSMs and add to each of the FSM a new state, which we will call a “wait state”. The diagram below depicts our new construction.

fsm_partition2.png

Notice how for the “red” FSM transitioning in and out of the new “wait state” is exactly equivalent (same conditions) to switching in and out of the red zone of the original FSM. Same goes for the blue FSM but the conditions for going in and out of the “wait state” are naturally reversed.

OK, so far so good, but what is this good for? For starters, it would probably be easier now to choose state encodings for each separate FSM that will reduce switching (check out this post on that subject). However, the sweetest thing is that when we are in the “red wait state” we could gate the clock for the rest of the red FSM and all its dependent logic! This is a significant bonus, since although previously such strategy would have been possible, it would just be by far more complicated to implement. The price we pay is additional states which will sometimes lead to more flip-flops needed to hold the current state.

As mentioned before, it is not wise to just blindly partition your FSMs arbitrarily. It is important to try to look for patterns and recognize “regions of operation”. Then, try to find transitions in and out of this regions which are relatively simple (ideally one condition to go in and one to go out). This means that sometimes it pays to include in a “region” one more state, just to make the transitioning in and out of the “region” simpler.

Use this technique. It will make your FSMs easy to debug, simple to code and hopefully will enable you to introduce low power concepts more easily in your design.

h1

FSM State Encoding – More Switching Reduction Tips

September 4, 2007

I promised before to write some words on reducing switching activity by cleverly assigning the states of an FSM, so here goes…

Look at the example below. The FSM has five states “A”-“E”. Most naturally, one would just sequentially enumerate them (or use some enumeration scheme given by VHDL or Veriog – which is easier for debugging purposes).
In the diagram the sequential enumeration is marked in red. Now, consider only the topology of the FSM – i.e. without any reference to the probability of state transitions. You will notice that the diagram states (pun intended) in red near each arc the amount of bits switching for this specific transition. For example, to go from state “E” (100) to state “B” (001), two bits will toggle.

red_switching_fsm.png

But could we choose a better enumeration scheme that will reduce the amount of switching? Turns out that yes (don’t tell anybody but I forced this example to have a better enumeration 🙂 ). If you look at the green state enumeration you will clearly see that at most only one bit toggles for every transition.

If you sum up all transitions (assuming equal probability) you would see that the green implementation toggles exactly half the time as the red. An interesting point is that we need only to consider states “B” – “E”, because once state “A” is exited it can never be returned to (this is sometimes being referred to as “black hole” or “a pit”).

The fact that we chose the states enumeration more cleverly doesn’t only mean that we reduced switching in the actual flip-flops that hold the state itself, but we also reduce glitches/hazards in all the combinational logic that is dependent on the FSM! The latter point is extremely important since those combinational clouds can be huge in comparison to the n flops that hold the state of the FSM.

The procedure on choosing the right enumeration deserve more words but this will become a too lengthy post. In the usually small FSMs that the average designer handles on a daily basis, the most efficient enumeration can be easily reached by trial and error. I am sure there is somewhere some sort of clever algorithm that given an FSM topology can spit out the best enumeration. If you are aware of something like that, please send me an email.

h1

Arithmetic Tips & Tricks #1

August 1, 2007

Every single one of us had sometime or another to design a block utilizing some arithmetic operations. Usually we use the necessary operator and forget about it, but since we are “hardware men” (should be said with pride and a full chest) we know there is much more going under the hood. I intend to have a series of posts dealing specifically with arithmetic implementation tips and tricks. There are plenty of them, I don’t know all, probably not even half. So if you got some interesting ones please send them to me and I will post them with credits.

Let’s start. This post will explain 2 of the most obvious and simple ones.

  • Multiplying by a constant
  • Multipliers are extremely area hungry and thus when possible should be eliminated. One of the classic examples is when multiplying by a constant.
    Assume you need to multiply the result of register A by a factor, say 5. Instead of instantiating a multiplier, you could “shift and add”. 5 in binary is 101, just add A to A00 (2 trailing zeros, have the effect of multiplying by 4) and you have the equivalent of multiplying by 5, since what you basically did was 4A+A = 5A.
    This is of course very simplistic, but when you write your code, make sure the constant is not passed on as an argument to a function. It might be that the synthesis tool knows how to handle it, but why take the risk.

  • Adding a bounded value
  • Sometimes (or even often), we need to add two values where one is much smaller than the other and bounded. For example adding a 3 bit value to a 32 bit register. The idea here is not to be neat and pad the 3 bit value by leading zeros and create by force a 32 bit register from it. Why? adding two 32 bit values instantiates full adder logic on all the 32 bits, while adding 3 bits to 32 will infer a full adder logic on the 3 LSBs and an increment logic (which is much faster and cheaper) on the rest of the bits. I am quite positive that today’s synthesis tools know how to handle this, but again, it is good practice to always check the synthesis result and see what came up. If you didn’t get what you wanted it is easy enough to force it by coding it in such a way.

    h1

    Replication

    July 25, 2007

    Replication is an extremely important technique in digital design. The basic idea is that under some circumstances it is useful to take the same logic cloud or the same flip-flops and produce more instances of them, even though only a single copy would normally be enough from a logical point of view.
    Why would I want to spend more area on my chip and create more logic when I know I could do without it?

    Imagine the situation on the picture below. The darkened flip-flop has to drive 3 other nets all over the chip and due to the physical placement of the capturing flops it can not be placed close by to all of them. The layout tool finds as a compromise some place in the middle, which in turn will generate a negative slack on all the paths.

    replication1.png

    We notice that in the above example the logic cloud just before the darkened flop has a positive slack or in other words, “some time to give”. We now use this and produce a copy of the darkened flop, but this time closer to each of the capturing flops.

    replication2.png

    Yet another option, is to duplicate the entire logic cloud plus the sending flop, as pictured below. This will usually generate even better results.

    replication3.png

    Notice that we also reduce the fan out of the driving flop, thus further improving on timing.

    It is important to take care about while writing the HDL code, that the paths are really separated. This means when you want to replicate flops and logic clouds make sure you give the registers/signals/wires different names. It is a good idea to keep some sort of naming convention for replicated paths, so in the future when a change is made on one path, it would be easy enough to mirror that change on the other replications.

    There is no need to mention that when using this technique we pay in area and power – but I will still mention it 🙂

    h1

    Low Power Techniques – Reducing Switching

    June 15, 2007

    In one of the previous posts we discussed a cool technique to reduce leakage current. This time we will look at dynamic power consumption due to switching and some common techniques to reduce it.

    Usually, with just a little bit of thinking, reduction of switching activity is quite possible. Let’s look at some examples.

  • Bus inversion
  • Bus inversion is an old technique which is used a lot in communication protocols between chip-sets (memories, processors, etc.), but not very often between modules within a chip. The basic idea is to add another line to the bus, which signals whether to invert the entire bus (or not). When more than half of the lines needs to be switched the bus inversion line is asserted. Here is a small example of a hypothetical transaction and the comparison of amount of transitions between the two schemes.

    bu_inversion.png

    If you studied the above example a bit, you could immediately see that I manipulated the values in such a way that a significant difference in the total amount of transitions is evident.

  • Binary Number Representation
  • The two most common binary number representation in applications are 2’s complement and signed magnitude, with the former one usually preferred. However, for some very specific applications signed digit shows advantages in switching. Imagine you have a sort of integrator, which does nothing more than summing up values each clock cycle. Imagine also that the steady state value is around 0, but fluctuations above and below are common. If you would use 2’s complement going from 0 to -1 will result in switching of the entire bit range (-1 in 2’s complement is represented by 111….). If you would use signed digit, only 2 bits will switch when going from 0 to -1.

    reduce_switching_coding.png

  • Disabling/Enabling Logic Clouds
  • When handling a heavy logic cloud (with wide adders, multipliers, etc.) it is wise to enable this logic only when needed.
    Take a look at the diagrams below. On the left implementation, only the flop at the end of the path – flop “B” has an enable signal, since flop “A” could not be gated (its outputs are used someplace else!) the entire logic cloud is toggling and wasting power. On the right (no pun intended) implementation, the enable signal was moved before the logic cloud and just for good measures, the clock for flop “B” was gated.

    reducing_switching_enable1.pngreducing_switching_enable2.png

  • High Activity Nets
  • This trick is usually completely ignored by designers. This is a shame since only power saving tools which can drive input vectors on your design and run an analysis of the active nets, might be able to resolve this.
    The idea here is to identify the nets which have high activity among other very quiet nets, and to try to push them as deep as possible in the logic cloud.

    reducing_switching_high_activity_net1.pngreducing_switching_high_activity_net2.png

    On the left, we see a logic cloud which is a function of X1..Xn,Y. X1..Xn change with very low frequency, while Y is a high activity net. On the implementation on the right, the logic cloud was duplicated, once assuming Y=0 and once for Y=1, and then selecting between the 2 options depending on the value of Y. Often, the two new logic clouds will be reduced in size since Y has a fixed value there.