Archive for the 'Coding Style' Category

h1

A Concise Guide to Why and How to Split your State Machines

September 8, 2007

So, why do we really care about state machine partitioning? Why can’t I have my big fatty FSM with 147 states if I want to?

Well, smaller state machines are:

  1. Easier to debug and probably less buggy
  2. More easily modified
  3. Require less decoding
  4. Are more suitable for low power applications
  5. Just nicer…

There is no rule of thumb stating the correct size of an FSM. Moreover, a lot of times it just doesn’t make sense to split the FSM - So when can we do it? or when should we do it? Part of the answer lies in a deeper analysis of the FSM itself, its transitions and most important, the probability of occupying specific states.

Look at the diagram below. After some (hypothetical) analysis we recognize that in certain modes of operation, we spend either a lot of time among the states marked in red or among the states marked in blue. Transitions between the red and blue areas are possible but are less frequent.

fsm_partition1.png

The trick now, is to look at the entire red zone as one state for a new “blue” FSM, and vice versa for the a new “red” FSM. We basically split the original FSM into two completely separate FSMs and add to each of the FSM a new state, which we will call a “wait state”. The diagram below depicts our new construction.

fsm_partition2.png

Notice how for the “red” FSM transitioning in and out of the new “wait state” is exactly equivalent (same conditions) to switching in and out of the red zone of the original FSM. Same goes for the blue FSM but the conditions for going in and out of the “wait state” are naturally reversed.

OK, so far so good, but what is this good for? For starters, it would probably be easier now to choose state encodings for each separate FSM that will reduce switching (check out this post on that subject). However, the sweetest thing is that when we are in the “red wait state” we could gate the clock for the rest of the red FSM and all its dependent logic! This is a significant bonus, since although previously such strategy would have been possible, it would just be by far more complicated to implement. The price we pay is additional states which will sometimes lead to more flip-flops needed to hold the current state.

As mentioned before, it is not wise to just blindly partition your FSMs arbitrarily. It is important to try to look for patterns and recognize “regions of operation”. Then, try to find transitions in and out of this regions which are relatively simple (ideally one condition to go in and one to go out). This means that sometimes it pays to include in a “region” one more state, just to make the transitioning in and out of the “region” simpler.

Use this technique. It will make your FSMs easy to debug, simple to code and hopefully will enable you to introduce low power concepts more easily in your design.

h1

FSM State Encoding - More Switching Reduction Tips

September 4, 2007

I promised before to write some words on reducing switching activity by cleverly assigning the states of an FSM, so here goes…

Look at the example below. The FSM has five states “A”-”E”. Most naturally, one would just sequentially enumerate them (or use some enumeration scheme given by VHDL or Veriog - which is easier for debugging purposes).
In the diagram the sequential enumeration is marked in red. Now, consider only the topology of the FSM - i.e. without any reference to the probability of state transitions. You will notice that the diagram states (pun intended) in red near each arc the amount of bits switching for this specific transition. For example, to go from state “E” (100) to state “B” (001), two bits will toggle.

red_switching_fsm.png

But could we choose a better enumeration scheme that will reduce the amount of switching? Turns out that yes (don’t tell anybody but I forced this example to have a better enumeration :) ). If you look at the green state enumeration you will clearly see that at most only one bit toggles for every transition.

If you sum up all transitions (assuming equal probability) you would see that the green implementation toggles exactly half the time as the red. An interesting point is that we need only to consider states “B” - “E”, because once state “A” is exited it can never be returned to (this is sometimes being referred to as “black hole” or “a pit”).

The fact that we chose the states enumeration more cleverly doesn’t only mean that we reduced switching in the actual flip-flops that hold the state itself, but we also reduce glitches/hazards in all the combinational logic that is dependent on the FSM! The latter point is extremely important since those combinational clouds can be huge in comparison to the n flops that hold the state of the FSM.

The procedure on choosing the right enumeration deserve more words but this will become a too lengthy post. In the usually small FSMs that the average designer handles on a daily basis, the most efficient enumeration can be easily reached by trial and error. I am sure there is somewhere some sort of clever algorithm that given an FSM topology can spit out the best enumeration. If you are aware of something like that, please send me an email.

h1

Arithmetic Tips & Tricks #1

August 1, 2007

Every single one of us had sometime or another to design a block utilizing some arithmetic operations. Usually we use the necessary operator and forget about it, but since we are “hardware men” (should be said with pride and a full chest) we know there is much more going under the hood. I intend to have a series of posts dealing specifically with arithmetic implementation tips and tricks. There are plenty of them, I don’t know all, probably not even half. So if you got some interesting ones please send them to me and I will post them with credits.

Let’s start. This post will explain 2 of the most obvious and simple ones.

  • Multiplying by a constant
  • Multipliers are extremely area hungry and thus when possible should be eliminated. One of the classic examples is when multiplying by a constant.
    Assume you need to multiply the result of register A by a factor, say 5. Instead of instantiating a multiplier, you could “shift and add”. 5 in binary is 101, just add A to A00 (2 trailing zeros, have the effect of multiplying by 4) and you have the equivalent of multiplying by 5, since what you basically did was 4A+A = 5A.
    This is of course very simplistic, but when you write your code, make sure the constant is not passed on as an argument to a function. It might be that the synthesis tool knows how to handle it, but why take the risk.

  • Adding a bounded value
  • Sometimes (or even often), we need to add two values where one is much smaller than the other and bounded. For example adding a 3 bit value to a 32 bit register. The idea here is not to be neat and pad the 3 bit value by leading zeros and create by force a 32 bit register from it. Why? adding two 32 bit values instantiates full adder logic on all the 32 bits, while adding 3 bits to 32 will infer a full adder logic on the 3 LSBs and an increment logic (which is much faster and cheaper) on the rest of the bits. I am quite positive that today’s synthesis tools know how to handle this, but again, it is good practice to always check the synthesis result and see what came up. If you didn’t get what you wanted it is easy enough to force it by coding it in such a way.

    h1

    Replication

    July 25, 2007

    Replication is an extremely important technique in digital design. The basic idea is that under some circumstances it is useful to take the same logic cloud or the same flip-flops and produce more instances of them, even though only a single copy would normally be enough from a logical point of view.
    Why would I want to spend more area on my chip and create more logic when I know I could do without it?

    Imagine the situation on the picture below. The darkened flip-flop has to drive 3 other nets all over the chip and due to the physical placement of the capturing flops it can not be placed close by to all of them. The layout tool finds as a compromise some place in the middle, which in turn will generate a negative slack on all the paths.

    replication1.png

    We notice that in the above example the logic cloud just before the darkened flop has a positive slack or in other words, “some time to give”. We now use this and produce a copy of the darkened flop, but this time closer to each of the capturing flops.

    replication2.png

    Yet another option, is to duplicate the entire logic cloud plus the sending flop, as pictured below. This will usually generate even better results.

    replication3.png

    Notice that we also reduce the fan out of the driving flop, thus further improving on timing.

    It is important to take care about while writing the HDL code, that the paths are really separated. This means when you want to replicate flops and logic clouds make sure you give the registers/signals/wires different names. It is a good idea to keep some sort of naming convention for replicated paths, so in the future when a change is made on one path, it would be easy enough to mirror that change on the other replications.

    There is no need to mention that when using this technique we pay in area and power - but I will still mention it :-)

    h1

    Low Power Techniques - Reducing Switching

    June 15, 2007

    In one of the previous posts we discussed a cool technique to reduce leakage current. This time we will look at dynamic power consumption due to switching and some common techniques to reduce it.

    Usually, with just a little bit of thinking, reduction of switching activity is quite possible. Let’s look at some examples.

  • Bus inversion
  • Bus inversion is an old technique which is used a lot in communication protocols between chip-sets (memories, processors, etc.), but not very often between modules within a chip. The basic idea is to add another line to the bus, which signals whether to invert the entire bus (or not). When more than half of the lines needs to be switched the bus inversion line is asserted. Here is a small example of a hypothetical transaction and the comparison of amount of transitions between the two schemes.

    bu_inversion.png

    If you studied the above example a bit, you could immediately see that I manipulated the values in such a way that a significant difference in the total amount of transitions is evident.

  • Binary Number Representation
  • The two most common binary number representation in applications are 2’s complement and signed magnitude, with the former one usually preferred. However, for some very specific applications signed digit shows advantages in switching. Imagine you have a sort of integrator, which does nothing more than summing up values each clock cycle. Imagine also that the steady state value is around 0, but fluctuations above and below are common. If you would use 2’s complement going from 0 to -1 will result in switching of the entire bit range (-1 in 2’s complement is represented by 111….). If you would use signed digit, only 2 bits will switch when going from 0 to -1.

    reduce_switching_coding.png

  • Disabling/Enabling Logic Clouds
  • When handling a heavy logic cloud (with wide adders, multipliers, etc.) it is wise to enable this logic only when needed.
    Take a look at the diagrams below. On the left implementation, only the flop at the end of the path - flop “B” has an enable signal, since flop “A” could not be gated (its outputs are used someplace else!) the entire logic cloud is toggling and wasting power. On the right (no pun intended) implementation, the enable signal was moved before the logic cloud and just for good measures, the clock for flop “B” was gated.

    reducing_switching_enable1.pngreducing_switching_enable2.png

  • High Activity Nets
  • This trick is usually completely ignored by designers. This is a shame since only power saving tools which can drive input vectors on your design and run an analysis of the active nets, might be able to resolve this.
    The idea here is to identify the nets which have high activity among other very quiet nets, and to try to push them as deep as possible in the logic cloud.

    reducing_switching_high_activity_net1.pngreducing_switching_high_activity_net2.png

    On the left, we see a logic cloud which is a function of X1..Xn,Y. X1..Xn change with very low frequency, while Y is a high activity net. On the implementation on the right, the logic cloud was duplicated, once assuming Y=0 and once for Y=1, and then selecting between the 2 options depending on the value of Y. Often, the two new logic clouds will be reduced in size since Y has a fixed value there.

    h1

    Designing Robust Circuits

    May 25, 2007

    There are many ways to design a certain circuit. Moreover, there are many trade-offs like power, area, speed etc.
    In this post we will discuss a bit about robustness and as usual, we will use a practical, real life example to state our point.

    When one talks about robustness in digital design, one usually means that if a certain type of failure occurs during operation the circuit does not need outside “help” in order to return to a defined or at least allowed state. Maybe this is a bit cryptic so let’s look at a very simple example - a ring counter.

    ring_counter_states.pngAs pictured on the right a 4 bit ring counter has 4 different states, with only a single “1″ in each state. “counting” is performed by shifting or more correctly rotating the “1″ to one direction with each rising clock edge. Ring counters have many uses, one of the most common is as a pointer for a synchronous FIFO. Because of their simplicity, one finds them many times in high speed full custom designs. Ring counters have only a subset of all possible states as allowed or legal states. For example, the state “1001″ is not allowed.

    A very simple implementation for a ring counter is the one depicted below. The 4 flip-flops are connected in a circular shift register fashion. Three of the registers have an asynchronous reset pin while the left most has an asynchronous set pin. When going into the reset state the ring counter will assume the state “1000″.

    ring_counter1.png

    Now, imagine that for some reason (inappropriate reset removal, cross talk noise etc.) the state “1100″ appeared in the above design - an illegal state. From now on, the ring counter will always toggle between illegal states and this situation will continue until the next asynchronous reset is de-asserted. If a system is noisy, and such risk is not unthinkable, hard reseting the entire system just to bring the counter to a known state might be disastrous.

    Let’s inspect a different, more robust design of a ring counter in the picture below.

    ring_counter2.png

    With the new implementation the NOR gate is functioning as the left most output. But more important, the NOR gate will drive “0″s into the 3-bit shift register until all 3-bits are “0″, then a “1″ will be driven. If we look at a forbidden or illegal state like “0110″, we see that the new circuit will go through the following states: “0110″–>”0011″–>”0001″ until it independently reaches a legal state! This means we might experience an unwanted behavior for a few cycles but we would not need to reset the circuit to bring it back to a legal state.

    In a later post, when discussing Johnson counters, we will see this property again.

    h1

    Late Arriving Signals

    May 23, 2007

    As I mentioned before, it is my personal opinion that many digital designers put themselves more and more further away from the physical implementation of digital circuits and concentrate more on the HDL implementations. A relatively simple construction like the one I am about to discuss, is already quite hard to debug directly in HDL. With a visual aid of how the circuit looks like, it is much easier (and faster) to find a solution.

    The classic example we will discuss is that of a late arriving signal. Look at the picture below. The critical path through the circuit is along the red arrow. Let’s assume that there is a setup violation on FF6.
    late_arriving_signal_1.png Let’s also assume that in this example the logic cloud marked as “A”, which in turn controls the MUX that chooses between FF3 and FF4, is quite heavy. The combination of cloud “A” and cloud “B” plus the MUXes in sequence is just too much. But we have to use the result of “A” before calculating “B”! What can be done?

    The most important observation is that we could duplicate the entire logic that follows “A”. We assume for the duplicated blocks that one time the result of “A” was a logic “0″ and in another logic “1″. Later we could choose between the two calculations. Another picture will make it clearer. late_arriving_signal_2.png
    Notice how the MUX that selected between FF3 and FF4 has vanished. There is now a MUX that selects between FF3 and FF5 (”A” result was a “0″) and a MUX in the parallel logic that selects between FF4 and FF5 (”A” result was a “1″) .
    At the end of the path we introduced a new MUX which selects between the two calculations we made, this time depending on cloud “A”. It is easy to see that although this implementation takes more area due to the duplicated logic, the calculation of the big logic clouds “A” and “B” is done in parallel rather than in series.

    This technique is relatively easy to implement and to spot if you have a circuit diagram of your design. Also do not count on the synthesis tool to do this for you. It might be able to do it with relatively small structures but when those logic clouds get bigger, you should implement this trick on your own - you will see improvements in timing (and often in synthesis run time). What you pay for is area and maybe power - nothing comes for free…

    h1

    Do You Think Low Power???

    May 20, 2007

    There is almost no design today, where low power is not a concern. Reducing power is an issue which can be tackled on many levels, from the system design to the most fundamental implementation techniques.

    In digital design, clock gating is the back bone of low power design. It is true that there are many other ways the designer can influence the power consumption, but IMHO clock gating is the easiest and simplest to introduce without a huge overhead or compromise.

    Here is a simple example on how to easily implement low power features.

    sync_fifo.png The picture on the right shows a very simple synchronous FIFO. That FIFO is a very common design structure which is easily implementable using a shift register. The data is being pushed to the right with each clock and the tap select decides which register to pick. The problem with this construction is that with each clock all the flip-flops potentially toggle, and a clock is driven to all. This hurts especially in data or packet processing applications where the size of this FIFO can be in the range of thousands of flip-flops!!

    sync_fifo_low_power.png The correct approach is instead of moving the entire data around with each clock, to “move” the clock itself. Well not really move, but to keep only one specific cell (or row in the case of vectors) active while all the other flip-flops are gated. This is done by using a simple counter (or a state machine for specific applications) that rotates a “one hot” signal - thus enabling only one cell at a time. Notice that the data_in signal is connected to all the cells in parallel,. When new data arrives only the cell which receives a clock edge in that moment will have a new value stored.

    h1

    Another Synchronization Pitfall…

    May 18, 2007

    Many are the headaches of a designer doing multi clock domain designs. The basics that everyone should know when doing multi clock domain designs are presented in this paper. I would like to discuss on this post a lesser known problem, which is overlooked by most designers. Just as a small anecdote, this problem was encountered by a design team led by a friend of mine. The team was offered a 2 day vacation reward for anyone tracking and solving the weird failures that they experienced. I guess this already is a good reason to continue reading…

    OK, we all know that when sending a control signal (better be a single one! - see the paper referenced above) from one clock domain to another, we must synchronize it at the other end by using a two stage shift register (some libraries even have a “sync cell” especially for this purpose).

    Take a look at the hypothetical example below

    sync_pitfall1.png sync_pitfall2.png

    Apparently all is well, the control signal, which is an output of some combinational logic, is being synchronized at the other end.
    So what is wrong?
    In some cases the combinational logic might generate a hazard, depending on the inputs. Regardless whether it is a static one (as depicted in the timing diagram) or a dynamic one, it is possible that exactly that point is being sampled at the other end. Take a close look at the timing diagram, the glitch was recognized as a “0″ on clk_b’s side although it was not intended to be.

    The solution to this problem is relatively easy and involves adding another sampling stage clocked with the sending clock as depicted below. Notice how this time the control signal at the other end was not recognized as a “0″. This is because the glitch had enough time to settle until the next rising edge of clk_a.

    sync_pitfall3.pngsync_pitfall4.png

    In general, the control signal sent between the two clock domains should present a strict behavior during switching- either 1–>0 or a 0–>1. Static hazards (1–>0–>1 or 0–>1–>0) or Dynamic hazards (1–>0–>1–>0 or 0–>1–>0–>1) are a cause for a problem.

    Just a few more lines on synchronization faults. Quite often they might pop up in only some of the designs. You might have 2 identical chips, one will show a problem the other not. This can be due to slight process variations that make some logic faster or slower, and in turn generate a hazard exactly at the wrong moment.

    h1

    Eliminating Unnecessary MUX structures

    May 16, 2007

    You will often hear engineers in our business saying something along these lines:
    “I first code, and then let synthesis find the optimal implementation” or “synthesis tools are so good these days, there is no use in spending time on thinking in the circuit level…”. Well not me - sorry!! I am a true fan of “helping” or “directing” the synthesis.
    The example I will discuss on this post, is a real life example that occurred while reviewing a fellow engineer’s work.

    The block in discussion is quite a heavy one, with very tight timing requirements and complicated functionality (aren’t they all like that…). Somewhere in the code I encountered this if-else-if statement (Verilog):

    if (s1)
       y = 1'b1;
    else if (s2)
             y = 1'b0;
          else
             y = x;
    
    

    Now, if this would have stood on its own, it would not have risen much suspicion. But this statement happened to be part of the critical path. On first look, the if-else-if ladder is translated into a set of cascaded muxes, but looking carefully at it, one can simplify it into two gates (or even one complex gate in most libraries) as shown below.

    cascaded_mux.png

    I do not say that a good synthesis tool is not able to simplify this construction, and I have to admit I do not really know what is going on inside the optimization process - this seems to be some sort of black magic of our art - but fact is, that timing improved after describing the if-else-if statement explicitly as an or-and combination.
    The reason can be, as depicted, that the muxes are being “dragged” somehow into the logic clouds just before and after them in hope of simplifying them there. I just don’t know!
    A good sign to spot when such simplification is easily possible, is when you have an if-else-if ladder or a case statement with constants on the right hand side (RHS). It does make the code a bit less readable, but IMHO it is worth it.

    Here is a short summery of some common mux constructs with fixed inputs and their simplified forms.

    mux_fixed_inputs.png