Archive for the ‘Layout’ Category


Reducing Power Through Retiming

February 23, 2009

Here is an interesting and almost trivial technique for (potential) power reduction, which I never used myself, nor seen used in others’ designs. Well… maybe I am doing the wrong designs… but I thought it is well worth mentioning. So, if any of my readers use this, please do post a short comment on how exactly did you implement it and if it really resulted in some significant savings.

We usually have many high activity nets in the design. They are in many cases toggling during calculation more than once per cycle. Even worse, they often drive long and high capacitive nets. Since, in a usual synchronous design (which 99% of us do), we only need the stable result once per cycle – when the calculation is done – we can just put a register to drive the high capacitive net. The register effectively blocks all toggling on that net (where it hurts) and allows it to change maximum one time per cycle.

The image below tells the whole story. (a) is before the insertion of the flop, (b) right after.


This is all nice, but just remember that in real life it can be quite hard to identify those special nets and those special high toggling logic clouds. Moreover, most of the time we cannot afford the flop for latency reasons. But if you happen to be in the early design phase and you know more or less your floor plan, think about moving some of those flops so they will reduce the toggling on those high capacitive nets.


On Replication and Wire Length

September 12, 2008

It is for some reason a common view, that when using replication you also have to pay in increased wire length. It looks reasonable isn’t it? After all, you now have more blocks to wire into and out of and therefore total wire length should increase, right? Well, not really…

In some cases this might be true, but in most cases wire length should decrease. Wiring in a chip obeys taxicab geometry laws, so it is a bit less intuitive than usual.

Here is a simple example showing how wire length can decrease after replication. Sure, I chose the block placements and the replicated block (R) size to be in my favor, but this is not a rigorous math proof.

Before replication

After replication

Notice how blocks (A) and (B) are now actually farther apart. This leaves more room for other critical logic to be placed in the precious place near the center. On the other hand, after replication we now have one really long wire going out of block (C).

Bottom line: don’t be afraid to use replication when you can, it has many advantages and not only for improving timing.


ECO Flow

December 5, 2007

Here is a useful checklist you should use when doing your ECOs.

  1. RTL bug fix
  2. Correct your bug in RTL, run simulations for the specific test cases and some your general golden tests. See if you corrected the problem and more important didn’t destroy any correct behavior.

  3. Implement ECO in Synthesis netlist
  4. Using your spare cells and/or rewiring, implement the bug fix directly in the synthesis verilog netlist. Remember you do not re-synthesize the entire design, you are patching it locally.

  5. Run equivalence check between synthesis and RTL
  6. Using your favorite or available formal verification tool, run an equivalence check to see if the code you corrected really translates to the netlist you patched. Putting it simply – the formal verification tool runs through the entire state space and tries to look for an input vector that will create 2 different states in the RTL code and the synthesis netlist. If the two designs are equivalent you are sure that your RTL simulations would also have the same result (logically speaking) as the synthesis netlist.

  7. Implement ECO in layout netlist
  8. You will now have to patch your layout netlist as well. Notice that this netlist is very different than the synthesis netlist. It usually has extra buffers inserted for edge shaping or hold violation correction or maybe even totally differently logically optimized.
    This is the real thing, a change here has to take into account the actual position of the cells, the actuall names etc. Try to work with the layout expert in close proximity. Make sure you know and understand the floorplan as well – it is very common to connect a logic gate which is on the other side of the chip just because it is logically correct, but in reality it will violate timing requirements.

  9. Run equivalence check between layout and synthesis
  10. This is to make sure the changes you made in the layout netlist are logically equivalent to the synthesis. Some tools and company internal flows enable a direct comparison of the layout netlist to the RTL. In many it is not so and one has to go through the synthesis netlist change as well

  11. Layout to GDS / gate level simulations / STA runs on layout netlist (all that backend stuff…)
  12. Let the layout guys do their magic. As a designer you are usually not involved in this step.
    However, depending on your timing closure requirements, run STA on the layout netlist to see if everything is still ok. This step might be the most crucial since even a very small change might create huge timing violations and you would have to redo your work.
    Gate level simulations are also recommended, depending on your application and internal flow.


Spare Cells

November 26, 2007

What are spare cells and why the heck do we need them?

Spare cells are basically elements embedded in the design which are not driving anything. The idea is that maybe they will enable an easy (metal) fix without the need of a full redesign.

Sometimes not everything works after tape-out, a counter might not be reseted correctly, a control signal needs to be additionally blocked when another signal is high etc. These kind of problems could be solved easily if “only I would have another AND gate here…”
Spare cells aim to give a chance of solving those kind of problems. Generally, the layout guys try to embed in the free spaces of the floor-plan some cells which are not driving anything. There is almost always free space around, and adding more cells doesn’t cost us in power (maybe in leakage in newer technologies), area (this space is anyhow there) or design time (the processes is 99% automatic).
Having spare cells might mean that we are able to fix a design for a few 10K dollars (sometimes less) rather than a few 100K.

So which spare cells should we use? It is always a good idea to have a few free memory elements, so I would recommend on a few flip-flops. Even a number as low as 100 FF in a 50K FF design is usually ok. Remember, you are not trying to build a new block, but rather to have a cheap possibility for a solution by rewiring some gates and FFs.
What gates should we through in? If you remember some basic boolean algebra, you know that NANDs and NORs can create any boolean function! This means that integrating only NANDs or NORs as spare cells would be sufficient. Usually, both NANDs and NORs are thrown in for more flexibility. 3 input, or even better 4 input NANDs and NORs should be used.

A small trick is tying the inputs of all NANDs to a logical “1” and all inputs of the NORs to a logical “0”. This way if you decide to use only 2 of the 4 inputs the other inputs do not affect the output (check it yourself), this in turn means less layout work when tying and untying the inputs of those spare cells.

The integration of spare cells is usually done after the synthesis step and in the verilog netlist basically looks like an instantiation of library cells. This should not done before, since the synthesis tool will just optimize all those cells away as they drive nothing. The layout guy has to somehow by feeling (or black magic) spread the spare cells around in an even way.

I believe that when an ECO (Engineering Change Order) is needed and a metal-fix is considered – this is where our real work as digital designers start. I consider ECOs, and in turn the use of spare cells to solve or patch a problem, as the epitome our usage of skills, experience, knowledge and creativity!

More on ECOs will be written in the future…



July 25, 2007

Replication is an extremely important technique in digital design. The basic idea is that under some circumstances it is useful to take the same logic cloud or the same flip-flops and produce more instances of them, even though only a single copy would normally be enough from a logical point of view.
Why would I want to spend more area on my chip and create more logic when I know I could do without it?

Imagine the situation on the picture below. The darkened flip-flop has to drive 3 other nets all over the chip and due to the physical placement of the capturing flops it can not be placed close by to all of them. The layout tool finds as a compromise some place in the middle, which in turn will generate a negative slack on all the paths.


We notice that in the above example the logic cloud just before the darkened flop has a positive slack or in other words, “some time to give”. We now use this and produce a copy of the darkened flop, but this time closer to each of the capturing flops.


Yet another option, is to duplicate the entire logic cloud plus the sending flop, as pictured below. This will usually generate even better results.


Notice that we also reduce the fan out of the driving flop, thus further improving on timing.

It is important to take care about while writing the HDL code, that the paths are really separated. This means when you want to replicate flops and logic clouds make sure you give the registers/signals/wires different names. It is a good idea to keep some sort of naming convention for replicated paths, so in the future when a change is made on one path, it would be easy enough to mirror that change on the other replications.

There is no need to mention that when using this technique we pay in area and power – but I will still mention it 🙂


Some Layout Considerations

July 1, 2007

I work on a fairly large chip. The more reflect on what could have been done better, the more I realize how important floor planning is and how important is the concept work of identifying long lines within the chip and tackling these problems in the architectural planning phase.

The average digital designer will be happy if he finished his HDL coding, simulated it and verified it is working fine. Next he will run it through synthesis to see if timing is OK and job done, right? wrong! There are many problems that simply can’t surface during synthesis. To name a few: routing congestion, cross talk effects and parasitics etc. This post will try concentrate on another issue which is much easier to understand, but when encountering it, it is usually too late in the design to be able to do something radical about it – the physical placement of flip-flops.

The picture below shows a hypothetical architecture of a design, which is very representative of the problems I want to describe.


Flop A is forced to be placed closed to the analog interface at the bottom, to have a clean interface to the digital core. In the same way Flop B is placed near the top, to have a clean interface to the analog part at the top. The signal between them, needs to physically cross the entire chip. The layout tools will place many buffers to have clean sharp edges, but in many cases timing is violated. If this signal has to go through during one clock period, you are in trouble. Many times it is not the case, and pipeline stages can be added along the way, or a multi-cycle path can be defined.
Most designers choose to introduce pipeline stages and to have a cleaner synthesis flow (less special constraints).

The other example shown in the diagram is a register that has loads all over the design. It drives signals in the analog interfaces as well as some state machines in the core itself. Normally, this is not a single wire but an entire bus and pipelining this can be very expensive. In a typical design there are hundreds of registers controlling state machines and settings all over the chip, with wires criss crossing by the thousands. Locating the bad guys should be done as soon as possible.

Some common solutions are:

  1. Using local decoding as described on this post
  2. Reducing the width of your register bus (costs in register read/write time)
  3. Defining registers as quasi-static – changeable only during the power up sequence, static during normal operation

Big Chips – Some Low Power Considerations

June 2, 2007

As designers, especially ones who only code in HDL, we don’t normally take into account the physical size of the chip we are working on. There are many effects which surface only past the synthesis stage and when approaching the layout.

As usual, let’s look at an example. Consider the situation described on the diagram below.


Imagine that block A and B are located physically far from one another, and could not be placed closer to one another. If the speeds we are dealing with are relatively high, it may very well be that the flight time of the signals from one side of the chip to another, already becomes too critical and even a flop to flop connection without any logic in between will violate setup requirements!
Now, imagine as depicted that many signals are sent across the chip. If you need to pipeline, you would need to pipeline a lot of parallel lines. This may result in a lot of extra flip-flops. Moreover, your layout tool will have to put in a lot of buffers to keep sharp edged signals. From architectural point of view, decoding globally may sound attractive at first, since you only need to do it once but can lead to a very power hungry architecture.

The alternative is to send as less long lines as possible across the chip, As depicted below.


With this architecture block B decodes the logic locally. If the lines sent to block B, need also to be spread all over the chip, we definitely pay in duplicating the logic for each target block.

There is no strict criteria to decide when to take the former or the latter architectures, as there is no definite crossover point. I believe this is more of a feeling and experience thing. It is just important to have this in mind when working on large designs.