Archive for the ‘Synthesis’ Category

h1

Real World Examples #4 – More on “Thinking Hardware”

January 20, 2009

I was reviewing some code not so long ago, and noticed together with the owner of the code, that we had some timing problems.
Part of the code looked something like that (Verilog):


wire [127:0] a;
wire [127:0] b;
wire [127:0] c;
assign c = select_register ? a : b;

For those not familiar with Verilog syntax, the code describes a MUX construct using the ternary operator. The two data inputs for the MUX are “a” and “b” and the select is “select_register”.

So why was this code translated into a relatively slow design? The answer is in the width of the signals. The code actually synthesizes to 128 parallel MUX structures. The “select_register” has actually 128 loads.
When a construct like this is hidden within a large code, our tendency is to just neglect it by saying it is “only” 2:1 MUX deep, but we have to look more carefully than that – and always remember to consider the load.

Solving this problem is relatively easy by replication. Just creating more versions of the “select_register” helped significantly.

Advertisements
h1

Who Said Clock Skew is Only Bad?

October 2, 2008

We always have this fear of adding clock skew. Well, seems like this is one of the holy cows of digital design, but sometimes clock skew can be advantageous.

Take a look at the example below. The capturing flop would normally violate setup requirements due to the deep logic cloud. By intentionally adding delay we could help make the clock arrive later and thus meet the setup condition. Nothing comes for free though, if we have another register just after the capturing one, the timing budget there will be cut.

This technique can also be implemented on the block level as well. Assume we have two blocks A and B. B’s signals, which are headed towards A, are generated by a deep logic cloud. On the other hand A’s signals, which arrive at B, are generated by a rather small logic cloud. Skewing the clock in the direction of A now, will give more timing budget for the B to A signals but will eat away the budget from A to B’s signals.

Inserting skew is very much disliked by physical implementation guys although a lot of the modern tools know how to handle it very nicely and even account for the clock re-convergence pessimism (more on this in another post). I have the feeling this dislike is more of a relic of the past, but as we push designs to be more complex, faster, less power hungry etc. we have to consider such techniques.

h1

Max Area = 0 ?

August 24, 2008

You are working on a design, you simulated the thing and it looks promising, first synthesis run also looks clean – jobs done right? wrong!

Many ASIC designers do not care for the area of their blocks. It has to meet the max_transition, max_capacitance and timing requirements but who cares about the area? Well if you are an engineer in soul, you should care.

I completely agree that it is a well accepted strategy not to constrain for area (or max_area = 0) when you first approach synthesis. But this doesn’t mean you should ignore the synthesis area reports, even if die size is not an issue in your project.

Not thinking about the area of your design is definitely a bad habit. Given that your transition, capacitance and timing requirements are met you should aim for lower area for your designs. In many cases the tool will meet the timing requirements at the cost of huge logic duplication and parallelism. This is OK for the critical path, but if you could do better than this for the other paths why not just “help” the tool?

For example, try thinking of pre-scaling wide increment logic or pre-decode deep logical clouds with information that might be available a cycle before. This would add some flip-flops but you might find your area decreasing significantly.

There is almost no design that can’t be improved, sometimes with a lot of engineering effort, but most designs have a lot of low hanging fruits. In my current project, I was working with one of my best engineers on optimizing some big blocks that were a legacy from another designer. In almost all blocks we were able to reduce the overall size by 30% and in some cases by over 50%!! This was not because the blocks were poorly designed, it is just that the previous designer cared less about area issues.

Bottom line – remember that smaller blocks mean:

    – Other blocks are located closer
    – Shorter wires need to be driven through the chip
    – Less hardware
    – Lower power
    – Are just more neat 🙂
h1

Why Not Just Over-Constrain My Design?

June 25, 2008

This is a question often raised by beginners when trying to squeeze performance from their designs.
So, why over-constraining a design does not necessarily improve performance. The truth is that I don’t really know. I assume it is connected to some internal variables and measuring algorithms inside the synthesis tool and the fact that they give up trying to improve the performance because they reached a certain local minimum in some n-variable space (really!).

But empirically, I (and many others) have found out that you can not get the best performance by just over-constraining your design in an unrealistic manner. It has to be somehow closely related to the actual maximum speed that can be reached. The graph below sums up this problem pretty neatly.

As seen above, there is a certain min-max range for the performance frequency that can be reached and its peak is not the result of the highest frequency constrained!
The flat region on the left of the figure is the speed reached without any optimization, that is, right after mapping your HDL into gates. As we move towards the right, we see actual speed improvement as we constrain for higher speeds. Then a peak is reached and constraining for higher speeds results in poorer performance.

I worked relatively less with FPGAs in my carrier but I have seen this phenomenon there as well. Take it to your attention.

h1

ECO Flow

December 5, 2007

Here is a useful checklist you should use when doing your ECOs.

  1. RTL bug fix
  2. Correct your bug in RTL, run simulations for the specific test cases and some your general golden tests. See if you corrected the problem and more important didn’t destroy any correct behavior.

  3. Implement ECO in Synthesis netlist
  4. Using your spare cells and/or rewiring, implement the bug fix directly in the synthesis verilog netlist. Remember you do not re-synthesize the entire design, you are patching it locally.

  5. Run equivalence check between synthesis and RTL
  6. Using your favorite or available formal verification tool, run an equivalence check to see if the code you corrected really translates to the netlist you patched. Putting it simply – the formal verification tool runs through the entire state space and tries to look for an input vector that will create 2 different states in the RTL code and the synthesis netlist. If the two designs are equivalent you are sure that your RTL simulations would also have the same result (logically speaking) as the synthesis netlist.

  7. Implement ECO in layout netlist
  8. You will now have to patch your layout netlist as well. Notice that this netlist is very different than the synthesis netlist. It usually has extra buffers inserted for edge shaping or hold violation correction or maybe even totally differently logically optimized.
    This is the real thing, a change here has to take into account the actual position of the cells, the actuall names etc. Try to work with the layout expert in close proximity. Make sure you know and understand the floorplan as well – it is very common to connect a logic gate which is on the other side of the chip just because it is logically correct, but in reality it will violate timing requirements.

  9. Run equivalence check between layout and synthesis
  10. This is to make sure the changes you made in the layout netlist are logically equivalent to the synthesis. Some tools and company internal flows enable a direct comparison of the layout netlist to the RTL. In many it is not so and one has to go through the synthesis netlist change as well

  11. Layout to GDS / gate level simulations / STA runs on layout netlist (all that backend stuff…)
  12. Let the layout guys do their magic. As a designer you are usually not involved in this step.
    However, depending on your timing closure requirements, run STA on the layout netlist to see if everything is still ok. This step might be the most crucial since even a very small change might create huge timing violations and you would have to redo your work.
    Gate level simulations are also recommended, depending on your application and internal flow.

h1

Spare Cells

November 26, 2007

What are spare cells and why the heck do we need them?

Spare cells are basically elements embedded in the design which are not driving anything. The idea is that maybe they will enable an easy (metal) fix without the need of a full redesign.

Sometimes not everything works after tape-out, a counter might not be reseted correctly, a control signal needs to be additionally blocked when another signal is high etc. These kind of problems could be solved easily if “only I would have another AND gate here…”
Spare cells aim to give a chance of solving those kind of problems. Generally, the layout guys try to embed in the free spaces of the floor-plan some cells which are not driving anything. There is almost always free space around, and adding more cells doesn’t cost us in power (maybe in leakage in newer technologies), area (this space is anyhow there) or design time (the processes is 99% automatic).
Having spare cells might mean that we are able to fix a design for a few 10K dollars (sometimes less) rather than a few 100K.

So which spare cells should we use? It is always a good idea to have a few free memory elements, so I would recommend on a few flip-flops. Even a number as low as 100 FF in a 50K FF design is usually ok. Remember, you are not trying to build a new block, but rather to have a cheap possibility for a solution by rewiring some gates and FFs.
What gates should we through in? If you remember some basic boolean algebra, you know that NANDs and NORs can create any boolean function! This means that integrating only NANDs or NORs as spare cells would be sufficient. Usually, both NANDs and NORs are thrown in for more flexibility. 3 input, or even better 4 input NANDs and NORs should be used.

A small trick is tying the inputs of all NANDs to a logical “1” and all inputs of the NORs to a logical “0”. This way if you decide to use only 2 of the 4 inputs the other inputs do not affect the output (check it yourself), this in turn means less layout work when tying and untying the inputs of those spare cells.

The integration of spare cells is usually done after the synthesis step and in the verilog netlist basically looks like an instantiation of library cells. This should not done before, since the synthesis tool will just optimize all those cells away as they drive nothing. The layout guy has to somehow by feeling (or black magic) spread the spare cells around in an even way.

I believe that when an ECO (Engineering Change Order) is needed and a metal-fix is considered – this is where our real work as digital designers start. I consider ECOs, and in turn the use of spare cells to solve or patch a problem, as the epitome our usage of skills, experience, knowledge and creativity!

More on ECOs will be written in the future…

h1

A Short Note on Automatic Clock Gates Insertion

June 13, 2007

As we discussed before, clock gating is one of the most solid logic design techniques, which one can use when aiming for low power design.
It is only natural that most tools on the market support an automatic clock gating insertion option. Here is a quote from a synopsys article describing their power compiler tool

…Module clock gating can be used at the architectural level to disable the clock to parts of the design that are not in use. Synopsys’ Power Compiler™ helps replace the clock gating logic inserted manually, gating the clock to any module using an Integrated Clock Gating (ICG) cell from the library. The tool automatically identifies such combinational logic…

But what does it really mean? What is this combinational logic that the tool “recognizes”?

The answer is relatively simple. Imagine a flip-flop with an enable signal. Implementation wise, this is done with a normal flip-flop and a MUX before with a feedback path to preserve the logical value of the flop when the enable is low. This is equivalent to a flop with the MUX removed and the enable signal controlling the enable of a clock gate cell, which in turn drives the clock for the flip-flop.

The picture below is better than any verbal explanation.

auto_clock_gating.png