Reducing Power Through Retiming

February 23, 2009

Here is an interesting and almost trivial technique for (potential) power reduction, which I never used myself, nor seen used in others’ designs. Well… maybe I am doing the wrong designs… but I thought it is well worth mentioning. So, if any of my readers use this, please do post a short comment on how exactly did you implement it and if it really resulted in some significant savings.

We usually have many high activity nets in the design. They are in many cases toggling during calculation more than once per cycle. Even worse, they often drive long and high capacitive nets. Since, in a usual synchronous design (which 99% of us do), we only need the stable result once per cycle – when the calculation is done – we can just put a register to drive the high capacitive net. The register effectively blocks all toggling on that net (where it hurts) and allows it to change maximum one time per cycle.

The image below tells the whole story. (a) is before the insertion of the flop, (b) right after.


This is all nice, but just remember that in real life it can be quite hard to identify those special nets and those special high toggling logic clouds. Moreover, most of the time we cannot afford the flop for latency reasons. But if you happen to be in the early design phase and you know more or less your floor plan, think about moving some of those flops so they will reduce the toggling on those high capacitive nets.


  1. Apart from timing, it becomes Verification nightmares. How can we Formally verify if we use retiming ?. Retiming in FV tools is at very basic and will not understand all retiming done in the design

  2. Kiran
    My understanding is that we have to insert the register in the RTL. I think its not easy to identify that cloud of logic in RTL as Nir said(if i understood correctly) and then insert an register. If it is done in RTL, then there is no problem with the FV tools.

  3. Hi Nir,

    Its always interesting to read your blog. I have picked quite a few valuable insights reading your posts. I have a question regarding using Verilog for parameterization. How do you use a MEM_DEPTH parameter to come up with a MEM_ADDRESS parameter? This is essentially a log operation that verilog doesn’t have native support for.


    • Hi Sanjay,

      You can very easily implement a function which will implement the logbase2 logic using shift operations. It will take as an input the MEM_DEPTH parameter and outputs the MEM_ADDRESS parameter.

      See the examples below:

      // Ceil Log_2 base calculation for an integer ———————————
      function integer clogb2 ( input integer depth );
      for (clogb2 = 0; depth > 0; clogb2 = clogb2 + 1)
      depth = depth >> 1;
      // —————————————————————————-

      // Floor Log_2 base calculation for an integer ——————————–
      //function integer flogb2 ( input integer depth );
      // for (flogb2 = 0; depth > 1; flogb2 = flogb2 + 1)
      // depth = depth >> 1;
      // —————————————————————————-

  4. Hi,
    verilog preprocessor is what you are looking for.
    e.g.: http://cvs.seul.org/viewcvs/viewcvs.cgi/eda/vbpp/

    you can read about the features here:

    It’s a bit outdated but still works like charm for me… (debian package is vbpp)


  5. Dear Nir,
    Sounds interesting but I would not use this approach in my design for the following reasons…
    1. Flops has dynamic power issues
    2. If you use clock gating you may save power but it increases the area in addition to flop
    3. Nowadays we are working in high speed designs…breaking the timing path by flops will compromise the efficiency of design

    The typical approach we use during post-CTS is to upsize cells and add buffers to increase the drive strength so that the capacitive load is reduced

  6. This was one of the first lessons in my Low Power VLSI class at IISc — effective pipeline partitioning helps in low power.

  7. Hi Nir,
    I’m curious how many toggling pulses that you see
    in digital simulation actually occur in silicon.
    The delays and signal transition times will be different than the sharp 1-0 transitions that digital simulation models, and the worst case delays will be different than the typical delays in silicon. If you have pulse suppression modeled in your cell library, that might be the best you can do to accurately count transitions for power estimation.
    I’d be curious what you think about whether digital simulation gives you sufficient accuracy in
    this regard.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: