Archive for July, 2007


The Double Edge Flip Flop

July 31, 2007

Sometimes it is necessary to use both the rising and the falling edge of the clock to sample the data. This is sometimes needed in many DDR applications (naturally). The double edge flop is sometimes depicted like that:


The most simple design one can imagine (at least me…), would be to use two flip flops. One sensitive to the rising edge of the clock, the other to the falling edge and to MUX the outputs of both, using the clock itself as the select. This approach is shown below:


What’s wrong with the above approach? Well in an ideal world it is OK, but we have to remember that semi-custom tools/users don’t like to have the clock in the data path. This requirement is justified and can cause a lot of headaches later when doing the clock tree synthesis and when analyzing the timing reports. It is a good idea to avoid such constructions unless they are absolutely necessary. This recommendation applies also for the reset net – try not combining the reset net into your logic clouds.

Here is a cool circuit that can help solve this problem:


I will not take the pleasure from you of drawing the timing diagrams yourself 🙂 and realizing how and why this circuit works, let me just say that IMHO this is a darn cool circuit!

Searching the web a bit I came across a paper which describes practically the same idea by Ralf Hildebrandt. He names it a “Pseudo Dual-Edge Flip Flop”, you can find his short (but more detailed) description, including a VHDL code, here.



July 25, 2007

Replication is an extremely important technique in digital design. The basic idea is that under some circumstances it is useful to take the same logic cloud or the same flip-flops and produce more instances of them, even though only a single copy would normally be enough from a logical point of view.
Why would I want to spend more area on my chip and create more logic when I know I could do without it?

Imagine the situation on the picture below. The darkened flip-flop has to drive 3 other nets all over the chip and due to the physical placement of the capturing flops it can not be placed close by to all of them. The layout tool finds as a compromise some place in the middle, which in turn will generate a negative slack on all the paths.


We notice that in the above example the logic cloud just before the darkened flop has a positive slack or in other words, “some time to give”. We now use this and produce a copy of the darkened flop, but this time closer to each of the capturing flops.


Yet another option, is to duplicate the entire logic cloud plus the sending flop, as pictured below. This will usually generate even better results.


Notice that we also reduce the fan out of the driving flop, thus further improving on timing.

It is important to take care about while writing the HDL code, that the paths are really separated. This means when you want to replicate flops and logic clouds make sure you give the registers/signals/wires different names. It is a good idea to keep some sort of naming convention for replicated paths, so in the future when a change is made on one path, it would be easy enough to mirror that change on the other replications.

There is no need to mention that when using this technique we pay in area and power – but I will still mention it 🙂


Puzzle #7 – Transitions

July 17, 2007

It’s time for puzzle #7.

An FSM receives an endless stream of “0”s and “1”s. The stream can not be assumed to have certain properties like randomness, transition density or the like.

Is it possible to build a state machine, which at any given moment outputs whether there were more 0–>1 or 1–>0 transitions so far?

If yes, describe briefly the FSM. If no, give a short proof.


2 Lessons on PRBS Generators and Randomness

July 10, 2007

The topic of “what is random” is rather deep and complicated. I am far from an authority on the subject and must admit to be pretty ignorant about it. However, this post will deal with two very simple but rather common errors (or misbehaviors) of random number generators usage.

    LFSR width and random numbers for your testbench

Say you designed a pretty complicated block or even a system in HDL and you wish to test it by injecting some random numbers to the inputs (just for the heck of it). For simplicity reasons lets assume your block receives an integer with a value between 1 and 15. You think to yourself that it would be pretty neat to use a 4-bit LFSR which generates all possible values between 1 and 15 in a pseudo-random order and just repeat the sequence over and over again. Together with the other type of noise in the system you inject, this should be pretty thorough, right? Well, not really!

Imagine for a second how the sequence looks like, each number will always be followed by another specific number in this sequence! For example, you will never be able to verify a case where the same number is injected immediately again into the block!
To verify all other cases (at least for all different pairs of numbers) you would need to use an LFSR with a larger width (How much larger?). What you need to do then is to pick up only 4 bits of this bigger LFSR and inject them to your block.

I know this sounds very obvious, but I have seen this basic mistake done several times before – by me and by others as well (regardless of their experience level).

    PRBS and my car radio “mix” function

On sunny days I ride my bicycle to work, but on rainy days I chicken out and use the car for the 6km I have to go. Since I don’t often like what is on the radio, I decided to go through my collection of CDs and choose the 200 or so songs I would like to listen to in the car and burn them as mp3s on a single CD (Don’t ask how much time this took). Unfortunately, if you just pop in the CD and press play, the songs play in alphabetical order. Luckily enough, my car CD player has a “mix” option. So far so good, but after a while I started to notice that when using the “mix” option, always song 149 is followed by song 148, which in turn is followed by song 18, and believe me this is annoying to the bone. The whole idea of “mixing” is that you don’t know what to expect next!

I assume that the “mix” function is accomplished by some sort of PRBS generator, which explains the deterministic order of song playing. But my advice to you if you design a circuit of this sort (for a CD player, or whatever), is to introduce some sort of true randomness to the system. For example, one could time the interval between power-up of the radio and the first human keystroke on the CD player and use this load the PRBS generator as a seed value, thus producing a different starting song for the play list each time. This however, does not solve the problem of the song playing order being deterministic. But given such a “random” number from the user once could use it to generate an offset for the the PRBS generator making it “jump” an arbitrary number of steps instead of the usual one step.

My point was not to indicate that this is the most clever way to do things, but I do think that with little effort one could come up with slightly more sophisticated systems, that make a big difference.


The Ultimate Interview Question for Logic Design – A Mini Challenge

July 9, 2007

I had countless interviews, with many different companies, large corporations and start ups. For some reason in almost all interviews, which were done in Israel, a single question popped up more often than others (maybe it is an Israeli High-Tech thing…).

Design a clock divide-by-3 circuit with 50% duty cycle

The solution should be easy enough even for a beginner designer. Since this is such a popular question, and since I am getting a decent amount of readers lately, I thought why not make a small challenge – try to find a solution to this problem with minimum hardware.

Please send me your solutions by email – can be found on the about me page.


Puzzle #5 – Binary-Gray counters – solution

July 5, 2007

The binary-Gray puzzle from last week generated some flow of comments and emails.
Basically, the important point to notice is the amount each counter toggles while going through a complete counting cycle.

For Gray coded counter, by definition only one bit changes at a time. Therefore, for an n stage counter we get 2^n toggling events for a complete counting cycle.

For binary coded n-bit counter, we have 2^(n+1)-2 toggling events for a complete counting cycle. you could verify this by

  1. Taking my word for it (don’t – check it yourself)
  2. Writing down manually the results for a few simple cases and convince yourself it is so
  3. Calculate the general case, but you have to remember something about how to calculate the sum of a simple series (best way)

Anyways, given the above assumptions and the fact that per bit the Gray counter consumes 3 times more power (2 times more would also just work, but the difference would be a constant), the Gray counter will always consume more power.
3*2^n > 2^(n+1) – 2


Some Layout Considerations

July 1, 2007

I work on a fairly large chip. The more reflect on what could have been done better, the more I realize how important floor planning is and how important is the concept work of identifying long lines within the chip and tackling these problems in the architectural planning phase.

The average digital designer will be happy if he finished his HDL coding, simulated it and verified it is working fine. Next he will run it through synthesis to see if timing is OK and job done, right? wrong! There are many problems that simply can’t surface during synthesis. To name a few: routing congestion, cross talk effects and parasitics etc. This post will try concentrate on another issue which is much easier to understand, but when encountering it, it is usually too late in the design to be able to do something radical about it – the physical placement of flip-flops.

The picture below shows a hypothetical architecture of a design, which is very representative of the problems I want to describe.


Flop A is forced to be placed closed to the analog interface at the bottom, to have a clean interface to the digital core. In the same way Flop B is placed near the top, to have a clean interface to the analog part at the top. The signal between them, needs to physically cross the entire chip. The layout tools will place many buffers to have clean sharp edges, but in many cases timing is violated. If this signal has to go through during one clock period, you are in trouble. Many times it is not the case, and pipeline stages can be added along the way, or a multi-cycle path can be defined.
Most designers choose to introduce pipeline stages and to have a cleaner synthesis flow (less special constraints).

The other example shown in the diagram is a register that has loads all over the design. It drives signals in the analog interfaces as well as some state machines in the core itself. Normally, this is not a single wire but an entire bus and pipelining this can be very expensive. In a typical design there are hundreds of registers controlling state machines and settings all over the chip, with wires criss crossing by the thousands. Locating the bad guys should be done as soon as possible.

Some common solutions are:

  1. Using local decoding as described on this post
  2. Reducing the width of your register bus (costs in register read/write time)
  3. Defining registers as quasi-static – changeable only during the power up sequence, static during normal operation