Counting Really, Really Fast With An FPGA

fast

During one of [Michael]’s many forum lurking sessions, he came across a discussion about frequency counting on a CPLD. He wondered if he could do the same on an FPGA, and how hard it would be to count high clock rates. As it turns out, it’s pretty hard with a naive solution. Being a bit more clever turns the task into a cakewalk, with a low-end FPGA being able to count clocks over 500 MHz.

The simplest solution for counting a clock would be to count a clock for a second with a huge, 30-bit counter. This is a terrible idea: long counters have a lot of propagation delays. Also, any sampling would have to run at least twice as fast as the input signal – not a great idea if you’re counting really fast clocks.

The solution is to have the input signal drive a very small counter – only five bits – and sample the counter using a slower clock on board the FPGA. [Michael] used a 5-bit Gray code, getting rid of the problem of the ‘11111’ to ‘00000’ rollover of a normal binary counter.

Because [Michael] is using a 5 bit clock with 31 edges sampled at 32 MHz, he can theoretically sample a 992 MHz clock. There isn’t a chance in hell of the Spartan 6 on his Papilio Pro board ever being able to measure that, but he is able to measure a 500 MHz clock, something that would be impossible without his clever bit of code.

25 thoughts on “Counting Really, Really Fast With An FPGA

  1. There is annother interesting way to implement a counter like this, and that is to remove the adder from the critical path. Instead of carrying you keep two numbers that when added together make up the count, you can do this with only one full-adder per bit. and the cost is an additional flip-flop per bit, however this should run at the speed of a single full adder.

    1. Still, I think his method will yield slightly higher clock rates, since there is almost no routing overhead in the ‘high-speed’ part of the design.

  2. Hi – I’m the project’s author.

    I was thinking of offering a prize for a faster Spartan 6lX9 design, where all it had to do was count the frequency of the incoming signal and squirt ASCII down any serial port.
    However that would be unfair of me as it is the maximum switching rate of the clock buffer that is the limiting factor, not the programmable logic – as it should be the fastest design that passes timing. 8-)

    The underlying logic of LFSRs are faster, but how do you cleanly transfer the state of the LFSR into your clock domain and then convert it back to a binary number? This makes LFSRs better four when you are counting to a preset terminal count then reseting.

    One other strike against LFSRs is that the logic blocks in FPGAs have special features that speed up addition (most notably the carry chain), so a long LFSR will be slower that a 5-bit binary counter.

    Mike

    1. If you are counting frequency, you are essentially looking at the change in clock count over a period of time. e.g. every 0.1sec (or whatever)

      Can’t you run the counter itself asynchronously and not worry about getting the reading while it is live? Once the sampling period is over, you gate the input signal to stop the counter. When the counter is stopped, you wouldn’t need to worry about crossing the clock domains any more because its bits stopped changing already. Examine the bits and when you are done let sampling period starts again and the counter to do it work again.

      1. there’s many way to skin a cat but you still end up with problem that ~500MHz is the limit on the IOs and clocks

      2. Yes – you can gate the clock, but then you will need a long counter, 29 bits if you want to count for a second at up to 500MHz.

        I’m not sure if that is possible in the FPGA I’m using, as it will require a 0.066 ns carry delay per bit, and it looks to be about 0.12 ns from the datasheet.

        To use that technique might have to use around a 16-bit counter, and would need to pause it every 125us to read back the current count. This would cause errors in the final total.

        1. AFAICS the carry delay doesn’t matter. It matters if you want a synchronous counter, but we don’t. If you make a simple asynchronous counter (aka ripple counter), there is no carry as such; each bit toggles when the next lower bit goes from 1 to 0. Now reading the all the bits at once, when the clock is running, is inconsistent/incorrect, because the bits are progressively skewed.

          But the thing is, you don’t need to read it while it’s running! At the end of the measurement period, you gate the input off, then wait enough time for the last toggle to propagate through (assuming the worst-case, where the MSB is 0 and all other bits are 1, so it’s toggling from 01111… to 1000…., it’s 0.12ns * 30ish, so pick any convenient multiple of a convenient clock that gives you 10ns or so), _then_ read the output.

          about ripple counters, for those who don’t know them…

          http://www.eecs.tufts.edu/~dsculley/tutorial/flopsandcounters/flops6.html

          1. You can do that for later stages, but for early stages, clocking via local routing decreases greatly your maximum possible frequency.

          2. Guess no one reads my comment (just 1 comment below you.) That’s how they make a 400MHz frequency counter in the old grand grand daddy XC4000. They used a chain of ripple counter. That chip wasn’t even fast by any stretch of imagination.

            >Consequently, the first stage is an unconditional divide-by-2,
            as shown in Figure 2. The clock-to-setup delay of 2.44 ns
            permits 400-MHz operation even under worst-case conditions.

            Now how much faster is the FPGA used again? If you can make a gray code counter at 500MHz to count your input, you have already a way to get that clock into the FPGA and toggle some logic inside. The part lacking is to think outside of the everything should be synchronous mindset.

            Once you get the 500MHz in this case down to 250MHz, even your average undergrad EE should be able to make the rest out of synchronous logic in a modern FPGA.

      3. BTW take a look at this:

        http://caxapa.ru/thumbs/404644/XC4002XL_fcounter_400MHz.pdf

        >This article describes a full-featured, single-chip frequency
        counter that operates at 400 MHz, consumes only 130 mW at
        the maximum input frequency, and occupies less than 90% of
        an XC4002XL, the smallest XC4000 family member.

        If they can measure 400MHz with the old XC4000, I am pretty sure that the same trick can be done on after so many generation of improvements.

        Now that was a much better hack.

        1. Sure, you just immediately divide down the input in a CLB flop, and use that to count the clock (and possibly do something to pick up the +/-1, but that’s not a big deal).

          The improvement isn’t ‘huge’ though, despite so many generations of improvements, because FPGAs aren’t optimized for this – there is no fast loopback path from a register to itself. The clock speed limitation is a combination of the output delay, the loopback delay, and the setup delay of a single CLB. On the old XC4000, that total was ~2.44 ns. On a Spartan-6 this is ~1.7 ns.

          So it doesn’t get you much faster at all. Realistically once you get past 500 MHz, you have to get creative no matter what generation FPGA you’re using.

          1. I got round to trying trying using a flip-flop as a /2 prescaler – half a dozen more lines of code and it passes timing at 1.051 ns (950 MHz). However I don’t have any suitable test gear to test it with.

            Thank’s for the idea Tekkieneet!

            Still feel that it is a bit like cheating – I’ll have to set the gate time to two seconds to make up for the missing LSB :)

    2. If you are already at the maximum frequency, then yes, there is little advantage to an LFSR. However, an LFSR will always be faster than a simple counter, even with a carry chain, as it only requires a single gate with two inputs. The rest of the chain is just shift registers. You can even pack those shift registers into SRLs, up to a 128 bits in a single slice (but that limits flexibility).

      As far as reading it periodically, you can just clock it into a second set of registers (periodically) with the high speed clock, then with a low speed clock pass it through logic for converting to numerals. You can also do comparisons utilizing the high speed clock and a single LUT if you are comparing to a 6 bit constant, or using a series of LUTs and a carry chain if you are comparing against a larger constant.

    3. So don’t use a clock buffer. At least not one straight off.

      You could probably count higher with a clever input structure, but you’d have to think about it. Toggle frequencies in a CLB are much higher (~800 MHz), and the fact that there’s a high latency path from fabric to clock tree doesn’t matter for a counter.

      Heckuva lot more work, though.

    4. Counting the input clock and measuring its frequency are two very different things. Counting accurately is probably limited to around 500 MHz easily. Measuring its frequency, though, could be done with a TDC, and there you can get into the GHz range.

      1. Yeah. I’m a little baffled by why they are doing this in such an odd way if what they seem to actually want is frequency measurement…when 500+ MHz was easily within the realm of discrete logic…

        1. at varying degrees of easily, your standard junk box flipflop, even a fast one, isn’t going to work at anywhere near 500MHz

  3. Well, there are other reasons to have a high frequency clock counter. Time tagging, for instance. Getting the high-frequency clock counter is just the first part. In addition, there’s a difference between measuring the frequency by counting the number of pulses in 1 second and measuring the time difference between two rising edges a whole bunch of times. If you’re looking for frequency stability, a big clock counter is really nice.

  4. Here’s something that’d be fun to experiment with. Feed the clock into a carry chain that extends across the entire chip vertically. Periodically sample the output from the carry chain into flip-flops (4 for each slice). Depending on the various noise and jitter sources, you now have a somewhat noisy image of the input signal frozen in time. You’d have to fall back to other methods for frequencies that would look like DC in the carry chain.

    1. I’ve tried playing with that a while ago, The ‘1’s stretch out and the ‘0’s shrink up. It looks like the ‘1’s hold up a little longer than ideal.

      So it is fine for resolving when the rising or falling edges on a relatively long (e.g. 1 ns or longer) pulse occur, but if you are looking at using it to analyse a cyclic high frequency signal or very short ‘0’ pulses on a normally ‘1’ data signal it doesn’t work too well as the signal soon smears out.

      I had better luck with chaining carefully placed and routed LUT tables, configured as inverters. I managed to get better than 0.5ns resolution on a Spartan 3E, and transmit 512MB/s 8b/10b data between two boards.

      Apart from the very, very tricky layout and signal routing needed, the big downside of this technique is that the delay changes with operating conditions, so you need a reiable way to calibrate against a known signal.

      1. If you have a high frequency clock, it’s almost certainly coming in LVDS, so use the (inverted) partner copy of the signal (using an IBUFDS_DIFF_OUT or equivalent) and feed it down an equivalent copy of the chain, which will eliminate any polarity influences by looking only at 0->1 or 1->0 transitions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s