Counting Really, Really Fast With An FPGA

June 29, 2014

fast

During one of [Michael]’s many forum lurking sessions, he came across a discussion about frequency counting on a CPLD. He wondered if he could do the same on an FPGA, and how hard it would be to count high clock rates. As it turns out, it’s pretty hard with a naive solution. Being a bit more clever turns the task into a cakewalk, with a low-end FPGA being able to count clocks over 500 MHz.

The simplest solution for counting a clock would be to count a clock for a second with a huge, 30-bit counter. This is a terrible idea: long counters have a lot of propagation delays. Also, any sampling would have to run at least twice as fast as the input signal – not a great idea if you’re counting really fast clocks.

The solution is to have the input signal drive a very small counter – only five bits – and sample the counter using a slower clock on board the FPGA. [Michael] used a 5-bit Gray code, getting rid of the problem of the ‘11111’ to ‘00000’ rollover of a normal binary counter.

Because [Michael] is using a 5 bit clock with 31 edges sampled at 32 MHz, he can theoretically sample a 992 MHz clock. There isn’t a chance in hell of the Spartan 6 on his Papilio Pro board ever being able to measure that, but he is able to measure a 500 MHz clock, something that would be impossible without his clever bit of code.

26 thoughts on “Counting Really, Really Fast With An FPGA”

Alexis says:

June 29, 2014 at 10:49 am

There is annother interesting way to implement a counter like this, and that is to remove the adder from the critical path. Instead of carrying you keep two numbers that when added together make up the count, you can do this with only one full-adder per bit. and the cost is an additional flip-flop per bit, however this should run at the speed of a single full adder.

Report comment

Reply
jpa says:

June 29, 2014 at 11:55 am

Or he could just have pipelined the counter. Like https://github.com/PetteriAimonen/dso-quad-logic/blob/fpga_support/fpga/pipelined_counter.vhdl

Report comment

Reply
1. Dojo says:
  
  June 29, 2014 at 12:05 pm
  
  Still, I think his method will yield slightly higher clock rates, since there is almost no routing overhead in the ‘high-speed’ part of the design.
  
  Report comment
  
  Reply
Russ says:

June 29, 2014 at 12:50 pm

If we are talking about fast counters, then LFSRs should be mentioned: http://en.wikipedia.org/wiki/Linear_feedback_shift_register#Uses_as_counters

‘LFSR counters have simpler feedback logic than natural binary counters or Gray code counters, and therefore can operate at higher clock rates.’

You could probably do the entire thing with one long shift register, and a single XOR gate.

Report comment

Reply
ejonesss says:

June 29, 2014 at 2:02 pm

couldnt you use a frequency divider and count double clock speed?

Report comment

Reply
Hamster says:

June 29, 2014 at 2:17 pm

Hi – I’m the project’s author.

I was thinking of offering a prize for a faster Spartan 6lX9 design, where all it had to do was count the frequency of the incoming signal and squirt ASCII down any serial port.
However that would be unfair of me as it is the maximum switching rate of the clock buffer that is the limiting factor, not the programmable logic – as it should be the fastest design that passes timing. 8-)

The underlying logic of LFSRs are faster, but how do you cleanly transfer the state of the LFSR into your clock domain and then convert it back to a binary number? This makes LFSRs better four when you are counting to a preset terminal count then reseting.

One other strike against LFSRs is that the logic blocks in FPGAs have special features that speed up addition (most notably the carry chain), so a long LFSR will be slower that a 5-bit binary counter.

Mike

Report comment

Reply
1. tekkieneet says:
  
  June 29, 2014 at 2:33 pm
  
  If you are counting frequency, you are essentially looking at the change in clock count over a period of time. e.g. every 0.1sec (or whatever)
  
  Can’t you run the counter itself asynchronously and not worry about getting the reading while it is live? Once the sampling period is over, you gate the input signal to stop the counter. When the counter is stopped, you wouldn’t need to worry about crossing the clock domains any more because its bits stopped changing already. Examine the bits and when you are done let sampling period starts again and the counter to do it work again.
  
  Report comment
  
  Reply
  1. fonz says:
    
    June 29, 2014 at 3:23 pm
    
    there’s many way to skin a cat but you still end up with problem that ~500MHz is the limit on the IOs and clocks
    
    Report comment
    
    Reply
  2. Hamster says:
    
    June 29, 2014 at 3:23 pm
    
    Yes – you can gate the clock, but then you will need a long counter, 29 bits if you want to count for a second at up to 500MHz.
    
    I’m not sure if that is possible in the FPGA I’m using, as it will require a 0.066 ns carry delay per bit, and it looks to be about 0.12 ns from the datasheet.
    
    To use that technique might have to use around a 16-bit counter, and would need to pause it every 125us to read back the current count. This would cause errors in the final total.
    
    Report comment
    
    Reply
    1. Foobar Bazbot says:
      
      June 29, 2014 at 5:23 pm
      
      AFAICS the carry delay doesn’t matter. It matters if you want a synchronous counter, but we don’t. If you make a simple asynchronous counter (aka ripple counter), there is no carry as such; each bit toggles when the next lower bit goes from 1 to 0. Now reading the all the bits at once, when the clock is running, is inconsistent/incorrect, because the bits are progressively skewed.
      
      But the thing is, you don’t need to read it while it’s running! At the end of the measurement period, you gate the input off, then wait enough time for the last toggle to propagate through (assuming the worst-case, where the MSB is 0 and all other bits are 1, so it’s toggling from 01111… to 1000…., it’s 0.12ns * 30ish, so pick any convenient multiple of a convenient clock that gives you 10ns or so), _then_ read the output.
      
      about ripple counters, for those who don’t know them…
      http://www.eecs.tufts.edu/~dsculley/tutorial/flopsandcounters/flops6.html
      
      Report comment
      
      Reply
      1. russdill says:
        
        June 29, 2014 at 7:08 pm
        
        You can do that for later stages, but for early stages, clocking via local routing decreases greatly your maximum possible frequency.
        
        Report comment
      2. tekkieneet says:
        
        June 30, 2014 at 6:26 am
        
        Guess no one reads my comment (just 1 comment below you.) That’s how they make a 400MHz frequency counter in the old grand grand daddy XC4000. They used a chain of ripple counter. That chip wasn’t even fast by any stretch of imagination.
        
        >Consequently, the first stage is an unconditional divide-by-2,
        as shown in Figure 2. The clock-to-setup delay of 2.44 ns
        permits 400-MHz operation even under worst-case conditions.
        
        Now how much faster is the FPGA used again? If you can make a gray code counter at 500MHz to count your input, you have already a way to get that clock into the FPGA and toggle some logic inside. The part lacking is to think outside of the everything should be synchronous mindset.
        
        Once you get the 500MHz in this case down to 250MHz, even your average undergrad EE should be able to make the rest out of synchronous logic in a modern FPGA.
        
        Report comment
  3. tekkieneet says:
    
    June 29, 2014 at 4:08 pm
    
    BTW take a look at this:
    http://caxapa.ru/thumbs/404644/XC4002XL_fcounter_400MHz.pdf
    
    >This article describes a full-featured, single-chip frequency
    counter that operates at 400 MHz, consumes only 130 mW at
    the maximum input frequency, and occupies less than 90% of
    an XC4002XL, the smallest XC4000 family member.
    
    If they can measure 400MHz with the old XC4000, I am pretty sure that the same trick can be done on after so many generation of improvements.
    
    Now that was a much better hack.
    
    Report comment
    
    Reply
    1. Pat says:
      
      June 30, 2014 at 8:48 am
      
      Sure, you just immediately divide down the input in a CLB flop, and use that to count the clock (and possibly do something to pick up the +/-1, but that’s not a big deal).
      
      The improvement isn’t ‘huge’ though, despite so many generations of improvements, because FPGAs aren’t optimized for this – there is no fast loopback path from a register to itself. The clock speed limitation is a combination of the output delay, the loopback delay, and the setup delay of a single CLB. On the old XC4000, that total was ~2.44 ns. On a Spartan-6 this is ~1.7 ns.
      
      So it doesn’t get you much faster at all. Realistically once you get past 500 MHz, you have to get creative no matter what generation FPGA you’re using.
      
      Report comment
      
      Reply
      1. Mike Field says:
        
        July 3, 2014 at 3:19 am
        
        I got round to trying trying using a flip-flop as a /2 prescaler – half a dozen more lines of code and it passes timing at 1.051 ns (950 MHz). However I don’t have any suitable test gear to test it with.
        
        Thank’s for the idea Tekkieneet!
        
        Still feel that it is a bit like cheating – I’ll have to set the gate time to two seconds to make up for the missing LSB :)
        
        Report comment
2. russdill says:
  
  June 29, 2014 at 7:01 pm
  
  If you are already at the maximum frequency, then yes, there is little advantage to an LFSR. However, an LFSR will always be faster than a simple counter, even with a carry chain, as it only requires a single gate with two inputs. The rest of the chain is just shift registers. You can even pack those shift registers into SRLs, up to a 128 bits in a single slice (but that limits flexibility).
  
  As far as reading it periodically, you can just clock it into a second set of registers (periodically) with the high speed clock, then with a low speed clock pass it through logic for converting to numerals. You can also do comparisons utilizing the high speed clock and a single LUT if you are comparing to a 6 bit constant, or using a series of LUTs and a carry chain if you are comparing against a larger constant.
  
  Report comment
  
  Reply
3. Pat says:
  
  June 29, 2014 at 7:18 pm
  
  So don’t use a clock buffer. At least not one straight off.
  
  You could probably count higher with a clever input structure, but you’d have to think about it. Toggle frequencies in a CLB are much higher (~800 MHz), and the fact that there’s a high latency path from fabric to clock tree doesn’t matter for a counter.
  
  Heckuva lot more work, though.
  
  Report comment
  
  Reply
4. Pat says:
  
  June 29, 2014 at 7:22 pm
  
  Counting the input clock and measuring its frequency are two very different things. Counting accurately is probably limited to around 500 MHz easily. Measuring its frequency, though, could be done with a TDC, and there you can get into the GHz range.
  
  Report comment
  
  Reply
  1. Tyler says:
    
    June 29, 2014 at 8:22 pm
    
    Yeah. I’m a little baffled by why they are doing this in such an odd way if what they seem to actually want is frequency measurement…when 500+ MHz was easily within the realm of discrete logic…
    
    Report comment
    
    Reply
    1. fonz says:
      
      June 30, 2014 at 8:26 am
      
      at varying degrees of easily, your standard junk box flipflop, even a fast one, isn’t going to work at anywhere near 500MHz
      
      Report comment
      
      Reply
    2. tekkieneet says:
      
      June 30, 2014 at 8:46 am
      
      but a (divider by 256) prescaler chip in the old cable TV converter box is good for 1+GHz and it has preamplifier built-in that have tens of mV sensivity.
      
      There are 12GHz prescaler these days. The chip is a bit hard to find though.
      http://www.changpuak.ch/electronics/prescaler_12GHz.php
      
      Report comment
      
      Reply
5. Alan says:
  
  November 8, 2015 at 7:09 pm
  
  Would you consider a follow-up with a focus on enhanced resolution? I was reading about interpolating counters, that measure the time between sample gate open & 1st pulse (and saple gate closes & next pulse). Usually creating a sawtooth and measuring with ADC… BUT there are “all digital” ADC solutions that use LVDS inputs to sample directly on FPGA.
  That would require tacking a few more digits to the display, of course.
  
  Report comment
  
  Reply
Pat says:

June 30, 2014 at 8:58 am

Well, there are other reasons to have a high frequency clock counter. Time tagging, for instance. Getting the high-frequency clock counter is just the first part. In addition, there’s a difference between measuring the frequency by counting the number of pulses in 1 second and measuring the time difference between two rising edges a whole bunch of times. If you’re looking for frequency stability, a big clock counter is really nice.

Report comment

Reply
russdill says:

June 30, 2014 at 9:02 am

Here’s something that’d be fun to experiment with. Feed the clock into a carry chain that extends across the entire chip vertically. Periodically sample the output from the carry chain into flip-flops (4 for each slice). Depending on the various noise and jitter sources, you now have a somewhat noisy image of the input signal frozen in time. You’d have to fall back to other methods for frequencies that would look like DC in the carry chain.

Report comment

Reply
1. Hamster says:
  
  June 30, 2014 at 5:01 pm
  
  I’ve tried playing with that a while ago, The ‘1’s stretch out and the ‘0’s shrink up. It looks like the ‘1’s hold up a little longer than ideal.
  
  So it is fine for resolving when the rising or falling edges on a relatively long (e.g. 1 ns or longer) pulse occur, but if you are looking at using it to analyse a cyclic high frequency signal or very short ‘0’ pulses on a normally ‘1’ data signal it doesn’t work too well as the signal soon smears out.
  
  I had better luck with chaining carefully placed and routed LUT tables, configured as inverters. I managed to get better than 0.5ns resolution on a Spartan 3E, and transmit 512MB/s 8b/10b data between two boards.
  
  Apart from the very, very tricky layout and signal routing needed, the big downside of this technique is that the delay changes with operating conditions, so you need a reiable way to calibrate against a known signal.
  
  Report comment
  
  Reply
  1. Pat says:
    
    July 1, 2014 at 9:43 am
    
    If you have a high frequency clock, it’s almost certainly coming in LVDS, so use the (inverted) partner copy of the signal (using an IBUFDS_DIFF_OUT or equivalent) and feed it down an equivalent copy of the chain, which will eliminate any polarity influences by looking only at 0->1 or 1->0 transitions.
    
    Report comment
    
    Reply