Ask Hackaday: Computing Square Roots On FPGA?

December 26, 2016

Hackaday reader [nats.fr] wrote in with some code from a project that resizes a video stream on the fly using an FPGA. Doing this right means undoing whatever gamma correction has been applied to the original stream, resizing, and then re-applying the gamma. Making life simpler, [nats.fr] settled on a gamma of two, which means taking a bunch of square roots, which isn’t fast on an FPGA.

[nats]’s algorithm is pretty neat: it uses a first-stage lookup to figure out in which broad range the value lies, and then one step of Hero’s algorithm to refine from there. (We think this is equivalent to saying he does a piecewise linear interpolation, but we’re not 100% sure.) Anyway, it works decently.

Of course, when you start looking into the abyss that is special function calculation, you risk falling in. Wikipedia lists more methods of calculating square roots than we have fingers. One of them, CORDIC, avoids even using multiplication by resorting to clever bitshifts and a lookup table. Our go-to in these type of situations, Chebyshev polynomial approximation, didn’t even make the cut. (Although we suspect it would be a contender in the gamma=1.8 or gamma=2.2 cases, especially if combined with range-reduction in a first stage like [nats.fr] does.)

So what’s the best/fastest approximation for sqrt(x) for 16-bit integers on an FPGA? [nats.fr] is using a Spartan 6, so you can use a multiplier, but division is probably best avoided. What about arbitrary, possibly fractional, roots?

33 thoughts on “Ask Hackaday: Computing Square Roots On FPGA?”

hennichodernich says:

December 26, 2016 at 10:32 am

LUT to logarithmic, then every power (or root) is just a multiplication.Then LUT back.

Report comment

Reply
jklu says:

December 26, 2016 at 10:41 am

This paper from the university of Jakarta, Indonesia claims digit-by-digit calculation is the most efficient one of them all:
“An optimized square root algorithm for implementation in FPGA hardware”
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.6832&rep=rep1&type=pdf

Report comment

Reply
1. jklu says:
  
  December 26, 2016 at 10:46 am
  
  And since the spartan 6 also has multiplication, this paper from Inria, France might help as well:
  “Multiplicative square root algorithms for FPGAs”
  https://hal.inria.fr/ensl-00475779v1/document
  
  Report comment
  
  Reply
2. Drone says:
  
  December 30, 2016 at 9:55 am
  
  @jklu, Thank you for the link to the paper “AN OPTIMIZED SQUARE ROOT ALGORITHM
  FOR IMPLEMENTATION IN FPGA HARDWARE”. However, I would like to make one correction:
  The paper cited in your link is not from “university of Jakarta, Indonesia” as you said; it is from the Universitas Ahmad Dahlan (UAD) a “private” institution (as if any university in Indonesia can truly be “private”) which located in the city of Yogyakarta (a.k.a, Jogjakarta or simply Jogja) on the Island of Java. Yogyakarta is a significant distance from the Indonesian Capital of Jakarta, which is also located on Java. Here are a couple of links: https://en.wikipedia.org/wiki/Yogyakarta -and- https://uad.ac.id/id
  Best Regards, Drone – from Jakarta
  
  Report comment
  
  Reply
mauswerkz says:

December 26, 2016 at 11:26 am

How does a handheld calculator do it? I would assume the hardware in these is logic-based, similar to an FPGA.

Report comment

Reply
1. Sweeney says:
  
  December 26, 2016 at 11:40 am
  
  You’d assume incorrectly. A calculator used a microprocessor and conventional algorithms. An FPGA uses pure hardware in parallel, and to get maximum performance out of it you want to be able to either pipeline or calculate in a single clock cycle. You’re looking at a different way of thinking about the problem.
  
  Report comment
  
  Reply
  1. SavannahLion says:
    
    December 26, 2016 at 2:59 pm
    
    Would it not depend on the calculator in question? I don’t think the Elka 6521 used a microprocessor in 1965. The Intel 4004 and others like it didn’t hit the market until 1971 or so.
    
    Report comment
    
    Reply
    1. spacecoyote says:
      
      December 26, 2016 at 3:24 pm
      
      That’s not a handheld calculator though.
      
      Report comment
      
      Reply
  2. RÖB says:
    
    December 26, 2016 at 7:10 pm
    
    I seem to remember that a very early calculator (HP?) did this in very few bytes of code so there be an easier way.
    
    As a child (before calculators) I was taught a way of doing square root on paper that took some time because it was rather iterative. The reason it was iterative is that the base was different to the inverse power being calculated. ie – decimal would be a much easier approach if you wanted the tenth root. I later converted that technique to digital electronics and found that the base of 2 (binary) is perfect for calculating a *square* root. It was just bit shifts and addition. Both of which an FPGA would be very good at with minimal resources.
    
    Report comment
    
    Reply
    1. Alan says:
      
      December 26, 2016 at 10:58 pm
      
      Newton-Raphson algorithm? Each answer gets closer to an acceptable result.
      
      If number = N, and current guess = X, calculate new, better guess X’ with: X’ = (N + X^2) / (2X)
      
      Report comment
      
      Reply
    2. Sweeney says:
      
      December 27, 2016 at 3:32 am
      
      Small code doesn’t mean it’s fast. Newton-Raphson is a method of successive approximations and isn’t fast. As I said, you’re looking for a different type of algorithm for maximum performance on an FPGA.
      
      Report comment
      
      Reply
      1. RÖB says:
        
        December 27, 2016 at 3:48 am
        
        True but there are advantages for raw math in FPGA to offset that.
        
        1st is that you may have limited requirements for accuracy and you can tailor the logic to that.
        
        The second is that unlike a CPU, FPGA can do most things in parallel so the over all number of cycles can be reduced dramatically compared to a CPU or even FPGA emulated CPU.
        
        3rd is that most ball park FPGA can run fine at 300MHz or more and that is a lot faster than you can get from any common and small CPU architecture.
        
        Report comment
      2. Alan says:
        
        December 29, 2016 at 4:24 am
        
        I’m not say it’s fast – but it might be the method Rob learned “as a child”.
        
        Report comment
jpa says:

December 26, 2016 at 11:27 am

If one is using 8-bit colors on input and output, a 256-entry lookup table is enough. On input it is trivial, but also on output you don’t need more than 256 entries if you are going to round the result to 8 bits anyway. For example perfect hash functions can be used for fast lookup.

Report comment

Reply
1. jpa says:
  
  December 26, 2016 at 11:33 am
  
  Or in fact one could probably throw a bunch of autogenerated “if linear > 1234 then output 1296 then output <= 36; …" statements at the synthetizer and see how huge the logic would be. It might find a reasonably compact optimization of the LUT when the output is just 8 bits.
  
  Report comment
  
  Reply
2. marty lawson says:
  
  December 27, 2016 at 1:00 pm
  
  AFIK, the fastest way to do a LUT this big would be to use a dual port block-ram initialized with the forward and reverse look up tables. You’d use one port for gamma removal lookups and the other port for gamma re-application lookups.
  
  A successive approximation method should also be just as quick if you pipeline it. I.e. CORDIC or binary digit-by-digit.
  
  Which is better, depends on weather the design is limited more by logic or memory.
  
  Report comment
  
  Reply
Sweeney says:

December 26, 2016 at 11:48 am

There’s a 32 bit square root in VHDL over here:
https://www.csee.umbc.edu/portal/help/VHDL/samples/sqrt32.vhdl

Report comment

Reply
Steven Clark says:

December 26, 2016 at 1:15 pm

Cordic apparently does nearly everything. I just wish i understood it enough to trust I could implement it.

Report comment

Reply
1. MartinR says:
  
  December 26, 2016 at 2:51 pm
  
  I guess one good thing I can say about ISE is the CORDIC wizard allows you to pretty much treat it as a black box and is pretty simple for things like this.
  
  Report comment
  
  Reply
Michael Dunn says:

December 26, 2016 at 1:54 pm

You could use my log/antilog Verilog blocks to do a SQRT in one cycle:
http://www.edn.com/design/systems-design/4424046/Single-cycle-logarithms—antilogs

Report comment

Reply
1. Zaid says:
  
  December 28, 2016 at 10:01 am
  
  I think i used your implementation in my undergrad project.
  
  Report comment
  
  Reply
JIm B says:

December 26, 2016 at 2:17 pm

“Which is best” is an underspecified question. Best for what? Do you need to calculate this infrequently and can live with a 30 cycle latency? Are you going for one result per clock? What are the precision requirements? Are there other calculation that you need to do which might be able to share part of the computing machinery?

Looking at this particular application, where the video components are low precision, and need a high throughput but latency doesn’t matter too much, probably a piecewise linear lookup is fine – two small LUTs, one small multiply/add.

Report comment

Reply
1. Schobi says:
  
  December 26, 2016 at 2:30 pm
  
  This is also a question of required accuracy!
  
  In this case the accuracy can be really low – just to make the scaling look good. I guess you could trick with number of leading zeros (somewhat cheap in an fpga) or very small LUTs and some interpolation.
  For example: y = rightshift(x , numberofleadingzeros(x)/2 )
  But again the Barrel shift is expensive/slow and would use a multiplication…
  
  Report comment
  
  Reply
BeagleBoy says:

December 26, 2016 at 5:01 pm

Since they’re using a Spartan-6, why not use ISE/coregen to generate a cordic block? No implementation required.

Report comment

Reply
1. Alan says:
  
  December 26, 2016 at 11:04 pm
  
  Add a good size RAM / EPROM, and you can look up an 8-bit value simply by selecting a 16-bit address.
  
  Report comment
  
  Reply
Luke Weston says:

December 26, 2016 at 11:25 pm

Remember, an FPGA is not a microcontroller. Unless you write a microcontroller on your FPGA.

What algorithm is best? Well that depends what kind of processor you’ve implemented on your FPGA.

Report comment

Reply
1. RÖB says:
  
  December 26, 2016 at 11:43 pm
  
  There are a large number of other types of implementation between a stateless combinatory machine right though to a Turing complete machine so you don’t have to implement a micro-processor or micro-controller to do maths based things as all the math itself can be encoded without a CPU etc.
  
  http://upload.wikimedia.org/wikipedia/commons/a/a2/Automata_theory.svg
  
  Report comment
  
  Reply
Al says:

December 27, 2016 at 4:11 am

Maybe this would be suitable in your particular case:
https://en.wikipedia.org/wiki/Fast_inverse_square_root

Report comment

Reply
1. RÖB says:
  
  December 27, 2016 at 4:19 pm
  
  What a gem of a link.
  
  I am about to write the kinematics for an arc delta robot to run on an 8-bitter so this is excellent and simple to implement.
  
  Thank you [Al].
  
  Report comment
  
  Reply
Miroslav says:

December 27, 2016 at 4:53 am

Look up tables in memory, as many others suggested. Quite an old concept to simplify logic:

https://en.wikipedia.org/wiki/IBM_1620

Report comment

Reply
imqqmi says:

December 27, 2016 at 6:48 am

Not that it was asked for, but is it checked that the monitors have a gamma adjustment in the on screen display settings? If it has then your problem is solved.
I’m not sure why you do this in two steps instead of one for gamma adjustment. In two steps there’s a loss in accuracy twice instead of doing it in one step.

Report comment

Reply
1. naikrovek says:
  
  December 28, 2016 at 7:02 pm
  
  Resizing an image with gamma correction already applied does not produce a result image with the correct gamma.
  
  Report comment
  
  Reply
Alan says:

January 5, 2017 at 3:02 pm

The Spartan-6 has a 6-input LUT so you can check 6 MSB (D10 – D15) with just one LUT. If all bits are zero, input is < 2048, output is < 45

Beyond that, I'd want to know if (for example) the input range 81 – 99 is truncated to 9, or range 73 – 90 is rounded to 9.
Everything above 65,025 is rounded down to 255.

Report comment

Reply

Hackaday

Ask Hackaday: Computing Square Roots On FPGA?

33 thoughts on “Ask Hackaday: Computing Square Roots On FPGA?”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

The Great Green Wall: Africa’s Ambitious Attempt To Fight Desertification

Your Open-Source Client Options In The Non-Mastodon Fediverse

Supercon 2023: MakeItHackin Automates The Tindie Workflow

The Computers Of Voyager

My Space

Our Columns

Hackaday Podcast Episode 270: A Cluster Of Microcontrollers, A Rocket Engine From Scratch, And A Look Inside Voyager

This Week In Security: TunnelVision, Scarecrows, And Poutine

Ask Hackaday: Do You Calibrate Your Instruments?

FLOSS Weekly Episode 782: Nitric — In Search Of The Right Knob

Displays We Love Hacking: LVDS And EDP

33 thoughts on “Ask Hackaday: Computing Square Roots On FPGA?”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns