Using Integer Addition To Approximate Float Multiplication

Once the domain of esoteric scientific and business computing, floating point calculations are now practically everywhere. From video games to large language models and kin, it would seem that a processor without floating point capabilities is pretty much a brick at this point. Yet the truth is that integer-based approximations can be good enough to hit the required accuracy. For example, approximating floating point multiplication with integer addition, as [Malte Skarupke] recently had a poke at based on an integer addition-only LLM approach suggested by [Hongyin Luo] and [Wei Sun].

As for the way this works, it does pretty much what it says on the tin: adding the two floating point inputs as integer values, followed by adjusting the exponent. This adjustment factor is what gets you close to the answer, but as the article and comments to it illustrate, there are plenty of issues and edge cases you have to concern yourself with. These include under- and overflow, but also specific floating point inputs.

Unlike in scientific calculations where even minor inaccuracies tend to propagate and cause much larger errors down the line, graphics and LLMs do not care that much about float point precision, so the ~7.5% accuracy of the integer approach is good enough. The question is whether it’s truly more efficient as the paper suggests, rather than a fallback as seen with e.g. integer-only audio decoders for platforms without an FPU.

Since one of the nice things about FP-focused vector processors like GPUs and derivatives (tensor, ‘neural’, etc.) is that they can churn through a lot of data quite efficiently, the benefits of shifting this to the ALU of a CPU and expecting (energy) improvements seem quite optimistic.

The Pentium Processor’s Innovative (and Complicated) Method Of Multiplying By Three, Fast

[Ken Shirriff] has been sharing a really low-level look at Intel’s Pentium (1993) processor. The Pentium’s architecture was highly innovative in many ways, and one of [Ken]’s most recent discoveries is that it contains a complex circuit — containing around 9,000 transistors — whose sole purpose is to multiply specifically by three. Why does such an apparently simple operation require such a complex circuit? And why this particular operation, and not something else?

Let’s back up a little to put this all into context. One of the feathers in the Pentium’s cap was its Floating Point Unit (FPU) which was capable of much faster floating point operations than any of its predecessors. [Ken] dove into reverse-engineering the FPU earlier this year and a close-up look at the Pentium’s silicon die shows that the FPU occupies a significant chunk of it. Of the FPU, nearly half is dedicated to performing multiplications and a comparatively small but quite significant section of that is specifically for multiplying a number by three. [Ken] calls it the x3 circuit.

The “x3 circuit”, a nontrivial portion of the Pentium processor, is dedicated to multiplying a number by exactly three and contains more transistors than an entire Z80 microprocessor.

Why does the multiplier section of the FPU in the Pentium processor have such specialized (and complex) functionality for such an apparently simple operation? It comes down to how the Pentium multiplies numbers.

Multiplying two 64-bit numbers is done in base-8 (octal), which ultimately requires fewer operations than doing so in base-2 (binary). Instead of handling each bit separately (as in binary multiplication), three bits of the multiplier get handled at a time, requiring fewer shifts and additions overall. But the downside is that multiplying by three must be handled as a special case.

[Ken] gives an excellent explanation of exactly how all that works (which is also an explanation of the radix-8 Booth’s algorithm) but it boils down to this: there are numerous shortcuts for multiplying numbers (multiplying by two is the same as shifting left by 1 bit, for example) but multiplying by three is the only one that doesn’t have a tidy shortcut. In addition, because the result of multiplying by three is involved in numerous other shortcuts (x5 is really x8 minus x3 for example) it must also be done very quickly to avoid dragging down those other operations. Straightforward binary multiplication is too slow. Hence the reason for giving it so much dedicated attention.

[Ken] goes into considerable detail on how exactly this is done, and it involves carry lookaheads as a key element to saving time. He also points out that this specific piece of functionality used more transistors than an entire Z80 microprocessor. And if that is not a wild enough idea for you, then how about the fact that the Z80 has a new OS available?

Home-Built CPU Runs With Home-Built Toolchain

A few years ago [Takaya Saeki] and fellow students of the University of Tokyo, were given a very limited instruction during their ‘CPU exercise’ class, along the lines of:

Take this ray-tracing program written in OCaml and run it on your CPU implemented on an FPGA

Splitting into groups to cover the CPU, FPU, simulator tool, and compiler toolchain, the students started with designing a RISC ISA, then designed a CPU around that. You can follow along with theĀ retrospective writeup of the class, then dive into the GitHub pages for each of the components of the system, although the commentary is mainly in Japanese. Hey, you can google translate right? Continue reading “Home-Built CPU Runs With Home-Built Toolchain”

Upgrading And Desoldering A Fake CPU

[quarterturn] had an old Apple Powerbook 520c sitting around in his junk bin. For the time, it was a great computer but in a more modern light, it could use an upgrade. It can’t run BSD, either: you need an FPU for that, and the 520 used the low-cost, FPU-less version of the 68040 as its main processor. You can buy versions of the 68040 with FPUs direct from China, which means turning this old Powerbook into a BSD powerhouse is just a matter of desoldering and upgrading the CPU. That’s exactly what [quarterturn] did, with an unexpected but not surprising setback.

The motherboard for the Powerbook 500 series was cleverly designed, with daughter cards for the CPU itself and RAM upgrades. After pulling the CPU daughter card from his laptop, [quarterturn] faced his nemesis: a 180-pin QFP 68LC040. Removing the CPU was handled relatively easily by liberal application of ChipQuik. A few quick hits with solder braid and some flux cleaned everything up, and the daughter card was ready for a new CPU.

The new FPU-equipped CPU arrived from China, and after some very careful inspection, soldering, and testing, [quarterturn] had a new CPU for his Powerbook. Once the Powerbook was back up and running, there was a slight problem. The chip was fake. Even though the new CPU was labeled as a 68040, it didn’t have an FPU. People will counterfeit anything, including processors from the early 90s. This means no FPU, no BSD, and [quarterturn] is effectively back to square one.

That doesn’t mean this exercise was a complete loss. [quarterturn] did learn a few things from this experience. You can, in fact, desolder a dense QFP with ChipQuik, and you can solder the same chip with a regular soldering iron. Networking across 20 years of the Macintosh operating system is a mess, and caveat emptor doesn’t translate into Mandarin.