Benchmarking Chinese CPUs

When it comes to PCs, Westerners are most most familiar with x86/x64 processors from Intel and AMD, with Apple Silicon taking up a significant market share, too. However, in China, a relatively new CPU architecture is on the rise. A fabless semiconductor company called Loongson has been producing chips with its LoongArch architecture since 2021. These chips remain rare outside China, but some in the West have been benchmarking them.

[Daniel Lemire] has recently blogged about the performance of the Loongson 3A6000, which debuted in late 2023. The chip was put through a range of simple benchmarking tests, involving float processing and string transcoding operations. [Daniel] compared it to the Intel Xeon Gold 6338 from 2021, noting the Intel chip pretty much performed better across the board. No surprise given its extra clock rate. Meanwhile, the gang over at [Chips and Cheese] ran even more exhaustive tests on the same chip last year. The Loongson was put through typical tasks likeĀ  compressing archives and encoding video. The outlet came to the conclusion that the chip was a little weaker than older CPUs like AMD’s Zen 2 line and Intel’s 10th generation Core chips. It’s also limited as a four-core chip compared to modern Intel and AMD lines that often start at 6 cores as a minimum.

If you find yourself interested in Loongson’s product, don’t get too excited. They’re not exactly easy to lay your hands on outside of China, and even the company’s own website is difficult to access from beyond those shores. You might try reaching out to Loongson-oriented online communities if you seek such hardware.

Different CPU architectures have perhaps never been more relevant, particularly as we see the x86 stalwarts doing battle with the rise of desktop and laptop ARM processors. If you’ve found something interesting regarding another obscure kind of CPU, don’t hesitate to let the tipsline know!

Assessing The Energy Efficiency Of Programming Languages

Programming languages are generally defined as a more human-friendly way to program computers than using raw machine code. Within the realm of these languages there is a wide range of how close the programmer is allowed to get to the bare metal, which ultimately can affect the performance and efficiency of the application. One metric that has become more important over the years is that of energy efficiency, as datacenters keep growing along with their power demand. If picking one programming language over another saves even 1% of a datacenter’s electricity consumption, this could prove to be highly beneficial, assuming it weighs up against all other factors one would consider.

There have been some attempts over the years to put a number on the energy efficiency of specific programming languages, with a paper by Rui Pereira et al. from 2021 (preprint PDF) as published in Science of Computer Programming covering the running a couple of small benchmarks, measuring system power consumption and drawing conclusions based on this. When Hackaday covered the 2017 paper at the time, it was with the expected claim that C is the most efficient programming language, while of course scripting languages like JavaScript, Python and Lua trailed far behind.

With C being effectively high-level assembly code this is probably no surprise, but languages such as C++ and Ada should see no severe performance penalty over C due to their design, which is the part where this particular study begins to fall apart. So what is the truth and can we even capture ‘efficiency’ in a simple ranking?

Continue reading “Assessing The Energy Efficiency Of Programming Languages”

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Print-in-Place Engine Aims To Be The Next Benchy

While there are many in the 3D-printing community who loudly and proudly proclaim never to have stooped to printing a 3DBenchy, there are far more who have turned a new printer loose on the venerable test model, just to see what it can do. But Benchy is getting a little long in the tooth, and with 3D-printers getting better and better, perhaps a better benchmarking model is in order.

Knocking Benchy off its perch is the idea behind this print-in-place engine benchmark, at least according to [SunShine]. And we have to say that he’s come up with an impressive model. It’s a cutaway of a three-cylinder reciprocating engine, complete with crankshaft, connecting rods, pistons, and engine block. It’s designed to print all in one go, with only a little cleanup needed after printing before the model is ready to go. The print-in-place aspect seems to be the main test of a printer — if you can get this engine to actually spin, you’re probably set up pretty well. [SunShine] shares a few tips to get your printer dialed in, and shows a few examples of what can happen when things go wrong. In addition to the complexities of the print-in-place mechanism, the model has a few Easter eggs to really challenge your printer, like the tiny oil channel running the length of the crankshaft.

Whether this model supplants Benchy is up for debate, but even if it doesn’t, it’s still a cool design that would be fun to play with. Either way, as [SunShine] points out, you’ll need a really flat bed to print this one; luckily, he recently came up with a compliant mechanism dial indicator to help with that job.

Continue reading “Print-in-Place Engine Aims To Be The Next Benchy”