Long-term Raspberry Pi watchers will have seen a lot of OS upgrades in their time, from the first Debian Squeeze previews through the Raspbian years to the current Raspberry Pi OS. Their latest OS version is something different though, and could be one of the most important releases in the platform’s history so far, as finally there’s an official release of a 64-bit Raspberry Pi OS.
Would-be 64-bit Pi users have of course had the chance to run 64-bit GNU/Linux operating system builds from other distributions for nearly as long as there have been Pi models with 64-bit processors, but until now the official distribution has only been available as a 32-bit build. In their blog post they outline their reasons for this move in terms of compatibility and performance, and indeed we look forward to giving it a try.
Aside from being a more appropriate OS for a 64-bit Pi, this marks an interesting moment for the folks from Cambridge in that it is the first distribution that won’t run on all Pi models. Instead it requires a Pi 3 or better, which is to say the Pi 3, Zero 2 W, Pi 4, Pi 400, and the more powerful Compute Modules. All models with earlier processors including the original Pi, Pi Zero, and we think the dual-core Pi 2 require a 32-bit version, and while the Pi Zero, B+ and A+ featuring the original CPU are still in production this marks an inevitable move to 64-bit in a similar fashion to that experienced by the PC industry a decade or more ago.
As far as we know the Zero is still flying off the shelves, but this move towards an OS that will leave it behind is the expected signal that eventually there will be a Pi line-up without the original chip being present. We’re sure the 32-bit Pi will be supported for years to come, but it should be clear that the Pi’s future lies firmly in the 64-bit arena. They’ve retained their position as the board to watch oddly not by always making the most impressive hardware but by having the most well-supported operating system, and this will help them retain that advantage by ensuring that OS stays relevant.
On the subject of the future course of the Pi ship, our analysis that the Compute Module 4 is their most exciting piece of hardware still stands.
how much difference will it make in performance?
Zero, it’s only a marketing change.
apps are just ported not optimized to 64bit
That’s a load of crap. Historically, applications with identical source code run between 10 and 30% faster simply by targeting x64 versus x86. There’s no reason that the difference between AArch64 and classic ARM isn’t just as dramatic.
Not so sure… for x86 vs x64 there is a big difference in the number of available registers, thus allowing the C compiler to do better code optimization; and, AFAIK, the calling convention also uses registers for the first parameters, thus speeding up any call because there are less parameters to be written into the stack. But in the case of ARM, I don’t think that the difference between both architectures is so different.
https://hackaday.com/2020/01/28/raspberry-pi-4-benchmarks-32-vs-64-bits/
Think of it as opening the door to the future of RPi
Or infinite, if the software you need isn’t available in 32-bit.
See this video: https://www.youtube.com/watch?v=ET-DJLnNX-Q
Citrix Workspace and MS Teams are two apps that I need to use, daily. The fact that there aren’t working packages available, for both apps on 64- bit ARM platforms, stopped my attempts, to get Manjaro OS set up, dead in their tracks, Twister OS is 32-bit. unlike Manjaro, so both of those apps will install and work with Twister. The big thing that I didn’t like about Twister, until I found that out, was that it was 32-bit. Now I know why. Manjaro, with KDE Plasma, seemed so promising, but I absolutely need both apps to work. I don’t have time to boot into some other OS when I need to run those apps.
No difference for desktop/browser kind of use, but specific advantage in specific applications. Moreover, there are applications that are optimized for 64 bits and are also dropping 32 bit support (i.e. elasticsearch). There are many other factors, such as processes being limited to 4GB RAM using PAE tricks (though in the Raspi case, this would of course only apply to a very specific application on the only 8GB available model).
Today Every app has advanced features that work better in 64 bit. We can start with the browser and go on from there. Just to get started, TLS (ubiquitous now) performs much better in 64 bit. We can move on to image and video decompression. Everybody uses the browser, it is not some esoteric app that nobody uses.
Love how people think whatever they use is “ubiquitous”.
That depends on so many factors, there is no correct answer. Some applications will be faster, some depending on how they were coded and compiled could be much slower.
The operation of transferring data between the memory and microprocessor (cache) is not very cheap, so transferring 64-bits instead of 32-bits could easily have negative effect performance.
64-bit is not magic, you are moving twice as many bits about. If you are only using half of the bits, then for code that was optimised to maximum performance using 32-bits (e.g. if the code was using something like bit magic – https://graphics.stanford.edu/~seander/bithacks.html ), you could potentially see a 50% drop in performance.
“64-bit is not magic, you are moving twice as many bits about.”
Yes and no. Some architectures store information in compressed form in their registers.
So unless 64-Bits of information are loaded, there won’t be a waste.
I think (or rather, hope) the engineers of advanced architectures are smart enough to realize this. ;)
The reverse might also be true, I suppose.
Some architectures may hold 64-Bit of data in their registers all the time.
If merely 32-Bits are stored, leading zeros may be used to fill up the registers.
As you said, it depends on so many factors.. :)
To be fair, registers isn’t the problematic part.
Making an architecture that dynamically scales a register to its content is a dance I have personally attempted. It isn’t particularly fun and not always beneficial.
One can reduce the peak transistor count a bit if one has fewer hardware bits than one can theoretically need to handle, but then the programmer/compiler needs to keep this “upper bound” into consideration, and for those that don’t, a crash is imminent. So better just make all registers as large as they logically can be. (Also saves on having to have hardware for allocation space for the registers in one’s now more complicated bank of registers)
Then there is the downside of dynamically scaling registers for their content.
We need more logic to implement that, adding a series delay to our access to our registers, decreasing clock speed, while the additional circuitry needs to also consume a bit of power as well.
Though, one can circumnavigate some of the issues outlined above.
Mostly thanks to out of order execution. By simply having more registers and using register renaming. Then if the out of order system runs out of a register at a specific size, it can either repurpose a larger one if available, or it can just tell the decoder to wait.
In my own opinion.
Since registers are dedicated hardware, and not that insanely expensive from a transistor standpoint. Then it doesn’t hurt to have enough of them, saving transistors here isn’t a major benefit compared to other areas in an architecture.
It is not just register… Those are relatively cheap. You would want to match the execution units, internal data bus, cache and may be fabric to the size of the register so that you can load/save them in a cycle instead of two. Those can cost you extra transistors, power (because more signals are being toggled)
e.g. Earlier Ryzen takes 2 cycles (by splitting them off) for some of the wider AVX instructions as a trade off.
Yes, there is a lot of nuance to instruction width as well.
Making everything “wide” isn’t really ideal from a power efficiency standpoint, and not from a resource efficiency point as well.
Though, making a slew of different sizes of each instruction just in case can likewise be inefficient. It is just another thing one has to balance when implementing the architecture. And one might even have different implementations for different cores targeting different applications.
As stated, trying to save in on transistors by dynamically scaling the registers individual sizes to fit a variable isn’t as interesting as scaling the instructions or other aspects of the architecture. A bank of registers aren’t particularly resource intensive compared to most instructions.
One of my own hobby architectures were going to dynamically allocate 4 bit chunks for the registers, but that is a lot more complex than having a few registers for each common size and just having a few extra bits used for indicating if it is used for a smaller variable that can fit within them. (So that a 32 bit register can augment an 8 bit one as en example.)
Though, out of order makes handling registers quite a lot more tricky.
Without it one can go quite far with a simple multi access SRAM array. But out of order makes the number of accesses per cycle a lot more intense, not to mention usually adds the requirement for register renaming. (one can technically do without renaming, but the one must be chronological and that can limit peak execution rate.)
>Though, making a slew of different sizes of each instruction just in case can likewise be inefficient.
Like it or not sometimes you have to deal with different data sizes when you need to interface with the real world. i.e. communication or I/O standards where they might have a different endianness or that you want to pack data into storage.
You’ll need the bare minimum of being able to load/store different sizes. They aren’t difficult as they are similar to Write Enable pins for x16 or x32 memory chips. You only need to do that when you are saving the results to a register.
As for the rest of the instructions, it really depends on what you are going to optimize – speed or size. You could use multiple instructions to mask off registers if you choose the smaller sizes assuming those operations are rare. If they are not, you are into a lot of pain just trying to reduce the instruction size.
I hope you didn’t skimp the whole statement there.
Since it is a balancing act of what sizes one should build the instructions at.
Generally one would ensure that a “big” instruction can still work with smaller data.
For an example.
If we have a 64 bit adder, then we can simply cut off half of it to make a 32 bit adder (or two 32 bit adders if we add some more logic), this requires a tiny bit of logic after the 32th full adder, as well as at the start of our 1st, but the improved flexibility can be worth it, even if it makes the full 64 bit addition take a bit more time.
Another approach is to just have a dedicated 64 bit only adder, and another 32 bit only adder. Performance wise this can be better, but it takes more space, and is less flexible.
The argument around choosing instruction sizes becomes more nuanced when we start looking at more complex instructions that aren’t a trivial addition. We can in most cases still make our instructions able to handle smaller variables, but it does in some cases add a fair bit of extra complexity. So at some point, it is worth while building another execution unit for the smaller variables and let our scheduling system pick what instructions get sent to what unit.
In the end, our processor would likely end up having a couple of each type, but what size they are targeting, and how many we have of each is debatable and usually application dependent.
If we go back to simple adders.
We can build a chip that only has 64 bit adders in it that we divide down to form 32, 16 and 8 bit adders when needed.
Or we can build a chip that has a couple of 64 bit ones and a handful of 16 bit adders, though some application might rather need more 32 bit performance, so this is a balancing act with no clear best solution. (unless one knows what exact applications one will run.)
And this is one of the many nuances CPU designers have to meddle with, and why one processor is better than another in one task but not another. And also why clock speed isn’t the only important metric to look at.
(I also somehow feel a lack of nuance in the numbers here, writing 16, 32, 64 feels wrong after having meddled with my own hobby architecture for a few years that uses in between values as well, only adding to the debate about how to balance instruction sizes.)
Your ISA dictates what the instruction types and sizes they operates on. The architecture/implementation is how you do that in hardware. i.e. your 64-bit execution unit
What they typically do is to have a suite of OS, popular applications and usual benchmarks (to “match” real life usage) that is used to evaluate the ISA , the architecture on a performance, power etc.
Yes, it is benchmarks as far as the eye can see.
One can make some logical “assumptions” by just looking at the statistical distribution of variables used to get most of the way there.
But final optimization is usually through benchmarks and/or by saying that the software instead has to optimize to the hardware. The two fields have to meet in the middle somewhere.
Thankfully most data types don’t magically grow larger just because one switches to a 64 bit platform.
Except the pointers and most other memory references.
But larger pointers will result in fewer of them in the same amount of cache, and this could push out other things along the way. It is a balancing act to be fair.
Though, some 64 bit applications just throws in 64 bit values everywhere even when it isn’t needed in the slightest, leading to a lot of extra memory usage for no good reason. Though, the same is true for some 32 bit applications too.
But then there is the applications that stores values as strings, so there is always worse offenders when it comes to having a complete disregard for minimizing unnecessary memory usage.
Some of this reminds me of the transition from Win16 to Win32 back in the 90s..
Some of the then new 32-Bit applications were unnecessarily bloated by comparison.
If the busses can move 64bits as easily as 32 bits (I’m not super familiar with the Broadcom chip’s architecture or ARM in general, so this may be an incorrect assumption), which I would expect this chip can, then the silicon is already there, and the clock speed won’t change. The only real architectural disadvantage should be (again, unless I’m wrong in my assumption) a marginal increase in RAM usage for addressing, which is inconsequential for any pi model that supports 64bit except the pi zero 2 w.
Then we get into software support, where everybody will scream at each other, but the Linux ecosystem has, for the most part, moved past 32bit.
The impact on RAM usage is noticeable.
But the impact on cache usage is likely more impactful.
If the performance increase of using 64 bit instructions don’t give a sufficient performance increase to offset the performance decrease from additional cache misses, then moving to 64 bit isn’t worth it. This will of course depend on the application.
Though, address size generally shouldn’t impact the data size of one’s instruction. As is typically seen on a lot of 64 bit architectures that have both instructions and registers far larger than 64 bit. But this is seemingly never the case for 32 bit architectures. (I have even seen 16 bit micros with 32 bit registers and instructions, but never a 32 bit processor that breaks rank….)
So personally I think part of the debate around 32 vs 64 bit is lost.
Though, I would much rather see a 48 bit address space, but that is a different story…
As someone who bench-tested a Zero 2 W before they were released, I’d strongly recommend avoiding the 64-bit Desktop Raspberry Pi OS on that machine. There’s hardly enough memory to do anything with a desktop in 64-bit in 512 MB RAM, and swapping is painful.
Raspberry Pi OS Lite 64-bit is okay, though.
Going to 64 from 32 bit addresses has both pros and cons.
Jeff Geerling already made a video on the topic, showing a roughly 30% increase in memory usage, likely due to various pointers needing 64 bit instead of 32, but there is likely a bunch of other overhead where variables have gotten larger for the sake of larger variables. (If this is important to you, only the applications you make/run will determine that.)
Some might though say that moving to 64 bit allows for more than 4GiB of RAM, but no, ARM already has features for expanding its memory capacity beyond 4GiB on some of its 32 bit processors, though not for a single process.
Though, few architectures seems to support 32 bit addressing (even locally) while still having 64 bit registers available. So in a lot of cases, if one wants 64 bit variables and a slew of extra instructions using those registers, then one has no choice than to also embrace 64 bit addressing, even if one might never need it. (This though has lead to the common misconception where people mix up address space and variable size. Ie, stop waiting for 128 bit X86, 128 bit instructions and registers already exists in it (for both Intel and AMD processors), while 64 bit addresses covers enough memory as is at 16 EiB.)
Data compression and decompression, and encryption are the most common workloads on the modern desktop and these things all work much better in 64 bit.
In WHOSE desktop? Oh, you mean web browsing. I thought we were talking about computers.
Who actually cares about whether Raspberry OS is 64 bit or not? BFUs don’t and if you use RPis for something more serious, you use Buildroot, Yocto, NixOS, etc… where it is upon you to decide how many bits you want to use…
Well, I am sure there are some people that will care :) . I had loaded PI OS 64 beta on one of my PI4s to learn 64bit assembly programming when it first was released — for fun of course. It has never missed a beat. I’ll load PI OS 64 on any future PI4 or CM4 modules that I acquire … just because I want to … not need to.
I love the Pi for one very particular thing. A raspberry pi is a s good as a small AWS EC2 instance. The ones on AWS cost a lot, the Pi’s cost $35. After buying 1 every month for an entire year, I put together a 12 pi cluster.
Now what would someone actually do with such a thing?
1. I learned K8S for free (price of the Pi asside, AWS or other cloude provider’s fees for using their envs for your learning purposes) which lead to a promotion at work which came with a significant pay increase.
2. Whatever I used to run at home before, is now setup with HA all around, with the only point of failure being the electric power to the house.
I could probably have learned the same thing from buying 2nd hand servers on ebay, but they are a lot louder and consume a lot more power than Pis do. Plus, they’re big and you can just take one of those and do with it what the small size of the pi allows it to do.
Now, I did do that stuff by running ubuntu 20 on it, instead of the new PiOS, but being able to use a fully supported OS by the people who make the hardware is a very nice thing.
In summary, I have my very own private cloud and it cost me a bit over $350, and once the electrician has a chance to hook up the generator to my home, it will have no single point of failure.
> we think the dual-core Pi 2 require a 32-bit version
Both Pi2 branches are quad cores. The old ones are 32-bitters, the new ones are 64-bitters (kinda down-clocked Pi3ers without the wireless stuff chip).
I was hoping to find this comment here, though I didn’t realize there was a 64-bit Pi 2; I thought they were all 32-bit.
The later revisions (1.4 IIRC) had the same SoC as the Pi 3 (I believe the Pi 2’s SoC was EOL’d). So it was basically a Pi 3 minus WiFi and Bluetooth.
Just recently I found out that the 64 bit Pi has no Widevine – necessary to play Netflix, Disney+ and the like. See the last paragraphs of the Rpi news announcement.
I’ve been using 64-bit Debian on a Pi4 and find that v4l2 isn’t quite as flexible as MMAL for interacting with the HQ camera hardware. It doesn’t look like the official release changes this. For example, out of the many exposure modes, only a handful actually work. There doesn’t seem to be a way to adjust camera ISO in video capture mode.
Still patiently waiting to be able to buy a compute module since the supply chain crunch has made some(most) of the sku’s into unicorns…
Know the feeling. I am ready to pick up another RPI4 for a little project I am entertaining doing. and I’d like to get a CM4 with all the bells and whistles for fun. Unobtainium….
Aarch64 instruction set is quite a bit different from the ARM32 set, the difference is wider than the x86-64 and x86 instruction sets. Comparisons are generally in favor of the 64-bit over the 32-bit instruction set on typical ARM applications.
My problem with 64-bit is that if you have even one 32-bit “legacy” app you will want to bring in a lot of libraries in order to have a working runtime in a dual-architecture setup. We do this on x86 all the time, but we’re also not running off the cheapest microSD card on Amazon. Overall, it’s a small complaint, and it will be good for RPi to move to 64-bit sooner rather than later.
Eventually we’ll see new ARM SoCs that are 64-bit only and won’t even include 32-bit support. It’s hard to say that RPi will every find itself on one of these products. But some of the datacenter centric ARMs have moved in this direction. UEFI alone can make it very difficult to support two architectures, so this limitation is already in the wild to some degree. But the next generation of chips look to have removed 32-bit support.
The main difference right now in 64-bit Pi is the ability to support more than 4GB RAM. This wasn’t a problem with the Pi 3, since it only had 1GB, but with the Pi 4 now supporting up to 8GB, it’s necessary.
The chip and Raspbian supports Large physical address extension allowing for a 40 bit address space. So that is 1 TB of RAM. All while still working as if it were just a humble 32 bit processor.
All though, a given process can only access 4 GB.
And according to the Raspberry PI .com article, a given process can only access 3GB as the top 1GB is reserved for the kernel. Nothing I do even comes close to using 3GB of memory in a single process!