Error-Correcting RAM On The Desktop

When running a server, especially one with mission-critical applications, it’s common practice to use error-correcting code (ECC) memory. As the name suggests, it uses an error-correcting algorithm to continually check for and fix certain errors in memory. We don’t often see these memory modules on the desktop for plenty of reasons, among which are increased cost and overhead and decreased performance for only marginal gains, but if your data is of upmost importance even when working on a desktop machine, it is possible to get these modules up and running in certain modern AMD computers.

Specifically, this feature was available on AMD Ryzen CPUs, but since the 7000 series with the AM5 socket launched, the feature wasn’t officially supported anymore. [Rain] decided to upgrade their computer anyway, but there were some rumors floating around the Internet that this feature might still be functional. An upgrade to the new motherboard’s UEFI was required, as well as some tweaks to the Linux kernel to make sure there was support for these memory modules. After probing the system’s behavior, it is verified that the ECC RAM is working and properly reporting errors to the operating system.

Reporting to the OS and enabling the correct modules is one thing, actually correcting an error was another. It turns out that introducing errors manually and letting the memory correct them is possible as well, and [Rain] was able to perform this check during this process as well. While ECC RAM may be considered overkill for most desktop users, it offers valuable data integrity for professional or work-related tasks. Just don’t use it for your Super Mario 64 speedruns.

55 thoughts on “Error-Correcting RAM On The Desktop

    1. Some Ryzen lines seem to quietly support is as well, or it “isn’t officially supported” but seems to work anyway. Very helpful for home NAS builds if you want to be thorough. I think the Ryzen Pro lines generally support it, they’re not meant to be sold direct to the consumer but they’re fairly easy to get hold of anyway.

    2. All Intel Core CPUs support ECC now, but Intel artifically limit it to workstation motherboards (W680 chipset) only. My home server has an Asus PRO WS W680M-ACE SE motherboard, Intel Core i5-13500, and 64GB of DDR5 ECC RAM. The motherboard is expensive, but it does have a BMC and IPMI, which is very handy (and actually ends up cheaper than buying a cheaper motherboard plus a PiKVM for remote control).

    1. Even around 1990, the 30 pin SIMMs were sold in 8 bit and 9 bit variants. Macs used the 8 bit non parity ones but PCs used the 9 bit parity ones. Might not have been able to correct errors, but at least you stood a chance of getting a useful error code, rather than just having your software go inexplicably off the rails.

      1. And 72 pin SIMM came in 32 and 36 bits version. 36 bits can be used in 32 bits only computer, the extra RAM is ignored. I think those were the last to offer parity RAM as standard. Afterward they were either standard or ECC only

    1. I’m not surprised a national lab wanted you to buck industry practices. Those large scale systems are where those once-in-a-million-cpu-hour faults can crop up daily.

      Fortunately things have changed in the last few decades. ECC for RAM is usually done over 64 bits with an 8 bit ECC word spread across multiple chips (chipkill ECC), which significantly decreases the odds of a false positive and makes SECDED reliable even in exascale systems. Memory scrubbing for errors and induced faults (pTRR / rowhammer) has moved to engines within DRAM devices themselves. On top of that, modern server CPUs do data poisoning across caches, buses, memory, and storage to prevent errors at any point from spreading.

      1. > Was coming here (from the RSS post) to do the same.

        Which – to mention the incorrect usage of ‘upmost’, or to crack an updog joke? :P

        > You know, though… sometimes we don’t know what we don’t know.

        That is indeed the case. :)

  1. ECC memory can have a higher price, higher power consumption and lower performance compared to non-ECC memory. If it’s not needed it’s best to leave it out. In Aeurospace it is useful as high altitude and space have higher levels of background radiation which can flip bits. But for desktop PCs I don’t see a use yet.

    My guess is the more bits the higher chance a bit flips, so I guess that, as we are pushing the limits of modern memory in terms of performance and size, error correction might become more and more useful. Or is reliability not an issue? Can anyone verify this?

    I wonder if it can help protect against certain security attacks. If temperature, supply voltage or EMF cannot corrupt memory unnoticed it might make the system more reliable.

    1. ECC is not only used in aerospace applications, but also in automotive stuff. At least safety-related devices usually rely on ECC memory (otherwise they would not pass some certifications). Like you said, this is mainly a reliability issue. For security, the benefit is limited since mostly SECDED ECC is used which means that (deliberate) changes of memory might not be detected reliably.

    2. > ECC memory can have a higher price, higher power consumption and lower performance compared to non-ECC memory.

      Slightly, slightly, and negligibly.

      > If it’s not needed it’s best to leave it out.

      It’s needed.

      > But for desktop PCs I don’t see a use yet.

      … which you follow with a bunch of comments that show you don’t know.

      > My guess is the more bits the higher chance a bit flips,

      Good guess. You have to “guess” this?

      > so I guess that, as we are pushing the limits of modern memory in terms of performance and size, error correction might become more and more useful.

      We went over the limits ages ago. Or, more correctly, the probability of an uncorrected error went beyond what a reasonable person would accept ages ago.

      DRAM gets errors. RAM errors in consumer PCs crash programs and eat data all the time. You can expect that to happen to you every few months on your average machine.

      > Or is reliability not an issue? Can anyone verify this?

      Of course reliability is an issue. How could reliability ever *not* be an issue?

      > I wonder if it can help protect against certain security attacks.

      It’s somewhat effective against Rowhammer. Not effective enough that RAM suppliers shouldn’t be preventing Rowhammer in the underlying chips and refresh logic… but effective enough that they often use it as an excuse *not* to.

      1. “which you follow with a bunch of comments that show you don’t know.”
        That’s why I asked.

        “Of course reliability is an issue. How could reliability ever *not* be an issue?”
        If the memory is reliable it’s not an issue.

        “We went over the limits ages ago. Or, more correctly, the probability of an uncorrected error went beyond what a reasonable person would accept ages ago. DRAM gets errors. RAM errors in consumer PCs crash programs and eat data all the time. You can expect that to happen to you every few months on your average machine.”
        That’s all I wanted to know. I could only find data on failure rates of server RAM. I didn’t know

    3. If you have a CPU, you have ECC. If you have DDR5, you already have ECC. If you have a drive, you have ECC. It’s inside your processor’s cache. It’s inside the memory chips themselves. It’s on every page of your SSD. Your computer is applying ECC protection to data at multiple stops across its lifetime. Then the ECC is ripped off, maybe a checksum is applied for transit across a bus, then a new ECC is re-applied at the new location. Instead of providing end-to-end protection for the many paths where data is just being copied around, your data is left unprotected at its most precarious transitions. Unless you pay Server Prices to Intel to do the same thing, just not in the stupidest way.

      1. Funny, I recently delt with a new DDR5 RAM stick with bitfade errors (on a new 7600X machine).
        Nothing corrected the errors. It made the system very unstable.

        MemTest86 found the problem.
        It threw me, after decades of good results, this is the first bad, new stick of name brand RAM I’ve ever seen.
        Excluding, of course, the ‘computer show and sale RAM’, but that was just some scumbag selling returns.

  2. I had ECC in my desktop some time back in the early naughts.

    I only bought it because the motherboard I had required it. And I only bought that because it was dual processor. Having two processors was awesome back in the day when they only came with single cores! Being more separated than cores in a single cpu they weren’t vulnerable to some of those bugs we see today that require one to choose between speed and security.

    But wow.. so expensive! It ate so much electricity! And generated so much heat! It eventually became unreliable, I think parts were starting to wear out due to that heat. I am happy to have a bunch of cores stuffed in my single die CPU and non-ECC RAM today.

    1. There’s 128GB of ECC RAM (and dual CPUs) in the system I’m typing this on, and nothing in the case, including the RAM, runs very hot. Not even on the warm side. All air cooled. System cooling is what came with the case.

      1. AMD hardware of the late 90s/early naughts was notoriously hot running. Double the CPU count in that and you were really cooking.

        If you have 128GB of RAM I very much doubt we are talking about the same generation of hardware.

        Enjoy that setup!

    2. Me too, but earlier. Dual Pentium pro.

      When I got rid of it, I checked in bios for the ECC logs. It had corrected one bit in the years I had run it.

      That said, IIRC, that machine had a whopping 8 MB of RAM. Still never bought ECC again.

  3. Wow, AMD on the cutting edge of emulating Apple by ‘innovating’ stuff that everyone else has been doing for years.

    Wonder how long until they announce hardware support for a second mouse button or rounded corners on CPUs

    1. Incase you didn’t read TFA then Ryzen CPUs have supported ECC memory for ages (while much of the Core i range didn’t). With the 7000 series ECC support isn’t mentioned, but apparently works.

    2. Lol. Nice troll.

      AMD has had ECC in their AMD64 CPUs since K8’s release in 2003. Apple has changed architectures twice since then, and didn’t invent jack in their CPU and chipset until M1 released three years ago.

        1. Surely, if it’s other people who are thin skinned and not you, then there’s no reason to use kid gloves when correcting something you’ve said? After all, if you’re thick skinned, you can gracefully accept when you’re wrong?

  4. It’s absolutely insane to build a multi-gigabyte DRAM system without ECC, and the only real reason that’s common practice is that Intel decided to do market segmentation by only supporting ECC for “server processors”.

    If you don’t care about the data, why do you bother to have a computer?

    1. ECC on DDR5 is limited to the chips themselves and not the link to the CPU. The CPU doesn’t know that they happened (so is unable to spot a chip going bad) and can’t correct an error over the link. You can get ECC DDR5 modules that do both.

  5. If you search for “google ecc error study” you’ll get a lot of useful information, but the tl;dr is that at a single machine scale it isn’t really a problem, but at data center scale it can become a real issue depending on the application. It’s often possible to design the application to tolerate the errors caused by a malfunctioning process or a crashing machine, but they often lead to other problems such as increased tail latency for RPCs.

    Note that it’s an issue for other pieces of equipment too. For example modern data switching ASICs contain enormous amounts of memory that are used during packet switching that are often only protected by a single parity bit (looking at you Broadcom!). At data center scale this can result in a significant number of lost or misdirected packets, increased tail latency etc.

      1. That’s not correct. The study shows that on average, a given DIMM has an 8.2% chance of a *correctable* error per year of runtime. On a home or office PC with up to four DIMMs, that’s a 32.8% chance that you will have a single-bit flip in memory, or roughly one incidence in three years of constant operation. For that to be noticeable, the bit flip must be in critical data: unflushed disk caches, resident programs and data that affects program state. But it’s much more likely that bit is being unused or used for inconsequential data like graphics textures or bitmap caches.

        I’ve had Windows crash for no good reason on a single DIMM laptop more than once every three years.

        This is also based on older DDR1/DDR2/FBDIMM technology that does not have on-die error correction to mitigate cell crosstalk like DDR4 and DDR5.

  6. AMD has pretty much always supported ECC. Even my old phemon II in a cheapo motherboard supports ECC, even though the motherboard’s chipset is capped at only 4GB. My cheapo AM1M-A with a cheapo 25 watt AMD CPU supports it too.

    It’s intel that’s largely responsible for the lack of ECC in consumer devices, as a source of product differentiation. Want hardware that actually works? Better pay more. Somehow AMD have been able to include this feature in every CPU for the better part of 15 years….until now that they want to hop on the product differentiation bandwagon too.

    Time and time again research has shown that bitflips happen all the time, but we merely get “lucky” in that they strike something (eg. a bitmap image) that doesn’t matter. (most sites cite https://web.archive.org/web/20110819185612/http://lambda-diode.com/opinion/ecc-memory which in turn cites https://web.archive.org/web/20090226195204/http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf
    Veritassium more recently cites a different report with even worse numbers, presumably due to die shrinkage https://www.youtube.com/channel/UCHnyfMqiRRG1u-2MsSQLbXA)

    Bitflips are hard to diagnose if you can’t even detect they’ve happened. The end result is that chipmakers (intel) get away with treating consumers and their data and their time as expendable.

      1. From the article:

        Update: Lots of commenters have pointed out that there’s no way to be certain the visits to his domain were the result of bit flips. Typos may also be the cause. Either way, the threat posed to end users remains the same.

  7. I have mission critical software that I have wrote(several million lines of code) and maintained for the last 40 years, with runtime in the 100s of million hours. Most of it is the health care industry so it has to work. And in all that time I have never seen a single instance were ECC or non ECC ram ever has been the issue or saved/killed the day. I have had just flat bad ram but those machines rarely even get to their base operating state before they die.
    I have been seeing this same thread done over and over and over and over for the last 50 years. ECC ram just really does not make a hill of dust difference in the real world. Most of the time it is just a pain in the ass to find what ever obscure stick of memory the machine needed the week it was manufactured or to match it to the rest of the ram in it. And it costsssssssssssssssssssssssssssssssssssssssssssssssssssss more.

    1. Oh yea and then you get the big server vendors and you know who they are, were they have the bios look for the most exotic features that what ever spec of ram they use have and refuse to run with out them, even though they never use them, because there is no way you will ever find that ram in any channel other than from the OEM.

  8. My Dell Precision 7250 laptop with Xeon CPU supports ECC, so it’s not just AMD.
    I filled it with 64Gb non-ECC though, as ECC SODIMM DDR5 RAM was practically impossible to find here in Japan (and probably still is).
    AFAIK, any Xeon based system supports ECC RAM (but you gotta pay more for a “Windows for Workstations” license because eff-you…).

  9. This thread reminded me that some years ago I had to change the RAM of my desktop computer, the catch is that i was so amazed at the price of the sticks i had found on ebay that i didn’t read the title. I had just bought 4x4GB of ECC ram.

    The H97-Pro Gamer motherboard i have shouldn’t be compatible with ECC but still, after 5 years, the computer is still working like a charm.

    I remember checking a thread that said that particular ram (Kingston KVR16LE11L/4) could work with non-ECC systems. At the time it seemed like nobody knew why, not caring in any way i just plugged the memory in the motherboard as soon as i found out it could’ve run without “exploding”. In the end i’ve been very lucky and i’m almost sure ECC is not working in my system, or is it?

  10. ECC RAM for a machine that’s on all the time is a great investment in my professional and personal opinion. Anecdotally I’ve an old Dell precision with a Xeon CPU and 8GB DDR2 EEC thats been running for 12 years now, never crashes (except on a windows update when windows 10 decides the drivers need to be updated). I’ve had multiple desktops, laptops, tablets of all generations and they crash, servers rarely crash because of the additional redundancy built into the hardware. It’s inevitable that things will have errors. Binary isn’t binary, bits flip incorrectly, error correction is on EVERYTHING that stores or transfers any sort of data built in by design, having an additional redundant safeguard on the data that’s constantly being written and read from just makes sense.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.