Single Event Upsets: High Energy Particles From Outer Space Flipping Bits

Our world is constantly bombarded by high-energy particles from various sources, and if they hit in just the right spot on the sensitive electronics our modern world is built on, they can start flipping bits. Known as Single Event Upsets (SEU), their effect can range from unnoticeable to catastrophic, and [Veritasium] explores this phenomenon in the video after the break.

The existence of radiation has been known since the late 1800s, but the effect of low-level radiation on electronics was only recognized in the 1970s when trace amounts of radioactive material in the ceramic packaging of Intel DRAM chips started causing errors. The most energetic particles come from outer space and are known as cosmic rays. They originate from supernovas and black holes, and on earth they have been linked to an impossibly fast Super Mario 64 speedrun and a counting error in a Belgian election. It’s also possible to see their path using a cloud chamber you can build yourself. There are even research projects that use the camera sensors of smartphones as distributed cosmic ray detectors.

Earth’s magnetic field acts as a protective barrier against the majority of these cosmic rays, and there is a measurable increase in radiation as you gain altitude and enter space. In space, serious steps need to be taken to protect spacecraft, and it’s for this reason that the Perseverance rover that landed on Mars this year uses a 20-year-old main computer, the PowerPC RAD750. It has a proven track record of radiation resistance and has been used on more than a dozen spacecraft. Astronauts experience cosmic radiation in the form of flashes of light when they close their eyes and protecting their DNA from damaging effects is a serious concern for NASA.

It’s impossible to know the true impact of cosmic radiation on our world and even our history. Who knows, one of those impossible-to-replicate software bugs or the inspiration for your latest project might have originated in another galaxy.

13 thoughts on “Single Event Upsets: High Energy Particles From Outer Space Flipping Bits

  1. In a lifetime of different computer occupations, starting as a nighttime computer operator in high school (1970’s) through retirement, I’ve only been able to prove one true runtime/CPU/memory computer error. It was a COBOL program running on a Burroughs B6800. A program I had written crashed with a data exception. I remember it as an index or subscript being out of range. We printed the trace (a log of the executed statements and data/variable values referenced). The offending index (or subscript) value had magically changed. The log of the instructions executed proved that the value could never have been set to such a value. I don’t care if it was just a fluke memory error or a stray ray from space…I can always say with a straight face: “I’ve seen a computer make a mistake, even when the code was perfect”. BTW, that program never, ever crashed before or after that event while I was with that employer.

  2. I actually had something like this happen overnight this week. Woke up to find that chrome had crashed due to an illegal opcode. Considering the text segment is W^X and thus shouldn’t have been writable by anything in the whole damn system, it’s quite possible that a cosmic particle with zero respect for page table configs is responsible.

    I once read about an occurrence someone else had, where they kept getting a strange crash in one specific library. Upon comparing it against a copy pulled from the debian repo, they found that exactly one bit was different, corrupting an instruction to become and illegal opcode. The theory is that the bitflip must have occurred during an update while the file was in ram, and was subsequently written to disk.

    Unfortunately, while virtually 100% of my desktop computers all use ECC ram, my laptop where the problem earlier this week occurred does not. There are VERY few laptops out there that can accept ECC SODIMMS and SODIMMS with special on-die ECC might well be a myth. That could change with DDR5 and it’s spec for on-die ECC, but I’m not holding my breath.

    (it’s actually possible this is yet another attempt to mitigate rowhammer, now that TRR is broken and we’ve been at it now since 2014)

  3. This is where a TI Hercules can be a help. Ingenuity has a few. A lockstep dual core cortex-R5 (rotated 90 to each other). ECC memories, parity etc. I was playing around using Ada on the RM57 series (the little endian one) the TMS570 ones are big endian series. They sell them for safety critical stuff, train control etc. They have a boat load of certifications. The other core that’s popular are the Leon3/4 series, typically these get placed in rad hard fpgas.

  4. Doug Sinclair (Sinclair Interplanetary, now RocketLab) prepared a good down-to-earth (har) guide to sensible rad-hardening of space systems. https://s3vi.ndc.nasa.gov/ssri-kb/static/resources/Radiation%20Effects%20and%20COTS%20Parts%20in%20SmallSats.pdf
    (ugh, why do people leave spaces in URLs?).
    Also available as slides from the talk at https://digitalcommons.usu.edu/smallsat/2013/all2013/69/

    Quite a different radiation environment compared to under our ten tons per square meter of atmosphere shielding.

    1. ZFS sometimes leaves me a “scan: scrub repaired …” note to let me know that it fixed something up on my consumer-grade systems. I haven’t seen the same message in the data centre running near identical workloads. Who knows if cosmic rays are to blame though? Other sources of radiation coupled with modern storage operating on the ragged edge of what is possible without UREs could also be a cause.

    2. People who have data worth protecting will use a better filesystem or even store redundant copies.
      For the rest of us, if a bit gets flipped in one of my videos or even some system DLL, it’s not really big enough of a problem to be worth taking extra space.

  5. Single event upsets are a serious problem for large particle physics experiments like ATLAS at CERN. We are trying to track the thousands of particles coming out of a high energy collision with silicon sensors of several kinds (PIXEL, Strip, Pads, etc.). What all these silicon detectors have in common are insane high readout channel counts in the order of tens of millions. All these channels are read out with highly integrated readout chips with mixed analog and digital functionality. Each chip reads out, digitizes and zero-supresses 256 detector channels. Although the used 130nm technology is relatively radiation hard single event upsets can still happen and can render chips disfunctional until they get actively reset or power cycled. Because of the high channel count and very high particle densities you will see multiple SEUs in your detector system per minute (or even second), effectively creating a ‘blind spot’ in your silicon sensor tracking system.
    After the initial readout chips were developed and prototypes it became apparent through further simulations and actual irradiation of the prototype chips that SEU would occur with too high frequency. New designs had to be made to triplicate sensitive parts of the chips. This redesign took a couple of years and has delayed our project for almost that same amount of years. This not only screws up the overal schedule but also make the project way more expensive because of the standing army problem.
    So SEUs are not much of a problem with the computer on your desk, but can cause major headaches when you are building massive systems which have to be fault tollerant.

  6. PowerPC? Interesting.

    The last time I read this story NASA was still using 386s and 486s in space and it was thought they might never upgrade. The larger transistor size in such old chips made them less vulnerable.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.