Good news this morning from low Earth orbit, where the Hubble Space Telescope is back online after a long and worrisome month of inactivity following a glitch with the observatory’s payload computer.
We recently covered the Hubble payload computer in some depth; at the time, NASA was still very much in the diagnosis phase of the recovery, and had yet to determine a root cause. But the investigation was pointing to one of two possible culprits: the Command Unit/Science Data Formatter (CU/SDF), the module that interfaces the various science instruments, or the Power Control Unit (PCU), which provides regulated power for everything in the payload computer, more verbosely known as the SI C&DH, or Scientific Instrument Command and Data Handling Unit.
In the two weeks since that report, NASA made slow but steady progress, methodically testing every aspect of the SI C&DH. It wasn’t until just two days ago, on July 14, that NASA made a solid determination on root cause: the Power Control Unit, or more specifically, the power supply protection circuit on the PCU’s 5-volt rail. The circuit is designed to monitor the rail for undervoltage or overvoltage conditions, and to order the SI C&DH to shut down if the voltage is out of spec. It’s not entirely clear whether the PCU is actually putting out something other than 5 volts, or if the protection circuit has perhaps degraded since the entire SI C&DH was replaced in the last service mission in 2009. But either way, the fix is the same: switch to the backup PCU, a step that was carefully planned out and executed on July 15th.
To their credit, the agency took pains that everyone involved would be free from any sense of pressure to rush a fix — the 30-year-old spacecraft was stable, its instruments were all safely shut down, and so the imperative was to fix the problem without causing any collateral damage, or taking a step that couldn’t be undone. And further kudos go to NASA for transparency — the web page detailing their efforts to save Hubble reads almost like a build log on one of our projects.
There’s still quite a bit of work to be done to get Hubble back into business — the science instruments have to be woken up and checked out, for instance — but if all goes well, we should see science data start flowing back from the space telescope soon. It’s a relief that NASA was able to pull this fix off, but the fact that Hubble is down to its last backup is a reminder Hubble’s days are numbered, and that the best way to honor the feats of engineering derring-do that saved Hubble this time and many times before is to keep doing great science for as long as possible.
i used to find it a little nerve wracking to try to fix the server because if i screwed it up, i had to get on a subway to the colo downtown…2 hours out of my day, at least, just because i hit enter before thinking things through. these days, i’m not in such a position of such authority. if i screw up my home server i just have to go down to the basement and hope the 20 year old CRT down there still works so i can diagnose the problem. trying not to break the thing is kind of like a game of “floor is hot lava”, where failure has no real consequences but you play nonetheless. still feel like a moron when i kill the network interface before i thought of how i was gonna tell it to come back up!
fun to imagine being in these guy’s shoes!
Checklists.
Also, lots of different computers, most of which have redundant modules, and which can poke into each others’ memory without needing to have the other computer actually be running. The NSSC-1 which caused the problem has four memory modules (all non-volatile core memory of 4kw each — the NSSC-1 is a 18-bit beast) which can be switched between on demand by one of the _other_ computers. So you can upload a new firmware image, write it to memory, shut the NSSC-1 down remotely, swap modules, start it up again, all in flight and not requiring the NSSC-1 to be actually operational.
I got tired of the risk, and set up a pair of servers, with a spare Ethernet port from each server connected to the BMC of the other server. Just for fun, I left the OS reinstall disk in each of their drives. It would be really hard to break badly enough to lock myself out, now.
Good luck after messing up your switch/router configuration
Well, direct connections between the servers, and each server has a line from the data center, so I would have to either crater both machines at the same time, or mess with the BMC IP settings. Not impossible, but avoidable. I got real tired of driving 150 miles to troubleshoot, so tried to make it hard-to-break.
We have a network at our vacation home, so it’s a 4 hour road trip to fix something if I fat-finger it.
Makes router firmware upgrades just that little bit more exciting.
The Telescope That Could-way to go NASA!!
Out of curiosity, how deep are the number of back ups on the various pieces of key equipment? Has that strategy changed from 30 years ago till now?
Yeah, I was surprised at the mention of *triple* redundancy of some components (the memory modules, IIRC)
NASA made a big deal about *no* redundancy (“single string redundancy”) in the recent Mars rovers Curiosity & Perseverance. The idea being: You remove the complexity and weight of redundant components and handover/fallback methods, and instead focus on fault *tolerance*, and allocate the weight and engineering and testing resources to designing the system to fail better and less often in the first place.
Akin’s Law of Spacecraft design #2: “To design a spacecraft right takes an infinite amount of effort. This is why it’s a good idea to design them to operate when some things are wrong.”
My favorite space story is the satellite that had a damaged battery system and went dark. Later, something went just wrong enough to bridge the connection to its solar panels, so it started transmitting again but only when the sun is on its panels.
AO-7?
I’m not surprised at the tripple redundancy. It’s just logical, common sense.
As a rule of thumb, the majority is “right”. Ie, it’s unlikely that 2 of 3 devices fail. Normally, it’s merely one of them. This makes 3 the minimum number needed for redundancy.
Alas, exceptions confirm the rule.
There was this story about an underground tram, in which that said redundancy failed.
Two of three circuits failed and the doors opened in the tunnels/closed at the stations. 😂
Sounds like a case where the work to build in redundancy should have been traded away to make the system more robust and fault-tolerant in the first place.
Hubble has used backup hardware before, but during the servicing missions they would replace the damaged part and restore primary systems. With no planned servicing missions, the backup is all they have. Once that goes, it’s the end of the mission unless a commercial operator can come up with a practical servicing mission. As far as orbital work platforms go, nothing flying right now can hold a candle to the Shuttle, but that doesn’t mean it’s not possible.
Personally my money would be on SpaceX using a modified Crew Dragon, but a few years ago Sierra Nevada said they could do it with Dream Chaser:
https://www.spaceflightinsider.com/missions/space-observatories/trump-space-advisors-considering-hubble-servicing-mission/
SpaceX could mount an extendable device in the “trunk” below the Dragon capsule. That would grab the fixture on Hubble’s base, which was installed during one of the Shuttle’s servicing missions.
That would serve to keep the capsule and telescope together, and ideally work to use maneuvering thrusters on a Dragon to boost the Hubble’s orbit.
The tricky part would be getting the crew to the access hatches and able to open them and remove and install parts. The Shuttle missions had the astronauts’ feet attached to a platform on the Canadarm.
Time to build an updated version of the MMU or modify a SAFER?
Or would it be easier to pull one of the obsolete, unused Keyhole spy satellites out of mothballs and rehab it, which would include instruments with optics that adapt the main mirror made for looking down at Earth to look at things lightyears away?
https://www.space.com/16000-spy-satellites-space-telescopes-nasa.html
Fix them up identically then fly them in formation to do some synthetic aperture observation work.
If it totally dies, it becomes junk?
If that happens, Musk and Bezos should race to replace the faulty PSU (and update the other electronics) and claim it for themselves.
They could take the world’s most expensive selfie together! I assume that’s like the billionaires version of saying GG
This is great news to hear (we could all use some of that these days). Wonder if a robotic servicing mission would work. Even bring it down to a lower orbit for a manned servicing mission. It is a national treasure that should be kept going.
I am sure the techs were waving their magic wands chanting “Hubble Hubble toil and trouble…”
The telescope was never intended to be serviced in orbit. The “barn doors” on the side that allowed the swapping out of all the major components with optics modified to work with the out of shape main mirror were a consequence of designing it for ease of assembly. It’s a good thing the builders didn’t decide to rivet the doors shut. That would have resulted in Hubble being an unfixable failure.
Drilling out a bunch of rivets with metal shavings flying loose would not have been a thing anyone would have wanted to do in space.
Even with the big doors on the side, the wire connectors and bolts weren’t designed to be removed in zero-G. One astronaut recounted how he carefully corralled a tiny screw from a connector. He held an open plastic zipper bag in one hand while carefully and very gently patting the floating screw this way and that with his other hand until he was able to slip the bag opening over the screw to catch it. I don’t remember if he had to get the screw back out of the bag for the replacement instrument or if its connector had a captive screw.
They could not leave any ‘extra’ parts behind because they could float off and damage other satellites or possibly get inside the Hubble telescope.
On one of the later Hubble service missions the doors that were never meant to be opened again after final assembly on Earth had warped to where they wouldn’t close. Fortunately NASA’s people had thought of everything they possibly could and there was a ratchet strap available, and the doors on Hubble had handles strong enough to loop the strap through to pull them closed so they could be latched shut.
If anyone’s interested in this kind of computer, I did an epic live-coding video where I write an assembler and emulator for the OBP (which begat the AOP, which begat the NSSC-1 which the Hubble uses). The blog post and writeup is here: http://cowlark.com/2021-07-03-obp-simulator
I don’t have specs for the NSSC-1, but it’s apparently very similar to the OBP, which has an 18-word data bus, a 16-word address bus with up to 64kw of addressable memory, all implemented with core memory and NOR gates, running at ~250kHz. The instruction set is surprisingly modern for something which was developed in 1968 (this was then streamlined a bit for the AOP); it would work fine in a modern embedded microcontroller. And the weirdest feature is… the OBP assembler uses _natural language_.
Find a tech named Mike Nelson to do it… “Mike fixed the Hubble, Mike fixed the Hubble!”
How does he eat & breathe though?
its amazing what kind of fixes can be phoned in.
Why isn’t the switch over automatic?
Waiting for command is just an extra step that could fail.
What if the automatic switchover failed and then it got stuck in a loop trying to switch over? Less manual control = less repairability.
Automatic switching can produce results like Ariane 5, where the backup failed before the main unit, so it executed a “failover” to a control unit that was already on the fritz from bad data.
That rocket failed due to penny pinching. The ESA didn’t do an all-up ground simulation of all the computers and electronics. The first Ariane 5 had a module carried over from the Ariane 4, but completely unused in the 5. The module monitored the stability of the rocket as it sat on the launch pad, with the purpose of delaying engine ignition unless the rocket was standing perfectly vertical.
In the 5 they apparently had fixed issues of the rocket wobbling a bit on the pad from wind, or the guidance software was updated to be able to handle launching a degree or so off vertical.
For whatever reason, that module wasn’t needed but it was left powered up and running, connected to the data bus, with its inertia sensors operating.
If the module had been set up to shut down the instant the engines lit, it wouldn’t have been a problem. But to avoid the inconvenience of rebooting the module in the event of a literal last second hold of launch, the module was programmed to keep running for several seconds after liftoff.
The Ariane 4 was still going vertical by the time that module shut down.
The Ariane 5 enters its gravity turn *with that module still running*.
So there’s this bit of hardware with one job, monitor how vertical the rocket is. What does it do when the rocket goes very much non-vertical? “Ahhhhhh! Falling over!” Its sensors quickly overwhelmed and outputting garbage data, which quickly overflowed buffers, then vomited onto the data bus. That hit the backup systems, crashing them, then the main systems, which attempted to switch to the backup.
The rocket then does its best impression of a Catharine wheel before ground control hits the self destruct.
All because someone was too cheap to simulate “What if we leave this part plugged in and active for Ariane 5, as it was in all the Ariane 4 flights, instead of plugging in dummy data from it?”.
However much money that simulation run would have cost would have been far less than losing the payload.
Good stuff to see. Trend on succesfull missions can not but rejoice. I am sure the techs have done pretty good job.