Manufacturing Problem

A factory is a machine. It takes a fixed set of inputs – circuit boards, plastic enclosures, optimism – and produces a fixed set of outputs in the form of assembled products. Sometimes it is comprised of real machines (see any recent video of a Tesla assembly line) but more often it’s a mixture of mechanical machines and meaty humans working together. Regardless of the exact balance the factory machine is conceived of by a production engineer and goes through the same design, iteration, polish cycle that the rest of the product does (in this sense product development is somewhat fractal). Last year [Michael Ossmann] had a surprise production problem which is both a chilling tale of a nasty hardware bug and a great reminder of how fragile manufacturing can be. It’s a natural fit for this year’s theme of going to production.

Surprise VCC glitching causing CPU reset

The saga begins with [Michael] receiving an urgent message from the factory that an existing product which had been in production for years was failing at such a high rate that they had stopped the production line. There are few worse notes to get from a factory! The issue was apparently “failure to program” and Great Scott Gadgets immediately requested samples from their manufacturer to debug. What follows is a carefully described and very educational debug session from hell, involving reverse engineering ROMs, probing errant voltage rails, and large sample sizes. [Michael] doesn’t give us a sense for how long it took to isolate but given how minute the root cause was we’d bet that it was a long, long time.

The post stands alone as an exemplar for debugging nasty hardware glitches, but we’d like to call attention to the second root cause buried near the end of the post. What stopped the manufacturer wasn’t the hardware problem so much as a process issue which had been exposed. It turned out the bug had always been reproducible in about 3% of units but the factory had never mentioned it. Why? We’d suspect that [Michael]’s guess is correct. The operators who happened to perform the failing step had discovered a workaround years ago and transparently smoothed the failure over. Then there was a staff change and the new operator started flagging the failure instead of fixing it. Arguably this is what should have been happening the entire time, but in this one tiny corner of the process the manufacturing process had been slightly deviated from. For a little more color check out episode #440.2 of the Amp Hour to hear [Chris Gammell] talk about it with [Michael]. It’s a good reminder that a product is only as reliable as the process that builds it, and that process isn’t always as reliable as it seems.