Preventing Embedded Fails With Watchdogs

January 12, 2019

Watchdog timers are an often overlooked feature of microcontrollers. They function as failsafes to reset the device in case of a software failure. If your code somehow ends up in an infinite loop, the watchdog will trigger. This is a necessity for safety critical devices. If the firmware in a pacemaker or a aircraft’s avionics system gets stuck, it isn’t going to end well.

In this oldie-but-goodie, [Jack Ganssle] provides us with a great write up on watchdog timers. This tells the story of a failed Clementine spacecraft mission that could have been saved by a watchdog, and elaborates on the design and implementation of watchdog techniques.

If you’re designing a device that needs to be able to handle unexpected failures, this article is definitely worth a read. [Jack] explains a lot of traps of using these devices, including why internal watchdogs can’t always be trusted and what features make for a great watchdog.

Thanks to [Jan] for the tip!

24 thoughts on “Preventing Embedded Fails With Watchdogs”

Hirudinea says:

January 12, 2019 at 4:38 pm

“… including why internal watchdogs can’t always be trusted …”, Quis custodiet ipsos custodes?

Report comment

Reply
1. Clemens says:
  
  January 12, 2019 at 7:39 pm
  
  I was thinking about this too. I like the “two controllers watching each other” approach discussed in the blog post… but what if they both fail at the same time for a shared reason?
  
  Report comment
  
  Reply
  1. dexdrako says:
    
    January 12, 2019 at 10:45 pm
    
    then there are bigger problem afoot then a simple rest can fix,
    
    Report comment
    
    Reply
  2. Ciprian says:
    
    January 13, 2019 at 12:38 pm
    
    Use two different controllers so the blocking conditions won’t be the same for all.
    But redundant system should have at least 3 devices, so use 3 different controllers to watch each other.
    
    Report comment
    
    Reply
    1. Robert Mateja says:
      
      January 14, 2019 at 7:10 am
      
      what if that fail?
      
      https://mfas3.s3.amazonaws.com/styles/grid-3_thumbnail_retina/s3/MinorityReportSFW_0.jpg
      
      Report comment
      
      Reply
2. jafinch78 says:
  
  January 12, 2019 at 10:51 pm
  
  Reminds me of logic with automation and some not liking logic and automation, even if with redundancy to create oversight human review positions that are still required for the logical automation systems maintenance at the least, that prevent human data entry transaction and whatever determinate errors found when investigating the system. The challenge of coming from paper based or even hearsay systems to automation of transactions in the simpleton world. Some can’t afford the redundancy, some can afford the redundancy, some like indeterminate/undetermined errors and hate the CAPA investigator even with creative fail safe mechanisms to make everyone not look bad.
  
  Report comment
  
  Reply
Thinkerer says:

January 12, 2019 at 4:53 pm

The Boeing 787 has to be powered down then re-powered every 248 days or it crashes, both literally and figuratively because of what may be a simple integer overflow. Where is your watchdog now?

https://www.i-programmer.info/news/149-security/8548-reboot-your-dreamliner-every-248-days-to-avoid-integer-overflow.html

Report comment

Reply
1. paul says:
  
  January 13, 2019 at 4:26 am
  
  Overflows in timers are very easily corrected when calculating time intervals by subtracting 2 timestamps.
  
  You can easily confirm this with pen and paper. Take as example an 4 bit integer, and draw the numbers in a circle, just like a clock. Then calculate some intervals by subtracting timestamps, and you will see that the interval will be calculated correctly even when the integer overflows between the timestamps.
  Because of this I almost always use 16 bit timer variables, because those are big enough to handle the time intervals I’m interested in, and they use less space & execution time.
  
  I’m not sure where I learned this. It might well have been one of Jack Ganssle’s articles. His pdf on debouncing, with scope pictures of different kind of switches, is also a very informative read.
  
  Report comment
  
  Reply
  1. Gérard says:
    
    January 13, 2019 at 9:07 am
    
    Not sure to understand:
    T1 0 –>–7–>–15
    T2 3–>-10–>—2
    T2-T1 3–>–3–>—13
    
    Even if there is a distance of 3 on your circle, the last delta is wrong because only one timer (T2) overflow not the first one (T1) i.e. you send something at T1 and you get an answer at T2.
    so it’s not alway wrong but not always right.
    
    Report comment
    
    Reply
    1. MJ says:
      
      January 13, 2019 at 2:31 pm
      
      Paul is actually correct, as long as you consider all your timestamps as unsigned numbers, and always compare ‘durations’ (i.e. difference between two timestamps) and dont try to compare instances, e.g. if (t1 > t0). This is due to the Modular Arithmetic of unsigned integers.
      
      Example (stolen from https://arduino.stackexchange.com/questions/12587/how-can-i-handle-the-millis-rollover ) :
      
      So lets say previousMillis is 4,294,967,290 (5 ms before rollover), and currentMillis is 10 (10ms after rollover). Then currentMillis – previousMillis is actual 16 (not -4,294,967,280) since the result will be calculated as an unsigned long (which can’t be negative, so itself will roll around). You can check this simply by:
      
      Serial.println( ( unsigned long ) ( 10 – 4294967290 ) ); // 16
      
      Report comment
      
      Reply
Anon says:

January 13, 2019 at 3:12 am

Nice read indeed.
It is worth noting that from my experience, radiation induced SEEs can interfere with the Device Service Unit (or however the manufacturer calls the external RESET handling unit), resulting to a system unrecoverable even using an external watchdog connected to the reset.
To overcome this, instead of issuing a reset the watchdog can be connected with a MOSFET to the power line of the μC effectively power cycling on timeout. Combined with a bit of additional circuitry, this can also be used for latchup protection.

Report comment

Reply
IanS says:

January 13, 2019 at 4:41 am

A career in electronic security brought a lot of experience with watchdog timers. These equipments are intended to operate for long periods without attention.

My favourite was a simple oscillator constructed from a schmitt inverter, resistor and capacitor that would free-run unless it was hit periodically with a pulse from an AC coupled output pin. The output of the inverter was connected to the processor reset pin and provided the initial start-up reset. This was almost unbreakable so long as the timeout was set to allow the crystal to start oscillating (this can take tens of milliseconds on microprocessors in my experience).

When the processor was reset it would generate (say) 16 pulses to ensure the watchdog was fully awake then continue with program initialisation. The main software loop generates a single pulse per cycle to kick the watchdog* and keep it awake. It is important that this reset was not in an interrupt handler as it has been known for interrupts to continue running even when the processor has run wild.

If a particular process has a long duration such as reading a serial EEPROM then the watchdog can be kicked periodically within the process, but this must be an exceptional condition. Alternatively the processor can reset itself by just suppressing the output pulses.

Thousands of these were in service and some are still active; I have seen at least two of my designs incorporating this class of watchdog during the last six months.

[*] No animals were harmed in the performance of this function.

Report comment

Reply
1. Tegwyn☠Twmfatt says:
  
  January 13, 2019 at 9:06 am
  
  schmitt inverter ….. read ‘555 timer’ ?????
  
  Report comment
  
  Reply
  1. John says:
    
    January 13, 2019 at 12:27 pm
    
    A single transistor, capacitor, resistor and diode is a lot less prone to failure.
    
    Report comment
    
    Reply
  2. IanS says:
    
    January 16, 2019 at 12:07 am
    
    > … ‘555 timer’ …
    
    Report comment
    
    Reply
    1. IanS says:
      
      January 16, 2019 at 12:09 am
      
      That should read:
      
      > … ‘555 timer’ …
      
      Shudder!
      
      Report comment
      
      Reply
2. enixon says:
  
  January 13, 2019 at 7:28 pm
  
  Was there a name for these circuits? Band name or other wise?
  
  Report comment
  
  Reply
  1. IanS says:
    
    January 16, 2019 at 12:11 am
    
    Which circuits? The watchdog, the equipment it was installed in or something else?
    
    Report comment
    
    Reply
    1. enixon says:
      
      January 18, 2019 at 11:57 am
      
      You said thousands of these were in service and some are still active, in reference to the watch dog circuit. I thought maybe it was a component or development board version I could order and study.
      
      Report comment
      
      Reply
      1. IanS says:
        
        January 21, 2019 at 3:17 am
        
        They were integrated into the designs of various obsolete electronic security equipments, mainly alarm control panels. Later processors incorporated watchdog timers so the external circuit was no longer required. If I can find a circuit I’ll put it up here.
        
        Report comment
Murray says:

January 13, 2019 at 12:51 pm

Something I read about rather than experience myself was a system where a continuously resetting watch dog cause the CPU active LED to appear to be flashing as normal,. My solution : at startup, keep the cpu LED on solidly for a second or two before entering the ever popular once a second flash. Indicator philosophy can be interesting, ideas like the dark cockpit etc. To enforce my own consistency I put my philosophy on the modern version of paper… a Web page. found here… http://opend.co.za/tutorials/indicator_concepts/index.html

Report comment

Reply
Austin Denver Morgan says:

January 13, 2019 at 10:28 pm

His mailing list has been discussing watchdogs for the last month see http://www.ganssle.com/tem/tem364.html and http://www.ganssle.com/tem/tem365.html

Report comment

Reply
John Honniball says:

January 14, 2019 at 6:25 am

An actual watchdog timer going off in a station sign board (Bristol Parkway station):

https://www.flickr.com/photos/anachrocomputer/25505384655

Report comment

Reply
1. Ren says:
  
  January 15, 2019 at 6:45 pm
  
  I can just picture people in the station looking around nervously for a watch dog.
  
  Report comment
  
  Reply

Hackaday

Preventing Embedded Fails With Watchdogs

24 thoughts on “Preventing Embedded Fails With Watchdogs”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

There’s More To Global Positioning Than Just GPS

AI On Every Machine: The LLM You Probably Didn’t Want

How Giant Tanks Of Fluid Could Help Support The Power Grid

Why Leaded Fuel Is Still A Thing

How TTY Opened Up The Phones For The Hard Of Hearing

Our Columns

Congratulations To The Green Powered Challenge Winners!

Retrotechtacular: Julius Sumner Miller Breaks Lamps With Magnets

Strange Ways To Make Cold

Hackaday Links: May 3, 2026

Peripherals Hacks

24 thoughts on “Preventing Embedded Fails With Watchdogs”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns