We are somewhat spoiled because electronics today are very reliable compared to even a few decades ago. Most modern electronics obey the bathtub curve. If they don’t fail right away, they won’t fail for a very long time, in all likelihood. However, there are a few cases where that’s not a good enough answer. One is when something really important is at stake — the control systems of an airplane, for example. The other is when you are in an environment that might cause failures. In those cases — near a nuclear reactor or space, for example, you often are actually dealing with both problems. In this installment of Circuit VR, I’ll show you a few common ways to make digital logic circuits more robust with some examples you can run in the Falstad simulator in your browser.
The most common way to deal with this problem is redundancy. The Space Shuttle, for example, had four computers and at least three had to agree. There was also a fifth computer that could replace a failed computer. The idea is that if one failed, everything would continue to work. The number of computers to use in such a scheme is often a topic of hot debate since with an even number like four, two failures create a dilemma. You can’t know what the right answer is. Many systems use TMR (triple module redundancy) which works just as well in the face of one failure. With two failures it will be wrong, but at least the result is unambiguous.
Voting Logic in a Circuit
Consider the circuit shown above. The output should be the vote of the three flip flops. The switch to the right lets you inject a failure by selecting the invert output on that one stage, but since the output is the majority of the three flip flops, that switch should have no bearing on the output.
I used T flip flops here, but really they could represent any kind of logic no matter how complex. The key is we expect the same output from them. In fact, there’s no reason each one has to get the answer the exact same way, although that is unusual — usually, the circuits are identical. I drew one clock, but in real life, each would probably have its own private clock, since one clock is a single point of failure.
The logic only uses NAND gates, but if you consider the flip flop outputs as A, B, and C, it is easier to think of the logic as (A&B)|(B&C)|(A&C). If you apply DeMorgan’s theorem you can convert that expression to the NAND gates as shown — they are equivalent.
In English, then, you are saying if any two inputs are high, then the output is high. You don’t need to check A&B&C nor do you need to explicitly test cases where any input is low.
If you try the circuit out, you can see the output will change, and flipping the switch won’t matter. If you break one of the inputs and ground it, for example, then you can see a failure when the switch is thrown because you’ll have two faults instead of one. Even then, the short to ground will be right half the time and the circuit will work. The voter doesn’t care how you got the right answer.
Key Ideas: Eventually You Have to Trust Something
If you think about it, you are implicitly trusting the NAND gates in this implementation. That could be a reasonable thing to do. You might implement the logic using some very robust method — keep in mind that two signals merged with a diode and a pull-up resistor can perform the NAND function if you throw in a single transistor. You could even do the logic (or, at least, the inverter) using relays or using chips that are not as susceptible to failure as other more complex parts. You could also implement the AND/OR version of the circuit pretty easily with diodes only since there is no inversion involved.
You can also replicate the voting logic. This was done in the Saturn rocket’s computer, for example. However, there’s always some part where you are at that last gate where you generally have a possible single point of failure and you want to make that part as reliable as you can make it.
Eventually, though, you have to trust something. Usually, you will decide to accept a certain number of failures, too. For example, many systems are made to be two-fault tolerant. That is, any two faults will not cause improper operation. The voting we are looking at is one-fault tolerant. Two flip-flops failing will cause the module to produce the wrong answer.
Where to Vote
It is tempting to try to make everything use TMR. This is a mistake because it drives up complexity and can actually make things worse. In many cases, you should examine the output of your circuits and vote on the outputs only. In some cases (like the Saturn computer) the logic was split into parts and each part was voted. In the particular case of the Saturn computer, there were seven pipelines (each with three copies). Each copy of a particular pipeline fed a voter and the output of the voter fed the next set of three pipelines.
This also can help with glitches that can easily accumulate if you try to vote everything. To follow with the above example, if one pipeline was slightly slower or faster than the other two, you would get small mismatches. By referencing them to the system clock, you could ensure each pipeline had produced its final answer before using the result of the vote.
Using Multiplexers and Error Detection
Another common way to vote is using a multiplexer. Some FPGAs have multiplexers or can configure them quite easily. Here’s a circuit that not only votes, but uses another mux to detect if there is an error.
This time I included switches to fail any of the flip flops. Two of the inputs select which mux channel to use (doesn’t really matter which two; I used B and C just to make the layout work nicely). If they are both zero that’s I0 and the output is zero. If they are both high, that’s I3 and the output is high.
The I1 and I2 cases are where B and C don’t agree. At that point, A is the tiebreaker. We just feed it through because that’s the bit that has two votes either way.
The top mux detects an error but needs an inverter (yes, in this case, you could have fed the inverted output of the flip flop in if you didn’t want the fail switch in there). If you have I1 or I2 active then you definitely have some kind of failure because B and C don’t agree. In the cases where they do agree, the output is the state of A — sort of. At I0, there’s no error if A is also 0, so the error signal is A. At I3, there’s no error when A is 1, so we invert A and that’s the answer. You could, of course, work out similar logic with the NAND gates, if you like.
The FPGA/CPU Connection
It isn’t unusual to be doing this kind of logic design on FPGAs. There are several ways this can work. First, there are FPGAs that automatically include TMR flip flops that are very expensive. These usually are known as “space-rated” FPGAs and often have other features such as other radiation-resistant circuitry, redundant clock busses, and so on. While these are expensive, they are very robust and easy to use since you don’t have to specify voting logic and multiple flip flops in your Verilog or VHDL.
Some synthesis tools can be cued to automatically generate TMR flip flops for some or all flip flops in your design. In other words, you specify a flip flop as normal in your VHDL or Verilog and the tool infers three physical flip flops and votes them for you. Exactly how you do this is tool-dependent, but it is likely to be some kind of option or constraint to the synthesis tool. For example, you can see how this works in Synplify in the video below. Even if you don’t use Synplify, the video has some good background information.
Of course, the final option is to do it yourself like we’ve done here. Depending on what you fear, this might not be the best option. One flip flop failing on an FPGA without radiation or some other specific reason isn’t very likely. And if you need radiation tolerance, just doing the flip flops might not be enough.
Keep in mind, though, that I’m only using a single flip flop as an example. You could use this same idea of voting to vote outputs from any odd number of outputs. You could vote three Arduino outputs, for example. Or you might want to vote three sensor inputs although if you were only worried about the sensors you could probably do that in software pretty easily, too. Just don’t forget to account for glitches if the outputs are not exactly synchronized.
For More Information
Dealing with redundant circuits is a huge topic and I can’t hope to cover everything in a short post. If you want to read more, NASA has quite a few documents that can help you including some discussions of the Shuttle architecture with failure reports and some scholarly papers. If you want to learn more about other kinds of redundancy and why adding redundancy can lower reliability if you aren’t careful, you can check out the Wikipedia article or this page from Rutgers. Specialists spend years studying this kind of design and you still find arguments over what’s the best for any given application. So don’t expect to master this overnight.
Especially since there’s plenty more to consider when making a really robust system. For example, a good design will periodically refresh the outputs so that an error that would correct itself won’t accumulate. Memory probably needs periodic scrubbing and error correction. But these simple circuits are at the heart of some of the most reliable systems we know how to make.