Troubleshooting: A Method For Solving Problems The Right Way

We’ve all experienced that magic moment when, after countless frustrating hours of experimentation and racking your brain, the object of our attention starts working. The 3D printer finally produces good output. The hacked up laptop finally boots. The car engine finally purrs. The question is, do we know why it started working?

This is more important than you might think. Knowing the answer lets you confirm that the core problem was solved, otherwise you may have just fixed a symptom. And lack of understanding means fixing one problem may just create another.

The solution is to adopt a methodical troubleshooting method. We’re talking about a structured problem solving technique that when used properly can help us solve a problem at its core without leaving any loose ends. Such methodology will also leave you knowing why any solution did or didn’t work in the end, and will give you reproducible results.

Understanding The Product or Process

Complicated car engine
Oh yeah, let’s just get our wrench and dig right in!

It is not reasonable to expect that we can effectively repair anything we do not understand. For example, if a car isn’t running quite right, it’ll be pointless to attempt a fix if you don’t know the basics about ignition timing, fuel delivery, how the engine functions at least on a basic level.

If you’re trying to build something from scratch or do a significant modification on an existing product, then a solid understanding of what the end product should look like will be needed. If you’re not up to speed on these things, it might be time to do a deep dive, going down the rabbit hole as they say, on the subject at hand.

Wikipedia, technical articles, forums, social media communities (groups.io, facebook groups, reddit, etc), and of course Hackaday are all possible resources for learning. Once you understand how any system is supposed to work, then you can begin the next step of troubleshooting: The process of elimination.

Process Of Elimination

Now that you’re as expert as you’re going to be on your subject of choice, it’s time to dive in and see what’s wrong. Armed with our clear vision of what a successful process looks like and the process of elimination, we’re going to investigate. Yep, we’re going full Sherlock Holmes! If you grew up playing a certain trademarked and copyrighted board game, then you’ll have a clue how this works.

The goal is simple: Identify all the steps that make up a successful process, and then check them one by one, starting with the first step — even if (especially if!) you think you know where the problem already is. One step at a time. Not multiples, and definitely not all of them. No skipping ahead. Just one. And then, after you check on just one of the items at a time, you make a note: Did it solve the problem? Yes? No? Not sure? That’s fine.

At each stage, record your results and then move on to the next item on your list. Recording the results is vital to the process. And sometimes we don’t have all the facts until later in the investigation. So being able to review notes will help us spot trends we’d never have noticed otherwise.

Even if you think you solved the problem, keep going through your list. Go through the entire system to make sure that the whole thing works the way it should. Troubleshooting is incomplete if you only look at a portion of the process.

In more complex systems, a tiered approach will be very useful. Start with a high level overview of the system at hand. Step through the process until you find something broken. Once you isolate the problem area, restart the troubleshooting process in that problem area, making notes as you go. If you do fix a problem, then go back up to the first tier of troubleshooting and continue until the process is completed. This will help you answer the next question.

Is It Really Fixed?

If you’ve identified and solved what you believe to be the core problem, then it’s time to verify that the fix is effective. The best way to do this is to put your item or process under test in the same way that it’ll be used. Sometimes this is simple: The Thing works, and it’s fixed; there’s very little in between to be had. Other times, more extensive testing is needed. Imagine fixing a car that won’t start, handing the keys to its owner, and then finding out the hard way that it also had no brakes!

So it may be that you need to go for a ‘test drive’ so to speak. The goal should be to verify that you didn’t solve one problem but create two more and that the entire system works as designed.

When Good Troubleshooting Goes Bad

Just like any method or process, it’s quite possible to think we’ve got it right when we don’t. Troubleshooting is no different. If we skip ahead in the process at all or don’t take notes along the way, it’ll be pretty easy to miss the problem. Similarly, if we don’t fully understand the subject, we might not be able to identify when something doesn’t look quite right.

Probing a circuit board with a multimeterOn the other hand, maybe we’re taking a project over from somebody else and they’ve told us what’s wrong with it, but admit that they don’t know how to fix it. This raises a thorny question: If they don’t know how to fix it, then how can they be sure what’s wrong with it to begin with? Take the incoming info with a grain of salt and verify for yourself what the problem is before you start looking for a solution.

When I was young, I heard the woes of backyard mechanics who lamented that they’d replaced hundreds of dollars worth of parts, but the issue they were experiencing went unsolved. I distinctly recall them blaming the fancy new “electronic stuff” (fuel injection) for their problems.The reality is that they didn’t understand the system they were working with and therefore could not troubleshoot it effectively, and so they stopped analyzing the problem and just reacted to it by throwing parts at it until something hopefully worked. And this is another way to fail at the troubleshooting process.

Alternatives to Troubleshooting

There are some instances when the exact troubleshooting process can be overkill. To lean on the car analogy again, imagine that you have an older fuel injected vehicle that isn’t running right. It may be that due to the age of the overall system, no single thing will really solve the problem. Years of crud, poor connections, and worn out parts all contribute to a vague problem that is difficult to reproduce. Furthermore we don’t know when the system was last serviced. In such cases, taking the shotgun approach may be needed. And no, I’m not talking about taking it out back and shooting it!

The aforementioned troubleshooting method could be described like a high precision rifle: We aim carefully and apply a fix. The shotgun method is the exact opposite: We aim in the general direction of the problem and fire multiple projectiles, hoping that one of them hits their mark and solves the problem.

In our ailing EFI example, it might not be unreasonable to replace all of the sensors and any broken connectors. This would be followed by rebuilding the mechanical portion such as the throttle body. Even replacing fuel pumps, filters, and cleaning the fuel delivery system with a fuel additive can be helpful. And then once the system has been brought back to a known state, it can be tested and any remaining faults can be scrutinized using the proper troubleshooting technique.

Another use case for the shotgun approach is when we have a time sensitive issue that needs to be fixed. The root problem may have only a few known causes, so applying all of the fixes at once may be faster in some cases. For example we might not have time to properly troubleshoot a mission critical server with an unknown hardware problem. Swapping the storage system into a new computer will get it back online quickly, and then the previous hardware can be subjected to testing without such time constraints.

No matter the case though, having a solid understanding of the system you’re working on will help you to take the correct approach to solving the problem.

A Noteworthy Note

You might have noticed that the troubleshooting methods discussed are mighty similar to the scientific method that at the very least, most of us learned in school. And that’s why taking notes is so important.

Adam Savage famously quipped “Remember kids, the only difference between screwing around and science is writing it down!” (This was later attributed to Alex Jason.) And that’s really the point here: Writing things down, making notes about things whether they work or not is a vitally important part of this entire process. Otherwise, we’re just blindly stabbing into the darkness.

I hope that this foray into fixing fiddly things has been useful for you. Do you have your own troubleshooting story, method, or “Aha!” moment to share? Be sure to let us know in the comments below!

48 thoughts on “Troubleshooting: A Method For Solving Problems The Right Way

  1. This sounds like my logic. However I’ll add one last bit: know whether the troubleshoot is worth it, might be equally important. Time management is another vector that is easy to overlook. I usually try to weigh out the passion:expense ratio. It may be cheaper on time and money to simply replace with a new system or a working copy of the original, but if the passion is high enough there can be exceptions.

    1. I use this approach with code in a large system (most api’s since that’s what I do for a living). It’s often easier/safer to violate the DRY principle than to try and fix a finicky system and worry about breaking something else.

      1. Not to mention that sometimes DRY is not actually DRY. If process A shares 80% of the steps with process B, but they are only coincidentally so similar, it’s far better to duplicate the step definitions than try to shoehorn special cases in once they drift apart. There’s a balance to be struck, but over-DRY is not a good place to be.

    2. Exactly. That storage server example means that I would swap to different hardware, and if the problem didn’t reoccur and it’s too risky to re-cycle the server back into production (because the cause might re-occur), or too expensive in man-power to do so, then it’s best to just dispose of it. Let a hobbyist troubleshoot what went wrong and play with the big boy gear for once.

  2. In troubleshooting electronic systems I was taught a technique by the US Air Force. We called it sectionalization. The idea is you locate a midway point in the circuit and test for proper signal there. If the signal is good you just eliminated the entire first half, if the signal is bad you just determined which side of your test the problem is on. You continue to cut the half circuit where the fault is by half until you localize the problem. This test was particularly helpful in international communications circuits that went through many stations on the way to their destination but also applies well to complex circuit paths within equipment as well. It eliminates a lot of testing time because the traditional method would take you step by step from the input only to find the problem way at the output end of a circuit.

    1. ” if the signal is bad you just determined which side of your test the problem is on.”

      Not really, because the load can distort the source, so you still don’t know which side of the circuit is causing it.

      All in all it sounds like a bad advice that just happens to work in certain cases, so it is taught as a gospel.

      1. I was thinking the same thing, though it will work often enough to be a very valid method to use, simple to try, unlikely to cause further harm, and if doesn’t work you do know it didn’t work.

        Plus in the case of most user troubleshooting you are not dealing with the IC’s and on PCB errors to fix, just the runs between them (as a failed board would generally be considered a spare part job – repairable or not finding flaws in PCB is much more tricky, even more so now such multilayered boards are common), which is where this method should work pretty reliably.

        1. If the method is supposed to save time, then it’s a hit and miss. You may isolate the error to a section of the circuit, or you may run yourself off to a wild goose chase because you keep looking for the problem in the wrong places thinking it must be there because the method says so.

          The first part is correct – the method shows where the problem is not.
          The second part is incorrect – the method doesn’t show where the problem is.

          1. If your test equipment is causing odd interactions and false positives it will be pretty obvious relatively quickly – as the whole point is to eliminate large sections quickly so if it leads to a place that the error isn’t it got you there fast, as long as you aware the test equipment can cause buggering up you then know you need a new method.

      2. On the other hand, sometimes it doesn’t distort the signal, e.g. in digital electronics, mechanics, or in software. So it’s a very helpful tool in some situations, but knowing when it is affecting the process (some analogue systems, software race conditions) is also important.

      3. Well it depends on the system and the test, doesn’t it? Sectionalizing is a useful technique in all cases when you understand the limits of your test and the behavior of the system. If you don’t, you were never going to be able to competently troubleshoot it in any case. Calling it bad advice is just wrong.

    2. Also reminds me of the joke about a cable company maintenance engineer who went to a shooting range.

      After ten rounds missing the target, he reached his finger over the end of the barrel and pulled the trigger. His conclusion: the transmitter is sending just fine, the problem must be at the receiving end.

      1. Still depends on the system and your experience with it – if you know that every failure of this system has been at these points historically that is the obvious step one through -n, even if they are the first/last bits in the chain.
        And if you know this is a system that will be hard to access near the middle eliminate the ends you can get at easily.

        The important part is to either already understand how it works and common flaws so you know where is most likely to start, or approach it in whatever systematic elimination methods is most convenient – which in a built system of any sort is often dictated by accessibility.

  3. Ref: Pirsig, _Zen_and_the_Art_of_Motorcycle_Maintainence

    This is a concept I, and probably most people,need to be reminded of every so often. Especially when working outside of specialty.

      1. Ya. Even Pirsig admit that the philosophy was a bit sketchy. It was the 70’s, though.

        It was the first pop book to really address the ways of looking at a system, though, and I still use it with engineering students, dated as it is, as it is an approachable story with solid points on several levels.

  4. If you are working on troubleshooting a device or process with another individual, make sure both sides know the “entire thought process.”
    Being remote hands or working with another technician and assuming or blindly following directions is a good way to make more work for everyone.

    Case in point: when I was younger I was troubleshooting an Air-to-Ground radio at an airport with a local communications company.
    The equipment was entirely supported by the remote technicians on the other side of the country, and I just needed to verify what module was broken was so a replacement could be sent out.

    After working through the entire system and not isolating the fault, the technician wanted to reboot the shelf to try to see if the digital interface was just hung up or frozen, and right after he asked me to flip the power switch off he quickly said… “WAIT! Don’t turn the radio off yet!”

    You see, when you have a power supply that has been idling for near two decades, sometimes the capacitors can degrade. This platform just happened to have a selftest, and when the very old power supplies were shut down and restarted, the degraded components would cause the selftest to fail, and the radio would never turn back on.
    It would just sit there sadly blinking a red “Fault” LED until you replaced the power supply.

    And then two modules were mailed out for me to replace instead of just one.

    If I had known what he was trying to scrutinize when troubleshooting, I could have unplugged just the digital module to reset it and not shut down the entire shelf, and being less experienced I had not thought to question whether restarting a complex system might have other consequences.

  5. “.The reality is that they didn’t understand the system they were working with and therefore could not troubleshoot it effectively, and so they stopped analyzing the problem and just reacted to it by throwing parts at it until something hopefully worked.”

    Th1

      1. Most automotive mechanics, actually. On YouTube, I follow South Main Auto and Watch Wes Work. These guys take the time to fully understand, test, and verify weird faults that would make the rest of us scratch our heads and just throw expensive parts at the car. (Or more likely, sell it and just buy a new one.)

  6. Damned cat!

    “.The reality is that they didn’t understand the system they were working with and therefore could not troubleshoot it effectively, and so they stopped analyzing the problem and just reacted to it by throwing parts at it until something hopefully worked.”

    That sounds like the process by which 3D printers ended up with autoleveling…

    1. hahahah you have a point but i actually disagree. i think i have a pretty good understanding of the flaws that force me to use mesh leveling, and i did make a pretty good effort directly at those flaws which dramatically reduced the amount of correction that was occurring in the leveling. but that last +/-0.3mm across the bed, it’s a lot of work. it’s not saving me from understanding the problem, it’s saving me from doing (and re-doing) the work!

    2. Totally agree with this. I have chinese i3 clone that was constantly out of level, until i bolted it firmly onto piece of kitchen desk plate. It’s leveled since then and only need to adjust level when i do nozzle clean/change, and glass change. And it holds the leveling even when tossed into backseat of my car and transported. Upon arrival i just put it on the table and send it without worry about leveling.

  7. of course as a software guy, for the good bugs i can almost never follow the whole route…it’s just too long, too many pieces. and even if i can, i don’t know if it always gives me full confidence in the diagnosis or the fix. when i’m getting close, i always go back to this crucial quetion:

    why did it work fine until today?

    i mean, obviously, to fix the bug i’ve had to spend a lot of time on “why is it broken?” but the “why did it work?” is where the real juice is. the last guy, the boneheaded fool who made this bug (aka me), he was thinking about something, the code correctly solved some problem and if it looks 100% wrong then i haven’t figured out the part that’s right and i still have work to do.

    i’m never really confident about a fix until i have a clear narrative for why it worked so well and for so long and what changed that brought it to my attention today.

  8. “When I was young, I heard the woes of backyard mechanics who lamented that they’d replaced hundreds of dollars worth of parts, but the issue they were experiencing went unsolved. I distinctly recall them blaming the fancy new “electronic stuff” (fuel injection) for their problems. The reality is that they didn’t understand the system they were working with and therefore could not troubleshoot it effectively, and so they stopped analyzing the problem and just reacted to it by throwing parts at it until something hopefully worked. And this is another way to fail at the troubleshooting process.”

    It is not just “backyard mechanics”, sometimes a factory trained specialist has to resort to the “shotgun method” of repair.
    I don’t mean taking a shotgun to it. B^) But replacing parts one by one until the problem goes away. The interaction of numerous parts and sensors does not make a closed “decision tree”. Maybe, someday, someone will figure out a way to solve that particular problem, but by then 95% of that line of automobiles have already been scrapped.

    This article has useful information, I’m making a hard copy of some paragraphs for handy reference.

    1. Indeed it’s hard to cover every possible scenario. I think the difference is that a factory trained tech would know *when* to apply the shotgun method figuratively and perhaps even literally ;-)

  9. Nice write-up but I feel that in most real-world troubleshooting the order of the steps you’ll take will be based on a combination of perceived effort and perceived likelyhood. In a car you’ll check the ignition system first because easy even if you think a fuel problem is more likely. And you’ll check the filter, and injectors before the fuel pump even though they come later in the system. In electronics, checking incoming voltage happens to be the first thing and usually the easiest!

    1. Going for the low hanging fruit works some of the time though it can leave bigger problems unsolved. The biggest problem in this case is assuming that the new fuel filter fixed the problem when really it just made the slowly-dying-fuel-pump’s job a little less hard. Without taking a methodical approach it’s hard to know which if the underlying problems is solved or if the problem will return in a month with a dead fuel pump. It’s one of those cases where the quick fix only postpones the final solution. But sometimes the quick fix is enough for the moment. It’s a highly subjective, uh… subject.

      As for your electronics example, it just so happens that the start of every process in the thing is also the easiest to check. Write down results as you go and you’re well on your way :-)

  10. The classic repair site is http://www.repairfaq.org. Originally an adjunct to the sci.electronics.repair newsgroup, it just kept getting bigger. At one point in the nineties, Sam was writing the repair column in Radio Electronics (I guess it was Electronics Now at that point) simply based on his work on the faq.

  11. I’ve been trouble shooting very complex systems for years, some of them with pieces that were completely inaccessible, or so remote they can’t be evaluated. (i.e., satellite and international subsea). One of the best tools I’ve found is the (attributed to) Hewlett Packard “half splitting” solution where you go to the middle point of a failed system for the first analysis point. If the problem is before there then go halfway between the midpoint and the start. Decide at this point either before or after. In theory it seems like you might be using additional steps but in practice it is really fast.
    Try it as a parlor game. “I’m guessing a number (integer) between 0 and 100”
    The question is, “is it greater than 50?” y/n
    y: ” is it greater than 75?”
    n: “is it greater than 62?”, etc.
    Arrive at the answer in 7 guesses or less, typically 5.
    Also works well as a software quick sort.

    1. Good old binary search. It’s fun how many fields, procedures, etc think they have invented it 😁 As the overly vigorous debate above highlighted, this can be very effective as long as you know the points where you can measure, evaluate, etc without changing the operation through observation. I think in my application of this for troubleshooting, the hardest part is keeping track of steps and progress. Recently started just writing things in an actual “lab notebook” of sorts, after watching Curious Marc with envy in his methodical, documented process of fixing even highly complex electronic contraptions.

      1. This can be more difficult and informative than you might expect. For an example from my personal experience, many times in development of a VR/AR system, you may hear complaints of “bad tracking”. Starting at the tracking system is usually the wrong approach, not least because that’s often the hardest system to tune. Many, many things, from frame rate issues/inconsistencies, inaccurate distortion correction, bad projection parameters, bad motion prediction, artifacts from progressive scan (rather than global) displays, etc can cause visual artifacts that cause the virtual world to not match expectation. Even various things tracking related might be involved, from bad calibration/offset, bad calibration of cameras/sensors/markers, to more subtle and maddening problems. These may all be attributed by most (even those with experience) to bad tracking, so it’s really important to identify the exact conditions and observations behind a “bad tracking” claim to avoid chasing your tail or wasting time.

        I’d say “ask me how I know” but there’s no need, I’ve had this experience many times.

  12. I disagree with a lot of this article, and my job is constant troubleshooting. You should only fall back to extremely methodical check-everything, once you have eliminated the easy things to check and the “most likely”. Whenever you identify a problem, fix/replace only that thing, and thoroughly test again. Once you have done that, if the problem persists, then the solution is either unexpected or you made some false assumptions, so then is the time to methodically check and test everything, in order.

  13. Used to troubleshoot online analytical instrumentation for a very large chemical plant as part of a team of techs. Some of our gospel:

    1. Log EVERYTHING. Every calibration, every new part, every issue encountered, every change whatsoever to that particular device. I could review every solved fault and issue for the life of that device, sometimes over >20 years of service. It was not uncommon to encounter a weird problem that had cropped up every 5-7 years and had been solved before by multiple techs.

    2. If it’s broken, what’s the last thing that was done to it? Consult that logbook and considering eliminating the last change. This often worked, even if we could not immediately explain why it should have worked.

    3. As certain pieces of equipment being offline cost tens of thousands of dollars per HOUR in lost production, many times we didn’t have the luxury of troubleshooting in-situ. If we could isolate the fault to a board, we swapped it immediately and then did board-level troubleshooting back at the shop.

    4. As time allowed, we reviewed logbooks and ensured we kept adequate spares for those repeated faults. Some items had ridiculously long leadtimes in the best of times and it was crucial to keep those items in inventory.

    5. If other techs work on your stuff, inevitably someone will repair something and fail to order replacement parts. It’s human nature. So make sure to check spares inventory periodically so someone else doesn’t cost you your job. It’s not unknown for a good tech to squirrel away a few extra spare parts someplace. I would if I were you.

  14. Had a boss who told me about how he once fought and fought with a small engine on a pump. Once he’d reached his ultimate level of frustration, he pulled out a .44 magnum and put a LARGE hole in it. I thought this was kind of funny, but it occurred to me to ask, “You’d tell me before I ever bothered you too much, right?”

  15. My method
    After a problem is seen
    1. Assumption what did I think it is
    2. Action. What corse of action tied to time and expense should I try first
    3. Observ results
    Go to step 1
    Repeat until resolved
    add nuances such as simplicity, cost, time, likelihood
    This approach has never failed me

Leave a Reply to SomunCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.