Troubleshooting is a critical skill for IT professionals. There’s no getting around it—a vast amount of our time is spent figuring out why something that should work, doesn’t. A great deal of our ability to diagnose and solve computer and network-related issues comes from experience. However, there is also a framework that guides us toward finding the answers we need.
While none of this content is exclusive to CompTIA, virtually every CompTIA certification contains a troubleshooting methodology exam objective. This methodology has been constructed over the years based on experience, and it serves as a guide for newer members of the IT community when problem solving.
If you’re new to IT and troubleshooting, check out CompTIA A+ to land your first job and get the fundamental skills you need for a successful IT career.
The CompTIA troubleshooting methodology can be likened to the scientific method in some ways. It’s a 6-step process that frames the problem and guides IT pros to identify and execute a solution.
The CompTIA troubleshooting methodology:
- Identify the problem
- Establish a theory of probable cause
- Test the theory to determine the cause
- Establish a plan of action to resolve the problem and implement the solution
- Verify full system functionality, and, if applicable, implement preventive measures
- Document findings, actions and outcomes
Let’s take a more in-depth look at each of these steps to determine what they really mean..
1. Identify the Problem
Identification is often the easiest step. It may be accomplished via an inbound phone call from a user, a help desk ticket, an email message, a log file entry or any number of other sources. It is not at all uncommon for users to alert you to the problem.
It’s important to recognize that the root cause of specific issues is not always apparent. For example, a failed login attempt might seem to indicate a username or password problem, when instead the real issue may be a lack of network connectivity that prevents the authentication information from being checked against a remote server.
As troubleshooters, we want to be very careful to ensure we have identified the root cause of the error, misconfiguration or service interruption before making any changes.
Specific steps here may include:
- Gathering information from log files and error messages
- Questioning users
- Identifying symptoms
- Determining recent changes
- Duplicating the problem
- Approaching multiple problems one at a time
- Narrowing the scope of the problem
2. Establish a Theory of Probable Cause
I’d like to start by pointing out the impreciseness of the words in this step. Words such as theory and probable indicate a guess on your part, even if it is a guess backed by data. The way this step is written acknowledges that the root cause (step one) may not have been accurately identified. However, the cause is specific enough to begin troubleshooting.
This stage may require significant research on your part. Vendor documentation, your organization’s own documentation and a good old-fashioned Google search may all be required to provide the basis for your theory.
Specific steps here may include:
- Questioning the obvious
- Considering multiple approaches, including top-to-bottom or bottom-to-top for layered technologies (such as networks)
One of the main issues that I’ve observed with newer troubleshooters is failing to question the obvious. In my classes, I rephrase this as “start simple and work toward the complex.” Yes, I am aware that operating systems, networks and cloud deployments are all very complex. However, that does not mean that your issue is complex.
I have found over the years that careful note-taking is important at this point. Your notes can include data copied from websites, web URLs, suggestions from your team members, etc.
3. Test the Theory to Determine the Cause
What’s most interesting about the first two steps is that they don’t require you to make configuration changes. Changes should not be made until you are reasonably sure you have a solution that you’re ready to implement.
This step is also part of the “information-gathering” phase.
Please note, it is not uncommon for experienced administrators to move very quickly and informally through steps one, two and three. Problems and symptoms are often familiar, making it simple to predict the likely cause of an error message or failed device.
At this stage, you may find yourself circling all the way back to step one: Identify the problem. If you test your theory to discover the likely cause and find that you were incorrect, you may need to start your research all over again. You can check in with users, dig more deeply into log files, exercise your Google-Fu, etc.
Once you’re confident you’ve found the fundamental issue, the next step is to prepare to solve the problem.
4. Establish a Plan of Action and Implement the Solution
If you believe you know the root cause of the troubleshooting issue, you can now plan how to address it. Here are some reasons to plan ahead before blindly jumping into a course of action:
- Some fixes require reboots or other more significant forms of downtime
- You may need to download software, patches, drivers or entire operating system files before proceeding
- Your change management procedures may require you to test modifications to a system’s configuration in a staging environment before implementing the fix in production
- You may need to document a series of complex steps, commands and scripts
- You may need to back up data that might be put at risk during the recovery
- You may need approval from other IT staff members before making changes
Once you’ve completed this stage, you’re now ready to do whatever it is you believe you need to do to solve the problem. These steps may include:
- Run your scripts
- Update your systems or software
- Edit configuration files
- Change firewall settings
Make sure that you have a rollback plan in place in case the fix you’re attempting does not address the issue. You must be able to reverse your settings to at least get back to where you began.
In some cases, implementing the proposed fix may be quicker than the research phases that preceded it. Those research phases are essential, however, to make sure you’re addressing the real issue and to minimize downtime.
5. Verify Full System Functionality and Implement Preventive Measures
I once observed a failure at this very stage of troubleshooting. The support person in question was called to investigate a printer that wasn’t working. When he arrived, he noticed the printer was unplugged. He plugged it back in, grumbled about users not understanding computers, and walked away. What he failed to realize, however, was that the printer was jammed and that the users had unplugged it while attempting to fix the jam. The tech walked away without verifying functionality.
When possible, have the users that rely on the system test functionality for you. They are the ones that really know how the system is supposed to act and they can ensure that it responds to their specific requirements.
Depending on the problem, you may need to apply the fix to multiple servers or other devices. For example, if you’ve discovered a problem with a device driver on a server, you may need to update the drivers on several servers that rely on the same device.
6. Document Findings
Documentation is a pet peeve of mine. It comes from working as a network administrator for an organization that had zero documentation. I was the sixth administrator the company had hired in five years, and no one before me wrote down anything. It was a nightmare.
Documenting your troubleshooting steps, changes, updates, theories and research could all be useful in the future when a similar problem arises (or when the same problem turns out not to have been fixed after all).
Another reason to keep good documentation as you go through the entire methodology is to communicate to others what you have tried so far. I was on the phone once with Microsoft tech support for a failed Exchange server. The first thing the tech said was, “What have you tried so far?” I had a three-page list of things we didn’t need to try again.
Such documentation is also useful in case your changes had unintended consequences. You are more easily able to reverse your changes or change configurations if you have good documentation on exactly what you did.
Keep It Simple
This troubleshooting methodology is just a guide, however. Each network environment is unique and as you gain experience in that environment, you’ll be able to predict the likely causes of issues.
If I could pass on one bit of wisdom to future support staff members, it would be the tidbit above regarding starting simple. In my courses, one of the main troubleshooting checklists I suggest is this:
- Is it plugged in?
- Is it on?
- Did you restart it?
That may seem facetious and over-simplified, but those steps are actually worthwhile (in fact, it’s worthwhile to double-check those steps). However, the real lesson is not those three steps, but rather the spirit of those tasks, which is to start simple and work toward the more complex.
One thing the troubleshooting methodology does not address is time. In many cases, you will be working within the confines of service level agreements (SLA), regulatory restrictions or security requirements. In those situations, you must be able to accomplish the above steps efficiently.
Being deliberate about following a troubleshooting methodology can make you much more consistent and efficient at finding and resolving system and network issues. I strongly encourage you to formalize such a methodology for your support staff.
Get more tips and tricks like this right in your inbox with CompTIA’s IT Career Newsletter. Subscribe today, and you can save 10% off your next CompTIA purchase.