How to Diagnose Common Enterprise Network Problems
The Methodology Behind Troubleshoot Enterprise Networks: A Primer on Best Diagnostic Practices
Diagnosing and troubleshooting a network problem in an enterprise network monitoring can be a daunting task. With the potential for multiple branch offices, hundreds or even thousands of hosts, dozens of routers, switches, and servers, all with different vendors or firmware, and good old fashioned human error, knowing where to start is key in implementing a quick solution.
There is an established methodology when it comes to diagnosing a large network problem, and following its guidelines will help administrators keep an organized approach to troubleshooting.
Knowing Where to Start
Prior experience with the network in question can aid administrators in finding the issue and fixing it. If the majority of network issues that arise during the operation of a network come from specific errors with a known fix, this will quickly give troubleshooting a “go-to” first choice for solving a problem. Even without familiarity with the network, a procedure can be adhered to that will help keep everyone involved on the right track.
The first and most obvious first step is defining the problem in order to troubleshoot enterprise network errors. If a user is unable to connect to a file server to access their work, that would define the problem. This initial step generally makes itself known simply by its nature. It’s rare to be called in for troubleshooting without a clear issue already presenting itself!
Next, gather information from the affected users or systems. In the above example about a user having trouble connecting to a file server, it would be worth the time to ask some basic questions. When was the last time the user was able to access the server? Has anything changed since then? Are other users also experiencing the same issue? If the problem is more widespread, it’s likely there’s an issue upstream in the network. If it’s isolated to just that one host, there probably isn’t a wider network issue that needs to be addressed. Gathering information might be one of the most important, and often overlooked, steps in troubleshooting a large network. The data and testimony gathered here can be used to guide administrators throughout the rest of the troubleshooting process.
Gathering Data with Ping and Trace Route
This is important enough to garner its own section. The ping and trace route tools provide much more information than their simplistic functions would imply. A large amount of data can be gathered for later analysis using just these two commands.
Using another example, let’s say that some users in one part of an office are unable to connect to the network. The ping command can be used to gather information and isolate the problem. This diagnostic tool works across the network layer and using this first can be attributed to the divide and conquer approach to troubleshooting. It simply sends a packet from the host machine to the destination. Keep in mind that some interfaces may have access controls or there may be a hardware/software firewall preventing pings from reaching a host, so this command can have its uses limited, particularly on incoming WAN interfaces.
Cisco recommends a specific four-step procedure when using ping to help diagnose IP errors at the network layer:
- Ping the loopback address. This is 127.0.0.1 and is used specifically for diagnostic purposes. This confirms that TCP/IP is working on the host.
- Ping the local host. This is the affected host’s own internal IP. As an example, 10.0.0.2. If this ping is successful, the network card is functioning.
- Ping the default gateway. If this is successful, the issue likely lies upstream from the host machine.
- Ping an external IP. If this is successful, but the host is still unable to connect to the internet or another network, a DNS error could be present, an improperly configured ACL, or an issue with a firewall.
Depending on the information gathered about the problem, some of these steps can be skipped. In the above example, if it’s already known that host’s inside that network can still communicate with each other, it makes sense to skip steps one and two.
Another powerful command is traceroute (on Cisco IOS) or tracert (on Windows command prompt). Trace route will send a packet to the destination and report the steps it took on its way there. If the packet fails to communicate with a router on the way to its destination, that will be reported back to the user running the command. This can highlight where a potential issue is occurring and give administrators a good idea of where to start looking for the problem.
Analyzing the Data and Working a Solution
Once the problem has been defined and information has been gathered, an analysis needs to take place in order to troubleshoot network problems. This can be simple or complex, depending on the data present. Analyzing the available data is an important step in troubleshooting a network issue, as it gives guidance on which methodology to start working the problem with.
Top or Bottom Down Approach
These methods are exactly what they sound like: troubleshooting the issue either from the top of the OSI model down, or from the bottom of the OSI model up. Working from these methods can be effective because generally speaking, if one layer works the layers below it are usually working properly. This isn’t always going to be true, but in most cases it will be. The drawback is that if insufficient information was gathered, starting from the wrong end of the model can create an unnecessary amount of extra work. This is why gathering extensive information and analyzing it is so important! If the issue is at the application layer and troubleshooting begins from the physical layer, it will take a lot of time and effort to confirm that the other six layers are working before reaching the actual problem. Depending on access across the network, it can also sometimes be difficult or impossible to check the upper layers of the OSI model, so that should be considered before selecting this approach.
Divide and Conquer
Often the most effective methodology when information is limited, this approach starts in the middle of the OSI model, usually the network layer, and works outward. This is where the ping and trace route commands come into play. Depending on how successful (or not) the ping test is will guide troubleshooting up or down in the model. If ping is working fine, there’s likely a problem in an upper layer. Similarly , if ping fails, there’s an issue at layer 3 or below. This can help quickly find a path to the problem at hand and get administrators working on a solution quickly.
Improvisation and Other Methods
A handful of methods fall into this category, and generally should only be used when the information gathered points to a very specific issue. Another reason to opt for this method first would be if the same network issue consistently appears and a fix for it is already known. If there is a high likelihood of the problem being quickly found and solved using this method, it will save time and resources over using the other methods. Familiarity with a given network will help administrators decide if this is the right way to go when troubleshooting a problem.
Being Flexible
Every network is different, every problem is different, and administrators need to be able to adapt to a changing network environment in order to quickly and effectively diagnose and fix network issues. While a consistently followed and well-documented troubleshooting plan will help keep everyone on the same page to quickly address potential problems, flexibility is needed in order to speed up response and fix times. Understanding when not to follow procedures is key in maintaining a large network.
Address Recurring Problems
All networks will undergo a significant number of errors and problems. However, if the same issue is constantly rearing its ugly head, looking for a permanent fix is important. If one router is consistently failing, for example, it may be time for a replacement. Redundancy can help to address, but not solve, recurring network problems. Likewise, “stop-gap” or “quick-fix” solutions need to have long-term solutions implemented as soon as possible to prevent future headaches. Getting ahead of a problem is often the best way to solve it.
The Human Factor and Malicious Intrusions
People make mistakes. They forget to plug things in, turn them on, configure them correctly, or just don’t know how to make something work. The best way to combat human error is with knowledge and practice. A well-informed user will cause far fewer networking nightmares than one who has received no instruction at all. Always account for the human factor when analyzing data and looking for a solution to a problem.
Likewise, humans will sometimes have unscrupulous goals when accessing a network. Always follow best security practices and be aware that network errors can sometimes have malicious origins designed to disrupt service. These kinds of attacks come in many forms and the best way to prevent them is with education and proactive defense.
Network Monitoring Software
There is a wealth of networking software available that will help monitor, diagnose, and troubleshoot large networks. From open source tools available freely on the internet to full-service enterprise oriented options, there will be a software solution for everyone that can aid administrators in managing their networks. Utilizing these tools can help to expedite network troubleshooting, taking a large portion of human resources and time and placing it in the hands of the software.
Every Problem Has a Solution
The biggest hurdle any network administrator will face is always going to be troubleshooting and maintaining their network. There are an infinite number of potential problems and an equal number of potential solutions, and covering them all is an impossible task. If specific procedures are followed and adhered to, pinpointing the trouble and getting a fix implemented will be made much easier for administrators and their associates.
to Contact Us