- Developing a Troubleshooting Checklist
There is an old saying that when you practice what you need to do in the time of a crisis,
when the crisis occurs the reaction tends to be automatic. When the firewall is down is
not the time to try to figure out what you should be looking at to resolve the problem.
Instead, develop a troubleshooting checklist in advance. The reason is simple: There will
already be enough stress and confusion as a result of the failure; there is no need to
increase either by not having a plan. Your troubleshooting checklist is that plan.
Obviously, you cannot plan for every failure that will occur, but you can put together a
strategy that, if executed properly, increases the likelihood of being able to isolate the
problem more rapidly. The primary objective of the troubleshooting checklist is to
provide a methodical and logical approach to troubleshoot the problem. After all,
computer systems (including firewalls) are binary devices, they are on or off. The logic is
simple, and the devices always do exactly what they are supposed toeven when they fail.
A troubleshooting checklist should guide you through that logical troubleshooting
process. I often use an analogy of eating an elephant when I talk about troubleshooting.
Trying to eat an elephant introduces a big, big problem. If you try to sit down and eat the
elephant all at once, you are going to quickly find yourself overwhelmed with the task at
hand. Troubleshooting is no different. If you try to troubleshoot the entire problem all at
once, you are going to quickly find yourself overwhelmed with the task at hand.
However, instead of trying to deal with the whole elephant, if you chop it into smaller,
easier-to-manage steak-sized pieces, you will find the task of eating the elephant more
manageable. Troubleshooting is no different, and after you have developed a checklist of
methodical and logical approaches to troubleshooting a problem, a secondary objective of
a troubleshooting checklist is to use the results obtained by following the checklist to
narrow down the potential causes of whatever failure is occurring.
Keeping in mind that every firewall, environment, and problem is unique, the following
represent a good baseline troubleshooting checklist:
Step 1. Verify the problem reported.
Step 2. Test connectivity.
Step 3. Physically check the firewall.
Step 4. Check for recent changes.
Step 5. Check the firewall logs for errors.
- Step 6. Verify the firewall configuration.
Step 7. Verify the firewall ruleset.
Step 8. Verify that any dependent, non-firewall-specific systems are not the culprit.
Step 9. Monitor the network traffic.
Step1: Verify the Problem Reported
One of the most overlooked steps in troubleshooting is to actually verify that the problem
that was reported is occurring as it was reported. Far too often, people report what they
suspect the problem is without being for certain that the problem is indeed related to the
firewall. I have lost count of how many times I have heard "I cannot access this server,
the firewall must be down" only to discover that the server itself was down. So before
you begin the actual troubleshooting process, ensure that the problem has been reported
accurately and that you understand and if possible can reproduce or see the problem as it
is occurring. The old saying "To know where you are going, you need to know where you
are at" holds true in troubleshooting. Before you can troubleshoot a problem, you need to
make sure you know what the problem is.
Another aspect of verifying the problem is to make sure you treat the problem, not the
symptoms. This is similar to treating a medical patient. If a person comes in with a fever
and all you do is treat the fever (the symptom), you have done nothing to fix the problem
(the illness causing the fever). Accordingly, when the problem is reported, try to
distinguish between the symptoms of the problem (which are normally what is reported)
and the problem itself. The reason for this is simple. If all you do is treat the symptoms,
you may eliminate the cause for the problem being reported, but you have not fixed the
problem itself, and a good chance exists that it will reappear at some point in the future.
This is particularly true when it comes to dealing with performance-related issues. It is
easy to lose sight of the problem, treat the symptom, and move on without ever
addressing the root cause of the performance problem.
Step2: Test Connectivity
In the realm of networking and firewalls, one of the most important and first questions to
ask is this: Is the device up? This is where testing connectivity comes into play. Although
this step is not applicable to every situation, it is usually a good idea to try to connect to
the firewall or system protected by the firewall just to make sure it is up. There are a
number of ways to do this.
Using Ping to Test Connectivity
- The de facto standard method of testing connectivity is to send a ping to the target host.
There are a couple of ways that this can be done that will provide additional information
based on the response. To help with understanding the process and the interaction of each
step, see the connectivity testing flowchart in Figure 13-1.
Figure 13-1. Connectivity Testing Flowchart
- The first step is to attempt to ping the target host by its host name. If this succeeds, it
validates that everything from name resolution to physical delivery of the data is
If this is not successful, the next step is to attempt to ping the target host by its IP address.
This eliminates name resolution as part of the problem. If this succeeds, the problem is
likely going to be related to name resolution (either Domain Name System [DNS]
resolution or NetBIOS name resolution). Perhaps the DNS server is down or the target
host name is not known. If this does not succeed, it is possible that the target host is
inaccessible for some reason (regardless, however, the problem warrants more attention).
If the target host is remote, the next step is to attempt to ping the default gateway of the
source machine as well as another host on the same network as the remote machine (it is
a good idea to use the remote machines default gateway as the destination). If you can
physically access the target machine, repeat this process in the other direction. These
steps validate that both hosts are able to communicate with their local routers as well as
validating that you can reach something on the remote network. If they cannot, a good
chance exists that the problem exists between the host and its router. It is possible that
there is an invalid Address Resolution Protocol (ARP) entry on either the host or the
If the hosts can successfully ping their respective routers, and you are unable to ping
another host on the remote network, the next step is to perform a traceroute from the
source to the target host. This will enable you to determine the approximate location
where the problem is occurring. In a complex routed environment, it is generally a good
idea to have some baseline results of a functional traceroute because the traceroute is
typically unable to provide the IP address of the failing hop. Only by knowing what the
next hop from the last successful hop is can you have an idea of what specific router
might be the cause.
Testing Connectivity Without Using Ping
One thing to keep in mind when testing connectivity is that many firewalls, by design, do
not allow ICMP traffic to traverse the firewall, and thus render the use of ICMP to test
connectivity worthless. You have a couple of options in this event. For one, you can
permit the ICMP traffic for the purposes of troubleshooting the problem and then disable
it again when you are finished. Another option is to use another protocol to determine
whether the remote system is responding at all. For example, just telnetting to many TCP
ports will either confirm or deny whether a remote host is accessible, as shown in
Example 13-1. Telnetting to TCP port 80 to Test Connectivity
- C:\Documents and Settings\wnoonan>telnet web server 80
HTTP/1.1 200 OK
Last-Modified: Tue, 23 Nov 2004 05:23:47 GMT
Date: Thu, 02 Mar 2006 05:29:53 GMT
Connection to host lost.