Three reasons why the root cause of incidents is hard to identify in the process of troubleshooting.
Miss to investigate the recent changes
Let us go through a simple example based on the point just described above:
I have VPN connection from my home router to my office. I had urgent work and my VPN was failing. In fact, my Internet service was gone! I called the support and explained to them that it must be their fault, since I see one of the machines connected to their network and nothing else from the Internet. They replied that there haven’t registered any issues on their side. I was furious up until I recalled that I had made a small change on my device early that evening. This change caused the outage, but my arrogance didn’t let me see the simple root cause of the problem.
Do not know the architecture of the failing service
The users encounter a problem and raise an incident. It is the troubleshooter’s job to get all the symptoms and map them to the holistic service architecture.
Often, support teams have an idea what could cause the incident(s), and that might bring straight to the supposed point of failure. Those judgments are based mostly on experience (both personal or company knowledge base) and can drastically reduce the resolution time. We can compare this with a situation from the everyday life. The reason why we all want to go to the most experienced doctor and best hospital for our medical exam. However, not all patients suffer the flu, some come with more complicated problems that require a more systematic approach.
It might sound intuitive that you have to investigate all the elements following the order of their probability to cause the failure or on how easy is to prove that or a combination. Instead, they focus on the elements they have experience with and try to fit the problem only into that area. Failing to understand the end-to-end nature of any service will cause many hours or days of living not knowing the root cause and infinitely relying on workarounds.
A customer using data services reported high packet loss. After examining all the routers composing the solution, my colleagues were failing to spot the problem. Finally it turned out to be faulty cabling, a fundamental element we take for granted and exclude from our troubleshooting process.
You rush yourself in action instead of following a structured approach
This error usually arises from the arrogance that we know our system so well. That makes us believe we can skip the simple things and jump immediately to the complicated stuff. We immediately spring ourselves into action and start changing the system, thus invalidating the original symptoms input by the customer. It is quite likely the support engineer will fail to find the root cause if this particular system behavior occurs for the first time.
Changing the service parameters, without having clear idea of how this might affect the service is a shot in the dark and this is not the way you want your business to be run. Much better is the follow the pattern of doctor House – identify symptoms, generate hypotheses for the diseases which might cause the symptoms, write the list on a flipchart by probability of disease and find a way to prove the root cause in that order.
A company had a problem with accessing resources on the Internet. They were using virtualized firewall that didn’t seem to behave properly. Support engineers team increased the resources for the firewall and then upgraded it to another version. They also changed the configuration on their main router. At that point finding the root cause of the problem called for reverting the changes in the order they were made and introducing testing on each step of the process. The cause turned to be an attack which flooded the bandwidth.
Nina is a highly qualified consultant and leader with more than 20 years of experience in leading highly complex projects and transformations. Nina is a well-known name in the field of Agile, Business Analysis, Enterprise Architecture and IT Service Management.