Common Pitfalls in Incident Management Process and How to Fix Them

by Soumya Ghorpode

In our present connected and tech based world incidents are a fact of life in what we do. Be it a software glitch, a hardware failure, a security breach, or human error -- disruption to services is to be expected. Effective incident management is not only about reacting to the issues as they come up; it is about0 minimizing their impact, restoring service quickly, and learning from each and every incident to put in place measures which will prevent it from happening again. A strong incident management process is key to business continuity, to customer satisfaction and to protecting a company’s reputation.

However at times even the best designed incident management processes break down. We see in practice that many organizations run into the same issues which in turn may cause small incidents blow up into big outages  causing in the process that we as customers become frustrated with services, we experience financial damage, and also that we lose trust in organizations. We think it is very important to identify these weak points and to put forward targeted fixes. In this piece we also look at what are the most common of these issues which we see play out and we present to you practical strategies to put in place to remediate them.

Common Pitfalls in Incident Management Process

  1. Lack of Clear Definitions and Triage:One out of the basic issues is that we do not have a which is accepted by all for what we mean by “incident” as opposed to a “request” or “problem”. Also we see that without clear severity levels (for instance P1, P2, P3) and impact criteria teams have trouble with initial triage which in turn leads to issues in priority setting and delayed response to critical issues.
  2. Poor Communication (Internal & External): Siloed teams, unstructured updates, and the use of very technical language is a recipe for poor incident resolution. At the same time stakeholders which range from affected end users to senior leadership are left in the dark which in turn causes anxiety, repeated queries and a perception of us not knowing what we are doing. This is especially true in the midst of a major incident.
  3. Inadequate Tools and Technology: Through the use of manual processes, separate communication channels, or out of date monitoring systems incident response turns into a chaos. We see that without integrated ITSM platforms, automated alerting, and central dashboards we have large issues in terms of collaboration and visibility.
  4. Insufficient Training and Resources: Even with great processes in place a team which is not well trained or resourced will have issues. This includes technical skill gaps, unfamiliar team members with key protocols, at staff levels which are below what is required during peak times, or we don’t have a system in place for cross training.
  5. Blame Culture vs. Learning Culture: During and after an incident as the focus turns to finger-pointing teams report issues less and admit mistakes less. In a blame culture transparency is suppressed, honest analysis is prevented and in the end continuous improvement efforts are damaged.
  6. Lack of Post-Incident Review (PIR) and Root Cause Analysis (RCA):Many organizations address the present issue but that is it, they do not perform in depth post incident analysis or get to the bottom of the root cause. Until we identify what went wrong in the first place the issues will reappear which in turn causes a never ending cycle of reactive solutions.
  7. Undefined Roles and Responsibilities: In many cases it is unclear who is to do what  from the early detection of an incident and initial response through to escalation, communication, and resolution  which in turn leads to duplication, key gaps, and delay. This is very much the case with complex incidents which involve multiple teams.
  8. Ignoring Service Level Agreements (SLAs) and Metrics: Failing to set, track, and report on SLAs for incident resolution and communication we have no defined performance metrics. Also without data which is driven by results it is impossible to identify what the issues are, measure performance or put forth the business value of the incident management process.
  9. Failure to Document and Share Knowledge:Information gaps are large in many organizations. If we are not putting down in a easy to access knowledge base what we have learned from incidents, work arounds and lessons learned we are in fact losing our institutional memory which in turn causes us to repeat work and see delayed resolution of the same issues which come up again and again.
  10. Lack of Continuous Improvement: A typical mistake is to see incident management as a fixed process. The operational environment is in a constant state of change, and a process which does not see regular review, adaptation, and improvement based on feedback and performance data will very soon become out of date and useless.
    How to Fix Them: Strategies for Improvement

Address for these issues requires a total approach which includes process improvement, technology adoption, and culture change.

1.Establish Clear Definitions and Triage Protocols:

Solution: Develop an in depth incident classification manual that includes clear incident type definitions as well as severity levels (critical, high, medium, low) and what that impact is on the business. Put in place a uniform triage process which will see to it that all incoming reports and alerts are initially assessed and put in order the same way. Train all affected staff on these definitions and procedures.

2. Implement a Robust Communication Plan:

Solution: Design out a tiered approach to communication. For internal teams we will use dedicated channels (e.g., Slack, Microsoft Teams, incident management platforms) for real time updates. For external parties and end users we will use status pages, automatic email alerts, and also put out regular simple to understand updates. Also define which team members will play what role in the communication (for example the incident commander, the communication lead) and what is the chain of command in case of an issue.

3. Leverage Appropriate Tools and Automation:

Solution: Invest in a fully integrated Incident Management System (IMS) or IT Service Management (ITSM) platform which includes incident logging, routing, tracking, and reporting. Also put in place integration with monitoring tools to enable automated alerting. We also look at which routine tasks like initial notifications, ticket creation and even some diagnostic actions can be automated.

4. Invest in Training and Enablement:

Solution: Provide ongoing in depth training for all incident responders in technical skills, process adherence, and communication protocols. Also run simulation exercises (war games) in a safe environment. Also we will do cross training to build up redundant capabilities and ensure we have enough staff.

5. Foster a Blameless Learning Culture:

Solution: Shift out of the “who” mode into “what” and “how” oriented thinking. We should see incidents as opportunities for growth and improvement which in turn is different from a punitive approach. Also we must foster open communication, transparent reporting of issues and in the post incident reviews put forth focus on system wide causes and preventive measures as opposed to individual failings.

6. Mandate Post-Incident Review (PIR) and Root Cause Analysis (RCA):

Solution: Make it a requirement for PIRs and RCAs in all major incidents and a selection of minor ones. We will put in place a structured process for these reviews which includes input from all relevant teams. We aim to identify the root cause of issues, document what we learn from them, and put in action remediation tasks which are owned and have set deadlines.

7. Define Clear Roles and Responsibilities (RACI Matrix)Solution:

Use a Responsible, Accountable, Consulted, Informed (RACI) chart to define roles for each stage of the incident lifecycle which goes from detection through to resolution and post mortem. Make sure all team members know what they are responsible for which in turn reduces confusion and improves coordination.

8. Integrate SLAs and Performance Metrics:

Solution: Define SLAs for various levels of incident severity (for example in terms of restoration time and communication frequency). Put in place strong metrics tracking (for example of Mean Time To Detect (MTTD), Mean Time To Resolve (MTTR), and incident volume by type) to monitor performance, identify trends and inform improvement actions. Also report on these metrics regularly.

9. Build and Maintain a Comprehensive Knowledge Base:

Solution: Develop a central, easy to search knowledge base for incident resolutions, common workarounds, diagnostic procedures, and frequent questions. Also make it a requirement to update the knowledge base after an issue is resolved. Foster a culture of knowledge sharing between all teams.

10. Implement a Feedback Loop for Continuous Improvement:

Solution: Treat the incident management process as a living document. We review its performance regularly via feedback from those that respond to incidents, analysis of incident data, and input from Post Incident Reviews. Also run periodic work shops to go over process improvements, to adapt to new tech or org changes, and to make the process as efficient as possible.

Common Pitfalls in Incident Management Process & How to Fix Them

Imagine that the primary site of your company goes down. Orders stop. Customers are left in the dark. Your support lines blow out like a Christmas tree. Each minute of down time is a financial hit and also damages your company’s image. An effective incident response strategy is what turns the tide in these situations. It is the barrier which protects your business when things fall apart.

Incident management is for getting back to normal service quickly. Also to reduce how much business is affected. Although very important many companies do have a hard time with it. Their reaction to sudden issues is slow at times and mess at others which in turn causes more problems than it should.

This article brings to your attention common problems we see in incident response. Also we provide you with solutions which work. Our aim is to put forth strategies that will make you better at what you do, at handling issues as they come up. This will in turn enable your business to weather any storm that comes your way.

Pitfall 1: Unclear Roles and Responsibilities

When there is an incident which goes down people may break into chaos if it is not clear who is to do what. People may also hold back. Also the same task may be done by many. This is a waste of time. Defined roles which each person is aware of  that which they are to do  is what we need.

Lack of a Defined Incident Commander

Picture a fire which has no chief. All are running around but there is no leader. That’s what we see in an incident without a Clear Incident Commander (IC). This person is the boss. They make the key decisions and run communications. Without an IC what we get is stalemate, also inaction. Different teams may instead try to solve the issues at hand in different ways, that don’t always play well together. In fact it just makes the issue greater rather than the solution.

Ambiguous Communication Channels

Knowing what to share and with whom is of great importance. If we don’t have clear lines of communication info which is critical may be lost. Teams may work in silos which causes important updates to be left out. Think of sending out a very important message and not sure if it got to the right person at all. This in turn causes delays and confusion.

Overlapping or Missing Skill Sets

At times the right person for the role is just not present. Also at times too many of the same skill set show up and other required skills are left out. This creates gaps. We find that teams may not have the right expertise to resolve certain issues. Also having improper people on key tasks which in turn slows everything down.

Pitfall 2: Inadequate Incident Detection and Reporting

Identifying a problem at a later stage makes it very difficult to repair. Slow detection leads to greater damage. Also it gives us more time to determine what transpired. Getting accurate reports in a quick time is the first step to a fast resolution.

Manual and Slow Monitoring

Dependence on people for constant system check is a risk. Outdated monitoring tools do which of small issues they put forward before they turn into big ones. As detection is slow a minor issue grows into a major service outage. The longer an incident goes unnotice the more damage it does. Also you have the issue of having to put in more work to fix a larger scale problem.

Siloed Alerting Systems

In different teams which have their own separate alert systems issues arise. They may not see how one small event plays out in another system. This gives a picture which is broken. Important connections between the alerts are lost. Teams are not seeing the full picture.

Unclear Incident Severity Levels

What do we mean by a “critical” incident? What about a “minor” one? If people have different views on that  which is the case  we see confusion. Without sets of clear criteria for what is which, teams may put in the wrong amount of effort. They may spend time on the small issues while the large ones are left to grow. This causes issues with priority setting and slow response to what really matters.

Pitfall 3: Ineffective Communication and Collaboration

If you do happen to find an issue quickly what’s needed is a team effort to fix it. Also poor communication can make things go south. It can turn a small problem into a large scale crisis. During a live incident teams must communicate and work well together.

Lack of a Centralized Communication Hub

Scatterred emails or short chats don’t do the job in an incident. Think of all that back and forth on many phones at the same time. It is a mess. A dedicated space like a special chat room or incident management tool which brings teams together. This way info is shared in real time. Without it info falls through the cracks and we see duplicate efforts.

Poor Status Updates and Stakeholder Communication

In the wake of an incident it is of great importance that people are informed. This includes customers and business leaders. We should avoid giving out vague or only very occasional information which breeds worry. That in turn breaks trust. Proactive communication is key. Also it shows that you are at the issue and are working to resolve it. People do better when they are made aware of what is being done.

"Blame Game" Culture

If people worry about blame, they won’t put in their two cents. May also hide up mistakes. This in turn hinders open communication and delays the root issue out. A which does not blame has teams which work well together. It also encourages what went wrong to be put out there. In this way teams learn from incidents faster.

Pitfall 4: Insufficient Post-Incident Analysis (Postmortems)

When it’s over the work has just begun. We look back at what transpired which is the first step in improving. Many companies miss this in the process. They address the immediate issue but do not learn from it.

Skipping or Rushing Postmortems

Fix it and you feel great, but going through the review process slowly is what really pays off. Also, not doing a full postmortem or a detailed one leaves out important info. We see the same issues reoccurring which in turn means we aren’t growing. You can’t better what you don’t analyze.

Focus on Blame instead of Root Cause.

A true postmortem is for identifying what caused an issue to happen. It doesn’t focus on assigning blame. We look to the root cause of the issue, not the simple outface. When the focus is instead on blaming individuals what happens is people go into defense mode. This in turn covers up the real issues. What we do is get to the core of the problem in our systems and processes.

Lack of Actionable Follow-Up

Finding issues is easy but to fix them that is a different story. In a post mortem we need to present concrete actions. What is to be done, and by whom, and by when? If you don’t put out task assignments and follow up on them things won’t change. That which also means your team isn’t growing. The same incidents will play out again and again because in reality nothing was done to fix them.

Pitfall 5: Lack of Continuous Improvement and Training

An incident management process is a live entity which you can’t just set up and leave alone. It has to improve over time. Without continuous improvement and learning your process will break down. It won’t be able to handle new issues.

Infrequent Process Review and Updates

Technology is a fast moving field. As are the ways in which systems fail. You should be regularly reviewing your incident playbooks and guides. We ask that you update them based on what you learn from each incident. If you don’t look at them often enough they will become out of date. This leaves your teams under prepared.

Inadequate Training for Incident Responders

Even with the best laid out plans which do not have trained people behind them are useless. Teams must be in a constant state of training. We run through drills and practice various scenarios which in turn help response time in real incidents. If responders are not prepared they will either freeze up or make errors. Regular training instills confidence and skill.

Failure to Leverage Incident Data

In each incident we collect data. How long did we take to resolve them? What was the count in the past month? Also which issues kept reoccurring? By analyzing this information we can see trends. This data also points out the weak spots in our process. Using that which we identify we are able to improve and prevent issues from happening in the first place.

IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook IT Operations Playbook

Conclusion

We have discussed the large scale issues in incident management. Which include roles which are confused, slow detection, poor communication, weak postmortems, and not learning from them enough. By identifying and solving these common issues you are able to put together a robust incident response plan.

Working on these issues which in turn improves our incident management process’ resilience and performance. We see that as a chance for your company to recover faster from any issue. Also a fine tuned incident management plan does not only solve the problems at hand. It also wins over your customers’ trust. It safeguards your company’s image. And it is what gets your business through any situation without it breaking down.


Overcoming the which is true for most organizations in incident management is a short term play; it is a long game which requires from leadership great buy in and a culture which is that of proactiveness and continuous learning. By systemically tackling issues such as poor definition, bad communication, which is a lack of proper tools and also a ignore of post incident analysis we see organizations transform their incident management from a reactive which is put out fire mode into a very efficient, resilient and constantly improving machine. In the end what we see from effective incident management is reduced down time, improved service quality and a protected and grown bottom line.