ITIL

Defining Metrics for Problem Management

Stuart Rance

6 min read

1456 views

Editor’s Note:
While reviewing the level of readership of our blogs, we couldn’t help but notice that certain blogs never lost their popularity over the years. This is one of them – with thousands of unique views every month. We thank Stuart Rance for his words of wisdom that clearly sustain longevity (the advice is as relevant today as it was when it was original published). So, ICYMI, we’re pleased to republish this blog for your convenience.

Defining Metrics for Problem Management

Many people define KPIs for their IT service management processes by looking in books (such as ITIL Service Operation) or by copying metrics that other organizations use. This is rarely going to give good results, because KPIs need to INDICATE the PERFORMANCE of the KEY things you care about (that’s why they’re called Key Performance Indicators). In the worst cases I have seen ITSM processes with huge numbers of so-called KPIs that are measured and reported even though nobody uses the values to drive any changes in behaviour or improvements in business outcomes.

I recently wrote a blog titled Defining Metrics for Change Management in which I explained how you can create KPIs that support what you are trying to achieve. A number of people contacted me after reading that blog to ask for examples of how to derive KPIs for other ITSM processes. I decided to write this blog about problem management KPIs because this is one process where many organizations I have worked with had very poor KPIs. Remember you shouldn’t just copy the outcomes, critical success factors (CSFs) and KPIs that I am describing here, you should use them to understand the approach and methodology I have used, and then think about what is important to you and derive metrics that measure the things you care about.

The first step to defining good KPIs is to identify the objectives of problem management, what outcomes does problem management help us to achieve? For me there are two key outcomes of a good problem management process:

  • Reducing the number of incidents that occur
  • Reducing the business impact of incidents that can’t be avoided

We could just measure the number of incidents and the overall business impact of incidents. These would certainly be valuable things to know, but I’m not sure they’d show how well problem management has been working, because so many other factors could have contributed. So I will break these down a bit and identify some problem management CSFs that could contribute to these outcomes:

  • Identify problems that have caused multiple incidents
  • Implement workarounds that reduce the impact of incidents
  • Initiate changes that reduce the number of incidents

It’s worth noting that I didn’t mention root cause analysis (RCA). I see many problem management people who only think about RCA, but this doesn’t actually deliver any benefit, it’s just a technique that we use in problem management. The worst problem management KPIs that I see are “Average time to root cause”, “Percentage of problems with RCA complete in 3 days”, or similar. These KPIs drive behaviours that we really don’t want, by encouraging problem management people to declare that they have found “the” root cause of a complex situation rather than continuing to analyse and understand it even after they have identified one significant contributory factor.

“When it comes to problem management, thinking about root cause analysis doesn't deliver any benefits” - @StuartRance #ITIL #ITSM Share on X

One of my customers has a process for prioritising problems that takes account of the frequency and business impact of the problem, including the mitigation provided by any workarounds that are in place. They then have a KPI of “Average time to reduce problems to P3 priority.” This reduction can be achieved by resolving the problem, or by implementing a good workaround. The point is that they are measuring problem management based on how well they are reducing pain to the business. I’m not going to suggest that KPI here because it requires a fairly sophisticated approach to problem prioritisation, which not many IT organizations can achieve, but if you can measure this then it’s certainly something you could think about.

Here are some suggested KPIs that might help to demonstrate the CSFs I have listed above. Remember you shouldn’t just copy these – use a similar process to identify KPIs that will measure what you care about.

CSF1 – Identify problems that have caused multiple incidents

  • Increased percentage of incidents associated with a problem record or known error
  • Top 5 problem report created every month

CSF2 – Implement workarounds that reduce the impact of incidents

  • Increased percentage of incidents for which a knowledge base article provided the solution
  • Increased percentage of incidents closed by users using self-service incident management
  • Reduced impact of incidents associated with previous months’ top 5 problems

CSF3 – Initiate changes that reduce the number of incidents

  • Reduced number of incidents associated with previous months’ top 5 problems
  • Reduced backlog of outstanding problems

I have worded these KPIs as “Increased…” or “Reduced…” because I don’t have the data needed to set explicit targets. As you make use of metrics like these you can put in place numerical targets, based on the baseline that you create when you first start measuring and reporting.

How well do your problem management metrics measure what your customers care about? Is it time to review your problem management KPIs and align them with your CSFs and objectives?


Update: Since writing this blog, Stuart has helped to write the publication ITIL Practitioner Guidance, which includes lots of helpful suggestions on how to define CSFs and KPIs.

What did you think of this article?

Average rating 4.5 / 5. Vote count: 2

No votes so far! Be the first to rate this post.

Did you find this interesting?Share it with others:

Did you find this interesting? Share it with others:

About

the Author

Stuart Rance

Stuart is an ITSM and security consultant, trainer, and author who has worked with clients in many countries, helping them create business value for themselves and their customers. He was the author of the 2011 edition of ITIL® Service Transition and lead author of RESILIA™ Cyber Resilience best practice published in June 2015. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog – lucky us!

We respect your privacy. By continuing to use our site, you agree to our privacy policy.

SysAid Reviews
SysAid Reviews
Trustpilot