ITSM

If You Ignore These 7 IT Operations Activities, You’re Heading for Failure

Stuart Rance

6 min read

July 2nd, 2019

2048 views

Ignoring IT Operations Activities

IT organizations tend to devote most of their time and effort to two areas:

Design, development, testing, and deployment of new and changed software and infrastructure
Managing incidents and problems.

These activities are very important, but if you want to deliver high quality IT services then you need to manage your IT operations. If you don’t actively manage your IT operations, you’re likely to have sudden and catastrophic failures. The activities that could have prevented these failures are often painfully obvious once they’ve happened, so wouldn’t it be a lot better to have the right steps in place before anything goes wrong?

Here @StuartRance shares seven operational activities that he thinks every IT organization needs to carry out. #ITSM Share on X

There are a lot of things people can do to manage IT operations effectively. Here are seven operational activities that I think every IT organization needs to carry out:

1. Certificate management

Security certificates are used to authenticate the identity of people and computers, and to support encrypted communication. They’re issued by trusted certificate authorities who vouch for the identify of the person or organization concerned. For example, if you connect to https://www.sysaid.com/ you’ll see a green padlock symbol on your web browser, and if you click this you can see that you have a secure connection to the website.

But certificates come with an expiry date, and if your organization does not renew a certificate at the right time, then that certificate expires. And when that happens any person or system trying to connect to you will see that your certificate is no longer valid. This can have very serious consequences.

For example, Ericsson allowed one of their certificates to expire in December 2018, and this caused a total loss of mobile data connectivity for millions of mobile phone users in the UK and Japan (Ericsson: Expired certificate caused O2 and SoftBank outages). And this isn’t a new problem; back in 2014 an expired certificate caused thousands of credit card terminals in the USA to stop working. These are not isolated examples. There have been many other incidents since.

What this means is that you must keep a record of certificate expiry dates and ensure that you have a reliable process in place for renewing them before they expire. If you don’t have a process in place for managing your security certificates, then sooner or later you’re likely to miss a renewal date and will have to manage the messy consequences of a completely avoidable failure.

2. Resilience testing

Most IT services are designed to cope with common types of failure. For example, you might be using RAID disks so that failure of one disk drive doesn’t cause your service to fail, or you may use multiple network paths between critical points, to ensure you can continue to operate if one network path fails.

Many organizations will have tested these resilience measures before going live. But just testing them once, before you go live, is not enough. If you don’t test your resilience measures regularly, then it’s possible that they won’t work when you need them. I’ve seen services where one disk drive failed, or one network path was decommissioned, and nobody noticed until the other one developed a fault. At that point the entire service failed catastrophically, because there was no resilience left.

Some organizations are so invested in maintaining good resilience that they intentionally inject random failures into their environment, to give them practice at managing routine failures. The best example of this is Netflix, who developed chaos monkey. This randomly disables computers in their production environment, and provides them with ample opportunities to practice managing the consequences, usually through complete automation of the failover and recovery.

Even if you aren’t brave enough to deploy chaos monkey, you still need a schedule for regular testing of your resilience measures, otherwise you can’t be sure they’ll work when you really need them.

Have you ever considered intentionally injecting random failures into your environment, to enable you to practice managing routine failures? @StuartRance suggests it's a good idea. #security #ITSM Share on X

3. Patch management

Software vendors release patches to fix errors, address security vulnerabilities, and provide new functionality. If you don’t have a robust process for discovering, evaluating, and installing these patches then you’re probably running out-of-date software on some of your systems. This leaves you open to security breaches, or at risk of failure from errors that the vendor has already fixed.

In the past it was quite normal to delay installation of patches for a few months, so that you could do thorough testing, and so that other organizations could install them first and discover any side effects. This strategy is now very risky for any security-related patches. The length of time from release of a patch to the time you’re attacked using that vulnerability has reduced from months to days, or even hours.

Learn about SysAid Patch Management

4. Vulnerability and threat management

All IT operations, no matter how well run they are, experience vulnerabilities and are subject to threats. But the ones that hurt the most are the ones you don’t know about, so you need to do what you can to be well informed.

Vulnerability management enables you to discover weaknesses in your applications and infrastructure, so that you can plan to deal with them. These vulnerabilities could include missing patches (see above) but might also include incorrect configurations like a firewall port that’s been left open; or critical components such as anti-virus controls that have been disabled. You need a process to identify potential vulnerabilities and then scan all of your systems to see where you need to make improvements.

Threat management involves identifying potential threats that you might need to defend against. Often you’ll identify such threats because they’ve succeeded in environments similar to yours. We can all learn valuable lessons when bad things that happen to us, but it’s probably better, and certainly cheaper, to learn from bad things that have already happened elsewhere. If you take the trouble to learn about threats as early as possible, you’re well placed to review and improve your defenses.

5. Backlog management

There are many different areas of IT operations that can build up a backlog of work. For example, you may have addressed your top five problems, but not done anything about the next most critical ones. You may have many low priority incidents that you’ve made no progress in resolving and that are not being actively managed. You may have a backlog of potential software improvements waiting for a development team to have time to investigate them. If so, you’re not alone. Almost every organization has a backlog of improvement opportunities, if only they had the time and resources to take action on them.

But if you don’t manage your backlogs then they continue to build up over time until they become completely unmanageable. This will certainly have an impact on your business; your reputation will suffer and so may your profit margins. Unresolved incidents lead to unhappy users and even more work dealing with their complaints. If you ignore low priority software improvements no one ever gets to experience their benefits. And if you persistently fail to resolve your low priority problems you’ll eventually end up with an increased number of incidents and reduced productivity.

The most common reason for backlogs to be ignored is that everyone is too busy dealing with things that seem more urgent, but if you make time for backlog management this will often free up time by reducing the number of incidents and complaints that you have to deal with.

If you make time for backlog management this will often free up time by reducing the number of incidents and complaints that you have to deal with. - @StuartRance #servicedesk Share on X

6. Customer engagement

The role of an IT organization is not just to operate IT and deliver IT services. It’s to help people use IT systems and services to create value. You can only do that if you spend time talking to your customers about their experience of your services and use this input to improve how you work.

Spending time talking to your customers about how they use IT can help you to identify, and prioritize, the things you need to do to deliver great services.

7. Supplier reviews

In the same way as talking to customers can help to ensure you’re doing the right things for them, talking to your suppliers can help to ensure they’re doing the right things for you. After all, from their point of view YOU are the customer, and they need to talk to you to make sure they identify and improve the things they do to deliver great service to you!

Conclusion

IT operations is not just about managing incidents and problems, there are lots of things you need to do on a regular basis if you want to deliver high quality services that delight your customers. I’ve given you seven examples here to help you get started, but you can probably add a few more to this list.

The more time you spend on managing your IT operations, the less time you will need for managing incidents and problems, and that will enable you to spend even more time improving your services. That’s a win-win situation that everyone can benefit from.

About

the Author

Stuart Rance

Stuart is an ITSM and security consultant, trainer, and author who has worked with clients in many countries, helping them create business value for themselves and their customers. He was the author of the 2011 edition of ITIL® Service Transition and lead author of RESILIA™ Cyber Resilience best practice published in June 2015. Now that his children have all left home, he has plenty of time on his hands for contributing to our blog – lucky us!

1. Certificate management

2. Resilience testing

3. Patch management

4. Vulnerability and threat management

5. Backlog management

6. Customer engagement

7. Supplier reviews

Conclusion

You'll Love This Too!

Preparing IT Staff for AI Use

Considering Generative AI for ITSM? Here’s What You Need to Know

2024 ITSM Trends – “Do Existing Things Better”

SysAid On-Prem Software CVE-2023-47246 Vulnerability

If AI is the Future of ITSM, That Future is Already Here

Measuring Success in IT

About

the Author

Stuart Rance