When a faulty CrowdStrike software update blue-screened millions of Windows computers around the world, it sent shockwaves through the tech industry and left many wondering what went wrong — and how to future-proof their own systems.
Brian Mathew, founder of Melbourne-based IT consultancy Computer Technicians, emphasises the importance of learning from this incident: “The CrowdStrike outage is a stark reminder of the complexities and vulnerabilities in our interconnected digital world. It is an opportunity to reassess what’s in place, strengthen our approach, and put in place robust backup and recovery plans.”
These are his key takeaways from the largest global outage to date.
1. Over Reliance on Single Providers
The CrowdStrike incident highlighted the risks associated with overreliance on a single cybersecurity provider.
When CrowdStrike’s update failed — something that most end users didn’t know was happening in the background — it took millions of computers offline worldwide, disrupting critical services across various sectors. High profile businesses from Sky News in Australia to Delta airlines in the USA were affected.
Lesson: Diversify your cybersecurity solutions. It may be tempting to use a single, comprehensive security suite, but a multi-layered implementation with solutions from different vendors may help reduce the risk of a single point of failure… but there are trade-offs.
2. Robust Testing Procedures
CrowdStrike attributed the outage to a bug in their cloud-based testing system, which allowed problematic content to be pushed out despite failing validation checks.
Lesson: Implement rigorous testing procedures, including automated and manual checks, before deploying any updates. If it’s possible, gradually rollout a large deployment so that issues can be caught early. Oh, and don’t deploy on a Friday.
3. Resilience
According to Parametrix, the loss incurred by Fortune 500 companies that were affected stands at an eye-watering $5.4 billion. And most of that won’t be covered by insurance.
If ever you needed a reminder on the importance of building resilient systems that can withstand and recover from unexpected failures, this is it.
Lesson: Design your IT infrastructure with redundancy and fault tolerance in mind. Implement failover systems and disaster recovery plans to ensure business continuity in the face of major outages.
4. Kernel-Level Access
In the wake of the incident, debate has been reignited about the risks associated with giving third-party software kernel-level access in operating systems like Windows.
Microsoft blamed the EU for forcing them to open the kernel — something which Apple does not grant to third-party developers.
Lesson: Carefully evaluate the need for kernel-level access in security software. Consider alternative solutions that don’t require such deep system integration, balancing security needs with system stability and resilience.
Conclusion
The CrowdStrike outage was a wake-up call for the entire IT industry, but it could have been far worse. It’s not a matter of if something will go wrong, but when.
Hopefully, there won’t be a repeat for a long time and that the lessons learned will stop future outages being as widespread as July 2024’s Great Blue Screen of Death.