CrowdStrike released a post-incident review detailing how a buggy update took down 8.5 million Windows machines last week. The review blames a bug in test software for failing to properly validate the content update pushed out on Friday. CrowdStrike promises to improve content update testing, error handling, and implement staggered deployments to prevent future issues.
CrowdStrike’s Falcon software helps businesses manage malware and security breaches on millions of Windows machines. The problematic content configuration update, meant to gather telemetry on novel threat techniques, caused Windows to crash.
CrowdStrike issues configuration updates in two ways: Sensor Content updates the Falcon sensor at the kernel level in Windows, while Rapid Response Content updates how the sensor detects malware. A small 40KB Rapid Response Content file caused Friday’s crash.
Update: Our preliminary Post Incident Review (PIR) is available at the link below. Details include the incident overview, remediation actions, and preliminary learnings. More to come in our full Root Cause Analysis (RCA).
Automated recovery techniques, coupled with strategic…
— CrowdStrike (@CrowdStrike) July 24, 2024
Sensor updates, which include AI and machine learning models, are typically not cloud-based and enhance long-term detection capabilities. Rapid Response Content, delivered on Friday, includes Template Instances that configure new detection types.
CrowdStrike’s cloud system performs validation checks on content before release. However, due to a bug in the Content Validator, one Template Instance passed validation despite containing problematic data. This led to the sensor loading the faulty content, triggering an out-of-bounds memory exception and causing a Windows crash (BSOD).
To prevent recurrence, CrowdStrike will enhance Rapid Response Content testing with local developer testing, content update and rollback testing, stress testing, fuzzing, and fault injection. Stability and content interface testing will also be performed.
Additionally, CrowdStrike will improve its cloud-based Content Validator and enhance error handling in the Content Interpreter within the Falcon sensor. Staggered deployment of Rapid Response Content will ensure gradual updates to larger portions of the install base rather than immediate pushes to all systems. Both driver improvements and staggered deployments have been recommended by security experts.