Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.
“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.



They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.
Things to improve:
I’m sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.
Fail open vs fail closed. Bot detection is a security feature. If the security feature fails, do you disable it and allow unchecked access to the client data? Or do you value Integrity over Availability
Imagine the opposite: they disable the feature and during that timeframe some customers get hacked. The hacks could have been prevented by the Bot detection (that the customer is paying for).
Yes, bot detection is not the most critical security feature and probably not the reason someone gets hacked but having “fail closed” as the default for all security features is absolutely a valid policy. Changing this policy should not be the lesson from this disasters.