Around the same time, Cloudflare’s chief technology officer Dane Knecht explained that a latent bug was responsible in an apologetic X post.

“In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack,” Knecht wrote, referring to a bug that went undetected in testing and has not caused a failure.

  • A_norny_mousse@feddit.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 hours ago

    a routine configuration change

    Honest question (I don’t work in IT): this sounds like a contradiction or at the very least deliberately placating choice of words. Isn’t a config change the opposite of routine?

    • monkeyslikebananas2@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 hours ago

      Not really. Sometimes there are processes designed where engineers will make a change as a reaction or in preparation for something. They could have easily made a mistake when making a change like that.

      • 123@programming.dev
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        2 hours ago

        E.g.: companies that advertise on a large sporting event might preemptively scale up (maybe warm up depending on language) their servers in preparation for a large load increase following some ad or mention of a coupon or promo code. Failure to capture the market it could generate would be seen as wasted $$$

        Edit: auto-scale does not count on non essential products, people would not come back if the website failed to load on the first attempt.

      • NotMyOldRedditName@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        2 hours ago

        I don’t think it was a bug making the configuration change, I think there was a bug as a result of that change.

        That specific combination of changes may not have been tested, or applied in production for months, and it just happened to happen today when they were needed for the first time since an update some time ago, hence the latent part.

        But they do changes like that routinely.

        • monkeyslikebananas2@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          Yeah, I just read the postmortem. My response was more about the confusion that any configuration change is inherently non-routine.

    • Fushuan [he/him]@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      They probably mean that they did a change in a config file that is uploaded in their weekly or bi-weekly change window, and that that file was malformed for whichever reason that made the process that reads it crash. The main process depends on said process, and all the chain failed.

      Things to improve:

      • make the pipeline more resilient, if you have a “bot detection module” that expects a file,and that file is malformed, it shouldn’t crash the whole thing: if the bot detection module crahses, control it, fire an alert but accept the request until fixed.
      • Have a control of updated files to ensure that nothing outside of expected values and form is uploaded: this file does not comply with the expected format, upload fails and prod environment doesn’t crash.
      • Have proper validation of updated config files to ensure that if something is amiss, nothing crashes and the program makes a controlled decision: if file is wrong, instead of crashing the module return an informed value and let the main program decide if keep going or not.

      I’m sure they have several of these and sometimes shit happens, but for something as critical as CloudFlare to not have automated integration tests in a testing environment before anything touches prod is pretty bad.