I tried telling this to my manager for years. He saw it as a “X days since we last had a problem and needed to reboot the server” and took pride in it.
We finally shut it down at over 5 years of uptime. Some docker containers had been running for 4 years straight.
Yes, that means what you think it does concerning update policies. Yes, the server and some containers were exposed to the internet. No, the backups were never tested.
If a device hasn’t been rebooted in a long time there is a much higher chance of it not coming back after a reboot. This is made worse by the fact that sometimes power loss is unexpected which means that an outage can occur at a bad time.
The other issue is that a high uptime device doesn’t usually have the latest updates installed. Delaying updates creates security issues and when you do get around to updating it means that more things get changed at once.
The reverse is that if you really know your stuff you can get away with fewer restarts, or even none. But you pretty much have to know every component and update you run while in that untested state.
This is similar to bugs that go away on a restart. If you don’t know why, then you haven’t really fixed it, just rolled the dice again hoping it won’t reoccur.
As for updates, on regular systems you can do update everything but the kernel. You do have to restart affected services afterwards (often done automatically).
Even on atomic systems you can switcheroo the subvolume underneath a running system.
Unfortunately the kernel is quite major, so that is a valid reason to see the need to update. Definitely not as pressing as say nginx, sshd, or sudo though. Kernel bugs bubbling up to an exposed attack surface is still quite unusual.
Having high uptime is not the flex you think it is
You shouldn’t have uptime higher than 60 days
I tried telling this to my manager for years. He saw it as a “X days since we last had a problem and needed to reboot the server” and took pride in it.
We finally shut it down at over 5 years of uptime. Some docker containers had been running for 4 years straight.
Yes, that means what you think it does concerning update policies. Yes, the server and some containers were exposed to the internet. No, the backups were never tested.
Yeah these days a high uptime is a mark of shame, not a badge of honour.
Why, by the way?
If a device hasn’t been rebooted in a long time there is a much higher chance of it not coming back after a reboot. This is made worse by the fact that sometimes power loss is unexpected which means that an outage can occur at a bad time.
The other issue is that a high uptime device doesn’t usually have the latest updates installed. Delaying updates creates security issues and when you do get around to updating it means that more things get changed at once.
The reverse is that if you really know your stuff you can get away with fewer restarts, or even none. But you pretty much have to know every component and update you run while in that untested state.
This is similar to bugs that go away on a restart. If you don’t know why, then you haven’t really fixed it, just rolled the dice again hoping it won’t reoccur.
As for updates, on regular systems you can do update everything but the kernel. You do have to restart affected services afterwards (often done automatically).
Even on atomic systems you can switcheroo the subvolume underneath a running system.
Unfortunately the kernel is quite major, so that is a valid reason to see the need to update. Definitely not as pressing as say nginx, sshd, or sudo though. Kernel bugs bubbling up to an exposed attack surface is still quite unusual.
Maybe they’re kexec-ing.
uptime should be handled by the kernel, so a kexec “soft-reboot” would still reset the uptime.
Modular custom single-program kernel running in a VM live migrated across a cluster?
No security revisions over multiple months?