Daemon that kills processes using too much CPU

communism@lemmy.ml · edit-2 4 hours ago

Daemon that kills processes using too much CPU

user28282912@piefed.social · 4 hours ago

Run your workload in a guest VM and limit its resources to whatever you desire. You can also consider c-groups if you already know which processes are causing all of the trouble.

ferret@sh.itjust.works · 7 hours ago

CPU pressure generally isn’t crippling, the scheduler is pretty clever. I would look into other causes

DigitalDilemma@lemmy.ml · 7 hours ago

Never heard of something like that, and I suspect anyone who started creating it soon filed it under “Really bad ideas” alongside “Whoops, why did my kernel just stop?”

sar is the traditional way to watch for high load processes, but do the basics first as that’s not exactly trivial to get going. Things like running htop. Not only will that give you a simple breakdown of memory usage (others have already pointed out swap load which is very likely), but also sorting by cpu usage. htop is more than just a linux taskmgr, it’s a first step triage for stuff like this.

gian @lemmy.grys.it · 5 hours ago

The kernel has a way to assign resources to each and every process, try to google for “Linux kernel limits” or “linux cgroup cpu limit”.

The problem is knowing which process cause the load, but if you cannot even htop, then I doubt a daemon could do something.

communism@lemmy.ml · 4 hours ago

if you cannot even htop, then I doubt a daemon could do something.

The point is that a daemon can catch it before it reaches that point by killing processes that are using too much resources, before all the system resources are used up.

deadcade@lemmy.deadca.de · 9 hours ago

It’s the task of your CPU scheduler to ensure your system doensn’t freeze, even over 100% CPU usage. If it’s completely unresponsive, it’s more likely you’re running out of memory instead.

nyan@sh.itjust.works · 7 hours ago

If you dare, you can try temporarily killing the system’s swap (using the swapoff command) and see what happens. With no swap, the standard OOM reaper should trigger within a couple of minutes at most if it’s needed, and it should write an entry to the system log indicating which process it killed.

Note that the process killed is not necessarily the one causing the problem. I haven’t had the OOM trigger on me in many years (I normally run without swap), but the last time it did, it killed my main browser instance (which was holding a large but not increasing amount of memory at the time) rather than the gcc instance that was causing the memory pressure.

ragica@lemmy.ml · edit-2 9 hours ago

I used to use earlyoom on an old laptop and it worked well for my purposes.

I hear there is a systemd-oomd, but I never tried it.

Edit: sorry I misread your post to be about memory rather than CPU. Too early on the morning for my brain to work.

communism@lemmy.ml · 9 hours ago

Thanks. I’ve had a couple of comments suggesting that it might be a memory leak instead of CPU usage anyway so I’ve installed earlyoom and we’ll see if that can diagnose the problem, if not I’ll look into CPU solutions.

just_another_person@lemmy.world · 5 hours ago

It’s not the CPU. All that will do is consume CPU and raise your energy bill.

FishFace@piefed.social · 9 hours ago

An almost-complete lockup on Linux is basically always due to running out of memory and having to hit swap. A system can run at 100% CPU and still be usable, but when it hits 100% memory, it will not be usable. For a desktop system, that means keystrokes, if they are registered at all, won’t be registered until minutes have passed. For a server, it will mean all requests time out.

Unfortunately, Linux’s approach to memory management firstly allows this to happen and secondly fails to solve it once it does happen. What is supposed to happen is that the “OOM killer” wakes up and kills off a process to free up memory. That may theoretically happen if you left the machine on for a year, but what actually happens is that the amount of memory needed to run programs exceeds the amount of physical RAM, but swap is still available, so the OOM killer doesn’t give a shit. At this point many, many operations in programs are taking several orders of magnitude longer than they should do because instead of fetching a value from memory they need to:

context switch to the kernel
find some memory to write to disk, and write it
find the requested memory on disk, and read it into memory
context switch back to the process

So while your PC is running 100-1000x slower than it normally would, the OOM killer is doing nothing. If you manage to consume all your swap space, then, and only then, will the OOM killer wake up and kill something. It may kill the right thing, or it may not.

The modern approach is to use a user-space OOM daemon which monitors memory and swap usage and aggressively kills processes before that happens. Unfortunately, this tends to result in killing your (high-memory) web browser, or the whole desktop session.

Sucks. Get more RAM for your sever maybe.

non_burglar@lemmy.world · 8 hours ago

but what actually happens is that the amount of memory needed to run programs exceeds the amount of physical RAM, but swap is still available, so the OOM killer doesn’t give a shit.

Stop giving technical advice, you don’t know what you’re talking about.

FishFace@piefed.social · 7 hours ago

No u.

CallMeAl (Not AI)@piefed.zip · 9 hours ago

If your system uses systemd you can set resource limits like cpu per process or user.

communism@lemmy.ml · 9 hours ago

Afraid I’m using OpenRC.

CallMeAl (Not AI)@piefed.zip · 8 hours ago

There’s a old school tool call sar which can help you figure out what is causing the performance issues. Found a recent guide: Mastering sar in Linux: A Comprehensive Guide

Jerkface (any/all)@lemmy.ca · 9 hours ago

high cpu usage isn’t going to make your system unusable. it’s probably consuming all your wired ram and thrashing your swap.

just_another_person@lemmy.world · 9 hours ago

Get some sort of resource monitor running on the machine to collect timeseries data about your procs, preferably sent to another machine. Prometheus is simple enough, but SigNoz and Outrace are like DataDog alternatives if you want to go there.
Identify what’s running out of control. Check CPU and Memory (most likely a memory leak)
Check logs to see if something is obviously wrong
Look and see if there is an update for whatever the proc is that addresses this issue
If it’s a systems process, set proper limits

In general, it’s not an out of control CPU that’s going to halt your machine, it’s memory loss. If you have an out of control process taking too much memory, it should get OOMkilled by the kernel, but if you don’t have proper swap configured, and not enough memory, it may not have time to successfully prevent the machine from running out of memory and halting.

custard_swollower@lemmy.world · 9 hours ago

Open a console with top/htop and check if it will be visible when the system halts.

From my experience it looks like out of memory situation and some process starts swapping like crazy, or a faulty hdd that tries to read some part of the disk over and over again without success.

communism@lemmy.ml · edit-2 9 hours ago

Open a console with top/htop and check if it will be visible when the system halts.

That would require me to have a second machine up all the time sshed in with htop open, no? Sometimes this happens on the server while I’m asleep and I don’t really want a second machine running 24/7.

Fushuan [he/him]@lemmy.blahaj.zone · 9 hours ago

You could code a Cron job that runs a script that picks the process with highest CPU usage and kills it if it’s above the threshold and run it every second.

Very rudimentary but should work.