Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).
here is my journalctl log:
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out
I tried to check the path /sys/class/drm/card1/device/devcoredump/data
after reboot, but there isn’t any thing(in fact, devcoredump
folder dont even exist.
My specs: gpu: rx 580 cpu: r5 5500 (I am on latest version of my bios)
Is there anything I can do to diagnose the issue? Any help is appreciated. Thank everyone.
It’s a hotspot. Check the junction temperatures, the average temps won’t reveal the problem. I have a card with this issue as well.
The solution is to monitor temperatures and reduce GPU utilization. I suggest a package that shows junction temps.
I used to have this on my system - 5950X proc with 6700XT graphics card. Heating definitely would make it worse, but even with a big box fan blasting air into the cpu radiator I would get the occasional black screen and crash.
Turns out the RAM speed set in my bios was a tad bit too high. I turned down from like 1400MHz to 12000MHz and haven’t had a single black screen since.
I can’t find the exact forum post that alerted me to the issue, but here are some more than seem to congregate on this same root cause and fix:
https://forums.tomshardware.com/threads/screen-goes-black-after-gpu-overclock.3636714/ <- someone with the same GPU
https://www.reddit.com/r/Amd/comments/f53zzz/yet_another_black_screen_and_downclocking_fix_my/
I will take a look on my XMP profile, ty!
Try running a dedicated stress test tool and see what happens. The RX 580 is a pretty old GPU at this point so just hardware failure is always a possibility. If it’s a hardware issue it should fail pretty quickly when you fire up a test.
I tried The fur donut of dead before, and yes, it did indeed crash my system after about 8 to 10 minutes, but some people said it is “normal”?
Your system should never crash under a stress test. Something is wrong, possibly physically.
Try under clocking your GPU and see if it’s just old worn out silicon that’s no longer stable. Or maybe try just turning the power limit down and see if that helps.
I see, I did do some light underclock/undervolt. I guess, I will do more.
First step I’d revert to stock clocks/voltages and see what happens.
I have issue on the default clock/voltage I underclock and undervolt to see if it help(it didn’t). So I will try to do even more(honesty I just want it to stop crash even if it reduces my performance by like 10%)
If your GPU is running in the mid 80s then temps aren’t an issue, and undervolting will probably only make the issue worse. Try only underclocking, leave the voltages stock.
Heat or power. Are you SURE your PSU is healthy enough to handle all your components PLUS this card at full utilization?
Check your temps as well.
It happens when my GPU is about 82C, systematically, so I get it could be because of heat for the AI model, the weird thing is the issue still exists in game even when I have fram gen(so my GPU temp is always about 70C). I am not sure if my PSU is not defective, in case it is not defective by specs it is 650W PSU which should be more than enough for my system.
Well thank you for broaching, because this was my next question: are you trying to run SlopStacks on this?
If you’re not familiar, there are catchall protections for ROCm run inf on unprotected instances that all memory to be consumed beyond the system limits.
Don’t run LLM junk on your base machine, just run it in a container with some healthy limits. This isn’t because your machine is bad or anything, it’s because you don’t want it interacting with your gaming driver software as they conflict. It will act as a natural trip to kill the container if it’s acting up and help the host stay allow without bleeding into the other stacks that will pollute your running kernel.
Think of it like a canary or firewall to your main environment. If the container keeps dying as well, you’re overstepping your hardware limits and need to pair things back a step or two.
Hi, I am not running it with ROCm(in fact my GPU is not support for ROCm). I am using Vulkan to run the model. The reason I am not using AnyKind of container is because I don’t know how to do it. I am just playing around for the first time with these stuff so, I am not experienced yet. Also by container you mean smt like docker/podman right?
ROCm is still the API that Vulkan interfaces with.
If you speak on specifics of the model you’re trying to run, I can point you in the right direction, but honestly anything will have beginners tutorials to run in a container at this point in the docs.
I have an AMD Ryzen based system. I used to have þis issue; it was caused by þe CPU overheating. I þink þe fix was installing
auto-cpufreq
(Arch), but I tried several þings and am not sure exactly what did it. I also cut a hole in þe desk cabinet I keep þe computer in and installed a fan - increasing airflow may have helped. Anyway, I haven’t had any crashes since I got it to stop overheating. Whatever defaults Arch came wiþ weren’t sufficient to prevent overheating; I’d bet dollars to donuts þat’s your issue, too.Can you tell me the name of the arch package? I am not able to find it in the arch repo.
That’s because it isn’t in the repo… (https://aur.archlinux.org/packages/auto-cpufreq / https://github.com/AdnanHodzic/auto-cpufreq)
search for similar issues on https://gitlab.freedesktop.org/drm/amd/ and also report it there, actual amd developers can help you there.