Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).

here is my journalctl log:

Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu:  Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out

I tried to check the path /sys/class/drm/card1/device/devcoredump/data after reboot, but there isn’t any thing(in fact, devcoredump folder dont even exist.

My specs: gpu: rx 580 cpu: r5 5500 (I am on latest version of my bios)

Is there anything I can do to diagnose the issue? Any help is appreciated. Thank everyone.

  • just_another_person@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    2 days ago

    Well thank you for broaching, because this was my next question: are you trying to run SlopStacks on this?

    If you’re not familiar, there are catchall protections for ROCm run inf on unprotected instances that all memory to be consumed beyond the system limits.

    Don’t run LLM junk on your base machine, just run it in a container with some healthy limits. This isn’t because your machine is bad or anything, it’s because you don’t want it interacting with your gaming driver software as they conflict. It will act as a natural trip to kill the container if it’s acting up and help the host stay allow without bleeding into the other stacks that will pollute your running kernel.

    Think of it like a canary or firewall to your main environment. If the container keeps dying as well, you’re overstepping your hardware limits and need to pair things back a step or two.

    • Kiuyn@lemmy.mlOP
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Hi, I am not running it with ROCm(in fact my GPU is not support for ROCm). I am using Vulkan to run the model. The reason I am not using AnyKind of container is because I don’t know how to do it. I am just playing around for the first time with these stuff so, I am not experienced yet. Also by container you mean smt like docker/podman right?

      • just_another_person@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        2 days ago

        ROCm is still the API that Vulkan interfaces with.

        If you speak on specifics of the model you’re trying to run, I can point you in the right direction, but honestly anything will have beginners tutorials to run in a container at this point in the docs.