Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).

here is my journalctl log:

Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu:  Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out

I tried to check the path /sys/class/drm/card1/device/devcoredump/data after reboot, but there isn’t any thing(in fact, devcoredump folder dont even exist.

My specs: gpu: rx 580 cpu: r5 5500 (I am on latest version of my bios)

Is there anything I can do to diagnose the issue? Any help is appreciated. Thank everyone.

  • Kiuyn@lemmy.mlOP
    link
    fedilink
    arrow-up
    0
    ·
    2 days ago

    Hi, I am not running it with ROCm(in fact my GPU is not support for ROCm). I am using Vulkan to run the model. The reason I am not using AnyKind of container is because I don’t know how to do it. I am just playing around for the first time with these stuff so, I am not experienced yet. Also by container you mean smt like docker/podman right?

    • just_another_person@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      ROCm is still the API that Vulkan interfaces with.

      If you speak on specifics of the model you’re trying to run, I can point you in the right direction, but honestly anything will have beginners tutorials to run in a container at this point in the docs.