Hi all, when I am using software with high gpu load(in the case AI model). It also happens with game. It just kinda happens after a random amount of with games(I can play for like 30 mins then crash or sometime not at all).

here is my journalctl log:

Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: Dumping IP State Completed
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=618, emitted seq=620
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu:  Process python pid 4571 thread python pid 5777
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: GPU reset begin!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: amdgpu: device lost from bus!
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] device wedged, but recovered through reset
Oct 20 12:57:18 Linux kernel: amdgpu 0000:10:00.0: [drm] *ERROR* [CRTC:61:crtc-0] flip_done timed out

I tried to check the path /sys/class/drm/card1/device/devcoredump/data after reboot, but there isn’t any thing(in fact, devcoredump folder dont even exist.

My specs: gpu: rx 580 cpu: r5 5500 (I am on latest version of my bios)

Is there anything I can do to diagnose the issue? Any help is appreciated. Thank everyone.

  • Kiuyn@lemmy.mlOP
    link
    fedilink
    arrow-up
    0
    ·
    2 days ago

    I tried The fur donut of dead before, and yes, it did indeed crash my system after about 8 to 10 minutes, but some people said it is “normal”?

    • fuckwit_mcbumcrumble@lemmy.dbzer0.com
      link
      fedilink
      arrow-up
      0
      ·
      2 days ago

      Your system should never crash under a stress test. Something is wrong, possibly physically.

      Try under clocking your GPU and see if it’s just old worn out silicon that’s no longer stable. Or maybe try just turning the power limit down and see if that helps.

      • Kiuyn@lemmy.mlOP
        link
        fedilink
        arrow-up
        0
        ·
        2 days ago

        I see, I did do some light underclock/undervolt. I guess, I will do more.

          • Kiuyn@lemmy.mlOP
            link
            fedilink
            arrow-up
            0
            ·
            edit-2
            2 days ago

            I have issue on the default clock/voltage I underclock and undervolt to see if it help(it didn’t). So I will try to do even more(honesty I just want it to stop crash even if it reduces my performance by like 10%)

            • fuckwit_mcbumcrumble@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              0
              ·
              2 days ago

              If your GPU is running in the mid 80s then temps aren’t an issue, and undervolting will probably only make the issue worse. Try only underclocking, leave the voltages stock.