Woke up today to the homeserver being unresponsive. Couldn’t SSH, no video out when I connected a monitor, and even the reset button didn’t do anything. Had to hold the power button to shut it down.

/var/log/syslog doesn’t show anything interesting other than the issue happened at just after 4am. Log

2026-02-27T03:55:01.481794-08:00 blackbox CRON[1743418]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:00.198504-08:00 blackbox smartd[2126]: Device: /dev/sdd [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
2026-02-27T04:00:00.291853-08:00 blackbox systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
2026-02-27T04:00:00.298344-08:00 blackbox systemd[1]: sysstat-collect.service: Deactivated successfully.
2026-02-27T04:00:00.298523-08:00 blackbox systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
2026-02-27T04:00:00.299608-08:00 blackbox kernel: kauditd_printk_skb: 8 callbacks suppressed
2026-02-27T04:00:00.299613-08:00 blackbox kernel: audit: type=1130 audit(1772193600.298:798916): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:00.299615-08:00 blackbox kernel: audit: type=1131 audit(1772193600.298:798917): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:01.923610-08:00 blackbox kernel: audit: type=1101 audit(1772193601.922:798918): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:accounting grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923614-08:00 blackbox kernel: audit: type=1103 audit(1772193601.922:798919): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1006 audit(1772193601.922:798920): pid=1744810 uid=0 subj=unconfined old-auid=4294967295 auid=33 tty=(none) old-ses=4294967295 ses=50544 res=1
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1300 audit(1772193601.922:798920): arch=c000003e syscall=1 success=yes exit=2 a0=7 a1=7fff81d75200 a2=2 a3=0 items=0 ppid=2654 pid=1744810 auid=33 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=50544 comm="cron" exe="/usr/sbin/cron" subj=unconfined key=(null)
2026-02-27T04:00:01.923616-08:00 blackbox kernel: audit: type=1327 audit(1772193601.922:798920): proctitle=2F7573722F7362696E2F43524F4E002D66002D50
2026-02-27T04:00:01.924259-08:00 blackbox CRON[1744811]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:01.924614-08:00 blackbox kernel: audit: type=1105 audit(1772193601.923:798921): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:session_open grantors=pam_loginuid,pam_env,pam_env,pam_permit,pam_umask,pam_unix,pam_limits acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.925610-08:00 blackbox kernel: audit: type=1110 audit(1772193601.924:798922): pid=1744811 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:02.357616-08:00 blackbox kernel: audit: type=1104 audit(1772193602.356:798923): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T09:23:35.786375-08:00 blackbox systemd-modules-load[904]: Inserted module 'dm_multipath'

Would something like this be a direct hardware failure? Like a power supply hiccup or something? It happening at 4am coincides with my electric car starting to charge, but the server is on a dedicated 20A circuit and behind a battery backup. I also don’t see any power issues on my Sense monitor at that time though it has limited resolution.

Mainboard is a Supermicro H13SAE-MF and I’m using ECC RAM.

I’ve been running this hardware for over a year and never had this issue, but I’m running out of places to look.

Might be time to finally get IPMI working.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    4 hours ago

    If it happens again and you have Magic Sysrq enabled, you can do Magic Sysrq-t, which may give you some idea of what the system is doing, since you’ll get stack traces. As long as the kernel can talk to the keyboard, it should be able to get that.

    https://en.wikipedia.org/wiki/Magic_sysrq

    You maybe can’t see anything on your monitor, but if the system is working enough to generate the stack traces and log them to the syslog on disk (like, your kernel filesystem and disk systems are still functional), you’ll be able to view them on reboot.

    If it can’t even do that, you might be able to set up a serial console and then, using another system running screen or minicom or something like that linked up to the serial port, issue Magic Sysrq to that and view it on that machine.

    Some systems have hardware watchdogs, where if a process can’t constantly ping the thing, the system will reboot. That doesn’t solve your problem, but it may mitigate it if you just want it to reboot if things wedge up. The watchdog package in Debian has some software to make use of this.