• cecilkorik@piefed.ca
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    6 hours ago

    I dabble in local AI and this always blows my mind. How do people just casually throw 135b parameter models around? Are people like, renting datacenter hardware or GPU time or something, or are people just building personal AI servers with 6 5090s in them, or are they quantizing them down to 0.025 bits or what? what’s the secret? how does this work? am I missing something? like the Q4 of Qwen3.5 122B is between 60-80GB just for the model alone. That’s 3x 5090s minimum, unless I’m doing the math wrong, and then you need to fit the huge context windows these things have in there too. I don’t get it.

    Meanwhile I’m over here nearly burning my house down trying to get my poor consumer cards to run glm-4.7-flash.