cross-posted from: https://sh.itjust.works/post/61139432

I seriously can’t believe how much progress he’s made for the FOSS community. He actually might take a bite out of the big 3’s profits with this

  • Rhaedas@fedia.io
    link
    fedilink
    arrow-up
    29
    ·
    1 day ago

    16GB is plenty for even older model setups. Now they’ve got a few models designed so you load just parts of the model onto the GPU (Mixture of Experts) and use the CPU for less referenced sections, so you get both reasonable speed and a much more complex model.

      • Rhaedas@fedia.io
        link
        fedilink
        arrow-up
        3
        ·
        5 hours ago

        Most models are going to require CUDA. There are some AMD ones out there, but it’s a totally different math and setup. As for the one I mentioned, it’s a pretty new idea so there are only a few out there, maybe just one (Qwen based). But I did get a 31B model to work on my 12GB, I just had to move from Ollama to llama.cpp to gain the control needed to set the parameters, and fine tune what it put on the CUDA to the max it would take. I had Claude help me along the way.

        It’s new enough that there aren’t any good abliterated/uncensored models yet.