• melfie@lemmy.zip
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    2 hours ago

    Local is potentially even cheaper than that. This guy talks about how to get 17 t/s with a GTX 1060 that has 6GB of VRAM on the Qwen 3.6 35B MoE model: https://m.youtube.com/watch?v=8F_5pdcD3HY. He’s using a fork of llama.cpp with TurboQuant and his newest video made after this one is using an even more optimized 28B version of the model. I have cmake building the llama.cpp fork in a Dockerfile at the moment and we’ll see how this performs on my $800 laptop with a RTX 4060.

    I’m also impressed how good OpenCode is compared to Claude Code. Qwen 3.6 is not quite as good as Claude and the MoE version that doesn’t need 24GB+ of VRAM isn’t quite as good as the dense version, but it also doesn’t cost $200 a month with usage limitations and a company training their models on your data. If it’s anywhere near “good enough”, I can see this being a daily driver.