Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    il y a 2 heures

    If you’re using docker anyway, and “fast” pure GPU models, you might try a vllm container while you’re at it.

    It should be much faster than even llama.cpp, albeit at the cost of context length, and it supports some exotic 4-bit quantization like SPQA.

    Same with TabbyAPI. It’s quantization is SOTA, though it does not support CPU offloading, and it’s speed is somewhere between vllm and llama.cpp.