Has anyone tried in organization to use self hosted llm models for agentic programming?

Im curious if it makes any sense. My organization spends fortune on tokens from us companies. I want to recommend something…

  • Kkk2237pl@lemmy.worldOP
    link
    fedilink
    arrow-up
    3
    ·
    2 days ago

    How about qwen 3.6 and MacBook with 64GB ram?

    I thought about that AI server, but idk how to calculate how long it pay for itself…

    • DaTingGoBrrr@lemmy.ml
      link
      fedilink
      arrow-up
      4
      ·
      2 days ago

      I am running qwen 3.5 locally using llama.cpp on 8gb of VRAM and 16 gigs of RAM. It works well enough with a 4B to 9B parameter model along with quantization and MTP. More optimizations are on the way with turboquant and possibly other tech.

      It’s just there to assist me, not do all the work, so I am happy as long as I can self host it.

      I can’t say how well my specs would work in a professional setting but for personal use a MacBook should be sufficient in my opinion.

    • MagicShel@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      I run this setup with 36GB (32+4). Local LLMs can be really effective BUT you are constrained by context size in a way you aren’t on cloud services.

      Cline supports running a local model through lmstudio but my experience feeding it any significant tasks is it just can’t handle reading and holding the contexts to build components for enterprise scale applications.

      I use Claude to write a lot of utility one-off scripts. With a maximum window of 1M tokens I can hit 30+% context just writing Python scripts. API contracts, development standards, existing reusable modules, and sometimes reading the code/documentation of the services I’m going to be calling.

      My MacBook can’t handle 300k token contexts. 30k seems doable. I should see how it handles my utility script folder…

      Anyway that’s still no Claude but if you need a cheaper model and you can afford for developers to spend time on it before ultimately deciding they need to spend for Claude or Codex or Gemini, then rubbing a local model on a beefy MacBook is 100% an option.

      Stepping up from there to building a locally hosted LLM is probably the worst of all worlds. It will be a beefy CapEx, prone to saturation by all the users, and you will most likely still have to punt the hardest jobs to cloud AI. It can certainly be done and done well, but the best example I know runs on $250-500k worth of hardware (to service a pretty big number of users to be fair).

    • 87Six@lemmy.zip
      link
      fedilink
      arrow-up
      3
      ·
      2 days ago

      I mean… RAM? Don’t you need mass VRAM for this kind of thing? Or are they shared on Mac?

      idk how to calculate how long it pay for itself…

      You don’t… Not in this industry. You guess and hope it goes in your favor.

      No calculations matter if the market can jump or drop by 300% in a few months… And that applies to programming, hardware prices, AI subscription prices, regulations between countries when Trump is in office…

      • SeductiveTortoise@piefed.social
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        2 days ago

        Apple unified memory shares all over CPU, GPU and NPU, you can assign a lot of memory to run local models and there bandwidth is good, depending on the model.

        AMD has something similar with their something something AI CPUs and they go up to 128GB at the moment. Apple can be way faster though. And you were able to buy a Mac Studio with 512GB back when RAM wasn’t worth more than unicorn pee. For… I guess 10k though.

        • 87Six@lemmy.zip
          link
          fedilink
          arrow-up
          2
          ·
          1 day ago

          Apple unified memory shares

          That’s cool asf.

          Apple engineers with better leadership could change the fucking world… But instead they’re used to screw over their own user base.

          If my GPU starts falling back to RAM my game fps drops to 1 lol.