The newer CPU generations come with cores optimized for this stuff (referred to as an NPU). It actually seems to work fairly well for the kind of model you’d run locally.
Barring that, a typical laptop dGPU will also work, although not super efficiently since they often don’t exceed 8 GB of VRAM and thus can’t run most models without partially offloading them to the CPU.
Of course a laptop with a dGPU and NPU cores will make the offloading less painful. So yeah, workable for most reasonably-sized models.
The newer CPU generations come with cores optimized for this stuff (referred to as an NPU). It actually seems to work fairly well for the kind of model you’d run locally.
Barring that, a typical laptop dGPU will also work, although not super efficiently since they often don’t exceed 8 GB of VRAM and thus can’t run most models without partially offloading them to the CPU.
Of course a laptop with a dGPU and NPU cores will make the offloading less painful. So yeah, workable for most reasonably-sized models.
Models can split loads across a discrete GPU and CPU/RAM.
Its not as fast as if you can load it all in the GPU, but it gives you more options. Its been quite common for a long time.
Yeah, that’s what I refer to with offloading. Depending on the model and runtime it might be a bit fiddly but it usually works fine.
Im apparently just bad at reading the whole message.