For the longest time, I’ve been trying to figure out a way to “survive” in this new AI age without having to fork over a ton of money just to keep up. I’ve tried using local models via Ollama, and while they definitely work to a degree, they’re (unsurprisingly) not as good as the big model providers.
The local models tend to
- Forget what they’re doing
- Struggle to break larger tasks into smaller ones
- Lose focus easily
- Have weaker coding performance
- Drift over longer sessions
So to improve the reliability of fully local, smaller models (and to keep all my data local and in my own network), I created Loki.
It’s a local-first, batteries-included command line tool and runtime for building and running LLM workflows locally. It’s model agnostic and supports things like
- Agents and agent delegation
- Roles/personas
- MCP Servers
- RAG
- Custom tools
- Macros
- Workflow Scripting
A lot of the features it supports are specifically designed to compensate for weaknesses in smaller local models. For example:
- Auto continuation to keep pushing models to completion instead of stopping halfway through problems
- Parallel agent delegation so tasks can be split into smaller, focused scopes
- Workflow-based execution (“If this, do that”) for building more reliable and repeatable automations
It also supports the major cloud providers if you want them (which definitely helped while testing 😄), but my long-term goal is simple:
Get as close as possible to Claude Code-style reliability using fully local models.
I’m always open to feedback, questions, or ideas.


I’m using a ton of different ones but the main ones I use daily are
gemma4:26bdeepseek-coderdeepseek-r1:32bdevstral:24bgranite-code:34bopenthinker:latestphi4:latestqwen3:30bmixtral:8x22bI’m also going to use this opportunity to plug an amazing project to help figure out which models will work well on my hardware: https://github.com/AlexsJones/llmfit Is amazing!
Isn’t it a huge delay to swap out to a different ~30b model every few minutes depending on the use case?
Unfortunately, yes. It’s one reason I’m trying to figure out a good mechanism to maybe do something like multiple ollama hosts. So like: you can specify what model to use specifically in an agent. But if an agent delegates to a sub-agent, it unloads that model and loads the new one. I’m trying to figure out if there’s a way to “alternate” between multiple hosts (say, ollama running locally and one running on your server), so that when a switch happens, it does it on the secondary host while also looking ahead to see what needs to be switched, if anything, on the primary host.
It supports multiple Ollama hosts right now as-is so what I’ve honestly been doing for the time being is specify which model on which host each agent uses so there’s only loading of one model at the beginning of a session. Then there’s no unloading/loading/etc. The other thing I’ve been trying is to see how small I can get the models to be without losing performance. While the tricks implemented in Loki help dramatically, I know there’s still a lot more I can do to improve it further.