I find I kind of look at the whole agentic harness setup as a genetic algorithm. Your tests and specs are the fitness function for the program you’re evolving, and the LLM is the mutator. At each step it generates some output, it gets tested against the fitness function, the LLM gets feedback and iterates on it. Eventually something working falls out in the end. The better you can define the selection criteria the more you box the agent in the better results you get.
The trick I can recommend for getting the model to code is to ask it to come up with a phased plan composed of focused features, and then to build each feature on its own branch. That way you have a clear unit of work that does a specific thing which makes it much easier to review the code. Can also recommend tools like https://github.com/Fission-AI/OpenSpec for making specs to box the model in when it works.
I really dislike the idea of making the whole program a genetic algorithm - that approach is nice when you don’t have a straightforward approach to employ/enact, but otherwise it feels both overkill and horrendously inefficient.
The next step for my own harness (whenever I get back to working on it) is definitely to look at leveraging structured outputs to help these smaller models iterate towards a longer term goal.
I don’t mean you turn the program itself into a genetic algorithm. I’m saying that the agentic loop for producing code acts as one. The code itself is just regular code. And the loop isn’t really any more inefficient than what you do as a developer. It almost never happens that you write perfect code on a first try in practice. You’ll write some code, run your tests, look how it did, and iterate. That’s precisely the same process the agent follows.
The difference from a typical genetic algorithm is that the LLM is not just randomly generating text that eventually fits into the shape you specified. It’s generating code that’s already close to what’s intended most of the time, and it just needs a bit of massaging to get completely right. That’s the feedback loop here.
Sorry, I misspoke (miswrote?). I meant growing the code through a genetic-algorithm-like process. Though, fundamentally, I don’t think there’s that much difference between applying a selection process on randomized bytes and having an LLM churn on a codebase.
I feel like you’re only considering the time it takes to reach a particular solution when considering what is inefficient - in which case I would agree it’s probably a wash. However, I don’t think an LLM is less energy-hungry than my own body, and I learn by doing, effectively reducing the cost of future coding iterations. I guess if I could run the LLM and surrounding hardware entirely off of solar power I wouldn’t mind nearly as much - though there’s still that part of banging my head against a problem that I believe is crucial for my own growth. I think that, over time and problems/projects, this compounds in a way that letting the LLM figure out the gritty details just won’t.
I think I agree with your last paragraph, though I do wish the LLM was capable of needing less massaging the more it runs. I hope we’ll be able to figure out how to achieve effectively infinite context length so that it doesn’t have to “forget” all of the previous tasks I’ve had it work on.
I find I kind of look at the whole agentic harness setup as a genetic algorithm. Your tests and specs are the fitness function for the program you’re evolving, and the LLM is the mutator. At each step it generates some output, it gets tested against the fitness function, the LLM gets feedback and iterates on it. Eventually something working falls out in the end. The better you can define the selection criteria the more you box the agent in the better results you get.
The trick I can recommend for getting the model to code is to ask it to come up with a phased plan composed of focused features, and then to build each feature on its own branch. That way you have a clear unit of work that does a specific thing which makes it much easier to review the code. Can also recommend tools like https://github.com/Fission-AI/OpenSpec for making specs to box the model in when it works.
I really dislike the idea of making the whole program a genetic algorithm - that approach is nice when you don’t have a straightforward approach to employ/enact, but otherwise it feels both overkill and horrendously inefficient.
The next step for my own harness (whenever I get back to working on it) is definitely to look at leveraging structured outputs to help these smaller models iterate towards a longer term goal.
I don’t mean you turn the program itself into a genetic algorithm. I’m saying that the agentic loop for producing code acts as one. The code itself is just regular code. And the loop isn’t really any more inefficient than what you do as a developer. It almost never happens that you write perfect code on a first try in practice. You’ll write some code, run your tests, look how it did, and iterate. That’s precisely the same process the agent follows.
The difference from a typical genetic algorithm is that the LLM is not just randomly generating text that eventually fits into the shape you specified. It’s generating code that’s already close to what’s intended most of the time, and it just needs a bit of massaging to get completely right. That’s the feedback loop here.
Sorry, I misspoke (miswrote?). I meant growing the code through a genetic-algorithm-like process. Though, fundamentally, I don’t think there’s that much difference between applying a selection process on randomized bytes and having an LLM churn on a codebase.
I feel like you’re only considering the time it takes to reach a particular solution when considering what is inefficient - in which case I would agree it’s probably a wash. However, I don’t think an LLM is less energy-hungry than my own body, and I learn by doing, effectively reducing the cost of future coding iterations. I guess if I could run the LLM and surrounding hardware entirely off of solar power I wouldn’t mind nearly as much - though there’s still that part of banging my head against a problem that I believe is crucial for my own growth. I think that, over time and problems/projects, this compounds in a way that letting the LLM figure out the gritty details just won’t.
I think I agree with your last paragraph, though I do wish the LLM was capable of needing less massaging the more it runs. I hope we’ll be able to figure out how to achieve effectively infinite context length so that it doesn’t have to “forget” all of the previous tasks I’ve had it work on.