Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn’t have tests and git bisect wouldn’t work, and it was a UI interaction bug for which I’m not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn’t possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.
I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn’t have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.
Like I said, because this was non-ironically such a great experience, I immediately thought to myself, “how can I get more of this?”


One key piece of getting good results from LLMs is not to have them do anything you can’t do yourself. I catch AI doing weird things all the time and just fix it or have AI fix it accordingly.
Left to its own devices, AI will generally produce bad output over a large enough size. This is why I argue AI will ultimately not replace developers. Even the best models I’ve seen just make more sophisticated errors. The product must be reviewed and fixed by someone who actually understands how to write it.
The question is more the threshold at which AI costs more than is gained in efficiency. As we’ve seen a lot of folks don’t gain efficiency, that’s obvious in some cases. Yet, other folks do see gains and the question is whether this is a domain issue or a technique issue.