Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn’t have tests and git bisect wouldn’t work, and it was a UI interaction bug for which I’m not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn’t possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn’t have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, “how can I get more of this?”

  • Eager Eagle@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    16 hours ago

    Reminder that “a coin toss” is only bad odds for problems with binary and equally likely outcomes. And that’s rarely the case for anything that an LLM is used for. A 50% chance of saving an hour of work a couple times a day are pretty good odds. If I have a problem which a candidate solution is easy to verify, it’s often more effective to let an LLM investigate it for some time before I do so, and only jump in if it fails.

    There have been several little fixes I’ve done in minutes with an agent that would take me at least an hour to manually investigate, write a solution, test, and refactor. So yes, there is something to it, but you need to know how to use it. Keep arguing in a thread after noticing hallucinations is a clear sign the author doesn’t know how to use an LLM.