Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn’t have tests and git bisect wouldn’t work, and it was a UI interaction bug for which I’m not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn’t possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn’t have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, “how can I get more of this?”

  • HaraldvonBlauzahn@feddit.orgOP
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    22 小时前

    From the blog:

    Another comment, this time loosely paraphrased because this is from memory, was that Matt Mullenweg said that, if you look at people who undertake high variance activities, like gamblers, they’re often superstitious. You’ll see somebody wear their lucky socks or have a specific routine they do before they sit down to play the slots. Using caveman mode or deciding which model is good because a coding agent coughed up a good result after trying it is not so different.

    Similar to gambling, that might well apply to agentic coding as a whole.

    Also, gambling is addictive, and addictive things distort reality.