Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn’t have tests and git bisect wouldn’t work, and it was a UI interaction bug for which I’m not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn’t possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn’t have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, “how can I get more of this?”

  • HaraldvonBlauzahn@feddit.orgOP
    link
    fedilink
    arrow-up
    11
    ·
    edit-2
    22 hours ago

    Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

    Like I said, because this was non- ironically such a great experience, I immediately thought to myself, “how can I get more of this?”

    So, do I get this right: People try it. It does not work better than a coin toss. But because it outputs human language - the hallmark of intelligence, until now -, people think there is still something onto it, get hooked, and continue to use it for code generation?

    • x1gma@lemmy.world
      link
      fedilink
      arrow-up
      9
      ·
      19 hours ago

      Naturally, this code didn’t have tests

      Codebase with no tests, check.

      it was a UI interaction bug for which I’m not even really qualified to write a test for

      What the hell are they doing in bugfixing an UI bug, when they are “not qualified” to write a test for it. Anyhow, not competent enough for the codebase you’re working on - check.

      so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug.

      So, instead of asking the LLM to e.g. create a proper reproduction as a test case, asking it to bisect, which the author claimed that I wasn’t possible, for some reason. So, also adding can’t bisect on his own, and can’t prompt properly, check and check.

      [Waffling about hallucinations] I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. […] The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

      So, the author realized it hallucinates. The author asks for video proof (instead of a fucking test, again). The author is surprised it generated him a video of exactly what they wanted to see, only creating it in a different way than they wanted to.

      This reads like “I have close to zero clue what I’m doing, I also don’t really know how to achieve what I want properly, and now I’m making a salty blog post that my magical text microwave didn’t fix my half-assed description of a problem”. Like, honestly, what the hell was the expectation here?

      • HaraldvonBlauzahn@feddit.orgOP
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        12 hours ago

        What the hell are they doing in bugfixing an UI bug, when they are “not qualified” to write a test for it. Anyhow, not competent enough for the codebase you’re working on - check.

        Does the name “Dan Luu” say anything to you? Do you know his blog ?

        In general, for Dan Luu I wouldn’t assume he is not competent enough.

        And besides that, what is the point of LLMs / GenAI if you need to be an expert in everything it touches to handle it correctly? If you are an expert, you can already do it yourself.

        Also, if one needs to be an expert in every topic to get good or even acceptable results, this creates more doubt that the “intelligence”, “reasoning”, and “capabilities” of these things are in reality the intelligence of the user, since he does the real work of discerning fabrication and accidental good output.

        Reminds me on that old story of the smart horse “Hans” which could do math, indicating the result with is hooves. But it turned out he could do it only when his owner was around - the horse had learned when his owner agreed with the result and indicated that unconciously.

        • x1gma@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          12 hours ago

          I’m not assuming he’s not competent, and I’ve looked him up - he’s by no means incompetent. But he himself said he’s not qualified to write tests for that. If you cannot write tests for whatever you’re doing, you shouldn’t be doing that. Someone with his knowledge, or at least the knowledge he should have given his CV, should know that. In this specific case he is incompetent, because what he’s doing is simply wrong on every level.

          You don’t need to be an expert on what you’re doing to use LLMs efficiently. You can also have solid prompts and ideas to use a LLM to cancel out your personal lack of knowledge in a specific domain. In any case, expecting LLMs to produce correct output when you’re actively guiding it to do something wrong is simply stupid.

          Any claim of actual intelligence in a LLM is simply not true. Never been, never will be. Artificial intelligence is an umbrella term for ANI, AGI and ASI, artificial narrow, general and super intelligence respectively. A narrow intelligence is not even close to human intelligence, and is hyper-specialized in a single task. All and any LLMs are and always will be ANIs, and their hyper-specialization is basically a stochastic word (well, token) completion on steroids. An AGI is mostly defined as “close to” or “approaching” human intelligence, as in general knowledge and transfer of it into unrelated fields.

          This, reasoning and capabilities will help you nothing when you guide it in the wrong direction. You need to keep in mind the absolutely mind blowing amount of money involved around LLMs. The bubble is too big to fail. Any LLM is a product, and their first and foremost goal is to make you use it, so you pay for it - therefore the primary directive of the AI is to give you what you ordered, to glaze you, and to be your best, obedient buddy. You want a video of the bug, of course! Here you have a video of how that bug looks like - stochastically that’s the answer to the prompt.

      • Eager Eagle@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        16 hours ago

        This. If you’re at a point that you’re arguing with an LLM, you’ve already lost. Just start a new thread with a different approach, don’t make an article about your inability to use an LLM.

    • atzanteol@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      20 hours ago

      The mistake is in assuming the AI is perfect and will be correct all the time.

      If you’re relying on it to be correct and not verifying its output, you’re doing it wrong.

      It’s like doing a search and finding posts in forums. Sometimes what you find is wrong or not appropriate for your situation.

      AI doesn’t replace your need to do critical thinking.

      • HaraldvonBlauzahn@feddit.orgOP
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        15 hours ago

        The mistake is in assuming the AI is perfect and will be correct all the time.

        If you’re relying on it to be correct and not verifying its output, you’re doing it wrong.

        I think unless you are a total beginner, proper verification will frequently take about as long, or longer than writing it yourself.

        Like it’s harder to read even good and correct legacy code, than to write new code.

    • HaraldvonBlauzahn@feddit.orgOP
      link
      fedilink
      arrow-up
      5
      ·
      edit-2
      22 hours ago

      From the blog:

      Another comment, this time loosely paraphrased because this is from memory, was that Matt Mullenweg said that, if you look at people who undertake high variance activities, like gamblers, they’re often superstitious. You’ll see somebody wear their lucky socks or have a specific routine they do before they sit down to play the slots. Using caveman mode or deciding which model is good because a coding agent coughed up a good result after trying it is not so different.

      Similar to gambling, that might well apply to agentic coding as a whole.

      Also, gambling is addictive, and addictive things distort reality.

    • Eager Eagle@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      16 hours ago

      Reminder that “a coin toss” is only bad odds for problems with binary and equally likely outcomes. And that’s rarely the case for anything that an LLM is used for. A 50% chance of saving an hour of work a couple times a day are pretty good odds. If I have a problem which a candidate solution is easy to verify, it’s often more effective to let an LLM investigate it for some time before I do so, and only jump in if it fails.

      There have been several little fixes I’ve done in minutes with an agent that would take me at least an hour to manually investigate, write a solution, test, and refactor. So yes, there is something to it, but you need to know how to use it. Keep arguing in a thread after noticing hallucinations is a clear sign the author doesn’t know how to use an LLM.

  • HaraldvonBlauzahn@feddit.orgOP
    link
    fedilink
    arrow-up
    3
    ·
    edit-2
    22 hours ago

    And another aspect is: You can, of course, engineer reliable things from unreliable components. Much of hardware works like that. Even my bicycle needs to have two brakes, for redundancy. Cloud computing and things like distributed databases and file systems works like that, at the price of massive complexity.

    I can see that some intelligent people are attracted by the challenge. Like a juggler who tries to keep more balls in the air.

    But for generating code and algorithms, and the price being intelligibility and maintainability - is this a good idea?

    • litchralee@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 hours ago

      You can, of course, engineer reliable things from unreliable components.

      I think the only way this statement can hold true in all circumstances is if we select an arbitrary boundary for what constitutes “reliable”. And that’s no small matter, because the threshold of reliability in a consumer IoT device would be inappropriate in a commercial or automotive setting, would be deeply wrong for industrial personnel safety, would be manifestly unlawful for a military or aerospace application, and potentially fatal for medical use.

      Engineering is all about balancing a set of objectives, be it cost, time to market, efficiency, size, weight, or competitive advantage, and more. Doubling up as a way to improve reliability necessarily implicates size, complexity, and efficiency, but that’s tolerable for large data centers where the customer counts servers by the number of floors, not the number of Rack Units (RU). But no one would accept installing two pacemakers because one of them might fail early; that’s an intolerable solution to the product’s base objective.

      As it happens, most USA jurisdictions only require a single brake on a bicycle, and it doesn’t even have to be on the more-effective front wheel. But the idea in law is to enforce the absolute minimum of requirements: having no brakes at all is where the line has been drawn, for a mode of transport that rarely gets above 50 kph (~30 MPH). But even then, all commercial bicycles for sale must have two brakes, so the law implicitly allows for some lost redundancy, because even one brake should be enough.

      Could a bicycle brake be developed such that it is inherently always able to stop? Likely yes. Would it appreciably improve macro safety objectives such as by reducing collisions with stationary objects? No, not really.

      And that’s the rub: just because engineers can double up things to get more redundancy, is this any better than the alternative? If an LLM is used as a search engine, is that appreciably better than using “grep”, a battle tested, secure, locally-ran application with a lineage harkening back to the 70s?

      The drawback with inherent unreliability is that it can only be statistically reduced, but never eliminated. NASA understands this risk better than most, because cost pressures mean they can’t be using military-grade hardware for everything. Perhaps then, it can be better said that engineers also have to balance risk in their decisions, and as it stands right now, the risk/uncertainty for LLM output is unquantifiable by any existing approach.

      Academics have long been researching ways to make LLMs “safe”, so that their outputs are constrainted in concrete ways. But I believe they’ve long concluded that the current approach of generative transformers simply cannot have safety “bolted on” after the fact. New constructions for machine learning will have to be invented with safety from day zero. The academics continue to work on that, while the commercial AI vendors are barreling ahead with LLMs, in spite of their risks and in the pursuit of a return.

      I’ve not seen anything that would suggest the academics are wrong, nor that industry has managed to produce large safety or reliability improvements, so at this point, I only see a plateau and dead end for the industry. Maybe if the industry would put more into R&D and theoretical work, this would be a lot more graceful as they run up against the buffer stops.

    • MagicShel@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      19 hours ago

      One key piece of getting good results from LLMs is not to have them do anything you can’t do yourself. I catch AI doing weird things all the time and just fix it or have AI fix it accordingly.

      Left to its own devices, AI will generally produce bad output over a large enough size. This is why I argue AI will ultimately not replace developers. Even the best models I’ve seen just make more sophisticated errors. The product must be reviewed and fixed by someone who actually understands how to write it.

      The question is more the threshold at which AI costs more than is gained in efficiency. As we’ve seen a lot of folks don’t gain efficiency, that’s obvious in some cases. Yet, other folks do see gains and the question is whether this is a domain issue or a technique issue.