• TehPers@beehaw.org
    link
    fedilink
    English
    arrow-up
    10
    ·
    edit-2
    2 days ago

    The conclusion of this experiment is objectively wrong when generalized. At work, to my disappointment, we have been trying for years to make this work, and it has been failure after failure (and I wish we’d just stop, but eventually we moved to more useful stuff like building tools adjacent to the problem, which is honestly the only reason I stuck around).

    There are a couple reasons why this problem cannot succeed:

    1. The outputs of LLMs are nondeterministic. Most problems require determinism. For example, REST API standards require idempotency from some kinds of requests, and a LLM without a fixed seed and a temperature of 0 will return different responses at least some of the time.
    2. Most real-world problems are not simple input-output machines. When calling, let’s say for example, an API to post a message to Lemmy, that endpoint does a lot of work. It needs to store the message in the darabase, federate the message, and verify that the message is safe. It also needs to validate the user’s credential before all of this, and it needs to record telemetry for observability purposes. LLMs are not able to do all this. They might, if you’re really lucky, be able to generate code that does this, but a single LLM call can’t do it by itself.
    3. Some real world problems operate on unbounded input sizes. Context sizes are constrained and as currently designed cannot handle unbounded inputs. See signal processing for an example of this, and for an example of a problem a LLM cannot solve because it cannot receive the input.
    4. LLM outputs cannot be deterministically improved. You can make changes to prompts and so on but the output will not monotonically improve when doing this. Improving one result often means sacrificing another result.
    5. The kinds of models you want to run are not in your control. Using Claude? K Anthropic updated the model and now your outputs all changed and you need to update your prompts again. This fucked us over many times.

    The list keeps going on. My suggestion? Just don’t. You’ll spend less time implementing the thing than trying to get an LLM to do it. You’ll save operating expenses. You’ll be less of an asshole.