• Technus@lemmy.zip
      link
      fedilink
      English
      arrow-up
      27
      ·
      2 days ago

      It’s glorified autocorrect (/predictive text).

      People fight me on this every time I say it but it’s literally doing the same thing just with much further lookbehind.

      In fact, there’s probably a paper to be written about how LLMs are just lossily compressed Markov chains.

    • CosmoNova@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      ·
      2 days ago

      That‘s what I keep arguing for years. It‘s not so different from printing out frames of a movie, then scanning them again and claim it‘s a completely new art piece. Everything has been altered so much it‘s completely different. However it‘s still very much recognizable with extremely little personal expression involved.

      Oh, but you chose the paper and the printer, so it‘s definitely your completely unique work, right? No, of course not.

      AI works pretty much the same. You can tell what protected material the LLM was fed by the output of a given prompt. The theft already happened when the model was trained and it‘s not that hard to prove, really.

      AI companies get away with the biggest heist in human history by being overwhelming, not by being something completely new and unregulated. Those things are already regulated but being ignored. They have big tech and therefore politics to back them up, but definitely not the written law in any country that protects intellectual property.

    • LadyAutumn@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      17
      ·
      2 days ago

      Kinda. But like, a compression algorithm that isnt all that good at exact decompression. It’s really good at outputting text that makes you think “wow that sounds pretty similar to what a person might write”. So even if it’s entirely wrong about something thats fine, as long as youd look at it and be satisfied its answer sounded right.

      • leftzero@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        11
        ·
        2 days ago

        It stores the shape of the information, not the information itself.

        Which might be useful from a statistics and analytics viewpoint, but isn’t very practical as an information storage mechanism.

        • TheBlackLounge@lemmy.zip
          link
          fedilink
          English
          arrow-up
          8
          ·
          2 days ago

          As you can learn from reading the article, they do also store the information itself.

          They learn and store a compression algorithm that fits the data, then use it to store that data. The former part of this is not new, AI and compression theory go back decades. What’s new and surprising is that you can get the original work out of attention transformers. Even in traditional overfit models that isn’t a given. And attention transformers shine at generality, so it’s not evident that they should do this, but all models tested do it, so maybe it is even necessary?

          Storing data isn’t a theoretical failure, some very useful AI algorithms do it by design. It’s a legal and ethical failure because openai etc have been claiming from the beginning that this isn’t happening, and it also provides proof of the pirated work it’s been trained on.

          • leftzero@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            2 days ago

            The images on the article clearly show that they’re not storing the data, they’re storing enough information about the data to reconstruct a rough and mostly useless approximation of the data (and they do so in such a way that the information about one piece of data can be combined with the information about another one to produce another rough and mostly useless approximation of a combination of those two pieces of data, which was not in the original dataset).

            It’s like playing a telephone game with a description of an image, with the last person drawing the result.

            The legal and ethical failure is in commercially using artists’ works (as a training model) without permission, not in storing or even reproducing them, since the slop they produce is evidently an approximation and not the real thing.

            • TheBlackLounge@lemmy.zip
              link
              fedilink
              English
              arrow-up
              4
              ·
              edit-2
              1 day ago

              The law disagrees. Compression has never been a valid argument. A crunchy 360p rip of a movie is a mostly useless approximation but sharing it is definitely illegal.

              Fun fact, you can use mpeg for a very decent perceptual image comparison algorithm (eg for facial recognition) , by using the file size of a two-frame video. This works mostly for the same theoretical reasons as neural network based methods. Of course, mpeg was built by humans using legally obtained videos for evaluation, but it does so without being able to reproduce any of those at all. So that’s not a requirement for compression.

    • Prove_your_argument@piefed.social
      link
      fedilink
      English
      arrow-up
      15
      ·
      edit-2
      2 days ago

      Better search results than google though.

      EDIT: DO NOT LOOK AT THE CHAT. LOOK AT THE SOURCE LINKS THEY GIVE YOU.

      Unless it’s a handful of official pages or discussion forums… google is practically unusable for me now. It absolutely exploded once chatgpt came to the scene and SEO has gotten so perfected that slop is almost all the results you get.

      I wish we had some kind of downvote or report system to remove all the slop, but the more clicks the more revenue from referrals… better to make people click more.

      Almost all recipe sites now give me “We see you’re using an adblocker!” until I turn on reader mode on my phone now too. Pretty soon that will be appropriately blocked and I guess i’ll go back to cook books or something?

      • Honse@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        28
        ·
        2 days ago

        No TF its not. The AI can only output hallucinations that are most statistically likely. There’s no way to sort the bad answers from the good. Google at least supplies a wide range of content to sort through to find the best result.

        • Prove_your_argument@piefed.social
          link
          fedilink
          English
          arrow-up
          19
          ·
          2 days ago

          You misunderstand. I’m not saying AI’s chats are better than a quality article. I’m saying their search results are often better.

          Don’t look at what they SAY. Look at the links they provide as sources. Many are bad, but I find I get much better info. If I try google I might try 10+ links before I get one that really says what I want. If I try a chatbot I typically get a link that is relevant within one or two clicks.

          There is no shortcut for intelligence… but AI “SEO” has not been perfected yet .

          • gustofwind@lemmy.world
            link
            fedilink
            English
            arrow-up
            10
            ·
            2 days ago

            Yep this has happened to me too

            I used to always get the results I was looking for now it’s just pure garbage but Gemini will have all the expected results as sources

            Obviously deliberate to force us to use Gemini and make free searches useless. Maybe it hasn’t been rolled out to everyone yet but it’s certainly got us

            • AmbitiousProcess (they/them)@piefed.social
              link
              fedilink
              English
              arrow-up
              3
              ·
              2 days ago

              I’m honestly not even sure it’s deliberate.

              If you give a probability guessing machine like LLMs the ability to review content, it’s probably just gonna be more likely to rank things as you expect for your search specifically than an algorithm made to extremely quickly pull the most relevant links… based on only some of the page as keywords, with no understanding of how the context of your search relates to each page.

              The downside is, of course, that LLMs use way more energy than regular search algorithms, take longer to provide all their citations, etc.

              • Prove_your_argument@piefed.social
                link
                fedilink
                English
                arrow-up
                2
                ·
                1 day ago

                A ton of factors have increased energy costs on the web over the years. It’s insignificant per person, but bandwidth is exponentially higher because all websites have miles of crud formatting code nowadays. Memory usage is out of control. Transmission and storage of all the metadata your web browser provides in realtime as you move your mouse around a page is infinitely higher than what we had in the early days of the web.

                The energy cost of ML will reduce as chips progress, but I think the financial reality will come crashing down on the AI industry sooner rather than later and basically keep it out of reach for most people anyway due to cost. I don’t see much of ROI for AI. Right now it’s treated as a capital investment which helps inflate company worth while it lasts, but after a few years the investment is worthless and a giant money sink from energy costs if used.

          • Honse@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            1 day ago

            Interesting. LLMs have no ability to directly do anything put output text so the tooling around the LLM is what’s actually searching. They probably use some API from bing or something, have you compared results with those from bing because I’d be interested to see how similar they are or how much extra tooling is used for search. I can’t imagine they want to use a lot of cycles generating only like 3 search queries per request, unless they have a smaller dedicated model for that. Would be interested to see the architecture behind it and what’s different from normal search engines.

      • leftzero@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        7
        ·
        edit-2
        2 days ago

        Because they intentionally broke the search engines in order to make LLMs look better.

        Search engines used to produce much more useful results than LLMs ever will (even excluding the ones they make up), before google and microsoft started pushing this garbage.

        • fristislurper@piefed.social
          link
          fedilink
          English
          arrow-up
          7
          ·
          1 day ago

          Nahh, even before LLMs became big, search results were becoming worse and worse because of all the SEO-spam.

          Now you can generate coherent-sounding articles using LLMs, making the amount of trash even bigger.

          Google making their search worse would be dumb, since all these LLMs also rely on it to some degree.

        • Prove_your_argument@piefed.social
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 day ago

          I don’t think so man. They were just started to mass train on SEO in the ad industry 15 years ago. Now it’s practically a household term.

          Google had very different goals and ambitions early on. Things changed. Now they’re like any other giant soulless corpo. Their goal is revenue, which as their platform has grown we’ve seen a metamorphosis from a focus on interesting things to a never ending professional jester troupe on every endpoint still making pennies on the dollar compared to what google earns from advertisers. They’re the middle men and should be making next to nothing since they produce next to nothing of value, but advertising sells.

          Google Search revenue was something like 175bn in 2024. A tiny fraction of that is paid out to websites from clicks. Someone with a proper LLM tuned for SEO can churn out hot garbage nonstop and fill up results in perpetuity with a guaranteed revenue stream far in excess of what is possible as a worker in something in the overwhelming majority of the world. There’s just more garbage than everywhere… and people have found exactly the right formula to rise to the top despite human-useless content. Google doesn’t think of it as a bad thing, 175bn in revenue from search! lol

        • Prove_your_argument@piefed.social
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 day ago

          Genie is out of the bottle forever unfortunately.

          They could have instituted a system of genuine authentic reviewers who manually curate content so that your search is great, but that would result in less clicks. Less clicks means less revenue. They’re financially incentivized to make you click as much as you’re willing to.