To be more specific (for anyone interested), the next word predictors are usually a type of model called an LSTM (at least I think that’s the most common). This model type has been used for a long time for dealing with sequential data. In 2014 there was a famous paper introducing an attention mechanism. This was a rather brilliant, though relatively minor extension to how LSTMs work. Essentially between each step of an LSTM it generates some data representing the model’s knowledge of the sequence to that point. The attention mechanism looks back at these intermediate values and determines how relevant each state is to the current point in the sequence and pulls in the most relevant bits. This vastly improved the memory of the LSTM over longer sequences.
In 2017 there was another famous paper “attention is all you need” which said something to the effect of “the attention mechanism is doing all the work, we don’t need the rest of the LSTM we can replace it by running attention between all point combinations in the sequence.” It’s actually significantly slower to run as the model grows, but much much faster to train because it’s not intrinsically sequential. This is the transformer model that’s the basis of all our LLMs.
Obviously some massive simplifications here but as despite being fairly anti AI, I do love the engineering behind it. So yeah, pretty literally a fancy text predictor, but it turns out when you throw all the compute you can muster at a fancy word predictor is makes the world go crazy
GenAI as it currently stands is a fancy text predictor. You ever had your phone suggest the next word in a message you’re typing? It’s that, on crack.
When you really wrap your head around the fact that that is all it’s doing, it loses a lot of its appeal imho. Especially for the cost to do so.
To be more specific (for anyone interested), the next word predictors are usually a type of model called an LSTM (at least I think that’s the most common). This model type has been used for a long time for dealing with sequential data. In 2014 there was a famous paper introducing an attention mechanism. This was a rather brilliant, though relatively minor extension to how LSTMs work. Essentially between each step of an LSTM it generates some data representing the model’s knowledge of the sequence to that point. The attention mechanism looks back at these intermediate values and determines how relevant each state is to the current point in the sequence and pulls in the most relevant bits. This vastly improved the memory of the LSTM over longer sequences.
In 2017 there was another famous paper “attention is all you need” which said something to the effect of “the attention mechanism is doing all the work, we don’t need the rest of the LSTM we can replace it by running attention between all point combinations in the sequence.” It’s actually significantly slower to run as the model grows, but much much faster to train because it’s not intrinsically sequential. This is the transformer model that’s the basis of all our LLMs.
Obviously some massive simplifications here but as despite being fairly anti AI, I do love the engineering behind it. So yeah, pretty literally a fancy text predictor, but it turns out when you throw all the compute you can muster at a fancy word predictor is makes the world go crazy