Which came first: the chicken or the egg? Can God create a stone that He cannot lift? These are questions that have troubled people for thousands of years.
Another one has been added: “Can LLMs become smarter than humans?” Here, I would like to add a few more questions: how “smart” are such models? Are they ever capable of “reasoning”? What can Big Language Models really do?
Let’s talk about this.
What is the LLM model?
Large Language Models (LLMs) are an undeniable and significant advancement in the field of Natural Language Processing (NLP) and Artificial Intelligence (AI). I have never denied this. But I’ve always had a reasonable skepticism regarding their incredible capabilities. Here, I think it’s all about what those who aren’t as familiar with “how things work under the hood” see.
These models are designed to generate and interact with human language. The word “understanding” is usually added here, but that is not how one should describe how such models work. Because, at least for now, they do not “understand” in the sense that humans do.
LLMs are developed through machine learning, where a model learns from a vast dataset to perform a task. Specifically, these models are pre-trained on extensive corpora of text data, encompassing a wide range of internet text, books, articles, and more. This pre-training phase allows the model to learn the language structure, including syntax, grammar, and semantics, by predicting the next word in a sequence of words.
The models we know today are built around a “neural network,” which reflects how the human brain works. Each “neuron” is essentially numbers that, when combined, are multiplied and compared, which helps the model look for patterns.
The LLM architecture is based on a transformer model, a neural network designed to deal with sequential data. Transformers utilize mechanisms such as attention and self-attention, allowing the model to assess the importance of different words in a sentence or document. This ability allows the LLM to “understand” context and generate coherent and relevant text from the data. It is about the model “weighing” the different data it has been trained on to find the most appropriate combination, which allows it to generate text similar to what we write.
This is the subtle place where people usually stumble and start overestimating such models, because when you write a query to such a model and get a reasonable response, the first thing you want to exclaim is “it thinks”.
But by and large, it’s just a model that counts probabilities to give you the best-generated answer. Training such a model, very simplistically speaking, consists of three main steps:
- Pre-training. The LLM is given a huge amount of textual data at this stage. This data can include anything, but nowadays, such models are used to learn as diverse texts as possible: books, articles, encyclopedias, poems, and so on. At this stage, the model finds patterns in the sequences that occur most frequently, allowing it to priapism grammar, semantics, and punctuation
- Attention Mechanisms. This allows the LLM to rank information in a task-specific manner. Like a weight counter, the model assigns coefficients to different data depending on their frequency and relevance to the context.
- Transformers. It is essentially an extra ball in generation used by the LLM to find the most important relationships during text generation. This is known as self-awareness, which allows each element of the sequence to interact with the others, which improves the “understanding” of the context.
I don’t remember which cartoon I saw it in, but I remember a scene from there where there was a huge machine that spun words on an internal drum, putting sentences and predictions together. It’s an unfair example, but it is similar to LLM. It’s a model that analyzes the resulting text and then counts and recalculates the probabilities to pick what works best in the text generation process.
The limitations
But how much does such a model “understand” what it is writing about? Well. To date, the very word “understands”, “thinks” or “reasons” does not fit. It’s more like T9 work. I have an amusement: when I’m bored, I can open a chat room with a close friend, write some words, and then poke at the results of T9 and end up sending him a huge message that, if it makes sense, it will be hard to grasp, although at first glance it may seem like a text that makes sense, at least the words in it are in common sequences.
I’m not trying to make an analogy here because it’s a false example. I’m just demonstrating how a fairly simple model can help us generate mostly meaningless text. But LLM models don’t generate complete nonsense, do they? Yes, they are now generating texts that make sense. We won’t discuss how good such texts are; you can read about this here. Let’s admit it: texts are often impossible to distinguish from humans.
This, by the way, is the reason why all tools for determining AI-generated content fail. Because they try to find patterns like “uniqueness”, “perplexity” and “burstiness”. The snag is that it doesn’t work: people write differently, and someone can write literary using known clichés and roughly the same word length. Or we can tweak our LLM model so that it pretends to be
The main problem with LLM models, like today’s AI in general, is its limitations. Let’s compare the differences in how humans and LLM models learn. Some experts such as Pavel Prudkov writes “LLMs already approach the level of human cognition. ” On the other hand, such models have no real reasoning ability or cognition. As David Eagleman, a neuroscientist says “The claim that humans are nothing more than word predictors is wrong though. When I buy food, it’s not because I’m predicting the words “I’m hungry.” It’s because I’m predicting that I’ll be hungry.”
And there’s a fundamental difference because predicting and “understanding” are two different things. How a person learns is a lifetime of learning. And we learn from a very limited amount of data. You see, a child doesn’t “download” a huge dataset into itself which is then processed and “weighted”. On the contrary, we learn the principle of deduction: we know some general knowledge. from which we draw more specific conclusions. Language models follow the principle of induction, analyzing huge datasets and extracting from them a general understanding of how text should be generated.
Philosopher John Searle in his paper “Minds, Brains, and Programs”, published in Behavioral and Brain Sciences in 1980 put forward the concept of the “Chinese Room”. This explains why any model such as the LLM cannot “think” or be able to “conclude” in our sense of how the human brain does. Let’s shift away from “the Chinese” and use an example that’s easier to understand for English-speakers.
Let’s consider a person who doesn’t know Polish. But we gave him a guide to Polish grammar and a dictionary and then asked him a question in Polish.
After spending time, the person can first “understand” the question by translating it using the dictionary. And then he can, using the grammar guide and dictionary, give us the correct answer.
And it will sound as if the person knows Polish, even though they don’t speak it. This is very similar to the LLM model, giving us the most frequently occurring words. But as in the example of a person answering a question in a language they don’t know, this text is not the product of “understanding” or “reasoning” but an answer based on the available data set. If given a lot of time, a person can process more words to find the most appropriate ones.
In an article for TIME, Andy Clark Professor of Cognitive Philosophy at the University of Sussex writes, “Words, as evidenced by an abundance of great and not-so-great literary works, already depict patterns of all kinds – for example, patterns in sights, tastes and sounds. This gives generative AI a real window into our world. However, it still lacks that crucial ingredient – action. At best, text-predicting AIs are given a kind of verbal fossilized trace of the impact of our actions on the world. This trace consists of verbal descriptions of actions (“Andy stepped on his cat’s tail”) and verbally articulated information about their typical effects and consequences. “
Therein lies the key difference between the LLM and the human brain, as the LLM only has indirect data. descriptions of the interactions they have been able to study. As a result, any description of our world is just a search for existing and most frequent patterns in text, but in no way “reasoning”.
Can a chatbot become sentient?
Remember the story of the Google engineer who claimed a chatbot was sentient? His belief was built on the idea that the bot could communicate like a “seven or eight-year-old child, offering judgments and asking questions.” But this is not what “thinking” is all about-it’s a sham. Indeed, training a model on a massive amount of data can provide you with answers like “a child knowing physics.” Or you can have a chatbot communicate with you as if the chatbot were Bill Geist. But it’s just copying and compiling what the model could analyze based on the data set. And it has nothing to do with our thinking.
A chatbot can indeed imitate arguments that have been derived through critical thinking. But only because those arguments have already been written by someone else, and the model was able to learn them. Continuing the China Room example, I can spend two days (time enough to read but not enough time to study a topic and develop my system for evaluating and relating to what I read) and read a dozen papers on, for example, behavioral economics. Then, I can attend a debate and make reasonable arguments about behavioral economics that I’ve read elsewhere. But does that mean I have a real “understanding” of what I’m talking about? No and again no, it means that I have learned a set of arguments and assessments from others and can “weight” them and choose the most appropriate ones.
Such imitation doesn’t mean that the AI has “consciousness” or “sentient”, but only that it successfully imitates personality because it knows how humans do it. We don’t think parrots can talk because they know how to copy sounds, do we?
This is where you can look at how children develop their minds. They do not develop it by analyzing numerous information (a child’s information is limited). Still, it is something innate, and additional information only enhances the child’s ability to “judge”.
Therefore, today’s AI, especially LLM, is very far from AGI, which is presented as artificial intelligence capable of thinking and possessing consciousness in one form or another.
What you can read about LLM
- Bengio, Y., Courville, A., & Godfellow, I. (November 18, 2016). Deep Learning (Adaptive Computation and Machine Learning series). The MIT Press.
- Almeida, I. (September 2023). Introduction to Large Language Models for Business Leaders. Now Next Later AI.
- Wolfram, S. (2023). What Is ChatGPT Doing … and Why Does It Work? Wolfram Media Inc.
- Buttrick, N. (2024). Studying large language models as compression algorithms for human culture. In Trends in Cognitive Sciences. Elsevier BV. https://doi.org/10.1016/j.tics.2024.01.001