ChatGPT and the spectacular evolution of natural language processing

25 Apr. 2023

The buzz around ChatGPT raised interest on recent advances in natural language processing based on pre-trained artificial intelligence models and on the principle of transfer learning. Michalis Vazirgiannis and Moussa Kamal Eddine are specialists of this field in the Data Science and Mining group at the Computer Science Laboratory of the Ecole Polytechnique. They explain how ChatGPT work, the latest developments and the issues at stake.

ChatGPT and the spectacular evolution of natural language processing

When the OpenAI launched ChatGPT in late November 2022, few were prepared for a viral hit such as the one that followed. Researchers have indeed been working for a long time on natural langage processing models (NLP models), that forms the basis of chatbots, and ChatGPT was yet another development in this field. So, why such a buzz? To get more insights about ChatGPT, and the concerns and controversy around it, on should first try to understand how it works.

GPT stands for “Generative Pretrained Model”. Pretrained models are machine learning models that have been trained on a large amount of data in advance to solve a specific task (i.e. for NLP models predicting hidden words in sentences). Thus these models are able to memorize all dependencies and patterns in the data and thus solve either downstream (i.e. classify a document correctly) or generative tasks (i.e. create a summary of a document) . Also it can be adapted to a specific task without requiring as much additional data or training time as building a model from scratch. Pretrained models can also be used in mixed modalities such as natural language processing, computer vision, and speech recognition, among others.

A bit of history

In recent years, large pretrained models for natural language processing have emerged, such as GPT-2, GPT-3, BERT, and RoBERTa. These models were trained on massive amounts of text data, allowing them to generate high-quality outputs for a variety of NLP tasks, including language generation, text classification, and question answering, among others. These models have achieved state-of-the-art performance on several benchmark datasets and have been utilized in various applications, including chatbots, machine translation, and content generation. However, the use of large pretrained models also poses challenges, such as computational requirements and ethical concerns related to their training data and biases.

A second key ingredient is transfer learning, which was a game changer in many AI subfields. Transfer learning is a set of methods that allow us to take advantage of the weights (the parameters of machine learning algorithms) of already trained models. In this case training on a new task is starting with the weights of the previously trained model instead of training from scratch thus capitalizing on the knowledge already stored in the model.

In the context of NLP, transfer learning dates back to 2008 when Collobert and Weston showed that self-supervised learning (i.e., in the case of NLP predicting a part of sentence input given another part of it) could improve the generalization capabilities of AI models on classification tasks. Later many new methods and approaches kept emerging, providing more efficient techniques for performing transfer learning. One of these recent approaches is GPT. GPT consists of pretraining a Transformer based decoder - one of the most effective deep learning architectures that enables finding all types of dependencies in the data - on the language modeling objective (i.e., predicting the next words in a sentence given the previous ones). The release of GPT-2 in 2019 created a buzz similar to but less intense than the one created by ChatGPT. At that time, the capabilities of GPT-2 to generate natural text were so exciting that some researchers started worrying about the risks of releasing the model's weights publicly. Based on these concerns, OpenAI decided not to release the full model's weights, motivating their decision by the fact that this model can be used maliciously to generate spam, fake news, or even racist and offensive content.

Later OpenAI made their decision to release the weights of GPT-2, and many efforts were built on top of it. Other pretrained models emerged during the same period, including BERT and BART. In our team Data Science and Mining (DaSciM) at the Computer Science Laboratory of the Ecole Polytechnique(*LIX) we contributed to several pretrained models and released them publicly to benefit the entire NLP community, especially researchers with limited resources. These pretrained models include but are not limited to BARThez, AraBART, BERTweetFR, and JuriBERT.

So, how does ChatGPT work?

The main drawback of pretrained language models is that despite their ability to generate fluent natural language, their output may not be the one desired by the user. It can be non-factual (i.e. referring to false information), toxic, or simply not aligned with the user's intent. This is what ChatGPT is trying to mitigate, generating text that responds appropriately to the user's query by following his intent. For example we queried GPT-2 and ChatGPT with a simple question:

Which one is larger, the sun or the moon?

GPT-2 answer was fluent however it was not coherent:

The answer is no. The sun and moon are not the same thing. They are two different things, but they do not have a common name. In fact, there are many different names for each of them.

On the other hand ChatGPT answer was the following:

The sun and the moon are two celestial bodies that appear in the sky, but they have very different sizes and distances from the Earth. The sun is a star that is about 1.4 million kilometers in diameter, while the moon is a natural satellite that is about 3,474 kilometers in diameter. The sun is about 400 times larger than the moon in terms of diameter, and about 27 million times larger in terms of volume. However, the sun is also about 150 million kilometers away from the Earth, while the moon is only about 384,000 kilometers away. This means that the sun and the moon appear to have roughly the same angular size in the sky, about half a degree. Therefore, the sun is much larger than the moon, but they look similar in size from our perspective.

It is clear from a user’s point of view that ChatGPT’s answer is way more aligned with his intent.

ChatGPT is an instance of the pretrained langage model InstructGPT. Its training requires four phases:

Unsupervised Pretraining: in this phase, a large neural network (e.g., GPT-3) is trained in a self-supervised fashion on a large corpus of raw text. This phase is the most computationally expensive one and involves the use of a massive corpus of raw text extracted from the web. For example, one of the instances of InstructGPT was initialized with the weights of GPT-3 XL which has 1.3 billion parameters and was pretrained on 570 Gigabyte of filtered raw text. GPT-3 XL is pretrained on the language modeling objective, where the model learns to predict the next words in a sentence given the previous ones.
Supervised Training: In this phase, the pretrained model is finetuned using data annotated by human experts. We remember here that the goal is to tweak the model to generate responses that better match user intents. Thus the dataset on which the model is trained in this phase contains a set of prompts (queries). Then a team of human experts annotate these queries with the desired responses.
Training a reward model: In this context, the reward model is a regression neural network that is trained to estimate the quality of a ChatGPT response. Again, the reward model is trained on an annotated dataset containing queries with multiple responses ranked by human annotators.
Reinforcement Learning: Now, having a model (agent) and a reward model, it is feasible to continue the finetuning of ChatGPT, this time using reinforcement learning (i.e. learning by experiment and trial and error without human annotation) via a Proximal Policy Optimization.

Concerns and controversy

Like any other hot topic about new emerging technologies, the public opinion about ChatGPT was divided. While some people panicked when they saw what ChatGPT was capable of, others didn't look very impressed and tried to deny the potential risks this model could raise. Anyone having a good experience in the field can tell that the capabilities of ChatGPT are impressive and that it is a breakthrough that will somehow shift the current trends. However, we have to keep in mind that ChatGPT is not the first invention that had a significant impact on many several social and educational aspects. This was the case with the easy access to calculators, the Internet, and recently to AI models. When these inventions became in the hands of everyone, proper adaptations were made to mitigate their potential risks. In fact, ChatGPT is no different; it should not be seen as the ultimate undefeatable AI agent but more like an AI assistant that may have some limited potential risks if we are not aware of its capabilities and don't deal with it consciously. This is, in fact, a call for the NLP community to accelerate its investigation into the possible risks of ChatGPT and how to deal with them.

Another more general concern is the sovereignty over these very large models and their effect on society and economy. Indeed the large industries have access to unprecedented amounts of data – to which citizens or even government have not. Also these models to be trained need massive amounts of computations that require unprecedented quantities of specialized processors (GPUs) and energy – that are accessible only to a handful of few large industries.

Moreover the models themselves are closed in most of the cases thus the society and economy are not capitalizing the vast potential of this new situation. These form political challenges for the near future that are expected to try hard states and societies. Recently the next version of GPT, GPT4, was released with event larger number of parameters (in the order of trillions!) providing as well multimodal capabilities – i.e. accepting image and text inputs, emitting text outputs. Also Google released “Bard”, a trial version of its AI chat bot model while Microsoft has integrated a version of chatGPT in the Bing search engine. Finally in China very large multimodal pretrained models, like the Wudau 2.0 were released as well. The evolution is overwhelming hard to keep the pace.

As a result of this there has been serious concerns for safety, environmental footprint and other issues related with the usage of these huge models. Specifically, an open letter was published and signed by prominent people in this area, calling for a 6 months pause to training Giant AI models more powerful than GPT-4. Our society must stay alert in front of these spectacular evolutions.

*LIX: a joint research unit CNRS, École Polytechnique, Institut Polytechnique de Paris, 91120 Palaiseau, France

About the authors:

Michalis Vazirgiannis, Professor at LIX, leader of the DaScim group

Moussa Kamal Eddine, PhD student at LIX in the DaScim group