Rise of AI in language processing: Transformers

In my previous post, I have touched on recursive neural networks (RNNs), which are one of the early algorithms that use the "Encoder-Decoder" model, which encodes the information in (or abstract meaning of) the text in one language (Encoder) and decodes the information in another language.

Although RNNs showed considerable improvement in machine translation, they have a drawback: the memory of the model diminished with in parallel to the length of the text. In other words, the model forgot what was being talked about at the earlier sentences. Multiple enhancements were provided to solve this problem, such as gated recurrent units (GRUs) and long short-term memory models (LSTMs). In fact, LSTMs were quite successfully and were most widely used for automated translation until recently, before the rise of a new deep learning model, which would transform the field of natural language processing in its entirety.

Although LSTMs were generally successful in automated translation, they had their own drawbacks. Although they had better accuracy, by design, they were not suitable for parallelization and they were slow. A seminal paper titled "Attention Is All You Need" (https://arxiv.org/abs/1706.03762) would propose a solution that would not only handle the efficiency problem, but also yield better translations.

At the heart of the transformer model lies the attention layers. "Attention" essentially corresponds to semantic relationship between the words. As an example, consider the sentence: "The car was moving on the highway along the river." In this sentence, the words car and highway are closely related as they are both associated with road transportation, so they have a stronger relationship with each other compared to other words. This information is derived from mathematical embeddings of the words (Word2Vec). This context information helps us to understand that it is not a railroad (train) car but an automobile, because it is moving on the "highway". By emphasizing this simple intuitive relationship, when translating a particular word in the text, the transformer algorithm to "pays more attention" to the highly related words in the text flow. This concept has major implications in accuracy.

By combining the attention layers with additional features such as positional encoding, which essentially tracks the relative position of a word in the text, transformers eliminate dependency on previous word, freeing of serialization in RNNs. Thanks to this feature, transformers can be highly parallelized as a critical improvement.

The accuracy and efficiency advances in transformer learning models over the recursive models made them the top choice in not just automated translation, but the entire field of natural language processing. You might have noted that although highly related, the task of text translation is not the only task that large language models (LLMs) such as ChatGPT perform. These "Generative Pre-trained Transformer (GPT)" models do not just translate just texts across languages, but they do much more and are the main locomotive driving the AI boom in academia, research, industry and the media. I will touch on this "generative" property in my next article.

Systems Bio

Rise of AI in language processing: Transformers

Recent Posts

Comments