Transformer architectures are widely used in Natural Language Processing applications. But it has a shortcoming – substantial computational overhead of its self-attention mechanism. However, the recent research from Google proposes to replace self-attention sublayers with simple linear transformations. In this article, we’ll see how it is done and evaluate its performance. Let’s start from the basics:
The transformer in the Natural Language Processing (NLP) is a novel architecture that was first proposed in “Attention is all you need” (2017), a Google Machine translation paper. It is an encoder-decoder architecture that works on attention mechanisms. This architecture can be used in many applications such as Video captioning, Text summarization, Question answering, Text prediction, and Machine translation.
So, what exactly is the attention mechanism?
Keen-observation on something or someone is what comes into my mind when I hear the word attention. This is exactly what we need to understand in the case of attention mechanisms. Attention allowed us to focus on some parts of the input sentence while generating the output word.
Self attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
More simply, it can be explained as self-attention allows us to obtain similar connections within the same sentence. Come on, let’s make it easier to understand with an example.
“I had pasta from the restaurant and it was so good”
What does ‘it’ refer to? Pasta or restaurant?
That is where self-attention comes into play. With self-attention, we could relate ‘it’ to pasta. That is, we are focusing on ‘it’ in this case to obtain our correct output.
BERT (Bidirectional Encoder Representations from Transformers) was first released in Pre-Training of Deep Bidirectional Transformer for Language Understanding. It is a language model and is based on the transformer structure. Is both BERT and Transformer the same?
The answer is NO!
What are the changes and advantages of BERT over transformer?
In the case of BERT, we are taking only the encoder part with the attention mechanism of the transformer. The main advantage we can see in BERT is that it is bidirectional. Yes, being bidirectional improves the architecture to capture both forward and backward contexts. The other main thing in the case of the bidirectional model is that it cannot be trained. As a result, there is already a pre-trained model using Wikipedia (2,500 words in English) and BooksCorpus (800M words) and thus, we only require the fine-tuning of the model based on the different tasks. Fine-tuning is task-specific. Woah bang!!!
It sounds easier to implement right?
“Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.”
BERT has inspired many recent NLP architectures, training approaches, and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa, etc.
In recent years Transformers had a larger impact on almost all the tasks of the NLP domain. So, there was always new optimization that was done onto the transformer layers. As smooth as the optimization is, the output accuracy would increase. Reducing the self-attention layers of the transformer or even changing the layers were done for optimization. In this article, I will explain more in detail of FNet. Yes, we can say another advancement that has happened in the Transformer architecture. Precisely, another new optimization was done in the transformer architecture to improve the network.
I have already given a brief description of BERT.
How does FNet differ from BERT?
Just like BERT, FNet is also taking only the encoder part of the transformer. FNet is an attention-free transformer architecture. Yes, the attention layers are been replaced in FNet.
Sounds interesting right?
The attention mechanism is not considered in FNet. Let’s see!
In the case of the FNet architecture, the self-attention layers are being replaced by the Fourier transform. Fourier transform is a term which we usually hear in signal processing. Let’s see what added advantages it gives to the architecture.
Fourier transform is a mathematical concept where a signal is decomposed into its constituent frequencies. It has been used in many deep learning applications to increase convolution speed. Both Discrete Fourier Transforms(DFT) and Fast Fourier Transforms(DFT) exist, in which DFT is rather used in many transformer works. In FNet, we replace the self-attention sublayer of each Transformer encoder layer with a Fourier sublayer, which applies a 2D DFT to its (sequence length, hidden dimension) embedding input.
A result is a complex number that can be written as a real number multiplied by the imaginary unit (the number “i” in mathematics, which enables solving equations that do not have real number solutions). Only the result’s real number is kept, eliminating the need to modify the (nonlinear) feedforward sublayers or output layers to handle complex numbers.
From the architecture, we can see the self-attention layers are replaced by the Fourier layers.
So, why is it done?
Although transformers with self-attention blocks give better accuracy they are computationally expensive as well.
Through Fourier transform, it provides a better and effective mechanism of token mixing. Fourier transforms are linear and provide better complexity than transformers with self-attention mechanisms. The computation of the Fourier transform is much more optimized on GPU/TPU than that of self-attention computation. Moreover, the Fourier transform layer doesn’t have any learnable weights unlike self-attention, and thus uses less memory. Well, this is quite a lot of advantages. When we are dealing with larger data, there might be issues of out-of-memory and higher training time. So, as a data science aspirant, we always prefer to go with architectures that produce less computation time and without compromising the accuracy of our work. In that case, FNet was able to outperform BERT in a text classification task where it produces 92 % accuracy and was seven times as fast on GPU and twice as fast on TPU.
By replacing the attention sublayer with linear transformations, we are able to reduce the complexity and memory footprint of the Transformer architecture. We show that FNet offers an excellent compromise between speed, memory footprint, and accuracy, achieving 92% of the accuracy of BERT in a common classification transfer learning setup on the GLUE benchmark (Wang et al., 2018), but training seven times as fast on GPUs and twice as fast on TPUs.
The study shows that replacing a transformer’s self-attention sublayers with FNet’s Fourier sublayers achieves remarkable accuracy and also significantly speeds up training time, indicating the promise of using linear transformations as a replacement for attention mechanisms in text classification tasks.
Initially, I have given an idea about transformers and how BERT is different from a transformer architecture. It was important to understand BERT before we go into FNet since in both cases we are considering the encoder part of the transformer architecture. The advantages of FNet over BERT are also explained with reasons.
This article is also available in my Medium Blog