Attention not needed!

Attention, Attention

For as long as we have been using Transformer models, there has been a question about how to put it in production because it needs a lot of computing power. Google took about a year from when BERT (Bidirectional Encoder Representations from Transformers) was introduced to being incorporated into their search algorithm. It is not just the advancement in understanding searches using BERT; they also needed new hardware.

“But it’s not just advancements in software that can make this possible: we needed new hardware too. Some of the models we can build with BERT are so complex that they push the limits of what we can do using traditional hardware, so for the first time, we’re using the latest Cloud TPUs to serve search results and get you more relevant information quickly. “(https://blog.google/products/search/search-language-understanding-bert/)

One of the complexities in BERT is the ‘self-attention’ modules, these are neural network layers that compute a compatibility value from the original input to the model. BERT’s initial architecture used multiple attention layers, which introduced complexities that are quadratic in space and time. But why use ‘self-attention’ if it is that complex? Self-attention yields in a more interpretable model. These attention layers exhibit behavior related to the syntactic and semantic structure of sentences [1] which we know are helpful when doing various language tasks.

Addressing Self-Attention Complexity

To address the inefficiency of self-attention, a few approaches have been used to reduce the quadratic complexity to linear via low-dimensional projection and row selection.

Here are some of the examples of the approaches to address the complexity of self-attention in transformer models:

Linformer - added linear projection matrices to compute the self-attention using smaller vector dimensions, reducing the computation to linear complexity with almost the same performance.[2]

Informer – instead of calculating with quadratic complexity using all the inputs to compute the attention values, select inputs that will contribute major attention [3].

Most of the strategies will make a selection on the attention input matrices and make sure that it does not significantly affect the transformer’s performance. Here are some of those strategies for further exploration.

Sparse Transformer - https://arxiv.org/pdf/1904.10509.pdf

Log Sparse Transformer - http://arxiv.org/abs/1907.00235

Longformer - http://arxiv.org/abs/2004.05150

NAACL 2022 – Paper

Two papers caught my attention in NAACL 2022 that might have potential upsides other than addressing the complexity of self-attention in Transformer models.

Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences

In this paper, they introduce Skeinformer, which accelerates self-attention and improves the selection process on the attention input matrices. It discusses how the strategies in Informer and Linformer from the perspective of matrix approximation, specifically by using a ‘sketching’ method.

What is sketching?

A common way to represent data is as a matrix. A matrix will have ‘n’ rows and ‘d’ columns, which can be thought of as a n x d matrix. This nxd matrix (let’s call this matrix ‘A’) is approximated with a more compact matrix r x e (let’s call this matrix ‘B’) or a product of a few smaller matrices, so that matrix ‘B’ preserves the properties of ‘A’ up to some approximation ratio. The ‘r’ dimension of matrix ‘B’ should be sufficiently small so that any downstream computations can be carried out using that smaller dimension, like in the case of self-attention in the transformer models.

This kind of approximation can be applied to any matrix computation to take advantage of the lower computational cost using a reduced dimension matrix.

Additional references to understand sketching

A brief survey on the literature about the topic - https://www.cs.utah.edu/~jeffp/teaching/cs7931-S15/cs7931.html

Here is a good introduction to sketching - http://arxiv.org/abs/1411.4357

In the paper, they have evaluated the sketch matrices to the original self-attention. The paper claims that the approximation tends to outperform the original. The paper speculates that good approximation can recover the main signals from the original self-attention.

FNet: Mixing Tokens with Fourier Transforms

In this paper, the whole self-attention layer of a Transformer model was replaced with an unparameterized (no parameters are learned) Fourier Transform. The new model achieves the same performance as the original model.

What is Fourier Transform?

The Fast Fourier Transform (FFT) is at the center of this paper. The Fourier transform is one of the most ubiquitous transformations in all mathematical physics and engineering. The FFT is useful for representing data, images, and solutions for partial differential equations. The FFT has been central in how we analyze and process data that we collect – pretty much all of our modern digital communication is built on the FFT (e.g., audio sequences, video images). Sending pictures, audio clips, or any data efficiently would not have been possible without the FFT.

The Fourier Transform is an approximation by using the sum of sines and cosines in increasing frequencies. The sine and cosine functions are very useful functions that can model various natural phenomena.

In the paper, the authors removed the attention layer in the transformer model and substituted an FFT. Substituting an FFT means that there are no learned parameters which means it will also save some cycles during backpropagation. The paper claims the FNET model almost performs the same as the BERT base models but trains 80% faster and matches the same accuracy of other transformer models in the Long-Range Arena benchmark (documents with longer sequence texts)

What Next?

Both Papers above introduce really good techniques that can be used beyond the transformer architecture models. Both sketching and FFT deserve a second look for applications. Here is a promising paper - (https://www.cs.purdue.edu/homes/spa/papers/feature20.pdf)

[1] Vaswani et al., “Attention Is All You Need.”

[2] Wang et al., “Linformer.”

[3] Zhou et al., “Informer.”

[4] Brunton et. al., “Data Driven Science & Engineering: Machine Learning, Dynamical Systems, and Control”

[5] Bracewell, “Computing with the Hartley Transform.”

Search This Blog

Petabyte