Attention Is All You Need

Written by Abhishek Jog | Jul 2, 2025 5:03:02 AM

Abstract:

Most top-performing LLMs use complex systems like recurrent or convolutional neural networks.
Which include an encoder and a decoder connected by an attention mechanism.

So here we are introducing,

Much simpler model called the Transformer, which uses only attention mechanisms and removes the need for recurrence or convolutions.

Some Additional Points on Transformer:

a. After experimenting on some translation tasks, it was found that transformers produce better results and lesser training time.

b. For example, For English – To – German Translation Model, it achieved BLEU score of 28.4 which is more than previous translation models with encoder, decoder & convolutional / recurrent neural networks.

c. Similarly, For English – To – French Translation Model, BLEU score was 41.8 and it was trained just for 3.5 days which is very faster compared to previous models.

Simple Meaning of Some Technical words on previous slide

Recurrent Neural Network – Type of Neural Network which processes data in sequence and predicts next by remembering past information.
Convolutional Neural Network – Neural Network used to detect image patterns on minute scale.
Encoder – Very First component of model which interprets data.
Decoder – final component of model which processes output w.r.t. user.
Attention Mechanism – A method that helps model to focus on most relevant part of input while making decisions.
BLEU Score (Bilingual Evaluation Understudy) – Number which shows how machine translation is close to human translation.

Introduction

As we seen in earlier slide, RNN (Recurrent Neural Networks) process data step by step.
So eventually they tend to take lot of time to process the data and they fails to do all data processing activities simultaneously.
Some techniques have made these models faster & efficient but still they rely on this slow sequential process of RNN.

What we are going to explore here :

Transformers – Neural Network which uses attention mechanism to process data by removing the sequential data processing works on principle of attention mechanism.

Hence which results in significantly less training time and better accuracy.

What is Attention Mechanism?

So, Readers, Attention Mechanism is quiet mathematical concept.
So, without going into depth of mathematical background of it, let's understand it in simple language and example

Generally, when any data comes to model through encoder, model gives equal importance to every data point and process data sequentially.
Here in attention mechanism, it gives more importance to most relevant data by making every other data point less important.
It saves a lot of time and increases accuracy, as model is working on most relevant part which is most useful for generating output !!!

Example : Suppose user wants to translate the sentence : The cat sat on mat.

Then while translating mat into some different language, attention mechanism will give more importance to words “Cat” and “Mat” as they are closely related with word “Sat”. “The” and “On” will naturally receive less importance.

Background

Some Previous Models like Extended Neural GPU, ByteNet & ConvS2S used convolutional neural networks to process sequences in parallel which reduced need for step-by-step computation.
But again, in above approach it was observed that when two words of translation are far from the models above took much more steps to compute it.
Transformers solve this by using self-attention, which use fixed no. of operations for words. Whether it be too long or too close.
Additionally for improving accuracy, transformers use multi-head attention, allowing model to focus on multiple parts of sentence at once.

Model Architecture of Transformer

Scaled Dot Product Attention - 1

Readers, Scaled Dot Product also carries lot of mathematical background. So, without going too deep into that lets understand it. So, Scaled Dot Product Attention is fast and efficient way to calculate how much “Attention” each word should get

Scaled Dot Product Attention have 3 major parameters :

Q = Query – What I am looking for ?

K = Key – what do I have ?

V = Value – what info should I get if it matches?

Scaled Dot Product Attention - 2

Above is formula for Scaled Dot Product Attention

QK^T = Calculates Dot Product Between Query (Q) & Key (K)
1/(dk)^0.5 = Prevents the results from getting too big
Softmax = turns similarity scores into probabilities
V = Weighs the values by those probabilities to get final output

Scaled Dot Product Attention - 3

Let's see one simple example:

Stage - 1 Stage - 2

As we can see output before Weighted sum is [0.67, 0.33]
This means the first word was more relevant (Got 67 % Attention) while second got 33%

Position-Wise Feed-Forward Networks

So simply, in a transformer, each position is passed through the same small neural network independently.
We can think of the above concept like, if we are having sentence of 10 words, we can apply the same two-layer feed-forward network to each word vector, one at a time.

What is the use of Position-Wise Feed-Forward Networks ?

Adds Non-Linearity to Model, which enables model to learn more complex transformations.
Keeps position independence- each word is treated on its own at this stage
Improves model’s expressiveness

Glimpse on Position - Encoding

As we already seen in previous slides, that transformer models don’t have recurrent or convolutional neural networks, it doesn’t have inbuilt mechanism to understand in which sequence it should process data
To overcome the above problem, Position – Encodings come into picture
This position encodings carry information about position of words in sequence

Information on Training of Model

Transformer Model for English – To - German is trained on Standard WMT 2014 English German Dataset consisting of about 4.5 million sentence pairs.
Sentences were encoded using byte-pair encoding having source target vocabulary of 37000 tokens
While the Transformer Model for English – To – French is trained on WMT 2014 English – French dataset consisting of 36 Million sentences and split tokens into 32000 word-piece vocabulary.

Additional Information About Hardware & Optimizer used

Hardware used – 8 NVIDIA P100 GPUs
Optimizer – Adam Optimizer, with

Optimizers main motive is to update model's weights and to improve learning rate

Regularization Techniques used

Residual DropOut – It randomly drops out fraction of neurons during training, forcing the network not to rely too heavily on any single feature.
Label Smoothing – It replaces hard “0”s and “1”s in target labels with slightly softened values like 0.9 instead of 1 and so on. Which controls the model from being over-confident.

Readers, for better understanding of transformers, here along with this ppt there is python code attached which will showcase end – to – end working of transformers.

View full post