(Original Paper Published by Google in 2014)
[written by Ilya Sutskever, Oriol Vinyals, Quoc V. Le]
Hello Readers!!! Welcome back to 8th Blog of Interpretable Blog Series.
Today we are going to discuss about sequence-to-sequence learning with neural networks in detail but in very narrative and simple way.
So, let's start!!!
As usual, lets first understand what is meant by neural networks.
Okay. So, you all are having a good idea about machine learning models. Machine learning model is nothing a but a mathematical equation whose original form of equation is close to our data (in ml terms we call it a good fit) isn’t it?
So, machine learning is about choosing correct mathematical equation which we often call machine learning model, suitable for our goal and nature of the data.
But there are some complex conditions which machine learning models / equations cannot solve. For example, recognizing objects from images having complex blend of variety of objects.
So, to solve such problems, there arrived a concept of deep learning. So deep learning is sub-branch of machine learning which uses neural networks to solve problems.
AND
Neural network is a type of computer program designed to think a little bit like human brain. Its made of different layers of tiny units called “Neurons”, and they work together to solve problems.
Here Is the very basic architecture for neural networks
A] Input Layer – Which takes input. i.e. Input X1, Input X2 and Input X3
B] Hidden Layer – This is the brain of neural network. According to each problem statement, this hidden layer can be different as this hidden layer contains blend of weighted connections which can be different for each problem.
C] Output Layer – Which provides output generated from hidden layer. i.e. Output Y1 and Output Y2
So now the question arises….
What is sequence to sequence learning?
In field of AI, sequence to sequence learning means when we are giving input of one sentence and at output, we are again getting some sentence (whether transformed OR translated). Translating languages, summarizing text, or generating responses in a conversation are some common examples of sequence-to-sequence learning.
So, readers. I believe that the idea of research paper is clear now.
Deep Neural Networks (DNNs) showed very good accuracy on variety of complex tasks. But they failed to give appropriate results on sequence-to-sequence mappings.
In this research paper, method is using LSTM to map input sequence (sentence) with vector of fixed dimensionality and then the last step is to decode the target sequence from the vector.
LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and passive voice.
It is observed that when order of sentence is reversed the mappings between input sentence and output sentence are improved.
|
System |
BLEU Score |
Notes |
|
LSTM (direct translation) |
34.8 |
Penalized for out-of-vocabulary words; handled long sentences well |
|
Phrase-based SMT |
33.3 |
Standard statistical machine translation system |
|
LSTM + SMT (reranking 1000 hypotheses) |
36.5 |
Used LSTM to re-rank SMT outputs, close to previous best result |
So, readers, after looking at above information, maybe you are not familiar with some of the words. So, I will explain them in one sentence.
LSTM – Long Short-Term Memory – it is a neural network specially used for sequence-based tasks.
BLEU Score (Bilingual Evaluation Understudy) – It is an evaluation metrics which gives us magnitude of similarity between human translation and machine generated translation.
Its range is [0,1]. Closer to 1 means more accurate and vice-versa.
Deep Neural Networks are very powerful and can easily solve problems like speech recognition, visual object detection. Main strength of DNNs is ability to perform arbitrary parallel computation of longer steps.
Despite of strength and power of DNN, they can be applied to problems whose inputs and outputs can be sensibly encoded with vectors of fixed dimensionality.
But many of the real-world problems in sequence-to-sequence learning are not having fixed dimensionality. So here there is a limitation of DNNs. Likewise, Question-Answering can also be seen as sequence to sequence learning as vector of question is mapped to vector version of answer. So therefore it was clear that domain independent method which learns sequence to sequence learning will be more helpful.
In the original paper, a straightforward application of the Long Short-Term Memory (LSTM) architecture is applied. It is showed that it can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector. The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence.
The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs
Below is the Architecture.
Model is reading ABC as input sentence and producing WXYZ as output sentence.
Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of other researchers with related architectures. Researchers from the research paper were able to do well on long sentences because they reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, they introduced many short-term dependencies that made the optimization problem much simpler.
A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation.
RNN is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs (X1, X2, X3….Xn) a standard RNN computes a sequence of outputs (Y1, Y2, Y3…Yn) by iterating the following equation.
AND
That Y(t) is our output at time t. So, readers I believe you are clear till this slide. LSTM Involves calculation of conditional probability. So before moving forward, lets understand concept of conditional probability with the help of simple one liner definition.
So, it is written as P(A|B).
What does it mean?
P(A|B) is nothing but a probability of event A if Event B has already happened.
So, coming on LSTM,
It calculates P ((Y1, Y2, Y3…Yn) | (X1, X2, X3...Xn))
So simply it means, what is probability of Y1, Y2, Y3...Yn (Predicted Output Sequence) if Input sequence (X1, X2, X3...Xn) are provided.
So, Readers. I believe that you are familiar with the basic architecture of model.
In research paper, they have applied method to WMT’14 English to French MT Task in two ways.
WMT’14 English to French MT refers to a machine translation benchmark task from the 2014 Workshop on Statistical Machine Translation (SMT).
One of the ways is without using SMT and One is with Using SMT (Statistical Machine Translation) and finally the results are shown.
For training, WMT’14 English to French Dataset is Used. Dataset of consisting of 12M Sentences consisting of 348 M French Words and 304 English Words.
So, in original research paper things are explained with mathematical explanations. So, I decided to make it easy for you I will try to explain it in layman terms.
Once trained, they generated translations using beam search:
Beam search is a smart guessing strategy for generating sequences without checking every single possibility.
So, below are the steps. Let's see how it works.
Beam size (B) matters:
They also tested another use case:
This way, their LSTM could help improve SMT results without replacing it entirely.
While LSTM is capable of solving problems with long term dependencies, it is observed that when the input sentences are reversed by keeping the target sentence as it is BLEU Score (being a performance metric for translation models) is increased from 25.9 to 30.6.
Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag”.
By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced.
So here arrives new word minimal time lag. What is that?
It’s the smallest gap in time steps between when the model receives information and when it is expected to use that information to make a correct prediction.
They trained a large, deep LSTM model for machine translation and found it wasn’t too hard to train. The model was of 4 layers, had a lot of memory cells in each layer, and used big word embeddings. They also compared deep vs shallow LSTMs and saw big performance improvements with more layers.
This means the model uses 8000 numbers to represent a whole sentence — the hidden state representation.
What’s the reason for deep LSTMs working better?
Out of 384 million Parameters, 64 million are recurrent connections (i.e. which carry memory from previous layers)
This module is one of the strengths of DNNs.
Let's first understand what problems they faced for implementing this parallelization.
They wrote their LSTM translation system in C++ and ran it on a single GPU. With the model from the previous section, it could process about 1,700 words / second. This speed was too slow for their needs training would take too long for getting completed.
The solution for above problem is using multiple GPUs. They had access to a machine with 8 GPUs.
They split the work across these GPUs in two main ways:
The result they got from that:
How they measured translation quality
(a standard metric for translation quality that compares model output to human translations, case-sensitive).
Comparisons
Table 1: The performance of the LSTM on WMT’14 English to French test set (Without Use SMT (Statistical Machine Translation))
Table 2: The performance of the LSTM on WMT’14 English to French test set (With Use Of SMT (Statistical Machine Translation))
Okay Readers.
Now let's see at the model analysis graphically.
These graphs show how the LSTM groups sentences with similar meanings close together in its learned “semantic space,” even if the word order changes. Paraphrases cluster tightly, while sentences with different roles or meanings are far apart proving the model captures true sentence level meaning, not just word patterns.
Large Deep LSTM’s having limited vocabulary performed well on sequence to sequence learning than SMT (Statistical Machine Translations) which are having unlimited vocabulary. It is observed that LSTMs are even doing better for longer sentences.