
NEURAL NETWORKS
OVERVIEW
Seq2seq(encoder-decoder architecture) model with a simple dot product attention will be used to build the text summarization model. The underlying idea of choosing this architecture is that we have a many-to-many problem at hand(n number of words as input and m number of words as output). The figure below shows the detailed architecture diagram for this model.
​
​
Neural Networks Architecture

There are four major components in this architecture:
Encoder: The encoder layer of the seq2seq model extracts information from the input text and encodes it into a single vector, that is a context vector.
Basically, for each input word, the encoder generates a hidden state and a vector , using this hidden state for the next input word.GRU(Gated Recurrent Unit) is used for the encoder layer in order to capture long term dependencies - mitigating the vanishing/exploding gradient problem encountered while working with vanilla RNNs.
The GRU cell reads one word at a time and using the update and reset gate, computes the hidden state content and cell state.
​
Decoder: The decoder layer of a seq2seq model uses the last hidden state of the encoder i.e. the context vector and generates the output words. The decoding process starts once the sentence has been encoded and the decoder is given a hidden state and an input token at each step/time.
At the initial time stamp/state the first hidden state is the context vector and the input vector is SOS(start-of-string). The decoding process ends when EOS(end-of-sentence) is reached.
The SOS and EOS tokens are explicitly added at the start and end of each sentence respectively.
​
Attention Mechanism: Using the encoder-decoder architecture, the encoded context vector is passed on to the decoder to generate an output sentence. If the input sentence is long and a single context vector cannot capture all the important information, this is where attention comes into picture.
The main intuition of using attention is to allow the model to focus/pay attention on the most important part of the input text. As a blessing in disguise, it also helps to overcome the vanishing gradient problem.
There are different types of attention — additive, multiplicative, however, we will use the basic dot product attention for our model.
1. Attention scores are first calculated by computing the dot product of the encoder(h) and decoder(s) hidden state
2. These attention scores are converted to a distribution(α) by passing them through the Softmax layer.
3. Then the weighted sum of the hidden states (z) is computed.
4. This z is then concatenated with s and fed through the softmax layer to generate the words using ‘Greedy Algorithm’ (by computing argmax)
​
In this architecture, instead of directly using the output of last encoder’s hidden state, we are also feeding the weighted combination of all the encoder hidden states. This helps the model to pay attention to important words across long sequences.
Supporting Equations

Teacher Forcing: In general, for recurrent neural networks, the output from a state is fed as an input to the next state. This process causes slow convergence thereby increasing the training time.
​
What is Teacher Forcing
Teacher forcing addresses this slow convergence problem by feeding the actual value/ground truth to the model. The basic intuition behind this technique is that instead of feeding the decoders predicted output as an input for the next state, the ground truth or the actual value is fed to the model. If the model predicts a wrong word it might lead to a condition wherein all the further words that are predicted are incorrect. Hence, teacher forcing feeds the actual value thereby correcting the model if it predicts/generates a wrong word.
Teacher forcing is a fast and effective way to train RNNs, however, this approach may result in more fragile/unstable models when the generated sequences vary from what was seen during the training process.
To deal with such an isssue, we will follow an approach that involves randomly choosing to use the ground truth output or the generated output from the previous time step as an input for current time step.