Recurrent Neural Network and Long Term Dependencies

7 min readJul 14, 2019

Recurrent Neural Network (RNN) is a state-of-the art deep learning algorithm used for modeling sequential information. It is one of the most popular models that have given great promises in many NLP tasks.The idea behind recurrent neural network is to make use of sequential information. In a traditional neural network we assume that all inputs are independent of each other. But for many tasks the idea seemed very primitive. For instance, if you want to create a software that can predict the next word in a sentence. The network must have a better knowledge about the words prior to it. Such a level of application is not much entertained in the traditional network and it is were the recurrent neural network comes into action.

Humans do not start thinking from start every second during a conversation. We understand each word based on our understanding of previous words and processes the next word. Normal neural networks won’t understand this and cannot process in this way. Recurrent neural networks are designed for addressing this issue.They are networks with loops in them allowing information to persist. Recurrent neural network is used for large number of applications like time series analysis, speech recognition, image captioning, machine translation and much more.

Why RNN?

Consider a sentence containing a sequence of words “The plane is very fast”. We have five words in it. Let X i1, X i2, X i3, X i4, X i5 be vectors representing these words.What will happen if we train them using a Multi Layer Perceptron(MLP)?

The figure shows training words using a simple MLP network. Usually, each entry of the network will have a vector of particular dimension. Let each word is represented by a vector of 10k dimension. We have 5 words here. So our total input becomes 50k dimension. By the above network we still preserve sequence information. But there are many limitations for implementing this idea. Suppose we have another sentence like “ This plane has some limitations in landing but landed safely “.This sentence has 10 words and the input size becomes 10x10k = 100k dimension. Similarly you have a wide range sentences with variable word counts. In the above model, however, the network is capable of handling a 50k dimension as the model is trained with a 5 word sentence. Any test sentences with further word counts may not yield the desired results when testing the setup.

As an argument for the above said limitation we can use the longest sentence as an input vector. If we do so, our network will become infinitely large and will have billions and billions of weights making it an another limitation. The solution to this problem is to use Recurrent neural networks.

Basic Structure Of Recurrent Neural Network

They are networks with loops in them, allowing information to persist. In the diagram, a chunk of neural network A, looks(previous value) at some input X t and outputs a value H t. A loop will allow the information to be passed from one step of the network to the next.

Here X0, X1, X2…Xt represents input and h0, h1, h2…ht represents output at time t = t0, t1, t2…tt.

One of the intuitions of recurrent neural network’s is the idea that they might be able to use previous information to the present task, such as using previous video frames to understand the present frame.But can normal recurrent neural network handle this or require an improved mechanism? Let us find out.

LSTM — The Solution To Long Term Dependencies

Sometimes we just need to look at recent information to perform the present task. For example, consider a language model trying to predict the last word in “ the clouds are in the sky”. Here it’s easy to predict the next word as sky based on the previous words. But consider the sentence “ I grew up in France I speak fluent French. “ Here it is not easy to predict that the language is French directly. It depends on previous input also. In such sentences it’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. In theory, RNN’s are absolutely capable of handling such “long-term dependencies.”

A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, recurrent neural network don’t seem to be able to learn them.This problem is called Vanishing gradient problem.The neural network updates the weight using the gradient descent algorithm. The gradients grow smaller when the network progress down to lower layers.The gradients will stay constant meaning there is no space for improvement. The model learns from a change in the gradient. This change affects the network’s output. However, if the difference in the gradients is very small network will not learn anything and so no difference in the output. Therefore, a network facing a vanishing gradient problem cannot converge towards a good solution.

Long Short Term Memory networks

Long Short Term Memory networks (LSTMs) is a special kind of recurrent neural network capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber in 1997. Remembering information for longer periods of time is their default behavior. The Long short-term memory (LSTM) is made up of a memory cell, an input gate, an output gate and a forget gate. The memory cell is responsible for remembering the previous state while the gates are responsible for controlling the amount of memory to be exposed.

The memory cell is responsible for keeping track of the dependencies between the elements in the input sequence.The present input and the previous is passed to forget gate and the output of this forget gate is fed to the previous cell state. After that the output from the input gate is also fed to the previous cell state.By using this the output gate operates and will generate the output.

Forget Gate

There are some information from the previous cell state that is not needed for the present unit in a LSTM. A forget gate is responsible for removing this information from the cell state. The information that is no longer required for the LSTM to understand or the information that is of less importance is removed via multiplication of a filter. This is required for optimizing the performance of the LSTM network. In other words we can say that it determines how much of previous state is to be passed to the next state.

The gate has two inputs X t and h t-1. h t-1 is the output of the previous cell and x t is the input at that particular time step. The given inputs are multiplied by the weight matrices and a bias is added. Following this, the sigmoid function(activation function) is applied to this value.

Input Gate

The process of adding new information takes place in input gate. Here combination of x t and h t-1 is passed through sigmoid and tanh functions(activation functions) and added. Creating a vector containing all the possible values that can be added (as perceived from h t-1 and x t) to the cell state. This is done using the tanh function. By this step we ensure that only that information is added to the cell state that is important and is not redundant.

Output Gate

A vector is created after applying tanh function to the cell state.Then making a filter using the values of h t-1 and x t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.Then both of them are multiplied to form output of that cell state.

Out of all the remarkable results achieved using recurrent neural network most of them are by using LSTM. The real magic behind LSTM networks is that they are achieving almost human-level of sequence generation quality, without any magic at all.The explanation and scope of recurrent neural network does not end here. LSTMs were a big step in what we can accomplish with RNNs. But explaining everything in one blog makes it more lengthy..So we will come up with it later.

Reference: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Originally published at https://www.infolks.info on July 14, 2019.