Note: This is part 1 of a three part mini series
Before we actually build an ML model, first let us understand some terms and definitions.
Natural Language Processing, or NLP for short, is very broadly defined as the manipulation of natural language by a computer. Natural language is how we humans communicate with each other, namely by text and speech. Working with natural language has been historically hard. It’s so hard that the Turing test, developed by Alan Turing to test a machine’s ability to exhibit intelligent behavior indistinguishable from a human, classifies a machine with the ability to hold a human conversation for 5 minutes as one with human intelligence.
The most popular models for NLP are what are called deep learning models. A deep learning model is inspired by the structure and function of the brain. Deep learning algorithms can automatically extract features from raw data in a process called feature learning. Our manually defined features of natural language tend to over specified, incomplete, and take a really long time to be designed and validated. Features learned automatically are easy to adapt to, faster to train with, and can be continuously improved upon for better performance. By finding features on multiple levels, deep learning models can also represent higher-level features as constituted by several low-level features. This allows computers to learn difficult and complicated concepts by building them out of simpler ones.
Tensorflow is a free, open-source, and widely used library designed by Google Brain for machine learning, which specializes in the creation of deep learning neural networks.
Keras is an open source application programming interface (API) for the Tensorflow library. It is an approachable and highly-productive interface for solving machine learning models, with a focus of deep learning.
A corpus is a large collection of machine-readable text. This is what we will train our machine learning model on. It is common practice to divide a coprus into two sets, one for training and one for testing. The corpus typically requires some form of processing before it becomes fit for usage in a machine-learning system.
Machine learning models cannot work with raw text. That’s why we need a way to convert words into a series of numbers the model can interpret in such a way that they retain their meanings. Encoding refers to this process of converting text data into a form that a machine learning model can understand. The actual process of converting words into number vectors is called tokenization. There are several ways in which you can encode words. The most common as one-hot-encoding and creating densely embedded vectors.
One hot encoding converts the text into a series of zeroes and ones. This involves creating a vector for each word in a corpus such that said word is represented by a one in its respective position, while all the others are represented as zeroes, and then joining all the vectors together into a matrix. While this does convert the text into a format the machine learning model can interpret, this does not detect similarities between words, nor can it represent the meaning of a word.
Word embedding is the process of representing a word of a phrase as a vector or numbers, using more numbers than simply ones and zeroes. Thus it can help form more complex relationships between words, and this representation can store important information like the relationship to other words, their context, etc.
A basic neural network connects together a series of nodes. Each node takes in some data, applies a mathematical function on it. In a basic neural network, the input data has to be fixed size. The input a layer receives is the output of the previous layer transformed by the weights of the layer. An RNN on the other hand can remember previous inputs from previous layers in the network. This provides the network some sort of “context”, and the output of the layers in the network are calculated by taking into account this context along with the weights and the output of the previous layer.
RNNs are very good for NLP. This is because in human language, we understand each word based on our understanding of previous words, instead of attempting to understand each word on its own. RNNs achieve this by taking into account the “context” mentioned earlier. One of the main problems with “vanilla” RNNs is while they can usually remember previous words in a sentence, their ability to preserve the context of earlier inputs degrades over time as the input series increases. This accumulates irrelevant data over time and blocks out the relevant data needed to make accurate predictions.
LSTMs solve this problem.
LSTM networks are a type of RNN which are able to deal learn long-term dependencies. They do this by selectively “unlearning” of forgetting information which is not essential for the task at hand. By doing this, they remove the irrelevant data from the previous inputs the network has to take into account and can thus make better predictions.
In the upcoming posts, we will look at how we can implement a LSTM and do text generation with it, following which we will create a LSTM model, train, and evaluate it.
Use the following links if you want to know more about the topics we’ve looked at above:
Stay safe and have a nice day!