High School Student Blog: Text Generation Project with Python and Tensorflow (Part 1)
Before we actually build an ML model, first let us understand some terms and definitions.
Natural Language Processing
Natural Language Processing, or NLP for short, is very broadly defined as the manipulation of natural language by a computer. Natural language is how we humans communicate with each other, namely by text and speech. Working with natural language has been historically hard. It’s so hard that the Turing test, developed by Alan Turing to test a machine’s ability to exhibit intelligent behavior indistinguishable from a human, classifies a machine with the ability to hold a human conversation for 5 minutes as one with human intelligence.
The most popular models for NLP are what is called deep learning models. A deep learning model is inspired by the structure and function of the brain. Deep learning algorithms can automatically extract features from raw data in a process called feature learning. Our manually defined features of natural language tend to be over-specified, incomplete, and take a really long time to be designed and validated. Features learned automatically are easy to adapt to, faster to train with, and can be continuously improved upon for better performance. By finding features on multiple levels, deep learning models can also represent higher-level features as constituted by several low-level features. This allows computers to learn difficult and complicated concepts by building them out of simpler ones.
Tensorflow
Tensorflow is a free, open-source, and widely used library designed by Google Brain for machine learning, which specializes in the creation of deep learning neural networks.
Keras
Keras is an open-source application programming interface (API) for the Tensorflow library. It is an approachable and highly productive interface for solving machine learning models, with a focus on deep learning.
Corpus
A corpus is a large collection of machine-readable text. This is what we will train our machine learning model on. It is common practice to divide a corpus into two sets, one for training and one for testing. The corpus typically requires some form of processing before it becomes fit for usage in a machine-learning system.
Encoding
Machine learning models cannot work with raw text. That’s why we need a way to convert words into a series of numbers the model can interpret in such a way that they retain their meanings. Encoding refers to this process of converting text data into a form that a machine learning model can understand. The actual process of converting words into number vectors is called tokenization. There are several ways in which you can encode words. The most common is one-hot-encoding and creating densely embedded vectors.
One-Hot Encoding
One hot encoding converts the text into a series of zeroes and ones. This involves creating a vector for each word in a corpus such that said word is represented by one in its respective position, while all the others are represented as zeroes, and then joining all the vectors together into a matrix. While this does convert the text into a format the machine learning model can interpret, this does not detect similarities between words, nor can it represent the meaning of a word.
Word Embeddings
Word embedding is the process of representing a word or a phrase as a vector or numbers, using more numbers than simple ones and zeroes. Thus it can help form more complex relationships between words, and this representation can store important information like the relationship to other words, their context, etc.
Recurrent Neural Network
A basic neural network connects together a series of nodes. Each node takes in some data, applies a mathematical function to it. In a basic neural network, the input data has to be fixed size. The input a layer receives is the output of the previous layer transformed by the weights of the layer. An RNN on the other hand can remember previous inputs from previous layers in the network. This provides the network some sort of “context”, and the output of the layers in the network is calculated by taking into account this context along with the weights and the output of the previous layer.
RNNs are very good for NLP. This is because, in human language, we understand each word based on our understanding of previous words, instead of attempting to understand each word on its own. RNNs achieve this by taking into account the “context” mentioned earlier. One of the main problems with “vanilla” RNNs is while they can usually remember previous words in a sentence, their ability to preserve the context of earlier inputs degrades over time as the input series increases. This accumulates irrelevant data over time and blocks out the relevant data needed to make accurate predictions.
LSTMs solve this problem.
Long Short-Term Memory (LSTM) Networks
LSTM networks are a type of RNN which are able to deal learn long-term dependencies. They do this by selectively “unlearning” or forgetting information that is not essential for the task at hand. By doing this, they remove the irrelevant data from the previous inputs the network has to take into account and can thus make better predictions.
In the upcoming posts, we will look at how we can implement an LSTM and do text generation with it, following which we will create an LSTM model, train, and evaluate it.
If you want to read more about tokenization, you can do so here.
Stay safe and have a nice day!
Use the following links if you want to know more about the topics we’ve looked at above:
Dropout layers: https://keras.io/api/layers/regularization_layers/dropout/ and https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab
NLP: https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
Tensorflow: https://www.tensorflow.org/
Keras: https://keras.io/
Encoding: https://towardsdatascience.com/text-encoding-a-review-7c929514cccf
RNNs: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Danish Khan is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.
https://khandanish2004.medium.com/?p=893b51b6938f