High School Student Blog: Text Generation Project with Python and Tensorflow (Part 2)
In this post, we’ll be going through the process of implementing an LSTM and text generation.
This is how the text generation will work:
Given a starting character, the model will learn the probabilities regarding what character will come next.
We will then chain these probabilities together to create an output of many characters.
I’ll be using Google Colab for this project, though you can follow along using any other notebook service or using your own hardware. Setting up a Colab notebook is very easy and straightforward since you don’t need to install anything. In case you are using Colab, you can use a GPU as a hardware accelerator to improve training speeds by going to Runtime -> Change runtime type -> Hardware accelerator.
Importing Libraries
We’ll be needing the following libraries:
Numpy
Keras
NLTK (Natural Language ToolKit)
sys
Let’s go ahead and import all of these into our program.
import numpy import sys import nltk from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords from keras.models import Sequential from keras.layers import Dense, Dropout, LSTM from keras.utils import np_utils from keras.callbacks import ModelCheckpoint nltk.download("stopwords")
You’ll notice that we downloaded something called “stopwords”. We’ll need this later.
We need data to train our model on. For this project I’ll be using William Shakespeare’s Romeo and Juliet, which is available to download for free on Project Gutenberg. You are free to use any other text you have.
If you are using a file from Project Gutenberg, you’ll notice that there is some information and legal text about the usage of the book at the beginning and end of the text file. You may choose to remove it if you want to.
Upload your text file onto you notebook and read it.
data = open("RomeoAndJuliet.txt").read()
To make things easier for this example, we’ll convert all the text to lowercase. We will also do some prepossessing to clean the data and then convert the text file into arrays that our model can use.
We’ll need to convert our words into tokens before we can make arrays. A token is basically a sequence of characters that are grouped together as a useful semantic unit for processing.
Tokenization can be a very complicated process. In English, you can’t remove every punctuation and white space you come across since that can change the meaning of the words. It is even more difficult in languages like Arabic, where a single word can comprise up to four independent tokens. There will be links at the end of the post if you want to learn more about this topic.
We’ll create an instance of a tokenizer and use it on our text file.
In the end, we’ll remove the tokens which are in a list of Stop Words and which do not add significant value to the sentence. This is where the “stopwords” that we downloaded earlier comes in handy. It contains words (‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’) which do not contribute to the meaning of the sentence in a significant manner. We use this to filter out the stopwords.
Let’s create a function that does all that:
def tokenize(input): input = input.lower() tokenizer = RegexpTokenizer(r'\w+') tokens = tokenizer.tokenize(input) filtered = filter(lambda token: token not in stopwords.words('english'), tokens) return " ".join(filtered)
Now call this function on our file:
processed_input = tokenize(text)
As stated before we’ll need to convert our data into arrays of numbers. The first step in doing that is to assign a number to each character that appears in our data and to create a dictionary that holds these character—number pairs.
We first sort the list of the set of all characters that are in our data. We then use the enumerate
function to get the numbers that represent these characters. We finally create the dictionary which holds these values.
characters = sorted(list(set(processed_input))) char_to_num = dict((c, i) for i, c in enumerate(characters))
Now that we’ve transformed the data into the form it needs to be in, we can begin making a dataset out of it, which we’ll feed into our network. We need to define how long we want an individual sequence (one complete mapping of inputs characters as integers) to be. We’ll set a length of 100 for now, and declare empty lists to store our input and output data:
seq_length = 100 x_data = [] y_data = []
We now need to create input-output pairs to train and test the model on. For this, we’ll go through the entire list of inputs and chop them up into sequences of 100 characters, and convert these characters into numbers. These will be the inputs. The output will be the next character which comes after a single sequence of 100 characters, converted into it’s corresponding number.
for i in range(0, input_len - seq_length, 1): in_seq = processed_input[i:i + seq_length] out_seq = processed_input[i + seq_length] x_data.append([char_to_num[char] for char in in_seq]) y_data.append(char_to_num[out_seq])
We now have our training data features and labels, stored as x_data and y_data. Now we have to convert our input sequences into a processed numpy array for the model to use. We will also have to convert the values in the arrays into floats so that our model can output probabilities from 0 to 1.
X = numpy.reshape(x_data, (len(x_data), seq_length, 1)) X = X/float(vocab_len)
We will now one-hot-encode out label data:
y = np_utils.to_categorical(y_data)
Now that all the data processing is done, we can finally create our LSTM model. We’ll define the type of model (in this case sequential), add a few LSTM layers and dropout layers (to prevent overfitting, and then a final layer layer that will output the probability about what the next character will be.
model = Sequential() model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(256, return_sequences=True)) model.add(Dropout(0.2)) model.add(LSTM(128)) model.add(Dropout(0.2)) model.add(Dense(y.shape[1], activation='softmax'))
We will now compile the model, after which it will be ready for training.
model.compile(loss='categorical_crossentropy', optimizer='adam')
It takes the model quite a while to train, and for this reason we’ll save the weights and reload them when the training is finished. We’ll set a checkpoint
to save the weights to, and then make them the callbacks for our future model.
filepath = "model_weights_saved.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min') desired_callbacks = [checkpoint]
We will now train the model:
model.fit(X, y, epochs=4, batch_size=256, callbacks=desired_callbacks)
Now that we have a trained model, we load in the weights. If you are using Google Colab, now would also be a good time to download your weights if you want to use the model in the future, since all of your files will be deleted if you when you disconnect from the runtime.
filename = "model_weights_saved.hdf5" model.load_weights(filename) model.compile(loss='categorical_crossentropy', optimizer='adam')
Since we converted all our data into numbers before, we will need to define a dictionary that will convert the output of the model back into readable text.
num_to_char = dict((i, c) for i, c in enumerate(characters))
To use the model for character generation, we provide our model with a random character (the seed), using which it will generate a string of characters.
start = numpy.random.randint(0, len(x_data) - 1) pattern = x_data[start]
Now to FINALLY generate text, we’re going to iterate through our chosen number of characters and convert our seed into float values. We then input these values into the model and ask it to predict what comes next. We take these characters and append them to pattern
, and repeat the process for a set number of time (1000 in this example), while printing out the generated characters.
for i in range(1000): x = numpy.reshape(pattern, (1, len(pattern), 1)) x = x / float(vocab_len) prediction = model.predict(x, verbose=0) index = numpy.argmax(prediction) result = num_to_char[index] seq_in = [num_to_char[value] for value in pattern] sys.stdout.write(result) pattern.append(index) pattern = pattern[1:len(pattern)]
This is what my model generated:
oe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe too
It is … slightly disappointing, to say the least. The generated text doesn’t make any sense, and the model immediately started repeating a simple pattern.
However, the longer you train the model, the better it will become. We trained the model for 4 epochs. This is the result of the same model trained for 50 epochs:
stay thou hast shall street ere shou hast shall street ere shou hast shall street ere thou hast shall
The words have started making sense now, even though the model still goes back to repeating a pattern fairly quickly.
And this is the result of the model after it was trained for 150 epochs:
wound shall stay thy lady capulet sun haste serve holy shy sorch bome thou wilt sword shall fortune thy live shall stay thy lady capulet sun haste serve holy shy sorch bome
You can experiment with more training time and increasing the number of layers to get a better model.
Note: If you decide to train your models for a very long amount of time, keep in mind that a Google Colab notebook recycles after 12 hours if the browser is kept open. Checkpointing would be a good way to get a substantial amount of training done.
If you want to read more about tokenization, you can do so here.
Stay safe and have a nice day!
Use the following links if you want to know more about the topics we’ve looked at above:
Dropout layers: https://keras.io/api/layers/regularization_layers/dropout/ and https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab
NLP: https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
Tensorflow: https://www.tensorflow.org/
Keras: https://keras.io/
Encoding: https://towardsdatascience.com/text-encoding-a-review-7c929514cccf
RNNs: https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Danish Khan is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.
https://khandanish2004.medium.com/?p=893b51b6938f