Generating Drake Rap Lyrics using Language Models and LSTMs

An inside look at your MC-LSTM, soon releasing on iTunes

Published in

Towards Data Science

10 min readApr 9, 2018

A major part of all future AI applications is building networks that are capable of learning from some dataset and then generating original content. This idea has been applied to Natural Language Processing (NLP) and that is how the AI community developed something called Language Models

The premise of a Language Model is to learn how sentences are built in some body of text and use that knowledge to generate new content

In my case, I wanted to try out Rap Generation as a fun side project to see if I can recreate lyrics of a popular Canadian rapper Drake (a.k.a. #6god)

I also want to share a general Machine Learning project pipeline, as I found that building something of your own is often very difficult if you don’t exactly know where to start.

1. Getting the Data

It all started off with looking for a dataset of all of Drake’s songs, I didn’t want to waste too much time, so I just built a quick script myself that scrapes web pages of a popular website called metrolyrics.com

I used a well known Python package BeautifulSoup to scrape the pages, which I picked up in about 5 minutes from this awesome tutorial by Justin Yek. As a note, I actually predefined what songs I want to acquire form metrolyrics, that’s why you might notice that I’m iterating over my songs dataframe in the code above.

DataFrame storing all of the songs lyrics

After running the scrapper, I had all of my lyrics in proper formatted .csv file and was ready to start preprocessing data and building the model.

About the Model

Now, we are going to talk about the model for text generation, this is really what you are here for, it’s the real sauce - raw sauce. I’m going to start off by talking about the model design and some important elements that make lyric generation possible and then, we are going to jump into the implementation of it.

There are two main approaches to building Language Models: (1) Character-level Models and (2) Word-level models.

The main difference for each one of the models comes from what your inputs and outputs are, and I’m going to talk exactly about how each one of them works here.

Character-level model

In a case of a character-level model your input is a series of characters seedand your model is responsible for predicting the next character new_char. Then you use the seed + new_char together to generate the next character and so on. Note, since your network input must always be of the same shape, we are actually going to lose one character from the seed on every iteration of this process. Here is a simple visualization:

Fig. 2 Iterative process of word generation with Character-level Language Model

At every iteration, the model is basically making a prediction what is the next most likely character given the seed characters, or using conditional probability, this can be described like finding the maximum P(new_char|seed) , where new_char is any character from the alphabet. In our case, the alphabet is a set of all english letters, and a space character.(Note, your alphabet can be very different and can contain any characters that you want, depends on language that you are building the model for)

Word-level model

Word level model is almost the same as the character one, but it generates the next word instead of the next character. Here is a simple example:

Fig. 3 Iterative process of word generation with Word-level Language Model

Now, in this model, we are looking ahead by one unit, but this time our unit is a word, not a character. So, we are looking for P(new_word|seed) , where new_word is any word from our vocabulary.

Notice, that now we are searching through a much larger set than before. With alphabet, we searched through approximately 30 items, now we are searching through many more items at every iteration, hence the word-level algorithm is slower on every iteration, but since we are generating a whole word instead of a single character, it is actually not that bad at all. As a final note on our Word-level model, we can have a very diverse vocabulary and we usually develop it by finding all unique words from our dataset (usually done in data preprocessing stage). Since vocabularies can get infinitely large, there are many techniques that improve the efficiency of algorithm, such as Word-Embeddings, but that is for a later article.

For the purposes of this article, I’m going to focus on the character level model because it is simpler in its implementation and understanding of Character-level model can be easily transferred to a more complex Word-level model later. As, I’m writing this, I have also built a Word-level model and will attach a link to it as soon as I’m done the write up [here] (or you can follow me to stay updated 😉)

2. Data Preprocessing

For the Character-level model, we are going to have to preprocess the data in following ways:

Tokenize the dataset — When we are feeding the inputs into the model, we don’t want to be feeding in just strings, we want to be working with characters instead, since this is a Character-level model. So we are going to split all lines of lyrics into lists of characters.
Define the alphabet —Now, that we know every single kind of character that might appear in the lyrics (from previous tokenization step), we want to find all of the unique characters. For the sake of simplicity and the fact that the entire dataset is not that large (I’m only using 140 songs), I’m going to stick to English alphabet and also a couple special characters (like spaces) and will ignore all numbers and other stuff (Since the dataset is small, I’d rather have my model predict less characters).
Create training sequences — We are going to use an idea of a sliding window and create a set of training examples by sliding a window of fixed size over a sentence. Here is a nice way to visualize this:

Fig. 4 Sliding window on the dataset with input/output generation

By moving one character at a time, we are generating inputs of length of 20 characters and a single output character. In addition, as a bonus, since we are moving one character at a time, we are actually significantly expanding the size of our dataset.

4. Label Encode training sequences —Finally, since we don’t want the model to be working with raw characters (though it’s possible in theory because a character is technically just a number, so you could almost say that ASCII encoded all of the characters for us). We are going to associate a unique integer number with each character in our alphabet, something that you might have heard of as Label Encoding. This is also the time when we create two very important mappings character-to-index and index-to-character . With these two mappings, we can always encode any of the characters into it’s unique integer and also decode output of the model from an index back to its original character.

5. One-Hot-Encode the dataset — Since we are working with categorical data, where all characters fall under some kind of category, we are going to have to encode out input columns. Here is a great description of what One-Hot-Encoding actually does written by Rakshith Vasudev.

Once we are done with these five steps, we are pretty much done, now all we have to do is build the model and train it. Here is the code of the previous five steps, if you want to dive deeper into details.

3. Building the model

To predict the next character using a set of previous characters, we are going to be using Recurring Neural Networks (RNN), or specifically Long-Short-Term-Memory network (LSTM). If you are unfamiliar with either concept, I suggest you read up on them. RNNs by Pranoy Radhakrishnan and LSMTs by Eugine Kang. If you just need a refresher or feeling brave, here is the quick rundown:

RNN refresher

Usually, you see networks that look like a web and converge from many nodes to a single output. Something like this:

Fig. 5 Image of a Neural Network. credit

Here we have a single point of input and a single output. This works great for inputs that are not consecutive, where the order of inputs does not affect the output. But in our case, the order of characters is actually very important because the specific order of characters is what creates unique words.

RNNs tackle this issue by creating a network that takes in consecutive inputs and also uses the activation from previous node as a parameter for the next one.

Remember our case of a sequence Tryna_keep_it_simple and we extracted that the next character after that should be _ . This is exactly what we want our network to do. We are going to input sequences of characters where each character goes into T — > s<1>, r -> x<2>, n -> x<3>... e-> x<n> and the network predicts an output y -> _ , which is a space and is our next character.

LSTM refresher

Simple RNNs have one problem to them, they are not very good at passing information from very early cells to the later ones. For example, if you are looking at a sentence Tryna keep it simple is a struggle for me predicting that last word me (could be literally anyone or anything like: Baka, cat, potato) is very difficult if you can’t look back and see what other words appeared before.

LSTMs solve this problem by adding a little memory to every cell that stores some information about what happened before (what words appeared previously), and that’s why LSTMs look like this:

Fig. 7 LSTM visualization, taken from Andrew Ng’s Deep Learning specialization

As well as passing the a<n> activation, you are also passing the c<n> which contains information of what happened in previous nodes. That’s why LSTMs are better at preserving the context and can generally make better predictions for purposes like Language Modeling.

Actually building it

I learned a bit of Keras before, so I used it as the framework to build the network, but in reality, this can be done by hand, only difference is that it will take a lot longer.

As you can see, we are using an LSTM model and we are also using batching, which means that we training on subsets of data instead of all of it at once to slightly speed up the training process.

4. Generating Lyrics

After our network is trained, here is how we are going to look for the next character. We are going to get some random seed, which is going to be a simple string input by a user. Then we are going to use the seed as an input to network to predict the next character, and we will repeat this process until we have generated a bunch of new lines; similar to the Figure 2 shown above.

Below are some examples of generated lyrics

Note: the lyrics are not censored, so view at your own discretion

You might have noticed that words sometimes make no sense and that is a very common problem with Character-level models, due to the fact that input data is often sliced in the middle of the word, which makes the network learn and generate weird new words that somehow make sense to it.

This is the issue that gets addressed with the Word-level models, but for less than 200 lines of code, Character-level models are still very impressive.

Other Applications

Ideas described in this Character-level network can be expanded to many other applications that are far more useful than lyric generation.

For example, next word recommendations on your iPhone keyboard work the same way.

Imagine if you build an accurate enough Python Language Model, you could autocomplete not only keywords or variable names, but also large chunks of code, saving programmers tons of time.

You might have noticed that the code is not complete here and some of pieces are missing, here is the link to my Github repo, where you can dive much deeper into the details of building a similar project yourself.

Credits to Keras’ example of from github

All in all, I hope you enjoyed reading this story, please consider following or clapping 👏 if you did. If you are interested in more content like this, you can follow me here or on any other social media with @nikolaevra.

I will catch you next time!