Word Clouds

Creating Styled Word Clouds with Python wordclouds and NLTK

Written By: Alec Jackson

Python provides several libaries for working with datasets. For this project I relied on several tools. There was 3 major parts to creating these wordclouds. Firstly, we need to create our dataset. I chose to work with one of my favorite movie scripts Starship Troopers. Any text will work here, but anything smaller than a movie script is going to be short on content. There needs to be enough tokens that when processed will create an interesting word cloud. Using Natural Language Toolkit for python for processing we can simplify the text processing. I chose to use character lines as my corpus. This means for the text processing we will need to search the script for the character name and then parse the following lines. There are several ways to do this using NLTK and the approach will vary depending on the script format.

Having created a corpus for our character we are going to want to do several things to clean up our text. First we need to remove punctuation. Since NLTK standard tokenizer processes punctuation as unique tokens this makes this task a little bit easier. Next, we are going to want to expand contractions there are several python libaries that can help with this. They are mainly going to be relativly staright forward programs that will do a basic find and replace with a dictionary. Fortunately this means that if we have any contractions in our corpa that are not listed we can simply add the contraction and corresponding expanded form to the dictionary. The final part of cleaning the text is to remove stop words. This part is optional but it will make the word cloud more interesting if you do it. NLTK provides a standard list of stopwords you can remove or you can create your own stopword list fairly easily.

Now that the text is clean we need to prepare some images. I took some frames from Starship Troopers and scrubbed the background out. Then made a silouette out of the foreground. This doesn't need to be extremely percise, just enough to get the outline. Creating a more percise outline will allow you to do a few interesting things with the final result but for a basic word cloud it won't be too important. Using a program like G.I.M.P (GNU Image Manipulation Program) or Photoshop this is a fairly easy task to do. If the image already has fairly good seperation you can even use a command line tool like image magick.

This final step is to put this all together using python wordcloud. You just pass your image and text to the program and you it will return a word cloud. Just save this output and you've successfully created a word cloud.