Word Cloud Tutorial

In this tutorial, we will talk about how to make a word cloud picture with your picture and text.

These are wordcloud examples I did in the past projects

First, you need to install python and the required packages. I suggest using pip install for package installations, here are the packages we need to install

pip install wordcloud
pip install matplotlib
pip install scipy
pip install nltk
pip install imageio

You can choose whatever picture you like for the wordcloud, the only requirement is the picture must have a transparent background. Please check out this tutorial on how to make an image background transparent using the free online editor Pixlr. I used Steve Jobs - One of the most innovative people in Tech. Here’s the picture that I use for the wordcloud:

Steve Jobs PNG image. Retrieved from http://pngimg.com/download/33423

After you chose your picture, it’s time for us to choose our speech. I used Steve Jobs’ 2005 Standford Commencement address. Here’s the link for the speech. If we take a look at that speech, we will find that there is a lot of stuff we don’t want to put in the word cloud picture, e.g, numbers, abbreviations, pronouns, conjunctions (“for”, “or”), etc. That’s why we need to use the NLTK in python. Let’s use the first sentence of the speech as an example. We will first convert the sentence to its atomic elements, then remove meaningless words.

1.Converting the sentence to its atomic elements

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'[A-Za-z]+')         #Regex with only word characters

string = 'I am honored to be with you today at your commencement from one of the finest universities in the world.'
raw = string.lower()   
tokens = tokenizer.tokenize(raw)

print(tokens)
>>['i', 'am', 'honored', 'to', 'be', 'with', 'you', 'today', 'at', 'your', 'commencement', 'from', 'one', 'of', 'the', 'finest', 'universities', 'in', 'the', 'world']

2.Removing meaningless words(stop words). You can find the list of stop words that we are using in this case here

from nltk.corpus import stopwords
stop_words  = stopwords.words('english')  #English stop words list
stopped_tokens = [word for word in tokens if not word in stop_words] # extract all the non-meaningless words

print(stopped_tokens)
>>['honored', 'today', 'commencement', 'one', 'finest', 'universities', 'world']

Now, we can apply these techniques to our speech. It’s easier to import the speech to python if we save it as a text file. You can find the Steve Jobs’ 2005 Standford Commencement address test file here. Then we import the text file to python.

with open('steve_jobs_speech.txt') as f:      #replace the filename with your speech
    speech = f.read()
print(speech)
>>'I am honored to be with you today at your commencement from one ...'

Next, we clean the speech (converting the speech to atomic elements and removing meaningless words).

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

tokenizer = RegexpTokenizer(r'[A-Za-z]+')
raw = speech.lower()   
tokens = tokenizer.tokenize(raw)
stop_words  = stopwords.words('english')       #English stop words list
stopped_tokens = [word for word in tokens if not word in stop_words] # extract all the non-meaningless words
print(stopped_tokens)
>>['honored', 'today', 'commencement', 'one', 'finest', 'universities', ...]

We need to concatenate all words in the stopped_tokens list as a speech and create the wordcloud.

from wordcloud import WordCloud
import imageio
import matplotlib.pyplot as plt

speech_new=" ".join(stopped_tokens)         #concatenate all words 

wordcloud = WordCloud(
    mask=imageio.imread('steve_jobs.png'),
    background_color="white",)

wordcloud.generate(speech_new)
plt.figure()
plt.title('How to live before you die')
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig("steve_jobs_wordcloud.png", dpi =300)     # you can save it as jpeg or other formats               
plt.show()

Here’s the result: