Uploading a Corpus to the NLTK, part 2

A year ago I posted a brief tutorial explaining how to upload a corpus into the Natural Language Toolkit. The post receives frequent hits, and I’ve received a handful of emails asking for further explanation. (The NLTK discussion forums are not easy to navigate, and Natural Language Processing with Python is not too clear about this basic but vital step. It’s discussed partially and in passing in Chapter 3.)

Here’s a more detailed description of the process, as well as information about preparing the corpus for analysis:

1. The corpus first needs to be saved in a plain text format. Also, the plain text file needs to be saved in the main Python folder, not under Documents. The path for your file should look something like this: c:\Python27\corpus.txt 

Once the corpus is in the proper format, open the Python IDLE window, import the Natural Language Toolkit, then open the corpus and convert it into raw text using the following code:


Using the ‘type’ command, we can see that the uploaded corpus at this point is simply a string of raw text (‘str’) as it exists in the .txt file itself.


2. The next step involves ‘tokenizing’ the raw text string. Tokenization turns the raw text into tokens: each word, number, symbol, and punctuation mark becomes its own entity in a long list. The line ccctokens[:100] shows the first 100 items in the now-tokenized corpus. Compare this list to the raw string above, listed after cccraw[:150].


Tokenization is an essential step. Running analyses on raw text is not as accurate as running analyses on tokenized text.

3. Next, all the words in the tokenized corpus need to be converted to lower-case. This step ensures that the NLTK does not count the same word as two different tokens simply due to orthography. Without lower-casing the text, the NLTK will, e.g., count ‘rhetoric’ and ‘Rhetoric’ as different items. Obviously, some research questions would want to take this difference into account, but otherwise, skipping this step might muddy your results.


4. Finally, attach the functionality of the NLTK to your corpus with this line: nltk.Text(tokenname)

‘token name’ would be whatever you’ve named your file in the preceding lines. The definition ID’s used in the examples above (ccc, ccctokens, cccraw) can obviously be changed to whatever you want, but it’s a good idea to keep track of them on paper so that you aren’t constantly scrolling up and down in the IDLE window.

Now the corpus is ready to be analyzed with all the power of the Natural Language Toolkit.

14 thoughts on “Uploading a Corpus to the NLTK, part 2

  1. Hi, just wanted to know, how flexible and extensible NLTK is? I mean, yes, it does come with its own set of methods and functionalities, but suppose, I want to do a pretty advanced task, say compare the story lines of two short stories and decide whether they have similar kind of storyline (say the protagonist defeats the evil guy, and the beautiful lady falls for him blah blah). Obviously NLTK doesn’t have a method for this, but how practical it is to develop such feature using the features NLTK already provides?

    • NLTK and Python can do a lot of stylistic analysis. So, to do “plot line” analysis, you would simply need to reduce story elements to stylistic elements—that is, to words and phrases in use. No corpus analysis tool is going to “know” anything about story, protagonist, bad guy, good guy, et. cetera . . . but it CAN know what sorts of words are associated with these elements, as long as you tell it what to look for.

    • Sorry for the late reply, Daniel. I didn’t see this comment for some reason.

      “Plain text” just means you’re putting the corpus into a text reader (e.g., Notepad) and saving it as a .txt file, as opposed to saving it in Microsoft Word. Python can’t access text saved in a .doc file.

      So, just copy your corpus into Notepad or Notepad++ or any other text reader, save it as a .txt file, and access that with NLTK.

  2. Pingback: Loading a corpus into the Natural Language Toolkit |

  3. Hi Seth,
    I can’t figure out what the second line of programming >>> sorted etc is doing in this eg from the NLTK handbook – can you help? I’ve been understanding it all until I tripped over this and don’t feel I can go on until I have figured it out:

    Notice that the one pronunciation is spelt in several ways: nics, niks, nix, even ntic’s with a silent t, for the word atlantic’s. Let’s look for some other mismatches between pronunciation and writing. Can you summarize the purpose of the following examples and explain how they work?

    >>> [w for w, pron in entries if pron[-1] == ‘M’ and w[-1] == ‘n’]
    [‘autumn’, ‘column’, ‘condemn’, ‘damn’, ‘goddamn’, ‘hymn’, ‘solemn’]
    >>> sorted(set(w[:2] for w, pron in entries if pron[0] == ‘N’ and w[0] != ‘n’))
    [‘gn’, ‘kn’, ‘mn’, ‘pn’]

    Thanks so much for your help.

    • The first line is searching for words whose spoken pronunciation ends in bilabial “m” but whose spelling ends in “n.”

      The second line is doing something similar, but looking at the beginnings of words. It is searching for words whose spoken pronunciation begins with coronal “n” but whose spelling does not begin with “n.”

      set(w[:2] tells Python to collect the first two letters for each word.

      if pron[0] == ‘N’ . . . if the pronunciation element in the list begins with “n” (remember, the list they’re dealing with has both a word and its pronunciation)

      and w[0] != ‘n’ . . . but if the word in the list does not begin with “n”

      Make sense? I think that’s more or less what’s going on. The returned list gives us what we’d expect: all the funky Greek borrowings in English that begin with coronal “n” but aren’t spelled with “n”: PNeumonia, KNowledge, MNeumatic, etc . . .

  4. Wonderful, thank you, you are absolutely right, that is the critical detail I have been searching for in the NLTK handbook ever since I started learning PYTHON – but what if I have multiple txt files? I usually do have at least 10-50 such files to analyse and sometimes as much as 5000.
    Thank you in advance for your time.

    • Hi Anon. If you want to search for patterns across multiple files, you can just copy each file back to back to back into a “master” .txt file. This can work well for certain things, especially if you want to look at lexical dispersion over time. The more sophisticated way to do it is to put the .txt files into different folders, then access the folders rather than individual .txt files. The movie review corpus and the Inaugural Address corpus in NLTK are structured this way. Here’s a link to the movie review corpus:


      Download it (you’ll need 7-zip to open, cuz it’s a .tar file) and see how it’s structured.

      I’m not sure where to find the data for the Inaugural Address corpus, but I haven’t spent much time looking. It would be equally valuable to download it, see how it has been structured, and do thou likewise.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s