Loading a corpus into the Natural Language Toolkit

[UPDATED: See this post for a more thorough version of the one below.]

Looking through the forum at the Natural Language Toolkit website, I’ve noticed a lot of people asking how to load their own corpus into NLTK using Python, and how to do things with that corpus. Unfortunately, the answers to those question aren’t exactly easy to find on the forums; they’re scattered around in different threads, and often a bit vague. Part of this is probably because the REAL coders on the forums want us noobs to figure it out ourselves. And I get that. The more I play on Python, the more I realize that the best way to learn code is to read about the basics, then start dicking around to see what works and what just gives you a scroll of red. You won’t learn anything by constantly asking “PLEASE TO HAVE TEH CODE?” on web forums.

That being said, I think the ability to load a corpus into the program is a pretty basic step. I wonder how many people have abandoned exploring the program because they couldn’t figure out how to load something other than the pre-loaded corpora. So, I’ll start throwing up some EASY lines of code for some BASIC  functions, in the hope that noobs like me googling around for answers might run across them.

For now, I’ll provide the basic steps for loading your own non-tagged corpus into the program:

1. Save your corpus as a plain text format–e.g., a .txt file–using Notepad or some other text editor. Depending on what you want to do with your corpus, it might be easier to use Word first to deal with punctuation, capitalization, et cet. You can smooth out the text using NLTK, but it’s easy to do it in Word before saving the document in a plain text format. (Getting rid of punctuation and capitalization is important when compiling lexical statistics; most corpus analysis tools will, for example, count rhetoric and Rhetoric as two separate lexical entries because one is capitalized and one isn’t.)

2. Save the .txt file in the Python folder.

3. Load up IDLE, the Python GUI text-editor.

4. Import the NLTK book:

nltk1

5. Import the Texts, like it says to do in the first chapter of the NLTK book. There are certain tools that won’t work unless these are imported.

ntlk2

6. Now you’re ready to load your own corpus, using the following code:

nltk3

Basically, these lines simply split all the words in your file into a list form that the NLTK can access and read, so that you can run analyses on your corpus using the NLTK tools. Above, ‘ccc.txt’ is the plain text file which was saved in my Python folder. ‘Abstracts’ is just what I label the file while working with the NLTK. You can name it whatever you want.

Now, just run a basic concordance command to make sure it works:

nltk4

And you’re ready to go.

Advertisements

14 thoughts on “Loading a corpus into the Natural Language Toolkit

  1. The only disadvantage to the complete motion picture is the closing
    fight scene in between Iron Man and Obadiah Stane. Every single martial arts their very own individual individuality and are the very best of
    its kind. A novice to martial arts, zero past expertise and
    searching for some thing to understand to find out that martial art is the greatest.

  2. Tremendous things here. I am very glad to look your post. Thank you so much and I am having a
    look ahead to contact you. Will you please drop me a
    mail?

  3. How to laod a different langauage other than English in nltk i went thorugh your code and it doing wel for Engish…can you hel cos when am laoding my own corps news text in isiXhosa it tells me something else.

  4. Single most useful post on this topic on the web, thanks. PS: The “rU” option to open() means “read and universally handle all types of line endings”.

  5. Pingback: Uploading a Corpus to the NLTK, part 2 |

  6. Hello, I followed the steps you outlined above on how to load my own corpus into the NLTK but when I got to step 6, I encountered problem, and is giving me the message below:

    >>> f*open (‘opinion_pos.txt’, ‘rU’)
    Traceback (most recent call last):
    File “”, line 1, in
    f*open (‘opinion_pos.txt’, ‘rU’)
    NameError: name ‘f’ is not defined
    >>>

  7. How would one save the Text object to use again? I assume there must be a way to save and retrieve the created corpus without having to open the file and run split on it again the next time an analysis is conducted.

    • Mm, good question. I’ve just been retrieving the corpus each time; it’s only a few lines, and once they’re memorized, it takes a whole sixty seconds to type them out. But, I’m sure there’s a way to save it . . . I’ll look into it and try to post another reply if I figure it out.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s