[UPDATED: See this post for a more thorough version of the one below.]
Looking through the forum at the Natural Language Toolkit website, I’ve noticed a lot of people asking how to load their own corpus into NLTK using Python, and how to do things with that corpus. Unfortunately, the answers to those question aren’t exactly easy to find on the forums; they’re scattered around in different threads, and often a bit vague. Part of this is probably because the REAL coders on the forums want us noobs to figure it out ourselves. And I get that. The more I play on Python, the more I realize that the best way to learn code is to read about the basics, then start dicking around to see what works and what just gives you a scroll of red. You won’t learn anything by constantly asking “PLEASE TO HAVE TEH CODE?” on web forums.
That being said, I think the ability to load a corpus into the program is a pretty basic step. I wonder how many people have abandoned exploring the program because they couldn’t figure out how to load something other than the pre-loaded corpora. So, I’ll start throwing up some EASY lines of code for some BASIC functions, in the hope that noobs like me googling around for answers might run across them.
For now, I’ll provide the basic steps for loading your own non-tagged corpus into the program:
1. Save your corpus as a plain text format–e.g., a .txt file–using Notepad or some other text editor. Depending on what you want to do with your corpus, it might be easier to use Word first to deal with punctuation, capitalization, et cet. You can smooth out the text using NLTK, but it’s easy to do it in Word before saving the document in a plain text format. (Getting rid of punctuation and capitalization is important when compiling lexical statistics; most corpus analysis tools will, for example, count rhetoric and Rhetoric as two separate lexical entries because one is capitalized and one isn’t.)
2. Save the .txt file in the Python folder.
3. Load up IDLE, the Python GUI text-editor.
4. Import the NLTK book:
5. Import the Texts, like it says to do in the first chapter of the NLTK book. There are certain tools that won’t work unless these are imported.
6. Now you’re ready to load your own corpus, using the following code:
Basically, these lines simply split all the words in your file into a list form that the NLTK can access and read, so that you can run analyses on your corpus using the NLTK tools. Above, ‘ccc.txt’ is the plain text file which was saved in my Python folder. ‘Abstracts’ is just what I label the file while working with the NLTK. You can name it whatever you want.
Now, just run a basic concordance command to make sure it works:
And you’re ready to go.