[UPDATED: See this post for a more thorough version of the one below.]
Looking through the forum at the Natural Language Toolkit website, I’ve noticed a lot of people asking how to load their own corpus into NLTK using Python, and how to do things with that corpus. Unfortunately, the answers to those question aren’t exactly easy to find on the forums; they’re scattered around in different threads, and often a bit vague. Part of this is probably because the REAL coders on the forums want us noobs to figure it out ourselves. And I get that. The more I play on Python, the more I realize that the best way to learn code is to read about the basics, then start dicking around to see what works and what just gives you a scroll of red. You won’t learn anything by constantly asking “PLEASE TO HAVE TEH CODE?” on web forums.
That being said, I think the ability to load a corpus into the program is a pretty basic step. I wonder how many people have abandoned exploring the program because they couldn’t figure out how to load something other than the pre-loaded corpora. So, I’ll start throwing up some EASY lines of code for some BASIC functions, in the hope that noobs like me googling around for answers might run across them.
For now, I’ll provide the basic steps for loading your own non-tagged corpus into the program:
1. Save your corpus as a plain text format–e.g., a .txt file–using Notepad or some other text editor. Depending on what you want to do with your corpus, it might be easier to use Word first to deal with punctuation, capitalization, et cet. You can smooth out the text using NLTK, but it’s easy to do it in Word before saving the document in a plain text format. (Getting rid of punctuation and capitalization is important when compiling lexical statistics; most corpus analysis tools will, for example, count rhetoric and Rhetoric as two separate lexical entries because one is capitalized and one isn’t.)
2. Save the .txt file in the Python folder.
3. Load up IDLE, the Python GUI text-editor.
4. Import the NLTK book:
5. Import the Texts, like it says to do in the first chapter of the NLTK book. There are certain tools that won’t work unless these are imported.
6. Now you’re ready to load your own corpus, using the following code:
Basically, these lines simply split all the words in your file into a list form that the NLTK can access and read, so that you can run analyses on your corpus using the NLTK tools. Above, ‘ccc.txt’ is the plain text file which was saved in my Python folder. ‘Abstracts’ is just what I label the file while working with the NLTK. You can name it whatever you want.
Now, just run a basic concordance command to make sure it works:
And you’re ready to go.
I managed to load the corpus but many methods do not work. for instance readme() or words(). Are they broken in NLTK3?
Your simple code is very helpful, I am thankful to you.
Hello Seth, Greatly appreciate your helpful post! Have you tried VADER sentiment analysis tool and some steps for beginner’s would be appreciated.
The only disadvantage to the complete motion picture is the closing
fight scene in between Iron Man and Obadiah Stane. Every single martial arts their very own individual individuality and are the very best of
its kind. A novice to martial arts, zero past expertise and
searching for some thing to understand to find out that martial art is the greatest.
Tremendous things here. I am very glad to look your post. Thank you so much and I am having a
look ahead to contact you. Will you please drop me a
mail?
How to laod a different langauage other than English in nltk i went thorugh your code and it doing wel for Engish…can you hel cos when am laoding my own corps news text in isiXhosa it tells me something else.
I also encountered the same problem to load a corpus which is written in different writting system(a low resource Indian language) other than English.Please help me out.
Single most useful post on this topic on the web, thanks. PS: The “rU” option to open() means “read and universally handle all types of line endings”.
Thanks. Compliments and value-added comments are always appreciated!
Pingback: Uploading a Corpus to the NLTK, part 2 |
Hello, I followed the steps you outlined above on how to load my own corpus into the NLTK but when I got to step 6, I encountered problem, and is giving me the message below:
>>> f*open (‘opinion_pos.txt’, ‘rU’)
Traceback (most recent call last):
File “”, line 1, in
f*open (‘opinion_pos.txt’, ‘rU’)
NameError: name ‘f’ is not defined
>>>
Emmanuel, you need to define f, which means you need to follow it with an = sign. The first line should look like this:
f=open(‘opinion_pos.txt’,’ru’)
How would one save the Text object to use again? I assume there must be a way to save and retrieve the created corpus without having to open the file and run split on it again the next time an analysis is conducted.
Mm, good question. I’ve just been retrieving the corpus each time; it’s only a few lines, and once they’re memorized, it takes a whole sixty seconds to type them out. But, I’m sure there’s a way to save it . . . I’ll look into it and try to post another reply if I figure it out.
maybe write a function to open and create the corpus so that you can call it in one line?