Inspired by this article on text network analysis (TNA), I applied part of the article’s methodology to my own corpus of article abstracts from Rhetoric Society Quarterly. For this first foray into TNA, I had a simple goal: I wanted to see what concepts cluster around the word “rhetoric” in the abstracts.
I need to be careful using the above article’s methods on my abstracts corpora because these corpora are really multiple texts written by multiple authors; the methodology linked above is designed to analyze self-contained texts. But for my current question (what concepts occur most frequently beside “rhetoric”?), treating the corpus like a self-contained text is beneficial. I want to see if there is stability in the use of “rhetoric” across dozens of authors, or if the concepts appearing beside “rhetoric” in the abstracts are myriad and unpredictable. In other words, my question is an indirect way to test the cohesion of rhetorical studies’ discourse about that discipline’s defining term.
First, a video of my initial results; then, the methods and my final results. In the video, the center red node is “rhetoric”; the green nodes are concepts that appear beside the word “rhetoric” in RSQ abstracts. The concepts that fly away from the center node only appear beside “rhetoric” once; the concepts that get pulled toward the center node appear in connection with “rhetoric” more and more often.
Now the methods:
1. Import the Rhetoric Society Quarterly abstracts corpus into Python and attach it to the NLTK package.
2. Get rid of stopwords in the corpus, so that a sentence like “The rhetoric of Aristotle is too analytic” becomes “rhetoric Aristotle analytic”.
3. Do a ‘light’ cleaning of stems in the corpus, so that “comparative rhetoric” and “comparison rhetoric” both become “compare rhetoric.” (See below for more on this.)
4. Collect word-sequences that use the word “rhetoric.” I used 3-word sequences.
4.5. Obviously, getting rid of stopwords was essential; otherwise, I’d end up with an unhelpful list like this:
Now I had a list of terms to which rhetoric was connected: N1 rhetoric N2.
5. Next, using Gephi, import the list of words as individual nodes.
6. List the edges between the nodes in Gephi, giving an edge a higher weight each time it recurs in the abstracts corpora. For example, “rhetoric” and “science” occurred beside each other 3 times; so, the edge between these two IDs (“rhetoric”=source, “science”=target) was given a weight of 3.
Steps 5 and 6 were done manually. Very time consuming; it took over an hour. I’ll need to find a way to run an algorithm that can do it automatically, especially once I start working with larger texts.
7. The video above provides my first attempt at arranging the nodes and edges visually in Gephi’s Overview interface. I had to tweak the nodes a bit around the center. Not sure why, but the weight of the edges didn’t perfectly match the nodes’ distance from the central node.
Here’s the final visualization. Click to enlarge for detail.
And here’s the list of edges (connections between “rhetoric” and other words) and their weights (number of connections), showing only edges with weights greater than 3:
Reading through the list, please recall Step 3 above. Following the methods in the Nodus Labs article, I clustered words such as “historical,” “historian,” and “histories” under the single node “history.” This de-stemming technique allows me to answer my question about concept-clustering without being so bloody granular that it appears as though no cohesion exists across the abstracts.
So, this list and visualization show us what concepts cluster around the term “rhetoric” in RSQ abstracts. I say concepts rather than words because de-stemming the words places emphasis on the free morpheme, which will always be more ‘conceptual’ than individual words; i.e., the word clusters in a corpus might be “rhetoric historian”, “rhetoric histories”, and “rhetoric historiography”, all of which might be brought under a particular concept cluster, “rhetoric history”.
The verbs in the list are interesting: “use,” “study,” “compare,” and “examine” tell us both what scholars are doing with rhetoric and also, perhaps, what scholars are saying people outside the academy are doing with rhetoric.
The “rhetoric history” concept cluster carries the heaviest weight. This is not surprising, considering that rhetorical studies is embedded in the Greco-Roman rhetorical tradition, and that a lot of current scholarship is dedicated to historical recovery of oratory outside the traditional canon.
I was surprised by the heavy weight of the “rhetoric scholar” cluster. It must be that authors in RSQ like to talk about themselves. I’ll have to check the actual collocations, but I’ll bet that “rhetorical scholarship” and “scholars of rhetoric” constitute the bulk of this cluster’s connections.
The “rhetoric epideictic” cluster made me happy. Epideictic oratory and rhetoric were historically considered unworthy of serious attention, but clearly, they’re getting a lot of attention these days. Perhaps we can thank Richard Weaver, whose seminal essay “Language is Sermonic” basically reignited a scholarly interest in epideictic rhetoric.
That’s all for now. Like I said at the beginning, this was a simple text network analysis; it took me a couple hours, maybe a little more, and once I write a few scripts, I can probably automate the process. The next step will be to trace all the clusters in a textual network.