October 2012

Text networks allow you to trace the circulation of meaning within a text.

A text network analysis proceeds in the following way: a text is copied into a .txt file; it is imported into some analytic tool (I use Auto Map) in order to remove stop words and to lightly stem the text; then, using the same tool, the text—which has now been expunged of all but significant content words—is run through an algorithm that treats the content words like a network and creates a co-reference list in .csv format. What words are connected to what other words, and how often? The .csv file is then opened in a network analysis tool (I use Gephi) in order to visualize these semantic connections. Each word is visualized as a node in the network, and words that appear next to each other—within a certain word gap—appear as edges. (I used a 3-word gap below.)

The most interesting network visualization, in my opinion, shows nodes with the highest levels of Betweenness Centrality, which measures whether or not a node is connected to other nodes that themselves have many connections; a node with high betweenness centrality will in essence be an important ‘passageway’ between communities within the network. (Here’s an excellent visual description of the concepts.)

In a text network, a word with high degree centrality is a word used in connection with myriad other words. This simply tells you that a word is used frequently in a text and in a variety of contexts (it will more or less be a productive creator of bigrams). However, a word with high betweenness centrality is a word used frequently and in conjunction with other words that also connect to other words to form community clusters. This tells you that a word is not only used frequently and not only in many contexts but also that it is used in connection with words that also do a lot of semantic work in the text. A word with high betweenness centrality is a word through which many meanings in a text circulate.

Using Auto Map and Gephi, and following a methodology similar to the one described here, I created a network of all the lexical connections within the first 10 chapters of Vladimir Nabokov’s Lolita. (View the upcoming videos in full screen; otherwise, you can’t see the nodes I’m talking about.) The results here show which words possess the highest betweenness centrality. The more betweenness centrality, the larger the node.

The results also allow us to trace all the possible connections from one word to any other word, both within individual meaning clusters and through terms with a high level of betweenness centrality. For example, the terms ‘girl’ and ‘night’ have a relatively high betweenness centrality, and they are both connected to one another through the word ‘touched’, which itself is not connected to very many clusters and thus has a low betweenness centrality.

night –> touched –> girl

(Lots of pervy pathways of meaning in Lolita.)

Visualizing all the connections in this textual network is messy. Nabakov was a master stylist, not one to use the same words too often, and certainly not in the same sentence or in the same connective pattern. The average path length in the text is 7.95. Average path length measures how many steps you need to take on average to connect two randomly selected nodes. The lower the average path length, the more connected the text. At 7.95, the first 10 chapters in Lolita are not very connected; there are 221 separate meaning clusters. Here’s the messy initial network . . .

Using Gephi’s degree range tool, I hid the most disconnected nodes, thereby ‘cleaning’ the visualization of all but the most prominent clusters and connections.

With this cleaner network, I could see a few distinct clusters, as well as those terms with high degrees of betweenness centrality, the words that act as conduits between different words and meaning clusters. They were what you’d expected: meaning in Lolita circulates through the favorite words of an enamored pederast. Nymphet, night, girl, age, eyes, hair . . .

More interesting than the overall network, however, were the various paths I found between different terms. In general, fewer than 3 paths of separation in a social network = a possible cross-influence between two nodes. In our text network, two words separated by 3 paths or fewer = a possible, latent relationship between the words, perhaps even a relationship that can be expressed in terms of influence.

For example, ‘nymphet’ led backward to ‘annabel’, which had a direct path to ‘lolita’ in one direction and to ‘death’ in the other direction.

Remember, this textual network only represents the first 10 chapters of the novel (I didn’t include the fake preface). And yet, already built into this network of lexemes from early in the novel is a clue to Humbert’s eventual demise, a great example of the intimate connection between form and function, style and plot.

Another interesting pathway was the path between ‘life’ and ‘death’. Actually, there were two pathways, one leading through ‘felt’ and another leading, oddly enough, through ‘love’ and then ‘father’.

The ‘father’, ‘love’, ‘death’ triangle is quite interesting . . . and, of course, ‘death’ leads back through ‘felt’ to ‘annabel’, the first nymphet in Humbert’s life.

Finally, two important terms in the network are quite disconnected: ‘nymphet’ and ‘girl’. Which is exactly what we should expect. Humbert goes to great lengths to separate the one from the other, and textually, it’s difficult to trace a lexemic path from one to the other. (note: at the end of this video, I highlight the word ‘fruit’, which is only connected to ‘table’ and ‘set’. Nabokov apparently declines to use any sort of forbidden fruit metaphor during the first ten chapters of the novel; ‘fruit’ never connects to the pervy words or meaning clusters.)

Even this short analysis has given me some interesting things to discuss if I were actually writing a dissertation on Lolita. The meaning circulation of ‘lolita’, ‘annabel’, and ‘death’ through the conduit ‘nymphet’ would be worth analyzing in more detail, especially considering that this circulation occurs so early in the novel.

Inspired by this article on text network analysis (TNA), I applied part of the article’s methodology to my own corpus of article abstracts from Rhetoric Society Quarterly. For this first foray into TNA, I had a simple goal: I wanted to see what concepts cluster around the word “rhetoric” in the abstracts.

I need to be careful using the above article’s methods on my abstracts corpora because these corpora are really multiple texts written by multiple authors; the methodology linked above is designed to analyze self-contained texts. But for my current question (what concepts occur most frequently beside “rhetoric”?), treating the corpus like a self-contained text is beneficial. I want to see if there is stability in the use of “rhetoric” across dozens of authors, or if the concepts appearing beside “rhetoric” in the abstracts are myriad and unpredictable. In other words, my question is an indirect way to test the cohesion of rhetorical studies’ discourse about that discipline’s defining term.

First, a video of my initial results; then, the methods and my final results. In the video, the center red node is “rhetoric”; the green nodes are concepts that appear beside the word “rhetoric” in RSQ abstracts. The concepts that fly away from the center node only appear beside “rhetoric” once; the concepts that get pulled toward the center node appear in connection with “rhetoric” more and more often.

Now the methods:

1. Import the Rhetoric Society Quarterly abstracts corpus into Python and attach it to the NLTK package.

2. Get rid of stopwords in the corpus, so that a sentence like “The rhetoric of Aristotle is too analytic” becomes “rhetoric Aristotle analytic”.

3. Do a ‘light’ cleaning of stems in the corpus, so that “comparative rhetoric” and “comparison rhetoric” both become “compare rhetoric.” (See below for more on this.)

4. Collect word-sequences that use the word “rhetoric.” I used 3-word sequences.

4.5. Obviously, getting rid of stopwords was essential; otherwise, I’d end up with an unhelpful list like this:

Now I had a list of terms to which rhetoric was connected: N₁ rhetoric N₂.

5. Next, using Gephi, import the list of words as individual nodes.

6. List the edges between the nodes in Gephi, giving an edge a higher weight each time it recurs in the abstracts corpora. For example, “rhetoric” and “science” occurred beside each other 3 times; so, the edge between these two IDs (“rhetoric”=source, “science”=target) was given a weight of 3.

Steps 5 and 6 were done manually. Very time consuming; it took over an hour. I’ll need to find a way to run an algorithm that can do it automatically, especially once I start working with larger texts.

7. The video above provides my first attempt at arranging the nodes and edges visually in Gephi’s Overview interface. I had to tweak the nodes a bit around the center. Not sure why, but the weight of the edges didn’t perfectly match the nodes’ distance from the central node.

Here’s the final visualization. Click to enlarge for detail.

And here’s the list of edges (connections between “rhetoric” and other words) and their weights (number of connections), showing only edges with weights greater than 3:

Reading through the list, please recall Step 3 above. Following the methods in the Nodus Labs article, I clustered words such as “historical,” “historian,” and “histories” under the single node “history.” This de-stemming technique allows me to answer my question about concept-clustering without being so bloody granular that it appears as though no cohesion exists across the abstracts.

So, this list and visualization show us what concepts cluster around the term “rhetoric” in RSQ abstracts. I say concepts rather than words because de-stemming the words places emphasis on the free morpheme, which will always be more ‘conceptual’ than individual words; i.e., the word clusters in a corpus might be “rhetoric historian”, “rhetoric histories”, and “rhetoric historiography”, all of which might be brought under a particular concept cluster, “rhetoric history”.

The verbs in the list are interesting: “use,” “study,” “compare,” and “examine” tell us both what scholars are doing with rhetoric and also, perhaps, what scholars are saying people outside the academy are doing with rhetoric.

The “rhetoric history” concept cluster carries the heaviest weight. This is not surprising, considering that rhetorical studies is embedded in the Greco-Roman rhetorical tradition, and that a lot of current scholarship is dedicated to historical recovery of oratory outside the traditional canon.

I was surprised by the heavy weight of the “rhetoric scholar” cluster. It must be that authors in RSQ like to talk about themselves. I’ll have to check the actual collocations, but I’ll bet that “rhetorical scholarship” and “scholars of rhetoric” constitute the bulk of this cluster’s connections.

The “rhetoric epideictic” cluster made me happy. Epideictic oratory and rhetoric were historically considered unworthy of serious attention, but clearly, they’re getting a lot of attention these days. Perhaps we can thank Richard Weaver, whose seminal essay “Language is Sermonic” basically reignited a scholarly interest in epideictic rhetoric.

That’s all for now. Like I said at the beginning, this was a simple text network analysis; it took me a couple hours, maybe a little more, and once I write a few scripts, I can probably automate the process. The next step will be to trace all the clusters in a textual network.

Month: October 2012

Meaning circulation in Lolita

Concept clusters in Rhetoric Society Quarterly