Meaning circulation in Lolita

As applied to writing studies, text network analysis is a method by which a researcher can trace the circulation of meaning within a text.

Meaning is generated when different pieces of information are related to one another in some way. Information is relational, however, only according to some system or ontology. For instance, the string of phonetic sounds [k æ t] relies for its meaning on the system of English phonology (as well as the English lexicon).  Within this system, when my vocal tract strings together the different sounds [k], [æ], and aspirated [t], I generate meaning because the different sounds relate to and work with one another to create the English word, cat. On their own, those sounds are not necessarily meaningful; they become meaningful in relation to one another within a specific system. Any text will draw upon many systems to create meaning: syntactic, graphemic, cultural . . .

Meaning circulates within individual texts, but individual texts circulate among other texts and within communities and cultures. So, a larger concern with meaning circulation is not satisfied with analyses of individual texts. However, any study of meaning circulation within larger networks must take the individual text as the starting point (or the ending point, depending on how you approach the question). The inter-textual network does not terminate at the individual text; it simply changes scale, exiting the exterior network and entering the network of the text.

Using Auto Map and Gephi, and following a methodology similar to the one described here, I created a network of all the lexical connections within the first 10 chapters of Vladimir Nabokov’s Lolita. (View the upcoming videos in full screen; otherwise, you can’t see the nodes I’m talking about.) There are different ways to visualize these connections as a text network, but the results here show which words possess the highest levels of betweenness centrality. The more betweenness centrality, the larger the node; these are the words that have, not the most connections, but the most connections to the most different clusters, which tells us that these terms, in the text, are used in many different contexts and therefore are the most fluid in meaning.

The results also allow us to trace all the possible connections from one word to any other, both within individual meaning clusters and through terms with a high level of betweenness centrality. For example, the terms ‘girl’ and ‘night’ have a relatively high betweenness centrality, and they are both connected to one another through the word ‘touched’, which itself is not connected to very many clusters and thus has a low betweenness centrality.

night –> touched –> girl

(Lots of pervy pathways of meaning in Lolita.)

Visualizing all the connections in this textual network is messy. Nabakov was a master stylist, not one to use the same words too often, and certainly not in the same sentence or in the same connective pattern. The average path length in the text is 7.95. Average path length measures how many steps you need to take on average to connect two randomly selected nodes. The lower the average path length, the more connected the text. At 7.95, the first 10 chapters in Lolita are not very connected; there are 221 separate meaning clusters. Here’s the messy initial network . . .

Using Gephi’s degree range tool, I hid the most disconnected nodes, thereby ‘cleaning’ the visualization of all but the most prominent clusters and connections.

With this cleaner network, I could see a few distinct clusters, as well as those terms with high degrees of betweenness centrality, the words that act as conduits between different words and meaning clusters. They were what you’d expected: meaning in Lolita circulates through the favorite words of an enamored pederast. Nymphet, night, girl, age, eyes, hair . . .

More interesting than the overall network, however, were the various paths I found between different terms. In general, fewer than 3 paths of separation in a social network indicate a possibility of cross-influence between two nodes; in our textual network, two nodes separated by 3 paths or fewer indicate a possible, latent relationship between the nodes, perhaps even a relationship that can be expressed in terms of influence.

For example, ‘nymphet’ led backward to ‘annabel’, which had a direct path to ‘lolita’ in one direction and to ‘death’ in the other direction.

Remember, this textual network only represents the first 10 chapters of the novel (I didn’t include the fake preface). And yet, already built into this network of lexemes from early in the novel is a clue to Humbert’s eventual demise, a great example of the intimate connection between form and function, style and plot.

Another interesting pathway was the path between ‘life’ and ‘death’. Actually, there were two pathways, one leading through ‘felt’ and another leading, oddly enough, through ‘love’ and then ‘father’.

The ‘father’, ‘love’, ‘death’ triangle is quite interesting . . . and, of course, ‘death’ leads back through ‘felt’ to ‘annabel’, the first nymphet in Humbert’s life.

Finally, two important terms in the network are quite disconnected: ‘nymphet’ and ‘girl’. Which is exactly what we should expect. Humbert goes to great lengths to separate the one from the other, and textually, it’s difficult to trace a lexemic path from one to the other. (note: at the end of this video, I highlight the word ‘fruit’, which is only connected to ‘table’ and ‘set’. Nabokov apparently declines to use any sort of forbidden fruit metaphor during the first ten chapters of the novel; ‘fruit’ never connects to the pervy words or meaning clusters.)

Even this short analysis has given me some interesting things to discuss if I were actually writing a dissertation on Lolita. The meaning circulation of ‘lolita’, ‘annabel’, and ‘death’ through the conduit ‘nymphet’ would be worth analyzing in more detail, especially considering that this circulation occurs so early in the novel.

Concept clusters in Rhetoric Society Quarterly

Inspired by this article on text network analysis (TNA), I applied part of the article’s methodology to my own corpus of article abstracts from Rhetoric Society Quarterly. For this first foray into TNA, I had a simple goal: I wanted to see what concepts cluster around the word “rhetoric” in the abstracts.

I need to be careful using the above article’s methods on my abstracts corpora because these corpora are really multiple texts written by multiple authors; the methodology linked above is designed to analyze self-contained texts. But for my current question (what concepts occur most frequently beside “rhetoric”?), treating the corpus like a self-contained text is beneficial.  I want to see if there is stability in the use of “rhetoric” across dozens of authors, or if the concepts appearing beside “rhetoric” in the abstracts are myriad and unpredictable. In other words, my question is an indirect way to test the cohesion of rhetorical studies’ discourse about that discipline’s defining term.

First, a video of my initial results; then, the methods and my final results. In the video, the center red node is “rhetoric”; the green nodes are concepts that appear beside the word “rhetoric” in RSQ abstracts. The concepts that fly away from the center node only appear beside “rhetoric” once; the concepts that get pulled toward the center node appear in connection with “rhetoric” more and more often.

Now the methods:

1. Import the Rhetoric Society Quarterly abstracts corpus into Python and attach it to the NLTK package.

2. Get rid of stopwords in the corpus, so that a sentence like “The rhetoric of Aristotle is too analytic” becomes  “rhetoric Aristotle analytic”.

3. Do a ‘light’ cleaning of stems in the corpus, so that “comparative rhetoric” and “comparison rhetoric” both become “compare rhetoric.” (See below for more on this.)

4. Collect word-sequences that use the word “rhetoric.” I used 3-word sequences.

4.5. Obviously, getting rid of stopwords was essential; otherwise, I’d end up with an unhelpful list like this:

Now I had a list of terms to which rhetoric was connected: N1 rhetoric N2.

5. Next, using Gephi, import the list of words as individual nodes.

6. List the edges between the nodes in Gephi, giving an edge a higher weight each time it recurs in the abstracts corpora. For example, “rhetoric” and “science” occurred beside each other 3 times; so, the edge between these two IDs (“rhetoric”=source, “science”=target) was given a weight of 3.

Steps 5 and 6 were done manually. Very time consuming; it took over an hour. I’ll need to find a way to run an algorithm that can do it automatically, especially once I start working with larger texts.

7. The video above provides my first attempt at arranging the nodes and edges visually in Gephi’s Overview interface. I had to tweak the nodes a bit around the center. Not sure why, but the weight of the edges didn’t perfectly match the nodes’ distance from the central node.

Here’s the final visualization. Click to enlarge for detail.

And here’s the list of edges (connections between “rhetoric” and other words) and their weights (number of connections), showing only edges with weights greater than 3:

Reading through the list, please recall Step 3 above. Following the methods in the Nodus Labs article, I clustered words such as “historical,” “historian,” and “histories” under the single node “history.” This de-stemming technique allows me to answer my question about concept-clustering without being so bloody granular that it appears as though no cohesion exists across the abstracts.

So, this list and visualization show us what concepts cluster around the term “rhetoric” in RSQ abstracts. I say concepts rather than words because de-stemming the words places emphasis on the free morpheme, which will always be more ‘conceptual’ than individual words; i.e., the word clusters in a corpus might be “rhetoric historian”, “rhetoric histories”, and “rhetoric historiography”, all of which might be brought under a particular concept cluster, “rhetoric history”.

The verbs in the list are interesting: “use,” “study,” “compare,” and “examine” tell us both what scholars are doing with rhetoric and also, perhaps, what scholars are saying people outside the academy are doing with rhetoric.

The “rhetoric history” concept cluster carries the heaviest weight. This is not surprising, considering that rhetorical studies is embedded in the Greco-Roman rhetorical tradition, and that a lot of current scholarship is dedicated to historical recovery of oratory outside the traditional canon.

I was surprised by the heavy weight of the “rhetoric scholar” cluster. It must be that authors in RSQ like to talk about themselves. I’ll have to check the actual collocations, but I’ll bet that “rhetorical scholarship” and “scholars of rhetoric” constitute the bulk of this cluster’s connections.

The “rhetoric epideictic” cluster made me happy. Epideictic oratory and rhetoric were historically considered unworthy of serious attention, but clearly, they’re getting a lot of attention these days. Perhaps we can thank Richard Weaver, whose seminal essay “Language is Sermonic” basically reignited a scholarly interest in epideictic rhetoric.

That’s all for now. Like I said at the beginning, this was a simple text network analysis; it took me a couple hours, maybe a little more, and once I write a few scripts, I can probably automate the process. The next step will be to trace all the clusters in a textual network.