Elliot Rodger’s Manifesto: Text Networks and Corpus Features

Analyzing manifestos is becoming a theme at this blog. Click here for Chris Dorner’s manifesto and here for the Unabomber manifesto.

Manifestos are interesting because they are the most deliberately written and deliberately personal of genres. It’s tenuous to make claims about a person’s psyche based on the linguistic features of his personal emails; it’s far less tenuous to make claims about a person’s psyche based on the linguistic features of his manifesto—especially one written right before he goes on a kill rampage. This one—”My Twisted World,” written by omega male Elliot Rodger—is 140 pages long, and is part manifesto, part autobiography.

I’ve made a lot of text networks over the years—of manifestos, of novels, of poems. Never before have I seen such a long text exhibit this kind of stark, binary division:

RodgersBetweennessCentrality

This network visualizes the nodes with the highest betweenness centrality. The lower, light blue cluster is Elliot’s domestic language; this is where you’ll find words like “friends”, “school,” “house,” et cetera . . . words describing his life in general. The higher, red cluster is Elliot’s sexually frustrated language; this is where you’ll find words like “girls,” “women,” “sex,” “experience,” “beautiful,” “never”  . . . words describing his relationships with (or lack thereof) the feminine half of our species.

It’s quite startling. Although this text is part manifesto and part autobiography, I wasn’t expecting such a clear division: the language Elliot uses to describe his sexually frustrated life is almost wholly severed from the language he uses to describe his life apart from the sex and the “girls” (Elliot uses “girls” far more frequently than he uses “women”—see below). It’s as though Elliot had completely compartmentalized his sexual frustration, and was keeping it at bay. Or trying to. I don’t know how this plays out in individual sections of the manifesto. Nor do I know what it says about Elliot’s mental health more generally. I’ve always believed that compartmentalizing frustrations is, contra popular advice, a rather healthy thing to do. I expected a very, very tortuous and conflicted network to emerge here, indicating that each aspect of Elliot’s life was dripping with sexual angst and misogyny. Not so, it turns out.

Here’s a brief “zoom” on each section:

RodgersDegreeCentralityDomestic

RodgersDegreeCentralityWomen

In the large, zoomed-out network—the first one in the post—notice that the most central nodes are “me” and “my.” I processed the text using AutoMap but decided to retain the pronouns, curious how the feminine, masculine, and personal pronouns would play out in the networks and the dispersion plots. Feminine, masculine, personal—not just pronouns in this particular text. And what emerges when the pronouns are retained is an obvious image of the Personal. Rodgers’ manifesto is brimming with self-reference:

RodgersPronouns

Take that with a grain of salt, of course. In making claims about any text with these methods, one should compare features with the features of general text corpora and with texts of a similar type. The Brown Corpus provides some perspective: “It” is the most frequent pronoun in that corpus; “I” is second; “me” is far down the list, past the third-person pronouns.

Here’s another narcissistic twist, found in the most frequent words in the text. Again,  pronouns have been retained. (Click to enlarge.)

RodgersFreqWords

“I” is the most frequent word in the entire text, coming before even the basic functional workhorses of the English language. The Brown Corpus once more provides perspective: “I” is the 11th most frequent word in that general corpus. Of course, as noted, there is an auto-biographic ethos to this manifesto, so it would be worth checking whether or not other auto-biographies bump “I” to the number one spot. Perhaps. But I would be surprised if “I,” “me,” and “my” all clustered in the top 10 in a typical auto-biography—a narcissistic genre by design, yet I imagine that self-aware authors attempt to balance the “I” with a pro-social dose of “thou.” Maybe I’m wrong. It would be worth checking.

More lexical dispersion plots . . .

Much more negation is seen below then is typically found in texts. According to Michael Halliday, most text corpora will exhibit 10% negative polarity and 90% positive polarity. Elliot’s manifesto, however, bursts with negation. Also notice, below, the constant references to “mother” and “father”—his parents are central characters. But not “mom” and “dad.” I’m from Southern California, born and raised, with social experience across the races and classes, but I’ve never heard a single English-only speaker refer to parents as “mother” and “father” instead of “mom” and “dad.” Was Elliot bilingual? Finally, note that Elliot prefers “girl/s” to “woman/en.”

RodgersGirlsGuys

RodgersMotherFather

RodgersNegation

RodgersSexEtc

Until I discover that auto-biographical texts always drip with personal pronouns, I would argue that Elliot’s manifesto is the product of an especially narcissistic personality. The boy couldn’t go two sentences without referencing himself in some way.

And what about the misogyny? He uses masculine pronouns as often as he uses feminine pronouns; he refers to his father as often as he refers to his mother—although, it is true, the references to mother become more frequent, relative to father, as Elliot pushes toward his misogynistic climax. Overall, however, the rhetorical energy in the text is not expended on females in particular. This is not an anti-woman screed from beginning to end. Also, recall, the preferred term is “girls,” not “women.” Elliot hated girls. Women—middle-aged, old, married, ensconced in careers, not apt to wear bikinis on the Santa Barbara beach—are hardly on Elliot’s radar. (This ageism also comes through in his YouTube videos.) Despite the “I hate all women” rhetorical flourishes at the very beginning and the very end of his manifesto, Elliot prefers to write about girls—young, blonde, unmarried, pre-career, in sororities, apt to wear bikinis on the Santa Barbara beach.

I noticed something similar in the Unabomber manifesto. Not about the girls. About the beginning and ending: what we remember most from that manifesto is its anti-PC bookends, even though the bulk of the manifesto devotes itself to very different subject matter. The quotes pulled from manifestos (including this one) and published by news outlets are a few subjective anecdotes, not the totality of the text .

Anyway. Pieces of writing that sally forth from such diseased individuals always call to mind what Kenneth Burke said about Mein Kampf:

[Hitler] was helpful enough to put his cards face up on the table, that we might examine his hands. Let us, then, for God’s sake, examine them.

 

Demographic distribution: Gender of citations in CCC, RSQ, and RR abstracts

This post follows up on my discussion of citation frequencies in abstracts in rhetoric and composition journals. To reiterate, a safe assumption to make is that citations in abstracts are “central” to the arguments presented and the research undertaken in the articles themselves; they are particularly informative about overall trends. The genre of the humanities article demands more citations than a core argument actually requires, so looking at citations in abstracts should control for that genre requirement, distilling down all citations to the most vital ones.

The journals: College Composition and Communication (CCC), Rhetoric Society Quarterly (RSQ), and Rhetoric Review (RR). The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

The previous post discussed the “long tail” distribution that emerged from the citation frequencies and what it means for disciplinary identity. This post presents information on the gender of the sources cited in the abstracts, then makes a few comments about demographic distributions in general.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. (See previous post for .xls data files.) Here’s how the gender distribution falls: in CCC, 23 out of the 79 sources are female; in RSQ, 39 out of the 159 sources are female; in RR, 36 out of the 121 sources are female.

And here are graphs of the raw counts and of the percentages:

Abstract citations by gender (raw count)

Abstract citations by gender (raw count)

Abstract citations by gender (percentage)

Abstract citations by gender (percentage)

In Authoring a Discipline, Maureen Daly Goggin has shown that by 1990 total contributors to 9 of rhetoric and composition’s major journals—including the 3 analyzed here—had equalized to a nearly 50/50 split between males and females. I imagine this trend has continued into the new millennium, but it would be worthwhile to determine whether or not that’s the case.

What has not equalized, however, is the gender contribution in terms of citations. Odds are, counting all citations in the articles themselves would alleviate the large gap seen in the graphs above. But insofar as we accept that abstract citations represent the most vital sources in each journal, then an obvious gender gap still exists in CCC, RSQ, and RR citations.

In RSQ and RR, this gap, in part, likely has something to do with these journals’ tendencies to publish work on rhetorical history. I pointed this out in the last post: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Those numbers would grow if they included figures from the 18th and 19th centuries, as well. The reality is, most of these historical sources are male: Plato, Cicero, Aristotle, Quintilian, et cetera.

I have no ready explanation for why CCC citations should have as large a gender gap as the other journals’ citations, given that CCC builds most of its scholarship on sources from the middle part of the 20th century or later. If we look at the 102 most cited figures in CCC between 1987 and 2011 (Mueller, “Grasping”), we discover that 43/102 (42%) of the sources are female: a gender imbalance, but one not nearly as pronounced as the imbalance that surfaces in abstract citations. I’d be curious to see the gender distribution in Mueller’s entire data set. Is there a nearly 50/50 split between male and female sources across all citations in CCC between 1987 and 2011? If so, we could model the gender imbalance in this journal’s citations as an emergent feature: 50/50 across the entire data set; 58/42 in the most popular citations between 1987 and 2011; 71/29 in abstracts between 2000 and 2011. It’s unfortunate that CCC did not publish abstracts until the late 1990s, so that the dates of the abstracts and the articles could be uniform.

The question of demographic balance is one that spills a lot of digital ink. Just this morning, Scott Weingart visualized the gender (im)balance of Digital Humanities Conference attendees: about a 70/30 split that favors males. And Google recently released the demographic characteristics of its workforce: 30% of its employees are women; 17% of its technical employees are women. 60% of its employees are white; 30% of its employees are Asian (read: East Asian and Indian); and only 3% of its employees are Non-Asian Minorities.

I asked Scott why our default assumption should be uniform demographic distribution. When looking at statistical trends that emerge at large scales, we shouldn’t be surprised to discover that human populations cluster differently. At least, that’s my default assumption. The DH Conference draws more males, but then, an Early Childhood Education conference will draw more females. (I once attended a conference on speech and behavior therapy for autistic children; there were no more than three or four males amid about seventy females.) Or take a look at the National Association for the Education of Young Children. Although we often hear about the male-ness of executive boards, the NAEYC’s executive team is entirely female, and its 17-member governing  board boasts 13 females and only 4 males. Looking at all the Early Childhood Education associations and organizations in the country, what gender trends would we expect to find?

The first question to ask about demographic distribution in any particular population (like Google’s workforce or citations in abstracts) is this: What are the characteristics of the larger population from which this particular population is drawing? As long as rhetorical scholars continue to look at rhetorical history, where most of the figures are male, then we can continue to expect many citations in these historical journals to be male. (This may change, however, as more and more rhetorical historians re-discover the history of female oratory.) Or, in Google’s case, if we take the American population as the baseline, assuming a 50/50 gender split, then clearly there is a gender imbalance. But in terms of race and ethnicity, its white workforce is in fact under-represented. Raising the percentage of blacks and Hispanics at Google would mean firing a lot of the Chinese and Indians, unless we want to make whites more under-represented than they already are. (A fairer baseline population would be the percentage of working-age adults in America, or, better yet, the percentage of working-age adults with college degrees; however, those stats are much harder to come by. Total population is a decent but imperfect proxy.)

The point is that we do not always find particular populations boasting a uniform or near-uniform demographic distribution. Why is this? A complex question. Given the totality of the human population (or, more humbly, the totality of any total population in a given geographic area), why do we find the smaller population clusters clustering the way they do around different practices? Why are there more males in CCC citations? Why are there more males at the DH Conference? Why are East Asians and Indians so over-represented at Google? Why are there so few East Asians and Indians in the NFL and the NBA? That populations cluster differently around different practices seems to be a statistical fact. Is it also a future inevitability?

A possible explanation for the emergence of quotative “like” in American English

So Monica was like, “What are you doing here, Chandler?” and Chandler was like, “Uhh nothing” and then Monica was like, “Why are you here with Phoebe?” and Chandler was like, “I don’t know,” and Monica was like, “Whatever!”

Quotative “be like” probably gets on your nerves. Unfortunately for you, it spread like wildfire in the latter half of the 20th century and today is used by native and non-native speakers alike as often as they use traditional say-type quotatives. What is its structure, when did it arise, and why did it spread so quickly? This post offers a possible explanation, based on evidence dragged up from the depths of the Google Books Corpus. To appreciate that evidence, however, we need to start with some discussion of this quotative’s formal properties.

1

One interesting property of quotative “be like” is its ambiguous semantics. In some contexts, it is a stative predicate that denotes internal speech, i.e., thoughts reflexive of an attitude. In other contexts, it is an eventive predicate denoting an actual speech act. Sometimes, the denotation is ambiguous, as in (1):

(1) Monica was like, “Oh my God!”

. . . Did Monica literally say “Oh my God!” or did she just think or feel it?

Another interesting property of quotative “be like” is that it disallows indirect speech.

(2a) Monica was like, “I should go to the mall.”

(2b) *Monica was like that she should go to the mall.

(2c) *Monica was like she should go to the mall

Quotative say of course allows indirect speech:

(3a) Monica said, “I should go to the mall.”

(3b) Monica said that she should go to the mall.

(3c) Monica said she should go to the mall.

Haddican et al. (2012) recognize that quotative “be like” is immune to indirect speech due to its mimetic implicature. (2b) cannot be allowed because quotative “be like” always means something more along these lines:

(4) Monica was like: QUOTE

Given the implied mimesis of this construction, it makes no sense, as in (2b) and (2c), to add an overt complementizer and to change person/tense to produce an indirect, third person report. This property is shared by all uses of quotative “be like,” whether in their stative or eventive readings.

But there’s more to it than a mimetic implicature. Schourup (1982) points out that quotative “go” also shares this mimetic property (although he does not frame it as such). As expected of a quotative with a mimetic implicature, quotative “go” likewise does not allow an indirect speech interpretation via addition of an overt complementizer and shifts in person/tense:

(5a) Monica goes, “I should go to the mall.”

(5b) *Monica goes that she should go to the mall.

Why should these innovative quotatives be so immune to indirect speech and so committed to direct quote marking? Schourup suggests that quotative “go” (and, by extension, quotative “be like”) arose precisely to meet English’s need for a mimetic, unambiguous direct quotation marker. Prior to the occurrence of these new quotatives, English lacked such a marker. Consider (6a) and (6b) below:

(6a) When I talked to him yesterday, Chandler said that you should go to the doctor.

(6b) When I talked to him yesterday, Chandler said you should go to the doctor.

There is no ambiguity in (6a). The speaker of this utterance clearly intends to convey to his interlocutor that Chandler said the interlocutor should go to the doctor. (6b), however, introduces ambiguity. The utterance in (6b) can be interpreted in two ways: a) Chandler said the speaker of the utterance (i.e., I) should go to the doctor; b) Chandler said the speaker’s interlocutor (i.e., you) should go to the doctor. With orthographic conventions, of course, this ambiguity disappears:

(6c) When I talked to him yesterday, Chandler said, “You should go to the doctor.” (So I went.)

However, unlike other languages, spoken English has no “quoting” conventions—it has no direct quote markers for unmarked speech. It is unclear if (6b) is a true quotative or merely an indirect report on speech with a null complementizer.

QuotvsInt

We can imagine speakers needing to clarify this ambiguity:

JOEY: When I talked to him yesterday, Chandler said you should go to the doctor.

ROSS: Wait, he said I should go or you should go?

This ambiguity arises with say-type verbs whenever the complementizer that is omitted. It is traditionally understood that English differentiates between direct quotatives and indirectly reported speech via shifts in person and/or tense. However, the overt complemetizer is really the central feature of this differentiation. Without an overt complementizer, it is never entirely clear if the embedded clause is a direct quote or an indirect report of speech. Here’s another example:

(7) JOEY: Chandler said I will be responsible for the cat’s funeral.

Without the aid of quote marks, we cannot know whether Chandler or Joey is responsible for the cat’s funeral, even though the embedded clause contains a shift in both person and tense. Of course, if Joey wants to convey that Joey himself will be responsible for the cat’s funeral, he can simply add the overt complementizer: “Chandler said that I will be responsible . . .” However, if Joey wants to convey that Chandler has decided to be responsible, Joey has no way to convey it unambiguously with say-type verbs. He must resort to an indirect speech construction with an overt complementizer. Alternatively, he can resort to non-structural signals: a short pause, a change in intonation, or a mimicry of Chandler’s voice. Or he must abandon say-type constructions altogether and convey his meaning some other way.

Quotative “go” and quotative “be like” solve this ambiguity. These innovative quotatives always signal that the following clause is mimetic, a direct quote of speech or thought. Many languages—Russian, Japanese, Georgian, Ancient Greek, to name just a few— have overt markers to ensure that interior clauses are understood as being directly quoted material, whether or not those quoted clauses contain grammatical shifts (though of course they often do). The quotatives “go” and “be like” serve this same purpose. They are structural, unambiguous markers for direct speech, which is why one cannot use them for indirect speech, and which is also why they have spread so widely and quickly: they have met a real need in the language.

Quotative “go,” however, is attested long before quotative “be like.” The Oxford English Dictionary puts the earliest usage in the early 19th century, initially as a way to mime sounds people made, then later as a way to report on actual speech. Here’s an example from Dickens’ Pickwick Papers:

DickensPickwick

So, although I have said that both quotative “be like” and quotative “go” met a need in English for an unambiguous direct quotation marker, it was “go” that in fact met the need first, by at least a century. This historical fact leads me to suspect that quotative “be like” met a slightly different need: while quotative “go” became a direct quotation marker for speech acts, quotative “be like” became a direct quotation marker for thoughts. As Haddican et al. rightly note, an innovative feature of these quotatives is that they allow direct quotes to be descriptors of states. In other words, the directly marked quotes of “go” denote external speech; the directly marked quotes of “be like” primarily denote internal speech, i.e., thoughts or attitudes. I believe this hypothesis is supported by the earliest uses of quotative “be like,” to which we now turn:

2

Today, young native and non-native speakers of English frequently use “like” as a versatile discourse marker or interjection in addition to its use as a quotative (D’Arcy 2005). D’Arcy provides two extreme examples of discourse marker “like.” Both are taken from a large corpus of spoken English:

(8) I love Carrie. LIKE, Carrie’s LIKE a little LIKE out-of-it but LIKE she’s the funniest, LIKE she’s a space-cadet.      Anyways, so she’s LIKE taking shots, she’s LIKE talking away to me, and she’s LIKE, “What’s wrong with you?”

(9) Well you just cut out LIKE a girl figure and a boy figure and then you’d cut out LIKE a dress or a skirt or a coat, and LIKE you’d colour it.

This usage does not become noticeable in available corpora until the 1980s, so nearly all papers that I have read assume that discourse marker “like” and qutoative “be like” arose more or less in tandem during the 1970s, becoming common by the 1980s. However, using the Google Books Corpus, I was able to find an early use of “like” that presages quotative “be like.” This early use also seems to set the stage for the versatile discursive uses of “like” seen in (8) and (9). This early use is the expression, “like wow.” It seems to have arisen during the 1950s (though perhaps earlier) in the early rock n’roll scenes in the Southern United States. Here are some examples.

The first is from 1957: a line from a rock n roll song by Tommy Sands:

(10) When you walk down the street, my heart skips a beat—man, like wow!

The second is from a 1960 issue of Business Education World:

(11) Like, wow! I’m taking a real cool course called general business. It’s the most.

BusinessEducationWorld

The third is from a novel called The Fugitive Pigeon, published in 1965:

(12) But all of a sudden you’re like wow, you know what I mean?

And by 1971, we have a full example of quotative “be like,”— note that this early occurrence uses an expletive as the subject:

(13) But to me it was like, “Oh, why can’t you say, ‘Gee that’s wonderful . . .’”

LifeMagazine1971

These early uses of “like wow” in (10) and (11) denote a stative feeling or attitude rather than any kind of eventive speech act. This is especially clear in (11), where the expression is a direct response to a question about how the speaker is feeling. The quotative in (13) likewise seems to be a stative predicate rather than an eventive one. In fact, in nearly all of the earliest 1uses of quotative “be like”—from the 1970s and early 1980s, as reported in the Google Books Corpus—the intention is to denote a feeling or attitude, not a direct quote of a speech act. Such eventive predications don’t become common until the 1990s and 2000s.

“Like wow,” then, arose in 1950s slang as a stative description. However, the sentence in (14) below suggests that wow was not interpreted as a structurally independent interjection but as an adjective. This is from a 1960 edition of Road and Track magazine:

(14) Man, that crate would look like wow with a Merc grille.

RoadTrack

It is possible that like is an adverb here, but in my estimation it is most likely still a garden variety manner preposition that has innovatively selected for a bare adjective. Typically, like as a preposition only selects NPs as its complement. However, with the advent of “like wow,” it loosened its selection requirements and began to select for adjectives as well. And not just adjectives. The bottom line in this advertisement from Billboard magazine in May 1960 demonstrates that it also began to select for adverbs:

BillboardLikeWowAd

Apparently, in the 1950s and early 1960s, like became a popular and versatile manner preposition. Once like loosened its requirements to select AP complements, it’s easy to see how it could start selecting quotes, thus becoming a new direct quote marker (like narrative “go”); and given the stative denotation of the original phrase “like wow,” it’s also easy to see why stative to be would become the verbal element in this quotative rather than a lexical verb like act or go. Indeed, it appears that the first uses of quotative “be like” were entirely restricted to the phrase “like wow,” ensuring that subsequent uses would likewise have stative readings. (The ad above also shows how easy it would be for like to become an all-around discourse marker once it began to select for a wider range of phrases.)

So, based on the timeline of evidence in the corpus, I posit the following evolution:

LikeEvolution

The emergence of quotative “like”

I follow Haddican et al. in assuming that like in quotative “be like” is still a manner preposition. However, while they assume the preposition did not undergo any change, I argue that like became more versatile in its selection restrictions. This versatility allowed it first to select APs, then to select quotes. Initially, this quotative construction was just an extension of the phrase “like wow,” but it soon began to select any quoted material. And from the beginning, this quotative possessed two features: a) it had an obvious mimetic implicature, ensuring that it would be a direct quote marker, similar to narrative “go”; and b) it had a stative denotation, due to the stative dentation of the original phrase “like wow,” ensuring that the directly marked quotes were reflective of internal speech, i.e., thoughts or attitudes.

A corpus analysis by Buchstaller (2001) has shown that, even today, quotative “go” is much more likely than quotative “be like” to frame “real, occurring speech” (pp. 10); in other words, “be like” continues to be used more often as a stative rather than eventive predicate. As I mentioned earlier, Haddican et al. are correct that one innovative aspect of quotative “be like” is that quotes are now able to be descriptors of states; however, I believe they overstate the eventive vs. stative ambiguity that arises in these quotatives. Most of the time, in real contexts, they are as unambiguously stative as they are unambiguously mimetic of the state. Haddican et al. themselves note that even these eventive readings are open to clarification. Asking whether or not someone “literally” said something sounds much odder following a say-type quotative than a “be like” quotative with a putatively eventive reading.

3

Nevertheless, as I showed at the very beginning of this post, there are instances where quotative “be like” seems to denote an eventive speech act. Linguistically, this is odder than it sounds at first. A single verbal construction—like quotative “be like”—should not have a stative and eventive reading. This ambiguity can only happen for two reasons: either there is some special semantic function at work in this construction, or there are in fact two separate quotative constructions, each with its own syntactic structures.

It is tempting to see a correlation between this ambiguity and the putative ambiguity between stative be and eventive be, also known as the be of activity. Consider the following sentences:

(15) Joey was silly.

(16) Rachel asked Joey to be silly.

Both forms of be select an adjective; however, (16), unlike (15), can be taken to mean that Joey performed some silly action. In other words, the small clause in (16) seems to be an eventive predication, not a stative one. It has been argued (Parsons 1990) that this eventive be is not the usual copular form but a completely different verb that means something like “to act”—in other words, English to be is actually a homophonous pair of verbs, similar to auxiliary have and possessive have. Perhaps this lexical ambiguity in be is related to the eventive vs. stative ambiguity in quotative “be like.” The stative reading arises when stative be is involved; the eventive reading arises when the eventive, lexical be is involved.

Haddican et al. argue against this line of thought. Diachronically, we know that quotative “be like” has arisen rapidly in many varieties of English, and that in all of these varieties, the semantics are ambiguous. But if there are in fact two be verbs that underwent this quotative innovation, then we would need to posit two unrelated channels of change: one in which like+QUOTE became a possible complement of stative be and one in which like+QUOTE became a possible complement of eventive be.

This is actually a problematic claim, given that, presumably, stative and eventive be have different structures. The former undergoes its typical V to T movement in English; the latter, given its eventive semantics, would be expected to remain in the VP like any other lexical verb. These underlying structures would demand that we devise different processes by which qutoative “be like” arose. However, given the rapidity with which it did in fact arise, it is more probable that it arose via a single process—and the inevitable conclusion is that there is a single, stative verb to be that underwent the process. This conclusion is also verified by the auxiliary-like behavior of be in quotatives involving adverbs and questions:

(17) Ross was totally like, “I don’t care!”

(18) Was Ross like, “I don’t care”?

Although the ambiguous stative vs. eventive reading still occurs here, (17) exhibits raising above AdvP, and (18) exhibits subject-aux inversion. In other words, be in these quotatives behaves like an ordinary copular auxiliary, not a lexical verb. We therefore should not posit a separate, eventive be verb. We need another way to explain the semantic ambiguity of these quotatives.

Haddican et al. explain this ambiguity with Davidsonian semantics. Briefly stated, they argue that there is a single stative be verb—both in these qutoative constructions and in English more generally. However, be has a semantic LOCALE function that, in certain contexts, can localize the state in a short-term event, and this localization of an event can force an agentive role onto the subject, even when an adjective has been selected by be. So, in a sentence such as (19), be will have a denotation as in (20):

(19) Joey is being silly.

(20) [[be]] = λSλeλx. ∃s ϵS [e = LOCALE(s) & ARGUMENT(x,e)]

(20) takes a property of state S and localizes it into an event (a moment in which Joey was silly); in the right context, it is not a great leap to coerce this experiencer event into an agentive one. The application of these semantics to “(be) like” quotatives is straightforward:

In the state reading, be like is simply a stage level use of the copula, localised to the event in which the subject of be exhibited the relevant behaviour. The eventive reading arises when the event mapped to is an agentive one, where the most plausible event of an agent behaving in a quotative manner is the relevant speech act. (Haddical et al. 2012 pp. 85)

In short, the ambiguity between stative and eventive “be like” arises from a semantic property that forces certain “states of being” to be processed as localized events whereby the experiencer of the event takes on an agentive role. In certain quotative contexts, the embedded quote is processed as an event, and the subject is understood as having caused that event, i.e, as actually saying something rather than just experiencing an attitude.

I agree that it would be better not to posit two homophonous verbs (stative be vs. be of activity) to account for the ambiguous stative vs. eventive denotations of quotative “be like.” Doing so requires two separate analyses and two separate channels of diffusion, which seems unlikely given the rapidity with which this quotative did in fact spread across many varieties of English. However, Haddican et. al’s application of Davidsonian semantics to explain the ambiguous readings runs into a problem in sentences like (21) below, as well as in the earlier example in (13):

(21) It was like, “Oh Mom, Can I film a movie in the house, it won’t be any problem at all.”

This is clearly an eventive predication of quotative “be like.” But instead of an agentive subject we have expletive it. Recall that Haddican et. al’s analysis relies on the notion that stative be has a LOCALE function that locates the state into a temporary moment or event. This localization can coerce an experiencer subject into the role of an agentive subject when the most likely reading (as above) suggests that the temporary event was an actual speech act. As Haddican et al. say themselves, “this event assigns an agentive role to the subject” (pp. 85). However, by definition, the expletive in (21) receives no theta role and can therefore be neither the experiencer of a state nor the agent of an event. And yet (21) clearly denotes an eventive reading: the speaker actually spoke the words, or something like them.

The fact that “be like” quotatives can take an eventive (or even a stative) reading when an expletive surfaces in spec-TP suggests that Davidsonian semantics do not explain the ambiguous eventive vs. stative readings associated with these quotatives. (The fact that “be like” quotatives exhibit both experiencer subjects and expletive subjects also suggests that the quote CP is the only obligatory argument assigned by “be like.”)

The only alternative seems to be that there are in fact two homophonous be verbs, and quotative “be like” makes use of both. Maybe this isn’t such a big deal. If I’m right about the diachronic process by which quotative “be like” arose, then we can at least see a two-step process: quotative “be like” was solely a stative predicate in its early use and for most of its early history; only later did it begin to be used as an eventive predicate. And if there are in fact two be verbs, the eventive sounds exactly like the stative and is in fact much rarer than the stative, so I suppose one can see how these facts laid the groundwork for the eventual use of stative “be like” as an eventive predicate.

Distant Reading and the “Evolution” Metaphor

1

Are there any corpora that purposefully avoid “diachronicity”? There are corpora that possess no meta-data about publication dates and whose texts are therefore organized by some other scheme—for example, the IMDB movie review corpus, which is organized according to positive/negative polarity; its texts, as far as I know, are not arranged chronologically or coded for time in any way. And there are cases where time-related data are not available, easily or at all. But have any corpora been compiled with dates—the time element—purposefully elided? Is time ever left out of a corpus because that information might be considered “noise” to researchers?

Maybe in rare situations. But for most corpora whose texts span any length of time greater than a year, the texts are, if possible, arranged chronologically or somehow tagged with date information. In this universe, time flows in one direction, so assembling hundreds or thousands of texts with meta-data related to their dates of publication means the resulting corpus will possess an inherent diachronicity whether we want it to or not. We can re-arrange the corpus for machine-learning purposes, but the “time stamp” is always there, ready to be explored. Who wouldn’t want to explore it?

If we have a lot of texts—any data, really—that span a great length of time, and if we look at features in those data across the time span, what do we end up studying? In nearly all cases, we end up studying patterns of formal change and transformation across spans of time. The “evolution” metaphor suggests itself immediately. Be honest, now, you were thinking about it the minute you compiled the corpus.

One can, of course, use “evolution” as a general synonym for change. This is probably the case for Thomas Miller’s The Evolution of College English and for many other studies whose data extend only to a limited number of representative sources. However, when it comes to distant readings, the word becomes much more tempting. The trees of Moretti’s Graphs, Maps, Trees are explicitly evolutionary:

For Darwin, ‘divergence of character’ interacts throughout history with ‘natural selection and extinction’: as variations grow apart from each other, selection intervenes, allowing only a few to survive. In a seminar a few years ago, I addressed the analogous problem of literary survival, using as a test case the early stages of British detective fiction . . . (70-71)

The same book ends with an afterword by geneticist Alberto Piazza (who worked with Luigi Luca Cavalli-Sforza on The History and Geography of Human Genes). Piazza writes:

[Moretti's writings] struck me by their ambition to tell the ‘stories’ of literary structures, or the evolution over time and space of cultural traits considered not in their singularity, but their complexity. An evolution, in other words, ‘viewed from afar’, analogous at least in certain respects to that which I have taught and practiced in my study of genetics. (95)

Analogous at least in certain respects . . . For Moretti and Piazza, literary evolution is not just a synonym for change in literature. Biological evolution becomes a guiding metaphor (not perfect, by any means) for the processes of formal change analyzed by Moretti. Piazza continues:

The student of biological evolution is especially interested in the root of a [phylogenetic] tree (the time it originated). . . . The student of literary evolution, on the other hand, is interested not so much in the root of the tree (because it is situated in a known historical epoch) as in its trajectory, or metamorphoses. This is an interest much closer to the study of the evolution of a gene, the particular nature of whose mutations, and the filter operated by natural selection, one wants to understand . . . (112-113)

Obviously, for Piazza, Moretti’s study of changes to and migrations of literary form in time and space evokes the processes and mechanisms of biological evolution—there’s not a one-to-one correspondence, of course, and Piazza points this out at length, but the similarities are evocative enough that he, a population geneticist, felt confident publishing his thoughts on the subject.

In Distant Reading, Moretti has more recently acknowledged that the intense data collection and quantitative analysis that has marked work at Stanford’s Literary Lab must at some point heed “the need for a theoretical framework” (122). Regarding that framework, he writes:

The results of the [quantitative] exploration are finally beginning to settle, and the un-theoretical interlude is ending; in fact, a desire for a general theory of the new literary archive is slowly emerging in the world of digital humanities. It is on this new empirical terrain that the next encounter of evolutionary theory and historical materialism is likely to take place. (122)

In Macroanalysis, Matthew Jockers also acknowledges (and resists) the temptation to initiate an encounter between evolutionary theory and the quantitative, diachronic data compiled in his book:

. . . the presence of recurring themes and recurring habits of style inevitably leads us to ask the more difficult questions about influence and about whether these are links in a systematic chain or just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics . . .

“Evolution” leaps to mind as a possible explanation. Information and ideas do behave in a ways that seem evolutionary. Nevertheless, I prefer to avoid the word evolution: books are not organisms; they do not breed. The metaphor for this process breaks down quickly, and so I do better to insert myself into the safer, though perhaps more complex, tradition of literary “influence” . . . (155)

And in the last chapter to Why Literary Periods Mattered, Ted Underwood does not mention evolution at all but there is clearly an evolutionary connotation to the terms he uses to describe digital humanities’ influence on literary scholars’ conception of history:

. . . digital and quantitative methods are a valuable addition to literary study . . . because their ability to represent gradual, macroscopic change brings a healthy theoretical diversity to literary historicism . . .

. . . we need to let quantitative methods do what they do best: map broad patterns and trace gradients of change. (159, 170)

Underwood also discusses “trac[ing] processes of change” (160) and “causal continuity” (161). The entire thrust of Underwood’s argument, in fact, is that distant or quantitative readings of literature will force scholars to stop reading literary history as a series of discrete periods or sharp cultural “turns” and to view it instead as a process of gradual change in response to extra-literary forces—”Romanticism” didn’t just become “Naturalism” any more than homo erectus one decade decided to become homo sapiens.

Tracing processes of gradual, macroscopic change . . . if that doesn’t invoke evolutionary theory, I don’t know what does. Underwood doesn’t even need to use the word.

Moretti, Jockers, and Underwood are three big names in digital humanities who have recognized, either explicitly or implicitly, that distant reading puts us face to face with cultural transformation on a large, diachronic scale. Anyone working with DH methods has likely recognized the same thing. Like I said, be honest: you were already thinking about this before you learned to topic model or use the NLTK.

 

2

Human culture changes—its artifacts, its forms. This is not up for debate. Even if we think human history is a series of variations on a theme, the mutability of cultural form remains undeniable, even more undeniable than the mutability of biological form. Distant reading, done cautiously, gives us a macro-scale, quantitative view of that change, a view simply not possible to achieve at the scale of individual texts or artifacts. Given the fact of cultural transformation, then, and DH’s potential to visualize it, to quantify aspects of it, one of two positions must be taken.

1. The diachronic patterns we discover in our distant readings are, to use Jockers’ words, “just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics.” Theorizing the patterns is a fool’s errand.

2. The diachronic patterns we discover are not arbitrary or random. Theorizing the patterns is a worthwhile activity.

Either we believe that there are processes guiding cultural change (or, at least, that it’s worthwhile to discover whether or not there are such processes) or we assume a priori that no such processes exist. (A third position, I suppose, is to believe that such processes exist but we can never know them because they are too complex.) We can all decide differently. But those who adopt the first position should kindly leave the others to their work. In my view, certain criticisms of distant reading amount to an admonition that “What you’re trying to do just can’t be done.” We’ll see.

 

3

When we decide to theorize data from distant readings, what are we theorizing? Moretti, Jockers, and Underwood each provide a similar answer: we are theorizing changes to a cultural form over time and, in some instances, space. Certain questions present themselves immediately: Are the changes novel and divergent, or are they repeating and reticulating? Is the change continuous and gradual, or are there moments of punctuated equilibrium? How do we determine causation? Are purely internal mechanisms at work, or also external dynamics? A complex interplay of both internal mechanisms and external dynamics? How do we reduce data further or add layers of them to untangle the vectors of causation?

To me, all of this sounds purely evolutionary. Even talking about gradual vs. quick change is a discussion taken right out of Darwinian theory.

But we needn’t adopt the metaphor explicitly if we are troubled that it breaks down at certain points. Alex Reid writes:

Matthew Jockers remarks following his own digital-humanistic investigation, “Evolution is the word I am drawn to, and it is a word that I must ultimately eschew. Although my little corpus appears to behave in an evolutionary manner, surely it cannot be as flawlessly rule bound and elegant as evolution” (171). As he notes elsewhere, evolution is a limited metaphor for literary production because “books are not organisms; they do not breed.” He turns instead to the more familiar concept of “influence” . . . Certainly there is no reason to expect that books would “breed” in the same way biological organisms do (even though those organisms reproduce via a rich variety of means). [However], if literary production were imagined to be undertaken through a network of compositional and cognitive agents, then such productions would not be limited to the capacity of a human to be influenced. Jockers may be right that “evolution” is not the most felicitous term, primarily because of its connection to biological reproduction, but an evolutionary-type process, a process as “natural” as it is “cultural,” as “nonhuman” as it is “human,” may exist.

An “evolutionary-type” process of culture is what we’re after, one that is not necessarily reliant on human agency alone. Will it end up being “flawlessly rule bound and elegant as evolution”? First, I think Jockers seriously over-estimates the “flawless” nature of evolutionary theory and population genetics. If the theory of evolution is so flawless and elegant, and all the science settled, what do biologists and geneticists do all day? Here’s a recent statement from the NSF:

Understanding the tree of life has been a goal of evolutionary biologists since the time of Darwin. During the past decade, unprecedented gains in gathering and analyzing phylogenetic data have demonstrated increasingly complex genealogical patterns.

. . . . Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted as a single, typological, bifurcating tree.

Moretti, it turns out, needn’t worry so much about the fact that cultural evolution reticulates. And Jockers needn’t assume that biological evolution is elegantly settled stuff.

Secondly, as Reid argues, we needn’t hope to discover a system of influence and cultural change that can be reduced to equations. We probably won’t find any such thing. However, within all the textual data, we can optimistically hope to find regularities, patterns that can be used to make predictions about what might be found elsewhere, patterns that might connect without casuistic contrivance to theories from the sciences. Here’s an example, one I’ve used several times on this blog: Derek Mueller’s distant reading of the journal College Composition and Communication. Mueller used article citations as his object of analysis. When he counted and graphed a quarter century of citations in the journal, he discovered patterns that looked like this:

muellerlongtail

Actually, based on similar studies of academic citation patterns, we could have predicted that Mueller would discover this power law distribution. It turns out that academic citations—a purely cultural form, a textual artifact constructed through the practices of the academy—behave according to a statistical law that seems to affect all sorts of things, from earthquakes to word frequencies. This example makes a strong case against those who argue that cultural artifacts, constructed by human agents within their contextualized interactions, will not aggregate over time into scientifically recognizable patterns.  Granted, this example comes from mathematics, not evolutionary theory, but it makes the point nicely anyway: the creations of human culture are not necessarily free from non-human processes. Is it foolish to look for the effects of these processes through distant reading?

 

4

“Evolution,” “influence,” “gradualism”—whatever we call it in the digital humanities, those of us adopting it on the literary and rhetorical end have a huge advantage over those working in history: we have a well-defined, observable element, an analogue of DNA, to which we can always reduce our objects of study: words. If evolution is going to be a guiding metaphor, we need this observable element because it is through observations of its metamorphoses (in usage, frequency, etc.) that we begin to figure out the mechanisms and dynamics that actually cause or influence those metamorphoses. If we had no well-defined segment to observe and quantify, the evolutionary metaphor could be thrown right out.

To demonstrate its importance, allow me a rhetorical demonstration. First, I’ll write out Piazza’s description of biological evolution found in his afterword to Graphs, Maps, Trees. Then, I’ll reproduce the passage, substituting lexical and rhetorical terms for “genes” but leaving everything else more or less the same. Let’s see how it turns out:

Recognizing the role biological variability plays in the reconstruction of the memory of our (biological) past requires ways to visualize and elaborate data at our disposal on a geographical basis. To this end, let us consider a gene (a segment of DNA possessed of a specific, ascertainable biological function); and for each gene let us analyze its identifiable variants, or alleles. The percentage of individuals who carry a given allele may vary (very widely) from one geographical locality to another. If we can verify the presence or absence of that allele in a sufficient number of individuals living in a circumscribed and uniform geographical area, we can draw maps whose isolines will join all the points with the same proportion of alleles.

The geographical distribution of such genetic frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate genetic differences between human populations. But their interpretation involves quite complex problems. When two human populations are genetically similar, the resemblance may be the result of a common historical origin, but it can also be due to their settlement in similar physical (for example, climactic) environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, dietary regimes) can favour the increase or decrease to the point of extinction of certain genes.

Why do genes (and hence their frequencies) vary over time and space? They do so because the DNA sequences of which they are composed can change by accident. Such change, or mutations, occurs very rarely, and when it happens, it persists equally rarely in a given population in the long run . . . From an evolutionary point of view, the mechanism of mutation is very important because it introduces innovations . . .

. . . The evolutionary mechanism capable of chancing the genetic structure of a population most swiftly is natural selection, which favours the genetic types best adapted for survival to sexual maturity, or with a higher fertility. Natural selection, whose action is continuous over time, having to eliminate mutations that are injurious in a given habitat, is the mechanism that adapts a population to the environment that surrounds it. (100-101)

Now for the “distant reading” version:

Recognizing the role lexical variability plays in the reconstruction of the memory of our (literary and rhetorical) past requires ways to visualize and elaborate data at our disposal on the basis of cultural space (which often correlates with geography). To this end, let us consider a word (a segment of phonemes and morphemes possessed of a specific, ascertainable grammatical or semantic function); and for each word let us analyze its stylistic variants, or synonyms. The percentage of texts that carry a given stylistic variant may vary from one cultural space to another, or from one genre to the other. If we can verify the presence or absence of that variant in a sufficient number of texts produced in a circumscribed and uniform cultural space we can draw maps whose isolines will join all the points with the same proportion of stylistic variants.

The distribution of such lexical frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate lexical differences between “generic populations.” But their interpretation involves quite complex problems. When two rhetorical forms or genres are lexically similar, the resemblance may be the result of a common historical origin, but it can also be due to their development in similar geographic or political environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, religious dictates) can favour the increase or decrease to the point of extinction of certain lexical items or clusters of lexical items.

Why do words (and hence their frequencies and “clusterings”) vary over time and space? They do so because of stylistic innovations. Such innovation occurs very rarely, and when it happens, it persists equally rarely in a given generic population in the long run . . . From an evolutionary point of view, the mechanism of innovation is very important because it introduces new rhetorical forms . . .

. . . The evolutionary mechanism capable of changing the lexical structure of a rhetorical form or genre most swiftly is cultural selection, which favours the forms best adapted for survival to publication and circulation, or with a higher degree of influence (meaning a higher likelihood of being reproduced by others without too many changes). Cultural selection, whose action is continuous over time, having to eliminate rhetorical innovations or “mutations” that are injurious in a given cultural habitat, is the mechanism that adapts a rhetorical form to the environment that surrounds it.

Obviously, it’s not perfect. I leave it to the reader to decide its persuasive potential.

I think the biggest problem is in the handling of mutations. In biological evolution, genes mutate via chance variations during replication of their segments; these mutations can introduce innovations in an organism’s form or function. In literary evolution, however, no sharp distinction exists between a lower-scale “mutation” and the innovation it introduces. The innovation is the formal mutation. This issue arises because, in literary evolution, as in linguistic evolution, the genotype/phenotype distinction is not as obvious or strictly scaled as it is in evolutionary theory. Words are more phenotype than genotype, unless we want to get lost in an overly complex evocation of morphology and phonology.

The metaphor always breaks down somewhere, but where it works, it is, I think, highly suggestive: the idea is that we track rhetorical forms—constellations of words and their stylistic variants—across time and space, in order to see where the forms replicate and where they disappear. Attach meta-data to the texts that constitute those forms, and we will have what it takes to begin making data-driven arguments about how cultural ecology affects or does not affect cultural form.

It’s an interesting framework in which distant reading might go forward, even if explicit uses of the word “evolution” are abandoned.

Graphing Citations and Making Sense of Disciplinary Divisions

A Pareto distribution: the troubling result of Derek Mueller’s distant reading of citations in College Composition and Communication: a “long tail” of citations, a handful of names cited many times but exponentially more names cited only once. Out of 8,035 unique citations, 5,761 were cited once and 986 were cited twice. In other words, 84% of citations in CCC occurred only once or twice in a 25 year period.

Troubling, but unsurprising. Physical and social scientists have long known that power law distributions occur across a wide variety of phenomena, including academic citations (Gupta et al. 2005). That a long tail occurs in a rhet/comp journal simply puts our discipline in the same position as everyone else: a small group of scholarly work has gained a “cumulative advantage” or “preferential attachment” and thus become the core set of classic texts recognized by the field, while most other scholars fail to produce texts that cross the tipping point toward their own preferential attachment. It is usually assumed that this core group of scholars is what unites a discipline. To some extent, the assumption is probably true. However, Mueller is right to ask how far a citation trail can lead away from that core group of scholars before we start questioning just how unified a discipline really is.

When graphing citation counts, it’s not problematic to discover a steep drop between the most cited scholar and the tenth most cited scholar; nor is it problematic that most sources are cited infrequently. The problem is not the long tail. The problem, in CCC’s case, is that the long tail very rapidly approaches a value equal to one. This indicates that any given source in CCC is valuable to the scholar citing it but effectively worthless to everybody else who publishes in the journal. If most citations occurred three, four, five times, even that would suggest a certain unity of purpose—what one scholar has found valuable, several others have found valuable as well, in various issues and various contexts. But when the long tail is mostly comprised of sources cited once and never again? That requires a more robust explanation than a nod toward a core group of scholars can provide. Mueller thus raises the right question:

Although we do not at this time have data from all of the major journals to investigate this fully, the changing shape of the graphed distribution reiterates more emphatically a question only hinted at . . . but one nevertheless crucial to the idea of a common disciplinary domain: How flat can the citation distribution become before it is no longer plausible to speak of a discipline?

To answer Mueller’s call for more data, I have compiled article abstracts from CCC and two other major journals in the field—Rhetoric Society Quarterly and Rhetoric Review. I intend this post to serve as a tentative response to the question posed by Mueller at the end of this quote.  The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

Only abstracts, not full articles. However, because only the most important citations appear in abstracts, I think tallying abstract citations offers the best chance to shorten the long tail and partially alleviate the implications of Mueller’s work. It is not a slight to the humanities to point out that articles demand more citations than their arguments actually require: many article citations can be removed without affecting anything vital to an argument. Citations in abstracts, on the other hand, are in most cases central to the argument or study undertaken. If we count only the most important sources in each journal—the ones that surface in abstracts—is the long tail of citation distributions less pronounced? We can expect to discover a long tail. That’s a mathematical inevitability. But if a journal—to say nothing of an entire discipline—is somehow unified, citations in abstracts should have a slightly less extreme power law distribution than citations in the articles themselves. Abstract citations are the “cream of the crop,” those vital enough to make it into the space constraints of the abstract genre: we hope to find fewer citations and therefore a graph that does not drop so precipitously toward x=1.

Methods: Each corpus was uploaded to the Natural Language Toolkit and tagged for part of speech. Then I compiled proper nouns. The proper noun list was larger than but included proper names. I extracted these names—noun forms (e.g. ‘Burke’ or ‘Burke’s) and adjective forms (e.g. ‘Burkean’)—and tracked them across the abstracts. I compiled each unique citation as well as the number of times each was cited in an abstract.

Finding citation names

Finding citation names

Here are spreadsheets with the unique citations and their citation counts in each abstracts corpus: College Composition and Communication. Rhetoric Society Quarterly. Rhetoric Review.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. Only six citations occur in both the RSQ and CCC abstracts corpora: Mina Shuaghnessy, Kenneth Burke, John Dewey, Donald Davidson, Peter Elbow, and Mikhail Bakhtin. When factoring in RR, only Kenneth Burke, John Dewey, and Peter Elbow are shared across all three corpora. RR and RSQ share quite a few sources, almost all of which are historical figures—Plato, Aristotle, Cicero, Isocrates, and the like. Kenneth Burke is the most frequently cited source in each abstracts corpus: he is cited in 5 separate abstracts in CCC, 17 in RSQ, and 14 in RR. Maybe “rhetoric and composition” should be changed to “Burkean studies.” No surprise—the man has his own journal.

Based on the raw count of unique citations in each journal—on average, less than one per abstract—I think my original suggestion is at least partially correct: counting citations in abstracts controls for the rhetorical demand of articles to cite more sources than necessary. Abstract citations are the stars of the show. Nevertheless, after graphing the citations, Pareto distributions did emerge:

CCC abstract citations

CCC abstract citations

RSQ abstract citations

RSQ abstract citations

RR abstract citations

RR abstract citations

Citations in the CCC abstracts occurred in a slightly more even distribution than citations in CCC articles (c.f., Mueller). But then, there aren’t many citations in this corpus, relative to the RSQ and RR corpora. Among the citations that do appear, none occur in numbers much greater than those occurring in only one abstract. The citation occurring most frequently—Burke—occurs in five abstracts. Does this graph confirm Mueller’s conclusion about a dappled CCC? To some extent, yes. There’s still a long tail, after all . . .

RSQ citations even more obviously display the Pareto distribution discussed in Mueller’s article. The citations occurring most frequently—Burke and Plato—surface in 17 and 14 abstracts, respectively.

The distribution in RR is also uneven, and the drop of the long tail is even more precipitous than the one in RSQ. Burke is cited in 14 abstracts and the next most frequent source, Aristotle, is cited in 5 abstracts.

These graphs indicate that even in article abstracts—where only the most vital sources are invoked—a small canon of core scholars emerges beside an otherwise long, flat, dapple distribution of citations. More divergence and specialization, then—not just in CCC but in RR and RSQ.

I think there’s more to it than disciplinary divergence, however. These long tails can undoubtedly be explained mathematically—the conclusion: they’re inevitable—but in this particular case they might also be explainable in prosaic terms. And I believe this prosaic explanation makes sense of the long tail in a way that salvages a shred of disciplinary unity within each journal:

In RR and RSQ, for example, an obvious citation pattern emerges. Five of the ten most cited sources in the RSQ abstracts are historical figures: Plato, Aristotle, Quintilian, Blair, and Cicero. In RR, the exact same thing: Aristotle, Cicero, Isocrates, Plato, Quintilian. But glancing through the long tail in both citation counts, historical figures continue to emerge, mostly from the Greco-Roman world, but from beyond it, as well. In the CCC long tail, on the other hand, historical figures occur in less frequent numbers, and only two pre-19th century.

Raw numbers for RR and RSQ: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Most are Greco-Roman sources, but Confucius, Montaigne, and Averroes are also scattered throughout the long tail. We might conclude, then, that a decently sized community of historians of rhetoric communicate in RSQ and RR (when they’re not communicating in Rhetorica, presumably). Their communication adds to the long tail, but does it signify disciplinary divergence and specialization?

Rather, here is one disciplinary community—historians of rhetoric—mapped out in unity. Its borders extend slightly into CCC but its principal territory lies in RSQ and RR. An obvious outcome, if you’re involved in the field. However, it also helps us make partial sense of that worrying Pareto distribution: not all of the singular citations that constitute the long tail are as disconnected as the graphs lead us to believe. In RSQ and RR, many singular citations could be grouped together: Plutarch, Laertius, Strabo, Aristophanes—these are, at least, not as indicative of a dappled disciplinary identity as, say, St. Paul and Steven Mailloux.

The same point can be made with pedagogy in the CCC abstracts. It is not surprising, of course, that CCC is home to scholars citing pedagogically-inclined sources; however, for a second time, this obvious point helps make sense of the Pareto distribution of citations presented here and in Mueller’s article: Charles Pierce, Mina Shuaghnessy, Melvin Tolson, Les Perelman—each appears only once, scattered throughout the long tail of abstract citations. But each is invoked for its direct relevance to writing pedagogy. Viewed in this way, the flat distribution of citations seems a little less dappled.

Robo-Graders

I was wrong about the mechanization of student writing. I had assumed another year or two would pass before MOOCs began utilizing essay grading software. Turns out it’s happening now. EdX, founded by Harvard and probably the most prestigious online course program, has anounced that it will implement its own assessment software to grade student writing.

Marc Bousquet’s essay successfully mines the reasons why humanities profs are anxious about algorithmic scoring. The reality is, across many disciplines, the writing we ask our students to do is “already mechanized.” The five-paragraph essay, the research paper, the literature review . . . these are all written genres with well-defined parameters and expectations. And if you have parameters and expectations for a text, it’s quite easy to write algorithms to check whether the parameters were followed and the expectations met.

The only way to ensure that a written product cannot be machine graded is to ensure that it has ill-defined parameters and vague or subjective expectations. For example, the expectations for fiction and poetry are highly subjective—dependent, ultimately, on individual authors and the myriad reasons why people enjoy those authors. It might be possible to machine grade a Stephen King novel on its Stephen-King-ness (based on the expected style and form of a Stephen King novel), but otherwise, it will remain forever impossible to quantitatively ‘score’ novels qua novels or poems qua poems, and there’s no market for doing that anyway. Publishers will never replace their front-line readers and agents with robots who can differentiate good fiction from bad fiction.

However, when we talk about student writing in an academic context, we’re not talking about fiction or poetry. We’re talking about texts that are highly formulaic and designed to follow certain patterns, templates, and standardized rhetorical moves. This description might sound like fingernails on a chalkboard to some, but look, in the academic world, written standards and expectations are necessary to optimize for the clearest possible communication of ideas. The purpose of lower division writing requirements is to enculturate students into the various modes of written communication they are expected to follow as psychologists, historians, literary critics, or whatever.

Each discourse community, each discipline, has its own way of writing, but the differences aren’t anywhere near incommensurable (the major differences exist across the supra-disciplines: hard sciences, soft sciences, social sciences, humanities). No matter the discipline, however, there is a standard way that members of that discipline are expected to write and communicate—in other words, texts in academia will always need to conform to well-defined parameters and expectations. Don’t believe it? One of the most popular handbooks for student writers, They Say/I Say, is a hundred pages of templates. And they work.

So what’s my point? My point is that it’s very possible to machine-grade academic writing in a fair and useful way because academic writing by definition will have surface markers that can be checked with algorithms. Clearly, the one-size-fits-all software programs, like the ones ETS uses, are problematic and too general. Well, all that means is that any day now, a company will start offering essay-grading software tailor-made for your own university’s writing program, or psychology department, or history department, or Writing Across the Curriculum program, or whatever—software designed to score the kind of writing expected in those programs. Never bet against technology and free enterprise.

And that’s another major point—there’s not a market for robot readers at publishing firms, but there certainly is a market for software that can grade student writing. And wherever there’s a need or a want or some other exigence, technology will fill the void. The exigence in academia is that there are more students than ever and less money to pay for full-time faculty to teach these students. Of course, this state of affairs isn’t an exigence for the Ivy League, major state flagships, or other elite institutions—these campuses are not designed for the masses. The undergraduate population at Yale hasn’t changed since 1978. A few years ago, a generous alumnus announced his plans to fund an increase in MIT’s undergraduate body—by a whopping 250 students. Such institutions will continue to be what they are: boutique experiences for the future elite. I imagine that Human-Graded Writing will continue to be a mainstay at these boutique campuses, kind of like Grown Local stickers are a mainstay of Whole Foods.

For the vast majority of undergraduates—those at smaller state colleges, online universities, or those trying to graduate in 4 years by taking courses through EdX—machine-grading will be an inevitable reality. Why? It fulfills both exigencies I mentioned above. It allows colleges to cut costs while simultaneously making it easier to get more students in and out of the door. Instead of employing ten adjuncts or teaching associates to grade papers, you just need a single tenure-track professor who posts lectures and uploads essays with a few clicks.

So, the question for teachers of writing (the question for any professors who value writing in their courses) is not “How can we stop machine-grading from infiltrating the university?” It’s here. It’s available. Rather, the question should be, “How can we best use it?”

Off the top of my head . . .

Grammar, mechanics, and formatting. Unless we’re teaching ESL writing or remedial English, these aspects tend to get downplayed. I know I rarely talk about participial clauses or the accusative case. I overlook errors all the time, focusing instead on higher-order concerns—say, whether or not a secondary source was really put to use or just quoted to fill a requirement. However, I don’t think it’s a good thing that we overlook these errors. We do so because there are only so many minutes in a class or a meeting. With essay-grading software, we can bring sentence-level issues to students’ attention without taking time away from higher-order concerns.

Quicker response times for ESL students, and, perhaps, more detailed responses than a single instructor could provide, especially if she’s teaching half-a-dozen courses. Anyone who has tried to learn a second language knows that waiting a week or two for teacher feedback on your writing is a drag. In my German courses, I always wished I could get quick feedback on a certain turn of phrase or sentence construction, lest something wrong or awkward get imprinted in my developing grammar.

So, I guess my final point is that there are valid uses for essay-grading software, even for those of us teaching at institutions that won’t ever demand its use en masse. Rather than condemn it wholesale, we–and by we, I mean every college, program, professor, and lecturer–should figure out how to adapt to it and use it to our advantage.

Technology and the empirical study of writing

A materialist theory of literary form will ultimately have to concern itself with the organic processes of reading and composition, but the way to do this is through empirical study of readers and writers, not more interpretation of texts, or armchair ruminations.  –Cozma Shalizi in a response to Franco Moretti’s Graps, Maps, Trees (128)

ancientwriting

Janet Emig initiated the writing process movement when she published The Composing Processes of Twelfth Graders, an attempt to study writing in an empirical way (lower case e; no Lockean baggage implied) by closely observing and polling several high school seniors as they wrote essays. Today, the shortcomings of her study are obvious—the sample size was small, and she had no way to track granular textual changes as they were made in real time. However, despite its limitations, Emig’s work introduced an important assumption to the field of writing studies:

Writing is a natural, organic phenomenon that can be studied empirically.

Unfortunately for Emig and the process movement, writing studies was and is situated in an academic context that requires a sharp pedagogical focus, and the empirical study of writing has little to no educational value. Studying writers can tell us how people write, but it doesn’t necessarily tell us how to teach writing, academic or creative or otherwise.

Intimately tied to the pedagogical critique of the process movement was the political critique. In early studies of writing processes, certain contextual elements (read: race, gender, class) were ignored. Emig, for example, did not deeply address the racial differences of her subjects. Critics claimed that the study of writing processes would not pay enough attention to relevant cultural factors that affect how, where, and why people write. This critique was weak, however, because all empirical pursuits must by design bracket out certain contextual elements in its early stages. As the pursuit progresses and gathers knowledge, the causes and effects (if any) of various contextual factors can be coded and controlled for. Race, class, and gender are such factors—important ones at that—but we needn’t stop there (cross-linguistic differences would be first and foremost on my mind). Dozens of material factors must be taken into consideration when studying writing. Had the process movement not been abandoned, researchers would have gotten around to controlling for all of them.

Then there was the philosophical critique. The goal of studying writing is to build evidence-based theories about this unique human practice from a variety of angles—stylistic, material, cognitive, neuronal, linguistic—and eventually to see how these levels interact. (E.g., What areas of the brain are operational at various stages of writing and re-writing? Are small stylistic changes or large organizational changes more often influential of a text’s shape?  How does textual cohesion emerge? What roles do vision and memory play in the way writers work with their texts on word processors? Do writers across languages and writing systems have completely different processes?)  However, like many humanities disciplines, writing studies has been influenced by postmodernism and is thus adverse to data-driven, quantitative, empirical methods, and not interested in questions—like the ones above—that require these methods. Gary Olson typifies the philosophical critique when he writes that the process movement attempts to “systematize something [writing] that simply is not susceptible to systematization.”

Of course, no evidence is provided for this claim—but then, none is needed. It isn’t a claim at all. It is an a priori assumption, a statement of faith, designed to obviate any empirical work on the subject.

Ironically, the most valid critique of the process movement in writing studies was one that no one ever made: the technological critique. In the 1980s, when the process movement was jostling for academic legitimacy, researchers simply did not have access to the technology that could enable a more robust inquiry into the material, organic processes of writing.

Today, we do have access to that technology. What’s more, the postmodern zeitgeist has waned in its influence, and the political critique was never quite valid. The time seems right for a return, not necessarily to process theory, but to the assumptions it made about data-driven, quantitative, empirical studies of the way humans compose.

Scholars like Chris Anson, Richard Haswell, and Chuck Bazerman are leading the way. Haswell’s call for “RAD” research—replicable, aggregable, data-supported—is essentially a call for empiricism in writing studies. And Anson’s recent report on the use of eye-tracking devices to study writers at work demonstrates how new technologies will enable and benefit this empirical endeavor. Anson’s line of research could lead to major insights into the ways writers access their ‘textual memory’ in order to manage the many semantic strands that comprise any written text. Indeed, this is a perfect example of how technology and RAD methods can test old ideas in writing studies, confirm or complicate those ideas, and fill them in with data-driven details. In 1999, for example, Christina Haas wrote about the way writers manage their texts in Writing Technology: Studies in the Materiality of Literacy:

Clearly, writers interact constantly, closely, and in complex ways with their own texts. Through these interactions, they develop some understanding—some representation—of the text they have created or are creating . . . As the text gets longer and more ideas are introduced and developed, it becomes more difficult to hold an adequate representation in memory of that text, which is out of sight. (117, 121, qtd in Brooke’s Lingua Fracta)

In 2012, enter the eye-tracking software, which can show us where writers look, physically, to develop representations of their texts as they are constructed: What kinds of words or phrases do writers reference most often, as ‘anchors’ for their intellectual wanderings? What are the outside limits of textual vision? Where do writers focus their vision at different stages in the writing process—choosing words, writing sentences, organizing paragraphs, et cetera? Do accomplished writers use their eyes differently than novice writers? Do high-IQ individuals use their eyes less or more while writing; are their visual memories more robust, requiring less visual tracking to make sense of their texts’ cohesion?

Without eye-tracking devices and empirical methodologies, researchers could never hope to answer these and other questions. They would never even think to formulate them.

Outside the field of writing studies, researchers are already using technology to capture and study authorial processes. An IBM study used an application called history flow to study contributions in Wikipedia articles and how numerous contributions endure, change, or are phased out entirely over time. Ben Fry built an amazing visualization of the multiple changes Darwin made to On the Origin of Species across six editions. And on a lesser scale, Timothy Weninger created a time-lapse video that shows the writing of a research paper in various stages (I’m in the process of figuring out how to do something similar, using track changes in Word).

The interest in organic authorial processes extends beyond writing studies, so it boggles my mind that writing studies scholars aren’t at the forefront of this research, which, I grant, is in its early stages. Luckily, things are looking up for RAD, empirical research in writing studies. Now that we can start grounding our theories of composition in real data, it’s only a matter of time before we start gaining empirical insight into this strange, relatively recent human behavior that we call ‘writing’.

Text Network and Corpus Analysis of the Unabomber Manifesto

Introduction

The Unabomber Manifesto—Industrial Society and its Future—was sent to major newspapers in 1995, with an accompanying promise from its author, Ted Kaczynski, to stop exploding things if someone printed the 35,000 word text in full. The New York Times and the Washington Post obliged in September of that year. The manifesto became a major clue in the hunt for the Unabomber, but only a few forensic linguists concluded that Kaczynski, a suspect at the time, had written it. The majority failed to see a connection between the manifesto and other writings by Kaczynski (these are the same people, I can only guess, who remain skeptical about who wrote Romeo and Juliet). In the end, none of it mattered anyway. Evidence found in Kaczynski’s cabin was far more damning than forensic linguistic analyses of the manifesto.

The Manifesto

You expect the manifesto of a domestic terrorist to be insane. Kaczynski is not your average domestic terrorist. A former Berkeley professor of mathematics with a Michigan PhD, Kaczynski could have feasibly published the essay with a legitimate press or magazine and gained a wide academic audience had he not retreated into the woods and his own head. The manifesto is a real argument that, minus its calls for violence, could have been inserted into a legitimate discourse, albeit one that would have resulted in criticism coming Ted’s way.

Ostensibly, the manifesto is a strong critique of contemporary techno-capitalist society. However, if you took a knife to the text, divided it into little passages, you would discover that half of them bend far leftward and could be read aloud without protest in Harvard Yard, while the other half bend far rightward and could only be read aloud without protest at Hillsdale College.

So, there are passages such as this one, which would send heads nodding in every humanities department in America:

The Industrial Revolution and its consequences have been a disaster for the human race. They have greatly increased the life-expectancy of those of us who live in “advanced” countries, but they have destabilized society, have made life unfulfilling, have subjected human beings to indignities, have led to widespread psychological suffering (in the Third World to physical suffering as well) and have inflicted severe damage on the natural world.

Then comes this curveball:

One of the most widespread manifestations of the craziness of our world is leftism, so a discussion of the psychology of leftism can serve as an introduction to the discussion of the problems of modern society in general.

Like many on the left, Kaczynski blames technology and The System for the sad state of the earth and its inhabitants, yet he suggests that the contemporary left (the “oversocialized” left, as Ted puts it) is in fact The System’s most malformed, though logical outgrowth.

At first, I couldn’t recognize the motive behind the manifesto. Its politics seemed too conflicted. Then I noticed a brief mention in Kaczynski’s Wikipedia article that ties him to the anarcho-primitive tradition, and suddenly the text became more philosophically cohesive.

The Manifesto’s Motive

There are two types of anarcho-primitivists: the Rousseau types and the Hobbes types (my own ad hoc terms). The former are human-centric and collectivist. They believe that dismantling techno-capitalist society will usher in an era of equality and harmony between men and women of all races. The latter are earth-centric and individualistic. They believe that dismantling techno-capitalist society will put a halt to overpopulation and environmental degradation, and allow individuals to live more spiritually and physically fulfilled lives.

The goals aren’t mutually exclusive, but nor are they necessarily aligned. (When it comes to immigration, they are outright opposed.) The Hobbesian primitivists tend to believe that nature, for all its beauty and desirability, isn’t a progressive utopia. Who are these Hobbesians? They are the Monkey Wrench Gang radicals, the Edward Abbeys and Doug Peacocks of the environmental movement, the Garret Hardins of ecology, the survivalists, the Timothy Treadwells, the (typically) men who love nature more than humanity but harbor no romanticism about either. Kaczynski would have gotten along well in the Monkey Wrench Gang, who held no love for humans or community or society in the aggregate because, to them, human communities are precisely the problem.

Let’s put these categorizations aside for now and look to the text of the manifesto itself. A text network analysis and an analysis with the Natural Language Toolkit (NLTK) can provide us with grounded data about Kaczynski’s motives as they appear in his manifesto. The motives of all authors—or at least their traces—are always left behind in the lexical choices of their texts. Deliberate, written language is like a rhetorical fingerprint.

Text Network Analysis

As I’ve discussed in other posts, a text network analysis proceeds in the following way: a text is copied into a .txt file; it is imported into some analytic tool (I use Auto Map) in order to remove stop words and to lightly stem the text; then, using the same tool, the text—which has now been expunged of all but significant content words—is run through an algorithm that treats the content words like a network and creates a co-reference list in .csv format. What words are connected to what other words, and how often? (In this analysis, I used a two word gap and a five word gap.) The .csv file is then opened in a network analysis tool (I use Gephi) in order to visualize these semantic connections. Each word is visualized as a node in the network, and words that appear next to each other—again, within a certain word gap—appear as edges.

The two most important network visualizations, in my opinion, show nodes with the highest levels of Betweenness Centrality and the highest levels of Degree Centrality. The latter measures how many total connections a node has to other individual nodes; so, a node with high degree centrality will simply be connected to many other nodes. The former measures whether or not a node is connected to other nodes that themselves have many connections; so, a node with high betweenness centrality will in essence be an important ‘passageway’ between communities within the network. (Here’s an excellent visual description of the concepts.)

In a textual network, a word with high degree centrality is a word used in connection with myriad other words. This simply tells you that a word is used frequently in a text and in a variety of contexts. A word with high betweenness centrality is a word used frequently and in conjunction with other words that also connect to other nodes to form community clusters. This tells you that a word is not only used frequently and not only in many contexts but also that it is used in connection with words that also do a lot of semantic work in the text. A word with high betweenness centrality is a word through which many meanings in a text circulate.

For example, as you see below, psychological has a high degree centrality in the Unabomber Manifesto but not a high betweenness centrality. This lexical item was therefore used frequently and connected to many different words, such as:

psychological techniques

psychological methods

However, the words to which psychological is connected (techniques and methods) do not themselves perform a lot of semantic work elsewhere in the text. Words like psychological are essentially productive creators of bigrams but not pathways of meaning.

Society, on the other hand, not only has a high degree centrality but also a high betweenness centrality. So, the words that it connects to also have further connections and thus do perform semantic work elsewhere in the text.

Here are the text network visualizations:

Nodes with the highest Degree Centrality in the manifesto

Nodes with the highest Degree Centrality in the manifesto

Nodes with the highest Betweenness Centrality in the manifesto

Nodes with the highest Betweenness Centrality in the manifesto

The text is long, so its network is messy. In the 5-word gap network, the manifesto had over 200 separate meaning clusters. In the 2-word gap (seen above), it still had over 150 clusters.

Social, society, people, and human are the words with the highest levels of degree centrality in Kaczynski’s manifesto. Also visible in this network are technology and its derivations, psychological, system, freedom, physical, power, leftist, and modern.

Social, society, and people are the words with the highest levels of betweenness centrality in Kaczynski’s manifesto. Also visible in the network are human, problems, system, change, and natural.

As I mentioned earlier, most commentary on the Unabomber manifesto focuses on a) its attack on technology, and b) its attack on leftism. However, as these text networks demonstrate, the words that do the most semantic work in the text—the words through which most meanings flow—suggest that Kaczynski’s sights were set on society as a whole—its people, its systems. Three other words with relatively many connections—psychological, power, and freedom—further suggest that the ostensible screed against leftism and technology masks a deeper motive that circulates in a diffuse, though nonetheless salient way throughout the text. And in the light of Kaczynski’s possible connection to an anarcho-primitivist tradition, these particularly noticeable nodes make much more sense than they would if we tried to paint him as a madman or, worse, a bitter, conservative academic. If he were only that, we might expect other terms to be more noticeable in the network (e.g., the various derivations of leftism).

One thing a text network does, beyond providing an interesting visualization, is to point the researcher in the direction of terms and n-grams that might be explored more granularly in a corpus analysis tool, such as the NLTK. It provides a map of a text’s semantic circulation, a map that can be followed when we return to the world of pure textuality.

Corpus Analysis

Here is a raw count of the most frequent words in the manifesto:

unabomberLineChart

Certain words weren’t visually important nodes in the text network but were nonetheless used frequently (e.g., goal/s, individual/s, process, industrial, way, work, man, behavior, control ); these words were deployed often but in conjunction with a limited number of other terms. Nevertheless, the 20 most frequent words signify a dual emphasis that makes sense if Kaczynski is a certain kind of primitivist: there is the left-wing emphasis on the ills of society, the system, technology, and control; but there is also the right-wing emphasis on individuals and freedom.

The NLTK can also generate a dispersion plot, which shows where in a text individual words fall. Here is a dispersion plot of the 10 most frequent words:

unabomberDispersionPlotTop10

A striking pattern emerges. Although much has been made of the manifesto’s condemnation of the left, the dispersion plot demonstrates that anti-leftism is not a continuous theme in the text but rather forms the bookends: the manifesto opens and closes with references to leftists, but the bulk of the text does not mention them at all. The focus is elsewhere.

The dispersion of technology and technological provides another striking pattern. More than a third of the text passes before Kaczynski begins to deploy these words in earnest, even though a surface reading of the text leaves the reader with the impression that technological anxieties anchor every aspect of the manifesto.

But compare the dispersion of these supposedly central terms—leftist/s, technology and technological—with the dispersion of other terms in the list. Society, system, people, power, human, and, to a lesser extent, modern all have much more uniform dispersions throughout the manifesto. In other words, these concepts appear more regularly and consistently in each of the manifesto’s 232 numbered paragraphs, and that is precisely what we should expect if Kaczynski is indeed a primitivist who loves nature more than humanity. His ire is most obviously directed at leftists, but more subtly, the motivated energy of his manifesto is pointed in all directions at all society in its malformed, destructive development.

Ranking Native American language health

I recently finished reading Ellen Cushman’s The Cherokee Syllabary, an excellent book on the history and spread of the writing system developed by Sequoyah for the Cherokee tribe. Cushman does a thorough job explaining how the syllabary works as a syllabary, rather than describing it in alphabetic terms. She argues that to explain a syllabary in terms of one-to-one sound-grapheme correspondence (which is often the tact in linguistic work) is already to analyze it in alphabetic terms.

One of Cushman’s central projects in the book is to demonstrate how the Cherokee syllabary—both its structure and graphic representation—grew from Cherokee culture. It was not, she argues, a simple borrowing and re-application of the Roman alphabetic script. Most scholars would disagree with her, including Henry Rogers in Writing Systems: A Linguistic Approach and Steven Roger Fischer in A History of Writing. Fischer claims that “using an English spelling book, [Sequoyah] arbitrarily appointed letters of the alphabet” to correspond with units of sound in Cherokee (287). Cushman counters this claim by pointing out that linguists only make it after looking at the printed form of Cherokee, which, by necessity, remediated Sequoyah’s original syllabary into a more Latinate form. Cushman provides us with pictures of the original syllabary, as well as a new Unicode font that she believes more adequately represents the original style:

Much of Cushman’s book is devoted to showing the connection between Cherokee culture and the syllabary, a connection which obviates the need to assume some sort of alphabetic borrowing.

I’m not at all convinced by this main argument (still lots of Latinate forms up there), but I was quite interested, after reading the book, in another point Cushman makes about what it means to be Native American, both historically and contemporarily. She posits “four pillars of Native peoplehood: language, history, religion, and place” (6). I would argue that language is the most powerful of the four, but Cushman merely claims that the loss of the Cherokee language would “spell the ruin of an integral part of Cherokee identity.”

No doubt it would. And this got me thinking about native language health in general. As regards Cherokee specifically, Cushman writes that “while the Cherokees are one of the largest tribes in the United States, the Cherokee Nation estimates that only a few thousand speak, read, and write the Cherokee language” (6). I checked this statistic and found it to be correct but misleading. Perhaps only a few thousand Cherokees “speak, read, and write” Cherokee, but 16,000 speak the language.

So what about other native languages? Using Ethnologue and the World Atlas of Language Structures, I ranked all native American languages (and a few Canadian languages) by their ‘linguistic health’, measured purely as number of speakers. Here’s a bar chart of native languages with more than 100 speakers. (Click to enlarge.) Already, you can notice the seriously skewed curve that I’ll discuss in a moment . . .

Now, no native language in America (or Canada) is ‘healthy’ compared to English, Spanish, Mandarin, Hindi, or the world’s other dominant languages. Nearly all native American languages are endangered, severely endangered, or extinct. Only one—Navajo—escapes the ‘endangered’ list, but even then, Navajo is lately considered ‘vulnerable’ because the youngest generation is switching to English.

Within this continuum of endangered native languages, however, there exists a highly skewed continuum of linguistic health. There are approx. 115 living languages in America, but only 35 possess more than 1,000 speakers. Only 9 possess more than 10,000 speakers. And only 3 possess more than 50,000 speakers. In other words, the great bulk of living native American languages are in bad shape, and will likely go extinct within the next generation, joining the 41 native languages that already have gone extinct. Here’s the ranking of native languages with fewer than 100 speakers:

And yet what interests me about this data is not the obvious point about language loss in our post-colonial present. Language loss is the inevitable outcome in the wake of conquest; Old English itself was lost when the Norman French invaded Britain. Rather, what interests me is that, extinction and severe endangerment being the rule, several languages have managed to become glaring exceptions to the rule. Why?

According to my list, there are approximately 454,515 native language speakers in America—and parts of Canada, since I’ve included Cree and Ojibwe, Canada’s healthiest native languages, in my list (see the end of this post for more methodological details). At the start of the colonial era, there were somewhere between 2 and 7 million natives living in what is now the U.S. and Canada, with most of that population inhabiting the U.S. Splitting the difference, we can say there were 4 .5 million native language speakers pre-conquest but only 454,515 today. That’s a nearly 90% reduction in native language speakers over the course of 500 years.

(Note: this is not the same as a reduction in population. There are currently 2.9 million native Americans in the U.S., which, depending on your source, is anywhere from a net gain in population between the 15th and 21st centuries, or a loss of around 50-60% total native population. The comparatively drastic loss in number of native language speakers, however,results from the fact that most native Americans have, both recently and historically, switched to English.)

Speaking of languages, then, not population, it seems as though total annihilation is the most probable outcome for a language after conquest. It seems almost inevitable that a conquered population’s language will eventually become the language of the conqueror. (This is why only 100,000 people speak Irish in Ireland, and why no one speaks an un-Romanized version of English.)

Thus, it’s not surprising that most native languages possess fewer than 1,000 speakers, or that more than half only have between 1 and 100 speakers—i.e., it’s not surprising that more than half of native American languages are practically extinct. If we ignore the nine ‘healthiest’ native languages (the outliers with more than 10,000 speakers), then the total reduction in native language speakers between pre-colonial times and today rapidly approaches 100%.

Which returns us to the interesting thing about this data: the existence of these (comparatively) healthy native American languages. The nine healthiest languages have a total of 368,259 speakers, which translates to 81% of all native language speakers across all tribal languages; and yet these nine languages comprise only 7% of all native languages. In other words, 81% of native language speakers in America and parts of Canada speak only 7% of the existing native languages (less than 4% of all native languages, if we factor extinct languages and all Canadian languages into the equation).

I imagine that if we look at any area on the globe where conquered indigenous languages jostle beside more powerful indigenous or colonial languages, we’ll find similar data showing that, even amongst the less powerful languages, there remains a very skewed hierarchy of linguistic health. One can’t help wondering what’s at work here . . .

I enjoy compiling large sets of data like this because certain questions just don’t come into sharp focus until we compile the data. I think most rhet/comp scholars, like Cushman, have a general understanding that certain native American languages are in better shape than others; however, until we take the time to work with the actual data set (all living and extinct native American languages), we won’t discover this skewed pattern within it, and we won’t be able to formulate what, to my mind, are highly interesting and relevant questions: why and how have certain languages managed to survive and (comparatively) thrive while most other native languages have gone extinct or dwindled to only a few hundred speakers? What did these languages and tribal groups have going for them that the others didn’t? Was it a purely linguistic advantage, a purely geopolitical advantage, or a combination of both?

In part, we can read Cushman’s book as an answer to these unformulated questions. While Cushman spends a lot of time (rightly) describing language attrition among contemporary Cherokees, she perhaps doesn’t realize that Cherokee is doing a hell of a lot better than most other native languages. Although her book presents something of a contrast between the language’s current weakened state and the syllabary’s historic role in uniting and strengthening the Cherokee against further Western encroachment, we can see, in light of this data, how the contrast is perhaps instead a partial explanation for the fact that Cherokee isn’t as unhealthy as the vast majority of native American languages. In other words, the existence of the Cherokee syllabary may very well be one of the reasons why Cherokee exists on the healthier side of living native languages, why Cherokee isn’t entirely extinct.

Stylizing Sequoyah’s thought process, Cushman writes, “If whites could have a writing system that so benefited them, filling them with self-respect and earning the respect of others, then Cherokees could have a writing system with all this power as well” (35). After compiling statistics on native language health, I can see that Cushman, in focusing on current language attrition among the Cherokee, misses a deeper exploration of a compelling possibility: that the syllabary’s power not only bolstered the Cherokee people but also perhaps played a part in saving the Cherokee language itself from total extinction. The syllabary’s strengthening role was not an historic phenomenon; without it, perhaps there wouldn’t be a Cherokee language today at all.

This is a good example of why I think digital tools and databases have a lot to offer the humanities: without them, patterns go unnoticed and questions go unasked.

Methodological notes: I couldn’t rank linguistic health among native languages without first deciding what “counted” as a native language and what was simply a dialect of a language. This language/dialect issue is sometimes difficult to navigate, and Ethnologue typically gives each dialect its own language code. But such granularity is misleading; Madrid Spanish and Buenos Aires Spanish are different in many respects, but speakers in both places can understand one another because they are still, despite the differences, speaking Spanish.

Mutual intelligibility between speaker populations is the general rule for differentiating a dialect from a separate language, and I’ve done my best to follow that rule. For example, I’ve counted Ojibwe as a single language, even though Ojibwe is in fact a continuum of dialects; on the other hand, I’ve divided the Miwok continuum into different languages (Sierra Miwok, Plains Miwok, et cetera). Speakers of the Miwok languages, while closely related, have difficulty understanding one another in a way that speakers of Ojibwe dialects do not. So, Ojibwe is a single language, while the Miwok ‘dialects’ should really be considered separate languages.

However, none of this made huge differences in the ranking. Some might quibble with my grouping of all Ojibwe or Cree dialects into a single language, but even had I taken out the dialects that aren’t perfectly intelligible with the others, each of these languages still would have retained tens of thousands of speakers. Conversely, even had I counted all Miwok speakers as a single linguistic group, Miwok would still have fewer than 50 speakers.

Finally, when compiling statistics on numbers of speakers for each language, I used field linguists’ counts when they were available, rather than census counts, which tend to err on the side of liberality. (E.g., according to the U.S. census, there are over 150,000 Navajo speakers, but most linguists consider this an unlikely number.)

Meanings of ‘writing’ and ‘rhetoric’ in RSQ and CCC

Earlier this year, I compiled two small corpora of article abstracts from the most prominent journals in the American fields of rhetoric and writing studies: Rhetoric Society Quarterly and College Composition and Communication, respectively. The RSQ abstracts stretch from Winter 2000 (30.1) to Fall 2011 (41.5), for a total of 220 abstracts. The CCC abstracts stretch from February 2000 (51.3) to September 2011 (63.1), for a total of 261 abstracts. I think that article abstracts are a good vantage point for looking at disciplinary trends, because (in the humanities, anyway) researchers tend to write abstracts that function like movie previews. Designed to appeal to a specific disciplinary audience, abstracts signal that their articles ‘belong’ in the field by using all the right buzz words, name-dropping all the right researchers, and making all the right stylistic moves that make other researchers want to read the article.

Using Python and the Natural Language Toolkit to explore these two corpora of abstracts, I’ve discovered both interesting and unsurprising things about how rhetoric and writing studies have taken shape, over the last decade, as separate but ambivalently related disciplines. One of the more interesting pieces of capta demonstrated by the corpora is that the words ‘writing’ and ‘rhetoric’ share grammatical contexts with very different lexical items, suggesting that each word means something different in each journal.

Before I get to the details, though, here’s a bit about my methodology:

With Python and NLTK, you can chart how a word is used  similarly or differently in two corpora.  For instance, a concordance of the word ‘monstrous’ in Moby Dick reveals contexts such as ‘the monstrous size’ and ‘the monstrous pictures’. Running a few extra commands, you discover that words such as ‘impalpable‘, ‘imperial’, and ‘lamentable’ are also used in these same contexts. Running an identical search on Sense and Sensibility, however, reveals that ‘monstrous’ shares contexts with quite different terms: ‘very’, ‘exceedingly’, and ‘remarkably’. Dissimilar contexts reveal different connotations for ‘monstrous’ in each novel, positive or neutral in Austen but negative in Melville. This, basically, was the method I applied for mining the usage of ‘rhetoric’ and ‘writing’ in the abstracts corpora (more details below the tables, though).

‘Rhetoric’ occurred 244 times in RSQ abstracts and 69 times in CCC abstracts. ‘Writing’ occurred 22 times in RSQ and 251 times in CCC. I compiled common grammatical contexts for each term in each corpus. Each context took the form,

CommonContexts

where N was any term and x was ‘rhetoric’ or ‘writing’, respectively.

RhetoricWritingCommonContexts

The more and more contexts shared by two terms, the more and more likely it is that the two terms, within the specific corpus, are used interchangeably. One way to get your head around this fact is by looking at grammatical contexts without an operative term:

(1) I _ you

In an English corpus, the words that appear in that _ context will be semantically limited. Hundreds, if not thousands of words, will indeed fit in that context, but given such a large list of lexical items, all the items will nonetheless share some kind of discerning semantic value: for example, all the words  that can appear in the context of (1) can only be transitive or di-transitive verbs, and they cannot be 3rd person present verbs. Right off the bat, this context has limited its possible terms down to a fraction of all the words in the English lexicogrammar. Throw in a second context, and the list of terms grows even smaller:

(2) is _ by

Given rules of English morphology and semantics, most of the words that appear in this context will be past tense action or emotive participles (e.g., loved, felt, killed, written, eaten, trapped). Terms that can appear in both (1) and (2) are quite limited: only transitive or di-transitive verbs, no 3rd person presents, and now, no irregular verbs (e.g., written, wrote, eaten, ate).

If we start using contexts that contain more than just semantically null stopwords on both sides, it’s easy to see how the list of terms can grow very short very quickly:

(3) I _ girls

What kind of words can appear in (1), (2), and (3)? No irregular verbs, no 3rd person present verbs, and now, probably no di-transitive verbs, given the lack of a definite article before ‘girls’ (e.g., I put the girls to bed). Words that can appear in all three of these contexts would likely be words that are easily grouped together in some meaningful way.

So, when a corpus analysis tells us that two words share half a dozen or more contexts in a specific corpus, you can see how these words might share not only grammatical but semantic and definitional attributes within the corpus. The simple example of ‘monstrous’, ‘lamentable’, and ‘imperial’ in Melville demonstrates this statistical fact. This fact is also proved by the large number of contexts (20!) shared by ‘writing’ and ‘composition’ in the CCC abstracts, two words that I knew, a priori, were synonymous in the American field of writing studies. The analysis bears out this a priori knowledge, thus confirming the methodology.

nltk14

While the terms sharing 2 or 3 contexts in the tables above are interesting, our attention should be focused on the terms near the top of the lists. In RSQ, ‘language’, ‘discourse’, ‘art’, ‘persuasion’, ‘theory’, and ‘texts’ tell us indirectly what the word ‘rhetoric’ means in that journal; in CCC, ‘writing’, ‘composition’, ‘education’, ‘place’, and ‘theory’ provide the same information.

The highlighted terms are the terms that overlap between each journal’s set of common contexts. The overlap is minimal. For ‘rhetoric’, only a single word (‘theory’) overlaps and surfaces in more than 3 distinct contexts in each journal; for ‘writing’, no word overlaps and surfaces in more than 3 distinct contexts. More telling is that ‘writing’ and ‘rhetoric’ themselves possess a high degree of interchangeability in CCC, sharing 7 distinct contexts, but a very low degree in RSQ, sharing only 2 distinct contexts. In other wordsthese capta suggest that ‘writing’ and ‘rhetoric’ mean nearly the same thing in CCC but do not mean the same thing at all in RSQ.